Understanding the impact of heteroscedasticity on the predictive ability of modern regression methods

Date created: 
Regression trees
Random forests
Bayesian adaptive regression trees
Artificial neural networks
Multivariate adaptive regression splines

As the size and complexity of modern data sets grows, more and more prediction methods are developed. Despite the growing sophistication of methods, there is not a well-developed literature on how heteroscedasticity affects modern regression methods. We aim to understand the impact of heteroscedasticity on the predictive ability of modern regression methods. We accomplish this by reviewing the visualization and diagnosis of heteroscedasticity, as well as developing a measure for quantifying it. These methods are used on 42 real data sets in order to understand the prevalence and magnitude ``typical'' to data. We use the knowledge from this analysis to develop a simulation study that explores the predictive ability of nine regression methods. We vary a number of factors to determine how they influence prediction accuracy in conjunction with, and separately from, heteroscedasticity. These factors include data linearity, the number of explanatory variables, the proportion of unimportant explanatory variables, and the signal-to-noise ratio. We compare prediction accuracy with and without a variance-stabilizing log-transformation. The predictive ability of each method is compared by using the mean squared error, which is a popular measure of regression accuracy, and the median absolute standardized deviation, a measure that accounts for the potential of heteroscedasticity.

Document type: 
Graduating extended essay / Research project
This thesis may be printed or downloaded for non-commercial research and scholarly purposes. Copyright remains with the author.
Thomas Loughin
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.