Statistical Learning Tools for Heteroskedastic Data

Date created: 
Information Criteria
Random Forest
Regression Tree

Many regression procedures are affected by heteroskedasticity, or non-constant variance. A classic solution is to transform the response y and model h(y) instead. Common functions h require a direct relationship between the variance and the mean. Unless the transformation is known in advance, it can be found by applying a model for the variance to the squared residuals from a regression fit. Unfortunately, this approach additionally requires the strong assumption that the regression model for the mean is 'correct', whereas many regression problems involve model uncertainty. Consequently it is undesirable to make the assumption that the mean model can be correctly specified at the outset. An alternative is to model the mean and variance simultaneously, where it is possible to try different mean models and variance models together in different combinations, and to assess the fit of each combination using a single criterion. We demonstrate this approach in three different problems: unreplicated factorials, regression trees, and random forests. For the unreplicated factorial problem, we develop a model for joint identification of mean and variance effects that can reliably identify active effects of both types. The joint model is estimated using maximum likelihood, and effect selection is done using a specially derived information criterion (IC). Our method is capable of identifying sensible location-dispersion models that are not considered by methods that rely on sequential estimation of location and dispersion effects. We take a similar approach to modeling variances in regression trees. We develop an alternative likelihood-based split-selection criterion that has the capacity to account for local variance in the regression in an unstructured manner, and the tree is built using a specially derived IC. Our IC explicitly accounts for the split-selection parameter and our IC also leads to a faster pruning algorithm that does not require crossvalidation. We show that the new approach performs better for mean estimation under heteroskedasticity. Finally we use these likelihood-based trees as base learners in an ensemble much like a random forest, and improve the random forest procedure itself. First, we show that typical random forests are inefficient at fitting flat mean functions. Our first improvement is the novel alpha-pruning algorithm, which adaptively changes the number of observations in the terminal nodes of the regression trees depending on the flatness. Second, we show that random forests are inefficient at estimating means when the data are heteroskedastic, which we address by using our likelihood-based regression trees as a base learner. This allows explicit variance estimation and improved mean estimation under heteroskedasticity. Our unifying and novel contribution to these three problems is the specially derived IC. Our solution is to simulate values of the IC for several models and to store these values in a lookup table. With the lookup table, models can be evaluated and compared without needing either crossvalidation or a holdout set. We call this approach the Corrected Heteroskedastic Information Criterion (CHIC) paradigm and we demonstrate that applying the CHIC paradigm is a principled way to model variance in finite sample sizes.

Document type: 
This thesis may be printed or downloaded for non-commercial research and scholarly purposes. Copyright remains with the author.
Tom Loughin
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.