Resource type
Date created
2020-07-14
Authors/Contributors
Author: Jiang, Haiyang
Abstract
Many Statistical Learning (SL) regression methods have been developed over roughly the last two decades, but no one model has been found to be the best across all sets of data. It would be useful if guidance were available to help identify when each different method might be expected to provide more accurate or precise predictions than competitors. We speculate that certain measurable features of a data set might influence methods' potential ability to provide relatively accurate predictions. This thesis explores the potential to use measurable characteristics of a data set to estimate the prediction performance of different SL regression methods. We demonstrate this process on an existing set of 42 benchmark data sets. We measure a variety of properties on each data set that might be useful for differentiating between likely good- or poor-performing regression methods. Using cross-validation, we measure the actual relative prediction performance of 12 well-known regression methods, including both classical linear techniques and more modern flexible approaches. Finally, we combine the performance measures and the data set properties into a multivariate regression model to identify which properties appear to be most important and to estimate the expected prediction performance of each method.
Document
Identifier
etd20930
Copyright statement
Copyright is held by the author.
Scholarly level
Member of collection
Download file | Size |
---|---|
etd20930.pdf | 8.9 MB |