Resource type
Thesis type
(Thesis) M.Sc
Date created
2022-11-21
Authors/Contributors
Author: Sutton, Hannah
Abstract
Random forests are often regarded as black-box machine learning models. They are sufficiently complex that they are not easily interpretable. This fact has inspired a variety of research into improving the interpretability of random forests, which is the focus of this thesis; specifically, we wish to capture dissimilarities between random forest trees using several comparison functions on the decision trees that comprise the random forest, allowing the structure of the random forest to be quantified. These include a phylogenetic metric designed for transmission trees, as well as others we developed that involve the count and location of variables in each tree, as well as the depths of the trees. This allows us to visualise an underlying grouping of the trees using a heatmap and hierarchical clustering, and analyze the predictive accuracy of the decision tree clusters. Finally we propose a method for generating random decision trees, which we then use to generate synthetic data using a small set of trees. We use the random forest trained on this data to determine which comparison functions are statistically significant and contribute to the overall clustering. Additionally, we investigate whether or not the random forest is capable of recovering the original trees that the data was created from.
Document
Extent
69 pages.
Identifier
etd22289
Copyright statement
Copyright is held by the author(s).
Supervisor or Senior Supervisor
Thesis advisor: Colijn, Caroline
Thesis advisor: Elliott, Lloyd
Language
English
Member of collection
Download file | Size |
---|---|
etd22289.pdf | 1.13 MB |