Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

The analysis of serve decisions in tennis using Bayesian hierarchical models

Author: 
File(s): 
Date created: 
2021-07-07
Supervisor(s): 
Tim Swartz
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.
Abstract: 

Anticipating an opponent’s serve is a salient skill in tennis: a skill that undoubtedly requires hours of deliberate practice to properly hone. Awareness of one’s own serve tendencies is equally as important, and helps maintain unpredictable serve patterns that keep the returner unbalanced. This project investigates intended serve direction with Bayesian hierarchical models applied on an extensive, and now publicly available data source of professional tennis players at Roland Garros.We find discernible differences between men’s and women’s tennis, and between individual players. General serve tendencies such as the preference serving towards the Body on second serve and on high pressure points are revealed.

Document type: 
Graduating extended essay / Research project

Factorial designs under baseline parameterization and space-filling designs with applications to big data

Author: 
File(s): 
Date created: 
2021-06-18
Supervisor(s): 
Boxin Tang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.
Abstract: 

This dissertation reports my research work on three topics in the areas of two-level factorial designs under the baseline parameterization, space-filling designs, and sub-data selection for big data. When studying two-level factorial designs, factorial effects are usually given by the orthogonal parameterization. But if each factor has an intrinsic baseline level, the baseline parameterization is a more appropriate alternative. We obtain a relationship between these two types of parameterization, and show that certain design properties are invariant. The relationship also allows us to construct an attractive class of robust baseline designs. We then consider two classes of space-filling designs driven by very different considerations: uniform projection designs and strong orthogonal arrays (SOAs), where the former are obtained by minimizing the uniform projection criterion while the latter are a special kind of orthogonal arrays. We express the uniform projection criterion in terms of the stratification characteristics related to an SOA. This new expression is then used to show that certain SOAs are optimal or nearly optimal under the uniform projection criterion. Finally, we consider the problem of selecting a representative sub-dataset from a big dataset for the purpose of statistical analyses without massive computation. Under the nonparametric regression situation, we present a two-phase selection method, which embodies two important ideas. First, the sub-dataset should be a space-filling subset within the full dataset. Second, in the area where the response surface is more rugged, more data points should be selected. Simulations are conducted to demonstrate the usefulness of our method.

Document type: 
Thesis

Covariance-adjusted, sparse, reduced-rank regression with adjustment for confounders

Author: 
File(s): 
Date created: 
2021-08-18
Supervisor(s): 
Brad McNeney
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.
Abstract: 

There is evidence that common genetic variation in the gene NEDD9 is associated with developing Alzheimer’s Disease (AD). In this project, we study the relationship between brain-imaging biomarkers of AD and the gene NEDD9 while adjusting for the effects of genetic population structure. The data used in this project, collected by the Alzheimer’s Disease Neuroimaging Initiative (ADNI), consists of magnetic resonance imaging (MRI) measures of 56 brain regions of interest for 200 cognitively normal people and genetic data on Single Nucleotide Polymorphisms (SNPs) obtained from 33 candidate genes for AD. The standard solution to such a multiple response problem is separate simple linear regression models. Such an approach neglects correlations between 56 brain areas and possible sparsity in the SNP effects. In this project, we review a sparse and covariance adjusted reduced-rank regression approach that can select significant predictors and estimate covariance simultaneously, and extend the approach to adjust for confounding variables. We apply the proposed algorithm to the ADNI data, and also simulated data.

Document type: 
Graduating extended essay / Research project

Q-learning with online trees

File(s): 
Date created: 
2021-08-13
Supervisor(s): 
Lloyd T. Elliott
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.
Abstract: 

Reinforcement learning is one of the major areas of artificial intelligence that has been studied rigorously in recent years. Among numerous methodologies, Q-learning is one of the most fundamental model-free reinforcement learning algorithms, and it has inspired many researchers. Several studies have shown great results by approximating the action-value function, one of the essential elements in Q-learning, using non-linear supervised learning models such as deep neural networks. This combination has led to the surpassing humanlevel performances in complex problems such as the Atari games and Go, which have been difficult to solve with standard tabular Q-learning. However, both Q-learning and the deep neural network typically used as the function approximator require very large computational resources to train. We propose using the online random forest method as the function approximator for the action-value function to mitigate this. We grow one online random forest for each possible action in a Markov decision process (MDP) environment. Each forest approximates the corresponding action-value function for that action, and the agent chooses the action in the succeeding state according to the resulting approximated action-value functions. When the agent executes an action, an observation consisting of the state, action, reward, and the subsequent state is stored in an experience replay. Then, the observations are randomly sampled to participate in the growth of the online random forests. The terminal nodes of the trees in the random forests corresponding to each sample randomly generate tests for the decision tree splits. Among them, the test that gives the lowest residual sum of squares after splitting is selected. The trees of the online random forests grown in this way age each time they take in a sample observation. One of the trees that is older than a certain age is then selected at random and replaced by a new tree according to its out-of-bag error. In our study, forest size plays an important role. Our algorithm constitutes an adaptation of previously developed Online Random Forests to reinforcement learning. To reduce computational costs, we first grow a small-sized forest and then expand them after a certain period of episodes. We observed in our experiments that this forest size expansion showed better performances in later episodes. Furthermore, we found that our method outperformed some deep neural networks in simple MDP environments. We hope that this study will be a medium to promote research on the combination of reinforcement learning and tree-based methods.

Document type: 
Graduating extended essay / Research project

Exploring out of distribution: Deep neural networks and the human brain

Author: 
File(s): 
Date created: 
2021-08-17
Supervisor(s): 
Lloyd T. Elliott
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.
Abstract: 

Deep neural networks have achieved state-of-the-art performance across a wide range of tasks. Convolutional neural networks, with their ability to learn complex spatial features, have surpassed human-level accuracy on many image classification problems. However, these architectures are still often unable to make accurate predictions when the test data distribution differs from that of the training data. In contrast, humans naturally excel at such out-of-distribution generalizations. Novel solutions have been developed to improve a deep neural net's ability to handle out-of-distribution data. The advent of methods such as Push-Pull and AugMix have improved model robustness and generalization. We are interested in assessing whether or not such models achieve the most human-like generalization across a wide variety of image classification tasks. We identify AugMix as the most human-like deep neural network under our set of benchmarks. Identifying such models sheds light on human cognition and the analogy between neural nets and the human brain. We also show that, contrary to our intuition, transfer learning worsens the performance of Push-Pull.

Document type: 
Graduating extended essay / Research project

On the Bayesian estimation of jump-diffusion models in finance

File(s): 
Date created: 
2021-05-19
Supervisor(s): 
Jean-François Bégin
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.
Abstract: 

The jump-diffusion framework introduced by Duffie et al. (2000) encompasses most one factor models used in finance. Due to the model complexity of this framework, the particle filter (e.g., Hurn et al., 2015; Jacobs & Liu, 2018) and combinations of Gibbs and Metropolis-Hastings samplers (e.g., Eraker et al., 2003; Eraker, 2004) have been the tools of choice for its estimation. However, Bégin & Boudreault (2020) recently showed that the discrete nonlinear filter (DNF) of Kitagawa (1987) can also be used for fast and accurate maximum likelihood estimation of jump-diffusion models. In this project report, we combine the DNF with Markov chain Monte Carlo (MCMC) methods for Bayesian estimation in the spirit of the particle MCMC algorithm of Andrieu et al. (2010). In addition, we show that derivative prices (i.e., European option prices) can be easily included into the DNF’s likelihood evaluations, which allows for efficient joint Bayesian estimation.

Document type: 
Graduating extended essay / Research project

Numerical approximation algorithms for pension funding

Author: 
File(s): 
Date created: 
2021-08-04
Supervisor(s): 
Jean-François Bégin
Barbara Sanders
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.
Abstract: 

It is difficult to find closed-form optimal decisions in the context of pension plans. Therefore, we often need to rely on numerical algorithms to find approximate optimal decisions. In this report, we present two numerical algorithms that can be applied to solve optimal pension funding problems: the value function approximation and the grid value approximation. The value function approximation method applies to models with infinite time horizons and approximates the parameters of the value function by minimizing the difference between the true and approximate evaluations of the Hamilton–Jacobi–Bellman (HJB) equation. The grid value approximation method is used for models with finite time horizons. It works iteratively with backward and forward stages and approximates the optimal decisions directly without using the HJB equation. Numerical results are presented to compare approximate and true solutions for optimal contributions and share in risky assets for classic problems in the pension literature.

Document type: 
Graduating extended essay / Research project

Autoregressive mixed effects models and an application to annual income of cancer survivors

Author: 
File(s): 
Date created: 
2021-04-26
Supervisor(s): 
Rachel Altman
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.
Abstract: 

Longitudinal observations of income are often strongly autocorrelated, even after adjusting for independent variables. We explore two common longitudinal models that allow for residual autocorrelation: 1. the autoregressive error model (a linear mixed effects model with an AR(1) covariance structure), and 2. the autoregressive response model (a linear mixed effects model that includes the first lag of the response variable as an independent variable). We explore the theoretical properties of these models and illustrate the behaviour of parameter estimates using a simulation study. Additionally, we apply the models to a data set containing repeated (annual) observations of income and sociodemographic variables on a sample of breast cancer survivors. Our preliminary results suggest that the autoregressive response model may severely overestimate the magnitude of the effect of cancer. Our findings will guide future, comprehensive study of the short- and long-term effects of a breast cancer diagnosis on a survivor’s annual net income.

Document type: 
Graduating extended essay / Research project

Sequence clustering for genetic mapping of binary traits

File(s): 
Date created: 
2021-08-24
Supervisor(s): 
Jinko Graham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.
Abstract: 

Sequence relatedness has potential application to fine-mapping genetic variants contributing to inherited traits. We investigate the utility of genealogical tree-based approaches to fine-map causal variants in three different projects. In the first project, through coalescent simulation, we compare the ability of several popular methods of association mapping to localize causal variants in a sub-region of a candidate genomic region. We consider four broad classes of association methods, which we describe as single-variant, pooled-variant, joint-modelling and tree-based, under an additive genetic-risk model. We also investigate whether differentiating case sequences based on their carrier status for a causal variant can improve fine-mapping. Our results lend support to the potential of tree-based methods for genetic fine-mapping of disease. In the second project, we develop an R package to dynamically cluster a set of single-nucleotide variant sequences. The resulting partition structures provide important insight into the sequence relatedness. In the third project, we investigate the ability of methods based on sequence relatedness to fine-map rare causal variants and compare it to genotypic association methods. Since the true gene genealogy is unknown in reality, we apply the methods developed in the second project to estimate the sequence relatedness. We also pursue the idea of reclassifying case sequences into their carrier status using the idea of genealogical nearest neighbours. We find that method based on sequence relatedness is competitive for fine-mapping rare causal variants. We propose some general recommendations for fine-mapping rare variants in case-control association studies.

Document type: 
Thesis

Post-selection inference

Author: 
File(s): 
Date created: 
2021-04-21
Supervisor(s): 
Richard Lockhart
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.
Abstract: 

Forward Stepwise Selection is a widely used model selection algorithm. It is, however, hard to do inference for a model that is already cherry-picked. A post-selection inference method called selective inference is investigated. Beginning with very simple examples and working towards more complex ones, we evaluate the method's performance in terms of its power and coverage probability though a simulation study. The target of inference is investigated and the impact of the amount of information used to construct conditional conference intervals is investigated. To achieve the same level of coverage probability, the more conditions we use, the wider the Confidence Interval is -- the effect can be extreme. Moreover, we investigate the impact of multiple conditioning, as well as the importance of the normality assumption on which the underlying theory is based. For models with not very many parameters (p << n), we find normality is not crucial in terms of the test coverage probability.

Document type: 
Graduating extended essay / Research project