Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

Statistical machine learning in computational genetics

Author: 
Date created: 
2020-07-03
Abstract: 

Statistical machine learning has played a key role in many areas, such as biology, health sciences, finance and genetics. Important tasks in computational genetics include disease prediction, capturing shapes within images, computation of genetic sharing between pairs of individuals, genome-wide association studies and image clustering. This thesis develops several learning methods to address these computational genetics problems. Firstly, motivated by the need for fast computation of genetic sharing among pairs of individuals, we propose the fastest algorithms for computing the kinship coefficient of a set of individuals with a known large pedigree. {Moreover, we consider the possibility that the founders of the known pedigree may themselves be inbred and compute the appropriate inbreeding-adjusted kinship coefficients, which has not been addressed in literature.} Secondly, motivated by an imaging genetics study of the Alzheimer's Disease Neuroimaging Initiative, we develop a Bayesian bivariate spatial group lasso model for multivariate regression analysis applicable to exam the influence of genetic variation on brain structure and accommodate the correlation structures typically seen in structural brain imaging data. We develop a mean-field variational Bayes algorithm and a Gibbs sampling algorithm to fit the model. We also incorporate Bayesian false discovery rate procedures to select SNPs. The new spatial model demonstrates superior performance over a standard model in our application. Thirdly, we propose the Random Tessellation Process (RTP) to model complex genetic data structures to predict disease status. The RTP is a multi-dimensional partitioning tree with non-axis aligned cuts. We develop a sequential Monte Carlo (SMC) algorithm for inference. Our process is self-consistent and can relax axis-aligned constraints, allowing complex inter-dimensional dependence to be captured. Fourthly, we propose the Random Tessellation with Splines (RTS) to acquire complex shapes within images. The RTS provides a framework for describing Bayesian nonparametric models based on partitioning two-dimensional Euclidean space with splines. We also develop an inference algorithm that is "embarrassingly parallel". Finally, we extend the mixtures of spatial spline regression with mixed-effects model under the Bayesian framework to accommodate streaming image data. We propose an SMC algorithm to analyze online fashion brain image.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Liangliang Wang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Approximate marginal likelihoods for shrinkage parameter estimation in penalized logistic regression analysis of case-control data

Author: 
Date created: 
2020-04-17
Abstract: 

Inference of associations between disease status and rare exposures is complicated by the finite-sample bias of the maximum likelihood estimator for logistic regression. Penalised likelihood methods are useful for reducing such bias. In this project, we studied penalisation by a family of log-F priors indexed by a shrinkage parameter m. We propose a method for estimating m based on an approximate marginal likelihood obtained by Laplace approximation. Derivatives of the approximate marginal likelihood for m are challenging to compute, and so we explore several derivative-free optimization approaches to obtaining the maximum marginal likelihood estimate. We conduct a simulation study to evaluate the performance of our method under a variety of data-generating scenarios, and applied the method to real data from a genetic association study of Alzheimer's disease.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Brad McNeney
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

A bivariate longitudinal model for psychometric data

Author: 
Date created: 
2020-04-30
Abstract: 

Psychometric test data are useful for predicting a variety of important life outcomes and personality characteristics. The Cognitive Reflection Test (CRT) is a short, well-validated rationality test, designed to assess subjects' ability to override intuitively appealing but incorrect responses to a series of math- and logic-based questions. The CRT is predictive of many other cognitive abilities and tendencies, such as verbal intelligence, numeracy, and religiosity. Cognitive psychologists and psychometricians are concerned with whether subjects improve their scores on the test with repeated exposure, as this may threaten the test's predictive validity. This project uses the first publicly available longitudinal dataset derived from subjects who took the CRT multiple times over a predefined period. The dataset includes a multitude of predictors, including number of previous exposures to the test (our variable of primary interest). Also included are two response variables measured with each test exposure: CRT score and time taken to complete the CRT. These responses serve as a proxy for underlying latent variables, "rationality" and "reflectiveness", respectively. We propose methods to describe the relationship between the responses and selected predictors. Specifically, we employ a bivariate longitudinal model to account for the presumed dependence between our two responses. Our model also allows for subpopulations ("clusters") of individuals whose responses exhibit similar patterns. We estimate the parameters of our one- and two-cluster models via adaptive Gaussian quadrature. We also develop an Expectation-Maximization algorithm for estimating models with greater numbers of clusters. We use our fitted models to address a range of subject-specific questions in a formal way (building on earlier work relying on ad hoc methods). In particular, we find that test exposure has a greater estimated effect on test scores than previously reported and we find evidence of at least two subpopulations. Additionally, our work has generated numerous avenues for future investigation.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Rachel Altman
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Statistical analysis of data from opioid use disorder study

Author: 
Date created: 
2020-04-24
Abstract: 

This project presents statistical analyses of data from a population based opioid use disorder research program. The primary interest is in estimating the association of a range of demographic, clinical and provider-related characteristics on retention in treatment for opioid use disorders. This focus was motivated by the province’s efforts to respond to the opioid overdose crisis, and the methodological challenges inherent in analyzing the recurrent nature of opioid use disorder and the treatment episodes. We start with executing a network analysis to clarify the influence of provider-related characteristics, including individual-, case-mix and prescriber network-related characteristics on treatment retention. We observe that the network characteristics have a statistically significant impact on OAT retention. Then we use a Cox proportional hazards model with a gamma frailty, while also considering how the ending of the previous episode will impact the future ones to start our investigation into the importance of the episode endings. Moreover, we consider three different analyses under multiple scenarios to reach our final goal of analyzing the multi-type events. The OAT episode counts of the study subjects throughout the follow-ups are analyzed using Poisson regression models. Logistic regression analyses of the records of the OAT episode types are conducted with mixed effects. Lastly, we analyze the OAT episode duration times marginally via an estimating function approach. The robust variance estimator is identified for the estimator of the model parameters. In addition, we conduct a simulation study to verify the findings of the data analysis. The outcomes of the analyses indicate that the OAT episode counts and duration times are significantly associated with a few covariates, such as gender and birth era, and the relationships are varying according to the OAT episode types.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
X. Joan Hu
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Incorporating statistical clustering methods into mortality models to improve forecasting performances

Author: 
Date created: 
2020-04-09
Abstract: 

Statistical clustering is a procedure of classifying a set of objects such that objects in the same class (called cluster) are more homogeneous, with respect to some features or characteristics, to each other than to those in other classes. In this project, we apply four clustering approaches to improving forecasting performances of the Lee-Carter and CBD models. First, each of four clustering methods (the Ward's hierarchical clustering, the divisive hierarchical clustering, the K-means clustering, and the Gaussian mixture model clustering) are adopted to determine, based on some characteristics of mortality rates, the number and members of age subgroups from a whole group of ages 25-84. Next, we forecast 10-year and 20-year mortality rates for each of the age subgroups using the Lee-Carter and CBD models, respectively. Finally, numerical illustrations are given with R packages "NbClust" and "mclust" for clustering. Mortality data for both genders of the US and the UK are obtained from the Human Mortality Database, and the MAPE (mean absolute percentage error) measure is adopted to evaluate forecasting performance. Comparisons of MAPE values are made with and without clustering, which demonstrate that all the proposed clustering methods can improve forecasting performances of the Lee-Carter and CBD models.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Cary Chi-Liang Tsai
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Functional neural networks for scalar prediction

Author: 
Date created: 
2020-04-07
Abstract: 

We introduce a methodology for integrating functional data into densely connected feed-forward neural networks. The model is defined for scalar responses with at least one functional covariate and some number of scalar covariates. A by-product of the method is a set of functional parameters that are dynamic to the learning process which leads to interpretability. The model is shown to perform well in a number of contexts including prediction of new data and recovery of the true underlying coefficient function; these results were confirmed through cross-validations and simulation studies. A collection of useful functions are built on top of the Keras/Tensorflow architecture allowing for general use of the approach.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jiguo Cao
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Assessing the performance of an open spatial capture-recapture method on grizzly bear populations when age data is missing

Author: 
Date created: 
2020-02-13
Abstract: 

It is often difficult in capture-recapture (CR) studies of grizzly bear populations to determine the age of detected bears. As a result, analyses often omit age terms in CR models despite past studies suggesting age influences detection probability. This paper explores how failing to account for age in the detection function of an open, spatially-explicit CR model, introduced in Efford & Schofield (2019), affects estimates of apparent survival, apparent recruitment, population growth, and grizzly bear home-range sizes. Using a simulation study, it was found that estimates of all parameters of interest excluding home-range size were robust to this omission. The effects of using two different types of detectors for data collection (bait sites and rub objects) on bias in estimates of above parameters was also explored via simulation. No evidence was found that one detector type was more prone to producing biased parameter estimates than the other.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Steven Thompson
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Optimal investment and consumption strategy for a retiree under stochastic force of mortality

Author: 
Date created: 
2020-01-15
Abstract: 

With an increase in the self-driven retirement plans during past few decades, more and more retirees are managing their retirement portfolio on their own. Therefore, they need to know the optimal amount of consumption they can afford each year, and the optimal proportion of wealth they should invest in the financial market. In this project, we study the optimization strategy proposed by Delong and Chen (2016). Their model determines the optimal consumption and investment strategy for a retiree facing (1) a minimum lifetime consumption, (2) a stochastic force of mortality following a geometric Brownian motion process, (3) an annuity income, and (4) non-exponential discounting of future income. We use a modified version of the Cox, Ingersoll, and Ross (1985) model to capture the stochastic mortality intensity of the retiree and, subsequently, determine a new optimal consumption and investment strategy using their framework. We use an expansion method to solve the classic Hamilton-Jacobi-Bellman equation by perturbing the non-exponential discounting parameter using partial differential equations.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jean-François Bégin
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Foul Accumulation in the NBA

Author: 
Date created: 
2020-01-13
Abstract: 

This project investigates the fouling time distribution of players in the National Basketball Association. A Bayesian analysis is presented based on the assumption that fouling times follow a Gamma distribution. Various insights are obtained including the observation that players accumulate their nth foul more quickly for increasing n. Methods are developed that will allow coaches to better manage playing time in the presence of fouls such that key players are available in the latter stages of matches.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Tim Swartz
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Prediction for Canadian federal election aided by Canadian Community Health Survey

Author: 
Date created: 
2019-09-05
Abstract: 

This project aims to develop predictive models for Canadian federal elections. We begin with explanatory analyses of two sets of data: some publicly accessible election data and some extracted data from the Canadian Community Health Survey (CCHS) 2007-2018 on life satisfaction and other potentially associated social-demographics. We propose to predict for federal election outcomes using the information on longitudinal Canadian life satisfaction. Specifically, we model the federal election outcome for a riding in change from its previous election jointly with its longitudinal life satisfaction since the previous election. Election data from years 2008 and 2011 and the CCHS data of 2008-2011 are employed to fit the model via both the two-stage estimation and the maximum likelihood estimation by the Monte Carlo EM algorithm. The analysis results indicate that life satisfaction is an important factor in election prediction. It appears that young adults are more likely to vote for a change but male voters are less likely to do so. Using voter information or CCHS respondent's information to model the election outcomes produce different estimation results. Two applications of the proposed approach are presented to further illustrate the proposed joint modeling approach.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
X. Joan Hu
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.