Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

Statistical analysis of event times with missing origins aided by auxiliary information, with application to wildfire management

Author: 
Date created: 
2020-08-20
Abstract: 

Motivated partly by analysis of lightning-caused wildfire data from Alberta, this dissertation develops statistical methodology for analyzing event times with missing origins aided by auxiliary information such as associated longitudinal measures and other relevant information prior to the time origin. We begin an analysis of the motivating data to estimate distribution of time to initial attack since a wildfire starts burning with flames, i.e. duration between the start time and initial attack time of a fire, with two conventional approaches: one neglects the missing origin and performs inference on the observed portion of duration and the other views the observation on the event time of interest subject to interval censoring with a pre-determined interval. The counterintuitive/non-informative results of the preliminary analysis lead us to propose new approaches to tackling the issue of missing origin. To facilitate methodological development, we first consider estimation of the duration distribution with independently and identically distributed (iid) observations. We link the unobserved time origin to the available longitudinal measures of burnt areas via the first-hitting-time model. This yields an intuitive and easy-to-implement adaption of the empirical distribution function with the event time data. We establish consistency and weak convergence of the proposed estimator and present its variance estimation. We then extend the proposed approach to studying the association of the duration time with a list of potential risk factors. A semi-parametric accelerated failure time (AFT) regression model is considered together with a Wiener process model using random drift for longitudinal measures. Further, we accommodate the potential spatial correlation of the wildfires by specifying the drift of the Wiener process as a function of covariates and spatially correlated random effects. Moreover, we propose a method to aid the duration distribution estimation with lightning data. It leads to an alternative approach to estimating the distribution of the duration by adapting the Turnbull estimator with interval-censored observations. A prominent byproduct of this approach is an estimation procedure for the distribution of ignition time using all the lightning data and the sub-sampled data. The finite-sample performance of proposed approaches is examined via simulation studies. We use the motivating Alberta wildfire data to illustrate the proposed approaches throughout the thesis. The data analyses and simulation studies show that the two conventional approaches with current data structure could give rise to misleading inference. The proposed approaches provide intuitive, easy-to-implement alternatives to analysis of event times with missing origins. We anticipate the methodology has many applications in practice, such as infectious diseases research.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Joan Hu
John Braun
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Modeling human decision-making in spatial and temporal systems

Author: 
Date created: 
2020-08-20
Abstract: 

In this thesis, we analyze three applications of human decision-making in spatial and temporal environments. The first two projects are statistical applications to basketball while the third project analyzes an experiment that aims to understand decision-making processes in games. The first project explores how efficiently players in a basketball lineup collectively allocate shots. We propose a new metric for allocative efficiency by comparing a player's field goal percentage (FG%) to their field goal attempt (FGA) rate in context of both their four teammates on the court and the spatial distribution of their shots. Leveraging publicly available data provided by the National Basketball Association (NBA), we estimate player FG% at every location in the offensive half court using a Bayesian hierarchical model. By ordering a lineup's estimated FG%s and pairing these rankings with the lineup's empirical FGA rate rankings, we detect areas where the lineup exhibits inefficient shot allocation. Lastly, we analyze the impact that suboptimal shot allocation has on a team's overall offensive potential, finding that inefficient shot allocation correlates with reduced scoring. In the second basketball application, we model basketball plays as episodes from team-specific nonstationary Markov decision processes (MDPs) with shot clock dependent transition probabilities. Bayesian hierarchical models are employed in the parametrization of the transition probabilities to borrow strength across players and through time. To enable computational feasibility, we combine lineup-specific MDPs into team-average MDPs using a novel transition weighting scheme. Specifically, we derive the dynamics of the team-average process such that the expected transition count for an arbitrary state-pair is equal to the weighted sum of the expected counts of the separate lineup-specific MDPs. We then utilize these nonstationary MDPs in the creation of a basketball play simulator with uncertainty propagated via posterior samples of the model components. After calibration, we simulate seasons both on policy and under altered policies and explore the net changes in efficiency and production under the alternate policies. We also discuss the game-theoretic ramifications of testing alternative decision policies. For the final project, we take a different perspective on the behavior of the decision-makers. Broadly speaking, both basketball projects assume the agents (players) act sub-optimally and the goal of the analyses is to evaluate the impact their suboptimal behavior has on point production and scoring efficiency. By contrast, in the final project we assume that the agents' actions are optimal, but that the criteria over which they optimize are unknown. The goal of the analysis is to make inference on these latent optimization criteria. This type of problem can be termed an inverse decision problem. Our project explores the inverse problem of Bayesian optimization. Specifically, we seek to estimate an agent's latent acquisition function based on their observed search paths. After introducing a probabilistic solution framework for the problem, we illustrate our method by analyzing human behavior from an experiment. The experiment was designed to force subjects to balance exploration and exploitation in search of a global optimum. We find that subjects exhibit a wide range of acquisition preferences; however, some subject's behavior does not map well to any of the candidate acquisitions functions we consider. Guided by the model discrepancies, we augment the candidate acquisition functions to yield a superior fit to the human behavior in this task.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Luke Bornn
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Understanding jump dynamics using liquidity measures

Author: 
Date created: 
2020-07-15
Abstract: 

Numerous past studies investigate the relationship between volatility and other relevant variables, e.g., asset jumps, liquidity factors. However, empirical studies examining the link between liquidity and jumps are almost non-existent. In this report, we investigate the possible improvement in estimating so-called jump distribution parameters computed from intraday returns by including liquidity measures. More specifically, we first calculate the jump distribution parameters by using classic jump detection techniques in the spirit of Lee and Mykland (2008) and Tauchen and Zhou (2011), and we then use them as our responses in the heterogeneous autoregressive model \citep[e.g.,][]{corsi2009simple}. We examine the in-sample performance of our model and find out that liquidity measures do provide extra information in the estimation of the jump intensity and jump size variation. We also apply the same technique but using one-period-ahead instead of contemporaneous responses; we again find extra explanatory power when the liquidity measures are included.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jean-François Bégin
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Supervised basis functions applied to functional regression and classification

Author: 
Date created: 
2020-07-29
Abstract: 

In fitting functional linear models, including scalar-on-function regression (SoFR) and function-on-function regression (FoFR), the intrinsically infinite dimension of the problem often demands an limitation to a subspace spanned by a finite number of basis functions. In this sense, the choice and construction of basis functions matters. We discuss herein certain supervised choices of basis functions for regression/classification with densely/sparsely observed curves, and give both numerical and theoretical perspectives. For SoFR, the functional principal component (FPC) regression may fail to provide good estimation or prediction if the response is highly correlated with some excluded FPCs. This is not rare since the construction of FPCs never involves the response. We hence develop regression on functional continuum (FC) basis functions whose framework includes, as special cases, both FPCs and functional partial least squares (FPLS) basis functions. Aiming at the binary classification of functional data, we then propose the continuum centroid classifier (CCC) built upon projections of functional data onto the direction parallel to FC regression coefficient. One of the two subtypes of CCC (asymptotically) enjoys no misclassification. Implementation of FPLS traditionally demands that each predictor curve be recorded as densely as possible over the entire time span. This prerequisite is sometimes violated by, e.g., longitudinal studies and missing data problems. We accommodate FPLS for SoFR to scenarios where curves are sparsely observed. We establish the consistency of proposed estimators and give confidence intervals for responses. FPLS is widely used to fit FoFR. Its implementation is far from unique but typically involves iterative eigen decomposition. We introduce an new route for FoFR based upon Krylov subspaces. The method can be expressed in two equivalent forms: one of them is non-iterative with explicit forms of estimators and predictions, facilitating the theoretical derivation; the other one stabilizes numerical outputs. Our route turns out to be less time-consuming than other methods with competitive accuracy.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Richard A. Lockhart
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Understanding and estimating predictive performance of statistical learning methods based on data properties

Author: 
Date created: 
2020-07-14
Abstract: 

Many Statistical Learning (SL) regression methods have been developed over roughly the last two decades, but no one model has been found to be the best across all sets of data. It would be useful if guidance were available to help identify when each different method might be expected to provide more accurate or precise predictions than competitors. We speculate that certain measurable features of a data set might influence methods' potential ability to provide relatively accurate predictions. This thesis explores the potential to use measurable characteristics of a data set to estimate the prediction performance of different SL regression methods. We demonstrate this process on an existing set of 42 benchmark data sets. We measure a variety of properties on each data set that might be useful for differentiating between likely good- or poor-performing regression methods. Using cross-validation, we measure the actual relative prediction performance of 12 well-known regression methods, including both classical linear techniques and more modern flexible approaches. Finally, we combine the performance measures and the data set properties into a multivariate regression model to identify which properties appear to be most important and to estimate the expected prediction performance of each method.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Thomas Loughin
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Statistical machine learning in computational genetics

Author: 
Date created: 
2020-07-03
Abstract: 

Statistical machine learning has played a key role in many areas, such as biology, health sciences, finance and genetics. Important tasks in computational genetics include disease prediction, capturing shapes within images, computation of genetic sharing between pairs of individuals, genome-wide association studies and image clustering. This thesis develops several learning methods to address these computational genetics problems. Firstly, motivated by the need for fast computation of genetic sharing among pairs of individuals, we propose the fastest algorithms for computing the kinship coefficient of a set of individuals with a known large pedigree. {Moreover, we consider the possibility that the founders of the known pedigree may themselves be inbred and compute the appropriate inbreeding-adjusted kinship coefficients, which has not been addressed in literature.} Secondly, motivated by an imaging genetics study of the Alzheimer's Disease Neuroimaging Initiative, we develop a Bayesian bivariate spatial group lasso model for multivariate regression analysis applicable to exam the influence of genetic variation on brain structure and accommodate the correlation structures typically seen in structural brain imaging data. We develop a mean-field variational Bayes algorithm and a Gibbs sampling algorithm to fit the model. We also incorporate Bayesian false discovery rate procedures to select SNPs. The new spatial model demonstrates superior performance over a standard model in our application. Thirdly, we propose the Random Tessellation Process (RTP) to model complex genetic data structures to predict disease status. The RTP is a multi-dimensional partitioning tree with non-axis aligned cuts. We develop a sequential Monte Carlo (SMC) algorithm for inference. Our process is self-consistent and can relax axis-aligned constraints, allowing complex inter-dimensional dependence to be captured. Fourthly, we propose the Random Tessellation with Splines (RTS) to acquire complex shapes within images. The RTS provides a framework for describing Bayesian nonparametric models based on partitioning two-dimensional Euclidean space with splines. We also develop an inference algorithm that is "embarrassingly parallel". Finally, we extend the mixtures of spatial spline regression with mixed-effects model under the Bayesian framework to accommodate streaming image data. We propose an SMC algorithm to analyze online fashion brain image.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Liangliang Wang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Approximate marginal likelihoods for shrinkage parameter estimation in penalized logistic regression analysis of case-control data

Author: 
Date created: 
2020-04-17
Abstract: 

Inference of associations between disease status and rare exposures is complicated by the finite-sample bias of the maximum likelihood estimator for logistic regression. Penalised likelihood methods are useful for reducing such bias. In this project, we studied penalisation by a family of log-F priors indexed by a shrinkage parameter m. We propose a method for estimating m based on an approximate marginal likelihood obtained by Laplace approximation. Derivatives of the approximate marginal likelihood for m are challenging to compute, and so we explore several derivative-free optimization approaches to obtaining the maximum marginal likelihood estimate. We conduct a simulation study to evaluate the performance of our method under a variety of data-generating scenarios, and applied the method to real data from a genetic association study of Alzheimer's disease.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Brad McNeney
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

A bivariate longitudinal model for psychometric data

Author: 
Date created: 
2020-04-30
Abstract: 

Psychometric test data are useful for predicting a variety of important life outcomes and personality characteristics. The Cognitive Reflection Test (CRT) is a short, well-validated rationality test, designed to assess subjects' ability to override intuitively appealing but incorrect responses to a series of math- and logic-based questions. The CRT is predictive of many other cognitive abilities and tendencies, such as verbal intelligence, numeracy, and religiosity. Cognitive psychologists and psychometricians are concerned with whether subjects improve their scores on the test with repeated exposure, as this may threaten the test's predictive validity. This project uses the first publicly available longitudinal dataset derived from subjects who took the CRT multiple times over a predefined period. The dataset includes a multitude of predictors, including number of previous exposures to the test (our variable of primary interest). Also included are two response variables measured with each test exposure: CRT score and time taken to complete the CRT. These responses serve as a proxy for underlying latent variables, "rationality" and "reflectiveness", respectively. We propose methods to describe the relationship between the responses and selected predictors. Specifically, we employ a bivariate longitudinal model to account for the presumed dependence between our two responses. Our model also allows for subpopulations ("clusters") of individuals whose responses exhibit similar patterns. We estimate the parameters of our one- and two-cluster models via adaptive Gaussian quadrature. We also develop an Expectation-Maximization algorithm for estimating models with greater numbers of clusters. We use our fitted models to address a range of subject-specific questions in a formal way (building on earlier work relying on ad hoc methods). In particular, we find that test exposure has a greater estimated effect on test scores than previously reported and we find evidence of at least two subpopulations. Additionally, our work has generated numerous avenues for future investigation.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Rachel Altman
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Statistical analysis of data from opioid use disorder study

Author: 
Date created: 
2020-04-24
Abstract: 

This project presents statistical analyses of data from a population based opioid use disorder research program. The primary interest is in estimating the association of a range of demographic, clinical and provider-related characteristics on retention in treatment for opioid use disorders. This focus was motivated by the province’s efforts to respond to the opioid overdose crisis, and the methodological challenges inherent in analyzing the recurrent nature of opioid use disorder and the treatment episodes. We start with executing a network analysis to clarify the influence of provider-related characteristics, including individual-, case-mix and prescriber network-related characteristics on treatment retention. We observe that the network characteristics have a statistically significant impact on OAT retention. Then we use a Cox proportional hazards model with a gamma frailty, while also considering how the ending of the previous episode will impact the future ones to start our investigation into the importance of the episode endings. Moreover, we consider three different analyses under multiple scenarios to reach our final goal of analyzing the multi-type events. The OAT episode counts of the study subjects throughout the follow-ups are analyzed using Poisson regression models. Logistic regression analyses of the records of the OAT episode types are conducted with mixed effects. Lastly, we analyze the OAT episode duration times marginally via an estimating function approach. The robust variance estimator is identified for the estimator of the model parameters. In addition, we conduct a simulation study to verify the findings of the data analysis. The outcomes of the analyses indicate that the OAT episode counts and duration times are significantly associated with a few covariates, such as gender and birth era, and the relationships are varying according to the OAT episode types.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
X. Joan Hu
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Incorporating statistical clustering methods into mortality models to improve forecasting performances

Author: 
Date created: 
2020-04-09
Abstract: 

Statistical clustering is a procedure of classifying a set of objects such that objects in the same class (called cluster) are more homogeneous, with respect to some features or characteristics, to each other than to those in other classes. In this project, we apply four clustering approaches to improving forecasting performances of the Lee-Carter and CBD models. First, each of four clustering methods (the Ward's hierarchical clustering, the divisive hierarchical clustering, the K-means clustering, and the Gaussian mixture model clustering) are adopted to determine, based on some characteristics of mortality rates, the number and members of age subgroups from a whole group of ages 25-84. Next, we forecast 10-year and 20-year mortality rates for each of the age subgroups using the Lee-Carter and CBD models, respectively. Finally, numerical illustrations are given with R packages "NbClust" and "mclust" for clustering. Mortality data for both genders of the US and the UK are obtained from the Human Mortality Database, and the MAPE (mean absolute percentage error) measure is adopted to evaluate forecasting performance. Comparisons of MAPE values are made with and without clustering, which demonstrate that all the proposed clustering methods can improve forecasting performances of the Lee-Carter and CBD models.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Cary Chi-Liang Tsai
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.