Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

Quarterback evaluation in the National Football League

Author: 
Date created: 
2020-08-20
Abstract: 

This project evaluates quarterback performance in the National Football League. With the availability of player tracking data, there exists the capability to assess various options that are available to quarterbacks and the expected points resulting from each option. The quarterback’s execution is then measured against the optimal available option. Since decision making does not rely on the quality of teammates, a quarterback metric is introduced that provides a novel perspective on an understudied aspect of quarterback assessment.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Tim Swartz
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Curating and combining big data from genetic studies

Author: 
Date created: 
2020-08-19
Abstract: 

Big data curation is often underappreciated by users of processed data. With the development of high-throughput genotyping technology, large-scale genome-wide data are available for genetic association analysis with disease. In this project, we describe a data-curation protocol to deal with the genotyping errors and missing values in genetic data. We obtain publicly-available genetic data from three studies in the Alzheimer’s Disease Neuroimaging Initiative (ADNI), and with the aid of the freely-available HapMap3 reference panel, we improve the quality and size of the ADNI genetic data. We use the software PLINK to manage data format, SHAPEIT to check DNA strand alignment and perform phasing of the genetic markers that have been inherited from the same parent, IMPUTE2 to impute missing SNP genotypes, and GTOOL to merge files and convert file formats. After merging the genetic data across these studies, we also use the reference panel to investigate the population structure of the processed data. ADNI's participants are collected in the U.S, where the majority of the population are descendants of relatively recent immigrants. We use principal component analysis to understand the population structure of the participants, and model-based clustering to investigate the genetic composition of each participant and compare it with self-reported ethnicity information. This project is intended to serve as a guide to future users of the processed data.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jinko Graham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Statistical methods for tracking data in sports

Author: 
Date created: 
2020-08-21
Abstract: 

In this thesis, we examine player tracking data in basketball and soccer and explore statistical methods and applications related to this type of data. First, we present a method for nonparametric estimation of continuous-state Markov transition densities, using as our foundation a Poisson process representation of the joint input-output space of the Markovian transitions. Representing transition densities with a non-stationary point process allows the form of the transition density to vary rapidly over the space, resulting in a very flexible estimator of the transition mechanism. A key feature of this point process representation is that it allows the presence of spatial structure to inform transition density estimation. We illustrate this by using our method to model ball movement in the National Basketball Association, enabling us to capture the effects of spatial features, such as the three point line, that impact transition density values. Next, we consider a sports science application. Sports science has seen substantial benefit from player tracking data, as high resolution coordinate data permits sports scientists to have to-the-second estimates of external load metrics traditionally used to understand the physical toll a game takes on an athlete. Unfortunately, this data is not widely available. Algorithms have been developed that allow a traditional broadcast feed to be converted to x-y coordinate data, making tracking data easier to acquire, but coordinates are available for an athlete only when that player is within the camera frame. This leads to inaccuracies in player load estimates, limiting the usefulness of this data for sports scientists. In this research, we develop models that predict offscreen load metrics and demonstrate the viability of broadcast-derived tracking data for understanding external load in soccer. Finally, we address a tactics question in soccer. A key piece of information when evaluating a matchup in soccer is understanding the formations utilized by the different teams. Multiple researchers have developed methodology for learning these formations from tracking data, but they do not work when faced with the heavy censoring inherent to broadcast tracking data.We present an algorithm for aligning broadcast tracking data with the origin, and then show how the aligned data can be used to learn formations, with performance comparable to formations learned from the full tracking data.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Luke Bornn
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Statistical analysis of event times with missing origins aided by auxiliary information, with application to wildfire management

Author: 
Date created: 
2020-08-20
Abstract: 

Motivated partly by analysis of lightning-caused wildfire data from Alberta, this dissertation develops statistical methodology for analyzing event times with missing origins aided by auxiliary information such as associated longitudinal measures and other relevant information prior to the time origin. We begin an analysis of the motivating data to estimate distribution of time to initial attack since a wildfire starts burning with flames, i.e. duration between the start time and initial attack time of a fire, with two conventional approaches: one neglects the missing origin and performs inference on the observed portion of duration and the other views the observation on the event time of interest subject to interval censoring with a pre-determined interval. The counterintuitive/non-informative results of the preliminary analysis lead us to propose new approaches to tackling the issue of missing origin. To facilitate methodological development, we first consider estimation of the duration distribution with independently and identically distributed (iid) observations. We link the unobserved time origin to the available longitudinal measures of burnt areas via the first-hitting-time model. This yields an intuitive and easy-to-implement adaption of the empirical distribution function with the event time data. We establish consistency and weak convergence of the proposed estimator and present its variance estimation. We then extend the proposed approach to studying the association of the duration time with a list of potential risk factors. A semi-parametric accelerated failure time (AFT) regression model is considered together with a Wiener process model using random drift for longitudinal measures. Further, we accommodate the potential spatial correlation of the wildfires by specifying the drift of the Wiener process as a function of covariates and spatially correlated random effects. Moreover, we propose a method to aid the duration distribution estimation with lightning data. It leads to an alternative approach to estimating the distribution of the duration by adapting the Turnbull estimator with interval-censored observations. A prominent byproduct of this approach is an estimation procedure for the distribution of ignition time using all the lightning data and the sub-sampled data. The finite-sample performance of proposed approaches is examined via simulation studies. We use the motivating Alberta wildfire data to illustrate the proposed approaches throughout the thesis. The data analyses and simulation studies show that the two conventional approaches with current data structure could give rise to misleading inference. The proposed approaches provide intuitive, easy-to-implement alternatives to analysis of event times with missing origins. We anticipate the methodology has many applications in practice, such as infectious diseases research.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Joan Hu
John Braun
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Modeling human decision-making in spatial and temporal systems

Author: 
Date created: 
2020-08-20
Abstract: 

In this thesis, we analyze three applications of human decision-making in spatial and temporal environments. The first two projects are statistical applications to basketball while the third project analyzes an experiment that aims to understand decision-making processes in games. The first project explores how efficiently players in a basketball lineup collectively allocate shots. We propose a new metric for allocative efficiency by comparing a player's field goal percentage (FG%) to their field goal attempt (FGA) rate in context of both their four teammates on the court and the spatial distribution of their shots. Leveraging publicly available data provided by the National Basketball Association (NBA), we estimate player FG% at every location in the offensive half court using a Bayesian hierarchical model. By ordering a lineup's estimated FG%s and pairing these rankings with the lineup's empirical FGA rate rankings, we detect areas where the lineup exhibits inefficient shot allocation. Lastly, we analyze the impact that suboptimal shot allocation has on a team's overall offensive potential, finding that inefficient shot allocation correlates with reduced scoring. In the second basketball application, we model basketball plays as episodes from team-specific nonstationary Markov decision processes (MDPs) with shot clock dependent transition probabilities. Bayesian hierarchical models are employed in the parametrization of the transition probabilities to borrow strength across players and through time. To enable computational feasibility, we combine lineup-specific MDPs into team-average MDPs using a novel transition weighting scheme. Specifically, we derive the dynamics of the team-average process such that the expected transition count for an arbitrary state-pair is equal to the weighted sum of the expected counts of the separate lineup-specific MDPs. We then utilize these nonstationary MDPs in the creation of a basketball play simulator with uncertainty propagated via posterior samples of the model components. After calibration, we simulate seasons both on policy and under altered policies and explore the net changes in efficiency and production under the alternate policies. We also discuss the game-theoretic ramifications of testing alternative decision policies. For the final project, we take a different perspective on the behavior of the decision-makers. Broadly speaking, both basketball projects assume the agents (players) act sub-optimally and the goal of the analyses is to evaluate the impact their suboptimal behavior has on point production and scoring efficiency. By contrast, in the final project we assume that the agents' actions are optimal, but that the criteria over which they optimize are unknown. The goal of the analysis is to make inference on these latent optimization criteria. This type of problem can be termed an inverse decision problem. Our project explores the inverse problem of Bayesian optimization. Specifically, we seek to estimate an agent's latent acquisition function based on their observed search paths. After introducing a probabilistic solution framework for the problem, we illustrate our method by analyzing human behavior from an experiment. The experiment was designed to force subjects to balance exploration and exploitation in search of a global optimum. We find that subjects exhibit a wide range of acquisition preferences; however, some subject's behavior does not map well to any of the candidate acquisitions functions we consider. Guided by the model discrepancies, we augment the candidate acquisition functions to yield a superior fit to the human behavior in this task.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Luke Bornn
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Understanding jump dynamics using liquidity measures

Author: 
Date created: 
2020-07-15
Abstract: 

Numerous past studies investigate the relationship between volatility and other relevant variables, e.g., asset jumps, liquidity factors. However, empirical studies examining the link between liquidity and jumps are almost non-existent. In this report, we investigate the possible improvement in estimating so-called jump distribution parameters computed from intraday returns by including liquidity measures. More specifically, we first calculate the jump distribution parameters by using classic jump detection techniques in the spirit of Lee and Mykland (2008) and Tauchen and Zhou (2011), and we then use them as our responses in the heterogeneous autoregressive model \citep[e.g.,][]{corsi2009simple}. We examine the in-sample performance of our model and find out that liquidity measures do provide extra information in the estimation of the jump intensity and jump size variation. We also apply the same technique but using one-period-ahead instead of contemporaneous responses; we again find extra explanatory power when the liquidity measures are included.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jean-François Bégin
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Supervised basis functions applied to functional regression and classification

Author: 
Date created: 
2020-07-29
Abstract: 

In fitting functional linear models, including scalar-on-function regression (SoFR) and function-on-function regression (FoFR), the intrinsically infinite dimension of the problem often demands an limitation to a subspace spanned by a finite number of basis functions. In this sense, the choice and construction of basis functions matters. We discuss herein certain supervised choices of basis functions for regression/classification with densely/sparsely observed curves, and give both numerical and theoretical perspectives. For SoFR, the functional principal component (FPC) regression may fail to provide good estimation or prediction if the response is highly correlated with some excluded FPCs. This is not rare since the construction of FPCs never involves the response. We hence develop regression on functional continuum (FC) basis functions whose framework includes, as special cases, both FPCs and functional partial least squares (FPLS) basis functions. Aiming at the binary classification of functional data, we then propose the continuum centroid classifier (CCC) built upon projections of functional data onto the direction parallel to FC regression coefficient. One of the two subtypes of CCC (asymptotically) enjoys no misclassification. Implementation of FPLS traditionally demands that each predictor curve be recorded as densely as possible over the entire time span. This prerequisite is sometimes violated by, e.g., longitudinal studies and missing data problems. We accommodate FPLS for SoFR to scenarios where curves are sparsely observed. We establish the consistency of proposed estimators and give confidence intervals for responses. FPLS is widely used to fit FoFR. Its implementation is far from unique but typically involves iterative eigen decomposition. We introduce an new route for FoFR based upon Krylov subspaces. The method can be expressed in two equivalent forms: one of them is non-iterative with explicit forms of estimators and predictions, facilitating the theoretical derivation; the other one stabilizes numerical outputs. Our route turns out to be less time-consuming than other methods with competitive accuracy.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Richard A. Lockhart
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Understanding and estimating predictive performance of statistical learning methods based on data properties

Author: 
Date created: 
2020-07-14
Abstract: 

Many Statistical Learning (SL) regression methods have been developed over roughly the last two decades, but no one model has been found to be the best across all sets of data. It would be useful if guidance were available to help identify when each different method might be expected to provide more accurate or precise predictions than competitors. We speculate that certain measurable features of a data set might influence methods' potential ability to provide relatively accurate predictions. This thesis explores the potential to use measurable characteristics of a data set to estimate the prediction performance of different SL regression methods. We demonstrate this process on an existing set of 42 benchmark data sets. We measure a variety of properties on each data set that might be useful for differentiating between likely good- or poor-performing regression methods. Using cross-validation, we measure the actual relative prediction performance of 12 well-known regression methods, including both classical linear techniques and more modern flexible approaches. Finally, we combine the performance measures and the data set properties into a multivariate regression model to identify which properties appear to be most important and to estimate the expected prediction performance of each method.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Thomas Loughin
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Statistical machine learning in computational genetics

Author: 
Date created: 
2020-07-03
Abstract: 

Statistical machine learning has played a key role in many areas, such as biology, health sciences, finance and genetics. Important tasks in computational genetics include disease prediction, capturing shapes within images, computation of genetic sharing between pairs of individuals, genome-wide association studies and image clustering. This thesis develops several learning methods to address these computational genetics problems. Firstly, motivated by the need for fast computation of genetic sharing among pairs of individuals, we propose the fastest algorithms for computing the kinship coefficient of a set of individuals with a known large pedigree. {Moreover, we consider the possibility that the founders of the known pedigree may themselves be inbred and compute the appropriate inbreeding-adjusted kinship coefficients, which has not been addressed in literature.} Secondly, motivated by an imaging genetics study of the Alzheimer's Disease Neuroimaging Initiative, we develop a Bayesian bivariate spatial group lasso model for multivariate regression analysis applicable to exam the influence of genetic variation on brain structure and accommodate the correlation structures typically seen in structural brain imaging data. We develop a mean-field variational Bayes algorithm and a Gibbs sampling algorithm to fit the model. We also incorporate Bayesian false discovery rate procedures to select SNPs. The new spatial model demonstrates superior performance over a standard model in our application. Thirdly, we propose the Random Tessellation Process (RTP) to model complex genetic data structures to predict disease status. The RTP is a multi-dimensional partitioning tree with non-axis aligned cuts. We develop a sequential Monte Carlo (SMC) algorithm for inference. Our process is self-consistent and can relax axis-aligned constraints, allowing complex inter-dimensional dependence to be captured. Fourthly, we propose the Random Tessellation with Splines (RTS) to acquire complex shapes within images. The RTS provides a framework for describing Bayesian nonparametric models based on partitioning two-dimensional Euclidean space with splines. We also develop an inference algorithm that is "embarrassingly parallel". Finally, we extend the mixtures of spatial spline regression with mixed-effects model under the Bayesian framework to accommodate streaming image data. We propose an SMC algorithm to analyze online fashion brain image.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Liangliang Wang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Approximate marginal likelihoods for shrinkage parameter estimation in penalized logistic regression analysis of case-control data

Author: 
Date created: 
2020-04-17
Abstract: 

Inference of associations between disease status and rare exposures is complicated by the finite-sample bias of the maximum likelihood estimator for logistic regression. Penalised likelihood methods are useful for reducing such bias. In this project, we studied penalisation by a family of log-F priors indexed by a shrinkage parameter m. We propose a method for estimating m based on an approximate marginal likelihood obtained by Laplace approximation. Derivatives of the approximate marginal likelihood for m are challenging to compute, and so we explore several derivative-free optimization approaches to obtaining the maximum marginal likelihood estimate. We conduct a simulation study to evaluate the performance of our method under a variety of data-generating scenarios, and applied the method to real data from a genetic association study of Alzheimer's disease.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Brad McNeney
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.