Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

New perspectives on non-negative matrix factorization for grouped topic models

Author: 
Date created: 
2020-08-18
Abstract: 

Probabilistic topic models (PTM's) have become a ubiquitous approach for finding a set of latent themes (``topics'') in collections of unstructured text. A simpler, linear algebraic technique for the same problem is non-negative matrix factorization: we are given a matrix with non-negative entries and asked to find a pair of low-rank matrices, also non-negative, whose product is approximately the original matrix. A drawback of NMF is the non-convex nature of the optimization problem it poses. Recent work by the theoretical computer science community addresses this issue, utilizing NMF's inherent structure to find conditions under which the objective function admits convexity. With convexity comes tractability, and the central theme of this thesis is the exploitation of this tractability to ally NMF with resampling-based nonparametrics. Our motivating example is one in which a document collection exhibits some kind of partitioning according to a discrete, indexical covariate, and the goal is to assess the influence of this partitioning on document content; we call this scenario a grouped topic model. Computation relies on several well-studied tools from numerical linear algebra and convex programming which are especially well suited for synthesis with permutation tests and the bootstrap. The result is a set of simple, fast, and easily implementable methodologies for performing inference in grouped topic models. This is contrast to parallel developments in PTM's where ever-more cumbersome inference schemes are required to fit complex graphical models.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
David Campbell
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Bayesian logistic regression with the local bouncy particle sampler for COVID-19

Author: 
Date created: 
2020-08-24
Abstract: 

A novel coronavirus, called SARS-CoV-2, has caused the outbreak of the pandemic of COVID-19. The global economy, people’s health and life have been facing a tremendous threat in COVID-19. This project is to determine some important factors in COVID-19 severity based on 137 Tianjin patients who have been exposed to COVID-19 since January 5, 2020. We fit a logistic regression model and estimate the parameters using standard Markov chain Monte Carlo (MCMC) methods. Due to the weaknesses and limitations of the standard MCMC methods, we then perform model estimation in one special example of a Piecewise Deterministic Markov Process, named the Bouncy Particle Sampler (BPS). This method is also known as a rejection-free and irreversible MCMC, and can draw samples from our target distribution efficiently. One type of the BPS algorithm, the Local Bouncy Particle Sampler (LBPS), has advantages in computational efficiency. We apply the standard MCMC method and the LBPS to our dataset. We conclude that age and Wuhan-related exposures (i.e. people who have lived or traveled from Wuhan) are two important factors in a COVID-19 severity test.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Liangliang Wang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Quarterback evaluation in the National Football League

Author: 
Date created: 
2020-08-20
Abstract: 

This project evaluates quarterback performance in the National Football League. With the availability of player tracking data, there exists the capability to assess various options that are available to quarterbacks and the expected points resulting from each option. The quarterback’s execution is then measured against the optimal available option. Since decision making does not rely on the quality of teammates, a quarterback metric is introduced that provides a novel perspective on an understudied aspect of quarterback assessment.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Tim Swartz
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Curating and combining big data from genetic studies

Author: 
Date created: 
2020-08-19
Abstract: 

Big data curation is often underappreciated by users of processed data. With the development of high-throughput genotyping technology, large-scale genome-wide data are available for genetic association analysis with disease. In this project, we describe a data-curation protocol to deal with the genotyping errors and missing values in genetic data. We obtain publicly-available genetic data from three studies in the Alzheimer’s Disease Neuroimaging Initiative (ADNI), and with the aid of the freely-available HapMap3 reference panel, we improve the quality and size of the ADNI genetic data. We use the software PLINK to manage data format, SHAPEIT to check DNA strand alignment and perform phasing of the genetic markers that have been inherited from the same parent, IMPUTE2 to impute missing SNP genotypes, and GTOOL to merge files and convert file formats. After merging the genetic data across these studies, we also use the reference panel to investigate the population structure of the processed data. ADNI's participants are collected in the U.S, where the majority of the population are descendants of relatively recent immigrants. We use principal component analysis to understand the population structure of the participants, and model-based clustering to investigate the genetic composition of each participant and compare it with self-reported ethnicity information. This project is intended to serve as a guide to future users of the processed data.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jinko Graham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Statistical methods for tracking data in sports

Author: 
Date created: 
2020-08-21
Abstract: 

In this thesis, we examine player tracking data in basketball and soccer and explore statistical methods and applications related to this type of data. First, we present a method for nonparametric estimation of continuous-state Markov transition densities, using as our foundation a Poisson process representation of the joint input-output space of the Markovian transitions. Representing transition densities with a non-stationary point process allows the form of the transition density to vary rapidly over the space, resulting in a very flexible estimator of the transition mechanism. A key feature of this point process representation is that it allows the presence of spatial structure to inform transition density estimation. We illustrate this by using our method to model ball movement in the National Basketball Association, enabling us to capture the effects of spatial features, such as the three point line, that impact transition density values. Next, we consider a sports science application. Sports science has seen substantial benefit from player tracking data, as high resolution coordinate data permits sports scientists to have to-the-second estimates of external load metrics traditionally used to understand the physical toll a game takes on an athlete. Unfortunately, this data is not widely available. Algorithms have been developed that allow a traditional broadcast feed to be converted to x-y coordinate data, making tracking data easier to acquire, but coordinates are available for an athlete only when that player is within the camera frame. This leads to inaccuracies in player load estimates, limiting the usefulness of this data for sports scientists. In this research, we develop models that predict offscreen load metrics and demonstrate the viability of broadcast-derived tracking data for understanding external load in soccer. Finally, we address a tactics question in soccer. A key piece of information when evaluating a matchup in soccer is understanding the formations utilized by the different teams. Multiple researchers have developed methodology for learning these formations from tracking data, but they do not work when faced with the heavy censoring inherent to broadcast tracking data.We present an algorithm for aligning broadcast tracking data with the origin, and then show how the aligned data can be used to learn formations, with performance comparable to formations learned from the full tracking data.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Luke Bornn
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Statistical analysis of event times with missing origins aided by auxiliary information, with application to wildfire management

Author: 
Date created: 
2020-08-20
Abstract: 

Motivated partly by analysis of lightning-caused wildfire data from Alberta, this dissertation develops statistical methodology for analyzing event times with missing origins aided by auxiliary information such as associated longitudinal measures and other relevant information prior to the time origin. We begin an analysis of the motivating data to estimate distribution of time to initial attack since a wildfire starts burning with flames, i.e. duration between the start time and initial attack time of a fire, with two conventional approaches: one neglects the missing origin and performs inference on the observed portion of duration and the other views the observation on the event time of interest subject to interval censoring with a pre-determined interval. The counterintuitive/non-informative results of the preliminary analysis lead us to propose new approaches to tackling the issue of missing origin. To facilitate methodological development, we first consider estimation of the duration distribution with independently and identically distributed (iid) observations. We link the unobserved time origin to the available longitudinal measures of burnt areas via the first-hitting-time model. This yields an intuitive and easy-to-implement adaption of the empirical distribution function with the event time data. We establish consistency and weak convergence of the proposed estimator and present its variance estimation. We then extend the proposed approach to studying the association of the duration time with a list of potential risk factors. A semi-parametric accelerated failure time (AFT) regression model is considered together with a Wiener process model using random drift for longitudinal measures. Further, we accommodate the potential spatial correlation of the wildfires by specifying the drift of the Wiener process as a function of covariates and spatially correlated random effects. Moreover, we propose a method to aid the duration distribution estimation with lightning data. It leads to an alternative approach to estimating the distribution of the duration by adapting the Turnbull estimator with interval-censored observations. A prominent byproduct of this approach is an estimation procedure for the distribution of ignition time using all the lightning data and the sub-sampled data. The finite-sample performance of proposed approaches is examined via simulation studies. We use the motivating Alberta wildfire data to illustrate the proposed approaches throughout the thesis. The data analyses and simulation studies show that the two conventional approaches with current data structure could give rise to misleading inference. The proposed approaches provide intuitive, easy-to-implement alternatives to analysis of event times with missing origins. We anticipate the methodology has many applications in practice, such as infectious diseases research.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Joan Hu
John Braun
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Modeling human decision-making in spatial and temporal systems

Author: 
Date created: 
2020-08-20
Abstract: 

In this thesis, we analyze three applications of human decision-making in spatial and temporal environments. The first two projects are statistical applications to basketball while the third project analyzes an experiment that aims to understand decision-making processes in games. The first project explores how efficiently players in a basketball lineup collectively allocate shots. We propose a new metric for allocative efficiency by comparing a player's field goal percentage (FG%) to their field goal attempt (FGA) rate in context of both their four teammates on the court and the spatial distribution of their shots. Leveraging publicly available data provided by the National Basketball Association (NBA), we estimate player FG% at every location in the offensive half court using a Bayesian hierarchical model. By ordering a lineup's estimated FG%s and pairing these rankings with the lineup's empirical FGA rate rankings, we detect areas where the lineup exhibits inefficient shot allocation. Lastly, we analyze the impact that suboptimal shot allocation has on a team's overall offensive potential, finding that inefficient shot allocation correlates with reduced scoring. In the second basketball application, we model basketball plays as episodes from team-specific nonstationary Markov decision processes (MDPs) with shot clock dependent transition probabilities. Bayesian hierarchical models are employed in the parametrization of the transition probabilities to borrow strength across players and through time. To enable computational feasibility, we combine lineup-specific MDPs into team-average MDPs using a novel transition weighting scheme. Specifically, we derive the dynamics of the team-average process such that the expected transition count for an arbitrary state-pair is equal to the weighted sum of the expected counts of the separate lineup-specific MDPs. We then utilize these nonstationary MDPs in the creation of a basketball play simulator with uncertainty propagated via posterior samples of the model components. After calibration, we simulate seasons both on policy and under altered policies and explore the net changes in efficiency and production under the alternate policies. We also discuss the game-theoretic ramifications of testing alternative decision policies. For the final project, we take a different perspective on the behavior of the decision-makers. Broadly speaking, both basketball projects assume the agents (players) act sub-optimally and the goal of the analyses is to evaluate the impact their suboptimal behavior has on point production and scoring efficiency. By contrast, in the final project we assume that the agents' actions are optimal, but that the criteria over which they optimize are unknown. The goal of the analysis is to make inference on these latent optimization criteria. This type of problem can be termed an inverse decision problem. Our project explores the inverse problem of Bayesian optimization. Specifically, we seek to estimate an agent's latent acquisition function based on their observed search paths. After introducing a probabilistic solution framework for the problem, we illustrate our method by analyzing human behavior from an experiment. The experiment was designed to force subjects to balance exploration and exploitation in search of a global optimum. We find that subjects exhibit a wide range of acquisition preferences; however, some subject's behavior does not map well to any of the candidate acquisitions functions we consider. Guided by the model discrepancies, we augment the candidate acquisition functions to yield a superior fit to the human behavior in this task.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Luke Bornn
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Understanding jump dynamics using liquidity measures

Author: 
Date created: 
2020-07-15
Abstract: 

Numerous past studies investigate the relationship between volatility and other relevant variables, e.g., asset jumps, liquidity factors. However, empirical studies examining the link between liquidity and jumps are almost non-existent. In this report, we investigate the possible improvement in estimating so-called jump distribution parameters computed from intraday returns by including liquidity measures. More specifically, we first calculate the jump distribution parameters by using classic jump detection techniques in the spirit of Lee and Mykland (2008) and Tauchen and Zhou (2011), and we then use them as our responses in the heterogeneous autoregressive model \citep[e.g.,][]{corsi2009simple}. We examine the in-sample performance of our model and find out that liquidity measures do provide extra information in the estimation of the jump intensity and jump size variation. We also apply the same technique but using one-period-ahead instead of contemporaneous responses; we again find extra explanatory power when the liquidity measures are included.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jean-François Bégin
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Supervised basis functions applied to functional regression and classification

Author: 
Date created: 
2020-07-29
Abstract: 

In fitting functional linear models, including scalar-on-function regression (SoFR) and function-on-function regression (FoFR), the intrinsically infinite dimension of the problem often demands an limitation to a subspace spanned by a finite number of basis functions. In this sense, the choice and construction of basis functions matters. We discuss herein certain supervised choices of basis functions for regression/classification with densely/sparsely observed curves, and give both numerical and theoretical perspectives. For SoFR, the functional principal component (FPC) regression may fail to provide good estimation or prediction if the response is highly correlated with some excluded FPCs. This is not rare since the construction of FPCs never involves the response. We hence develop regression on functional continuum (FC) basis functions whose framework includes, as special cases, both FPCs and functional partial least squares (FPLS) basis functions. Aiming at the binary classification of functional data, we then propose the continuum centroid classifier (CCC) built upon projections of functional data onto the direction parallel to FC regression coefficient. One of the two subtypes of CCC (asymptotically) enjoys no misclassification. Implementation of FPLS traditionally demands that each predictor curve be recorded as densely as possible over the entire time span. This prerequisite is sometimes violated by, e.g., longitudinal studies and missing data problems. We accommodate FPLS for SoFR to scenarios where curves are sparsely observed. We establish the consistency of proposed estimators and give confidence intervals for responses. FPLS is widely used to fit FoFR. Its implementation is far from unique but typically involves iterative eigen decomposition. We introduce an new route for FoFR based upon Krylov subspaces. The method can be expressed in two equivalent forms: one of them is non-iterative with explicit forms of estimators and predictions, facilitating the theoretical derivation; the other one stabilizes numerical outputs. Our route turns out to be less time-consuming than other methods with competitive accuracy.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Richard A. Lockhart
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Understanding and estimating predictive performance of statistical learning methods based on data properties

Author: 
Date created: 
2020-07-14
Abstract: 

Many Statistical Learning (SL) regression methods have been developed over roughly the last two decades, but no one model has been found to be the best across all sets of data. It would be useful if guidance were available to help identify when each different method might be expected to provide more accurate or precise predictions than competitors. We speculate that certain measurable features of a data set might influence methods' potential ability to provide relatively accurate predictions. This thesis explores the potential to use measurable characteristics of a data set to estimate the prediction performance of different SL regression methods. We demonstrate this process on an existing set of 42 benchmark data sets. We measure a variety of properties on each data set that might be useful for differentiating between likely good- or poor-performing regression methods. Using cross-validation, we measure the actual relative prediction performance of 12 well-known regression methods, including both classical linear techniques and more modern flexible approaches. Finally, we combine the performance measures and the data set properties into a multivariate regression model to identify which properties appear to be most important and to estimate the expected prediction performance of each method.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Thomas Loughin
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.