New Summit website coming in May 2021!

                   Check the SFU library website for updates.

Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

Systematic comparison of designs and emulators for computer experiments using a library of test functions

Author: 
Date created: 
2020-12-16
Abstract: 

As computational resources have become faster and more economical, scientific research has transitioned from using only physical experiments to using simulationbased exploration. A body of literature has since grown aimed at the design and analysis of so-called computer experiments. While this literature is large and active, little work has been focused on comparing methods. This project presents ways of comparing and evaluating both design and emulation methods for computer experiments. Using a suite of test functions — in this work we introduce the Virtual Library of Computer Experiments a procedure is established which can provide guidance as to how to proceed in simulation problems. An illustrative comparison is performed for each context; putting three emulators, then four experimental designs up against each other; while also highlighting possible considerations for test function choice.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Derek Bingham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Efficient Bayesian parameter inference for COVID-19 transmission models

Author: 
Date created: 
2020-12-17
Abstract: 

Many transmission models have been proposed and adapted to reflect changes in policy for mitigating the spread of COVID-19. Often these models are applied without any formal comparison with previously existing models. In this project, we use an annealed sequential Monte Carlo (ASMC) algorithm to estimate parameters of these transmission models. We also use Bayesian model selection to provide a framework through which the relative performance of transmission models can be compared in a statistically rigorous manner. The ASMC algorithm provides an unbiased estimate of the marginal likelihood which can be computed at no additional computational cost. This offers a significant computational advantage over MCMC methods which require expensive post hoc computation to estimate the marginal likelihood. We find that ASMC can produce results that are comparable to MCMC in a fraction of the time.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Liangliang Wang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Contextual batting and bowling in limited overs cricket

Author: 
Date created: 
2020-11-25
Abstract: 

Cricket is a sport for which many batting and bowling statistics have been proposed. However, a feature of cricket is that the level of aggressiveness adopted by batsmen is dependent on match circumstances. It is therefore relevant to consider these circumstances when evaluating batting and bowling performances. This project considers batting performance in the second innings of limited overs cricket when a target has been set. The runs required, the number of overs completed and the wickets taken are relevant in assessing the batting performance. We produce a visualization for second innings batting which describes how a batsman performs under different circumstances. The visualization is then reduced to a single statistic “clutch batting” which can be used to compare batsmen. An analogous analysis is then provided for bowlers based on the symmetry between batting and bowling, and we define a statistic “clutch bowling”.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Harsha Perera
Timothy Swartz
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

A flexible group benefits framework for pricing deposit rates

Author: 
Date created: 
2020-11-16
Abstract: 

Currently, most flexible group benefit plans are designed and priced based on deterministic assumptions about the plan members’ option selections. This can cause the adverse selection spiral, threatening the sustainability of the plan. We therefore propose a comprehensive framework with a novel pricing formula that incorporates both a model for claims and a model for plan members’ enrollment decisions to prevent adverse selection. We find through simulation that our proposed pricing formula outperforms the traditional pricing practice by keeping flex plans sustainable over time. In addition to preventing the adverse selection spiral through pricing, our framework also serves as a tool to evaluate the impact of other parameters such as changes in plan designs, health costs, and member decision.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jean-François Bégin
Barbara Sanders
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

New perspectives on non-negative matrix factorization for grouped topic models

Author: 
Date created: 
2020-08-18
Abstract: 

Probabilistic topic models (PTM's) have become a ubiquitous approach for finding a set of latent themes (``topics'') in collections of unstructured text. A simpler, linear algebraic technique for the same problem is non-negative matrix factorization: we are given a matrix with non-negative entries and asked to find a pair of low-rank matrices, also non-negative, whose product is approximately the original matrix. A drawback of NMF is the non-convex nature of the optimization problem it poses. Recent work by the theoretical computer science community addresses this issue, utilizing NMF's inherent structure to find conditions under which the objective function admits convexity. With convexity comes tractability, and the central theme of this thesis is the exploitation of this tractability to ally NMF with resampling-based nonparametrics. Our motivating example is one in which a document collection exhibits some kind of partitioning according to a discrete, indexical covariate, and the goal is to assess the influence of this partitioning on document content; we call this scenario a grouped topic model. Computation relies on several well-studied tools from numerical linear algebra and convex programming which are especially well suited for synthesis with permutation tests and the bootstrap. The result is a set of simple, fast, and easily implementable methodologies for performing inference in grouped topic models. This is contrast to parallel developments in PTM's where ever-more cumbersome inference schemes are required to fit complex graphical models.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
David Campbell
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Bayesian logistic regression with the local bouncy particle sampler for COVID-19

Author: 
Date created: 
2020-08-24
Abstract: 

A novel coronavirus, called SARS-CoV-2, has caused the outbreak of the pandemic of COVID-19. The global economy, people’s health and life have been facing a tremendous threat in COVID-19. This project is to determine some important factors in COVID-19 severity based on 137 Tianjin patients who have been exposed to COVID-19 since January 5, 2020. We fit a logistic regression model and estimate the parameters using standard Markov chain Monte Carlo (MCMC) methods. Due to the weaknesses and limitations of the standard MCMC methods, we then perform model estimation in one special example of a Piecewise Deterministic Markov Process, named the Bouncy Particle Sampler (BPS). This method is also known as a rejection-free and irreversible MCMC, and can draw samples from our target distribution efficiently. One type of the BPS algorithm, the Local Bouncy Particle Sampler (LBPS), has advantages in computational efficiency. We apply the standard MCMC method and the LBPS to our dataset. We conclude that age and Wuhan-related exposures (i.e. people who have lived or traveled from Wuhan) are two important factors in a COVID-19 severity test.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Liangliang Wang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Quarterback evaluation in the National Football League

Author: 
Date created: 
2020-08-20
Abstract: 

This project evaluates quarterback performance in the National Football League. With the availability of player tracking data, there exists the capability to assess various options that are available to quarterbacks and the expected points resulting from each option. The quarterback’s execution is then measured against the optimal available option. Since decision making does not rely on the quality of teammates, a quarterback metric is introduced that provides a novel perspective on an understudied aspect of quarterback assessment.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Tim Swartz
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Curating and combining big data from genetic studies

Author: 
Date created: 
2020-08-19
Abstract: 

Big data curation is often underappreciated by users of processed data. With the development of high-throughput genotyping technology, large-scale genome-wide data are available for genetic association analysis with disease. In this project, we describe a data-curation protocol to deal with the genotyping errors and missing values in genetic data. We obtain publicly-available genetic data from three studies in the Alzheimer’s Disease Neuroimaging Initiative (ADNI), and with the aid of the freely-available HapMap3 reference panel, we improve the quality and size of the ADNI genetic data. We use the software PLINK to manage data format, SHAPEIT to check DNA strand alignment and perform phasing of the genetic markers that have been inherited from the same parent, IMPUTE2 to impute missing SNP genotypes, and GTOOL to merge files and convert file formats. After merging the genetic data across these studies, we also use the reference panel to investigate the population structure of the processed data. ADNI's participants are collected in the U.S, where the majority of the population are descendants of relatively recent immigrants. We use principal component analysis to understand the population structure of the participants, and model-based clustering to investigate the genetic composition of each participant and compare it with self-reported ethnicity information. This project is intended to serve as a guide to future users of the processed data.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jinko Graham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Statistical methods for tracking data in sports

Author: 
Date created: 
2020-08-21
Abstract: 

In this thesis, we examine player tracking data in basketball and soccer and explore statistical methods and applications related to this type of data. First, we present a method for nonparametric estimation of continuous-state Markov transition densities, using as our foundation a Poisson process representation of the joint input-output space of the Markovian transitions. Representing transition densities with a non-stationary point process allows the form of the transition density to vary rapidly over the space, resulting in a very flexible estimator of the transition mechanism. A key feature of this point process representation is that it allows the presence of spatial structure to inform transition density estimation. We illustrate this by using our method to model ball movement in the National Basketball Association, enabling us to capture the effects of spatial features, such as the three point line, that impact transition density values. Next, we consider a sports science application. Sports science has seen substantial benefit from player tracking data, as high resolution coordinate data permits sports scientists to have to-the-second estimates of external load metrics traditionally used to understand the physical toll a game takes on an athlete. Unfortunately, this data is not widely available. Algorithms have been developed that allow a traditional broadcast feed to be converted to x-y coordinate data, making tracking data easier to acquire, but coordinates are available for an athlete only when that player is within the camera frame. This leads to inaccuracies in player load estimates, limiting the usefulness of this data for sports scientists. In this research, we develop models that predict offscreen load metrics and demonstrate the viability of broadcast-derived tracking data for understanding external load in soccer. Finally, we address a tactics question in soccer. A key piece of information when evaluating a matchup in soccer is understanding the formations utilized by the different teams. Multiple researchers have developed methodology for learning these formations from tracking data, but they do not work when faced with the heavy censoring inherent to broadcast tracking data.We present an algorithm for aligning broadcast tracking data with the origin, and then show how the aligned data can be used to learn formations, with performance comparable to formations learned from the full tracking data.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Luke Bornn
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Statistical analysis of event times with missing origins aided by auxiliary information, with application to wildfire management

Author: 
Date created: 
2020-08-20
Abstract: 

Motivated partly by analysis of lightning-caused wildfire data from Alberta, this dissertation develops statistical methodology for analyzing event times with missing origins aided by auxiliary information such as associated longitudinal measures and other relevant information prior to the time origin. We begin an analysis of the motivating data to estimate distribution of time to initial attack since a wildfire starts burning with flames, i.e. duration between the start time and initial attack time of a fire, with two conventional approaches: one neglects the missing origin and performs inference on the observed portion of duration and the other views the observation on the event time of interest subject to interval censoring with a pre-determined interval. The counterintuitive/non-informative results of the preliminary analysis lead us to propose new approaches to tackling the issue of missing origin. To facilitate methodological development, we first consider estimation of the duration distribution with independently and identically distributed (iid) observations. We link the unobserved time origin to the available longitudinal measures of burnt areas via the first-hitting-time model. This yields an intuitive and easy-to-implement adaption of the empirical distribution function with the event time data. We establish consistency and weak convergence of the proposed estimator and present its variance estimation. We then extend the proposed approach to studying the association of the duration time with a list of potential risk factors. A semi-parametric accelerated failure time (AFT) regression model is considered together with a Wiener process model using random drift for longitudinal measures. Further, we accommodate the potential spatial correlation of the wildfires by specifying the drift of the Wiener process as a function of covariates and spatially correlated random effects. Moreover, we propose a method to aid the duration distribution estimation with lightning data. It leads to an alternative approach to estimating the distribution of the duration by adapting the Turnbull estimator with interval-censored observations. A prominent byproduct of this approach is an estimation procedure for the distribution of ignition time using all the lightning data and the sub-sampled data. The finite-sample performance of proposed approaches is examined via simulation studies. We use the motivating Alberta wildfire data to illustrate the proposed approaches throughout the thesis. The data analyses and simulation studies show that the two conventional approaches with current data structure could give rise to misleading inference. The proposed approaches provide intuitive, easy-to-implement alternatives to analysis of event times with missing origins. We anticipate the methodology has many applications in practice, such as infectious diseases research.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Joan Hu
John Braun
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.