# Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

## Bayesian Integration for Assessing the Quality of the Laplace Approximation

Author:
Date created:
2017-11-24
Abstract:

Nuisance parameters increase in number with additional data collected. In dynamic models, this typically results in more parameters than observations making direct estimation intractable. The Laplace Approximation is the standard tool for approximating the high dimensional integral required to marginalize over the nuisance parameters. However the Laplace Approximation relies on asymptotic arguments that are unobtainable for nuisance parameters. The way to assess the quality of the Laplace Approximation relies on much slower MCMC based methods. In this work, a probabilistic integration approach is used to develop a diagnostic for the quality of the Laplace Approximation.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
David Alexander Campbell
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## Forecasting Batting Averages in MLB

Author:
Date created:
2017-11-14
Abstract:

We consider new baseball data from Statcast which includes launch angle, launch velocity, and hit distance for batted balls in Major League Baseball during the 2015, and 2016 seasons. Using logistic regression, we train two models on 2015 data to get the probability that a player will get a hit on each of their 2015 at-bats. For each player we sum these predictions and divide by their total at bats to predict their 2016 batting average. We then use linear regression, which expresses 2016 actual batting averages as a linear combination of 2016 Statcast predictions and 2016 PECOTA predictions. When using this procedure to obtain 2017 predictions, we find that the combined prediction performs better than PECOTA. This information may be used to make better predictions of batting averages for future seasons.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Timothy Swartz
Jason Loeppky
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## Estimating conditional intensity conditional function of a neural spike train by particle Markov chain Monte Carlo and smoothing

Author:
Date created:
2017-08-14
Abstract:

Understanding neural activities is fundamental and challenging in decoding how the brain processes information. An essential part of the problem is to define a meaningful and quantitative characterization of neural activities when they are represented by a sequence of action potentials or a neural spike train. The thesis approaches to use a point process to represent a neural spike train, and such representation provides a conditional intensity function (CIF) to describe neural activities. The estimation procedure for CIF, including particle Markov Chain Monte Carlo (PMCMC) and smoothing, is introduced and applied to a real data set. From the CIF and its derivative of a neural spike train, we can successfully observe adaption behavior. Simulation study verifies that the estimation procedure provides reliable estimate of CIF. This framework provides a definite quantification of neural activities and facilitates further investigation of understanding the brain from neurological perspective.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Jiguo Cao
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## Adjusting for Scorekeeper Bias in NBA Box Scores

Author:
Date created:
2017-06-01
Abstract:

Box score statistics in the National Basketball Association are used to measure and evaluate player performance. Some of these statistics are subjective in nature and since box score statistics are recorded by scorekeepers hired by the home team for each game, there exists potential for inconsistency and bias. These inconsistencies can have far reaching consequences, particularly with the rise in popularity of daily fantasy sports. Using box score data, we estimate models able to quantify both the bias and the generosity of each scorekeeper for two of the most subjective statistics: assists and blocks. We then use optical player tracking data for the 2015-2016 season to improve the assist model by including other contextual spatio-temporal variables such as time of possession, player locations, and distance traveled. From this model, we present results measuring the impact of the scorekeeper and of the other contextual variables on the probability of a pass being recorded as an assist. Results for adjusting season assist totals to remove scorekeeper influence are also presented.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Luke Bornn
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## Differences in Prescription Drug use Among 5-year Survivors of Childhood, Adolescent, and Young Adult Cancer and the General Population in British Columbia, Canada

Author:
Date created:
2017-07-13
Abstract:

In this project, we analyze the prescription drug use of childhood, adolescent, and young adult cancer survivors identified by the CAYACS program in BC. Understanding the patterns of prescription use and factors associated with the tendency to be on prescriptions is important to policy and health care planners. Since data on actual prescription usage are not available, we use prescription dispensing data as a proxy. We examine the differences in prescription use between survivors and matched controls selected from the general population, and assess the impact of age and other clinical and sociodemographic factors on prescription use. Specifically, we model subjects' on-/off-prescription status by a first-order Markov transition model, and handle the between-subject heterogeneity using a random effect. Our method captures the differences in prescription drug use between survivors and the general population, as well as differences within the survivor population. Our results show that survivors tend to exhibit a higher probability of going on prescriptions compared to the general population over the course of their lifetime. Further, females appear to have a higher probability of going on prescriptions than males over the course of their lifetime. A simulation study is conducted to assess the performance of the estimators of the model.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Rachel Altman
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## Distributions of Time to First Spot Fire

Author:
Date created:
2017-08-15
Abstract:

In wildfire management, a spot fire is the result of an airborne ember igniting a separate fire away from the main wildfire. Under certain environmental and wildfire conditions, a burning ember can breach a fuel break, such as a river or road, and result in the production of a spot fire. This project derives distributions of the time to the first spot fire in various situations, and verifies them by simulation. To demonstrate the implementation of the distributions in practice, we incorporate a stochastic fire spread model. This research assesses the likelihood of spot fire occurring passed a fuel break, all while taking into account both spotting distance and spotting rate. This contrasts with the traditional approach that solely involves the maximal spotting distance, and can be a tool for fire management.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Joan Hu
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## Mendelian randomization for causal inference of the relationship between obesity and 28-day survival following septic shock

Author:
Date created:
2017-08-10
Abstract:

Septic shock is a leading cause of death in intensive care units. Septic shock occurs when a body-wide infection leads to low blood pressure, and ultimately organ failure. Some recent studies suggest that overweight and obese patients have a better chance of survival following septic shock than normal or underweight patients. In this project we apply Mendelian randomization to assess whether the observed obesity effect on 28-day survival following septic shock is causal or more likely due to unmeasured confounding variables. Mendelian randomization is an instrumental variables approach that uses genetic markers as instruments. Under modelling assumptions, unconfounded estimates of the obesity effect can be obtained by fitting a model for 28-day survival that includes a residual obesity term. Data for the project comes from the Vasopressin and Septic Shock Trial (VASST). Our analysis suggests that the observed obesity effect on survival following septic shock is not causal.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## A Multi-Dimensional Bühlmann Credibility Approach to Modeling Multi-Population Mortality Rates

Author:
Date created:
2017-06-08
Abstract:

In this project, we first propose a multi-dimensional Bühlmann credibility approach to forecasting mortality rates for multiple populations, and then compare forecasting performances among the proposed approach and the joint-k/co-integrated/augmented common factor Lee-Carter models. The model is applied to mortality data of the Human Mortality Database for both genders of three well-developed countries with an age span and a wide range of fitting year spans. Empirical illustrations show that the proposed multi-dimensional Bühlmann credibility approach contributes to more accurate forecast results, measured by MAPE (mean absolute percentage error), than those based on the Lee-Carter model.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Cary Chi-Liang Tsai
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## Using AI and Statistical Techniques to Correct Play-by-play Substitution Errors

Author:
Date created:
2017-05-26
Abstract:

Play-by-play is an important data source for basketball analysis, particularly for leagues that cannot afford the infrastructure for collecting video tracking data; it enables advanced metrics like adjusted plus-minus and lineup analysis like With Or Without You (WOWY). However, this analysis is not possible unless all substitutions are recorded and are correct. In this paper we use six seasons of play-by-play from the Canadian university league to derive a framework for automated cleaning of play-by-play that is littered with substitution logging errors. These errors include missing substitutions, unequal number of players subbing in and out, substitution patterns of a player not alternating between in/out, and more. We define features to build a prediction model for identifying correct/incorrect recorded substitutions and outline a simple heuristic for player activity to use for inferring the players who were not accounted for in the substitutions. We define two performance measures for objectively quantifying the effectiveness of this framework. The play-by-play which results from the algorithm opens up a set of statistics that were not obtainable for the Canadian university league which improves their analytics capabilities; coaches can improve strategy leading to a more competitive product, and media can introduce modern statistics in their coverage to increase engagement from fans.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Tim Swartz
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## An applied analysis of high-dimensional logistic regression

Author:
Date created:
2017-05-16
Abstract:

In the high dimensional setting, we investigate common regularization approaches for fitting logistic regression models with binary response variables. A literature review is provided on generalized linear models, regularization approaches which include the lasso, ridge, elastic net and relaxed lasso, and recent post-selection methods for obtaining p-values of coefficient estimates proposed by Lockhart et. al. and Buhlmann et. al. We consider varying n, p conditions, and assess model performance based on several evaluation metrics - such as their sparsity, accuracy and algorithmic time efficiency. Through a simulation study, we find that Buhlmann et. al’s multi sample splitting method performed poorly when selected covariates were highly correlated. When λ was chosen through cross validation, the elastic net had similar levels of performance as compared to the lasso, but it did not possess the level of sparsity Zou and Hastie have suggested.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Richard Lockhart
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.