Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

Multivariate CACE analysis with an application to Arthritis Health Journal Study

Author: 
Date created: 
2018-05-07
Abstract: 

Treatment noncompliance is a common issue in randomized controlled trials that may plague the randomization settings and bias the treatment effect estimation. The complier-average causal effect (CACE) model has become popular in estimating the method effectiveness under noncompliance. Performing multiple univariate CACE analysis separately fails to capture the potential correlations among multivariate outcomes, which will lead to biased estimates and significant loss of power in detecting actual treatment effect. Motivated by the Arthritis Health Journal Study, we propose a multivariate CACE model to better account for the correlations among outcomes. In our simulation study, the global likelihood ratio test is conducted to evaluate the treatment effect which fails to control the type I error for moderate sample sizes. So, we further perform a parametric bootstrap test to address this issue. Our simulation results suggest that the Multivariate CACE model outperforms multiple Univariate CACE models in the precision of estimation and statistical power in the case of correlated multivariate outcomes.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Hui Xie
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

A hierarchical credibility approach to modelling mortality rates for multiple populations

Author: 
Date created: 
2018-05-08
Abstract: 

A hierarchical credibility model is a generalization of the Bühlmann credibility model and the Bühlmann-Straub credibility model with a tree structure of four or more levels. This project aims to incorporate the hierarchical credibility theory, which is used in property and casualty insurance, to model the dependency of multi-population mortality rates. The forecasting performances of the three/four/five-level hierarchical credibility models are compared with those of the classical Lee-Carter model and its three extensions for multiple populations (joint-k, cointegrated and augmented common factor Lee-Carter models). Numerical illustrations based on mortality data for both genders of the US, the UK and Japan with a series of fitting year spans and three forecasting periods show that the hierarchical credibility approach contributes to more accurate forecasts measured by the AMAPE (average of mean absolute percentage errors). The proposed model is convenient to implement and can be further applied to projecting a mortality index for pricing mortality-indexed securities.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Cary Chi-Liang Tsai
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Understanding multicollinearity in Bayesian model averaging with BIC approximation

Author: 
Date created: 
2018-04-23
Abstract: 

Bayesian model averaging (BMA) is a widely used method for model and variable selection. In particular, BMA with Bayesian Information Criterion (BIC) approximation is a frequentist view of model averaging which saves a massive amount of computation compared to the fully Bayesian approach. However, BMA with BIC approximation may give misleading results in linear regression models when multicollinearity is present. In this article, we explore the relationship between performance of BMA with BIC approximation and the true regression parameters and correlations among explanatory variables. Specifically, we derive approximate formulae in the context of a known regression model to predict the BMA behaviours from 3 aspects - model selection, variable importance and coefficient estimation. We use simulations to verify the accuracy of the approximations. Through mathematical analysis, we demonstrate that BMA may not identify the correct model as the highest probability model if the coefficient and correlation parameters combine to minimize the residual sum of squares of a wrong model. We find that if the regression parameters of important variables are relatively large, BMA is generally successful in model and variable selection. On the other hand, if the regression parameters of important variables are relatively small, BMA can be dangerous in predicting the best model or important variables, especially when the full model correlation matrix is close to singular. The simulation studies suggest that our formulae are over-optimistic in predicting posterior probabilities of the true models and important variables. However, these formulae still provide us insights about the effect of collinearity on BMA.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Thomas M. Loughin
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Decomposing the RV coefficient to identify genetic markers associated with changes in brain structure

Author: 
Date created: 
2018-04-13
Abstract: 

Alzheimer’s disease (AD) is a chronic neurodegenerative disease that causes memory loss and decline in cognitive abilities; it is the sixth leading cause of death in the United States, affecting an estimated 5 million Americans and 747,000 Canadians. A recent study of AD pathogenesis (Szefer et al., 2017) used the RV coefficient to measure linear association between multiple genetic variants and multiple measurements of structural changes in the brain, using data from Alzheimer’s Disease Neuroimaging Initiative (ANDI). The authors decomposed the RV coefficient into contributions from individual variants and displayed these contributions graphically. In this project, we investigate the properties of such a “contribution plot” in terms of an underlying linear model, and discuss estimation of the components of the plot when the correlation signal may be sparse. The contribution plot is applied to genomic and brain imaging data from the ADNI-1 study, and to data simulated under various scenarios.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Brad McNeney
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Bayesian methodology for latent function modeling in applied physics and engineering

Author: 
Date created: 
2017-12-20
Abstract: 

Computer simulators play a key role in modern science and engineering as a tool for understanding and exploring physical systems. Calibration and validation are important parts of the use of simulators. Calibration is a necessary part of assessing the predictive capability of the model with fully quantified sources of uncertainty. Field observations for physical systems often have diverse types. New methodology for calibration with generalized measurement error structure is proposed and applied to the parallel deterministic transport model for the Center for Exascale Radiation Transport at Texas A\&M University. Validation of computer models is critical for building trust in a simulator. We propose a new methodology for model validation using goodness-of-fit hypothesis tests in a Bayesian model assessment framework. Lastly, the use of a hidden Markov model with a particle filter is proposed for detection of anomalies in time series for the purpose of identifying intrusions in cyber-physical networks.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Derek Bingham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Ranking and prediction for Cycling Canada

Author: 
Date created: 
2017-12-14
Abstract: 

In efforts to improve Canadian performance in the men's Elite UCI Mountain Bike World Cup, researchers from the Canadian Sport Institute Ontario (CSIO) presented to us a specific problem. They had a wealth of race data but were unsure how to best extract insights from the dataset. We responded to their request by building an interactive user interface with R Shiny to obtain rider rankings. Estimation was carried out via maximum likelihood using the Bradley-Terry model. We improved on the existing literature, proposed an exponentially weighted version of the model, and determined an optimal weighting parameter through cross-validation involving performance of future races. Therefore, the proposed methods provide forecasting capability. The tuned Bradley-Terry estimation performed better than the UCI point-based ranking in terms of predictive error. This implementation of the Bradley-Terry model with a user-friendly graphical interface provides broader scientific audiences easy access to Bradley-Terry ranking for prediction in racing sports.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Tim Swartz
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Bayesian Integration for Assessing the Quality of the Laplace Approximation

Author: 
Date created: 
2017-11-24
Abstract: 

Nuisance parameters increase in number with additional data collected. In dynamic models, this typically results in more parameters than observations making direct estimation intractable. The Laplace Approximation is the standard tool for approximating the high dimensional integral required to marginalize over the nuisance parameters. However the Laplace Approximation relies on asymptotic arguments that are unobtainable for nuisance parameters. The way to assess the quality of the Laplace Approximation relies on much slower MCMC based methods. In this work, a probabilistic integration approach is used to develop a diagnostic for the quality of the Laplace Approximation.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
David Alexander Campbell
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Forecasting Batting Averages in MLB

Author: 
Date created: 
2017-11-14
Abstract: 

We consider new baseball data from Statcast which includes launch angle, launch velocity, and hit distance for batted balls in Major League Baseball during the 2015, and 2016 seasons. Using logistic regression, we train two models on 2015 data to get the probability that a player will get a hit on each of their 2015 at-bats. For each player we sum these predictions and divide by their total at bats to predict their 2016 batting average. We then use linear regression, which expresses 2016 actual batting averages as a linear combination of 2016 Statcast predictions and 2016 PECOTA predictions. When using this procedure to obtain 2017 predictions, we find that the combined prediction performs better than PECOTA. This information may be used to make better predictions of batting averages for future seasons.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Timothy Swartz
Jason Loeppky
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Estimating conditional intensity conditional function of a neural spike train by particle Markov chain Monte Carlo and smoothing

Author: 
Date created: 
2017-08-14
Abstract: 

Understanding neural activities is fundamental and challenging in decoding how the brain processes information. An essential part of the problem is to define a meaningful and quantitative characterization of neural activities when they are represented by a sequence of action potentials or a neural spike train. The thesis approaches to use a point process to represent a neural spike train, and such representation provides a conditional intensity function (CIF) to describe neural activities. The estimation procedure for CIF, including particle Markov Chain Monte Carlo (PMCMC) and smoothing, is introduced and applied to a real data set. From the CIF and its derivative of a neural spike train, we can successfully observe adaption behavior. Simulation study verifies that the estimation procedure provides reliable estimate of CIF. This framework provides a definite quantification of neural activities and facilitates further investigation of understanding the brain from neurological perspective.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Jiguo Cao
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Adjusting for Scorekeeper Bias in NBA Box Scores

Date created: 
2017-06-01
Abstract: 

Box score statistics in the National Basketball Association are used to measure and evaluate player performance. Some of these statistics are subjective in nature and since box score statistics are recorded by scorekeepers hired by the home team for each game, there exists potential for inconsistency and bias. These inconsistencies can have far reaching consequences, particularly with the rise in popularity of daily fantasy sports. Using box score data, we estimate models able to quantify both the bias and the generosity of each scorekeeper for two of the most subjective statistics: assists and blocks. We then use optical player tracking data for the 2015-2016 season to improve the assist model by including other contextual spatio-temporal variables such as time of possession, player locations, and distance traveled. From this model, we present results measuring the impact of the scorekeeper and of the other contextual variables on the probability of a pass being recorded as an assist. Results for adjusting season assist totals to remove scorekeeper influence are also presented.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Luke Bornn
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.