Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

The use of submodels as a basis for efficient estimation of complex models

Author: 
Date created: 
2017-11-08
Abstract: 

In this thesis, we consider problems where the true underlying models are complex and obtaining the maximum likelihood estimator (MLE) of the true model is challenging or time-consuming. In our first paper, we investigate a general class of parameter-driven models for time series of counts. Depending on the distribution of the latent variables, these models can be highly complex. We consider a set of simple models within this class as a basis for estimating the regression coefficients in the more complex models. We also derive standard errors (SEs) for these new estimators. We conduct a comprehensive simulation study to evaluate the accuracy and efficiency of our estimators and their SEs. Our results show that, except in extreme cases, the maximizer of the Poisson generalized linear model (the simplest estimator in our context) is an efficient, consistent, and robust estimator with a well-behaved standard error. In our second paper, we work in the context of display advertising, where the goal is to estimate the probability of conversion (a pre-defined action such as making a purchase) after a user clicks on an ad. In addition to accuracy, in this context, the speed with which the estimate can be computed is critical. Again, computing the MLEs of the true model for the observed conversion statuses (which depends on the distribution of the delays in observing conversions) is challenging, in this case because of the huge size of the data set. We use a logistic regression model as a basis for estimation, and then adjust this estimate for its bias. We show that our estimation algorithm leads to accurate estimators and requires far less computation time than does the MLE of the true model. Our third paper also concerns the conversion probability estimation problem in display advertising. We consider a more complicated setting where users may visit an ad multiple times prior to taking the desired action (e.g., making a purchase). We extend the estimator that we developed in our second paper to incorporate information from such visits. We show that this new estimator, the DV-estimator (which accounts for the distributions of both the conversion delay times and the inter-visit times) is more accurate and leads to better confidence intervals than the estimator that accounts only for delay times (the D-estimator). In addition, the time required to compute the DV-estimate for a given data set is only moderately greater than that required to compute the D-estimate -- and is substantially less than that required to compute the MLE. In summary, in a variety of settings, we show that estimators based on simple, misspecified models can lead us to accurate, precise, and computationally efficient estimates of both the key model parameters and their standard deviations.

Document type: 
Thesis
Senior supervisor: 
Rachel Altman
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Multivariate CACE analysis with an application to Arthritis Health Journal Study

Author: 
Date created: 
2018-05-07
Abstract: 

Treatment noncompliance is a common issue in randomized controlled trials that may plague the randomization settings and bias the treatment effect estimation. The complier-average causal effect (CACE) model has become popular in estimating the method effectiveness under noncompliance. Performing multiple univariate CACE analysis separately fails to capture the potential correlations among multivariate outcomes, which will lead to biased estimates and significant loss of power in detecting actual treatment effect. Motivated by the Arthritis Health Journal Study, we propose a multivariate CACE model to better account for the correlations among outcomes. In our simulation study, the global likelihood ratio test is conducted to evaluate the treatment effect which fails to control the type I error for moderate sample sizes. So, we further perform a parametric bootstrap test to address this issue. Our simulation results suggest that the Multivariate CACE model outperforms multiple Univariate CACE models in the precision of estimation and statistical power in the case of correlated multivariate outcomes.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Hui Xie
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

A hierarchical credibility approach to modelling mortality rates for multiple populations

Author: 
Date created: 
2018-05-08
Abstract: 

A hierarchical credibility model is a generalization of the Bühlmann credibility model and the Bühlmann-Straub credibility model with a tree structure of four or more levels. This project aims to incorporate the hierarchical credibility theory, which is used in property and casualty insurance, to model the dependency of multi-population mortality rates. The forecasting performances of the three/four/five-level hierarchical credibility models are compared with those of the classical Lee-Carter model and its three extensions for multiple populations (joint-k, cointegrated and augmented common factor Lee-Carter models). Numerical illustrations based on mortality data for both genders of the US, the UK and Japan with a series of fitting year spans and three forecasting periods show that the hierarchical credibility approach contributes to more accurate forecasts measured by the AMAPE (average of mean absolute percentage errors). The proposed model is convenient to implement and can be further applied to projecting a mortality index for pricing mortality-indexed securities.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Cary Chi-Liang Tsai
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Understanding multicollinearity in Bayesian model averaging with BIC approximation

Author: 
Date created: 
2018-04-23
Abstract: 

Bayesian model averaging (BMA) is a widely used method for model and variable selection. In particular, BMA with Bayesian Information Criterion (BIC) approximation is a frequentist view of model averaging which saves a massive amount of computation compared to the fully Bayesian approach. However, BMA with BIC approximation may give misleading results in linear regression models when multicollinearity is present. In this article, we explore the relationship between performance of BMA with BIC approximation and the true regression parameters and correlations among explanatory variables. Specifically, we derive approximate formulae in the context of a known regression model to predict the BMA behaviours from 3 aspects - model selection, variable importance and coefficient estimation. We use simulations to verify the accuracy of the approximations. Through mathematical analysis, we demonstrate that BMA may not identify the correct model as the highest probability model if the coefficient and correlation parameters combine to minimize the residual sum of squares of a wrong model. We find that if the regression parameters of important variables are relatively large, BMA is generally successful in model and variable selection. On the other hand, if the regression parameters of important variables are relatively small, BMA can be dangerous in predicting the best model or important variables, especially when the full model correlation matrix is close to singular. The simulation studies suggest that our formulae are over-optimistic in predicting posterior probabilities of the true models and important variables. However, these formulae still provide us insights about the effect of collinearity on BMA.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Thomas M. Loughin
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Decomposing the RV coefficient to identify genetic markers associated with changes in brain structure

Author: 
Date created: 
2018-04-13
Abstract: 

Alzheimer’s disease (AD) is a chronic neurodegenerative disease that causes memory loss and decline in cognitive abilities; it is the sixth leading cause of death in the United States, affecting an estimated 5 million Americans and 747,000 Canadians. A recent study of AD pathogenesis (Szefer et al., 2017) used the RV coefficient to measure linear association between multiple genetic variants and multiple measurements of structural changes in the brain, using data from Alzheimer’s Disease Neuroimaging Initiative (ANDI). The authors decomposed the RV coefficient into contributions from individual variants and displayed these contributions graphically. In this project, we investigate the properties of such a “contribution plot” in terms of an underlying linear model, and discuss estimation of the components of the plot when the correlation signal may be sparse. The contribution plot is applied to genomic and brain imaging data from the ADNI-1 study, and to data simulated under various scenarios.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Brad McNeney
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Bayesian methodology for latent function modeling in applied physics and engineering

Author: 
Date created: 
2017-12-20
Abstract: 

Computer simulators play a key role in modern science and engineering as a tool for understanding and exploring physical systems. Calibration and validation are important parts of the use of simulators. Calibration is a necessary part of assessing the predictive capability of the model with fully quantified sources of uncertainty. Field observations for physical systems often have diverse types. New methodology for calibration with generalized measurement error structure is proposed and applied to the parallel deterministic transport model for the Center for Exascale Radiation Transport at Texas A\&M University. Validation of computer models is critical for building trust in a simulator. We propose a new methodology for model validation using goodness-of-fit hypothesis tests in a Bayesian model assessment framework. Lastly, the use of a hidden Markov model with a particle filter is proposed for detection of anomalies in time series for the purpose of identifying intrusions in cyber-physical networks.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Derek Bingham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Ranking and prediction for Cycling Canada

Author: 
Date created: 
2017-12-14
Abstract: 

In efforts to improve Canadian performance in the men's Elite UCI Mountain Bike World Cup, researchers from the Canadian Sport Institute Ontario (CSIO) presented to us a specific problem. They had a wealth of race data but were unsure how to best extract insights from the dataset. We responded to their request by building an interactive user interface with R Shiny to obtain rider rankings. Estimation was carried out via maximum likelihood using the Bradley-Terry model. We improved on the existing literature, proposed an exponentially weighted version of the model, and determined an optimal weighting parameter through cross-validation involving performance of future races. Therefore, the proposed methods provide forecasting capability. The tuned Bradley-Terry estimation performed better than the UCI point-based ranking in terms of predictive error. This implementation of the Bradley-Terry model with a user-friendly graphical interface provides broader scientific audiences easy access to Bradley-Terry ranking for prediction in racing sports.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Tim Swartz
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Bayesian Integration for Assessing the Quality of the Laplace Approximation

Author: 
Date created: 
2017-11-24
Abstract: 

Nuisance parameters increase in number with additional data collected. In dynamic models, this typically results in more parameters than observations making direct estimation intractable. The Laplace Approximation is the standard tool for approximating the high dimensional integral required to marginalize over the nuisance parameters. However the Laplace Approximation relies on asymptotic arguments that are unobtainable for nuisance parameters. The way to assess the quality of the Laplace Approximation relies on much slower MCMC based methods. In this work, a probabilistic integration approach is used to develop a diagnostic for the quality of the Laplace Approximation.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
David Alexander Campbell
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Forecasting Batting Averages in MLB

Author: 
Date created: 
2017-11-14
Abstract: 

We consider new baseball data from Statcast which includes launch angle, launch velocity, and hit distance for batted balls in Major League Baseball during the 2015, and 2016 seasons. Using logistic regression, we train two models on 2015 data to get the probability that a player will get a hit on each of their 2015 at-bats. For each player we sum these predictions and divide by their total at bats to predict their 2016 batting average. We then use linear regression, which expresses 2016 actual batting averages as a linear combination of 2016 Statcast predictions and 2016 PECOTA predictions. When using this procedure to obtain 2017 predictions, we find that the combined prediction performs better than PECOTA. This information may be used to make better predictions of batting averages for future seasons.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Timothy Swartz
Jason Loeppky
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Estimating conditional intensity conditional function of a neural spike train by particle Markov chain Monte Carlo and smoothing

Author: 
Date created: 
2017-08-14
Abstract: 

Understanding neural activities is fundamental and challenging in decoding how the brain processes information. An essential part of the problem is to define a meaningful and quantitative characterization of neural activities when they are represented by a sequence of action potentials or a neural spike train. The thesis approaches to use a point process to represent a neural spike train, and such representation provides a conditional intensity function (CIF) to describe neural activities. The estimation procedure for CIF, including particle Markov Chain Monte Carlo (PMCMC) and smoothing, is introduced and applied to a real data set. From the CIF and its derivative of a neural spike train, we can successfully observe adaption behavior. Simulation study verifies that the estimation procedure provides reliable estimate of CIF. This framework provides a definite quantification of neural activities and facilitates further investigation of understanding the brain from neurological perspective.

Document type: 
Graduating extended essay / Research project
File(s): 
Senior supervisor: 
Jiguo Cao
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.