# Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

## An analysis of loan prepayment using competing risks random forests

Author:
Date created:
2019-11-27
Abstract:

Loan prepayment is a large cause of loss to financial institutions when they issue installment loans, and has not been well studied with respect to predicting it for individual borrowers. Using a dataset of competing risks times for loan termination, competing risks random forests were used as a non-parametric approach for identifying useful predictors, and for finding a tuned model that demonstrated that loan prepayment can be predicted on an individual borrower basis. In addition, a new software package we developed, largeRCRF, is introduced and evaluated for the purpose of training competing risks random forests on large scale datasets. This research is a firm first step for financial institutions to reduce their prepayment rates and increase their margins.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Jiguo Cao
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## Ain’t played nobody: Building an optimal schedule to secure an NCAA tournament berth

Author:
Date created:
2019-08-12
Abstract:

American men’s college basketball teams compete annually for the National Collegiate Athletic Association (NCAA) national championship, determined through a 68-team, single-elimination tournament known as “March Madness”. Tournament participants either qualify automatically, through their conferences’ year-end tournaments, or are chosen by a selection committee based on various measures of regular season success. When selecting teams, the committee reportedly values a team's quality of, and performance against, opponents outside of their conference. Since teams have some freedom in selecting nonconference games, we seek to develop an approach to optimizing this choice. Using historical data, we find the committee's most valued criteria for selecting tournament teams. Additionally, we use prior seasons’ success and projected returning players to forecast every team’s strength for the upcoming season. Using the selection criteria and these projections, we develop a tool to help teams build the optimal nonconference schedule to increase their NCAA tournament selection probability.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Thomas Loughin
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## Advanced Monte Carlo methods and applications

Author:
Date created:
2019-08-16
Abstract:

Monte Carlo methods have emerged as standard tools to do Bayesian statistical inference for sophisticated models. Sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC) are two main classes of methods to sample from high dimensional probability distributions. This thesis develops methodologies within these classes to address problems in different research areas. Phylogenetic tree reconstruction is a main task in evolutionary biology. Traditional MCMC methods may suffer from the curse of dimensionality and the local-trap problem. Firstly, we introduce a new combinatorial SMC method, with a novel and efficient proposal distribution. We also explore combining SMC and Gibbs sampling to jointly estimate the phylogenetic trees and evolutionary parameter of genetic data sets. Secondly, we propose an embarrassingly parallel'' method for Bayesian phylogenetic inference, annealed SMC, based on recent advances in the SMC literature such as adaptive determination of annealing parameters. Another application of the methods presented in this thesis is in genome wide-association studies. Linear mixed models (LMMs) are powerful methods for controlling confounding caused by population structure. We develop a Bayesian hierarchical model to jointly estimate LMM parameters and the genetic similarity matrix using genetic sequences and phenotypes. We develop an SMC method to jointly approximate the posterior distributions of the LMM and phylogenetic trees. We also consider parameter estimation for nonlinear differential equation (DE) systems from noisy measurements of dynamic systems. We develop a fully Bayesian framework for non-linear DE systems. A flexible nonparametric function is used to represent the dynamic process such that expensive numerical solvers can be avoided. We derive an SMC method to sample from multi-modal DE posterior distributions. In addition, we consider Bayesian computing problems related to importance sampling and misclassification in multinomial data. Lastly, motivated by a personalized recommender system with dynamic preference changes, we develop a new hidden Markov model (HMM) and propose an efficient online SMC algorithm by hybridizing with the EM algorithm for the HMM model.

Document type:
Thesis
File(s):
Supervisor(s):
Liangliang Wang
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Thesis) Ph.D.

## Shrinkage parameter estimation for penalized logistic regression analysis of case-control data

Author:
Date created:
2019-08-15
Abstract:

In genetic epidemiology, rare variant case-control studies aim to investigate the association between rare genetic variants and human diseases. Rare genetic variants lead to sparse covariates that are predominately zeros and this sparseness leads to estimators of log-odds-ratio parameters that are biased away from their null value of zero. Different penalized-likelihood methods have been developed to mitigate this sparse-data bias for case-control studies. In this project, we study penalized logistic regression using a class of log-F priors indexed by a shrinkage parameter m to shrink the biased MLE towards zero. We propose a simple method to select the value of m based on a marginal likelihood. The marginal likelihood is maximized by the Monte Carlo EM algorithm. Properties of the proposed method are evaluated in a simulation study, and the method is applied to a real dataset from the ADNI-1 study.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## A multi-state model for pricing critical illness insurance products

Author:
Date created:
2019-08-21
Abstract:

Due to increasing cases of cancer and other severe illnesses, there is a great demand of critical illness insurance products. This project introduces a Markovian multi-state model based on popular critical illness plans to describe the policyholder's health condition over time, which includes being diagnosed with certain dread diseases such as cancer, stroke and heart attack. Critical illness insurance products with life insurance or other optional riders are considered. Following the idea of Baione and Levantesi (Insurance: Mathematics and Economics, 58: 174-184, 2014), we focus on the method of modelling mortality rates, estimating transition probabilities with Canadian prevalence rates and incidence rates of covered illnesses, and calculating premium rates based on the multi-state model. A comparison of transition intensities under various mortality models and premium rates for critical illness policies under several graduation approaches are also illustrated.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Yi Lu
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## A moneyness-adapting fee structure for guaranteed benefits embedded in variable annuities: Pricing and valuation

Author:
Date created:
2019-08-21
Abstract:

Guaranteed minimum death benefit (GMDB) and guaranteed minimum maturity benefit (GMMB) are two common guarantee riders embedded in variable annuities. To cover the financial risks incurring from the guarantees, fees are charged based on the underlying fund value, where a traditional approach funds the guarantees as a constant rate of fee over the period of the accumulation phase. This fee structure, however, potentially encourages surrendering when the options are out-of-money. To prevent the adverse incentives, Bernard et al. (2014) introduced a state-dependent fee, where fees are charged only when the guarantees are in-the-money or close to being in-the-money. This project proposes a moneyness-adapting fee structure, aiming to further reduce the insurer’s reserve. Following the estimation of rate of fee charged for GMDB and/or GMMB under three pricing principles, the performances of three fee structures are compared with numerical illustrations, based on the measures of value-at-risk and conditional-tail-expectation.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Cary Chi-Liang Tsai
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## Supersaturated designs for screening experiments and strong orthogonal arrays for computer experiments

Author:
Date created:
2019-08-23
Abstract:

This dissertation centers on supersaturated designs and strong orthogonal arrays, which provide useful plans for screening experiments and computer experiments, respectively. Supersaturated designs are a good choice for screening experiments. In order to use such designs, a common assumption that all interactions are negligible is made. In this dissertation, this assumption is dropped for the use of supersaturated designs. We propose and study a new class of supersaturated designs, namely foldover supersaturated designs, which allow the active main effects to be identified without making the assumption that two-factor interactions are absent. The E(s2)-optimal foldover supersaturated designs are constructed, and further optimization is also considered for these E(s2)-optimal supersaturated designs. Strong orthogonal arrays were recently introduced and studied as a class of space-filling designs for computer experiments. This dissertation tackles two important problems that so far have not been addressed in the literature. The first problem is how to develop concreteconstructions for strong orthogonal arrays of strength 3. We provide a systematic and comprehensive study on the construction of these arrays, with the aim at better space-filling properties. Besides various characterizing results, three families of arrays of strength 3 are presented. The other important problem is that of design selection for strong orthogonal arrays. We conduct a systematic investigation into this problem with the focus on strong orthogonal arrays of strength 2+ and 2. We first select arrays of strength 2+ by examining their 3-dimensional projections, and then formulate a general framework for the selection of arrays of strength 2 by looking at their 2-dimensional projections. Both theoretical and computational results for arrays are presented.

Document type:
Thesis
File(s):
Supervisor(s):
Boxin Tang
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Thesis) Ph.D.

## A statistical investigation of data from the NHL Combine

Author:
Date created:
2019-05-22
Abstract:

This project seeks to discover useful information from the NHL Combine results by comparing NHL Central Scouting Service rankings, NHL Draft results and measures of player evaluation. Data management is central to this project and we describe the details of handling datasets including the large and proprietary Combine dataset. Many data management decisions are made based on knowledge from the sport of hockey. The investigation of three questions of interest are carried out utilizing modern machine learning techniques such as random forests. Investigation 1 determines whether the Combine serves any purpose in terms of modifying the opinion of Central Scouting. Investigation 2 focuses on which test results of the Combine are important in predicting prospects’ future development. Investigation 3 considers how the Combine results revise Central Scouting’s beliefs.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Tim Swartz
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## Covariance-adjusted, sparse, reduced-rank regression with application to imaging-genetics data

Author:
Date created:
2019-05-31
Abstract:

Alzheimer's disease (AD) is one of the most challenging diseases in the world and it is crucial for researchers to explore the relationship between AD and genes. In this project, we analyze data from 179 cognitively normal individuals that contain magnetic resonance imaging measures in 56 brain regions of interest and alternate allele counts of 510 single nucleotide polymorphisms (SNPs) obtained from 33 candidate genes for AD, provided by the AD Neuroimaging Initiative (ADNI). Our objectives are to explore the data structure and prioritize interesting SNPs. Using standard linear regression models is inappropriate in this research context, because they cannot account for sparsity in the SNP effects and the spatial correlations between brain regions. Thus, we review and apply the method of covariance-adjusted, sparse, reduced-rank regression (Cov-SRRR) that simultaneously performs variable selection and covariance estimation to the data of interest. In our findings, SNP \textit{rs16871157} has the highest variable importance probability (VIP) in bootstrapping. Also, the estimated coefficient values corresponding to the thickness measures of the temporal lobe area have largest absolute values and are negative, which is consistent with current AD research.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Jinko Graham
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.

## Selecting baseline two-level designs using optimality and aberration criteria when some two-factor interactions are important

Author:
Date created:
2019-06-14
Abstract:

The baseline parameterization is less commonly used in factorial designs than the orthogonal parameterization. However, the former is more natural than the latter when there exists a default or preferred setting for each factor in an experiment. The current method selects optimal baseline designs for estimating a main effect model. In this project, we consider the selection of optimal baseline designs when estimates of both main effects and some two-factor interactions are wanted. Any other potentially active effect causes bias in estimation of the important effects. To minimize the contamination of these potentially active effects, we propose a new minimum aberration criterion. Moreover, an optimality criterion is used to minimize the variances of the estimates. Finally, we develop a search algorithm for selecting optimal baseline designs based on these criteria and present some optimal designs of 16 and 20 runs for models with up to three important two-factor interactions.

Document type:
Graduating extended essay / Research project
File(s):
Supervisor(s):
Boxin Tang
Department:
Science: Department of Statistics and Actuarial Science
Thesis type:
(Project) M.Sc.