Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

Using computer model uncertainty to inform the design of physical experiments: An application in glaciology

Author: 
Date created: 
2016-08-09
Abstract: 

Computer models are used as surrogates for physical experiments in many areas of science. They can allow the researchers to gain a better understanding of the processes of interest, in situations where it would be overly costly or time-consuming to obtain sufficient physical data. In this project, we give an approach for using a computer model to obtain designs for a physical experiment. The designs are optimal for modelling the spatial distribution of the response across the region of interest. An additional consideration is the presence of several tuning parameters to the computer model, which represent physical aspects of the process but whose values are not precisely known. In obtaining the optimal designs, we account for this uncertainty in the parameters governing the system. The project is motivated by an application in glaciology, where computer models are often used to model the melt of snow and ice across a glacier surface. The methodology is applied to obtain optimal networks of stakes, which researchers use to obtain measurements of summer mass balance (the difference between the amount of snow/ice before and after the melt season).

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Derek Bingham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Analysis of Data in Network and Natural Language Formats

Date created: 
2016-06-06
Abstract: 

The work herein describes a predictive model for cricket matches, a method of evaluating cricket players, and a method to infer properties of a network from a link-traced sample. In Chapter 2, player characteristics are estimated using a frequency count of the outcomes that occur when that player is batting or bowling. These characteristics are weighted against the relative propensity of each outcome in each of 200 game situations (10 wickets times 20 overs), and incorporate prior information using a Metropolis-Hastings algorithm. The characteristics of players in selected team rosters are then fed into a system we developed to simulate outcomes of whole games. The winning probabilities of each team are shown to perform similarly to competitive betting lines during the 2014 Cricket World Cup. In Chapter 3 the simulator is used to estimate the effect, in terms of expected number of runs, of each player. The effect of the player is reported as expected runs scored or allowed per innings above an average player in the same batting or bowling position. Chapter 4 proposes a method based on approximate Bayesian computation (ABC) to make inferences on hidden parameters of a network graph. Network inference using ABC is a very new field. This is the first work, to the author’s knowledge, of an ABC based inference using only a sample of a network, rather than the either network. Summary statistics are taken from the sample of the network of interest, networks and samples are then simulated using hidden parameters from a prior distribution, and a posterior of the parameters is found by a kernel density estimate conditioned on the summary statistics. Chapter 5 describes an application of the method proposed in Chapter 4 to real data. A network of precedence citations between legal documents, centered around cases overseen by the Supreme Court of Canada, is observed. The features of certain cases that lead to their frequent citation are inferred, and their effects estimated by ABC. Future work and extensions are briefly discussed in Chapter 6.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Steven Thompson
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

A goodness-of-fit test for semi-parametric copula models of right-censored bivariate survival times

Author: 
Date created: 
2016-06-09
Abstract: 

In multivariate survival analyses, understanding and quantifying the association between survival times is of importance. Copulas, such as Archimedean copulas and Gaussian copulas, provide a flexible approach of modeling and estimating the dependence structure among survival times separately from the marginal distributions (Sklar, 1959). However, misspecification in the parametric form of the copula function will directly lead to incor- rect estimation of the joint distribution of the bivariate survival times and other model-based quantities.The objectives of this project are two-folded. First, I reviewed the basic definitions and properties of commonly used survival copula models. In this project, I focused on semi- parametric copula models where the marginal distributions are unspecified but the copula function belongs to a parametric copula family. Various estimation procedures of the de- pendence parameter associated with the copula function were also reviewed. Secondly, I extended the pseudo in-and-out-of-sample (PIOS) likelihood ratio test proposed in Zhang et al. (2016) to testing the semi-parametric copula models for right-censored bivariate sur- vival times. The PIOS test is constructed by comparing two forms of pseudo likelihoods, one is the "in-sample" pseudo likelihood, which is the full pseudo likelihood, and the other is the "out-of-sample" pseudo likelihood, which is a cross-validated pseudo likelihood by the means of jacknife. The finite sample performance of the PIOS test was investigated via a simulation study. In addition, two real data examples were analyzed for illustrative purpose.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Qian (Michelle) Zhou
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

On Supervised and Unsupervised Discrimination

Date created: 
2016-07-29
Abstract: 

Discrimination is a supervised problem in statistics and machine learning that begins with data from a finite number of groups. The goal is to partition the data-space into some number of regions, and assign a group to each region so that observations there are most likely to belong to the assigned group. The most popular tool for discrimination is called discriminant analysis. Unsupervised discrimination, commonly known as clustering, also begins with data from groups, but now we do not necessarily know how many groups, nor do we get to know which group each observation belongs to. Our goal when doing clustering is still to partition the data-space into regions and assign groups to those regions, however we do not have any a priori information with which to assign these groups. Common tools for clustering include the k-means algorithm and model-based clustering using either the expectation maximization (EM) or classification expectation maximization (CEM) algorithms (of which k-means is a special case). Tools designed for clustering can also be used to do discrimination. We investigate this possibility, along with a method proposed by Yang (2013) for smoothing the transition between these problems. We use two simulations to investigate the performance of discriminant analysis and both versions of model-based clustering with various parameter settings across various datasets. These settings include using Yang’s method for modifying clustering tools to handle discrimination. Results are presented along with recommendations for data analysis when doing discrimination or clustering. Specifically, we investigate what assumptions to make about the groups’ sizes and shapes, as well as which method to use (discriminant analysis or the EM or CEM algorithms) and whether or not to apply Yang’s pre-processing procedure.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Tom Loughin
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Analysis of universal life insurance cash flows with stochastic asset models

Author: 
Date created: 
2016-06-02
Abstract: 

Universal life insurance is a flexible product which provides the policyholder with life insurance protection as well as savings build-up. The performance of the policy is hard to be evaluated accurately with deterministic asset models, especially when the fund is placed in accounts that track the performance of equities. This project aims to investigate factors that affect the savings (account value) and insurance coverage (death benefit) under a stochastic framework. Time series models are built to capture the complex dynamics of returns from two commonly offered investment options, T-bills and S&P 500 index, with and without interdependence assumption. Cash flows of account value, cost of insurance, and death benefit are projected for sample policies with common product features under multiple investment strategies. The comparison reveals the impact of asset models and fund allocation on the projected cash flows.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Yi Lu
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Cricket Analytics

Date created: 
2015-12-16
Abstract: 

This thesis consists of a compilation of three research papers and a non-statistical essay.Chapter 2 considers the decision problem of when to declare during the third innings of atest cricket match. There are various factors that affect the decision of the declaring teamincluding the target score, the number of overs remaining, the relative desire to win versusdraw, and the scoring characteristics of the particular match. Decision rules are developedand these are assessed against historical matches. We observe that there are discrepanciesbetween the optimal time to declare and what takes place in practice.Chapter 3 considers the determination of optimal team lineups in Twenty20 cricket where alineup consists of three components: team selection, batting order and bowling order. Viamatch simulation, we estimate the expected runs scored minus the expected runs allowed fora given lineup. The lineup is then optimized over a vast combinatorial space via simulatedannealing. We observe that the composition of an optimal Twenty20 lineup sometimesresults in nontraditional roles for players. As a by-product of the methodology, we obtainan “all-star” lineup selected from international Twenty20 cricket.Chapter 4 is a first attempt to investigate the importance of fielding in cricket. We introducethe metric of expected runs saved due to fielding which is both interpretable and is directlyrelevant to winning matches. The metric is assigned to individual players and is based ona textual analysis of match commentaries using random forest methodology. We observethat the best fielders save on average 1.2 runs per match compared to a typical fielder.Chapter 5 is a non-statistical essay of two cricketing greats from Sri Lanka who establishednumerous world records and recently retired from the game. Though their record-breakingperformances are now part of cricketing statistics, this chapter is not a contribution whichadds to the statistical literature, and should not be regarded as a component of the thesisin terms of analytics.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Tim Swartz
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Bayesian profile regression with evaluation on simulated data

Author: 
Date created: 
2016-01-06
Abstract: 

Using regression analysis to make inference using data sets that contain a large number of potentially correlated covariates can be difficult. This large number of covariates have become more common in clinical observational studies due to the dramatic improvement in information capturing technology for clinical databases. For instance, in disease diagnosis and treatment, obtaining a number of indicators regarding patients’ organ function is much easier than before and these indicators can be highly correlated. We discuss Bayesian profile regression, an approach that deals with the large numbers of correlated covariates for the binary covariates commonly recorded in clinical databases. Clusters of patients with similar covariate profiles are formed through the application of a Dirichlet process prior and then associated with outcomes via a regression model. Methods for evaluating the clustering and making inference are described afterwards. We use simulated data to compare the performance of Bayesian profile regression to the LASSO, a popular alternative for data sets with a large number of predictors. To make these comparisons, we apply the recently developed R package PReMiuM, to fit the Bayesian profile regression.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jinko Graham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Data integration methods for studying animal population dynamics

Author: 
Date created: 
2015-12-22
Abstract: 

In this thesis, we develop new data integration methods to better understand animal population dynamics. In a first project, we study the problem of integrating aerial and access data from aerial-access creel surveys to estimate angling effort, catch and harvest. We propose new estimation methods, study their statistical properties theoretically and conduct a simulation study to compare their performance. We apply our methods to data from an annual Kootenay Lake (Canada) survey. In a second project, we present a new Bayesian modeling approach to integrate capture-recapture data with other sources of data without relying on the usual independence assumption. We use a simulation study to compare, under various scenarios, our approach with the usual approach of simply multiplying likelihoods. In the simulation study, the Monte Carlo RMSEs and expected posterior standard deviations obtained with our approach are always smaller than or equal to those obtained with the usual approach of simply multiplying likelihoods. Finally, we compare the performance of the two approaches using real data from a colony of Greater horseshoe bats (\emph{Rhinolophus ferrumequinum}) in the Valais, Switzerland. In a third project, we develop an explicit integrated population model to integrate capture-recapture survey data, dead recovery survey data and snorkel survey data to better understand the movement from the ocean to spawning grounds of Chinook salmon (\emph{Oncorhynchus tshawytscha}) on the West Coast of Vancouver Island, Canada. In addition to providing spawning escapement estimates, the model provides estimates of stream residence time and snorkel survey observer efficiency, which are crucial but currently lacking for the use of the area-under-the-curve method currently used to estimate escapement on the West Coast of Vancouver Island.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Richard Lockhart
Carl Schwarz
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Statistical Inference under Latent Class Models, with Application to Risk Assessment in Cancer Survivorship Studies

Author: 
Date created: 
2015-11-12
Abstract: 

Motivated by a cancer survivorship program, this PhD thesis aims to develop methodology for risk assessment, classification, and prediction. We formulate the primary data collected from a cohort with two underlying categories, the at-risk and not-at-risk classes, using latent class models, and we conduct both cross-sectional and longitudinal analyses. We begin with a maximum pseudo-likelihood estimator (pseudo-MLE) as an alternative to the maximum likelihood estimator (MLE) under a mixture Poisson distribution with event counts. The pseudo-MLE utilizes supplementary information on the not-at-risk class from a different population. It reduces the computational intensity and potentially increases the estimation efficiency. To obtain statistical methods that are more robust than likelihood-based methods to distribution misspecification, we adapt the well-established generalized estimating equations (GEE) approach under the mean-variance model corresponding to the mixture Poisson distribution. The inherent computing and efficiency issues in the application of GEEs motivate two sets of extended GEEs, using the primary data supplemented by information from the second population alone or together with the available information on individuals in the cohort who are deemed to belong to the at-risk class. We derive asymptotic properties of the proposed pseudo-MLE and the estimators from the extended GEEs, and we estimate their variances by extended Huber sandwich estimators. We use simulation to examine the finite-sample properties of the estimators in terms of both efficiency and robustness. The simulation studies verify the consistency of the proposed parameter estimators and their variance estimators. They also show that the pseudo-MLE has efficiency comparable to that of the MLE, and the extended GEE estimators are robust to distribution misspecification while maintaining satisfactory efficiency. Further, we present an extension of the favourable extended GEE estimator to longitudinal settings by adjusting for within-subject correlation. The proposed methodology is illustrated with physician claims from the cancer program. We fit different latent class models for the counts and costs of the physician visits by applying the proposed estimators. We use the parameter estimates to identify the risk of subsequent and ongoing problems arising from the subjects’ initial cancer diagnoses. We perform risk classification and prediction using the fitted latent class models.

Document type: 
Thesis
File(s): 
Supervisor(s): 
X. Joan Hu
John J. Spinelli
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Application of Relational Models in Mortality Immunization

Author: 
Date created: 
2015-07-29
Abstract: 

The prediction of future mortality rates by any existing mortality projection models is hardly tobe exact, which causes an exposure to mortality and longevity risks for life insurance companies.Since a change in mortality rates has opposite impacts on the surpluses of life insurance andannuity products, hedging strategies of mortality and longevity risks can be implemented bycreating an insurance portfolio of both life insurance and annuity products. In this project, wedevelop a framework of implementing non-size free matching strategies to hedge against mortalityand longevity risks. We apply relational models to capture the mortality movements byassuming that the simulated mortality sequence is a proportional and/or a constant change ofthe expected one, and the amount of the changes varies in the length of the sequence. Withthe magnitude of the proportional and/or constant changes, we determine the optimal weightsof allocating the life insurance and annuity products in a portfolio for mortality immunizationaccording to each of the proposed matching strategies. Comparing the hedging performanceof non-size free matching strategies with size free ones proposed by Lin and Tsai (2014), wedemonstrate that non-size free matching strategies can hedge against mortality and longevityrisks more effectively than the corresponding size free ones.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Cary Tsai
Department: 
Science: Department of Biomedical Physiology and Kinesiology
Thesis type: 
(Thesis) M.Sc.