Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

Predicting ovarian cancer survival times: Feature selection and performance of parametric, semi-parametric, and random survival forest methods

Author: 
Date created: 
2019-04-23
Abstract: 

Survival time predictions have far-reaching implications. For example, such predictions can be influential in constructing a personalized treatment plan that is of benefit to both physicians and patients. Advantages include planning the best course of treatment considering the allocation of health care services and resources, as well as the patient's overall health or personal wishes. Predictions also play an important role in providing realistic expectations and subsequently managing quality of life for the patient's residual lifetime. Unfortunately, survival data can be highly variable, making precise predictions difficult or impossible. This project explores methods of predicting time to death for ovarian cancer patients. The dataset consists of a multitude of predictors, including some that may be unimportant. The performances of various prediction methods that allow for feature selection (the Weibull model, Cox proportional hazards model, and the random survival forest) are evaluated. Prediction errors are assessed using Harrell's concordance index and a version of the expected integrated Brier score.We find that the Weibull and Cox models provide the best predictions of survival distributions in this context. Moreover, we are able to identify subsets of predictors that lead to reduced prediction error and are clinically meaningful.

Document type: 
Graduating extended essay / Research project
Supervisor(s): 
Rachel Altman
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

An efficient statistical method of detecting introgressive events from big genomic data

Author: 
Date created: 
2019-04-09
Abstract: 

Introgressive hybridization, also called introgression, is the gene flow from one species to another due to mating between species. The genetic signals of introgression are not always obviously observed. Current methods of detecting introgressive events rely on the analysis of orthologous markers, and therefore do not consider gene duplication and gene loss. Since introgression leaves a phylogenetic signal similar to horizontal gene transfer, introgression events can be detected under a gene tree-species tree reconciliation framework, which simultaneously accounts for evolutionary mechanisms including gene duplication, gene loss, and gene transfer. In this work, the reconciliation-based method has been applied to a large dataset of Anopheles mosquito genomes. We recover extensive introgression that occurs in gambiae complex, a group of African mosquitoes, although with some variations compared to previous reports. Our analysis results also imply a possible ancient introgression between the Asian and African mosquitoes.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Liangliang Wang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Fast emulation and calibration of large computer experiments with multivariate output

Author: 
Date created: 
2019-04-17
Abstract: 

Scientific investigations are often expensive and the ability to quickly perform analysis of data on-location at experimental facilities can save valuable resources. Further, computer models that leverage scientific knowledge can be used to gain insight into complex processes and reduce the need for costly physical experiments, but in turn may be computationally expensive to run. We compare multiple statistical surrogates or emulators based on Gaussian processes for expensive computer models, with the goal of producing predictions quickly given large training sets. We then present a modularised approach for finding the values of inputs that allow for the surrogate model to match reality, or field observations. This process is model calibration. We then extend the emulator of choice and calibration procedure for use with multivariate response and demonstrate the speed and efficacy of such emulators on datasets from a series of transmission impact experiments.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Derek Bingham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Rao-Blackwellizing field-goal percentage

Date created: 
2019-03-29
Abstract: 

Shooting skill in the NBA is typically measured by field goal percentage (FG%) - the number of makes out of the total number of shots. Even more advanced metrics like true shooting percentage are calculated by counting each player’s 2-point, 3-point, and free throw makes and misses, ignoring the spatiotemporal data now available (Kubatko et al. 2007). In this paper we aim to better characterize player shooting skill by introducing a new estimator based on post-shot release shot-make probabilities. Via the Rao-Blackwell theorem, we propose a shot-make probability model that conditions probability estimates on shot trajectory information, thereby reducing the variance of the new estimator relative to standard FG%. We obtain shooting information by using optical tracking data to estimate three factors for each shot: entry angle, shot depth, and left-right accuracy. Next, we use these factors to model shot-make probabilities for all shots in the 2014-15 season, and use these probabilities to produce a Rao-Blackwellized FG% estimator (RB-FG%) for each player. We present a variety of results derived from this shot trajectory data, as well as demonstrate that RB-FG% is better than raw FG% at predicting 3-point shooting and true-shooting percentages. Overall, we find that conditioning shot-make probabilities on spatial trajectory information stabilizes inference of FG%, creating the potential to estimate shooting statistics and related metrics earlier in a season than was previously possible.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Luke Bornn
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Unsupervised learning on functional data with an application to the analysis of U.S. temperature prediction accuracy

Author: 
Date created: 
2019-02-07
Abstract: 

Unsupervised learning techniques are widely applied in exploratory analysis as the motivation of further analysis. In functional data analysis, two typical topics of unsupervised learning are functional principal component analysis and functional data clustering analysis. In this study, besides reviewing the developed unsupervised learning techniques, we extend unsupervised random forest clustering method to functional data and detect its shortages and strength through comparisons with other clustering methods in simulation studies. Finally, both proposed method and developed unsupervised learning techniques are conducted on a real data application: the analysis of the accuracy of the U.S. temperature prediction from 2014 to 2017.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jiguo Cao
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Development of functional principal components analysis and estimating the time-varying gene regulation network

Author: 
Date created: 
2018-09-27
Abstract: 

Functional data analysis (FDA) addresses the analysis of information on curves or functions. Examples of such curves or functions include time-course gene expression measurements, the Electroencephalography (EEG) data motoring the brain activity, the emission rate of automobiles after acceleration and the growth curve of children on body fat percentage made over a growth time period. The primary interests for the underlying curves or functions varies in different fields. In this thesis, new methodology for constructing time-varying net- work based on functional observations is proposed. Several variations of Functional Principal Component Analysis (FPCA) are developed in the context of functional regression model. Lastly, the new use of FPCA are explored in terms of recovering trajectory functions and estimating derivatives.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Jiguo Cao
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Construction of orthogonal designs and baseline designs

Date created: 
2018-07-23
Abstract: 

In this thesis, we study the construction of designs for computer experiments and for screening experiments. We consider the existence and construction of orthogonal designs, which are a useful class of designs for computer experiments. We first establish a non-existence result on orthogonal designs, generalizing an early result on orthogonal Latin hypercubes, and then present some construction results. By computer search, we obtain a collection of orthogonal designs with small run sizes. Using these results and existing methods in the literature, we create a comprehensive catalogue of orthogonal designs for up to 100 runs. In the rest of the thesis, we study designs for screening experiments. We propose two classes of compromise designs for estimation of main effects using two-level fractional factorial designs under baseline parameterization. Previous work in the area indicates that orthogonal arrays are more efficient than one-factor-at-a-time designs whereas the latter are better than the former in terms of minimizing the bias due to non-negligible interactions. Using efficiency criteria, we examine a class of compromise designs, which are obtained by adding runs to one-factor-at-a-time designs. A theoretical result is established for the case of adding one run. For adding two or more runs, we develop a complete search algorithm for finding optimal compromise designs. We also investigate another class of compromise designs, which are constructed from orthogonal arrays by changing some ones to zeros in design matrices. We then use a method of complete search for small run sizes to obtain optimal compromise designs. When the complete search is not feasible, we propose an efficient, though incomplete, search algorithm.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Boxin Tang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Quantifying inter-generational equity under different target benefit plan designs

Author: 
Peer reviewed: 
No, item is not peer reviewed.
Date created: 
2018-06-20
Abstract: 

In this research, we investigate the value of inter-generational transfers under various target benefit plan designs. The contingent retirement benefits are decomposed into embedded options, and the risk-adjusted values of these options are calculated and compared across generations. For this purpose, an economic scenario generator is implemented: the economic variables’ dynamics are generated by a model that combines the first-order vector autoregressive model and the generalized autoregressive conditional heteroscedasticity process. A corresponding risk-neutral model is derived and estimated using the prices of financial assets; the latter is helpful to price the embedded options. We study four target benefit plans with different design elements. We find that intergenerational value transfers arise by simply joining the collective pension scheme even without the inclusion of any intertemporal benefit smoothing designs. Without additional sourceof funding, we show that benefit security and stability can be achieved by adopting plan designs that allow temporary inter-generational subsidization, e.g., plan designs with no-action range. We show that adding a symmetric no-action range can reduce the volatility of retirement benefits without triggering significant value transfers, at least under the assumption of stationary demographic profile and when the simulation of economic scenarios starts from its long-term equilibrium level.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Barbara Sanders
Jean-François Bégin
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Statistical inference using large administrative data on multiple event times, with application to cancer survivorship research

Author: 
Date created: 
2018-12-20
Abstract: 

Motivated by the breast cancer survivorship research program at BC Cancer Agency, this dissertation develops statistical approaches to analyzing right-censored multivariate event time data. Following the background and motivation of the research, we introduce the framework of the dissertation, and provide a literature review and a list of the research questions. A description of the motivating study data is then given together with a preliminary analysis before presenting the proposed approaches as follows. We consider firstly estimation of the joint survivor function of multiple event times when the observations are subject to informative censoring due to a terminating event. We formulate the potential dependence of the multiple event times with the time to the terminating event by the Archimedean copulas. This may account for the informative censoring and, at the same time, allow to adapt the commonly used two-step procedure for estimating the joint distribution of the multiple event times under a copula model. We propose an easy-to-implement pseudo-likelihood based estimation procedure under the model, which reduces computational intensity compared to its MLE counterpart. A more flexible approach is then proposed to handling informative censoring with particular attention to observations on bivariate event time potentially censored by a terminating event. We formulate the correlation of the bivariate event time with the censoring time by embedding the bivariate event time distribution in a bivariate copula model. This yields the convenience of inference under the conventional copula model. At the same time, the proposed model is more flexible, and thus potentially more appropriate in many practical situations than modeling the event times and the associated censoring time jointly by a single multivariate copula. Adapting the commonly used two-stage estimation procedure under a copula model, we develop an easy-to-implement estimator for the joint survivor function of the two event times. A by-product of the proposed approaches is an estimator for the marginal distribution of a single event time with semicompeting-risks data. Further, we extend the approach to regression settings to explore covariate effects in either parametric or nonparametric forms. In particular, adjusting for some covariates, we compare two populations based on an event time with observations subject to informative censoring. We conduct both asymptotic and simulation studies to examine the consistency, efficiency, and robustness of the proposed approaches. The breast cancer program that motivated this research is employed to illustrate the methodological development throughout the dissertation.

Document type: 
Thesis
File(s): 
Supervisor(s): 
X. Joan Hu
John J. Spinelli
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Masquerade detection: A topic model based approach

Date created: 
2018-12-19
Abstract: 

The goal of masquerade detection is to "detect" when an intruder has infiltrated a computer system by looking for evidence of malicious behaviour. In this project, I use a topic model based intrusion detection system to search for intruders within the SEA and Greenberg datasets of Unix computer commands. Using LDA topic modeling I was able to find a probability distribution for each user for both the topics over a block of commands and over each command. Using these two probability distributions and building on previous detection techniques I was able to create five different detection techniques. I describe how I created the five LDA based models and combine them to find a sixth model. All of these techniques performed on par with their non-LDA counter-parts. Therefore, combined with the reduction in dimensionality afforded by the LDA topic model, I conclude that my methods perform better than the current models.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Derek Bingham
David Campbell
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.