Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

Covariance-adjusted, sparse, reduced-rank regression with application to imaging-genetics data

Author: 
Date created: 
2019-05-31
Abstract: 

Alzheimer's disease (AD) is one of the most challenging diseases in the world and it is crucial for researchers to explore the relationship between AD and genes. In this project, we analyze data from 179 cognitively normal individuals that contain magnetic resonance imaging measures in 56 brain regions of interest and alternate allele counts of 510 single nucleotide polymorphisms (SNPs) obtained from 33 candidate genes for AD, provided by the AD Neuroimaging Initiative (ADNI). Our objectives are to explore the data structure and prioritize interesting SNPs. Using standard linear regression models is inappropriate in this research context, because they cannot account for sparsity in the SNP effects and the spatial correlations between brain regions. Thus, we review and apply the method of covariance-adjusted, sparse, reduced-rank regression (Cov-SRRR) that simultaneously performs variable selection and covariance estimation to the data of interest. In our findings, SNP \textit{rs16871157} has the highest variable importance probability (VIP) in bootstrapping. Also, the estimated coefficient values corresponding to the thickness measures of the temporal lobe area have largest absolute values and are negative, which is consistent with current AD research.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jinko Graham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Selecting baseline two-level designs using optimality and aberration criteria when some two-factor interactions are important

Author: 
Date created: 
2019-06-14
Abstract: 

The baseline parameterization is less commonly used in factorial designs than the orthogonal parameterization. However, the former is more natural than the latter when there exists a default or preferred setting for each factor in an experiment. The current method selects optimal baseline designs for estimating a main effect model. In this project, we consider the selection of optimal baseline designs when estimates of both main effects and some two-factor interactions are wanted. Any other potentially active effect causes bias in estimation of the important effects. To minimize the contamination of these potentially active effects, we propose a new minimum aberration criterion. Moreover, an optimality criterion is used to minimize the variances of the estimates. Finally, we develop a search algorithm for selecting optimal baseline designs based on these criteria and present some optimal designs of 16 and 20 runs for models with up to three important two-factor interactions.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Boxin Tang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Joint modeling of longitudinal and time-to-event data with the application to kidney transplant data

Author: 
Date created: 
2018-12-13
Abstract: 

The main thesis develops the novel and powerful statistical methodology to solve the problems in kidney transplant. Firstly, we use functional principal component analysis (FPCA) through conditional expectation to explore major sources of variations of GFR curves. The estimated FPC scores can be used to cluster GFR curves. Ordering FPC scores can detect abnormal GFR curves. FPCA can effectively estimate missing GFR values and predict GFR values. Secondly, we propose new joint models with mixed-effect and Accelerated Failure Time (AFT) submodels, where the piecewise linear function is used to calculate the non-proportional dynamic hazard ratio curve of a time-dependent side event. The finite sample performance of the proposed method is investigated in simulation studies. Our method is demonstrated by fitting the joint model for some clinical kidney data. Thirdly, we develop a joint model with FPCA and multi-state model to fit the longitudinal and multiple time-to event outcomes together. FPCA is efficient in reducing the dimensions of the longitudinal trajectories. Multistate submodel can be used to describe the dynamic process of multiple time-to-event outcomes. The relationships between the longitudinal and time-to-event outcomes can be assessed based on the shared latent feathers. The latent variables FPC scores are significantly related to time-to-event outcomes in the application example, and Cox model may cause bias for multiple time-to event outcomes compared with multi-state model. Fourthly, we develop a flexible class joint model of generalized linear latent variables for multivariate responses, which has an underlying Gaussian latent processes. The model accommodates any mixture of outcomes from the exponential family. Monte Carlo EM is proposed for parameter estimation and the variance components of the latent processes. We demonstrate this methodology by kidney transplant studies. Finally, in many social and health studies, measurement of some covariates are only available from units of subjects, rather than from individual. Such kind of measures are referred as to aggregate average exposures. The current method fails to evaluate high-order or nonlinear effect of aggregated exposures. Therefore, we develop a nonparametric method based on local linear fitting to overcome the difficulty. We demonstrate this methodology by kidney transplant studies.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Jiguo Cao
Liangliang Wang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Multidimensional scaling for phylogenetics

Author: 
Date created: 
2019-04-11
Abstract: 

We study a novel approach to determine the phylogenetic tree based on multidimensional scaling and Euclidean Steiner minimum tree. Pairwise sequence alignment method is implemented to align the objects such as DNA sequences and then some evolutionary models are applied to get the estimated distance matrix. Given the distance matrix, multidimensional scaling is widely used to reconstruct the map which has coordinates of the data points in a lower-dimensional space while preserves the distance. We employ both Classical multidimensional scaling and Bayesian multidimensional scaling on the distance matrix to obtain the coordinates of the objects. Based on the coordinates, the Euclidean Steiner minimum tree could be constructed and served as a candidate for the phylogenetic tree. The result from the simulation study indicates that the use of the Euclidean Steiner minimum tree as a phylogenetic tree is feasible.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Liangliang Wang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Predicting ovarian cancer survival times: Feature selection and performance of parametric, semi-parametric, and random survival forest methods

Author: 
Date created: 
2019-04-23
Abstract: 

Survival time predictions have far-reaching implications. For example, such predictions can be influential in constructing a personalized treatment plan that is of benefit to both physicians and patients. Advantages include planning the best course of treatment considering the allocation of health care services and resources, as well as the patient's overall health or personal wishes. Predictions also play an important role in providing realistic expectations and subsequently managing quality of life for the patient's residual lifetime. Unfortunately, survival data can be highly variable, making precise predictions difficult or impossible. This project explores methods of predicting time to death for ovarian cancer patients. The dataset consists of a multitude of predictors, including some that may be unimportant. The performances of various prediction methods that allow for feature selection (the Weibull model, Cox proportional hazards model, and the random survival forest) are evaluated. Prediction errors are assessed using Harrell's concordance index and a version of the expected integrated Brier score.We find that the Weibull and Cox models provide the best predictions of survival distributions in this context. Moreover, we are able to identify subsets of predictors that lead to reduced prediction error and are clinically meaningful.

Document type: 
Graduating extended essay / Research project
Supervisor(s): 
Rachel Altman
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

An efficient statistical method of detecting introgressive events from big genomic data

Author: 
Date created: 
2019-04-09
Abstract: 

Introgressive hybridization, also called introgression, is the gene flow from one species to another due to mating between species. The genetic signals of introgression are not always obviously observed. Current methods of detecting introgressive events rely on the analysis of orthologous markers, and therefore do not consider gene duplication and gene loss. Since introgression leaves a phylogenetic signal similar to horizontal gene transfer, introgression events can be detected under a gene tree-species tree reconciliation framework, which simultaneously accounts for evolutionary mechanisms including gene duplication, gene loss, and gene transfer. In this work, the reconciliation-based method has been applied to a large dataset of Anopheles mosquito genomes. We recover extensive introgression that occurs in gambiae complex, a group of African mosquitoes, although with some variations compared to previous reports. Our analysis results also imply a possible ancient introgression between the Asian and African mosquitoes.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Liangliang Wang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Fast emulation and calibration of large computer experiments with multivariate output

Author: 
Date created: 
2019-04-17
Abstract: 

Scientific investigations are often expensive and the ability to quickly perform analysis of data on-location at experimental facilities can save valuable resources. Further, computer models that leverage scientific knowledge can be used to gain insight into complex processes and reduce the need for costly physical experiments, but in turn may be computationally expensive to run. We compare multiple statistical surrogates or emulators based on Gaussian processes for expensive computer models, with the goal of producing predictions quickly given large training sets. We then present a modularised approach for finding the values of inputs that allow for the surrogate model to match reality, or field observations. This process is model calibration. We then extend the emulator of choice and calibration procedure for use with multivariate response and demonstrate the speed and efficacy of such emulators on datasets from a series of transmission impact experiments.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Derek Bingham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Rao-Blackwellizing field-goal percentage

Date created: 
2019-03-29
Abstract: 

Shooting skill in the NBA is typically measured by field goal percentage (FG%) - the number of makes out of the total number of shots. Even more advanced metrics like true shooting percentage are calculated by counting each player’s 2-point, 3-point, and free throw makes and misses, ignoring the spatiotemporal data now available (Kubatko et al. 2007). In this paper we aim to better characterize player shooting skill by introducing a new estimator based on post-shot release shot-make probabilities. Via the Rao-Blackwell theorem, we propose a shot-make probability model that conditions probability estimates on shot trajectory information, thereby reducing the variance of the new estimator relative to standard FG%. We obtain shooting information by using optical tracking data to estimate three factors for each shot: entry angle, shot depth, and left-right accuracy. Next, we use these factors to model shot-make probabilities for all shots in the 2014-15 season, and use these probabilities to produce a Rao-Blackwellized FG% estimator (RB-FG%) for each player. We present a variety of results derived from this shot trajectory data, as well as demonstrate that RB-FG% is better than raw FG% at predicting 3-point shooting and true-shooting percentages. Overall, we find that conditioning shot-make probabilities on spatial trajectory information stabilizes inference of FG%, creating the potential to estimate shooting statistics and related metrics earlier in a season than was previously possible.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Luke Bornn
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Unsupervised learning on functional data with an application to the analysis of U.S. temperature prediction accuracy

Author: 
Date created: 
2019-02-07
Abstract: 

Unsupervised learning techniques are widely applied in exploratory analysis as the motivation of further analysis. In functional data analysis, two typical topics of unsupervised learning are functional principal component analysis and functional data clustering analysis. In this study, besides reviewing the developed unsupervised learning techniques, we extend unsupervised random forest clustering method to functional data and detect its shortages and strength through comparisons with other clustering methods in simulation studies. Finally, both proposed method and developed unsupervised learning techniques are conducted on a real data application: the analysis of the accuracy of the U.S. temperature prediction from 2014 to 2017.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jiguo Cao
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Development of functional principal components analysis and estimating the time-varying gene regulation network

Author: 
Date created: 
2018-09-27
Abstract: 

Functional data analysis (FDA) addresses the analysis of information on curves or functions. Examples of such curves or functions include time-course gene expression measurements, the Electroencephalography (EEG) data motoring the brain activity, the emission rate of automobiles after acceleration and the growth curve of children on body fat percentage made over a growth time period. The primary interests for the underlying curves or functions varies in different fields. In this thesis, new methodology for constructing time-varying net- work based on functional observations is proposed. Several variations of Functional Principal Component Analysis (FPCA) are developed in the context of functional regression model. Lastly, the new use of FPCA are explored in terms of recovering trajectory functions and estimating derivatives.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Jiguo Cao
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.