Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

Sequence clustering for genetic mapping of binary traits

Date created: 
2021-08-24
Abstract: 

Sequence relatedness has potential application to fine-mapping genetic variants contributing to inherited traits. We investigate the utility of genealogical tree-based approaches to fine-map causal variants in three different projects. In the first project, through coalescent simulation, we compare the ability of several popular methods of association mapping to localize causal variants in a sub-region of a candidate genomic region. We consider four broad classes of association methods, which we describe as single-variant, pooled-variant, joint-modelling and tree-based, under an additive genetic-risk model. We also investigate whether differentiating case sequences based on their carrier status for a causal variant can improve fine-mapping. Our results lend support to the potential of tree-based methods for genetic fine-mapping of disease. In the second project, we develop an R package to dynamically cluster a set of single-nucleotide variant sequences. The resulting partition structures provide important insight into the sequence relatedness. In the third project, we investigate the ability of methods based on sequence relatedness to fine-map rare causal variants and compare it to genotypic association methods. Since the true gene genealogy is unknown in reality, we apply the methods developed in the second project to estimate the sequence relatedness. We also pursue the idea of reclassifying case sequences into their carrier status using the idea of genealogical nearest neighbours. We find that method based on sequence relatedness is competitive for fine-mapping rare causal variants. We propose some general recommendations for fine-mapping rare variants in case-control association studies.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Jinko Graham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Post-selection inference

Author: 
Date created: 
2021-04-21
Abstract: 

Forward Stepwise Selection is a widely used model selection algorithm. It is, however, hard to do inference for a model that is already cherry-picked. A post-selection inference method called selective inference is investigated. Beginning with very simple examples and working towards more complex ones, we evaluate the method's performance in terms of its power and coverage probability though a simulation study. The target of inference is investigated and the impact of the amount of information used to construct conditional conference intervals is investigated. To achieve the same level of coverage probability, the more conditions we use, the wider the Confidence Interval is -- the effect can be extreme. Moreover, we investigate the impact of multiple conditioning, as well as the importance of the normality assumption on which the underlying theory is based. For models with not very many parameters (p << n), we find normality is not crucial in terms of the test coverage probability.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Richard Lockhart
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

An efficient approach to pruning regression trees using a modified Bayesian information criterion

Author: 
Date created: 
2021-04-14
Abstract: 

By identifying relationships between regression tree construction and change-point detection, we show that it is possible to prune a regression tree efficiently using properly modified information criteria. We prove that one of the proposed pruning approaches that uses a modified Bayesian information criterion consistently recovers the true tree structure provided that the true regression function can be represented as a subtree of a full tree. In practice, we obtain simplified trees that can have prediction accuracy comparable to trees obtained using standard cost-complexity pruning. We briefly discuss an extension to random forests that prunes trees adaptively in order to prevent excessive variance, building upon the work of other authors.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Thomas Loughin
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Some new methods and models in functional data analysis

Author: 
Date created: 
2020-06-19
Abstract: 

With new developments in modern technology, data are recorded continuously on a large scale over finer and finer grids. Such data push forward the development of functional data analysis (FDA), which analyzes information on curves or functions. Analyzing functional data is intrinsically an infinite-dimensional problem. Functional partial least squares method is a useful tool for dimension reduction. In this thesis, we propose a sparse version of the functional partial least squares method which is easy to interpret. Another problem of interest in FDA is the functional linear regression model, which extends the linear regression model to the functional context. We propose a new method to study the truncated functional linear regression model which assumes that the functional predictor does not influence the response when the time passes a certain cutoff point. Motivated by a recent study of the instantaneous in-game win probabilities for the National Rugby League, we develop novel FDA techniques to determine the distributions in a Bayesian model.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Jiguo Cao
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.

Systematic comparison of designs and emulators for computer experiments using a library of test functions

Author: 
Date created: 
2020-12-16
Abstract: 

As computational resources have become faster and more economical, scientific research has transitioned from using only physical experiments to using simulationbased exploration. A body of literature has since grown aimed at the design and analysis of so-called computer experiments. While this literature is large and active, little work has been focused on comparing methods. This project presents ways of comparing and evaluating both design and emulation methods for computer experiments. Using a suite of test functions — in this work we introduce the Virtual Library of Computer Experiments a procedure is established which can provide guidance as to how to proceed in simulation problems. An illustrative comparison is performed for each context; putting three emulators, then four experimental designs up against each other; while also highlighting possible considerations for test function choice.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Derek Bingham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Efficient Bayesian parameter inference for COVID-19 transmission models

Author: 
Date created: 
2020-12-17
Abstract: 

Many transmission models have been proposed and adapted to reflect changes in policy for mitigating the spread of COVID-19. Often these models are applied without any formal comparison with previously existing models. In this project, we use an annealed sequential Monte Carlo (ASMC) algorithm to estimate parameters of these transmission models. We also use Bayesian model selection to provide a framework through which the relative performance of transmission models can be compared in a statistically rigorous manner. The ASMC algorithm provides an unbiased estimate of the marginal likelihood which can be computed at no additional computational cost. This offers a significant computational advantage over MCMC methods which require expensive post hoc computation to estimate the marginal likelihood. We find that ASMC can produce results that are comparable to MCMC in a fraction of the time.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Liangliang Wang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Contextual batting and bowling in limited overs cricket

Author: 
Date created: 
2020-11-25
Abstract: 

Cricket is a sport for which many batting and bowling statistics have been proposed. However, a feature of cricket is that the level of aggressiveness adopted by batsmen is dependent on match circumstances. It is therefore relevant to consider these circumstances when evaluating batting and bowling performances. This project considers batting performance in the second innings of limited overs cricket when a target has been set. The runs required, the number of overs completed and the wickets taken are relevant in assessing the batting performance. We produce a visualization for second innings batting which describes how a batsman performs under different circumstances. The visualization is then reduced to a single statistic “clutch batting” which can be used to compare batsmen. An analogous analysis is then provided for bowlers based on the symmetry between batting and bowling, and we define a statistic “clutch bowling”.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Harsha Perera
Timothy Swartz
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

A flexible group benefits framework for pricing deposit rates

Author: 
Date created: 
2020-11-16
Abstract: 

Currently, most flexible group benefit plans are designed and priced based on deterministic assumptions about the plan members’ option selections. This can cause the adverse selection spiral, threatening the sustainability of the plan. We therefore propose a comprehensive framework with a novel pricing formula that incorporates both a model for claims and a model for plan members’ enrollment decisions to prevent adverse selection. We find through simulation that our proposed pricing formula outperforms the traditional pricing practice by keeping flex plans sustainable over time. In addition to preventing the adverse selection spiral through pricing, our framework also serves as a tool to evaluate the impact of other parameters such as changes in plan designs, health costs, and member decision.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jean-François Bégin
Barbara Sanders
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

New perspectives on non-negative matrix factorization for grouped topic models

Author: 
Date created: 
2020-08-18
Abstract: 

Probabilistic topic models (PTM's) have become a ubiquitous approach for finding a set of latent themes (``topics'') in collections of unstructured text. A simpler, linear algebraic technique for the same problem is non-negative matrix factorization: we are given a matrix with non-negative entries and asked to find a pair of low-rank matrices, also non-negative, whose product is approximately the original matrix. A drawback of NMF is the non-convex nature of the optimization problem it poses. Recent work by the theoretical computer science community addresses this issue, utilizing NMF's inherent structure to find conditions under which the objective function admits convexity. With convexity comes tractability, and the central theme of this thesis is the exploitation of this tractability to ally NMF with resampling-based nonparametrics. Our motivating example is one in which a document collection exhibits some kind of partitioning according to a discrete, indexical covariate, and the goal is to assess the influence of this partitioning on document content; we call this scenario a grouped topic model. Computation relies on several well-studied tools from numerical linear algebra and convex programming which are especially well suited for synthesis with permutation tests and the bootstrap. The result is a set of simple, fast, and easily implementable methodologies for performing inference in grouped topic models. This is contrast to parallel developments in PTM's where ever-more cumbersome inference schemes are required to fit complex graphical models.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
David Campbell
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Bayesian logistic regression with the local bouncy particle sampler for COVID-19

Author: 
Date created: 
2020-08-24
Abstract: 

A novel coronavirus, called SARS-CoV-2, has caused the outbreak of the pandemic of COVID-19. The global economy, people’s health and life have been facing a tremendous threat in COVID-19. This project is to determine some important factors in COVID-19 severity based on 137 Tianjin patients who have been exposed to COVID-19 since January 5, 2020. We fit a logistic regression model and estimate the parameters using standard Markov chain Monte Carlo (MCMC) methods. Due to the weaknesses and limitations of the standard MCMC methods, we then perform model estimation in one special example of a Piecewise Deterministic Markov Process, named the Bouncy Particle Sampler (BPS). This method is also known as a rejection-free and irreversible MCMC, and can draw samples from our target distribution efficiently. One type of the BPS algorithm, the Local Bouncy Particle Sampler (LBPS), has advantages in computational efficiency. We apply the standard MCMC method and the LBPS to our dataset. We conclude that age and Wuhan-related exposures (i.e. people who have lived or traveled from Wuhan) are two important factors in a COVID-19 severity test.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Liangliang Wang
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.