Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

Classification based on supervised clustering with application to juvenile idiopathic arthritis

Author: 
Date created: 
2013-08-16
Abstract: 

Juvenile Idiopathic Arthritis (JIA) is the most common rheumatic disease of childhood. Our objective is to predict the results of remission so that those children who are likely to experience poor remission outcomes could benefit from early aggressive treatment. Many classification techniques could provide either a binary prediction or an estimated probability of remission. However, parents would like to know more specifically about the remission outcomes of children similar to their own. In this project, we propose a supervised clustering method that provides this information. Inspired by the basic idea of supervised principal component analysis, we perform supervision by selecting and/or weighting explanatory variables differently depending on their associations with the class response. Our supervised clustering method is applied to JIA data and to data simulated with known properties. Our method is shown to be competitive with an existing supervised clustering method, classification trees and random forests in terms of out-of-sample misclassification rates.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Thomas M. Loughin
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Modeling Mortality Rates with the Linear Logarithm Hazard Transform Approaches

Author: 
Peer reviewed: 
No, item is not peer reviewed.
Date created: 
2013-06-24
Abstract: 

In this project, two approaches based on the linear logarithm hazard transform (LLHT) to modeling mortality rates are proposed. Empirical observations show that there is a linear relationship between two sequences of logarithm of the forces of mortality (hazard rates of the future lifetime) for two years. The estimated two parameters of the linear relationship can be used for forecasting mortality rates. Deterministic and stochastic mortality rates with the LLHT, Lee-Carter and CBD models are predicted, and their corresponding forecasted errors are calculated for comparing the forecasting performances. Finally, applications to pricing some mortality-linked securities based on the forecasted mortality rates are presented for illustration.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Cary Chi-Liang Tsai
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Analysis of clustered event times with right-censoring

Author: 
Date created: 
2013-08-27
Abstract: 

Motivated by an infectious disease study at the BC Centre for Disease Control, this project is concerned with clustered event times where the observation is subject to right-censoring, and the cluster size is random. We formulate the dependence of the event times within each cluster with a copula model, and assume a parametric survival model for the margins. Inference on the model parameters are made via MLE (maxi- mum likelihood estimation). In addition, we explore patterns of the cluster sizes and their association with the individuals who define the clusters. The motivating infectious disease study is used throughout this project to illustrate the research.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Joan Hu
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Gene-environment interactions in non-Hodgkin lymphoma: a statistical analysis

Date created: 
2013-07-31
Abstract: 

An emerging focus of cancer epidemiology is the role of the environment together with genes in determining risk, often referred to as gene-environment interaction. For non-Hodgkin lymphoma (NHL), environmental exposures such as organochlorines are important risk factors. On the other hand, familial clustering of NHL suggests that genetics also plays a role. In this project, we analyze data from a BC population-based case-control study of NHL, to evaluate gene-environment interactions between the organochlorine oxychlordane and single-nucleotide polymorphisms (SNPs) that tag genes involved in the elimination of foreign compounds from the body. A statistically significant interaction between oxychlordane and an intronic SNP within the ABCC4 gene was identified at false-discovery rate level 10%. The same intronic region of ABCC4 produced the four most significant interactions. These results may be viewed in the context of recent work connecting intronic SNPs to regulation of gene expression and the development of cancer.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jinko Graham
John Spinelli
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Statistical modelling of temporary stream flow in Canadian prairie provinces

Author: 
Date created: 
2012-08-20
Abstract: 

Accurate forecasting of stream flow is of vital importance in semi-arid regions in order to meet the needs of humans, such as agriculture, and for wildlife. It is also of considerable interest for predicting stream flow for ungauged basins and for detecting change due to landuse or climate variations. Daily streamflows in semi-arid and arid regions are characterized by zero-inflation, seasonality, autoregression and extreme events such as floods and droughts. Analyses at the level of daily data for intermittent streams are problematic because of the preponderance of zero flows. Basic modelling approaches are often inappropriate when many zero flow events are present; approaches need to be modified to allow greater flexibility in incorporating zeros than is possible with traditional methods. This project discusses the utility of spline compartment models for analysis of data from intermittent streams, whereby the log-odds of the probability of a non-zero flow day, as well as the logarithm of non-zero flow rate can be studied. These models permit handling of large numbers of zero-flow days; the use of splines and other smoothers have the benefi t that they permit a wide range of distributional shapes to be fitted. The models are illustrated for ten streams in the Canadian Prairie Provinces.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Charmaine Dean
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Investigating the use of the accelerated hazards model for survival analysis

Author: 
Date created: 
2010-12-09
Abstract: 

This project contrasts the Proportional Hazards, Accelerated Failure Time and Accelerated Hazards (AH) models in the analysis of time to event data. The AH model handles data that exhibit crossing of the survival and hazard curves, unlike the other two models considered. The three models are illustrated on five contrasting data sets. A simulation study is conducted to assess the small sample performance of the AH model by quantifying the mean squared error of the predicted survivor curves under scenarios of crossing and non-crossing survivor curves. The results show that the AH model can perform poorly under model misspecification for models with a crossing hazard. Problems with variance estimation of parameters in the AH model are observed for small sample sizes and a bootstrap approach is offered as an alternate method of quantifying precision of estimates.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Charmaine Dean
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Some perspectives of smooth and locally sparse estimators

Author: 
Date created: 
2013-07-15
Abstract: 

In this thesis we develop some new techniques for computing smooth and meanwhile locally sparse (i.e. zero on some sub-regions) estimators of functional principal components (FPCs) in functional principal component analysis (FPCA) and coefficient functions in functional linear regression (FLR). Like sparse models in ordinary data analysis, locally sparse estimators in functional data analysis enjoy less variability and better interpretability. In the first part of the thesis, we develop smooth and locally sparse estimators of FPCs. For an FPC, the sub-regions on which it has significant magnitude are interpreted as where sample curves have major variations. The non-null sub-regions of our estimated FPCs coincide with the sub-regions where the corresponding FPC has significant magnitude. This makes our derived FPCs easier to interpret: those non-null sub-regions are where sample curves have major variations. An efficient algorithm is designed to compute our estimators using projection deflation. Our estimators are strongly consistent and asymptotically normal under mild conditions. Simulation studies also show that FPCs estimated by our method explain similar variations of sample curves as FPCs estimated from other methods. In the second part of the thesis, we develop a new regularization technique called “functional SCAD” (fSCAD), which is the functional generalization of the well-known SCAD (smoothly clipped absolute deviation) regularization, and then apply it to derive a smooth and locally sparse estimator of the coefficient function in FLR. The fSCAD enables us to identify the null sub-regions of the coefficient function without over shrinking the non-zero values. The smoothness of our estimator is regularized by a roughness penalty. We also develop an efficient algorithm to compute the estimator in practice via B-Splines expansion. An asymptotic analysis shows that our estimator enjoys the oracle property, i.e. it performs as well as if we knew the true null sub-regions of the coefficient function in advance. The simulation studies show that our estimator has superior numerical performance.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Jiguo Cao
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

A tutorial on the inheritance procedure for multiple testing of tree-structured hypotheses

Author: 
Date created: 
2013-07-25
Abstract: 

In a candidate gene association study the goal is to find associations between a trait of interest and genetic variation at markers, such as single-nucleotide polymorphisms, or SNPs. SNPs are grouped within candidate genes thought to influence the trait. Such grouping imposes a tree structure on the hypotheses, with hypotheses about single-SNP associations nested within gene-based associations. In this project we give a tutorial on the inheritance procedure, a powerful new method for testing tree-structured hypotheses. We define sequentially rejective procedures and show that the inheritance procedure is a sequentially rejective procedure that strongly controls the family-wise error rate under so-called monotonicity and single step conditions. We also show how to further improve power by taking advantage of the logical implications among the nested hypotheses. The resulting testing strategy enables more powerful detection of gene- and SNP-based associations, while controlling the chance of incorrectly claiming that such associations exist.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Brad Mcneney
Jinko Graham
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

Use of genetic algorithms for optimal investment strategies

Author: 
Date created: 
2013-04-17
Abstract: 

In this project, a genetic algorithm (GA) is used in the development of investment strategies to decide the optimum asset allocations that back up a portfolio of term insurance contracts and the re-balancing strategy to respond to the changing financial markets, such as change in interest rates and mortality experience. The objective function used as the target to be maximized in GA allows us to accommodate three objectives that should be of interest to the management in insurance companies. The three objectives under consideration are maximizing the total value of wealth at the end of the period, minimizing the variance of the total value of the wealth across the simulated interest rate scenarios and achieving consistent returns on the portfolio from year to year. One objective may be in conflict with another, and GA tries to find a solution, among the large searching space of all the solutions, that favors a particular objective as specified by the user while not worsening other objectives too much. Duration matching, a popular approach to manage risks underlying the traditional life insurance portfolios, is used as a benchmark to examine the effectiveness of the strategies obtained through the use of genetic algorithms. Experiments are conducted to compare the performance of the investment strategy proposed by the genetic algorithm to the duration matching strategy in terms of the different objectives under the testing scenarios. The results from the experiments successfully illustrate that with the help of GA, we are able to find a strategy very similar to the strategy from duration matching. We are also able to find other strategies that could outperform duration matching in terms of some of the desired objectives and are robust in the tested changing environment of interest rate and mortality.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Gary Parker
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.

The Hot hand in golf

Date created: 
2013-04-26
Abstract: 

In this project, an analysis is made to try to determine whether the phenomenon known as the hot hand, exists in golf. Data from a particular golf tournament in 2012 is studied in order to try to nd out whether this proposition seems true. For this tournament, the scores for each golfer are split into the number of strokes and the number of putts required to complete the course. The key idea in this project is the substitution of the number of putts with the expected number of putts. The rationale is that putting is a highly stochastic element of golf and that the randomness conceals evidence of the hot hand. This expected value will be based on the distance to the pin once the ball is on the green. This distance to the pin is obtained from the ShotLink website. New scores for all golfers are calculated and consist of the sum of the number of strokes plus the expected number of putts in order to complete a course. The association between said scores in the rst round and similar scores in the second round is calculated. The results seem to point to the conclusion that there is no hot hand in golf.

Document type: 
Graduating extended essay / Research project
File(s): 
Supervisor(s): 
Tim Swartz
Department: 
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Project) M.Sc.