## About Summit

## ** New Summit website coming in May 2021!**

## Check the SFU library website for updates.

# Statistics and Actuarial Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection## Sports analytics

This thesis consists of a compilation of four research papers. Chapter 2 investigates the powerplay in one-day cricket. The form of the analysis takes a “what if” approach where powerplay outcomes are substituted with what might have happened had there been no powerplay. This leads to a paired comparisons setting consisting of actual matches and hypothetical parallel matches where outcomes are imputed during the powerplay period. We also investigate individual batsmen and bowlers and their performances during the powerplay. Chapter 3 considers the problem of determining optimal substitution times in soccer. An analysis is presented based on Bayesian logistic regression. We find that with evenly matched teams, there is a goal scoring advantage to the trailing team during the second half of a match. We observe that there is no discernible time during the second half when there is a benefit due to substitution. Chapter 4 explores two avenues for the modification of tactics in Twenty20 cricket. The first idea is based on the realization that wickets are of less importance in Twenty20 cricket than in other formats of cricket (e.g. one-day cricket and Test cricket). The second idea may be applicable when there exists a sizeable mismatch between two competing teams. In this case, the weaker team may be able to improve its win probability by increasing the variance of run differential. A specific variance inflation technique which we consider is increased aggressiveness in batting. Chapter 5 explores new definitions for pace of play in ice hockey. Using detailed event data from the 2015-2016 regular season of the National Hockey League (NHL), the distance of puck movement with possession is the proposed criterion in determining the pace of a game. Although intuitive, this notion of pace does not correlate with expected and familiar quantities such as goals scored and shots taken.

## Pricing Defaultable Catastrophe Bonds with Compound Doubly Stochastic Poisson Losses and Liquidity Risk

Catastrophe bond (CAT bond) is one of the modern financial instruments to transfer the risk of natural disasters to capital markets. In this project, we provide a structure of payoffs for a zero-coupon CAT bond in which the premature default of the issuer is also considered. The defaultable CAT bond price is computed by Monte Carlo simulations under the Vasicek interest rate model with losses generated from a compound doubly stochastic Poisson process. In the underlying Poisson process, the intensity of occurrence is assumed to follow a geometric Brownian motion. Moreover, the issuer’s daily total asset value is modelled by the approach proposed in Duan et al. (1995), and the liquidity process is incorporated to capture the additional return of investors. Finally, a sensitivity analysis is carried out to explore the effects of key parameters on the CAT bond price.

## Sparse Multivariate Reduced-Rank Regression with Covariance Estimation

Multivariate multiple linear regression is multiple linear regression, but with multiple responses. Standard approaches assume that observations from different subjects are uncorrelated and so estimates of the regression parameters can be obtained through separate univariate regressions, regardless of whether the responses are correlated within subjects. There are three main extensions to the simplest model. The first assumes a low rank structure on the coefficient matrix that arises from a latent factor model linking predictors to responses. The second reduces the number of parameters through variable selection. The third allows for correlations between response variables in the low rank model. Chen and Huang propose a new model that falls under the reduced-rank regression framework, employs variable selection, and estimates correlations among error terms. This project reviews their model, describes its implementation, and reports the results of a simulation study evaluating its performance. The project concludes with ideas for further research.

## A multi-state model for a life insurance product with integrated health rewards program

With the prevalence of chronic diseases that account for a significant portion of deaths, a new approach to life insurance has emerged to address this issue. The new approach integrates health rewards programs with life insurance products; the insureds are classified by fitness statuses according to their level of participation and would get premium reductions at the superior statuses. We introduce a Markov chain process to model the dynamic transition of the fitness statuses, which are linked to corresponding levels of mortality risks reduction. We then embed this transition process into a stochastic multi-state model to describe the new life insurance product. Formulas are given for calculating its benefit, premium, reserve and surplus. These results are compared with those of the traditional life insurance. Numerical examples are given for illustration.

## Penalized Logistic Regression in Case-control Studies

Likelihood-based inference of odds ratios in logistic regression models is problematic for small samples. For example, maximum-likelihood estimators may be seriously biased or even non-existent due to separation. Firth proposed a penalized likelihood approach which avoids these problems. However, his approach is based on a prospective sampling design and its application to case-control data has not yet been fully justified. To address the shortcomings of standard likelihood-based inference, we describe: i) naive application of Firth logistic regression, which ignores the case-control sampling design, and ii) an extension of Firth's method to case-control data proposed by Zhang. We present a simulation study evaluating the empirical performance of the two approaches in small to moderate case-control samples. Our simulation results suggest that even though there is no formal justification for applying Firth logistic regression to case-control data, it performs as well as Zhang logistic regression which is justified for case-control data.

## Statistical Learning Tools for Heteroskedastic Data

Many regression procedures are affected by heteroskedasticity, or non-constant variance. A classic solution is to transform the response y and model h(y) instead. Common functions h require a direct relationship between the variance and the mean. Unless the transformation is known in advance, it can be found by applying a model for the variance to the squared residuals from a regression fit. Unfortunately, this approach additionally requires the strong assumption that the regression model for the mean is 'correct', whereas many regression problems involve model uncertainty. Consequently it is undesirable to make the assumption that the mean model can be correctly specified at the outset. An alternative is to model the mean and variance simultaneously, where it is possible to try different mean models and variance models together in different combinations, and to assess the fit of each combination using a single criterion. We demonstrate this approach in three different problems: unreplicated factorials, regression trees, and random forests. For the unreplicated factorial problem, we develop a model for joint identification of mean and variance effects that can reliably identify active effects of both types. The joint model is estimated using maximum likelihood, and effect selection is done using a specially derived information criterion (IC). Our method is capable of identifying sensible location-dispersion models that are not considered by methods that rely on sequential estimation of location and dispersion effects. We take a similar approach to modeling variances in regression trees. We develop an alternative likelihood-based split-selection criterion that has the capacity to account for local variance in the regression in an unstructured manner, and the tree is built using a specially derived IC. Our IC explicitly accounts for the split-selection parameter and our IC also leads to a faster pruning algorithm that does not require crossvalidation. We show that the new approach performs better for mean estimation under heteroskedasticity. Finally we use these likelihood-based trees as base learners in an ensemble much like a random forest, and improve the random forest procedure itself. First, we show that typical random forests are inefficient at fitting flat mean functions. Our first improvement is the novel alpha-pruning algorithm, which adaptively changes the number of observations in the terminal nodes of the regression trees depending on the flatness. Second, we show that random forests are inefficient at estimating means when the data are heteroskedastic, which we address by using our likelihood-based regression trees as a base learner. This allows explicit variance estimation and improved mean estimation under heteroskedasticity. Our unifying and novel contribution to these three problems is the specially derived IC. Our solution is to simulate values of the IC for several models and to store these values in a lookup table. With the lookup table, models can be evaluated and compared without needing either crossvalidation or a holdout set. We call this approach the Corrected Heteroskedastic Information Criterion (CHIC) paradigm and we demonstrate that applying the CHIC paradigm is a principled way to model variance in finite sample sizes.

## Using computer model uncertainty to inform the design of physical experiments: An application in glaciology

Computer models are used as surrogates for physical experiments in many areas of science. They can allow the researchers to gain a better understanding of the processes of interest, in situations where it would be overly costly or time-consuming to obtain sufficient physical data. In this project, we give an approach for using a computer model to obtain designs for a physical experiment. The designs are optimal for modelling the spatial distribution of the response across the region of interest. An additional consideration is the presence of several tuning parameters to the computer model, which represent physical aspects of the process but whose values are not precisely known. In obtaining the optimal designs, we account for this uncertainty in the parameters governing the system. The project is motivated by an application in glaciology, where computer models are often used to model the melt of snow and ice across a glacier surface. The methodology is applied to obtain optimal networks of stakes, which researchers use to obtain measurements of summer mass balance (the difference between the amount of snow/ice before and after the melt season).

## Analysis of Data in Network and Natural Language Formats

The work herein describes a predictive model for cricket matches, a method of evaluating cricket players, and a method to infer properties of a network from a link-traced sample. In Chapter 2, player characteristics are estimated using a frequency count of the outcomes that occur when that player is batting or bowling. These characteristics are weighted against the relative propensity of each outcome in each of 200 game situations (10 wickets times 20 overs), and incorporate prior information using a Metropolis-Hastings algorithm. The characteristics of players in selected team rosters are then fed into a system we developed to simulate outcomes of whole games. The winning probabilities of each team are shown to perform similarly to competitive betting lines during the 2014 Cricket World Cup. In Chapter 3 the simulator is used to estimate the effect, in terms of expected number of runs, of each player. The effect of the player is reported as expected runs scored or allowed per innings above an average player in the same batting or bowling position. Chapter 4 proposes a method based on approximate Bayesian computation (ABC) to make inferences on hidden parameters of a network graph. Network inference using ABC is a very new field. This is the first work, to the author’s knowledge, of an ABC based inference using only a sample of a network, rather than the either network. Summary statistics are taken from the sample of the network of interest, networks and samples are then simulated using hidden parameters from a prior distribution, and a posterior of the parameters is found by a kernel density estimate conditioned on the summary statistics. Chapter 5 describes an application of the method proposed in Chapter 4 to real data. A network of precedence citations between legal documents, centered around cases overseen by the Supreme Court of Canada, is observed. The features of certain cases that lead to their frequent citation are inferred, and their effects estimated by ABC. Future work and extensions are briefly discussed in Chapter 6.

## A goodness-of-fit test for semi-parametric copula models of right-censored bivariate survival times

In multivariate survival analyses, understanding and quantifying the association between survival times is of importance. Copulas, such as Archimedean copulas and Gaussian copulas, provide a flexible approach of modeling and estimating the dependence structure among survival times separately from the marginal distributions (Sklar, 1959). However, misspecification in the parametric form of the copula function will directly lead to incor- rect estimation of the joint distribution of the bivariate survival times and other model-based quantities.The objectives of this project are two-folded. First, I reviewed the basic definitions and properties of commonly used survival copula models. In this project, I focused on semi- parametric copula models where the marginal distributions are unspecified but the copula function belongs to a parametric copula family. Various estimation procedures of the de- pendence parameter associated with the copula function were also reviewed. Secondly, I extended the pseudo in-and-out-of-sample (PIOS) likelihood ratio test proposed in Zhang et al. (2016) to testing the semi-parametric copula models for right-censored bivariate sur- vival times. The PIOS test is constructed by comparing two forms of pseudo likelihoods, one is the "in-sample" pseudo likelihood, which is the full pseudo likelihood, and the other is the "out-of-sample" pseudo likelihood, which is a cross-validated pseudo likelihood by the means of jacknife. The finite sample performance of the PIOS test was investigated via a simulation study. In addition, two real data examples were analyzed for illustrative purpose.

## On Supervised and Unsupervised Discrimination

Discrimination is a supervised problem in statistics and machine learning that begins with data from a finite number of groups. The goal is to partition the data-space into some number of regions, and assign a group to each region so that observations there are most likely to belong to the assigned group. The most popular tool for discrimination is called discriminant analysis. Unsupervised discrimination, commonly known as clustering, also begins with data from groups, but now we do not necessarily know how many groups, nor do we get to know which group each observation belongs to. Our goal when doing clustering is still to partition the data-space into regions and assign groups to those regions, however we do not have any a priori information with which to assign these groups. Common tools for clustering include the k-means algorithm and model-based clustering using either the expectation maximization (EM) or classification expectation maximization (CEM) algorithms (of which k-means is a special case). Tools designed for clustering can also be used to do discrimination. We investigate this possibility, along with a method proposed by Yang (2013) for smoothing the transition between these problems. We use two simulations to investigate the performance of discriminant analysis and both versions of model-based clustering with various parameter settings across various datasets. These settings include using Yang’s method for modifying clustering tools to handle discrimination. Results are presented along with recommendations for data analysis when doing discrimination or clustering. Specifically, we investigate what assumptions to make about the groups’ sizes and shapes, as well as which method to use (discriminant analysis or the EM or CEM algorithms) and whether or not to apply Yang’s pre-processing procedure.