A lot of survey questions include a phrase like, “Choose all that apply”, which lets the respondents choose any number of options from predefined lists of items. Responses to thesequestions result in multiple-response categorical variables (MRCVs). This thesis focuses on analyzing and modeling three MRCVs. There are 232 possible models representing different combinations of associations. Parameters are estimated using generalized estimating equations generated by a pseudo-likelihood and variances of the estimators are corrected using sandwich methods. Due to the large number of possible models, model comparisons based on nested models would be inappropriate. As an alternative, model averaging is proposed as a model comparison tool as well as to account for model selection uncertainty. Further the calculations required for computing the variance of the estimators can exceed 32-bit machine capacity even for a moderately large number of items. This issue is addressed by reducing dimensions of the matrices.
We explore two regression models for creating an adjusted plus-minus statistic for the NHL. We compare an OLS regression models and a penalized gamma-lasso regression model. The traditional plus-minus metric is a simple marginal statistic that allocates a +1 to players for scoring a goal and a -1 for allowing a goal according to whether they were on the ice. This is a very noisy and uninformative statistic since it does not take into account the quality of the other players on the ice with an individual. We build off of previous research to create a more informative statistic that takes into account all of the players on the ice. This previous research has focused on goals to build an adjusted plus-minus, which is information deficient due to the fact that there are only approximately 5 goals scored per game. We improve upon this by instead using shots which provides us with ten times as much information per game. We use shot location data from 2007 to 2013 to create a smoothed probability map for the probability of scoring a goal from all locations in the offensive zone. We then model the shots from 2014-2015 season to get player estimates. Two models are compared, an OLS regression and a penalized regression (lasso). Finally, we compare our adjusted plus-minus to the traditional plus-minus and complete a salary analysis to determine if teams are properly valuing players for the quality of shots they are taking and allowing.
This project studies the reserving problem for incurred but not reported (IBNR) claims in non-life insurance. Based on an idea presented in Kremer (1995), we propose a new Poisson INAR (integer-valued autoregressive) model for the unclosed claim counts, which are the number of reported but not enough reported claims. The properties and the prediction of the proposed Poisson INAR model are discussed. We modify the estimation methods proposed in Silva et al. (2005) for the replicated INAR(1) processes to be applied to our model and introduce new algorithms for estimating the model parameters. The performance of three different estimation methods used in this project is compared, and the impact of the sample size to the accuracy of the estimates is examined in the simulation study. To illustrate, we also present the prediction results of our proposed model using a generated sample.
This thesis consists of a compilation of four research papers. Chapter 2 investigates the powerplay in one-day cricket. The form of the analysis takes a “what if” approach where powerplay outcomes are substituted with what might have happened had there been no powerplay. This leads to a paired comparisons setting consisting of actual matches and hypothetical parallel matches where outcomes are imputed during the powerplay period. We also investigate individual batsmen and bowlers and their performances during the powerplay. Chapter 3 considers the problem of determining optimal substitution times in soccer. An analysis is presented based on Bayesian logistic regression. We find that with evenly matched teams, there is a goal scoring advantage to the trailing team during the second half of a match. We observe that there is no discernible time during the second half when there is a benefit due to substitution. Chapter 4 explores two avenues for the modification of tactics in Twenty20 cricket. The first idea is based on the realization that wickets are of less importance in Twenty20 cricket than in other formats of cricket (e.g. one-day cricket and Test cricket). The second idea may be applicable when there exists a sizeable mismatch between two competing teams. In this case, the weaker team may be able to improve its win probability by increasing the variance of run differential. A specific variance inflation technique which we consider is increased aggressiveness in batting. Chapter 5 explores new definitions for pace of play in ice hockey. Using detailed event data from the 2015-2016 regular season of the National Hockey League (NHL), the distance of puck movement with possession is the proposed criterion in determining the pace of a game. Although intuitive, this notion of pace does not correlate with expected and familiar quantities such as goals scored and shots taken.
Catastrophe bond (CAT bond) is one of the modern financial instruments to transfer the risk of natural disasters to capital markets. In this project, we provide a structure of payoffs for a zero-coupon CAT bond in which the premature default of the issuer is also considered. The defaultable CAT bond price is computed by Monte Carlo simulations under the Vasicek interest rate model with losses generated from a compound doubly stochastic Poisson process. In the underlying Poisson process, the intensity of occurrence is assumed to follow a geometric Brownian motion. Moreover, the issuer’s daily total asset value is modelled by the approach proposed in Duan et al. (1995), and the liquidity process is incorporated to capture the additional return of investors. Finally, a sensitivity analysis is carried out to explore the effects of key parameters on the CAT bond price.
Multivariate multiple linear regression is multiple linear regression, but with multiple responses. Standard approaches assume that observations from different subjects are uncorrelated and so estimates of the regression parameters can be obtained through separate univariate regressions, regardless of whether the responses are correlated within subjects. There are three main extensions to the simplest model. The first assumes a low rank structure on the coefficient matrix that arises from a latent factor model linking predictors to responses. The second reduces the number of parameters through variable selection. The third allows for correlations between response variables in the low rank model. Chen and Huang propose a new model that falls under the reduced-rank regression framework, employs variable selection, and estimates correlations among error terms. This project reviews their model, describes its implementation, and reports the results of a simulation study evaluating its performance. The project concludes with ideas for further research.
With the prevalence of chronic diseases that account for a significant portion of deaths, a new approach to life insurance has emerged to address this issue. The new approach integrates health rewards programs with life insurance products; the insureds are classified by fitness statuses according to their level of participation and would get premium reductions at the superior statuses. We introduce a Markov chain process to model the dynamic transition of the fitness statuses, which are linked to corresponding levels of mortality risks reduction. We then embed this transition process into a stochastic multi-state model to describe the new life insurance product. Formulas are given for calculating its benefit, premium, reserve and surplus. These results are compared with those of the traditional life insurance. Numerical examples are given for illustration.
Likelihood-based inference of odds ratios in logistic regression models is problematic for small samples. For example, maximum-likelihood estimators may be seriously biased or even non-existent due to separation. Firth proposed a penalized likelihood approach which avoids these problems. However, his approach is based on a prospective sampling design and its application to case-control data has not yet been fully justified. To address the shortcomings of standard likelihood-based inference, we describe: i) naive application of Firth logistic regression, which ignores the case-control sampling design, and ii) an extension of Firth's method to case-control data proposed by Zhang. We present a simulation study evaluating the empirical performance of the two approaches in small to moderate case-control samples. Our simulation results suggest that even though there is no formal justification for applying Firth logistic regression to case-control data, it performs as well as Zhang logistic regression which is justified for case-control data.
Many regression procedures are affected by heteroskedasticity, or non-constant variance. A classic solution is to transform the response y and model h(y) instead. Common functions h require a direct relationship between the variance and the mean. Unless the transformation is known in advance, it can be found by applying a model for the variance to the squared residuals from a regression fit. Unfortunately, this approach additionally requires the strong assumption that the regression model for the mean is 'correct', whereas many regression problems involve model uncertainty. Consequently it is undesirable to make the assumption that the mean model can be correctly specified at the outset. An alternative is to model the mean and variance simultaneously, where it is possible to try different mean models and variance models together in different combinations, and to assess the fit of each combination using a single criterion. We demonstrate this approach in three different problems: unreplicated factorials, regression trees, and random forests. For the unreplicated factorial problem, we develop a model for joint identification of mean and variance effects that can reliably identify active effects of both types. The joint model is estimated using maximum likelihood, and effect selection is done using a specially derived information criterion (IC). Our method is capable of identifying sensible location-dispersion models that are not considered by methods that rely on sequential estimation of location and dispersion effects. We take a similar approach to modeling variances in regression trees. We develop an alternative likelihood-based split-selection criterion that has the capacity to account for local variance in the regression in an unstructured manner, and the tree is built using a specially derived IC. Our IC explicitly accounts for the split-selection parameter and our IC also leads to a faster pruning algorithm that does not require crossvalidation. We show that the new approach performs better for mean estimation under heteroskedasticity. Finally we use these likelihood-based trees as base learners in an ensemble much like a random forest, and improve the random forest procedure itself. First, we show that typical random forests are inefficient at fitting flat mean functions. Our first improvement is the novel alpha-pruning algorithm, which adaptively changes the number of observations in the terminal nodes of the regression trees depending on the flatness. Second, we show that random forests are inefficient at estimating means when the data are heteroskedastic, which we address by using our likelihood-based regression trees as a base learner. This allows explicit variance estimation and improved mean estimation under heteroskedasticity. Our unifying and novel contribution to these three problems is the specially derived IC. Our solution is to simulate values of the IC for several models and to store these values in a lookup table. With the lookup table, models can be evaluated and compared without needing either crossvalidation or a holdout set. We call this approach the Corrected Heteroskedastic Information Criterion (CHIC) paradigm and we demonstrate that applying the CHIC paradigm is a principled way to model variance in finite sample sizes.
Computer models are used as surrogates for physical experiments in many areas of science. They can allow the researchers to gain a better understanding of the processes of interest, in situations where it would be overly costly or time-consuming to obtain sufficient physical data. In this project, we give an approach for using a computer model to obtain designs for a physical experiment. The designs are optimal for modelling the spatial distribution of the response across the region of interest. An additional consideration is the presence of several tuning parameters to the computer model, which represent physical aspects of the process but whose values are not precisely known. In obtaining the optimal designs, we account for this uncertainty in the parameters governing the system. The project is motivated by an application in glaciology, where computer models are often used to model the melt of snow and ice across a glacier surface. The methodology is applied to obtain optimal networks of stakes, which researchers use to obtain measurements of summer mass balance (the difference between the amount of snow/ice before and after the melt season).