Markov chain Monte Carlo sampling of gene genealogies conditional on observed genetic data

Date created: 
Gene genealogy
Coalescent model
Markov chain Monte Carlo
Genetic association studies
Population genetics

The gene genealogy is a tree describing the ancestral relationships among genes sampled from unrelated individuals. Knowledge of the tree is useful for inference of population-genetic parameters such as the mutation or recombination rate. It also has potential application in genomic mapping, as individuals with similar trait values will tend to be more closely related genetically at the location of a trait-influencing mutation. One way to incorporate genealogical trees in genetic applications is to sample them conditional on genetic data observed at present. In this thesis, we describe our Markov chain Monte Carlo (MCMC) based genealogy sampler. First, we describe the sampler that conditions on haplotype data. Our implementation is based on the sampler described in Zollner and Pritchard (2005). However, we have made several changes to increase the efficiency of sampling. We illustrate the use of our sampler on haplotype data from a publicly-available dataset, where we examine statistics summarizing the degree to which case haplotypes are more related to each other than to control haplotypes. Most genealogy samplers condition on the haplotype data of present day sequences being available. However, commonly used genotyping technology measures genotypes at single loci rather than haplotypes and therefore the haplotype data needs to be imputed. To avoid single imputation, we then describe how the original sampler was extended to handle the case of only genotype data being available. We apply the sampler to simulated data to evaluate how well it estimates genetic parameters and predicts haplotypes. Adequate mixing of the sampler was a concern for some of the test datasets. The mixing difficulties were attributed to substantial dependence between the tree structure and the latent variables introduced to facilitate sampling of the trees. We describe our experiences with using simulated tempering in order to improve the mixing of the sampler. Our heated distributions were chosen so that the dependencies between the latent variables and the tree structure were gradually reduced. We apply this approach to a simulated dataset to illustrate how simulated tempering can improve mixing over the haplotype configurations.

Document type: 
Copyright remains with the author. The author granted permission for the file to be printed and for the text to be copied and pasted.
Senior supervisor: 
Jinko Graham
Brad McNeney
Science: Department of Statistics and Actuarial Science
Thesis type: 
(Thesis) Ph.D.