Variable-Number Tandem Repeats (VNTR) are genomic regions where a short sequence of DNA is repeated with no space in between repeats. While a fixed set of VNTRs is typically identified for a given species, the copy number at each VNTR varies between individuals within a species. Although VNTRs are found in both prokaryotic and eukaryotic genomes, the methodology called multi-locus VNTR analysis (MLVA) is widely used to distinguish different strains of bacteria, as well as cluster strains that might be epidemiologically related and investigate evolutionary rates. This thesis introduces PRINCE (Processing Reads toInfer the Number of Copies Efficiently), an algorithm that is able to accurately estimate the copy number of a VNTR given the sequence of a single repeat unit, two short flanking sequences and a set of short reads from a whole-genome sequence (WGS) experiment. This is a challenging problem, especially in the cases when the repeat region is longer than the expected read length. The proposed method computes a statistical approximation of the local coverage inside the repeat region. This approximation is then mapped to the copy number using a linear function whose parameters are fitted to simulated data. PRINCE was tested on the genomes of two datasets ofMycobacterium tuberculosis strains and was shown to be more accurate than two previous methods. An implementation of PRINCE in the Python language is freely available athttps://github.com/WGS-TB/PythonPRINCE
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Chindelevitch, Leonid
Member of collection