Skip to main content

PRINCE: Accurate approximation of the copy number of tandem repeats

Resource type
Thesis type
(Thesis) M.Sc.
Date created
Variable-Number Tandem Repeats (VNTR) are genomic regions where a short sequence of DNA is repeated with no space in between repeats. While a fixed set of VNTRs is typically identified for a given species, the copy number at each VNTR varies between individuals within a species. Although VNTRs are found in both prokaryotic and eukaryotic genomes, the methodology called multi-locus VNTR analysis (MLVA) is widely used to distinguish different strains of bacteria, as well as cluster strains that might be epidemiologically related and investigate evolutionary rates. This thesis introduces PRINCE (Processing Reads toInfer the Number of Copies Efficiently), an algorithm that is able to accurately estimate the copy number of a VNTR given the sequence of a single repeat unit, two short flanking sequences and a set of short reads from a whole-genome sequence (WGS) experiment. This is a challenging problem, especially in the cases when the repeat region is longer than the expected read length. The proposed method computes a statistical approximation of the local coverage inside the repeat region. This approximation is then mapped to the copy number using a linear function whose parameters are fitted to simulated data. PRINCE was tested on the genomes of two datasets ofMycobacterium tuberculosis strains and was shown to be more accurate than two previous methods. An implementation of PRINCE in the Python language is freely available at
Copyright statement
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Chindelevitch, Leonid
Member of collection

Views & downloads - as of June 2023

Views: 15
Downloads: 0