Accurate alignment-free inference of microbial phylogenies

Resource type
Thesis type
(Thesis) M.Sc.
Date created
Classifying, clustering or building a phylogeny on a set of genomes without the expensive computation of sequence alignment involves calculating pairwise distances by an appropriate metric. One such metric is the normalized compression distance (NCD), an approximation of the true information distance between two objects. Despite NCD's universal applicability, it has seen few applications in bioinformatics, with no existing tools applying NCD to whole-genome datasets to the best of our knowledge. We introduce Sequence Non-Alignment Compression and Comparison (snacc), a pipeline specifically tailored for computing pairwise distances between genomic sequences. snacc employs the NCD with a variety of compression algorithms, alongside an integer linear programming approach for selecting a sequence's reverse complement. We investigate the use of snacc with 5 common compression algorithms, and apply it to several bacterial and viral datasets with varying properties. Our results show that snacc achieves comparable accuracy relative to other metrics, demonstrating a large improvement over previous NCD implementations, and can be successfully used to reconstruct microbial phylogenies. In addition, snacc is flexible enough to incorporate almost any compression algorithm in a simple manner. snacc is an open-source tool and is available at
Copyright statement
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Chindelevitch, Leonid
Member of collection