Classifying, clustering or building a phylogeny on a set of genomes without the expensive computation of sequence alignment involves calculating pairwise distances by an appropriate metric. One such metric is the normalized compression distance (NCD), an approximation of the true information distance between two objects. Despite NCD's universal applicability, it has seen few applications in bioinformatics, with no existing tools applying NCD to whole-genome datasets to the best of our knowledge. We introduce Sequence Non-Alignment Compression and Comparison (snacc), a pipeline specifically tailored for computing pairwise distances between genomic sequences. snacc employs the NCD with a variety of compression algorithms, alongside an integer linear programming approach for selecting a sequence's reverse complement. We investigate the use of snacc with 5 common compression algorithms, and apply it to several bacterial and viral datasets with varying properties. Our results show that snacc achieves comparable accuracy relative to other metrics, demonstrating a large improvement over previous NCD implementations, and can be successfully used to reconstruct microbial phylogenies. In addition, snacc is flexible enough to incorporate almost any compression algorithm in a simple manner. snacc is an open-source tool and is available at https://github.com/SweetiePi/snacc/.
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Chindelevitch, Leonid
Member of collection