Scalable mapping and compression of high throughput genome sequencing data

Date created: 
Sequence Mapping
Sequence Compression
High Throughput Sequencing
Genomics, Sequence Alignment

The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing, downstream analysis and computational infrastructure. HTS has become an invaluable technology for many applications, e.g. the detection of single-nucleotide polymorphisms, structural variations. In most of these applications, mapping sequenced ``reads'' to their potential genomic origin is the first fundamental step for subsequent analyses. Many tools have been developed to address this problem. Because of the large amount of HTS data availability, much emphasis has been placed on speed and memory. In fact, as HTS data grow in size, data management and storage are becoming major logistical obstacles for adopting HTS-platforms. The requirements for ever increasing monetary investment almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information, which holds most of the sequence data generated world wide. One way to solve storage requirements for HTS data is compression. Currently, most HTS data is compressed through general purpose algorithms such as gzip. These algorithms are not specifically designed for compressing data generated by the HTS-platforms. Recently, a number of fast and efficient compression algorithms have been designed specifically for HTS data to address some of the issues in data management, storage and communication. In this thesis, we study both of these computational problems, i.e., Sequence Mapping and Sequence Compression extensively. We introduce two novel methods namely mrsFAST and drFAST to map HTS short-reads to the reference genome. These methods are cache oblivious and guarantee perfect sensitivity. Both are specifically designed to address the bottleneck of multi-mapping for the purpose of structural variation detection. In addition we present Dissect for mapping whole trascriptome to the genome while considering structural alterations in the transcriptome. Dissect is designed specifically to map HTS long-reads as well as assembled contigs. Finally, we address the storage and communication problems in HTS data by introducing SCALCE, a "boosting'' scheme based on Locally Consistent Parsing technique. SCALCE re-orders the data in order to increase the locality of reference and subsequently improve the performance of well-known compression methods in terms of speed and space.

Document type: 
Copyright remains with the author. The author granted permission for the file to be printed and for the text to be copied and pasted.
Senior supervisor: 
S. Cenk Sahinalp
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.