Resource type
Thesis type
(Thesis) M.Sc.
Date created
2013-03-19
Authors/Contributors
Author: Numanagic, Ibrahim
Abstract
The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Currently, most HTS data is compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing data generated by the HTS platform, as they do not take advantage of the specific nature of genomic sequence data. Here we present SCALCE, a "boosting'' scheme based on Locally Consistent Parsing technique which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome. Our tests indicate that SCALCE improves compression rate and time of gzip significantly. We also showed that reordering problem can be considered as an instance of set-cover problem, and that Locally Consistent Parsing is practically good as the best known approximation of set-cover problem.
Document
Identifier
etd7769
Copyright statement
Copyright is held by the author.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Sahinalp, Suleyman Cenk
Member of collection
Download file | Size |
---|---|
etd7769_INumanagic.pdf | 829.04 KB |