Computational methods for analysis of single molecule sequencing data

Date created: 
Computational biology
Single-molecule sequencing
Oxford Nanopore
Long read mapping
Hybrid error correction
Hybrid assembly

Next-generation sequencing (NGS) technologies paved the way to a significant increase in the number of sequenced genomes, both prokaryotic and eukaryotic. This increase provided an opportunity for considerable advancement in genomics and precision medicine. Although NGS technologies have proven their power in many applications such as de novo genome assembly and variation discovery, computational analysis of the data they generate is still far from being perfect. The main limitation of NGS technologies is their short read length relative to the lengths of (common) genomic repeats. Today, newer sequencing technologies (known as single-molecule sequencing or SMS) such as Pacific Biosciences and Oxford Nanopore are producing significantly longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. For instance, for the first time, a complete human chromosome was fully assembled using ultra-long reads generated by Oxford Nanopore. Unfortunately, long reads generated by SMS technologies are characterized by a high error rate, which prevents their direct utilization in many of the standard downstream analysis pipelines and poses new computational challenges. This motivates the development of new computational tools specifically designed for SMS long reads. In this thesis, we present three computational methods that are tailored for SMS long reads. First, we present lordFAST, a fast and sensitive tool for mapping noisy long reads to a reference genome. Mapping sequenced reads to their potential genomic origin is the first fundamental step for many computational biology tasks. As an example, in this thesis, we show the success of lordFAST to be employed in structural variation discovery. Next, we present the second tool, CoLoRMap, which tackles the high level of base-level errors in SMS long reads by providing a means to correct them using a complementary set of NGS short reads. This integrative use of SMS and NGS data is known as hybrid technique. Finally, we introduce HASLR, an ultra-fast hybrid assembler that uses reads generated by both technologies to efficiently generate accurate genome assemblies. We demonstrate that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples.

Document type: 
This thesis may be printed or downloaded for non-commercial research and scholarly purposes. Copyright remains with the author.
Senior supervisor: 
Binay Bhattacharya
S. Cenk Sahinalp; Cedric Chauve; Faraz Hach
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.