The rapid development of high throughput sequencing (HTS) technologies has made a considerable impact on clinical and genomics research. These technologies offer a time-efficient and cost-effective means for genotyping many pharmaceutical genes affecting the drug response (also known as ADMER genes), which makes HTS a good candidate for assisting the drug treatment and dosage decisions. However, challenges like data storage and transfer, as well as accurate genotype inference in the presence of various structural variations, are still preventing the wider integration of HTS platforms in clinical environments. For these reasons, this thesis presents fast and efficient methods for HTS data compression and accurate ADMER genotyping.First we propose a novel compression technique for reference-aligned HTS data, which utilizes the local assembly technique to assemble the donor genome and eliminate the redundant information about the donor present in the HTS data. Our results show that we can achieve significantly better compression rates over currently used methods, while providing fast compression speeds and random access capability on the compressed archives. We also present a companion benchmarking framework with the aim to evaluate the performance of different HTS compression tools in a fair and reproducible manner. In the second part, we investigate the genotyping of CYP2D6 gene. Although this gene is involved in the metabolism of 20–25% of all clinically prescribed drugs, accurate genotype inference of CYP2D6 presents a significant challenge for various genotyping platforms due to the presence of structural rearrangements within its region. Thus, we introduce the first computational tool which is able to accurately infer a CYP2D6 genotype from HTS data by formulating such problem as an instance of integer linear programming. Finally, we show how to extend the proposed algorithm to other genes which harbour similar structural rearrangements, like CYP2A6, and to other HTS sequencing platforms, like PGRNseq. We demonstrate the accuracy and effectiveness of the proposed algorithms on large set of simulated and real data samples sequenced by both Illumina and PGRNseq platforms.
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Sahinalp, S. Cenk
Member of collection