Computational Discovery of Splicing Events from High-Throughput Omics Data

Date created: 
Transcriptome Reconstruction
Alternative Splicing
Genomic Aberrations
High-Throughput Sequencing

The splicing mechanism, the process of forming mature messenger RNA (mRNA) by only concatenating exons and removing introns, is an essential step in gene expression. It allows a single gene to have multiple RNA isoforms which potentially code different proteins. In addition, aberrant transcripts generated from non-canonical splicing events (e.g. gene fusions) are believed to be potential drivers in many tumor types and human diseases. Thus, identification and quantification of expressed RNAs from RNA-Seq data become fundamental steps in many clinical studies. For that reason, number of methods have been developed. Most popular computational methods designed for these high-throughput omics data start by analyzing the datasets based on existing gene annotations. However, these tools (i) do not detect novel RNA isoforms and low abundance transcripts; (ii) do not incorporate multi-mapping reads in their read counting strategies in quantifications; (iii) are sensitive to sequencing artifacts. In this thesis, we will address these computational problems for analyzing splicing events from high-throughput omics data. For identification and quantification of expressed RNAs from RNA-Seq data, we introduce CLIIQ, a unified framework to solve these two problems simultaneously. This framework also supports data from multiple samples to improve accuracy. To better incorporate multi-mapping reads into the framework, we design ORMAN, a combinatorial optimization formulation to resolve their mapping ambiguity by assigning single best location for each read. For aberrant transcript detections, we present a computational strategy ProTIE to integratively analyze proteomics and transcriptomic data from the same individual. This strategy provides proteome-level evidence for aberrant transcripts that can be used to eliminate false positives reported solely based on sequencing data.

Document type: 
This thesis may be printed or downloaded for non-commercial research and scholarly purposes. Copyright remains with the author.
Senior supervisor: 
Cenk Sahinalp
Martin Ester
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.