Fast and accurate gene prediction by protein homology

Resource type
Thesis type
(Thesis) Ph.D.
Date created
Author: She, Rong
The fast development of genome sequencing technologies has provided scientists with enormous amount of DNA sequences that keep increasing exponentially. The task of analyzing these DNA sequences and deducing useful knowledge from them remains challenging. One of the most important steps towards the understanding of genomes is gene prediction, which is determining the positions of genes and their components (including exons and introns) on the DNA sequence. There have been many attempts on computational gene prediction. The two main categories of gene prediction methods are ab initio methods and homology-based methods. The ab initio methods are usually sensitive in finding genes in novel genomes but often produce many false positives. The homology-based methods, on the other hand, usually have higher specificity, but are limited to finding genes that have homologous partners. With the accumulation of genome sequences of related species, there has been a growing demand for better and faster homology-based gene prediction programs. In this thesis, I present a homology-based gene prediction framework that utilizes protein homology in determining positions of protein-coding genes. A protein sequence (the product of gene) is used as a query to help in finding genes that are homologous to the query protein. The framework consists of two major components. First, local alignments between the query protein and the genome are assembled into gene regions where potential homologous genes are located. Next, each potential gene region is examined for gene signals and gene models are resolved by utilizing the alignment information provided by the local alignments. The experiments on genomes of two closely related species Caenorhabditis elegans and Caenorhabditis briggsae demonstrated that this method is both accurate and efficient. In particular, it runs hundreds of times faster than GeneWise, a popular homology-based gene prediction program, while being competitive in accuracy. Experiments have also been done on the human genome with a much larger size than C. elegans and C. briggsae, which showed similar performance behaviours of genBlast.
Copyright statement
Copyright is held by the author.
The author has not granted permission for the file to be printed nor for the text to be copied and pasted. If you would like a printable copy of this thesis, please contact
Scholarly level
Member of collection
Download file Size
etd5865.pdf 1.98 MB

Views & downloads - as of June 2023

Views: 8
Downloads: 0