Background: Orthologs (genes that have diverged after a speciation event) tend to have similarfunction, and so their prediction has become an important component of comparative genomicsand genome annotation. The gold standard phylogenetic analysis approach of comparing availableorganismal phylogeny to gene phylogeny is not easily automated for genome-wide analysis;therefore, ortholog prediction for large genome-scale datasets is typically performed using areciprocal-best-BLAST-hits (RBH) approach. One problem with RBH is that it will incorrectlypredict a paralog as an ortholog when incomplete genome sequences or gene loss is involved. Inaddition, there is an increasing interest in identifying orthologs most likely to have retained similarfunction.Results: To address these issues, we present here a high-throughput computational methodnamed Ortholuge that further evaluates previously predicted orthologs (including those predictedusing an RBH-based approach) – identifying which orthologs most closely reflect species divergenceand may more likely have similar function. Ortholuge analyzes phylogenetic distance ratios involvingtwo comparison species and an outgroup species, noting cases where relative gene divergence isatypical. It also identifies some cases of gene duplication after species divergence. Throughsimulations of incomplete genome data/gene loss, we show that the vast majority of genes falselypredicted as orthologs by an RBH-based method can be identified. Ortholuge was then used toestimate the number of false-positives (predominantly paralogs) in selected RBH-predictedortholog datasets, identifying approximately 10% paralogs in a eukaryotic data set (mouse-ratcomparison) and 5% in a bacterial data set (Pseudomonas putida – Pseudomonas syringae speciescomparison). Higher quality (more precise) datasets of orthologs, which we term "ssd-orthologs"(supporting-species-divergence-orthologs), were also constructed. These datasets, as well asOrtholuge software that may be used to characterize other species' datasets, are available at http://www.pathogenomics.ca/ortholuge/ (software under GNU General Public License).Conclusion: The Ortholuge method reported here appears to significantly improve the specificity(precision) of high-throughput ortholog prediction for both bacterial and eukaryotic species. Thismethod, and its associated software, will aid those performing various comparative genomics-basedanalyses, such as the prediction of conserved regulatory elements upstream of orthologous genes.
BMC Bioinformatics 2006, 7:270 doi:10.1186/1471-2105-7-270
Improving the Specificity of High-Throughput Ortholog Prediction
Copyright is held by the author(s).
Member of collection