Iterative gene prediction and pseudogene removal improves genome annotation

Marijke J. van Baren; Michael R. Brent

doi:10.1101/gr.4766206

Iterative gene prediction and pseudogene removal improves genome annotation

Laboratory for Computational Genomics, Department of Computer Science Washington University, Saint Louis, Missouri 63130, USA

Abstract

Correct gene prediction is impaired by the presence of processed pseudogenes: nonfunctional, intronless copies of real genes found elsewhere in the genome. Gene prediction programs frequently mistake processed pseudogenes for real genes or exons, leading to biologically irrelevant gene predictions. While methods exist to identify processed pseudogenes in genomes, no attempt has been made to integrate pseudogene removal with gene prediction, or even to provide a freestanding tool that identifies such erroneous gene predictions. We have created PPFINDER (for Processed Pseudogene finder), a program that integrates several methods of processed pseudogene finding in mammalian gene annotations. We used PPFINDER to remove pseudogenes from N-SCAN gene predictions, and show that gene prediction improves substantially when gene prediction and pseudogene masking are interleaved. In addition, we used PPFINDER with gene predictions as a parent database, eliminating the need for libraries of known genes. This allows us to run the gene prediction/PPFINDER procedure on newly sequenced genomes for which few genes are known.

Footnotes

¹
↵1 Corresponding author.

↵1 E-mail brent{at}cse.wustl.edu.
[Supplemental material is available online at www.genome.org. N-SCAN and PPFINDER are open source software and may be obtained from http://genes.cse.wustl.edu/.]
Article is online at http://www.genome.org/cgi/doi/10.1101/gr.4766206
- Received October 28, 2004.
- Accepted March 13, 2006.
Freely available online through the Genome Research Open Access option.
Cold Spring Harbor Laboratory Press