Filling the gap in gene prediction

feature-image

Play all audios:

Loading...

Genome sequencing tells us the order of the As, Cs, Gs and Ts in a genome, but annotation programs make sense of it all by telling us where genes are and what they look like. Gene-finding


programs are reasonably good at identifying protein-coding regions, but are less proficient at finding other potentially important sequences — such as cis-regulatory regions and non-coding


exons — that lie upstream of the translational start site. Now, Davuluri and colleagues have filled this technical gap by developing a program that accurately recognizes promoters and first


exons. Although the program was developed to annotate the human genome, the authors believe it will also prove useful for the genomes of other species.


The starting point in constructing any sequence prediction program involves 'training' the algorithm to recognize the type of sequence you want. Because most sequence annotations do not


contain information about 5′ untranslated regions, the authors constructed their own data set of more than 2,000 genes for which first exons and promoters had been experimentally validated.


Using these sequences, the algorithm 'learned' to recognize features ∼500 bp either side of the first exon — defined as the region between a promoter and the first splice-donor site. The


program — called first-exon finder or FirstEF – operates by finding every potential promoter and splice-donor site and then calculating the probability that the intervening sequence is a


first exon. The power of FirstEF lies in its abilty to identify first exons that are associated with either CpG-rich or CpG-poor promoters, and to predict both coding and non-coding first


exons. Two tests confirm the accuracy of FirstEF. When the algorithm was trained on 90% of the gene data set and then tested on the remaining 10%, it correctly predicted 84% of first exons.


Its performance on the annotated genomic sequences of human chromosomes 21 and 22 (from the public consortium) was also quite impressive, whether it was asked to confirm experimentally


validated first exons or to localize promoters upstream of annotated genes. FirstEF is the first and the only computational tool available at present that can predict first exons, especially


non-coding ones.


The effort of annotating the human genome is likely to continue for many more years, but FirstEF has brought bioinformatics one step closer to its goal of defining the 5′ boundaries and


non-coding regions of genes. Notably, FirstEF has estimated the percentage of CpG-related first exons to be 70%, and not 50% as was previously believed. And, if you like a challenge, the


authors have made FirstEF's predictions — all 68,645 of them — from the working draft of the human genome available for scrutiny.


Anyone you share the following link with will be able to read this content: