Search
Advanced Search
Metrics info
Average Rating (0 User Ratings)
    • Currently 0/5 Stars.
    See all categories
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
    Rate This Article
Share this Article info
  • Bookmark: StumbleUpon Facebook Connotea CiteULike Bibliography

Open Access

Research Article

Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction

Author Summary<p>We describe a new approach to statistical learning for sequence data that is broadly applicable to computational biology problems and that has experimentally demonstrated advantages over current hidden Markov model (HMM)-based methods for sequence analysis. The methods we describe in this paper, implemented in the CRAIG program, allow researchers to modularly specify and train sequence analysis models that combine a wide range of weakly informative features into globally optimal predictions. Our results for the gene prediction problem show significant improvements over existing ab initio gene predictors on a variety of tests, including the specially challenging ENCODE regions. Such improved predictions, particularly on initial and single exons, could benefit researchers who are seeking more accurate means of recognizing such important features as signal peptides and regulatory regions. More generally, we believe that our method, by combining the structure-describing capabilities of HMMs with the accuracy of margin-based classification methods, provides a general tool for statistical learning in biological sequences that will replace HMMs in any sequence modeling task for which there is annotated training data.</p></sec></div> <span property="dc:date" content="2007-03-16" datatype="xsd:date" rel="dc:identifier" href="http://dx.doi.org/10.1371/journal.pcbi.0030054"></span> <span property="dc:subject" content="Computational Biology"></span> <span property="dc:subject" content="Genetics and Genomics"></span> <form action=""> <input type="hidden" name="journalDisplayName" id="journalDisplayName" value="PLoS Computational Biology" /> <input type="hidden" name="crossRefPageURL" id="crossRefPageURL" value="/article/crossref/info%3Adoi%2F10.1371%2Fjournal.pcbi.0030054" /> <input type="hidden" name="metricsTabURL" id="metricsTabURL" value="/article/metrics/info%3Adoi%2F10.1371%2Fjournal.pcbi.0030054" /> <input type="hidden" name="doi" id="doi" value="info:doi/10.1371/journal.pcbi.0030054" /> <input type="hidden" name="articleTitleUnformatted" id="articleTitleUnformatted" value="Global%20Discriminative%20Learning%20for%20Higher-Accuracy%20Computational%20Gene%20Prediction" /> <input type="hidden" name="articlePubDate" id="articlePubDate" value="1174028400000" /> </form> <div class="horizontalTabs" xpathLocation="noSelect"> <ul id="tabsContainer"> <li id="article" class="active"><a href="/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.0030054" class="tab" title="Article">Article</a></li> <li id="metrics"><a href="/article/metrics/info%3Adoi%2F10.1371%2Fjournal.pcbi.0030054" class="tab" title="Metrics">Metrics</a></li> <li id="related"><a href="/article/related/info%3Adoi%2F10.1371%2Fjournal.pcbi.0030054" class="tab" title="Related Content">Related Content</a></li> <li id="comments"><a href="/article/comments/info%3Adoi%2F10.1371%2Fjournal.pcbi.0030054" class="tab" title="Comments">Comments: 0</a></li> </ul> </div> <div id="retractionHtmlId" class="retractionHtmlId" style="display:none;" xpathLocation="noSelect"> <div id="retractionlist"></div> </div> <div id="fch" class="fch" style="display:none;" xpathLocation="noSelect"> <p class="fch"><strong> Formal Correction:</strong> This article has been <em>formally corrected</em> to address the following errors.</p> <ol id="fclist" class="fclist"></ol> </div> <div id="articleMenu" xpathLocation="noSelect"> <div class="wrap"> <ul> <li class="annotation icon">To <strong>add a note</strong>, highlight some text. <a href="#" onclick="toggleAnnotation(this, 'public'); return false;" title="Click to turn notes on/off">Hide notes</a></li> <li class="discuss icon"> <a href="/user/secure/secureRedirect.action?goTo=%2Farticle%2Finfo%3Adoi%2F10.1371%2Fjournal.pcbi.0030054">Make a general comment</a> </li> </ul> <div id="sectionNavTopBox" style="display:none;"> <p><strong>Jump to</strong></p> <div id="sectionNavTop" class="tools"></div> </div> </div> </div> <p xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:aml="http://topazproject.org/aml/" class="authors" xpathLocation="noSelect"><span property="dc:creator">Axel Bernal</span><sup><a href="#aff1">1</a></sup><sup><a href="#cor1" class="fnoteref">*</a></sup>, <span property="dc:creator">Koby Crammer</span><sup><a href="#aff1">1</a></sup>, <span property="dc:creator">Artemis Hatzigeorgiou</span><sup><a href="#aff2">2</a></sup>, <span property="dc:creator">Fernando Pereira</span><sup><a href="#aff1">1</a></sup></p><p xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:aml="http://topazproject.org/aml/" class="affiliations" xpathLocation="noSelect"><a name="aff1" id="aff1"></a><strong>1</strong> Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America, <a name="aff2" id="aff2"></a><strong>2</strong> Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America</p><div xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:aml="http://topazproject.org/aml/" class="abstract" xpathLocation="/article[1]/front[1]/article-meta[1]/abstract[1]"><a id="abstract0" name="abstract0" toc="abstract0" title="Abstract"></a><h2 xpathLocation="noSelect">Abstract <a href="#top">Top</a></h2><p xpathLocation="/article[1]/front[1]/article-meta[1]/abstract[1]/p[1]">Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns.</p> </div><div xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:aml="http://topazproject.org/aml/" class="abstract" xpathLocation="/article[1]/front[1]/article-meta[1]/abstract[2]"><a id="abstract1" name="abstract1" toc="abstract1" title="Author Summary"></a> <h2 xpathLocation="noSelect">Author Summary <a href="#top">Top</a></h2> <p xpathLocation="/article[1]/front[1]/article-meta[1]/abstract[2]/sec[1]/p[1]">We describe a new approach to statistical learning for sequence data that is broadly applicable to computational biology problems and that has experimentally demonstrated advantages over current hidden Markov model (HMM)-based methods for sequence analysis. The methods we describe in this paper, implemented in the CRAIG program, allow researchers to modularly specify and train sequence analysis models that combine a wide range of weakly informative features into globally optimal predictions. Our results for the gene prediction problem show significant improvements over existing ab initio gene predictors on a variety of tests, including the specially challenging ENCODE regions. Such improved predictions, particularly on initial and single exons, could benefit researchers who are seeking more accurate means of recognizing such important features as signal peptides and regulatory regions. More generally, we believe that our method, by combining the structure-describing capabilities of HMMs with the accuracy of margin-based classification methods, provides a general tool for statistical learning in biological sequences that will replace HMMs in any sequence modeling task for which there is annotated training data.</p> </div> <div xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:aml="http://topazproject.org/aml/" class="articleinfo" xpathLocation="noSelect"><p><strong>Citation: </strong>Bernal A, Crammer K, Hatzigeorgiou A, Pereira F (2007) Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction. PLoS Comput Biol 3(3): e54. doi:10.1371/journal.pcbi.0030054</p><p><strong>Editor: </strong>David Haussler, University of California Santa Cruz, United States of America</p><p></p><p><strong>Received:</strong> August 16, 2006; <strong>Accepted:</strong> February 1, 2007; <strong>Published:</strong> March 16, 2007</p><p><strong>Copyright:</strong> © 2007 Bernal et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</p><p><strong>Funding:</strong> This material is based on work funded by the US National Science Foundation under ITR grants EIA 0205456 and IIS 0428193 and Career grant 0238295.</p><p><strong>Competing interests:</strong> The authors have declared that no competing interests exist.</p><p><strong>Abbreviations: </strong>CRAIG, CRF-based ab initio genefinder; CRF, conditional random fields; HMM, hidden Markov model; MIRA, Margin Infused Relaxed Algorithm; PWM, position weight matrices; SVM, support vector machines; TIS, translation initiation site; WAM, weight array model; WWAM, windowed weight array model</p><p><a name="cor1"></a>* To whom correspondence should be addressed. E-mail: <a href="mailto:abernal@seas.upenn.edu">abernal@seas.upenn.edu</a></p></div> <div xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:aml="http://topazproject.org/aml/" id="section1" xpathLocation="/article[1]/body[1]/sec[1]"><a id="s1" name="s1" toc="s1" title="Introduction"></a><h3 xpathLocation="noSelect">Introduction <a href="#top">Top</a></h3><p xpathLocation="/article[1]/body[1]/sec[1]/p[1]">Prediction of protein-coding genes in eukaryotes involves correctly identifying splice sites and translation initiation and stop signals in DNA sequences. There are two main gene prediction methods. Ab initio methods rely exclusively on intrinsic structural features of genes, such as frequent motifs in splice sites and content statistics in coding regions. Notable ab initio predictors include GenScan [<a href="#pcbi-0030054-b001">1</a>], Augustus [<a href="#pcbi-0030054-b002">2</a>], TigrScan/Genezilla [<a href="#pcbi-0030054-b003">3</a>], HMMGene [<a href="#pcbi-0030054-b004">4</a>], GRAPE [<a href="#pcbi-0030054-b005">5</a>], MZEF [<a href="#pcbi-0030054-b006">6</a>], and Genie [<a href="#pcbi-0030054-b007">7</a>]. Homology-based methods exploit extrinsic features derived by comparative analysis. For instance, ProCrustes [<a href="#pcbi-0030054-b008">8</a>], GeneWise, and GenomeWise [<a href="#pcbi-0030054-b009">9</a>] exploit protein or cDNA alignments, while TwinScan [<a href="#pcbi-0030054-b010">10</a>], DoubleScan [<a href="#pcbi-0030054-b011">11</a>], and NScan [<a href="#pcbi-0030054-b012">12</a>] rely on genomic DNA from related informant organisms. Extrinsic features improve the accuracy of predictions for genes with close homologs in related organisms. Krogh [<a href="#pcbi-0030054-b013">13</a>] and Mathe et al. [<a href="#pcbi-0030054-b014">14</a>] review current gene prediction methods.</p> <p xpathLocation="/article[1]/body[1]/sec[1]/p[2]">GenScan was the first gene predictor to achieve about 80% exon sensitivity and specificity in several single-gene benchmark test sets. More recent predictors have improved on GenScan's results by focusing on specific aspects of gene prediction. For example, GenScan++ improves specificity for internal exons and Augustus improves prediction accuracy on very long DNA sequences. Despite these advances, overall accuracy on chromosomal DNA, particularly in regions with low gene density (low GC content), is not yet satisfactory [<a href="#pcbi-0030054-b015">15</a>]. Gene-level accuracy, which is especially important for applications, is a major challenge.</p> <p xpathLocation="/article[1]/body[1]/sec[1]/p[3]">Improvements at the gene level could have a positive impact on detecting gene-related biological features such as signal peptide regions, promoters, and even 3′ UTR microRNA targets. Genes with very long introns and intergenic regions represent more than 95% of the total number of genes in most vertebrate genomes, and even a small improvement on those could be significant in practice.</p> <p xpathLocation="/article[1]/body[1]/sec[1]/p[4]">With the exception of MZEF, which uses a quadratic discriminant function to identify internal coding exons, all of the ab initio predictors mentioned above use hidden Markov models (HMMs) to combine sequence content and signal classifiers into a consistent gene structure. HMM parameters are relatively easy to interpret and to learn. Content and signal classifiers can be built effectively using a variety of machine learning and statistical sequence modeling methods. However, the combination of content and signal classifiers with the HMM gene structure model is not itself trained to maximize prediction accuracy, and the overall model does not fully account for the statistical dependencies among the features used by the various classifiers. Moreover, recent work on machine learning for structured prediction problems [<a href="#pcbi-0030054-b016">16</a>,<a href="#pcbi-0030054-b017">17</a>] suggests that global optimization of model parameters to minimize a suitable training criterion can achieve better results than separate training of the various components of a structured predictor.</p> <p xpathLocation="/article[1]/body[1]/sec[1]/p[5]">To overcome the shortcomings outlined above, our gene predictor uses a linear structure model based on conditional random fields (CRFs) [<a href="#pcbi-0030054-b017">17</a>], hence the name CRAIG (CRF-based ab initio genefinder). CRFs are discriminatively trained Markovian state models that learn how to combine many diverse, statistically correlated features of the input to achieve high accuracy in sequence tagging and segmentation problems. Our models are semi-Markov [<a href="#pcbi-0030054-b018">18</a>] to model more accurately the length distributions of genomic regions. For training, instead of the original conditional maximum-likelihood training objective of CRFs, we use the online large-margin MIRA (Margin Infused Relaxed Algorithm) method [<a href="#pcbi-0030054-b019">19</a>], allowing us to extend to gene prediction the advantages of large-margin learning methods such as support vector machines (SVMs) while efficiently handling very long training sequences. <a href="#pcbi-0030054-g001">Figure 1</a> presents schematically the differences in the learning process between our method and the most common generative approach for gene prediction.</p> <div class="figure" xpathLocation="/article[1]/body[1]/sec[1]/fig[1]"><a name="pcbi-0030054-g001" id="pcbi-0030054-g001" title="Click for larger image " href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.g001" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><img xpathLocation="noSelect" border="1" src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.g001&representation=PNG_S" align="left" alt="thumbnail" class="thumbnail"></a><p><strong xpathLocation="/article[1]/body[1]/sec[1]/fig[1]/label[1]"><a href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.g001" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><span xpathLocation="/article[1]/body[1]/sec[1]/fig[1]/label[1]">Figure 1. </span></a> <span xpathLocation="/article[1]/body[1]/sec[1]/fig[1]/caption[1]/title[1]">Learning Methods: Discriminative versus Generative</span></strong></p><p xpathLocation="/article[1]/body[1]/sec[1]/fig[1]/caption[1]/p[1]">Schematic comparison of discriminative (A) and generative (B) learning methods. In the discriminative case, all model parameters were estimated simultaneously to predict a segmentation as similar as possible to the annotation. In contrast, for generative HMM models, signal features and state features were assumed to be independent and trained separately.</p> <span xpathLocation="noSelect">doi:10.1371/journal.pcbi.0030054.g001</span><div class="clearer"></div></div><p xpathLocation="/article[1]/body[1]/sec[1]/p[6]">Our model and training method allow us to combine a rich variety of possibly overlapping genomic features and to find a global tradeoff among feature contributions that maximizes annotation accuracy. In particular, we model different types of introns according to their length, which would have been difficult to integrate in previous models. We were also able to include rich features for start and stop signals and globally balance their weights against the weights of all other model features. These advances led to significant overall improvements over the current best predictions for the most used benchmark test sets: sensitivity and specificity of initial and single exon predictions showed a relative mean increase [<a href="#pcbi-0030054-b002">2</a>] of 25.5% and 19.6%, respectively; at the gene level, the relative mean improvement was 33.9%; the relative F-score improvement on the ENCODE regions was 16.05% at the exon level. These improvements were in good part due to the different treatment of intronic states within the model, which in turn increased structure prediction accuracy, particularly on genes with long introns.</p> <p xpathLocation="/article[1]/body[1]/sec[1]/p[7]">Some previous gene predictors have used discriminative training to some extent. HMMGene uses a nongeneralized HMM model for gene structure, which does not include features associated with biological signals, but it is trained with the discriminative conditional maximum likelihood criterion [<a href="#pcbi-0030054-b020">20</a>]. However, conditional maximum likelihood is more difficult to optimize than our training criterion because it is required to respect conditional independence and normalization for the underlying HMM. GRAPE takes a hybrid approach for learning. It first trains parameters of a generalized HMM (GHMM) to maximize generative likelihood, and then it selects a small set of parameters that are trained to maximize the percentage of correctly predicted nucleotides, exons, and whole genes used as surrogates of the conditional likelihood. This approach is commonly used when training data is limited, and it usually provides superior results only in those cases [<a href="#pcbi-0030054-b021">21</a>]. However, the GRAPE learning method does not globally optimize the training criterion.</p> </div> <div xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:aml="http://topazproject.org/aml/" id="section2" xpathLocation="/article[1]/body[1]/sec[2]"><a id="s2" name="s2" toc="s2" title="Results"></a><h3 xpathLocation="noSelect">Results <a href="#top">Top</a></h3> <h4 xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/title[1]">Datasets</h4> <p xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/p[1]">All the experiments reported in this paper use a gene model trained on a nonredundant set of 3,038 single-gene sequences. We built this set by combining the Augustus training set [<a href="#pcbi-0030054-b002">2</a>], the GenScan training set, and 1,500 high-confidence CDSs from EnsMart Plus [<a href="#pcbi-0030054-b022">22</a>], which are part of the Genezilla training set (<a href="http://www.tigr.org/software/traindata.shtml">http://www.tigr.org/software/traindata.s​html </a>). We then appended simulated intergenic material to both ends of each training sequence to make up for the lack of realistic intergenic regions in the training material, as described in more detail in Methods.</p> <p xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/p[2]">We compared CRAIG with GenScan, TwinScan 2.03 (without homology features, also known as GenScan++), Genezilla (formerly known as TigrScan), and Augustus on several benchmark test sets. We also ran predictions with HMMGene, the only other publicly available genefinder to use a discriminative structure training method; we present some prediction results with it in Methods. All programs we compare with are based on similar GHMM models with similar sequence features. Augustus uses two types of length distributions for introns: short intron lengths are modeled with an explicit distribution, but other introns use the default geometric distribution. This difference made Augustus run many times slower than the other programs in all our experiments.</p> <p xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/p[3]">We evaluated the programs on the following benchmark test sets.</p> <h5 xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/sec[1]/title[1]">BGHM953.</h5><p xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/sec[1]/p[1]">This test set combines most of the available single-gene test sets in one single set. It includes the GeneParser I (27 genes) and II (34 genes) datasets [<a href="#pcbi-0030054-b023">23</a>], 570 vertebrate sequences from Burset and Guigo [<a href="#pcbi-0030054-b024">24</a>], 178 human sequences from Guigo et al. [<a href="#pcbi-0030054-b025">25</a>], and 195 human, rat, and mouse sequences from Rogic et al. [<a href="#pcbi-0030054-b026">26</a>]. Repeated entries were removed. We combined different sets to obtain more reliable evaluation statistics by smoothing out possible overfitting to particular sequence types.</p> <h5 xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/sec[2]/title[1]">TIGR251.</h5><p xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/sec[2]/p[1]">This test set consists of 251 single-gene sequences, which are part of the TIGR human test dataset (<a href="http://www.tigr.org/software/traindata.shtml">http://www.tigr.org/software/traindata.s​html </a>), and it is composed mostly of long-intron genes.</p> <h5 xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/sec[3]/title[1]">ENCODE294.</h5><p xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/sec[3]/p[1]">This test set consists of 31 test regions from the ENCODE project [<a href="#pcbi-0030054-b027">27</a>,<a href="#pcbi-0030054-b028">28</a>], for a total of 21M bases, containing 294 carefully annotated alternatively spliced genes and 667 transcripts, after eliminating repeated entries and partial entries with coordinates outside the region's bounds. This is the only test set that was masked using RepeatMasker (<a href="http://www.repeatmasker.org/">http://www.repeatmasker.org/</a>) before performing gene prediction. <a href="#pcbi-0030054-t001">Table 1</a> gives summary statistics for the training set and the three test sets.</p> <div class="figure" xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/sec[3]/table-wrap[1]"><a name="pcbi-0030054-t001" id="pcbi-0030054-t001" title="Click for larger image " href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.t001" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><img xpathLocation="noSelect" border="1" src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.t001&representation=PNG_S" align="left" alt="thumbnail" class="thumbnail"></a><p><strong xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/sec[3]/table-wrap[1]/label[1]"><a href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.t001" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><span xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/sec[3]/table-wrap[1]/label[1]">Table 1. </span></a></strong></p><p xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/sec[3]/table-wrap[1]/caption[1]/p[1]">Dataset Statistics</p> <span xpathLocation="noSelect">doi:10.1371/journal.pcbi.0030054.t001</span><div class="clearer"></div></div><p xpathLocation="/article[1]/body[1]/sec[2]/sec[1]/sec[3]/p[2]">Predictions on all tests and for all programs—including CRAIG—allow partial genes, multiple genes per region, and genes on both strands. Alternative splicing and genes embedded within genes were not evaluated in this work. Any other program parameters were left at their default values. For each program, we used the human/vertebrate gene models provided with the software distributions. In all tests, sequences with noncanonical splice sites were filtered out. Accuracy numbers were computed with the eval package [<a href="#pcbi-0030054-b029">29</a>], a standard and reliable way to compare different gene predictions.</p> <h4 xpathLocation="/article[1]/body[1]/sec[2]/sec[2]/title[1]">Prediction in Single-Gene Sequences</h4> <p xpathLocation="/article[1]/body[1]/sec[2]/sec[2]/p[1]"><a href="#pcbi-0030054-t002">Table 2</a> shows prediction results for all programs on <b>BGHM957</b>. CRAIG achieved better sensitivity and specificity than the other programs at all levels, except for somewhat lower base sensitivity but much higher base specificity than GenScan. The relative F-score improvement for initial and single exons over Genezilla, the second-best program overall for this set, was 14.6% and 5.8%, respectively. Single-exon genes were more difficult to predict for all programs, with specificity barely exceeding 50% for the best program, but CRAIG's relative improvement in sensitivity was nearly 25% over runner-up Genezilla. Terminal exon predictions were also improved over the nearest competitors, but less markedly so. The improved gene-level accuracy follows from these gains at the exon level. GenScan++ and Augustus predicted internal exons with similar accuracy and their F-scores were only slightly worse than CRAIG, but the overall gene-level accuracy for GenScan++ looks much worse because it missed many terminal and single exons. GenScan also did well in this set, but overall performance was somewhat worse than the other programs.</p> <div class="figure" xpathLocation="/article[1]/body[1]/sec[2]/sec[2]/table-wrap[1]"><a name="pcbi-0030054-t002" id="pcbi-0030054-t002" title="Click for larger image " href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.t002" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><img xpathLocation="noSelect" border="1" src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.t002&representation=PNG_S" align="left" alt="thumbnail" class="thumbnail"></a><p><strong xpathLocation="/article[1]/body[1]/sec[2]/sec[2]/table-wrap[1]/label[1]"><a href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.t002" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><span xpathLocation="/article[1]/body[1]/sec[2]/sec[2]/table-wrap[1]/label[1]">Table 2. </span></a></strong></p><p xpathLocation="/article[1]/body[1]/sec[2]/sec[2]/table-wrap[1]/caption[1]/p[1]">Accuracy Results for <b>BGHM953</b></p> <span xpathLocation="noSelect">doi:10.1371/journal.pcbi.0030054.t002</span><div class="clearer"></div></div><p xpathLocation="/article[1]/body[1]/sec[2]/sec[2]/p[2]">Most of the genes in this set have short introns and the intergenic regions are truncated, so prediction was relatively easy and all programs did relatively well. The next section compares performance on datasets with long-intron genes and very long intergenic regions.</p> <h4 xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/title[1]">Prediction in Long DNA Sequences</h4> <p xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/p[1]">As previously noted, <b>TIGR251</b> has many genes with very long introns, so it is expected to be harder to predict accurately. This was confirmed by the results in <a href="#pcbi-0030054-t003">Table 3</a>. Performance was worse for all programs and levels when compared with the first set. However, CRAIG consistently outperformed the other programs with an even wider performance gap than in the first experiment. Here, base and internal-exon accuracies were also substantially improved. CRAIG's relative F-score improvement for bases and internal exons over Genezilla, the second-best program in both categories, was 5.4% and 7.1%, respectively, compared with approximately 1% for <b>BGHM953</b>. Other types of exons also improved, as in the first experiment. Because of these better base and exon-level predictions, the relative F-score improvement over runner-up Genezilla at the gene level was about 57%.</p> <div class="figure" xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/table-wrap[1]"><a name="pcbi-0030054-t003" id="pcbi-0030054-t003" title="Click for larger image " href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.t003" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><img xpathLocation="noSelect" border="1" src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.t003&representation=PNG_S" align="left" alt="thumbnail" class="thumbnail"></a><p><strong xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/table-wrap[1]/label[1]"><a href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.t003" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><span xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/table-wrap[1]/label[1]">Table 3. </span></a></strong></p><p xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/table-wrap[1]/caption[1]/p[1]">Accuracy Results for <b>TIGR251</b></p> <span xpathLocation="noSelect">doi:10.1371/journal.pcbi.0030054.t003</span><div class="clearer"></div></div><p xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/p[2]">Our final set of experiments was on <b>ENCODE294</b>. The results are shown in <a href="#pcbi-0030054-t004">Table 4</a>. As previously mentioned, all sequences in this set were masked for low-complexity regions and repeated elements. Unlike previous sets, in which masking did not affect results significantly, prediction on unmasked sequences in this set was worse for all programs (unpublished data). In particular, exon and base specificity decreased an average of 8%.</p> <div class="figure" xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/table-wrap[2]"><a name="pcbi-0030054-t004" id="pcbi-0030054-t004" title="Click for larger image " href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.t004" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><img xpathLocation="noSelect" border="1" src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.t004&representation=PNG_S" align="left" alt="thumbnail" class="thumbnail"></a><p><strong xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/table-wrap[2]/label[1]"><a href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.t004" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><span xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/table-wrap[2]/label[1]">Table 4. </span></a></strong></p><p xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/table-wrap[2]/caption[1]/p[1]">Accuracy Results for <b>ENCODE294</b></p> <span xpathLocation="noSelect">doi:10.1371/journal.pcbi.0030054.t004</span><div class="clearer"></div></div><p xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/p[3]">We added a transcript-level prediction category to <a href="#pcbi-0030054-t004">Table 4</a> to better evaluate predictions on alternatively spliced genes. We closely followed the evaluation guidelines and definitions by Guigo and Reese [<a href="#pcbi-0030054-b028">28</a>]. There, transcript and gene-level predictions that are consistent with annotated incomplete transcripts are counted correct, even in cases where the predictions include additional exons. We relaxed this policy to also mark as correct those predictions that contained incomplete transcripts whose first (last) exon did not begin (end) with an acceptor (donor). The reason for this change is that no program can exactly predict both ends of such transcripts. We developed our own programs to evaluate single-exon, transcript, and gene-level predictions for incomplete transcripts. Evaluations for other categories and for complete transcripts were handled directly with eval.</p> <p xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/p[4]">To ensure consistency in the evaluation, we obtained all of the programs except for Genezilla from their authors and we ran them on the test set in our lab. Genezilla predictions for this set were obtained directly from the supplementary material provided by [<a href="#pcbi-0030054-b028">28</a>] so that we could measure the potential differences between our evaluation method and that reported in [<a href="#pcbi-0030054-b028">28</a>], particularly at the transcript and gene level, for which we expected different results.</p> <p xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/p[5]">Overall, our results for all programs agree with those of Guigo and Reese [<a href="#pcbi-0030054-b028">28</a>]. Genezilla's base and exon-level results using our evaluation program closely matched the published values. Transcript and gene-level results computed by our method were 1% better than the published numbers, which roughly match the percentage of incomplete annotated transcripts with no splice signals on either end. Computed predictions for GenScan and Augustus were also somewhat different, but not substantially so, from those reported by Guigo and Reese [<a href="#pcbi-0030054-b028">28</a>], presumably because of differences in program version and operating parameters.</p> <p xpathLocation="/article[1]/body[1]/sec[2]/sec[3]/p[6]">Improvements in this set were similar to those obtained in our second experiment. The relative F-score improvements for individual bases and internal exons were 6% and 15.4% over GenScan++ and Augustus, the runner-ups in each respective category. Improvement in prediction accuracy on single, initial, and terminal exons is similar to that for the other test sets. Transcript and gene-level accuracies were, respectively, 30% and 30.6% better than Augustus, the second-best program overall. This means that our better accuracy results obtained in the first two single-gene sequence sets scale well to chromosomal regions with multiple, alternatively spliced genes.</p> <h4 xpathLocation="/article[1]/body[1]/sec[2]/sec[4]/title[1]">Significance Testing</h4> <p xpathLocation="/article[1]/body[1]/sec[2]/sec[4]/p[1]">In all tests and at all levels, CRAIG achieved greater improvements in specificity than in sensitivity. We investigated whether the improvements in exon sensitivity achieved by CRAIG could be explained by chance. Any exon belonging to a particular test set is associated with two dependent Bernoulli random variables for whether it was correctly predicted by CRAIG and by another program. We computed <i>p</i>-values with McNemar's test for dependent, paired samples from CRAIG and each of the other programs over the three test sets, as shown in <a href="#pcbi-0030054-t005">Table 5</a>. The null hypothesis was that CRAIG's advantage in exon predictions is due to chance. The <i>p</i>-values were <0.05 for all entries, except for the <b>TIGR251</b> experiments against Genezilla and the <b>ENCODE294</b> experiments against Genezilla and GenScan; in general, these two genefinders proved to be very sensitive at the cost of predicting many more false positives. <i>p</i>-Values for the combined test sets were all below 0.001, showing that CRAIG's advantage was extremely unlikely to be a chance event.</p> <div class="figure" xpathLocation="/article[1]/body[1]/sec[2]/sec[4]/table-wrap[1]"><a name="pcbi-0030054-t005" id="pcbi-0030054-t005" title="Click for larger image " href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.t005" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><img xpathLocation="noSelect" border="1" src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.t005&representation=PNG_S" align="left" alt="thumbnail" class="thumbnail"></a><p><strong xpathLocation="/article[1]/body[1]/sec[2]/sec[4]/table-wrap[1]/label[1]"><a href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.t005" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><span xpathLocation="/article[1]/body[1]/sec[2]/sec[4]/table-wrap[1]/label[1]">Table 5. </span></a></strong></p><p xpathLocation="/article[1]/body[1]/sec[2]/sec[4]/table-wrap[1]/caption[1]/p[1]">Significance Testing</p> <span xpathLocation="noSelect">doi:10.1371/journal.pcbi.0030054.t005</span><div class="clearer"></div></div><p xpathLocation="/article[1]/body[1]/sec[2]/sec[4]/p[2]">We also trained and tested an additional variant of CRAIG, in which we did not distinguish between short and long introns; this configuration corresponds closely to the state model representation used in most previous works. Following Stanke and Waack [<a href="#pcbi-0030054-b002">2</a>], we used the relative mean improvement: <br><a name="pcbi-0030054-e001" id="pcbi-0030054-e001"></a><span class="equation"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.e001&representation=PNG"></span><br>as the measure of differences in prediction accuracy between the CRAIG variant and CRAIG itself. The term Δ<i>Sn<sub>exon</sub></i> denotes the mean increase in exon sensitivity and is defined as <br><a name="pcbi-0030054-eq001" id="pcbi-0030054-eq001"></a><span class="equation"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.eq001&representation=PNG"></span><br>where <i>n<sub>t</sub></i> is the number of annotated genes in dataset <i>t, T</i> = {<b>BGHM953, TIGR251, ENCODE294</b>}, and Δ<i>Sn<sup>t</sup><sub>exon</sub></i> is the difference in exon sensitivity between CRAIG and the CRAIG variant on dataset <i>t</i>. The other terms are defined similarly. The improvement obtained by CRAIG with respect to the variant was <i>r</i> = 3.6. This result was as expected: there was an improvement in accuracy from including the extra intron state in the gene model, but even the simpler variant was more than competitive with the best current genefinders. </p> </div> <div xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:aml="http://topazproject.org/aml/" id="section3" xpathLocation="/article[1]/body[1]/sec[3]"><a id="s3" name="s3" toc="s3" title="Discussion"></a><h3 xpathLocation="noSelect">Discussion <a href="#top">Top</a></h3><p xpathLocation="/article[1]/body[1]/sec[3]/p[1]">It is well-known that more gene prediction errors occur on regions with low GC content, which have higher intron and intergenic region density [<a href="#pcbi-0030054-b015">15</a>]. This behavior can also be observed on our combined results, as shown in <a href="#pcbi-0030054-g002">Figure 2</a>A. It also can be noticed that CRAIG had the best F-score for all intron lengths. Except for CRAIG and HMMGene, the F-scores for all other predictors were very close for all lengths. CRAIG's advantage over its nearest competitors became more apparent as introns increased in length. However, all genefinders experience a significant drop in accuracy, at least 25% between 1,000 bp and 16,000 bp. For introns shorter than 1,000 bp, Augustus performs almost as well as CRAIG, in part because of its more complex, time-consuming model for short intron lengths.</p> <div class="figure" xpathLocation="/article[1]/body[1]/sec[3]/fig[1]"><a name="pcbi-0030054-g002" id="pcbi-0030054-g002" title="Click for larger image " href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.g002" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><img xpathLocation="noSelect" border="1" src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.g002&representation=PNG_S" align="left" alt="thumbnail" class="thumbnail"></a><p><strong xpathLocation="/article[1]/body[1]/sec[3]/fig[1]/label[1]"><a href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.g002" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><span xpathLocation="/article[1]/body[1]/sec[3]/fig[1]/label[1]">Figure 2. </span></a> <span xpathLocation="/article[1]/body[1]/sec[3]/fig[1]/caption[1]/title[1]">F-Score as a Function of Intron Length</span></strong></p><p xpathLocation="/article[1]/body[1]/sec[3]/fig[1]/caption[1]/p[1]">Results for all sets combined (A) and for individual test sets shown in subfigures (B–D). The boxed number appearing directly above each marker represents the total number of introns associated with the marker's length. For example, there were 1,475 introns with lengths between 1,000 and 2,000 base pairs for all sets combined (A).</p> <span xpathLocation="noSelect">doi:10.1371/journal.pcbi.0030054.g002</span><div class="clearer"></div></div><p xpathLocation="/article[1]/body[1]/sec[3]/p[2]">Intron analysis of individual test sets, as shown in <a href="#pcbi-0030054-g002">Figure 2</a>B–<a href="#pcbi-0030054-g002">2</a>D, reveals that, except for <b>ENCODE294</b>, CRAIG consistently achieved an intron F-score above 75%, even for lengths more than 30,000 bp; in contrast, the F-scores of all other programs fell to lower than 65%, even for introns as short as 8,000 bp. The results show that CRAIG predicts genes with long introns much better than the other programs. This hypothesis was also confirmed with experiments on an edited version of <b>ENCODE294</b> in which the original 31 regions were split into 271 contig sequences and all of the intergenic material was deleted except for 2,000 bp on both sides of each gene. This edited version was further subdivided into subsets with—<b>ALT_ENCODE155</b>—and without—<b>NOALT_ENCODE139</b>—alternative splicing. <a href="#pcbi-0030054-g003">Figure 3</a> shows intron prediction results for this arrangement. It can be observed that intron prediction on <b>NOALT_ENCODE139</b>, a subset of 139 genes, has the same characteristics as either <b>TIGR251</b> or <b>BGHM953</b>, that is, a rather flat F-score curve as intron length increases. The same cannot be said about complementary subset <b>ALT_ENCODE155</b>, whose significant drop in accuracy for long introns can be explained by the presence of alternative splicing.</p> <div class="figure" xpathLocation="/article[1]/body[1]/sec[3]/fig[2]"><a name="pcbi-0030054-g003" id="pcbi-0030054-g003" title="Click for larger image " href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.g003" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><img xpathLocation="noSelect" border="1" src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.g003&representation=PNG_S" align="left" alt="thumbnail" class="thumbnail"></a><p><strong xpathLocation="/article[1]/body[1]/sec[3]/fig[2]/label[1]"><a href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.g003" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><span xpathLocation="/article[1]/body[1]/sec[3]/fig[2]/label[1]">Figure 3. </span></a> <span xpathLocation="/article[1]/body[1]/sec[3]/fig[2]/caption[1]/title[1]">F-Score versus Intron Length for the Encode Test Set</span></strong></p><p xpathLocation="/article[1]/body[1]/sec[3]/fig[2]/caption[1]/p[1]">Results in subfigures (A) and (B) correspond to the subset of alternatively spliced genes and its complementary subset, respectively.</p> <span xpathLocation="noSelect">doi:10.1371/journal.pcbi.0030054.g003</span><div class="clearer"></div></div><p xpathLocation="/article[1]/body[1]/sec[3]/p[3]">We claimed in the Introduction that a key aspect of our model and training method is the ability to combine various genomic features and to find a global tradeoff among their contributions so that accuracy is maximized. Being able to identify introns longer than 30,000 bp with prediction accuracy comparable to that achieved on smaller introns is evidence that our program does a better job of combining features to recognize structure. Another way to see how well features have been integrated into the structure model is to examine signal predictions. It is well-known that translation initiation sites (TIS) are surrounded by relatively poorly conserved sequences and are harder to predict than the highly conserved splice signals. Also, stop signals present almost no sequence conservation at all and their prediction depends solely upon how well the last acceptor (in multi-exon genes) or the TIS (in single-exon genes) was predicted. Therefore, a simple splice site classifier can perform fairly well using only local sequence information. In contrast, TIS and stop signal classifiers are known to be much less accurate. Given these observations, we expected CRAIG to improve the most on TIS signal prediction accuracy, as all other programs examined in this work use individual classifiers for signal prediction, whereas CRAIG uses global training to compute each signal's net contribution to the gene structure. <a href="#pcbi-0030054-g004">Figure 4</a> shows the improvement in signal prediction accuracy for CRAIG when compared with the second-best program in each case. CRAIG shows improvement for all types of signals, but the improvement was most marked for TIS, especially in specificity. It can also be observed that the improvement on stop signals follows from the co-occurring improvement on both acceptor and TIS signals. The final outcome is that CRAIG makes fewer mistakes in deciding where to start translation and stop translation, which is one of the main reasons for its significant improvement at the gene level.</p> <div class="figure" xpathLocation="/article[1]/body[1]/sec[3]/fig[3]"><a name="pcbi-0030054-g004" id="pcbi-0030054-g004" title="Click for larger image " href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.g004" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><img xpathLocation="noSelect" border="1" src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.g004&representation=PNG_S" align="left" alt="thumbnail" class="thumbnail"></a><p><strong xpathLocation="/article[1]/body[1]/sec[3]/fig[3]/label[1]"><a href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.g004" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><span xpathLocation="/article[1]/body[1]/sec[3]/fig[3]/label[1]">Figure 4. </span></a> <span xpathLocation="/article[1]/body[1]/sec[3]/fig[3]/caption[1]/title[1]">Signal Accuracy Improvements</span></strong></p><p xpathLocation="/article[1]/body[1]/sec[3]/fig[3]/caption[1]/p[1]">CRAIG's relative improvements in prediction specificity (orange bar) and sensitivity (blue bar) by signal type. In each case, the second-best program was used for the comparison: Genezilla for starts, Augustus for stops, and GenScan++ for splice sites.</p> <span xpathLocation="noSelect">doi:10.1371/journal.pcbi.0030054.g004</span><div class="clearer"></div></div><p xpathLocation="/article[1]/body[1]/sec[3]/p[4]">There is great potential for including additional informative features into the model without algorithm changes, for instance, features derived from comparative genomics. To facilitate such extensions, we designed CRAIG to allow model changes without recompiling the C++ training and test code. The finite-state model, the features, and their relationships to states and transitions are all specified in a configuration file that can be changed without recompiling the program. This flexibility could be useful for learning gene models on organisms that may require a different finite-state model or a different set of features.</p> </div> <div xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:aml="http://topazproject.org/aml/" id="section4" xpathLocation="/article[1]/body[1]/sec[4]"><a id="s4" name="s4" toc="s4" title="Materials and Methods"></a><h3 xpathLocation="noSelect">Materials and Methods <a href="#top">Top</a></h3> <h4 xpathLocation="/article[1]/body[1]/sec[4]/sec[1]/title[1]">Gene structures.</h4> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[1]/p[1]">In what follows, a gene structure consists of either a single exon or a succession of alternating exons and introns, trimmed from both ends at the TIS and stop signals. We distinguish two different types of introns: short—980 bp or less—and long—more than 980 bp. <a href="#pcbi-0030054-g005">Figure 5</a> shows a gene finite-state model that implements these distinctions.</p> <div class="figure" xpathLocation="/article[1]/body[1]/sec[4]/sec[1]/fig[1]"><a name="pcbi-0030054-g005" id="pcbi-0030054-g005" title="Click for larger image " href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.g005" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><img xpathLocation="noSelect" border="1" src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.g005&representation=PNG_S" align="left" alt="thumbnail" class="thumbnail"></a><p><strong xpathLocation="/article[1]/body[1]/sec[4]/sec[1]/fig[1]/label[1]"><a href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.g005" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><span xpathLocation="/article[1]/body[1]/sec[4]/sec[1]/fig[1]/label[1]">Figure 5. </span></a> <span xpathLocation="/article[1]/body[1]/sec[4]/sec[1]/fig[1]/caption[1]/title[1]">Finite-State Model for Eukaryotic Genes</span></strong></p><p xpathLocation="/article[1]/body[1]/sec[4]/sec[1]/fig[1]/caption[1]/p[1]">Variable-length genomic regions are represented by states, and biological signals are represented by transitions between states. Short and long introns are denoted by I<sup>S</sup> and I<sup>L</sup>, respectively.</p> <span xpathLocation="noSelect">doi:10.1371/journal.pcbi.0030054.g005</span><div class="clearer"></div></div> <h4 xpathLocation="/article[1]/body[1]/sec[4]/sec[2]/title[1]">Linear structure models.</h4> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[2]/p[1]">In what follows, <b><i>x</i></b> <i>= x</i><sub>1</sub>…<i>x<sub>P</sub></i> is a sequence and <b><i>s</i></b> <i>= s<sub>1</sub>…s<sub>Q</sub></i> is a segmentation of <b><i>x</i></b>, where each segment <i>s<sub>j</sub> =</i> 〈<i>p<sub>j</sub>,l<sub>j</sub>,y<sub>j</sub></i>〉 starts at position pos(<i>s<sub>j</sub></i>) = <i>p<sub>j</sub></i>, has length len(<i>s<sub>j</sub></i>) = <i>l<sub>j</sub></i>, and state label lab(<i>s<sub>j</sub></i>) = <i>y<sub>j</sub></i>, with <i>p<sub>j+1</sub></i> = <i>p<sub>j</sub></i> + <i>l<sub>j</sub></i> ≤ <i>P</i> and 1 ≤ <i>l<sub>j</sub></i> ≤ <i>B</i> for some empirically determined upper bound <i>B</i>. The training data. <span class="capture-id" id="pcbi-0030054-ex001"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.ex001&representation=PNG" border="0"></span> consists of pairs of a sequence and its preferred segmentation. For DNA sequences, <i>x<sub>i</sub></i> Є Σ<sub>DNA</sub> = {A, T, G, C}, and each label lab(<i>s<sub>j</sub></i>) is one of the states of the model (<a href="#pcbi-0030054-g005">Figure 5</a>). A segment is also referred to as a genomic region; that is, an exon, an intron, or an intergenic region. </p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[2]/p[2]">A first-order Markovian linear structure model computes the score of a candidate segmentation <b><i>s</i></b> <i>= s<sub>1</sub>…s<sub>Q</sub></i> of a given input sequence <b><i>x</i></b> as a linear combination of terms for individual features of a candidate segment, the label of its predecessor, and the input sequence. More precisely, each proposed segment <i>s<sub>j</sub></i> is represented by a feature vector <b><i>f</i></b>(<i>s<sub>j</sub>,</i>lab(<i>s<sub>j−1</sub></i>),<b><i>x</i></b><i>)</i> Є ℜ<i><sup>D</sup></i> computed from the segment, the label of the previous segment, and the input sequence around position pos(<i>s<sub>j</sub></i>). A weight vector<i><sub>,</sub></i><b> <i>w</i></b> Є ℜ<i><sup>D</sup></i>, to be learned, represents the relative weights of the features. Then, the score of candidate segmentation <b><i>s</i></b> for sequence <b><i>x</i></b> is given by <br><a name="pcbi-0030054-e002" id="pcbi-0030054-e002"></a><span class="equation"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.e002&representation=PNG"></span><br> </p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[2]/p[3]">For gene prediction, we need to answer three basic questions. First, given a sequence, <b><i>x</i></b><i>,</i> we need to efficiently find its best-scoring segmentation. Second, given a training set <span class="capture-id" id="pcbi-0030054-ex002"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.ex002&representation=PNG" border="0"></span> , we need to learn weights <b><i>w</i></b> such that the best-scoring segmentation of <b><i>x</i></b><sup>(<i>t</i>)</sup> is close to <b><i>s</i></b><sup>(<i>t</i>)</sup>. Finally, we need to select a feature function <b><i>f</i></b> that is suitable for answering the first two questions while providing good generalization to unseen test sequences. The next three subsections answer these questions. </p> <h4 xpathLocation="/article[1]/body[1]/sec[4]/sec[3]/title[1]">Inference for gene prediction.</h4> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[3]/p[1]">Let <b>GEN(<i>x</i>)</b> be the set of all possible segmentations of <b><i>x</i></b>. The best segmentation of <b><i>x</i></b> for weight vector <b><i>w</i></b> is given by: <br><a name="pcbi-0030054-e003" id="pcbi-0030054-e003"></a><span class="equation"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.e003&representation=PNG"></span><br> </p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[3]/p[2]">We can compute <b>ŝ</b> efficiently from <b><i>x</i></b> using the following Viterbi-like recurrence: <br><a name="pcbi-0030054-e004" id="pcbi-0030054-e004"></a><span class="equation"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.e004&representation=PNG"></span><br> </p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[3]/p[3]">It is easy to see that <span class="capture-id" id="pcbi-0030054-ex003"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.ex003&representation=PNG" border="0"></span> where GEN<i><sub>i,y</sub></i>(<b><i>x</i></b>) is the set of all segmentations of <i>x<sub>1</sub>…x<sub>i</sub></i> that end with label <i>y</i>. Therefore, <i>S<b><sub>w</sub></b></i>(<b><i>x</i></b>,<b><i>ŝ</i></b>) <i>= M</i>(<i>P</i> + 1,END), where END is a special synchronization state inserted at position <i>P</i> + 1. The actual segmentations are easily obtained by keeping back-pointers from each state-position pair (<i>y,i</i>) to its optimal predecessor (<i>y′,i−l</i>). The complexity of this algorithm is <i>O</i>(<i>PBm</i><sup>2</sup>), where <i>m</i> is the number of distinct states and <i>B</i> is the upper bound on the segment length, because the runtime of <b>w</b> · <b><i>f</i></b> is independent of <i>P, B,</i> or <i>m</i>. To reduce the constant factor from these dot product computations, most <b>w</b> ·<b> <i>f</i></b> values are precomputed and cached. For introns and intergenic regions, the feature function <b><i>f</i></b> is a sum of per-nucleotide contributions, so the dynamic program in Equation 4 needs only to look at position <i>i</i> − 1 when <i>y</i> corresponds to such regions. Therefore, <i>B</i> needs to be only the upper bound for exon lengths, which was chosen following Stanke and Waack [<a href="#pcbi-0030054-b002">2</a>]. For long sequences, the complexity of the inference algorithm is therefore dominated by the sequence length <i>P</i>. </p> <h4 xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/title[1]">Online large-margin training.</h4> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/p[1]">Online learning is a simple, scalable, and flexible framework for training linear structured models. Online algorithms process one training example at a time, updating the model weights to improve the model's accuracy on that example. Large-margin classifiers, such as the well-known SVMs, provide strong theoretical classification error bounds that hold well in practice for many learning tasks. MIRA [<a href="#pcbi-0030054-b030">30</a>] is an online method for training large-margin classifiers that is easily extended to structured problems [<a href="#pcbi-0030054-b019">19</a>]. Algorithm 1 shows the pseudocode for the MIRA-based training algorithm we used for our models. For each training sequence, <b><i>x</i></b><sup>(<i>t</i>)</sup>, the algorithm seeks to establish a margin between the score of the correct segmentation and the score of the best segmentation according to the current weight vector that is proportional to the mismatch between the candidate segmentation and the correct one. MIRA keeps the norm of the change in weight vector as small as possible while giving the current example (<b><i>x</i></b><sup>(<i>t</i>)</sup>,<b><i>s</i></b><sup>(<i>t</i>)</sup>) a score that exceeds that of the best-scoring incorrect segmentation by a margin given by the mismatch between the correct segmentation and the incorrect one. The quadratic program in line 5 of Algorithm 1 formalizes that objective, and has a straightforward closed-form solution for this version of the algorithm. Line 11 of the algorithm computes <b><i>w</i></b> as an average of the weight vectors obtained at each iteration, which has been shown to reduce weight overfitting [<a href="#pcbi-0030054-b031">31</a>]. The training parameter <i>N</i> is determined empirically using an auxiliary development set.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/p[2]"><b>Algorithm 1.</b> Online Training Algorithm.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/p[3]">Training data <span class="capture-id" id="pcbi-0030054-ex004"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.ex004&representation=PNG" border="0"></span> . <i>L</i>(<b><i>s</i></b><i><sup>(t</sup></i><sup>)</sup>,<b><i>ŝ</i></b>) is some nonnegative real-valued function that measures the mismatch between segmentation <b><i>ŝ</i></b> and the correct segmentation <b><i>s</i></b><i><sup>(t</sup></i><sup>)</sup><b><i>.</i></b> The number of rounds <i>N</i> is determined using a small development set. </p> <ol class="simple"> <li><p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/list[1]/list-item[1]/p[1]">1: <b><i>w</i></b><sup>(0)</sup> = <b>0</b>; <b><i>v</i></b> = <b>0</b>; <i>i =</i> 0</p> </li> <li><p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/list[1]/list-item[2]/p[1]">2: <b>for</b> round = 1 to <i>N</i> <b>do</b></p> </li> <li><p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/list[1]/list-item[3]/p[1]">3: <b>for</b> <i>t</i> = 1 to <i>T</i> <b>do</b></p> </li> <li><p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/list[1]/list-item[4]/p[1]">4: <span class="capture-id" id="pcbi-0030054-ex005"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.ex005&representation=PNG" border="0"></span> </p> </li> <li><p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/list[1]/list-item[5]/p[1]">5: <span class="capture-id" id="pcbi-0030054-ex006"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.ex006&representation=PNG" border="0"></span> </p> </li> <li><p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/list[1]/list-item[6]/p[1]">subject to <i>S<b><sub>w′</sub></b></i>(<b><i>x</i></b><sup>(<i>t</i>)</sup><i>,</i><b><i>s</i></b><sup>(<i>t</i>)</sup>) − <i>S<b><sub>w′</sub></b></i>(<b><i>x</i></b><sup>(<i>t</i>)</sup><i>,</i><b><i>ŝ</i></b>) ≥ <i>L</i>(<b><i>s</i></b><sup>(<i>t</i>)</sup><i>,</i><b><i>ŝ</i></b>)</p> </li> <li><p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/list[1]/list-item[7]/p[1]">6: <b><i>w</i></b><sup>(<i>i</i> + 1)</sup> ← <b><i>ŵ</i></b></p> </li> <li><p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/list[1]/list-item[8]/p[1]">7: <b><i>v</i></b> ← <b><i>v</i></b> + <b><i>w</i></b><sup>(<i>i</i> + 1)</sup></p> </li> <li><p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/list[1]/list-item[9]/p[1]">8: <i>i</i> ← <i>i + 1</i></p> </li> <li><p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/list[1]/list-item[10]/p[1]">9: <b>end for</b></p> </li> <li><p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/list[1]/list-item[11]/p[1]">10: <b>end for</b></p> </li> <li><p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/list[1]/list-item[12]/p[1]">11: <b><i>w</i></b> = <b><i>v</i></b> / (<i>N*T</i>)</p> </li> </ol><p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/p[4]">Successful discriminative learning depends on having training data with statistics similar to the intended test data. However, this is not the case for gene training data. The main distribution mismatch is that reliable gene annotations available for training are for the most part for single-gene sequences with very small flanking intergenic regions.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/p[5]">To address this problem, we created long training sequences composed of actual genes separated by synthetic intergenic regions as follows. For each training sequence, we generated two extra intergenic regions and appended them to both sequence ends, making sure that the total length of both flanking intergenic regions followed geometric distributions with means 5,000, 10,000, 60,000, and 150,000 bp for each of four GC content classes, respectively [<a href="#pcbi-0030054-b003">3</a>,<a href="#pcbi-0030054-b010">10</a>]. The synthetic intergenic regions were generated by sampling from GC-dependent, fourth-order interpolated Markov models (IMMs), with the same form as the models we used to score the intergenic state.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[4]/p[6]">Algorithm 1 also requires a loss function, <i>L,</i> and a small development set on which to estimate the number of rounds, <i>N</i>. As loss function, we used the correlation coefficient at the base level [<a href="#pcbi-0030054-b024">24</a>], since it combines specificity and sensitivity into a single measure. The development set consisted of the 65 genes previously used in GenScan [<a href="#pcbi-0030054-b001">1</a>] to cross-validate splice signal detectors.</p> <h4 xpathLocation="/article[1]/body[1]/sec[4]/sec[5]/title[1]">Features.</h4> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[5]/p[1]">The final ingredient of the CRAIG model is the feature function <b><i>f</i></b> used to score candidate segments based on properties of the input sequence. A typical feature relates a proposed segment to some property of the input around that segment, and possibly to the label of the previous segment.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[5]/p[2]"><i>Properties.</i> We started by introducing basic sequence properties that features are based on. These properties are real-valued functions of the input sequence around a particular position. Some properties represent tests, taking the binary values 1 for true and 0 for false. For any test <i>P</i>,‖<i>P</i>‖ denotes the function with value 1 if the test is true, 0 otherwise.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[5]/p[3]">The tests <br><a name="pcbi-0030054-eq002" id="pcbi-0030054-eq002"></a><span class="equation"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.eq002&representation=PNG"></span><br>check whether substring <b><i>u</i></b> occurs at position <i>i</i> Є <b><i>x</i></b>. For example, <b><i>x</i></b> = ATGGCGGA would have sub<sub>A</sub>(1,<b><i>x</i></b>) = 1, sub<sub>TA</sub>(2,<b><i>x</i></b>) = 0, and sub<sub>GGC</sub>(3,<b><i>x</i></b>) = 1. </p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[5]/p[4]">The property score<i><sub>y</sub></i>(<i>i,</i><b><i>x</i></b>) computes the score of a content model for state <i>y</i> at position <i>i</i>. This score is the probability that nucleotide <i>i</i> has label <i>y</i> according to a <i>k</i>-order interpolated Markov model [<a href="#pcbi-0030054-b032">32</a>], where <i>k</i> = 8 for coding states and <i>k</i> = 4 for noncoding states.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[5]/p[5]">The property gcc(<i>i,</i><b><i>x</i></b>) calculates the GC composition for the region containing position <i>i</i>, averaged over a 10,000-bp window around position <i>i</i>.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[5]/p[6]">Each feature associates a property to a particular model state or state transition.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[5]/p[7]"><i>Binning.</i> Properties with multimodal or sparse distributions, such as segment length, cannot be used directly in a linear model, because their predictive effect is typically a nonlinear function of their value. To address this problem, we binned each property by splitting its range into disjoint intervals or bins, and converting the property into a set of tests that checked whether the value of the property belonged to the corresponding interval. The effect of this transformation was to pass the property through a stepwise constant nonlinearity, each step corresponding to a bin, where the height of each step was learned as the weight of a binary feature associated to the appropriate test.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[5]/p[8]">For example, following GenScan [<a href="#pcbi-0030054-b001">1</a>], we mapped the GC content property gcc to four bins: <43, 43–51, 51–57, and >57. For other properties, we used regular bins with a property-specific bin width. For instance, exon length was mapped to 90 bp–wide bins.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[5]/p[9]"><i>Test and feature combinations.</i> We used Boolean combinations of tests and binary features to model complex dependencies on the input. Conjunctions can model nucleotide correlations, for example donors of the form G<sub>−1</sub>G<sub>5</sub>, that is, donors with G at positions −1 and 5. Likewise, disjunctions were used to model consensus sequences, for example, donors of the form U<sub>3</sub>, that is, donors with either an A or a G at position 3.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[5]/p[10]">In general, for two binary functions <i>f</i> and <i>g</i>, we denoted their conjunction by <i>f</i> ∧ <i>g</i> and their disjunction by <i>f</i> ∨ <i>g</i>.</p> <h4 xpathLocation="/article[1]/body[1]/sec[4]/sec[6]/title[1]">State features.</h4> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[6]/p[1]">State features encode the content properties of the genomic regions associated to states: exons, introns, and intergenic regions. State features do not depend on the previous state, so we omitted the previous state argument in these feature definitions.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[6]/p[2]"><i>Coding/noncoding potential.</i> This feature corresponds to the log of the probability assigned to the region by the content scoring model: <br><a name="pcbi-0030054-eq003" id="pcbi-0030054-eq003"></a><span class="equation"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.eq003&representation=PNG"></span><br>where <i>μ<sub>y</sub></i> is the arithmetic mean of the distribution of log score<i><sub>y</sub></i> on the training data. For coding regions, the sum is computed over codon scores instead of base scores. Other features related to log score<i><sub>y</sub></i> also included in <b><i>f</i></b> are the coding differential and the score log-ratios between intronic and intergenic regions. </p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[6]/p[3]"><i>Phase biases.</i> Biases in intron and exon phase distributions have been found and analyzed by Fedorov et al. [<a href="#pcbi-0030054-b033">33</a>]. We represented possible biases with the straightforward functions <br><a name="pcbi-0030054-eq004" id="pcbi-0030054-eq004"></a><span class="equation"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.eq004&representation=PNG"></span><br>where <i>p</i> = 0,1,2 is a phase and I<i><sub>p</sub></i> is the corresponding intronic state. </p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[6]/p[4]"><i>Length distributions.</i> The length distributions of exons and introns have been extensively studied. Raw exon lengths were binned to allow our linear model to learn the length histogram from the training data. For long introns, with length >980, we used 980/len(<i>s</i>) as the length feature, whereas shorter introns used max {245/len(<i>s</i>),1}.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[6]/p[5]">For each genomic region type, we also provided length-dependent default features whose weights expressed a bias for or against regions of that length and type. The value of these features is len(s)/<i>λ<sub>y</sub></i>, where <i>λ<sub>y</sub></i> is the average length of all <i>y</i>-labeled segments. For introns and intergenic regions, we used separate, always-on default features for the four classes of GC content discussed above.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[6]/p[6]"><i>Coding composition.</i> In addition to coding potential scores, which give broad, smoothed statistics for different genomic region types, we also defined count features for each 3-gram (codon) and 6-gram (bicodon) in an exon, and similar count features for the first 15 bases (five codons) of an initial exon. The 3-gram features were further split by GC content class. The general form of such a feature is <br><a name="pcbi-0030054-eq005" id="pcbi-0030054-eq005"></a><span class="equation"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.eq005&representation=PNG"></span><br>where <i>p</i> = 0,1,2 is the phase, <i>u</i> is the n-gram, and <i>m</i> is the window size, which is len(<i>s</i>) for a general exon count, and min{len(<i>s</i>),15} for special initial exon features, which attempt to capture composition regularities right after the TIS. </p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[6]/p[7]"><i>Masking.</i> We represented the presence of tandem repeats and other low complexity regions in exonic segments by the function: <br><a name="pcbi-0030054-eq006" id="pcbi-0030054-eq006"></a><span class="equation"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.eq006&representation=PNG"></span><br>After training, this feature effectively penalizes any exon whose fraction of <i>N</i> occurrences exceeds 50% of its total length. </p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[6]/p[8]"><a href="#pcbi-0030054-t006">Table 6</a> shows all the state features associated with each segment label.</p> <div class="figure" xpathLocation="/article[1]/body[1]/sec[4]/sec[6]/table-wrap[1]"><a name="pcbi-0030054-t006" id="pcbi-0030054-t006" title="Click for larger image " href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.t006" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><img xpathLocation="noSelect" border="1" src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.t006&representation=PNG_S" align="left" alt="thumbnail" class="thumbnail"></a><p><strong xpathLocation="/article[1]/body[1]/sec[4]/sec[6]/table-wrap[1]/label[1]"><a href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.t006" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><span xpathLocation="/article[1]/body[1]/sec[4]/sec[6]/table-wrap[1]/label[1]">Table 6. </span></a></strong></p><p xpathLocation="/article[1]/body[1]/sec[4]/sec[6]/table-wrap[1]/caption[1]/p[1]">State Features for Each Segment Label</p> <span xpathLocation="noSelect">doi:10.1371/journal.pcbi.0030054.t006</span><div class="clearer"></div></div> <h4 xpathLocation="/article[1]/body[1]/sec[4]/sec[7]/title[1]">Transition features.</h4> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[7]/p[1]">Transition features look at biological signals that indicate a switch in genomic region type. Features testing for those signals looked for combinations of particular motifs within a window centered at a given offset from the position where the transition occurs. Features of the following form, which test for motif occurrence, are the building blocks for the transition features: <br><a name="pcbi-0030054-eq007" id="pcbi-0030054-eq007"></a><span class="equation"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.eq007&representation=PNG"></span><br>where <i>p</i> is the offset, <i>w</i> is the window width, and <i>u</i> is the motif. This feature counts the number of occurrences of <i>u</i> within <i>p</i> ± <i>w</i>/2 bases of the start of segments. </p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[7]/p[2]">In principle, all sequence positions are potential signal occurrences, but in practice one might filter out unlikely sites, using a sensitivity threshold proportional to level of signal conservation, thus decreasing decoding time.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[7]/p[3]">Burge and Karlin [<a href="#pcbi-0030054-b001">1</a>] model positional biases within signals with combinations of position weight matrices (PWMs) and their generalizations, weight array models (WAMs) and windowed weight array models (WWAMs), with very good results. It is straightforward to define these models as sets of features based on our motif<i><sub>p,u,w</sub></i> feature, as shown here in the WWAM case: <br><a name="pcbi-0030054-e005" id="pcbi-0030054-e005"></a><span class="equation"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.e005&representation=PNG"></span><br> </p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[7]/p[4]">PWMs and WAMs are special cases of WWAMs and can thus be defined by PWM<i><sub>q,r</sub></i> = WWAM<sub>1,1,<i>q,r</i></sub> and WAM<i><sub>q,r</sub></i> = WWAM<sub>1,2,<i>q,r</i></sub>. This means that we can use all of these techniques to model biological signals in CRAIG with the added advantage of having all signal model parameters trained as part of the gene structure.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[7]/p[5]">Correlations between two positions within a signal are captured by conjunctions of motif features. For example, the feature conjunction <br><a name="pcbi-0030054-eq008" id="pcbi-0030054-eq008"></a><span class="equation"><img src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.eq008&representation=PNG"></span><br>would be 1 whenever there is an A in position −3 and a T in position 2 relative to a donor signal occurring at position 156 in <b><i>x</i></b>, and 0 otherwise. </p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[7]/p[6]">We can also extend the feature conjunction operator to sets of features: if <i>A</i> and <i>B</i> are sets of features, such as the WWAM defined in Equation 5, we can define the set of features <i>A</i> ∧ <i>B</i> = {<i>f</i> ∧ <i>g</i> : <i>f</i> Є <i>A, g</i> Є <i>B</i>}<b><i>.</i></b></p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[7]/p[7]">In Equation 5, if we use the amino acid alphabet ∑<sub>AA</sub> instead of ∑<sub>DNA</sub>, and work with codons instead of single nucleotides, we can model signal peptide regions. If we sum over disjunctions of motif features, we can easily model consensus sequences.</p> <p xpathLocation="/article[1]/body[1]/sec[4]/sec[7]/p[8]"><a href="#pcbi-0030054-t007">Table 7</a> shows the motif feature sets for each biological signal. The parameters required by each feature type were either taken from the literature [<a href="#pcbi-0030054-b001">1</a>] or by search on the development set.</p> <div class="figure" xpathLocation="/article[1]/body[1]/sec[4]/sec[7]/table-wrap[1]"><a name="pcbi-0030054-t007" id="pcbi-0030054-t007" title="Click for larger image " href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.t007" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><img xpathLocation="noSelect" border="1" src="/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.0030054.t007&representation=PNG_S" align="left" alt="thumbnail" class="thumbnail"></a><p><strong xpathLocation="/article[1]/body[1]/sec[4]/sec[7]/table-wrap[1]/label[1]"><a href="/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.0030054&imageURI=info:doi/10.1371/journal.pcbi.0030054.t007" onclick="window.open(this.href,'plosSlideshow','directories=no,location=no,menubar=no,resizable=yes,status=no,scrollbars=yes,toolbar=no,height=600,width=850');return false;"><span xpathLocation="/article[1]/body[1]/sec[4]/sec[7]/table-wrap[1]/label[1]">Table 7. </span></a></strong></p><p xpathLocation="/article[1]/body[1]/sec[4]/sec[7]/table-wrap[1]/caption[1]/p[1]">Transition Features per Signal Type</p> <span xpathLocation="noSelect">doi:10.1371/journal.pcbi.0030054.t007</span><div class="clearer"></div></div><p xpathLocation="/article[1]/body[1]/sec[4]/sec[7]/p[9]">We included an additional feature set, motivated by previous work [<a href="#pcbi-0030054-b002">2</a>], to learn splice site information from sequences that only contain intron annotations. For any donor (acceptor), we first counted the number of similar donors (acceptors) in a given list of introns. A signal was considered to be similar to another if the Hamming distance between them was at most 1. The features were induced by a logarithmic binning function applied over the total number of similarity counts for Hamming distances 0 and 1.</p> </div> <div xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:aml="http://topazproject.org/aml/" xpathLocation="noSelect"><a id="ack" name="ack" toc="ack" title="Acknowledgments"></a><h3 xpathLocation="noSelect">Acknowledgments <a href="#top">Top</a></h3> <p xpathLocation="/article[1]/back[1]/ack[1]/p[1]">We thank Aaron Mackey for advice on evaluation methods, datasets, and software.</p> </div><div xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:aml="http://topazproject.org/aml/" class="contributions"><a id="authcontrib" name="authcontrib" toc="authcontrib" title="Author Contributions"></a><h3 xpathLocation="noSelect">Author Contributions <a href="#top">Top</a></h3><p xpathLocation="noSelect"><span class="capture-id">AB, AH, and FP conceived and designed the experiments. AB performed the experiments and analyzed the data. AB and FP wrote the paper. AB, KC, and FP contributed ideas to the model and algorithms and refined and implemented the algorithms. FP proposed the initial idea.</span></p></div><div xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:aml="http://topazproject.org/aml/" xpathLocation="noSelect"><a id="references" name="references" toc="references" title="References"></a><h3 xpathLocation="noSelect">References <a href="#top">Top</a></h3><ol class="references" xpathLocation="noSelect"><li xpathLocation="noSelect"><a name="pcbi-0030054-b001" id="pcbi-0030054-b001"></a><span class="authors">Burge CB, Karlin S</span> (1998) Finding the genes in genomic DNA. Curr Opin Struct Biol 8: 346–354. <a class="find" href="/article/findArticle.action?author=Burge&title=Finding the genes in genomic DNA."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b002" id="pcbi-0030054-b002"></a><span class="authors">Stanke M, Waack S</span> (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19(Supplement 2): II215–II225. <a class="find" href="/article/findArticle.action?author=Stanke&title=Gene prediction with a hidden Markov model and a new intron submodel."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b003" id="pcbi-0030054-b003"></a><span class="authors">Majoros WH, Pertea M, Salzberg SL</span> (2004) TigrScan and GlimmerHMM: Two open source ab initio eukaryotic genefinders. Bioinformatics 20: 2878–2879. <a class="find" href="/article/findArticle.action?author=Majoros&title=TigrScan and GlimmerHMM: Two open source ab initio eukaryotic genefinders."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b004" id="pcbi-0030054-b004"></a><span class="authors">Krogh A</span> (1997) Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol 5: 179–186. <a class="find" href="/article/findArticle.action?author=Krogh&title=Two methods for improving performance of an HMM and their application for gene finding."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b005" id="pcbi-0030054-b005"></a><span class="authors">Majoros WH, Salzberg SL</span> (2004) An empirical analysis of training protocols for probabilistic genefinders. BMC Bioinformatics 5: 206. <a class="find" href="/article/findArticle.action?author=Majoros&title=An empirical analysis of training protocols for probabilistic genefinders."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b006" id="pcbi-0030054-b006"></a><span class="authors">Zhang MQ</span> (1997) Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc Natl Acad Sci U S A 94: 565–568. <a class="find" href="/article/findArticle.action?author=Zhang&title=Identification of protein coding regions in the human genome by quadratic discriminant analysis."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b007" id="pcbi-0030054-b007"></a><span class="authors">Kulp D, Haussler D, Reese MG, Eeckman FH</span> (1996) A generalized hidden Markov model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol 4: 134–142. <a class="find" href="/article/findArticle.action?author=Kulp&title=A generalized hidden Markov model for the recognition of human genes in DNA."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b008" id="pcbi-0030054-b008"></a><span class="authors">Gelfand MS, Mironov AA, Pevzner PA</span> (1996) Gene recognition via spliced sequence alignment. Proc Natl Acad Sci U S A 93: 9061–9066. <a class="find" href="/article/findArticle.action?author=Gelfand&title=Gene recognition via spliced sequence alignment."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b009" id="pcbi-0030054-b009"></a><span class="authors">Birney E, Clamp M, Durbin R</span> (2004) Genewise and genome wise. Genome Res 14: 988–995. <a class="find" href="/article/findArticle.action?author=Birney&title=Genewise and genome wise."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b010" id="pcbi-0030054-b010"></a><span class="authors">Korf I, Flicek P, Duan D, Brent MR</span> (2001) Integrating genomic homology into gene structure. Bioinformatics 17(Supplement 1): S140–S148. <a class="find" href="/article/findArticle.action?author=Korf&title=Integrating genomic homology into gene structure."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b011" id="pcbi-0030054-b011"></a><span class="authors">Meyer IM, Durbin R</span> (2002) Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics 18: 1309–1318. <a class="find" href="/article/findArticle.action?author=Meyer&title=Comparative ab initio prediction of gene structures using pair HMMs."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b012" id="pcbi-0030054-b012"></a><span class="authors">Gross SS, Brent MR</span> (2005) Using multiple alignments to improve gene prediction. J Comput Biol 13: 379–393. <a class="find" href="/article/findArticle.action?author=Gross&title=Using multiple alignments to improve gene prediction."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b013" id="pcbi-0030054-b013"></a><span class="authors">Krogh A</span> (1998) Gene finding: Putting the parts together. In: Bishop M, editor. Guide to human genome computing. San Diego: Academic Press. pp. 261–274. </li><li xpathLocation="noSelect"><a name="pcbi-0030054-b014" id="pcbi-0030054-b014"></a><span class="authors">Mathe C, Sagot MF, Schiex T, Rouze P</span> (2002) Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 30: 4103–4117. <a class="find" href="/article/findArticle.action?author=Mathe&title=Current methods of gene prediction, their strengths and weaknesses."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b015" id="pcbi-0030054-b015"></a><span class="authors">Flicek P, Keibler E, Hu P, Korf I, Brent MR</span> (2003) Leveraging the mouse genome for gene prediction in human: From whole-genome shotgun reads to a global synteny map. Genome Res 13: 46–54. <a class="find" href="/article/findArticle.action?author=Flicek&title=Leveraging the mouse genome for gene prediction in human: From whole-genome shotgun reads to a global synteny map."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b016" id="pcbi-0030054-b016"></a><span class="authors">Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, et al. </span> (2007) Improving the <span class="genus-species">C. elegans</span> genome annotation using machine learning. PLoS Comput Biol 3: e20. <a class="find" href="/article/findArticle.action?author=R%C3%A4tsch&title=Improving the C. elegans genome annotation using machine learning."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b017" id="pcbi-0030054-b017"></a><span class="authors">Lafferty J, McCallum A, Pereira F</span> (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Danyluk A, editor. Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco: Morgan Kauffman. pp. 282–289. </li><li xpathLocation="noSelect"><a name="pcbi-0030054-b018" id="pcbi-0030054-b018"></a><span class="authors">Sarawagi S, Cohen WW</span> (2005) Semi-Markov conditional random fields for information extraction. In: Saul LK, Weiss Y, Bottou L, editors. Adv in Neur Inf Proc Syst 17. Cambridge (Massachusetts): MIT Press. pp. 1185–1192. </li><li xpathLocation="noSelect"><a name="pcbi-0030054-b019" id="pcbi-0030054-b019"></a><span class="authors">Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y</span> (2006) Online passive–aggressive algorithms. J Machine Learning Res 7: 551–585. <a class="find" href="/article/findArticle.action?author=Crammer&title=Online passive%E2%80%93aggressive algorithms."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b020" id="pcbi-0030054-b020"></a><span class="authors">Juang B, Rabiner L</span> (1990) Hidden Markov models for speech recognition. Technometrics 33: 251–272. <a class="find" href="/article/findArticle.action?author=Juang&title=Hidden Markov models for speech recognition."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b021" id="pcbi-0030054-b021"></a><span class="authors">Raina R, Shen Y, Ng AY, McCallum A</span> (2004) Classification with hybrid generative/discriminative models. In: Thrun S, Saul LK, Schölkopf B, editors. Adv in Neur Inf Proc Syst 16. Cambridge (Massachusetts): MIT Press. pp. 545–552. </li><li xpathLocation="noSelect"><a name="pcbi-0030054-b022" id="pcbi-0030054-b022"></a><span class="authors">Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, et al. </span> (2004) Ensmart: A generic system for fast and flexible access to biological data. Genome Res 14: 160–169. <a class="find" href="/article/findArticle.action?author=Kasprzyk&title=Ensmart: A generic system for fast and flexible access to biological data."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b023" id="pcbi-0030054-b023"></a><span class="authors">Snyder EE, Stormo GD</span> (1995) Identification of protein coding regions in genomic DNA. J Mol Biol 248: 1–18. <a class="find" href="/article/findArticle.action?author=Snyder&title=Identification of protein coding regions in genomic DNA."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b024" id="pcbi-0030054-b024"></a><span class="authors">Burset M, Guigo R</span> (1996) Evaluation of gene structure prediction programs. Genomics 34: 353–357. <a class="find" href="/article/findArticle.action?author=Burset&title=Evaluation of gene structure prediction programs."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b025" id="pcbi-0030054-b025"></a><span class="authors">Guigo R, Agarwal P, Abril JF, Burset M, Fickett JW</span> (2000) An assessment of gene prediction accuracy in large DNA sequences. Genome Res 10: 1631–1642. <a class="find" href="/article/findArticle.action?author=Guigo&title=An assessment of gene prediction accuracy in large DNA sequences."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b026" id="pcbi-0030054-b026"></a><span class="authors">Rogic S, Mackworth AK, Ouellette FB</span> (2001) Evaluation of gene-finding programs on mammalian sequences. Genome Res 11: 817–832. <a class="find" href="/article/findArticle.action?author=Rogic&title=Evaluation of gene-finding programs on mammalian sequences."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b027" id="pcbi-0030054-b027"></a><span class="authors">ENCODE Project Consortium</span> (2004) The ENCODE (Encyclopedia of DNA Elements) project. Science 306: 636–540. <a class="find" href="/article/findArticle.action?author=&title=The ENCODE (Encyclopedia of DNA Elements) project."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b028" id="pcbi-0030054-b028"></a><span class="authors"> Guigo R, Reese MG, editors. </span> (2006) Egasp '05: Encode genome annotation assessment project. Genome Biology. 7 Supplement 1 </li><li xpathLocation="noSelect"><a name="pcbi-0030054-b029" id="pcbi-0030054-b029"></a><span class="authors">Keibler E, Brent MR</span> (2003) Eval: A software package for analysis of genome annotations. BMC Bioinformatics 4: 50. <a class="find" href="/article/findArticle.action?author=Keibler&title=Eval: A software package for analysis of genome annotations."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b030" id="pcbi-0030054-b030"></a><span class="authors">Crammer K</span> (2004) Online learning of complex categorical problems. Jerusalem: Hebrew University. </li><li xpathLocation="noSelect"><a name="pcbi-0030054-b031" id="pcbi-0030054-b031"></a><span class="authors">Collins M</span> (2002) Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. Proceedings of Conference on Empirical Methods in Natural Language Processing. EMNLP 2002. pp. 1–8. </li><li xpathLocation="noSelect"><a name="pcbi-0030054-b032" id="pcbi-0030054-b032"></a><span class="authors">Salzberg SL, Delcher A, Kasif S, White O</span> (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res 26: 544–548. <a class="find" href="/article/findArticle.action?author=Salzberg&title=Microbial gene identification using interpolated Markov models."> Find this article online </a></li><li xpathLocation="noSelect"><a name="pcbi-0030054-b033" id="pcbi-0030054-b033"></a><span class="authors">Fedorov A, Fedorova L, Starshenko V, Filatov V, Grigor'ev E</span> (1998) Influence of exon duplication on intron and exon phase distribution. J Mol Evol 46: 263–271. <a class="find" href="/article/findArticle.action?author=Fedorov&title=Influence of exon duplication on intron and exon phase distribution."> Find this article online </a></li></ol></div> </div> </div> <div style="display:none"> <div dojoType="ambra.widget.RegionalDialog" id="AnnotationDialog" style="padding:0;margin:0;"> <div class="dialog annotate"> <div class="tipu" id="dTipu"></div> <div class="comment"> <h5><span class="commentPrivate">Add Your Note (For Private Viewing)</span><span class="commentPublic">Post Your Note (For Public Viewing)</span></h5> <div class="posting pane"> <form name="createAnnotation" id="createAnnotation" method="post" action=""> <input type="hidden" name="target" value="info:doi/10.1371/journal.pcbi.0030054" /> <input type="hidden" name="startPath" value="" /> <input type="hidden" name="startOffset" value="" /> <input type="hidden" name="endPath" value="" /> <input type="hidden" name="endOffset" value="" /> <input type="hidden" name="commentTitle" id="commentTitle" value="" /> <input type="hidden" name="comment" id="commentArea" value="" /> <input type="hidden" name="ciStatement" id="statementArea" value="" /> <input type="hidden" name="isCompetingInterest" id="isCompetingInterest" value="false" /> <input type="hidden" name="noteType" id="noteType" value="" /> <fieldset> <legend>Compose Your Note</legend> <span id="submitMsg" class="error" style="display:none;"></span> <table class="layout"> <tr> <td> <label for="cNoteType">This is a </label><select name="cNoteType" id="cNoteType"><option value="note">note</option><option value="correction">correction</option></select> <span id="cdls" style="visibility:hidden;margin-left:0.3em; white-space:nowrap;"><a href="/static/commentGuidelines.action?target=info%3Adoi%2F10.1371%2Fjournal.pcbi.0030054#corrections">What are corrections?</a></span> <label for="cTitle" class="commentPublic"><span class="none">Enter your note title</span><!-- error message text <em>A title is required for all public notes</em>--></label> <input type="text" name="cTitle" id="cTitle" value="Enter your note title..." class="title commentPublic" alt="Enter your note title..." /> <label for="cArea"><span class="none">Enter your note</span><!-- error message text <em>Please enter your note</em>--></label> <textarea name="cArea" id="cArea" value="Enter your note..." alt="Enter your note...">Enter your note...</textarea> <input type="hidden" name="isPublic" value="true" /> </td> <td> </td> <td class="coi"> <fieldset> <legend>Declare any competing interests.</legend> <ul> <li><label><input id="isCompetingInterestNo" type="radio" checked="checked" name="competingInterest" value="false" /> No, I don't have any competing interests to declare.</label></li> <li><label><input id="isCompetingInterestYes" type="radio" name="competingInterest" value="true" /> Yes, I have competing interests to declare (enter below):</label></li> </ul> <textarea name="ciStatementArea" id="ciStatementArea" disabled value="Enter your competing interests..." alt="Enter your competing interests...">Enter your competing interests...</textarea> </fieldset> </td> </tr> <tr> <td colspan="3" class="buttons"> <input type="button" value="Cancel" title="Click to close and cancel" id="btn_cancel"/> <input type="button" value="Submit" title="Click to post your note publicly" id="btn_post" class="primary"/> </td> </tr> </table> </fieldset> </form> </div> </div> <div class="tip" id="dTip"></div> </div> </div><div dojoType="ambra.widget.ContextAction" id="ContextActionDialog" class="contextActionDialog"> <div class="dialog context"> <div class="tipu" id="caTipu"></div> <div class="contextActionContent"> <h5><img src="/images/tooltip_addannotation.gif" /> Add a note to this text.</h5> Please follow our <a href="/static/commentGuidelines.action">guidelines for notes and comments</a> and review our <a href="/static/competing.action">competing interests policy</a>. Comments that do not conform to our guidelines will be promptly removed and the user account disabled. The following must be avoided: <ul> <li>Remarks that could be interpreted as allegations of misconduct</li> <li>Unsupported assertions or statements</li> <li>Inflammatory or insulting language</li> </ul> <form name="contextActionForm" id="contextActionForm" class="clearfix buttons" method="post" action=""> <input type="button" name="Continue" value="Continue" id="ContextActionDialogContinueButton" onmouseup="ambra.displayAnnotationContext.startComment(event);" title="Add a note to this text" class="primary"/> <input type="button" name="Cancel" value="Cancel" id="ContextActionDialogCancelButton" onclick="return false;" onmouseup="ambra.displayAnnotationContext.cancelContext(event);" title="Close this Window"/> </form> </div> <div class="tip" id="caTip"></div> </div> </div> <div dojoType="ambra.widget.ContextAction" id="ContextActionDialogNotLogged" class="contextActionDialog"> <div class="dialog context"> <div class="tipu" id="canlTipu"></div> <div class="contextActionContent"> <h5><img src="/images/tooltip_addannotation.gif" /> Add a note to this text.</h5> You must be logged in to add a note to an article. You may log in by <a onmousedown="ambra.displayAnnotationContext.disconnect(event);" href="/user/secure/secureRedirect.action?goTo=%2Farticle%2Finfo%3Adoi%2F10.1371%2Fjournal.pcbi.0030054">clicking here</a> or <a href="#" onclick="return false;" onmouseup="ambra.displayAnnotationContext.cancelContext(event);">cancel this note</a>. </div> <div class="tip" id="canlTip"></div> </div> </div> <div dojoType="ambra.widget.ContextAction" id="ContextActionDialogBadSelection" class="contextActionDialog"> <div class="dialog context"> <div class="tipu" id="canBDTipu"></div> <div class="contextActionContent"> <h5 class="annotation icon"><img src="/images/tooltip_addannotation.gif" /> Add a note to this text.</h5> You cannot annotate this area of the document. <a href="#" onclick="return false;" onmouseup="ambra.displayAnnotationContext.cancelContext(event);">Close</a> </div> <div class="tip" id="canBDTip"></div> </div> </div> <div dojoType="ambra.widget.ContextAction" id="ContextActionDialogBadRangeSelection" class="contextActionDialog"> <div class="dialog context"> <div class="tipu" id="canbrTipu"></div> <div class="contextActionContent"> <h5><img src="/images/tooltip_addannotation.gif" /> Add a note to this text.</h5> You cannot create an annotation that spans different sections of the document; please adjust your selection.<br/> <a href="#" onclick="return false;" onmouseup="ambra.displayAnnotationContext.cancelContext(event);">Close</a> </div> <div class="tip" id="canbrTip"></div> </div> </div> <div dojoType="ambra.widget.RegionalDialog" id="CommentDialog" style="padding:0;margin:0;"> <div class="dialog preview"> <div class="tipu" id="cTipu"></div> <div class="btn close" id="btn_close" title="Click to close"><a title="Click to close">Close</a></div> <div id="cmtContainer" class="comment"> <h6 id="viewCmtTitle"></h6> <div class="detail" id="viewCmtDetail"></div> <div class="contentwrap" id="viewComment"></div> <div class="contentwrap" id="viewCIStatement"></div> <div class="detail" id="viewLink"> <!--<a href="#" class="commentary icon" title="Click to view full thread and respond">View all responses</a> <a href="#" class="respond tooltip" title="Click to respond to this posting">Respond to this</a>--> </div> </div> <div class="tip" id="cTip"></div> </div> </div> <div dojoType="ambra.widget.RegionalDialog" id="CommentDialogMultiple" style="padding:0;margin:0;"> <div class="dialog multiple preview"> <div class="tipu" id="mTipu"></div> <div class="btn close" id="btn_close_multi" title="Click to close"><a title="Click to close">Close</a></div> <ol id="multilist"></ol> <br/> <div id="multidetail"></div> <div class="tip" id="mTip"></div> </div> </div> <div dojoType="dijit.Dialog" id="Rating"> <div class="dialog annotate"> <div class="tipu" id="dTipu"></div> <div class="comment"> <h5><span class="commentPublic">Rate This Article</span></h5> <div class="instructions">Please follow our <a href="/static/ratingGuidelines.action">guidelines for rating</a> and review our <a href="/static/competing.action">competing interests policy</a>. Comments that do not conform to our guidelines will be promptly removed and the user account disabled. The following must be avoided: <ol> <li>Remarks that could be interpreted as allegations of misconduct</li> <li>Unsupported assertions or statements</li> <li>Inflammatory or insulting language</li> </ol> </div> <div class="posting pane"> <form name="ratingForm" id="ratingForm" method="post" action=""> <input type="hidden" name="articleURI" value="info:doi/10.1371/journal.pcbi.0030054" /> <input type="hidden" name="commentTitle" id="commentTitle" value="" /> <input type="hidden" name="comment" id="commentArea" value="" /> <input type="hidden" name="ciStatement" id="statementArea" value="" /> <input type="hidden" name="isCompetingInterest" id="isCompetingInterest" value="" /> <fieldset> <legend>Compose Your Annotation</legend> <span id="submitRatingMsg" class="error" style="display:none;"></span> <table class="layout"> <tr> <td rowspan="2"> <label for="insight">Insight</label> <ul class="star-rating rating edit" title="Rate insight" id="rateInsight"> <li class="current-rating pct0"></li> <li><a href="javascript:void(0);" title="Bland" class="one-star" onclick="ambra.rating.setRatingCategory(this, 'insight', 1);">1</a></li> <li><a href="javascript:void(0);" title="" class="two-stars" onclick="ambra.rating.setRatingCategory(this, 'insight', 2);">2</a></li> <li><a href="javascript:void(0);" title="" class="three-stars" onclick="ambra.rating.setRatingCategory(this, 'insight', 3);">3</a></li> <li><a href="javascript:void(0);" title="" class="four-stars" onclick="ambra.rating.setRatingCategory(this, 'insight', 4);">4</a></li> <li><a href="javascript:void(0);" title="Profound" class="five-stars" onclick="ambra.rating.setRatingCategory(this, 'insight', 5);">5</a></li> </ul> <input type="hidden" name="insight" title="insight" value="" /> <label for="reliability">Reliability</label> <ul class="star-rating rating edit" title="Rate reliability" id="rateReliability"> <li class="current-rating pct0"></li> <li><a href="javascript:void(0);" title="Tenuous" class="one-star" onclick="ambra.rating.setRatingCategory(this, 'reliability', 1);">1</a></li> <li><a href="javascript:void(0);" title="" class="two-stars" onclick="ambra.rating.setRatingCategory(this, 'reliability', 2);">2</a></li> <li><a href="javascript:void(0);" title="" class="three-stars" onclick="ambra.rating.setRatingCategory(this, 'reliability', 3);">3</a></li> <li><a href="javascript:void(0);" title="" class="four-stars" onclick="ambra.rating.setRatingCategory(this, 'reliability', 4);">4</a></li> <li><a href="javascript:void(0);" title="Unassailable" class="five-stars" onclick="ambra.rating.setRatingCategory(this, 'reliability', 5);">5</a></li> </ul> <input type="hidden" name="reliability" title="reliability" value="" /> <label for="style">Style</label> <ul class="star-rating rating edit" title="Rate style" id="rateStyle"> <li class="current-rating pct0"></li> <li><a href="javascript:void(0);" title="Crude" class="one-star" onclick="ambra.rating.setRatingCategory(this, 'style', 1);">1</a></li> <li><a href="javascript:void(0);" title="" class="two-stars" onclick="ambra.rating.setRatingCategory(this, 'style', 2);">2</a></li> <li><a href="javascript:void(0);" title="" class="three-stars" onclick="ambra.rating.setRatingCategory(this, 'style', 3);">3</a></li> <li><a href="javascript:void(0);" title="" class="four-stars" onclick="ambra.rating.setRatingCategory(this, 'style', 4);">4</a></li> <li><a href="javascript:void(0);" title="Elegant" class="five-stars" onclick="ambra.rating.setRatingCategory(this, 'style', 5);">5</a></li> </ul> <input type="hidden" name="style" title="style" value="" /> <label for="cTitle" class="commentPublic"><span class="none">Enter your comment title</span><!-- error message text <em>A title is required for all public annotations</em>--></label> <input type="text" name="cTitle" id="cTitle" value="Enter your comment title..." class="title commentPublic" alt="Enter your comment title..." /> <label for="cArea"><span class="none">Enter your comment</span><!-- error message text <em>Please enter your annotation</em>--></label> <textarea name="cArea" id="cArea" value="Enter your comment..." alt="Enter your comment...">Enter your comment...</textarea> </td> <td rowspan="2"> </td> <td class="coi"> <fieldset> <legend>Declare any competing interests.</legend> <ul> <li><label><input id="isCompetingInterestNo" type="radio" name="competingInterest" value="false" /> No, I don't have any competing interests to declare.</label></li> <li><label><input id="isCompetingInterestYes" type="radio" name="competingInterest" value="true" /> Yes, I have competing interests to declare (enter below):</label></li> </ul> <textarea name="ciStatementArea" id="ciStatementArea" disabled value="Enter your competing interests..." title="Enter your competing interests...">Enter your competing interests...</textarea> </fieldset> </td> </tr> <tr> <td class="buttons"> <input type="button" value="Cancel" title="Click to close and cancel" id="btn_cancel_rating"/> <input type="button" value="Submit" title="Click to post your annotation publicly" id="btn_post_rating" class="primary"/> </td> </tr> </table> </fieldset> </form> </div> </div> </div> </div> <div dojoType="ambra.widget.LoadingCycle" id="LoadingCycle" class="loadingCycler"> <img src="/images/loading.gif" width="58" height="58" title="Loading..." /> </div> </div> </div> <!-- end : main contents --> </div> <!-- end : container --> <!-- begin : footer --> <div id="ftr"> <p><span>All site content, except where otherwise noted, is licensed under a <a href="http://creativecommons.org/licenses/by/2.5/" title="Creative Commons Attribution License 2.5" tabindex="200">Creative Commons Attribution License</a>.</span></p> <ul> <li><a href="/static/privacy.action" title="PLoS Privacy Statement" tabindex="501">Privacy Statement</a></li> <li><a href="/static/terms.action" title="PLoS Terms of Use" tabindex="502">Terms of Use</a></li> <li><a href="http://www.plos.org/advertise/" title="Advertise With PLoS" tabindex="503">Advertise</a></li> <li><a href="http://www.plos.org/journals/embargopolicy.html" title="PLoS Embargo Policy" tabindex="504">Media Inquiries</a></li> <li><a href="http://www.plos.org/journals/print.html" title="PLoS in Print" tabindex="505">PLoS in Print</a></li> <li><a href="/static/sitemap.action" title="Site Map" tabindex="506">Site Map</a></li> <li><a href="http://www.plos.org" title="PLoS.org" tabindex="507">PLoS.org</a></li> </ul> <div class="powered"> <ul> <li><a href="/static/releaseNotes.action" title="Ambra | Release Notes">Ambra 0.9.4 beta</a></li> <li>Managed Colocation provided by <a href="http://www.unitedlayer.com/" title="UnitedLayer: Built on IP Services">UnitedLayer</a>.</li> </ul> </div> </div> <!-- end : footer --> <script type="text/javascript"> var _namespace=""; var loggedIn = false; var almHost = "http://alm.plos.org"; // Safari v3.1.1 "console.debug" issue (http://trac.dojotoolkit.org/ticket/6849) workaround if (/3[\.0-9]+ Safari/.test(navigator.appVersion)) { window.console = { origConsole: window.console, log: function(s){ this.origConsole.log(s); }, info: function(s){ this.origConsole.info(s); }, error: function(s){ this.origConsole.error(s); }, warn: function(s){ this.origConsole.warn(s); } }; } var djConfig = { // don't debug for IE - as dojo's firebug lite module is error prone in IE isDebug: false, parseOnLoad: true }; </script> <script type="text/javascript" src="/javascript/dojo/dojo/dojo.js"></script> <script type="text/javascript" src="/javascript/dojo/dojo/ambra.js"></script> <script type="text/javascript" src="/javascript/init_global.js"></script> <script type="text/javascript" src="/javascript/init_article.js"></script> <script type="text/javascript" src="/javascript/init_ratings.js"></script> <script type="text/javascript" src="/javascript/init_article_body.js"></script> <script type="text/javascript" src="/javascript/init_article_rhc.js"></script> <script type="text/javascript" src="/javascript/alm.js"></script> <script type="text/javascript" src="/javascript/reporting/articleViewsCumulative.js"></script> <script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> var pageTracker = _gat._getTracker("UA-338393-1"); pageTracker._trackPageview(); pageTracker._setDomainName("www.ploscompbiol.org"); </script> </body> </html>