TY - JOUR T1 - Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning A1 - Rätsch, Gunnar A1 - Sonnenburg, Sören A1 - Srinivasan, Jagan A1 - Witte, Hanh A1 - Müller, Klaus-R A1 - Sommer, Ralf-J A1 - Schölkopf, Bernhard Y1 - 2007/02/23 N2 - Author SummaryEukaryotic genes contain introns, which are intervening sequences that are excised from a gene transcript with the concomitant ligation of flanking segments called exons. The process of removing introns is called splicing. It involves biochemical mechanisms that to date are too complex to be modeled comprehensively and accurately. However, abundant sequencing results can serve as a blueprint database exemplifying what this process accomplishes. Using this database, we employ discriminative machine learning techniques to predict the mature mRNA given the unspliced pre-mRNA. Our method utilizes support vector machines and recent advances in label sequence learning, originally developed for natural language processing. The system, called mSplicer, was trained and evaluated on the genome of the nematode C. elegans, a well-studied model organism. We were able to show that mSplicer correctly predicts the splice form in most cases. Surprisingly, our predictions on currently unconfirmed genes deviate considerably from the public genome annotation. It is hypothesized that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation and additional sequencing results show the superiority of mSplicer's predictions. It is concluded that the annotation of nematode and other genomes can be greatly enhanced using modern machine learning. JF - PLOS Computational Biology JA - PLOS Computational Biology VL - 3 IS - 2 UR - https://doi.org/10.1371/journal.pcbi.0030020 SP - e20 EP - PB - Public Library of Science M3 - doi:10.1371/journal.pcbi.0030020 ER -