Conceived and designed the experiments: JD MEW. Performed the experiments: SR JDC MEW. Analyzed the data: SR JDC MEW. Contributed reagents/materials/analysis tools: SR JDC JD MEW. Wrote the paper: MEW SR JDC JD.
The authors have declared that no competing interests exist.
Recent studies have noted extensive inconsistencies in gene start sites among orthologous genes in related microbial genomes. Here we provide the first documented evidence that imposing gene start consistency improves the accuracy of gene start-site prediction. We applied an algorithm using a genome majority vote (GMV) scheme to increase the consistency of gene starts among orthologs. We used a set of validated
The genetic code tells us precisely how a DNA sequence will be translated into a protein. However, it is more difficult to identify where translation will start and stop in the entire length of an organism's genome sequence. Computer software can predict where the start sites are, and this is successful most of the time; however, errors do occur. We hypothesized that some errors might be corrected by comparing predictions for the genome sequences of closely related organisms. This correction scheme seems especially appropriate for bacterial genomes: not only is protein production in bacteria simpler than in higher organisms, but hundreds of bacterial DNA sequences are now available, and many of these are closely related. To test the hypothesis, we developed a method to detect whether a gene's start site is inconsistent with the majority of equivalent genes in a set of related bacterial genomes. The method then modifies the start if it can be made consistent with the majority of genomes. Our tests show this majority vote method improves the accuracy of gene start sites. Application of the method to existing bacterial genomes should eliminate many inconsistencies and correct a large number of errors.
All of genomics depends on accurate identification of coding regions. Most gene boundaries are predicted using computational methods, and only a tiny fraction have been verified experimentally. Unfortunately, the accuracy of current gene-finding algorithms is not perfect. Error rates for the most common algorithms—Glimmer3
Gene-prediction error is a well-recognized problem
Here, we test this idea using a set of validated
Gene finding programs need to evaluate several possible start sites for each gene. The programs occasionally make mistakes and pick the wrong start site. If mistakes for orthologous genes in different genomes are uncorrelated (an unrealistic assumption, but useful as a reference point for algorithm development), then if less than half of the predicted starts are wrong, they might be corrected by a majority vote.
To formulate the majority vote idea mathematically, consider a set of
The probability of finding at least one error among all of the orthologs is
For example, let the probability of predicting an incorrect start site be
The above model illustrates that typical gene prediction error rates can lead to double-digit inconsistencies in ortholog sets (e.g. in the above case, a 5% error rate led to a 22.6% inconsistency rate). It also illustrates the ability of a majority vote to decrease errors and thereby increase accuracy through increasing consistency. The increase in accuracy requires that the error rate for a single gene start be less than 50%. This prerequisite is satisfied by modern gene calling software, for which reported error rates range from 1.5% to 17.6%
Although the above model gives clear and quantitative insight into how comparative genomics might improve the accuracy of gene maps, it is merely a reference point and does not consider the important effects of real biological variation and correlated errors. In our previous study of gene start site consistency in the
The GMV algorithm works as follows. For a given set of orthologous genes, if the positions of the start sites already coincide in a multiple sequence alignment, they are accepted. If they do not coincide, a start position is sought which is consistent for the majority of the genes and for which there is a reasonable alternative start site for the remaining genes in the set. If such a position is found, it is accepted, and the predictions are changed for the outlying genes. Otherwise, no start site prediction is made for the ortholog set.
We implemented GMV in the pipeline illustrated in
Individual steps A–E are explained in the text (
Any gene-calling software can in principle be used as a front end to provide input gene calls to GMV. Here we used Prodigal
Because Prodigal gene maps are likely to be more accurate than most of the gene maps currently in GenBank
To evaluate the performance of the algorithm, we created eight genome test sets that varied in size and diversity (Supplementary
Among the eight test sets, 5.9% to 61.8% of the ortholog sets had inconsistencies. The majority vote rule improved consistency for 13.2% to 51.9% of the ortholog sets (
5 genomes |
10 genomes |
|||||||
Low | Medium | High | Very High | Low | Medium | High | Very High | |
Total # of ortholog sets generated in the pipeline | 3633 | 2446 | 1414 | 988 | 3271 | 2133 | 1317 | 380 |
# of ortholog sets for which Prodigal starts were initially inconsistent |
213 (5.9%) | 536 (21.9%) | 574 (40.6%) | 547 (55.4%) | 251 (7.7%) | 614 (28.8%) | 634 (48.1%) | 235 (61.8%) |
# of ortholog sets for which Prodigal starts were already consistent |
3420 (94.1%) | 1910 (78.1%) | 840 (59.4%) | 441 (44.6%) | 3020 (92.3%) | 1519 (71.2%) | 683 (51.9%) | 145 (38.2%) |
# of inconsistent ortholog sets that were made consistent by GMV |
74 (34.7%) | 278 (51.9%) | 204 (35.5%) | 74 (16.8%) | 89 (35.5%) | 286 (46.6%) | 227 (35.8%) | 31 (13.2%) |
# of ortholog sets with consistent starts after GMV |
3494 (96.2%) | 2188 (89.5%) | 1044 (73.8%) | 515 (52.1%) | 3109 (95.0%) | 1805 (84.6%) | 910 (69.1%) | 176 (46.3%) |
# of ortholog sets with at least one consistent start |
3626 (99.8%) | 2428 (99.3%) | 1326 (93.8%) | 863 (87.3%) | 3269 (99.9%) | 2098 (98.4%) | 1215 (92.3%) | 310 (81.6%) |
The genomes in each set are listed in Supplementary
Percentage is with respect to total # of ortholog sets generated in the pipeline.
Percentage is with respect to # of ortholog sets for which Prodigal starts were initially inconsistent.
The maximum level of consistency that could theoretically be imposed ranged from 81.6% to 99.9% (
Among the ortholog sets revised by GMV,
5 genomes | 10 genomes | ||||
Codon before change | Codon after change | Medium | High | Medium | High |
|
|
243 | 184 | 354 | 249 |
|
|
47 | 26 | 66 | 41 |
|
|
16 | 10 | 33 | 26 |
|
|
31 | 15 | 42 | 22 |
|
|
8 | 5 | 5 | 7 |
|
|
0 | 1 | 0 | 1 |
|
|
9 | 10 | 14 | 11 |
|
|
3 | 1 | 7 | 5 |
|
|
0 | 0 | 1 | 1 |
Total Changes | 357 | 252 | 522 | 363 | |
Same codon | 251 | 189 | 360 | 257 | |
Different codon | 106 | 63 | 162 | 106 |
Before applying GMV, we first note evidence of an association between consistency and accuracy of gene start sites among orthologs. Among ortholog sets with consistent start sites, the
5 genomes | 10 genomes | |||||||
Low | Medium | High | Very High | Low | Medium | High | Very High | |
# of ortholog sets for which |
833 | 683 | 457 | 274 | 800 | 618 | 414 | 129 |
# of ortholog sets for which |
825 (99.0%) | 613 (89.8%) | 382 (83.6%) | 245 (89.4%) | 787 (98.4%) | 546 (88.3%) | 329 (79.5%) | 107 (82.9%) |
# of ortholog sets for which |
8 (0.96%) | 70 (10.2%) | 75 (16.4%) | 29 (10.6%) | 13 (1.63%) | 72 (11.7%) | 85 (20.5%) | 22 (17.1%) |
# of ortholog sets with start sites matching a validated |
799 (95.9%) | 664 (97.2%) | 444 (97.2%) | 271 (98.9%) | 769 (96.1%) | 602 (97.4%) | 406 (98.1%) | 126 (97.7%) |
# of ortholog sets with start sites matching a validated |
792 (96.0%) | 609 (99.3%) | 381 (99.7%) | 245 (100%) | 760 (96.6%) | 544 (99.6%) | 328 (99.7%) | 107 (100%) |
# of ortholog sets with start sites matching a validated |
7 (87.5%) | 55 (78.6%) | 63 (84.0%) | 26 (89.7%) | 9 (69.2%) | 58 (80.6%) | 78 (91.8%) | 19 (86.3%) |
Percentage is with respect to total # of ortholog sets.
Percentage is with respect to # of ortholog sets for which
Percentage is with respect to # of ortholog sets for which
The GMV pipeline corrected the most errors when applied to the high and medium diversity test sets (
Number of correct and incorrect changes are estimated using validated starts in
5 genomes | 10 genomes | |||||||
Low | Medium | High | Very High | Low | Medium | High | Very High | |
# of ortholog sets with an incorrect |
34 | 19 | 13 | 3 | 31 | 16 | 8 | 3 |
# of corrected validated starts in |
0 | 9 | 11 | 3 | 1 | 8 | 7 | 3 |
# of |
1 | 2 | 2 | 1 | 0 | 0 | 0 | 0 |
Error Rate ( |
1.00 | 0.182 | 0.154 | 0.25 | 0.5 |
0.111 |
0.125 |
0.25 |
Sensitivity ( |
0 | 0.474 | 0.846 | 1.0 | 0.032 | 0.5 | 0.875 | 1.00 |
Total # of changes in |
13 | 51 | 41 | 12 | 9 | 38 | 21 | 4 |
Total # of changes in all genomes | 92 | 357 | 252 | 88 | 169 | 522 | 363 | 40 |
Total # of changes that agree with a validated start | 9 | 76 | 82 | 31 | 20 | 114 | 126 | 28 |
Total # of changes that disagree with a validated start | 4 | 7 | 6 | 1 | 4 | 15 | 0 | 0 |
Estimated by adding one additional false positive to obtain a nonzero value.
To estimate the broader impact of GMV, we calculated the increase in consistency for the medium and high diversity genome test sets with either 5 or 10 genomes and then applied these rates to 39 genera. For each test set we obtained the number of genes
5 genomes | 10 genomes | |||
Medium | High | Medium | High | |
Maximum possible # ortholog sets, |
4282 | 4332 | 4151 | 3710 |
Ortholog set yield, |
57.1% | 32.6% | 51.3% | 35.4% |
Increase in consistency after applying GMV, |
11.4% | 14.4% | 13.4% | 17.2% |
Calculated as percentage of
Calculated as percentage of the number actual ortholog sets by subtracting the third from the fifth row of
We identified suitable target genomes from a list of finished microbial genomes from the Integrated Microbial Genomes resource (
Although the precise values of
The impact of GMV estimated with Prodigal gene maps is likely to be conservative because Prodigal gene predictions (for orthologs) tend to be more consistent than extant Genbank data. By “Genbank data”, we mean the owner-approved or “curated” maps that are accessed by default in Genbank. As described previously
5 genomes Medium Diversity | 5 genomes High Diversity | ||
Prodigal vs. GenBank | # of shared ortholog sets | 2289 | 1234 |
# of ortholog sets for which Prodigal starts were initially inconsistent | 455 | 413 | |
# of ortholog sets for which GenBank starts were initially inconsistent | 925 | 311 | |
# made consistent by Prodigal | 552 | 50 | |
# made consistent by GMV | 194 | 47 | |
Prodigal vs. Glimmer3 | # of shared ortholog sets | 2427 | 1398 |
# of ortholog sets for which Prodigal starts were initially inconsistent | 532 | 566 | |
# of ortholog sets for which GenBank starts were initially inconsistent | 869 | 767 | |
# made consistent by Prodigal | 432 | 248 | |
# made consistent by GMV | 193 | 155 |
The above conservative estimate suggests applying our pipeline could significantly increase consistency of GenBank gene maps, with Prodigal accounting for ¾ and GMV accounting for ¼ of the total impact. We also obtained an alternative, less conservative estimate of the impact by modifying the GMV algorithm to preferentially use the gene calls already in GenBank as opposed to new Prodigal gene calls. In the modified algorithm, if the GenBank start sites already coincide in a multiple sequence alignment, or if a majority of these start sites do not align, nothing is done. Otherwise, if a majority of the GenBank start sites coincide, an alternative, consistent Prodigal start site is sought in the minority genomes. If one is found, then in the minority genomes the GenBank start sites are replaced with the consistent Prodigal start sites. Applying this algorithm to the same 2,289 ortholog sets that were common to GenBank maps and Prodigal maps in the 5-genome, medium diversity
To project the broader impact of the GMV method on the accuracy of Prodigal gene maps, we first calculated a correction rate for the medium and high diversity genome test sets with either 5 or 10 genomes and then applied this rate to 39 suitable genera. The correction rate
The projection above applies to Prodigal gene maps; an assessment for existing Genbank gene maps is also desired. Accuracy can be directly measured for organisms, like
It is conceivable that the impact will be lower for new gene maps obtained from recent improvements in annotation pipelines. Newer maps may include information from servers such as MaGe
Ours is one of several approaches to leveraging multiple genomes for improving gene predictions. Numerous methods have used conservation patterns in pairwise sequence alignments to distinguish coding from non-coding regions in eukaryotic (SLAM
RAST
N-SCAN
A shared feature of prior approaches is that the multiple genomes are input at the front end and are used to develop a tightly integrated gene prediction model. By contrast, the GMV algorithm is run as a post-processing step. The main disadvantage of this is the additional compute time required to refine gene calls: running GMV on a 5-genome set takes about ½ a day on a single processor machine. The compute time is limited by the BLAST step, which scales like the number of genomes squared; however, the speed of the BLAST step (and all other steps of the pipeline) can be substantially improved by parallel processing. A major advantage in implementing GMV is that it can be coupled to any gene prediction software so long as a list of alternative start sites is provided. Aside from the great flexibility it provides in applications, the modular nature of GMV allowed us to treat it as an error correction method, enabling a well-controlled means of evaluating its performance.
The GMV algorithm dramatically decreases inconsistencies in the location of predicted gene start sites, and is projected to eliminate thousands of inconsistencies in currently sequenced microbial genomes, facilitating comparative genomics studies. At the same time, it is capable of correcting hundreds of errors in sets of 5–10 genomes and is potentially capable of correcting more than 10,000 errors in microbial gene maps. Moreover, GMV provides a straightforward solution to the challenging problem of improving gene start site predictions using more than two genomes. Overall, GMV is a simple and logical solution that resolves inconsistencies and increases the accuracy of gene maps.
Genome sets were selected with the aid of a bacterial phylogenetic tree (Benjamin McMahon, personal communication). The tree was derived by aligning the concatenated amino acid sequences of the β and β′ subunits of RNA polymerase from over 400 bacterial genomes. The 400 bacterial genomes were downloaded from NCBI (completed) and JGI (draft) in June of 2009. The amino acid sequences of the beta and beta-prime subunits of the RNA polymerase were extracted from each genome and concatenated. An initial multiple sequence alignment was calculated using MUSCLE
The genome sets used for testing GMV are listed in Supplementary
A) 5-genome set; B) 10-genome test set.
# Genomes, Diversity | Median Sequence Identity |
5, Low | 99.3% |
5, Medium | 85.2% |
5, High | 71.6% |
5, Very High | 64.4% |
10, Low | 98.8% |
10, Medium | 82.5% |
10, High | 69.5% |
10, Very High | 51.3% |
The sequence identity used is the minimum value among all gene pairs in each ortholog set. The percentage value is normalized using sequence length information (
The GMV algorithm was implemented in an automated pipeline to predict consistent start sites, illustrated in
In this step, gene predictions are made for each genome in the set using Prodigal (we used version 1.10 here, which is no longer available; the version we distribute uses versions 2.00–2.50)
In this step, alternative start sites for each gene in each genome are obtained from the Prodigal output files.
In this step, gene predictions from Step B are used to derive ortholog sets by a pan-reciprocal best hit approach using BLASTP (version 2.2.20) with default settings
Multiple sequence alignment is performed for each ortholog set using MUSCLE (version 3.7) with default settings
This is the final step in the GMV pipeline and it involves prediction of consistent start sites. If the positions of all of the original start sites coincide in the multiple sequence alignment, the predictions are accepted as is. Otherwise, look for a position where the original start sites coincide for a majority of genomes, and where an alternative start site coincides in each of the remaining genomes. Use the alternative sites as modified predictions for the remaining genomes. If there is no consistent start site that obeys the majority rule, flag the prediction as inconsistent.
It is important to note that the GMV pipeline is not restricted to using Prodigal for gene prediction and MUSCLE for multiple sequence alignment. GMV can be made to work with any gene prediction software that can output alternative start sites in Step A. The current requirements for input to GMV is described in the manual included in the package, distributed at
The GMV algorithm pipeline was developed using Java (JDK 1.6) and Perl 5.8. The software has been tested on both Linux and MacOS X operating systems. Software is available under the New BSD open source license and is freely available at
Histogram of mean (top) and minimum (bottom) identity score between genes in ortholog sets derived from the low diversity, 5 genome set.
(PDF)
Histogram of mean (top) and minimum (bottom) identity score between genes in ortholog sets derived from the medium diversity, 5 genome set.
(PDF)
Histogram of mean (top) and minimum (bottom) identity score between genes in ortholog sets derived from the high diversity, 5 genome set.
(PDF)
Histogram of mean (top) and minimum (bottom) identity score between genes in ortholog sets derived from the very high diversity, 5 genome set.
(PDF)
Histogram of mean (top) and minimum (bottom) identity score between genes in ortholog sets derived from the low diversity, 10 genome set.
(PDF)
Histogram of mean (top) and minimum (bottom) identity score between genes in ortholog sets derived from the medium diversity, 10 genome set.
(PDF)
Histogram of mean (top) and minimum (bottom) identity score between genes in ortholog sets derived from the high diversity, 10 genome set.
(PDF)
Histogram of mean (top) and minimum (bottom) identity score between genes in ortholog sets derived from the very high diversity, 10 genome set.
(PDF)
Bacterial phylogenetic tree. The tree is based on aligning the beta and beta-prime subunits of the RNA polymerase and was generated using a maximum likelihood method
(PDF)
List of genomes in each genome set. The FASTA files were downloaded June–July 2010.
(PDF)
List of genomes used to estimate projected impact of GMV on consistency and error rates in gene predictions. The 467 genomes were organized into 39 genera for the estimate. The genome list was obtained from the Integrated Microbial Genomes resource at the DOE Joint Genome Institute (
(PDF)
Source files for GenBank default and Glimmer3 gene start sites for 5 genome sets of medium and high diversity.
(PDF)
We are grateful to Benjamin McMahon for providing the bacterial phylogenetic tree (Supplementary