Conceived and designed the experiments: AMA MRR CD. Performed the experiments: AMA RAS. Analyzed the data: AMA MRR CD. Contributed reagents/materials/analysis tools: AMA RAS MRR CD. Wrote the paper: AMA MRR CD.
The authors have declared that no competing interests exist.
The function of most proteins is not determined experimentally, but is extrapolated from homologs. According to the “ortholog conjecture”, or standard model of phylogenomics, protein function changes rapidly after duplication, leading to paralogs with different functions, while orthologs retain the ancestral function. We report here that a comparison of experimentally supported functional annotations among homologs from 13 genomes mostly supports this model. We show that to analyze GO annotation effectively, several confounding factors need to be controlled: authorship bias, variation of GO term frequency among species, variation of background similarity among species pairs, and propagated annotation bias. After controlling for these biases, we observe that orthologs have generally more similar functional annotations than paralogs. This is especially strong for sub-cellular localization. We observe only a weak decrease in functional similarity with increasing sequence divergence. These findings hold over a large diversity of species; notably orthologs from model organisms such as
To infer the function of an unknown gene, possibly the most effective way is to identify a well-characterized evolutionarily related gene, and assume that they have both kept their ancestral function. If several such homologs are available, all else being equal, it has long been assumed that those that diverged by speciation (“ortholog”) are functionally closer than those that diverged by duplication (“paralogs”); thus function is more reliably inferred from the former. But despite its prevalence, this model mostly rests on first principles, as for the longest time we have not had sufficient data to test it empirically. Recently, some studies began investigating this question and have cast doubt on the validity of this model. Here, we show that by considering a wide range of organisms and data, and, crucially, by correcting for several easily overlooked biases affecting functional annotations, the standard model is corroborated by the presently available experimental data.
Understanding the relation between gene evolution and function is perhaps our only hope of bringing functional annotation in line with the furious pace of genomic sequencing. Indeed, despite developments in high-throughput experimental techniques, propagation of functional knowledge from evolutionarily related genes remains the procedure that scales best and appears most dependable
Yet, large-scale studies corroborating this standard model are surprisingly scarce
In the present study, we investigated the functional similarity of 395,328 pairs of orthologs and paralogs with experimental GO annotations
GO annotations—even restricting to experimentally supported ones—are heterogeneous in many ways, such as type of function described, level of specificity, applicable species, method of investigation, or curation practices
(
The average GO similarity of genes annotated based on the same scientific article is higher than that of genes annotated based on different articles (
Typical measures of function similarity do not account for variation of GO term frequency among species. This is the case for measures defined on the ontology graph alone, such as term overlap measures (e.g., Jaccard index
Even if we account for variation in GO term frequency among species, the average similarity of random pairs of genes (which we call “background similarity”) is not equal for all genome pairs (
Experimentally backed GO annotations (evidence code EXP and children), which constitute less than 1% of all annotations, are undisputedly considered the most reliable
Correcting for the biases described above, we first restricted our comparison to experimental annotations with no common investigator from the two yeast species,
The first observation is that at similar levels of sequence divergence, one-to-one orthologs do have significantly more similar experimental GO annotations than paralogs, and that one-to-many and many-to-many orthologs (referred to as “other orthologs” in the remainder of the text) are somewhat intermediary (
Only pairs of annotations derived from different publications, which do not share any common author, were used. (
On the other hand, the difference between orthologs and paralogs is not as important as might have been expected under a naive interpretation of the ortholog conjecture: orthologs are far from having almost the same function. This might stem in part from the differences between experiments performed by different investigators. Most surprising, the decrease in annotation similarity with protein divergence is very weak (Spearman correlation between sequence identity and GO similarity over all homologs: ρ = −0.019,
We have verified these results with a number of additional controls: using different metrics of GO annotation similarity (
The GO is composed of three orthogonal ontologies, which we have analyzed separately for the two yeasts. The Cellular Component ontology shows the most marked pattern, with a very clear excess of similarity between one-to-one orthologs, relative to all other homologs (
One inherent limitation of two-species analyses is that all pairs of orthologs started diverging at the same time (the speciation event between the two species), with almost all paralogs being either older (the “out-paralogs”) or younger (the “in-paralogs”) than the orthologs. By considering sequences from many different gene families—some of which faster evolving, other slower evolving—we can compare orthologs and paralogs that have similar levels of sequence divergence, but inevitably, slow-evolving orthologs will tend to be compared with in-paralogs, while fast-evolving orthologs will tend to be compared with out-paralogs. To avoid the potential bias that this might introduce, we need to look at data from multiple species.
We performed the same comparisons between all possible pairs of the 13 species with sufficient experimental GO annotations. Results are widely consistent with the yeast only study (
Only pairs of annotations derived from different publications, which do not share any common author, were used. (
Like for the yeast study, there is little correlation between functional similarity and protein sequence identity (Spearman ρ = −0.023,
The outgroups are (
Enzyme commission (EC) numbers are an alternative source of functional annotations. The relation between EC numbers and sequence divergence has already been studied extensively (e.g.,
The distinction between orthologs and paralogs has been a central concept of phylogenomics
Directly comparing functional annotations is complicated, because they are derived from a variety of sources and by a variety of procedures. The best-known bias is that computationally derived annotations (IEA code) are generally believed to be less reliable than experimentally derived annotations. The computational annotations reflect the algorithms used to propagate annotations
Even limiting ourselves to experimentally derived annotations, there remains a great deal of complexity and bias in the data of functional annotation.
First, different model organisms are studied by different scientific communities, for different purposes, which bias the types of experiments conducted and reported. Moreover, each organism is predominantly annotated by one Model Organism Database team, which differs from others in its data curation and annotation practices. Indeed, we observe significant differences in background functional similarity, depending on the species compared. While part of this variation might be due to biological differences among the species, these differences appear to be mostly due to the artifacts outlined above. Here, we have compared 13 organisms spanning the tree of life (
Second, each experiment is performed and reported by a given team of investigators, who have a scientific focus and a manner of reporting which are specific to them. This induces a strong bias towards similar annotations derived from the same paper, which mostly affects same-species paralogs. Importantly, there is a bias towards similar annotations even when considering different papers which share at least one co-author. Unless accounted for, this confounding factor leads to a large spurious excess of similarity between same-species paralogs
While GO annotations are complex and biased, it nevertheless appears possible to identify and correct these biases, and to detect biologically significant signal. We feel that the use of 13 different species, with diverse annotation levels and evolutionary distances, contributes to the robustness of our results.
Once the biases identified above are accounted for, the signal which emerges can be summarized in three major points: (i) Consistent with the “ortholog conjecture”, or “standard model of phylogenomics”, overall functional similarity is highest between one-to-one orthologs, lowest between paralogs, and intermediate between other orthologs. (ii) There is at best a very weak relation between protein sequence similarity and functional similarity. (iii) The difference between orthologs and paralogs, although consistent with the ortholog conjecture, is weaker than expected under a naive understanding of that model; this is especially true when Molecular Function and Biological Process are considered separately.
The standard model of higher functional similarity among orthologs than paralogs at similar levels of sequence divergence could not be supported until it was explicitly tested
An intriguing pattern in our results is that we find strong conservation of Cellular Component annotations among orthologs. Contrary to the two other ontologies, sub-cellular localization is an aspect of function which leaves little room for divergent interpretation. Moreover, experimental results are easier to report in similar terms in different species. These factors might allow better detection of the excess conservation of orthologs. Thus, of the 3 ontologies, our results on cellular components are arguably the most conclusive.
As for the two other aspects of protein function captured by the Gene Ontology—Molecular Function and Biological Process—they have more subtle patterns. Molecular Function shows an excess of conservation between orthologs which is weaker than for Cellular Component, but which is strongly significant over all 13 genomes analyzed. This is the aspect of function for which there was previously the most evidence for the “uniform model” of no significant difference between orthologs and paralogs; with the available data, this can now be rejected. This is also the aspect of function for which the absolute value of excess similarity (i.e., excess similarity of homologs over random pairs) is strongest—for both orthologs and paralogs. Thus, Molecular Function appears to be strongly conserved between even distant homologs, which supports the received wisdom of predicting this type of annotation on the basis of conserved protein domains.
Biological Process also has a significant excess of function conservation among orthologs, although weaker than for the Cellular Component. This is surprising, given the wide differences in biology between the species compared. Indeed, throughout the entire range of sequence divergence, orthologs are considerably more similar in function than even same-species paralogs. Of note, the biases which amplify apparent similarity between paralogs are strongest for this aspect of function: not correcting for the sampling bias of orthologs or paralogs detected between species can lead to a spurious excess of conservation of same-species paralogs. Our results contradict the concept of the evolution of cellular context set forth by Nehrt et al. to explain the apparent higher similarity of function of in-paralogs between human and mouse
This concept was also related to the weak relation between protein sequence divergence and functional divergence. Nehrt et al.
The low impact of evolutionary time on average protein function conservation is also apparent if we compare humans to model organisms with very different divergence times. Indeed, the extent of functional similarity of one-to-one orthologs is similar between human and
In conclusion, our analyses corroborate the central tenet of the standard model of phylogenomics—that at similar levels of sequence divergence, orthologs are in general more similar in function than paralogs. But although significant, the difference is modest, and is uneven among different aspect of function (among different ontologies). Furthermore, our results expose other trends unexplained by the standard model, such as differences among subtypes of orthology and paralogy (also observed in other contexts, such as intron conservation
We selected 13 genomes with highest coverage in GO annotations backed by experimental data (evidence codes EXP, IDA, IEP, IGI, IMP, and IPI). The annotations were retrieved from the GOA database
We used the EC number assignments of the ENZYME database, maintained by Swiss-Prot
We used orthologs and paralogs induced by Ensembl Compara gene trees (version 65)
The comparison of gene annotations requires a measure of semantic similarity. In recent years, several measures have been proposed (for review,
A first similarity measure is Resnik's information content metric
This measure is directly related to the information content of the most specific common parent of the two terms. The higher this value, the more specific the communality of the annotations. Note that the probabilities for all terms are commonly estimated from their frequency of occurrence in the database. A natural extension is Lin's
Therefore, Lin's similarity is bounded between 0 (related only through the root ontology term) and 1 (identical annotations).
Genes are often annotated with more than one term, which raises the question of how to compute the overall similarity between two genes. Two common approaches consist in computing the similarity for all pairs of GO terms between the two genes, and to report either the maximum or the average among them. To overcome problems with these measures
As alternative to these information-theoretic motivated similarity measures, similarity measures based solely on the ontology graph (e.g. term-overlap measures) have also been proposed and applied e.g. Jaccard index
In order to compare our results with the findings of Nehrt et. al
As not all genomes are annotated by the same people for the same purpose, there can be substantial differences in annotation structure and frequency across genomes. We normalize the similarity measures in two respects. First, contrary to the common practice of computing frequencies for each GO term across the entire annotation corpus
The second normalization step is motivated by the observation that the average similarity of random pairs of genes (the background similarity) is not equal for all genome pairs and subtypes of homology (
For each GO annotation an evidence code and a reference identifier is recorded. In the case of experimental annotations (EXP, IDA, IEP, IGI, IMP and IPI), this reference id is usually a PubMed identifier or a reference id from a model organism database (MOD). We extract authors associated with a given GO annotations by first mapping non-PubMed reference ids to PubMed ids using publicly available mapping files from the MODs. Second, for each PubMed id we extract the authors of that publication from the PubMed webpage.
Tab-separated text file with all raw data of the main dataset (experimentally-supported GO annotations without common authors).
(GZ)
Contrasting excess Schlicker-like similarity of homologs with experimental annotations reported in A) the same publication, B) different publications involving at least one common author and C) publications with different authors only.
(PDF)
Estimated background similarity per genome pair for each ontology and homolog relation type. For within-species homologs, entries along one column correspond to the background similarity within the species on the x-axis with respect to the speciation event with the species on the y-axis. The background similarities for each genome pair and homology type have been computed between 10,000 random gene pairs, where both genes have (i) at least one recorded homologous match of that type and (ii) are annotated with experimental GO annotations.
(PDF)
Different measures of GO term similarity among various types of homologs. The six figures are A) maximum simResnik, B) average simResnik, C) maximum simLin and D) average simLin, E) Maryland-bridge term overlap measure, F) simSchlicker (giving same weight to annotation) and G) simSchlicker as originally defined in Schlicker et. al (2006) (giving same weight to each gene product). All similarities are measured from the YEAST/SCHPO comparison with GO annotations backed by experimental evidence without common authors.
(PDF)
Contrasting different measures of divergence as independent variables: A) Percent sequence identity and B) PAM estimates of sequence divergence, both derived from a Smith-Waterman alignment over the full protein lengths. All function similarities are in Excess Schlicker-like Similarity and have been measured from the dataset with only GO annotations backed by experimental evidence originating from publications sharing no common authors.
(PDF)
Average excess Schlicker-like Similarity measured from homologous gene pairs with GO annotations backed by experimental evidence from publications with no common authors. The sampled gene pairs form quartets with an ancient duplication and subsequent speciations. The quartets are sampled from A) the two yeast species only and B) from all 13 analyzed species.
(PDF)
Difference in average Excess Schlicker Function Similarity between all types of Orthologs and all types of Paralogs from the YEAST/SCHPO genome pair on the dataset of pairs being backed with experimental annotations from studies without common authors. The different panels report the difference for the different GO ontologies. The data-points indicate the difference of the means and the gray area a linear interpolation of the bin-wise 95% confidence interval for the difference for the mean. To confidence interval is computed for each bin with a Mann-Whitney test. P-values are provided in
(PDF)
Average excess Schlicker-like similarity for any pair of analyzed species, measured on the dataset restricted to experimental annotations from publications without common authors. Reported is the average excess similarity over all three GO ontologies. A mapping of the species abbreviations to scientific names is provided in
(PDF)
Difference in average Excess Schlicker Function Similarity between all types of Orthologs and all types of Paralogs from all 13 analyzed genomes on the dataset of pairs being backed with experimental annotations from studies without common authors. The different panels report the difference for the different GO ontologies. The data-points indicate the difference of the means and the gray area a linear interpolation of the bin-wise 95% confidence interval for the difference for the mean. To confidence interval is computed for each bin with a Mann-Whitney test. P-values for the statistical test whether the difference is different from 0 are available in
(PDF)
Different bin-widths (columns) for evolutionary divergence categories: the results are robust with respect to the choice of bin width. The analysis is done on the gene pairs with experimental GO annotations without common author between all 13 genomes.
(PDF)
Different measures of GO term similarity among various types of homologs. The six figures are A) maximum simResnik, B) average simResnik, C) maximum simLin and D) average simLin, E) Maryland-bridge term overlap measure, F) simSchlicker (giving same weight to annotation) and G) simSchlicker as originally defined in Schlicker et. al (2006) (giving same weight to each gene product). All similarities are measured from the gene pair s from all 13 analyzed genomes with GO annotations backed by experimental evidence without common authors.
(PDF)
Test of over-representation of a single species pair. We applied the following re-sampling strategy to the dataset of gene pairs with experimental GO annotations without common authors: First, we partition the dataset into independent sub-datasets. Each sub-dataset is composed of all the gene pairs of a given homology type and species pair. After building those sub-datasets, we randomly select gene pairs with replacement of the same size or a maximum number of allowed pairs. This number has been set to 2000 gene pairs per species pair and homology type. This way we ensure that any species pair can influence the results more than 1.5%. We then compute the average similarity per homology type and distance category from the combined sub-datasets. This whole procedure is repeated 100 times in order to obtain the necessary quantiles for the box-plots.
(PDF)
Test for over-representation of large gene families in the OMA homologs. We applied the following re-sampling strategy to the dataset of gene pairs with experimental GO annotations without common authors: First, we partition the dataset into independent sub-datasets. Each sub-dataset is composed of all the gene pairs from a given gene family. After building those sub-datasets, we randomly select gene pairs with replacement of the same size or a maximum number of allowed pairs. This number has been set to 100 gene pairs per gene family. This way we ensure that any single family can influence the results more than 1%. We then compute the average similarity per homology type and distance category from the combined sub-datasets. This whole procedure is repeated 100 times in order to obtain the necessary quantiles for the box-plots. For every gene family, we sample at most 100 homologous gene pairs with replacement. Shown are box-plots for all 100 bootstrap samples.
(PDF)
Orthology/Paralogy relations inferred from Ensembl Gene Trees (version 65). To control for a potential bias in the orthology/paralogy inference method we repeated the analysis on homologs induced by the labeled Ensembl gene trees. Note that this analysis is limited to the following 6 species: HUMAN, MOUSE, RATNO, DROME, CAEEL and YEAST. Shown are the excess Schlicker similarities. In all ontologies, orthologs are significantly more similar in function than paralogs. The figures show the similarities of A) the average over all gene ontologies (t-test: p<2.2E−16), B) the molecular function ontology (t-test: p<2.2E−16), C) the biological process ontology (t-test: p = 2.19E−6) and D) the cellular component ontology (t-test: p<2.2E−16). All similarities have been computed on the dataset with experimental annotations without common authors from GOA 2012-01-21.
(PDF)
Contrasting different measures of divergence as independent variables: A) Percent sequence identity, B) PAM estimates of sequence divergence and C) Time estimates. Time estimates have been extracted from TimeTree (
(PDF)
Average excess Schlicker-like similarity of the various types of homologs with EC number annotations, with sequence divergence in percent identity as independent variable.
(PDF)
Effect sequence on functional similarity after correcting for several biases for A) biological process, B) cellular component and C) molecular function GO ontology. Homologs are taken from Nehrt
(PDF)
The 13 species used in the analysis and their phylogenetic relations among each other according to the NCBI taxonomy.
(PDF)
Authorship bias: the fraction of homologs with experimental GO annotations from the same publication, different publication but common author and different authors varies strongly. All homologs have at least 50% sequence identity.
(PDF)
Authorship bias: equivalent to
(PDF)
Significance test for difference of mean excess Schlicker-like similarity between orthologs and paralogs. P-values have been computed for each distance bin separately using a Mann-Whitney test. Values are shown for the dataset covering all 13 genomes (middle column) as well as the yeast-only dataset (rightmost column). The corresponding graphs are provided in
(PDF)
Species information: source and release date for all 13 analyzed species. Their phylogenetic relation is depicted in
(PDF)
The authors thank Maria Anisimova, Steven A. Benner, Brigitte Boeckmann, Pascale Gaudet, Gaston H. Gonnet, and Adrian Schneider for discussions and helpful suggestions.