Conceived and designed the experiments: PR MWH. Performed the experiments: NLN WTC. Analyzed the data: NLN WTC PR MWH. Wrote the paper: PR MWH.
The authors have declared that no competing interests exist.
A common assumption in comparative genomics is that orthologous genes share greater functional similarity than do paralogous genes (the “ortholog conjecture”). Many methods used to computationally predict protein function are based on this assumption, even though it is largely untested. Here we present the first large-scale test of the ortholog conjecture using comparative functional genomic data from human and mouse. We use the experimentally derived functions of more than 8,900 genes, as well as an independent microarray dataset, to directly assess our ability to predict function using both orthologs and paralogs. Both datasets show that paralogs are often a much better predictor of function than are orthologs, even at lower sequence identities. Among paralogs, those found within the same species are consistently more functionally similar than those found in a different species. We also find that paralogous pairs residing on the same chromosome are more functionally similar than those on different chromosomes, perhaps due to higher levels of interlocus gene conversion between these pairs. In addition to offering implications for the computational prediction of protein function, our results shed light on the relationship between sequence divergence and functional divergence. We conclude that the most important factor in the evolution of function is not amino acid sequence, but rather the cellular context in which proteins act.
The use of model organisms in biological research rests upon the assumption that gene and protein functions discovered in one organism are likely to be the same or similar in another organism. Hence, the assumption that experiments in mouse will tell us about the function of genes in humans. A guiding principle in the assignment of function from one organism to another is that single-copy genes (“orthologs”) are statistically more likely to provide functional information than are multi-copy genes, whether in the same organism or different organisms. Here we have tested this idea by examining genes with known functions in human and mouse. Surprisingly, we find that multi-copy genes are equally or more likely to provide accurate functional information than are single-copy genes. Our results suggest that the organism itself plays at least as large a role in determining the function of genes as does the particular sequence of the gene alone. This insight will benefit the assignment of function to genes whose roles are not yet known by widening the pool of appropriate genes from which function can be inferred.
The potential for gene duplication to generate evolutionary novelty was first noted in 1918 by Calvin Bridges (cited in
As the first protein-sequence data became available, Zuckerkandl and Pauling
We refer to the hypothesis that orthologs are more likely to be functionally similar than are paralogs as the “ortholog conjecture” (cf.
A large number of methods have been developed to identify orthologous relationships among proteins. These methods range from simple pairwise comparisons, to standard phylogenetic tree-building, to probabilistic assignment using Bayesian analyses
In this paper we directly test the ortholog conjecture using comparative functional genomic data. We use experimentally derived functional assignments of more than 8,900 genes from mouse and human, as well as a microarray dataset that includes 25 tissues in both mouse and human, to directly assess our ability to predict function using orthologs and paralogs. We use this pair of species both because they are two of the best-studied and best-annotated organisms and because homologous relationships are easy to identify due to their relatively recent divergence time. Because paralogs are almost always either more- or less-related to a focal gene than an ortholog (for inparalogs or outparalogs, respectively), it is meaningless to compare the predictive power of all orthologs to all paralogs; it seems obvious that closely related orthologs will be more similar in function than distantly related paralogs, and vice versa. Instead, we focus on the predictive power of both orthologs and paralogs as a function of protein sequence divergence. Our results demonstrate that paralogous genes from the same species are often a much better predictor of functional divergence than are orthologs or paralogs from different species, even at lower sequence identities.
Functional similarity was calculated between all pairs of homologous proteins (i.e. those in the same gene family) in human and mouse for which there is experimentally defined function for both members of the pair. These pairs include 2,579 one-to-one orthologs between human and mouse and 21,771 paralogous comparisons of any type. The experiments used to annotate these genes come from 12,204 unique published papers whose results are collected in the Gene Ontology (GO) database; in a later section we carry out an independent analysis using microarray data to measure functional similarity.
Standard error bars are shown. (
Functional similarity can be measured for both the Biological Process and Molecular Function categories defined in the GO database. For the Biological Process category,
Contrary to a common assumption (the “ortholog conjecture”), the functional similarity between paralogs is significantly higher than that between orthologs for high sequence identities (≥70% for Biological Process;
While the ortholog data can be easily understood from
Standard error bars are shown. (
The functional similarity curves show a clear difference between subtypes of paralogs. Inparalogs appear to be most functionally similar to one another, and their functional similarity is positively correlated with sequence identity in both ontologies. Within-species outparalogs have a slightly steeper decline than inparalogs, but are significantly more functionally similar than either between-species outparalogs or orthologs. The between-species outparalogs show trends most similar to orthologs. In fact, in the Biological Process category, these two curves are nearly identical. However, in the Molecular Function category, the more sequence-similar outparalogs have slightly higher functional similarity than do orthologs, while the less-similar outparalogs have lower functional similarity than do orthologs. In the
In addition to a large-scale view of functional similarity, it is also useful to take a family-based view in order to compare the predictive power of paralogs and orthologs within the same family. We asked, for a given family, first whether an ortholog or a paralog was more similar at the sequence level, and then whether an ortholog or the particular paralog was more similar at the functional level.
The counts for the groups were obtained as follows: for each family, only one target protein (functionally annotated) was selected uniformly randomly from all proteins with at least one ortholog and at least one paralog in the family, and all its functionally annotated homologs were collected. We then asked whether at least one of the paralogs had higher sequence similarity than the ortholog, and then whether it had higher or lower functional similarity. This analysis required functionally annotated triples within gene families (i.e. the target gene, an ortholog, and a paralog of any type); thus 1-to-many and many-to-many orthologous relationships were included in this analysis. In cases where multiple genes were co-orthologous to the target, the ortholog having the highest sequence identity with the selected target protein was used for comparison. Note that each gene family was counted only once in this analysis, preventing families with large numbers of lineage-specific duplications from biasing the results. Finally, to ensure that the choice of target protein did not unduly affect the results, we repeated the analysis 100 times, choosing a new target protein from the 1145 unique families containing experimentally annotated triples each time (685 with Biological Process and 711 families with Molecular Function annotation).
|
|
|||
Paralog has higher functional similarity | Ortholog has higher functional similarity | Paralog has higher functional similarity | Ortholog has higher functional similarity | |
Paralog has higher sequence identity | 17.4±0.2 | 3.6±0.1 | 17.7±0.2 | 7.2±0.1 |
Ortholog has higher sequence identity | 442.4±0.8 | 221.6±0.8 | 346.8±0.9 | 339.3±0.9 |
Each field shows the average number of protein families (±standard error), out of 100 runs with randomly selected target proteins, in which the row and column conditions were satisfied.
The family-based analysis showed similar trends to those observed in previous sections. In the Biological Process category, if the orthologous sequence was more similar to the target protein, the ortholog had higher functional similarity to the target protein than all of its paralogs in only 33.4±0.1% of the cases (mean ± standard error). In contrast, in 82.9±0.4% of protein families in which a paralogous sequence was most similar to the target protein, it was also functionally most similar. In the Molecular Function category, the observed difference between orthologs and paralogs was similar: an ortholog had higher functional similarity to the target protein than all of its paralogs in only 49.5±0.1% of the cases. On the other hand, if the most similar sequence to a target protein was a paralog, the paralog was functionally most similar to the target protein in 71.1±0.5% of families.
It is known that paralogous sequences residing on the same chromosome are more likely to undergo non-allelic gene conversion in mammals
Standard error bars are shown. (
We examined two families in further detail. (1) We compared the functional similarity of orthologs and paralogs in the full set of nuclear receptors in human and mouse, a well-studied group of proteins. Out of the 48 and 49 nuclear receptors identified in human and mouse, respectively
(2) Another example of a violation of the ortholog conjecture is found in the mitogen-activated protein kinase kinase kinase kinase 2 (MAP4K2) family. MAP4K2 is a serine/threonine protein kinase, expressed in lymph nodes, but also in other tissues such as lung, brain, and placenta
We analyzed multiple potential biases in the data that could impact the conclusions of this work. They included: 1) Functional annotation that is organism-specific, i.e. certain functions may be studied only in humans while others may be studied only in mice. To address this possibility we repeated our analysis using only the subset of functions studied in both human and mouse; there was no significant difference in the shape of the functional similarity curves relative to that shown in
Because all of the above analyses are based on user-reported or curator-based determinations of function, they may still be affected by individual researcher biases that we cannot control for. The only way to avoid this potential problem is to obtain a measure of function that is not dependent on an individual's interpretation of experiments. Therefore, we conducted a parallel analysis of the relationship between protein similarity and functional similarity using microarray data from 25 homologous tissues in human and mouse
We used the correlation in levels of normalized gene expression across tissues as our measure of functional similarity (see
Standard error bars are shown.
The microarray data used here have also been utilized in a number of previous evolutionary studies, though these studies largely focused only on paralogs
Because there is no interpretation or assignment of functional terms needed to obtain these results, we believe they strongly support all of our previous analyses. It should also be noted that very few of the above GO-based analyses used expression evidence: in particular, there were only a total of 310 annotations that used the IEP evidence code for either the Molecular Function or Biological Process categories. Therefore, these two datasets are largely non-overlapping and provide independent support for the results.
The accelerating pace of whole-genome sequencing coupled with the rapid—but relatively slower—pace of functional genomics projects has required commensurately fast methods for computational annotation of genes and proteins. Because functional studies are disproportionately concentrated in only a handful of model organisms, the working model for computational annotation has been transfer-by-similarity
Our results strongly suggest that the ortholog conjecture is not correct between human and mouse: given equivalent levels of protein divergence (or even slightly higher divergence), paralogous genes from the same species (either human or mouse) are better predictors of function than are orthologs from the other species. A similar result was previously obtained among yeast, fly, and worm when comparing conserved protein-protein interactions between homologs within the same species and homologs from different species (although this study did not distinguish among orthologs, inparalogs, and outparalogs
In addition to a general lack of support for the ortholog conjecture, our analyses revealed several surprising patterns. One of the most surprising is the lack of any discernible relationship between protein similarity and functional similarity for orthologs, whether considering Biological Process or Molecular Function annotations (
The importance of cellular and organismal context in defining protein function may go a long way toward explaining many aspects of our results, including the lack of a relationship between functional and sequence similarity for orthologs, the presence of this relationship for paralogs, and the differences between different types of paralogs (in-/outparalogs). We propose that the key to understanding the rate at which protein function evolves is not how quickly the protein sequence itself evolves, but rather the rate at which its cellular context—including directly and indirectly interacting molecules—evolves. To further explain this hypothesis, note that all of the orthologous pairs studied here are the same age: that is, they all share a last common ancestor at the split between the human and mouse lineages, regardless of their level of sequence identity. Unlike orthologs, the paralogs studied here shared common ancestors at many different times in the past, with some paralogs having split only a few million years ago while others split >100 million years ago. We propose that this difference in divergence times is the key to understanding the difference in relationships between functional and sequence similarity. The orthologs all share the same age—and therefore the same average functional similarity—but the paralogous pairs are of many different ages—and therefore different functional similarities.
Why should proteins of the same age share the same level of functional similarity? While there is no direct role for “time” in evolution that is not tied to mutation, we suggest that what time represents here is the evolution of the cellular context: the sum of the evolutionary changes over all of the directly and indirectly interacting molecules. If this context evolves at a steady rate (i.e. the average amount of functional change among all of the interacting molecules remains relatively constant), then protein function will appear to evolve at a steady rate, a rate largely disconnected from the level of an individual protein's sequence divergence. Several pieces of evidence support this conjecture. First, our results above show that even orthologous proteins that are 100% identical have different functions. Since it is obvious that the proteins themselves have not changed, the change must be due to regulation or downstream effects of these molecules. For example, Liao and Zhang
Some researchers may be concerned that the function being measured here is not independent of the organism, and is therefore not appropriate for testing the ortholog conjecture. Of course it is possible that if measured in a common
The results of our study suggest that neither sequence similarity nor identification of orthologous assignments alone can be considered an accurate predictor of protein function. We find that orthologous proteins between human and mouse share a constant level of functional similarity over a wide range of (global) sequence identities, while the functional similarity between paralogs is dependent on the type of paralogy, level of sequence identity, relative chromosomal location of duplicated genes, and organismal context. We find that sequence identity thresholds as a means of function transfer are generally applicable only to within-species paralogs. Moreover, these thresholds depend on the type of paralogy and a specific duplication event, with inparalogs typically having lower thresholds for similarly accurate functional transfer than outparalogs. On the other hand, in the absence of within-species paralogs, our data indicates that orthologs and between-species outparalogs are similarly accurate in predicting protein function. In general, however, such relationships cannot be deemed ideal for function transfer of GO terms, as the average accuracy of predictions using orthologs and between-species outparalogs were consistently lower than 0.70 (
Functional annotation of genes with unknown function is often carried out by researchers working on particular proteins. In these cases—far from being an automated process of ortholog identification and functional transfer—individual researchers may examine the function of many closely related homologs before making decisions about functional annotations, or even before designing experiments. If they are available, researchers may be using the functions of both orthologs and paralogs to guide their own functional annotations. When inparalogs are available and happen to have the highest sequence identity, these genes may actually be the ones having the largest influence on the functional annotations in common databases; such a process of individual functional inference would create a pattern much like the one we observe. While our analysis of microarray data is consistent with the high functional similarity of within-species paralogs and is free from individual researcher or curator bias, we cannot rule out the possibility that such bias exists in widely used databases. However, such biases are likely to only apply to organisms already being studied by a large community of researchers in molecular biology. Many new genomes are being sequenced solely for the evolutionary or environmental importance of a species, and are therefore unlikely to have much prior data on gene and protein function. In these cases, our results suggest that functional transfer need not be dependent on the identification of orthologous genes in a model organism.
There are 31,479 proteins in the Swiss-Prot database with experimentally characterized function and 40,951 proteins in the Gene Ontology database (data as of February 1, 2010). The functions of this relatively small group of proteins have been transferred to a much larger number of homologous proteins and propagated across biological databases, often with gross inaccuracies
Finally, it must be mentioned again that our study has only addressed protein functions in two organisms, human and mouse. A fuller picture of the accuracy of protein function prediction would include many pairs of species from across the tree of life (see
Ensembl Compara (release 49, March 2008) gene trees were used to identify all homologous human-human, mouse-mouse, and human-mouse gene pairs. Though there are many methods and databases available for identifying homologous relationships, they provide qualitatively similar results
Biological Process and Molecular Function protein function information was retrieved from the Gene Ontology (GO) database. Only the curated GO term annotations were used in the analysis. These include all experimentally inferred annotations: inferred from direct assay (IDA), expression pattern (IEP), genetic interaction (IGI), mutant phenotype (IMP), and physical interaction (IPI) evidence codes. We also included the traceable author statement (TAS) and inferred by curator (IC) evidence codes. Since both the Biological Process and Molecular Function ontologies are represented by directed acyclic graphs (DAGs), the original functional terms were propagated towards the root of each DAG (with the root node excluded) thus producing a complete set of terms for each protein. The GO seqdblite database (release 2009-01-18) was used for term propagation. In total, 4,854 human and 4,089 mouse proteins had functional annotation in at least one GO DAG. This reduced the number of gene trees with at least two functionally annotated genes to 2,448; the total number of ortholog pairs is 2,579, inparalog pairs is 597, within-species outparalogs is 11,334, and between-species outparalogs is 9,840 (
Microarray data presented in Su et al.
Expression data was normalized within each platform individually. Expression values were first normalized within each individual tissue using the
In total, we were able to obtain expression data for 15,907 human genes and 15,552 mouse genes. This reduced the number of gene trees with at least two functionally annotated genes to 7,495; the total number of data pairs used for orthologs is 10,863, for inparalog pairs is 2,014, for within-species outparalogs is 10,396, and for between-species outparalogs is 9,370 (
We calculated protein sequence identity by using Needleman-Wunsch alignments of protein sequences with the B
To calculate functional similarity for the GO data, let
This formula can be interpreted as the average of the fraction of correctly predicted functional terms in
To calculate functional similarity for the microarray data, we used the Pearson correlation coefficient (the Euclidean distance provided similar results). The correlation coefficient
Different types of homology relationships among genes.
(PDF)
The relationship between functional similarity and
(PDF)
The phylogenetic relationships between functionally annotated members of the MAP4K family, and counts of overlapping and non-overlapping GO terms for the target protein human MAP4K2 (red circles) and each of its homologs (blue circles). Tree branch lengths are not drawn to scale.
(PDF)
The relationship between functional similarity and sequence identity using only the subset of GO terms assigned to at least one human and at least one mouse protein.
(PDF)
The relationship between functional similarity and sequence identity using a constant GO term annotation depth for all members of the gene family. For each family, the maximum depth of annotation (measured as the distance from the root node) for each protein was calculated, and then the minimum of the individual maximum annotation depths was found. All GO terms below this minimum were removed for all proteins in the family.
(PDF)
The relationship between functional similarity and sequence identity excluding all GO term annotations derived from the same publication (based on PubMed ID) for both members of the homologous protein pair. During annotation, the same GO term can be assigned to a protein by two or more distinct PubMed IDs. In these cases, GO term annotations were not considered to have come from the same publication if different PubMed IDs could be assigned to the annotations for each member of the pair.
(PDF)
The relationship between functional similarity and sequence identity using only protein pairs annotated with GO terms assigned by the same evidence code. All experimental (IDA, IEP, IGI, IMP, IPI), curator inferred (IC), and traceable author statement (TAS) evidence codes are included.
(PDF)
The relationship between functional similarity and
(PDF)
Functional similarity within the nuclear receptor family in human and mouse. Of the total number of annotated proteins with both an ortholog and a paralog, the counts show the number in each category. Paralogs with higher functional similarity are further distinguished by whether the within-species or between-species outparalog was most similar.
(DOC)
Measures of functional similarity, sequence similarity, and homology relationships between proteins, as well as GO codes associated with each protein used in the study.
(TXT)
Correlation in gene expression profiles between proteins, tissues used from the human and mouse array experiments, mappings of probesets to genes, as well as normalized expression values for each gene.
(TXT)
We thank Arcady Mushegian and several anonymous reviewers for comments that improved the manuscript.