DB and MP conceived and designed the experiments, performed the experiments, analysed the data, contributed reagents/materials/analysis tools, and wrote the paper.
The authors have declared that no competing interests exist.
An important element of the developing field of proteomics is to understand protein-protein interactions and other functional links amongst genes. Across-species correlation methods for detecting functional links work on the premise that functionally linked proteins will tend to show a common pattern of presence and absence across a range of genomes. We describe a maximum likelihood statistical model for predicting functional gene linkages. The method detects independent instances of the correlated gain or loss of pairs of proteins on phylogenetic trees, reducing the high rates of false positives observed in conventional across-species methods that do not explicitly incorporate a phylogeny. We show, in a dataset of 10,551 protein pairs, that the phylogenetic method improves by up to 35% on across-species analyses at identifying known functionally linked proteins. The method shows that protein pairs with at least two to three correlated events of gain or loss are almost certainly functionally linked. Contingent evolution, in which one gene's presence or absence depends upon the presence of another, can also be detected phylogenetically, and may identify genes whose functional significance depends upon its interaction with other genes. Incorporating phylogenetic information improves the prediction of functional linkages. The improvement derives from having a lower rate of false positives and from detecting trends that across-species analyses miss. Phylogenetic methods can easily be incorporated into the screening of large-scale bioinformatics datasets to identify sets of protein links and to characterise gene networks.
A typical fully sequenced genome from a bacterial species contains several thousand genes, and those from multicellular animals may contain many thousands of genes. Understanding the function of these genes is one of the key goals of the developing fields of bioinformatics and proteomics, and the results are of interest to life scientists. The authors describe a computational statistical method that can identify pairs of genes whose functions may be linked, in the sense of participating in a common metabolic pathway or from some physical interaction. The method is applied to phylogenetic trees of related organisms and identifies instances in which a pair of genes is either gained or lost together during evolution. They find that genes that have co-evolved like this on two or more occasions during their evolutionary history are almost certainly functionally linked. These methods can be applied in an automated way to large numbers of species for which fully annotated genomes are available to identify candidate sets of functionally linked genes, and to characterize gene networks.
Evidence that two or more traits co-evolve across a range of species can be used to test hypotheses about the common selective pressures acting on the traits, and about the functional or adaptive relationship between them. Correlated evolution is increasingly being applied at the genetic level on the premise that genes that are gained and lost together [
Genes and their expression patterns evolve in a phylogenetic context such that functional links of adaptive value tend to be conserved and inherited by descendant species. Among closely related species, shared phylogenetic inheritance can also produce correlated gene profiles for genes that are not linked. Two or more genes might arise independently in a common ancestor and be retained in evolutionary descendants owing to their individual adaptive functions.
The figure shows a hypothetical phylogeny of eight species. Assume all four genes were present in the common ancestor. Only the top (blue) pair provides statistical evidence for correlated evolution. The apparent correlation in the bottom (red) pair arises from shared inheritance of the loss (state “0”) of both genes in the ancestor to the four species on the right of the diagram. Although the two genes were lost at the same time, it may have been for unrelated reasons. By comparison, the correlation in the top pair rests upon four separate events of the correlated loss of both genes. Both genes are retained until near the tips of the tree, at which point both are lost in each of four separate species. It is unlikely that two genes would be simultaneously lost on four independent occasions, unless the two genes were functionally linked. A simple across-species correlation does not discriminate between these two scenarios, whereas one that accounts for phylogeny does. This is an extreme scenario but many others are possible.
Our interest is to evaluate whether incorporating phylogenetic information improves the identification of functional gene links. The need to take account of phylogenetic relationships in comparative studies has long been appreciated in evolutionary biology [
Our dataset consists of a phylogeny of 15 eukaryote species for which complete or nearly complete sequenced genomes are available. There is no limit to the number of species that can be used, but it is important to use fully sequenced and well-annotated genomes to ensure that genes determined to be “absent” are in fact not in the genome. We compare the phylogenetic method's predictions to predictions derived from across-species correlations; the latter have been used in bioinformatics investigations to predict functional gene links [
All nodes of the tree received 100% posterior support in an MCMC analysis (see
We calculated the likelihood ratio statistic (see
Critical
To assign
Of the MIPS pairs whose patterns both vary across species, 609 (11%) have LRs that exceed the
We wish to know whether the LR method gets better at detecting “known” interactions the more extreme its statistical result. We combined the 8,102 LRs corresponding to all non-negative MIPS relationships, with the 6,838 LRs obtained from the randomly generated pairs in which the relationship is also non-negative. We then assigned the combined data to
The main graph shows the percentage of the predicted links at or below a given p-value, that correspond to annotated functionally linked pairs in the MIPS database, separately for the two methods. At a
The inset graph in
If false positives cause the across-species correlations to classify a lower percentage of the true pairs correctly, this should be apparent from comparing the two methods in the random-pairs data.
The across-species
(A) Higher rates of probable false positives for the across-species correlation. The horizontal dashed line defines the region in which the across-species method declares pairs significant (
(B) Same relationship as in (A) but for the MIPS pairs of annotated links. The across-species correlation returns a functional link for
The LR method's improvement over the across-species correlation seems principally to derive from correctly excluding spurious functional links that arise from shared phylogenetic inheritance, but also from correctly identifying some patterns of co-evolution that the across-species correlation misses. The two pairs of proteins shown alongside the phylogeny in
Contingent relationships between a pair of genes describe cases in which one gene is more likely to be gained or lost depending upon the state of the other. One example of this might be cases in which two genes are paralogues, and so one of the pair gets lost in each species owing to its redundant function. Other cases might identify instances in which one gene's function depends upon the presence of a second gene, but the second gene performs functions even in the absence of the first. Such contingent linkages may describe and explain many of the large number of cases in which two genes are functionally linked in one species, but they do not exclusively appear together across species. They can be detected by estimating the transition rate parameters of the dependent model (see
Three cytoplasmic ribosomal large subunit proteins may provide an example of contingent evolution. Protein L30 is significantly linked to proteins L43A and L43B: both LRs = 9.73,
Protein L30 is significantly linked across species to L43A and L43B (both LRs = 9.73,
Incorporating phylogenetic information into predictions of functional gene links improved by between 18% and 35% upon predictions derived from across-species correlations, and increasingly so for pairs of genes with greater evidence of correlated evolution on the phylogeny. The phylogeny makes it possible to discriminate across-species patterns that arise by chance through common ancestry from those that indicate multiple independent instances of the correlated gain or loss of a pair of genes. This has implications for methods such as “phylogenetic profiling” [
We find that the pairs of genes that have been gained or lost together on two to three or more occasions are almost certainly functionally linked. To our knowledge, this is the first phylogenetic demonstration that correlated evolutionary events strongly imply functional linkage, and underscores the importance of analysing events of protein evolution on phylogenetic trees. As the number of fully sequenced genomes increases, phylogenetic approaches can be used with increasing sensitivity to detect multiple events of correlated gene evolution, and by inference, pairs of genes with a high probability of being functionally linked.
We studied functional links on only a single phylogenetic tree rather than on a sample of trees, because we wished to compare results to the across-species correlation, which has no way of making use of the phylogenies. But it is straightforward to implement our approach in a Bayesian framework such that functional links are estimated across a sample of trees. Elsewhere we describe how to derive Bayesian posterior probability distributions of the parameters of the continuous-time Markov model of trait evolution, estimated over the posterior probability distribution of phylogenetic trees [
A surprising number of gene pairs that are annotated as functionally linked in yeast do not appear to be linked in other, often closely related, species. Some of these may arise because a gene characterised as “absent” has simply gone unnoticed. We think this is only a small part of the explanation here, as we restricted ourselves to well-annotated, fully sequenced genomes. More likely is that the set of across-species functional links is far smaller than the set of all known links within any given species, and this raises the question of just what an across-species functional link measures. One distinct possibility is that a fundamental set or “backbone” of conserved protein interactions exists, in what might be called the “correlated evolution network.” This set of links is distinctive, in that the pairs of genes tend either to be both present or both absent. If so, their identification should be given a high priority, as they may reveal general organismic “rules of assembly.”
The highly specific nature of functional links also has implications for using model organisms to make predictions about other species, such as humans. Our data suggest that such predictions will often be wrong: Many genes whose functions and links have been identified from in-depth study in a model species may adopt different functions in other species. A phylogenetic method routinely applied to large numbers of species could distinguish the subset of genes whose functions can be reliably assumed to generalise from those that do not. Used in combination with low-throughput single-species studies, a more sophisticated picture may emerge.
In any analyses relying on identification of orthologues across species, multigene families may cause particular headaches. Assuming that the functionally conserved orthologue of a given gene will be under similar selection pressures and therefore have the greatest sequence similarity on average, reciprocal sequence similarity procedures such as we have used (see
A large number of genes remain uncharacterised. Identifying functional linkages from phylogenetic events of co-evolution with other genes seems a promising way to understand function, and is an approach that can yield insights from currently poorly understood genomes. It is encouraging that we are able to detect functional links with reasonable sensitivity and specificity in a comparatively small number of species. Larger datasets will not only improve the ability to detect correlations; they will also make it possible to link events of correlated evolution to background organismic and ecological variables, and to identify clusters of genes that tend to appear together. Our approach can also be easily modified to use continuously varying data. Such data are increasingly becoming available from sequence similarity searches [
The method requires a phylogeny of the organisms to be investigated, plus data on the presence and absence of homologous genes.
All of our analyses are conducted on a phylogenetic tree of fifteen eukaryote species for which whole-genome data were available in 2003, including the 13 fungal species
We used gene sequences for EF-1 alpha and EF-2 to infer the phylogenetic tree. We obtained proteins and their corresponding nucleotide sequences for each species, and aligned the data at the protein level using Clustal-X [
We reconstructed the phylogeny using a general time-reversible model of sequence evolution and allowing for gamma-distributed rate-heterogeneity (GTR+Γ). We found a tree for the 15 species using a heuristic search for the maximum likelihood (ML) tree in PAUP, and we refer to this as the ML tree. For comparison, we sampled the posterior probability distribution of phylogenies using the same model of evolution in a Bayesian MCMC framework (as described in [
We used the single ML tree in all of our analyses of correlated evolution, rather than calculating correlated evolution across the Bayesian sample [
The MIPS database lists 260 known
We modelled correlated gene presence/absence on a phylogeny using a continuous-time Markov model [
We describe the method in some detail below, as our presentation only partially overlaps with that in [
The model of correlated evolution does not place any restrictions on the parameters, using a maximum of eight parameters to describe the data. The correlated evolution model will improve on the independent model when the distribution of the traits across the species of the phylogeny implies that some of the pairs of transition rates constrained in the independent model to be equal to each other, in fact differ. Information that pairs of coefficients differ arises not from the number of species that come to inherit a particular set of outcomes, but from the implied number of times the events represented by the rate coefficients have occurred on the tree. This is how the likelihood approach discriminates between the two scenarios of
The method is formally described by a rate matrix
where we use the
The model does allow both traits to change over a longer interval
Combining these probabilities over all branches of the tree yields the likelihood of the data,
The likelihood is summed over all possible ancestral state reconstructions [
The Discrete method can be used with a single tree and is now implemented in a Bayesian framework in the program BayesDiscrete, to account for uncertainty in both the estimates of the phylogeny and in the parameters of the model of correlated evolution. It is available from M. Pagel and A. Meade at
When the independent and dependent models are estimated by maximum likelihood, their goodness of fit is compared using the LR statistic: LR = −2 log
When the phylogeny contains a small number of species or rates of evolution are low, the LR statistic as defined above is often distributed with fewer than four degrees of freedom [
We used Fisher's exact test (e.g., [
Swiss-Prot (
We acknowledge the support of the Biotechnology and Biological Sciences Research Council, UK (Grants 19848 and 14980 to MP). Software to implement the methods described here is available from M. Pagel and A. Meade at
likelihood ratio
Markov chain Monte Carlo
Munich Information Center for Protein Sequences
maximum likelihood