CHY and DH conceived and designed the experiments. CHY performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, and wrote the paper.
The authors have declared that no competing interests exist.
Correlated changes of nucleic or amino acids have provided strong information about the structures and interactions of molecules. Despite the rich literature in coevolutionary sequence analysis, previous methods often have to trade off between generality, simplicity, phylogenetic information, and specific knowledge about interactions. Furthermore, despite the evidence of coevolution in selected protein families, a comprehensive screening of coevolution among all protein domains is still lacking. We propose an augmented continuous-time Markov process model for sequence coevolution. The model can handle different types of interactions, incorporate phylogenetic information and sequence substitution, has only one extra free parameter, and requires no knowledge about interaction rules. We employ this model to large-scale screenings on the entire protein domain database (Pfam). Strikingly, with 0.1 trillion tests executed, the majority of the inferred coevolving protein domains are functionally related, and the coevolving amino acid residues are spatially coupled. Moreover, many of the coevolving positions are located at functionally important sites of proteins/protein complexes, such as the subunit linkers of superoxide dismutase, the tRNA binding sites of ribosomes, the DNA binding region of RNA polymerase, and the active and ligand binding sites of various enzymes. The results suggest sequence coevolution manifests structural and functional constraints of proteins. The intricate relations between sequence coevolution and various selective constraints are worth pursuing at a deeper level.
The sequences of different components within and across genes often undergo coordinated changes in order to maintain the structures or functions of the genes. Identifying the coordinated changes—the “coevolution”—of those components in the context of evolution is important in predicting the structures, interactions, and functions of genes. The authors incur a large-scale screening on all the known protein sequences and build a compendium about the coevolving relations of all protein domains—subunits of proteins. The majority of the coevolving protein domains either belongs to the same proteins, appears in the same protein complexes, or shares the same functional annotations. Furthermore, coevolving positions in the same proteins or protein complexes are spatially coupled, as they tend to be closer than random positions in the 3-D structures of the proteins/protein complexes. More strikingly, many coevolving positions are located at functionally important sites of the molecules. The results provide useful insights about the relations between sequence evolution and protein structures and functions.
Coevolution is prevalent at species, organismic, and molecular levels. At the molecular level, selective constraints operate on the entire system, which often require coordinated changes of its components. The most well-known example is the compensatory substitution of nucleic acid pairs in RNA secondary structures [
Coordinated changes of amino acid residues have also been investigated. Typically these studies acquired one (or two) family(ies) of aligned sequences and examined covariation between aligned positions or of the entire sequences. Some of these have applied different covariation metrics including correlation coefficients [
A major drawback of many covariation metrics is the lack of phylogenetic information. The sequences manifesting the same level of covariation may arise from either a few independent substitutions in early ancestors or correlated changes along multiple lineages [
All the previous studies of detecting protein coevolution target a few proteins or protein domains, such as myoglobin [
We propose a general coevolutionary CTMP model which requires neither simplification of states nor prior knowledge about interactions, and has only one extra free parameter. Sequence substitution of the two sites is modeled by a continuous-time Markov process. The null (independent) model hypothesizes that two sites evolve independently. The alternative (coevolutionary) model is obtained from the null model by reweighting the independent substitution rate matrix to favor double over single changes. We apply this model to all the inter- and intra-domain position pairs in all the known protein domain families in Pfam database [
We extend the CTMP sequence substitution to model coevolution of amino acid position pairs. The state transitions of a CTMP at an infinitesimal time interval follow a matrix differential equation (
(Top row) The independent rate matrix
(Second row) Suppose two protein domains
(Third row) We acquire the joint phylogenetic tree of the two families of sequences. For each pair of positions, we place the joint sequences on the leaves of the tree as the observed states of the CTMP. The conditional probability of interval
(Fourth row) The joint likelihood of a CTMP along a tree is the product of prior and conditional probabilities. The marginal likelihood of each pair of aligned positions is obtained by summing over all possible states of internal nodes. It can be efficiently evaluated by dynamic programming.
(Bottom row) The
Very often there are multiple coevolving positions between two domains (or within one single domain). To assess the likelihood score of the entire domain pair, we employ a probabilistic graphical model with variables corresponding to specific positions of the protein domains in an ancestral or contemporary species. Using a spanning tree approximation, we evaluate the joint likelihood score in terms of the pairwise and singlet likelihoods (
The entire Pfam database of aligned protein domain sequences was downloaded [
We considered the 3,722,468 domain family pairs (12% of all family pairs) which co-appeared in no less than 20 species. Out of the 3,722,468 domain family pairs, 179,117 (4.81%) co-appear in the same proteins or share the same GO annotations (bottom level in the GO hierarchy) in more than half of the member proteins that have GO annotations. Among each domain family pair, we considered all position pairs. In total there were 1.171 × 1011 all-versus-all inter-domain position pairs.
We calculated the
Solid blue: coevolving positions. Dotted red: background.
With a threshold 9.0, we obtained 3,953 position pairs distributed over 582 domain family pairs. We then ranked the 582 inferred domain pairs according to the
The coevolving protein domains are highly enriched with functionally coupled domain pairs. Of the 582 domain family pairs (44.16%), 257 share proteins or bottom-level GO annotations in more than half of their members. The enrichment of functionally coupled domain pairs is more than a 9-fold increase compared to the entire dataset (4.81%). The hypergeometric
Functional Categorization of Coevolving Domain Family Pairs That Are Functionally Coupled
Sequence covariation without phylogenetic information can be captured by mutual information. To demonstrate the importance of phylogenetic information, we applied the same inter-domain large-scale screening using pairwise mutual information (see
Solid blue: coevolutionary scores. Dotted red: mutual information.
Besides functionally coupling coevolving domains, a natural question is whether the coevolving amino acids are also spatially coupled. Of the 582 coevolving domain family pairs, 156 contain the domain pairs co-crystalized in the same proteins or protein complexes. We extracted the 196 protein/protein complex structures of the 156 coevolving domain family pairs from the Protein Data Bank [
A remarkable example of the spatially coupled coevolving pair is between position 157 of the alpha-hairpin domain (accession number PF00081) and position 61 of the C-terminal domain (accession number PF02777) in iron/manganese superoxide dismutase. This domain pair ranks 82nd on the list (see
The amino acids at positions PF00081–157/PF02777–61 exhibit strong covariation between NF and FQ (N: asparagine, F: phenylaninine, Q: glutamine, see
Left: cyanobacteria. Right: human. Coevolving positions from the two domains are marked by red and blue, respectively.
Unlike PF00081–157/PF02777–61, the majority of the coevolving positions are not in direct contact: only 4.2% (203 out of 4,849) coevolving position pairs are less than 8Å apart. Sequence covariation tends to occur between multiple distant sites of two domains. In large proteins or protein complexes constituting multiple domains (e.g., RNA polymerase or ribosome), sequence covariation between positions from multiple domains also occurs. This multi-way covariation reflects the structural or functional constraints beyond direct pairwise interactions such as hydrogen bonds.
The spatially distant coevolving positions may reflect certain structural or functional constraints of the entire proteins/protein complexes (e.g., [
Functional Sites Overlapped/Near Inter-Domain Coevolving Positions
We use four examples to illustrate the spatial relations between inter-domain coevolving positions and functional sites of proteins.
There are 43 coevolving positions from ten protein domains in the 30S ribosomal subunit. Ribosomes synthesize proteins by binding tRNAs at three sites: the P (donor) site, the A (acceptor) site, and the E (exit) site.
Colored spheres: coevolving positions from different domains. Red ribbon: P-site. Cyan ribbon: A-site. Magenta ribbon: E-site.
There are 151 coevolving positions from ten protein domains in RNA polymerase.
There are eight coevolving positions from two protein domains in phosphoglucomutase, an enzyme that transfers the phosphoryl group of glucose or mannose from position 6 to position 1.
Red and blue spheres: coevolving positions from the two domains. Cyan ribbon: active site. Magenta ribbon: sugar binding loop. Brown ribbon: metal binding loop. Green ribbon: phosphate binding site.
There are 16 coevolving positions from two proteins in aspartate/ornithine carbamoyltransferase, an enzyme of the amino acid synthesis pathway [
Other functional sites overlapped with, or close to coevolving positions, include ADP binding sites in carbamoyl-phosphate synthase [
Each protein domain family has a different phylogenetic tree due to its distinct history of duplication and deletion. The coevolutionary model, however, requires a joint phylogenetic tree of the two families. To calculate the likelihood score, we have to extract a common subtree of the two phylogenetic trees that correspond to the coevolving part along the lineages of the two families. This problem is difficult due to the huge number of possible choices. A common approach to compare two distinct domain (gene) trees is to reconcile them with a common species tree: mapping each node in a gene tree to a node in the species tree. There are likely multiple paralogous domains mapped to the same species. Since domains belonging to different species are unlikely to coevolve, we only need to consider the domains in the same species as candidates of the coevolving partners. For simplicity, we also hypothesize that there is at most one pair of coevolving partners from each (ancient and contemporary) species. The problem of building a joint phylogenetic tree then becomes the problem of choosing the coevolving partners in each node of the species tree.
This problem is still difficult since there are many possible combinations of coevolving partners. We employed a heuristic to construct a joint tree of two domain families and to identify the coevolving partners in each species. The goal of this heuristic is to make the joint tree respect the phylogenetic trees of individual domain families and the species where they reside, to maximize the coverage of the species in the joint tree, and to reduce the spurious covariation from paralogous members. The heuristic is described in
Despite the advantages of the heuristic, certain covariation from early divergence is amplified when the topology of the domain tree does not conform with the species tree. A typical example is the position pairs between many RNA polymerase and ribosomal proteins (
To further reduce this type of covariation, we trimmed the part of the domain tree which mismatched the topology of the species tree at kingdom level. The enrichment of functionally coupled domain pairs is similar to the untreated version: 219 out of 642 inferred position pairs and 82 out of top 100 inferred pairs were functionally coupled. Most pairs between RNA polymerase domains and between RNA polymerase and ribosomal proteins were absent in the inferred pairs. Although covariation between these domain pairs does not re-occur, it is still important. It is attributed to early divergence of life, and as described previously, maintains the structurally conserved region in RNA polymerase. The inferred domain pairs by removing covariation from early divergence are reported in
Our model can also detect the coevolving positions within the same protein domains. Unlike inter-domain screening, the two amino acid residues share a common phylogenetic tree. Hence spurious covariation arising from selection of paralogous proteins does not happen.
We calculated the
Two questions arising from inter-domain screening also need to be answered in intra-domain analysis. First, whether or not coevolving positions within the same domains are spatially coupled. Second, whether or not these coevolving positions overlap with or are close to functionally important sites of proteins. We extracted 401 protein structures of the 110 protein domains from the Protein Data Bank and calculated the distances between intra-domain coevolving positions. As a comparison we also calculated the distances between all position pairs in the same domain families.
Solid blue: coevolving positions. Dotted red: background.
To check the functional importance of coevolution, we examined the intra-domain coevolving positions from the 38 domain families that contain the position pairs with
Functional Sites Overlapped/Near Intra-Domain Coevolving Positions
Two remarkable instances are domains delta-aminolevulinic acid dehydratase (PF00490) and photosynthetic reaction centre protein (PF00124). In the PF00490, there are five coevolving positions. All of them are physically close (<10Å) in all three protein structures of the domain family. These positions partially coincide with the active sites and Mn2+-binding sites of
Analysis in the preceding sections suggests that coevolving domains are likely to be functionally coupled, and coevolving position pairs tend to be spatially coupled and located at functionally important sites. Yet the question in the reverse direction—whether physically interacting amino acid residues are coevolved—are still not answered. Since the majority of the coevolving positions are not in direct contact, we expect the overlap set between physical interactions and coevolving positions to be small. We extracted 223,392 physical interactions from Pfam. Interactions corresponding to the same aligned positions in the domain families were collapsed together. To reduce computational time we only considered the interactions where covarying amino acid pairs (sequences that are distinct at both positions, for example, NF and FQ) comprise more than half of the members in the domain families. Only about 20% of the interactions (45,007 out of 223,392) passed this filtering criterion. We evaluated the
In this study we propose a probabilistic graphical model to detect coevolution of amino acid residues and invoke large-scale screenings on all the inter-domain, intra-domain position pairs, and known domain residue interactions. Despite the large number of pairwise comparisons executed, the inferred results strongly suggest that coevolving domains and positions are functionally and spatially coupled. The majority of coevolving protein domains appears in the same proteins or shares the same functional categorization. Coevolving positions between and within protein domains are substantially closer than the background distribution. Moreover, the coevolving positions in many proteins coincide with functionally important sites such as the subunit linkers of hydrogen peroxide dismutase, tRNA-binding sites of ribosomes, and active sites of phosphoglucomutase.
Most top-ranking coevolving domain pairs are involved in fundamental functions of life: ribosomal proteins, RNA polymerase, carbon metabolism, vitamin B12 dependent enzymes, and so on. This is probably because these ancient proteins have strict structural constraints. Our model implicitly favors the case where covarying sequences maintain the structural constraints. In addition, the stringent filtering criteria of sequence covariation and a wide coverage of species required for significant scores may also exclude the lineage-specific coevolution. To detect coevolution in these variable families (such as transcription factors, receptors, and signal transduction proteins), a targeted search on more extensive sampling of a specific clade and relaxed criteria for covariation are probably required.
Since simultaneous changes of multiple nucleic or amino acids are unlikely, there must exist “transition states” between optimal configurations during evolution. These transition states may disappear in contemporary species due to their deleterious effects. In RNAs, however, we do observe non-pairing or wobbling bases in a stem. Transition states also appear in the coevolving protein domains. For example, although position pair PF00081–157/PF02777–61 in superoxide dismutase is dominated by NF and FQ pairs, there are also a few other states including FF, FE, FP, and FR. FF can serve as a transition state between NF and FQ. Intriguingly, the distance between an FR pair is 9.46 Å (PDB id 1coj), indicating the two residues are not in contact. This suggests the transition states of amino acids may be accommodated by structural variation.
Our inferred results, in agreement with previous studies of protein coevolution, reveal a fundamental difference between protein and RNA coevolution. Typically RNA coevolution occurs in disjoint nucleic acid pairs that form hydrogen bonds and are in direct contact in the 3-D structure. In contrast, there are often multiple coevolving amino acid residues in a protein, and some of them are distant in the 3-D structure. Coevolution of multiple and distant amino acid residues probably results from multiple selective constraints. Some possible explanations include the coupling of binding energy via pathways in the protein, interactions with intermediate molecules such as water, and the global constraints pertaining to the conformation of a region in a protein.
The diverse causes of protein coevolution also make validation of computational methods problematic. Unlike RNAs, there is no gold standard for a coevolutionary protein dataset. We validated the findings with indirect evidence such as the enrichment of functionally coupled domains defined by GO categories, distance distribution in protein structures, and annotations of the functions of the coevolving sites. More appropriate validation procedures and datasets may become available as we have better understanding of protein coevolution.
The existence of paralogous genes adds difficulty in analyzing coevolution. When there are multiple paralogous domains in a family, we have to assign coevolving partners from all possible combinations. Our heuristic method reduces, yet cannot eliminate, spurious covariation from paralogous families. A better algorithm of dealing with paralogous genes is needed.
To facilitate large-scale screening we applied several simplifying assumptions and procedures. First, we applied the same sequence substitution rate matrix (the Dayhoff matrix) to all the domain families. Rate variation across domains or different sites within the same domains may create spurious covariation [
The distribution of
Coevolution probably only occurs in a small fraction of physical interactions. Nevertheless, we also demonstrate that coevolution manifests spatial and functional constraints other than direct interactions. Hence, the complex relations between coevolution and selective constraints are worth pursuing at a deeper level.
The sequence substitution of a single amino acid is modeled by a CTMP [
Define
A true coevolutionary model should reward transitions into the sequence states of selective advantages and penalize the transitions of opposite directions. Due to the difficulty of finding this true model, we constructed a simplified model by reweighting the entries of the independent rate matrix to penalize single transitions and to reward double transitions:
Transitions of single amino acids are penalized by multiplying a fixed number
Its value forces the diagonal entries in
To rank the coevolving domain pairs (or single domains), we need to assess the likelihood scores which take all the coevolving positions between the two domains into account. We treated the model of all coevolving positions as a probabilistic graphical model in both space and time (
(Top) A space-time model of three positions in three species. There are three pairwise interactions (1 2), (2 3), (3 1) in each species.
(Middle) First approximation of
(Bottom) Second approximation of
It is in general difficult to evaluate the marginal likelihood of this network. We simplified the problem by adopting two approximations. First we approximated the spatial dependency network by its maximum spanning tree (
We assumed the conditional probability from the coevolving positions in a parent species to the same set of positions in a child species followed a form similar to
The first approximation is still intractable since it has to sum over all possible states of all coevolving positions. To further simplify the problem, we performed marginalization for each singlet and pairwise term separately and combined these terms using
are the pairwise and singlet marginal likelihood evaluated by dynamic programming.
The marginal likelihood of the independent model is the product of the marginal likelihood for each position and can be exactly evaluated. The likelihood ratio in the bottom row of
It is costly to evaluate the coevolutionary likelihood scores. Hence we applied three filtering criteria on all 1.171 × 1011 inter-domain position pairs and all 8.29 × 107 intra-domain position pairs. First, we discarded the sequences that contained gaps in more than half of their members. Second, we discarded the conserved sequences where one amino acid pair occurred in more than 75% of the members. Third, for each of the remaining position pairs, we identified a maximal set of covarying amino acid pairs (amino acid pairs which are distinct at both positions, e.g., NF and FQ), and counted the number of occurrences for each amino acid pair. We only considered the sequences where the maximal set of covarying amino acid pairs constituted more than 80% of the members. The first two criteria filtered out the position pairs dictated by gaps and conserved amino acid pairs. The third criterion filtered out the sequences which were expected to have low
To further reduce computation time and error, we applied the Padé polynomial approximation for matrix exponentiation [
To evaluate the coevolutionary likelihood of an inter-domain position pair, a joint phylogenetic tree and representatives from each species in each domain are needed. We selected the species that contained both domains and built a binary species tree on selected species by extracting the hierarchy from the National Center for Biotechnology Information taxonomy [
The procedures of building a joint tree and selecting representatives are described in
As a comparison we calculated mutual information between the 3,379,517 inter-domain position pairs that passed the filtering criteria. Denote
The large-scale screenings, including filtering position pairs by sequence covariation, building the joint phylogenetic tree for domain family pairs, calculating pairwise coevolutionary scores, and evaluating the joint likelihood scores of the entire domains/domain pairs were implemented in C programs and executed on Rackable Linux Cluster (2048 AMD Opteron Processors, 2.2 GHz). The total CPU time was 24,000 h for inter-domain screening, 1,600 h for intra-domain screening, and 300 h for evaluating the
The
The false discovery rate of coevolving position pairs: We evaluated the false discovery rate of multi-hypotheses testing using the approximation procedures in [
The total number of position pairs
The
We downloaded 196 protein structure data from the 582 inter-domain family pairs and 401 protein structures from 110 intra-domain families from the Protein Data Bank [
(1 KB PDF)
(3 KB PDF)
(25 KB PDF)
(51 KB PDF)
(36 KB PDF)
(5 KB PDF)
(7 KB PDF)
(285 KB TXT)
(84 KB TXT)
(34 KB TXT)
(272 KB TXT)
(48 KB TXT)
(26 KB TXT)
The accession numbers listed in this paper from the Protein Data Bank (
We thank Tom Pringle for comments about the manuscript and Robert Baertsch for technical help with the
ribosome acceptor site
continuous-time Markov process
ribosome exit site
phenylaninine
asparagine
ribosome donor site
protein data bank
glutamine