The authors have declared that no competing interests exist.
Conceived and designed the experiments: CMZ AG. Performed the experiments: CMZ. Analyzed the data: CMZ AG. Contributed reagents/materials/analysis tools: CMZ AG. Wrote the paper: CMZ AG.
Evolutionary innovation in eukaryotes and especially animals is at least partially driven by genome rearrangements and the resulting emergence of proteins with new domain combinations, and thus potentially novel functionality. Given the random nature of such rearrangements, one could expect that proteins with particularly useful multidomain combinations may have been rediscovered multiple times by parallel evolution. However, existing reports suggest a minimal role of this phenomenon in the overall evolution of eukaryotic proteomes. We assembled a collection of 172 complete eukaryotic genomes that is not only the largest, but also the most phylogenetically complete set of genomes analyzed so far. By employing a maximum parsimony approach to compare repertoires of Pfam domains and their combinations, we show that independent evolution of domain combinations is significantly more prevalent than previously thought. Our results indicate that about 25% of all currently observed domain combinations have evolved multiple times. Interestingly, this percentage is even higher for sets of domain combinations in individual species, with, for instance, 70% of the domain combinations found in the human genome having evolved independently at least once in other species. We also show that previous, much lower estimates of this rate are most likely due to the small number and biased phylogenetic distribution of the genomes analyzed. The process of independent emergence of identical domain combination is widespread, not limited to domains with specific functional categories. Besides data from large-scale analyses, we also present individual examples of independent domain combination evolution. The surprisingly large contribution of parallel evolution to the development of the domain combination repertoire in extant genomes has profound consequences for our understanding of the evolution of pathways and cellular processes in eukaryotes and for comparative functional genomics.
Most proteins in eukaryotes are composed of two or more domains, evolutionary independent units with (often) their own individual functions. The specific repertoire of multidomain proteins in a given species defines the topology of pathways and networks that carry out its metabolic and regulatory processes. When proteins with new domain combinations emerge by gene fusion and fission, it directly affects topology of cellular networks in this organism. To better understand the evolution of such networks we analyzed a large set of eukaryotic genomes for the evolutionary history of known domain combinations. Our analysis shows that 70% of all domain combinations present in the human genome independently appeared in at least one other eukaryotic genome. Overall, over 25% of all known multidomain architectures emerged independently several times in the history of life. The difference between a global and species specific picture can be explained by the existence of a core set of domain combinations that keeps reemerging in different species, which are accompanied by a smaller number of unique domain combinations that do not appear anywhere else.
Most eukaryotic proteins are composed of multiple domains, units with their own evolutionary history and, often, specific and conserved functions. The ordered arrangement of all domains in a given protein constitutes its architecture. Protein architecture can also be described in a simplified way as a list of binary domain combinations. While not completely equivalent, both views provide similar insights, and in this manuscript we will predominantly use the latter. Many domains can combine with different partner domains and, as a result, form a wide variety of domain combinations, often even within the same species
The domain repertoires of most eukaryotes are remarkably similar, both in size as well as in their content
However, one can expect that the process of domain shuffling could lead not only to the emergence of completely new domain combinations, but also to the independent emergence of domain combinations already present in other, even distantly related, organisms. In fact, it has been shown that certain domain combinations observed in proteins involved in innate immunity have evolved independently several times
We have collected complete sets of predicted proteins for 172 eukaryotic genomes representing five out of six eukaryotic supergroups
The six “supergroups”—Opisthokonta, Amoebozoa, Archaeplastida, Chromalveolata, Rhizaria, and Excavata—are shown (the placement of Excavata is under debate)
All proteins from all genomes were analyzed for the presence of protein domains, as defined by the Pfam database (version 25.0)
Cutoff | Matched proteins (%) [average] | Multidomain proteins (%) | Number of domains that only appear in single-domain proteins | Number of domains that only appear in multidomain proteins | Number of domains that appear in both single- and multidomain proteins | Independent domain combination evolution (%) | Independent domain combination evolution on Pfam-clan level (%) |
|
40–94 [76] | 32 | 1,523 | 576 | 6,095 |
|
|
|
|
|
|
|
|
|
|
|
40–93 [76] | 32 | 1,457 | 525 | 5,963 |
|
|
|
45–96 [81] | 50 | 652 | 2,171 | 7,340 |
|
|
|
38–94 [75] | 48 | 694 | 1,316 | 6,337 |
|
|
|
33–91 [69] | 46 | 786 | 1,034 | 6,100 |
|
|
Overall, 34,778 distinct domain combinations were found in the 172 genomes analyzed here. A total of 22,241 of these appear in just one genome, and only 33 (listed in
In order to analyze the genomic domain combination content, we described each multidomain protein as a set of directed binary domain combinations. For example, a protein composed of domains A, B, and C (listed in the direction from the N-terminus to the C-terminus) is described as a set of the three binary combinations A∼B, B∼C, and A∼C. We retained information about the domain order, i.e., domain combination A∼B is not the same as B∼A. Combinations between the same domains were not included in the analysis (e.g., a protein with the architecture A-B-B would be decomposed into only one binary combination, namely A∼B). The rationale for this is that combinations between the same domains can be a result of local duplication, ancestral descent, or domain fusion and the only approach to distinguish between these would be by explicit phylogenetic analysis of each domain.
First, we simply counted how many distinct domains and domain combinations each of the 172 genomes contains. The average number of domains per protein is 1.7 for all the genomes analyzed here; however, for animal genomes this number is higher (2.0). The distribution of domains in multidomain proteins is not uniform, with 1,448 domains appearing exclusively in single-domain proteins and 535 appearing only in multidomain proteins (see
The number of distinct domains per genomes shows little variance between different organisms (with the exception of some parasitic species) and, when including inferred sets for the ancestral species, generally displays a decreasing trend in going from the last common eukaryotic ancestor to large multicellular organisms
The colors used correspond to the colors in
Standard deviations are shown as error bars. The asterix is used to indicate the results for Deuterostoma under exclusion of the amphioxus
As mentioned above, in our analysis of 172 genomes, 22,241 (out of 34,778) domain combinations appear only once and thus are specific to a single species. These species-specific domain combinations are relatively evenly distributed, with 95 out of the 172 analyzed genomes having between 10 and 100 domain combinations that are specific to the individual species (17 species have fewer than 10 specific domain combinations, and 60 have more than 100), with a median value of 57 (see
This figure shows the numbers of clade-specific domain combinations (black numbers after the slash) and core domain combinations (black numbers before the slash) for select clades. Below these are the numbers of clade-specific domains (gray numbers after the slash) and core domains (gray numbers before the slash). Numbers in brackets refer to domain combination counts under exclusion of the amphioxus
We also investigated the numbers and types of domain combinations that are not only exclusive to a given clade, but also appear in
Pfam domains | Description |
Disintegrin∼ADAM_CR | Found in |
Exostosin∼Glyco_transf_64 | Found in Exostosin-like 1 proteins that are transmembrane glycosyltransferases of the endoplasmic reticulum and are involved in the biosynthesis of heparan sulfate proteoglycans. |
FG-GAP∼Integrin_alpha2 | Found in Integrin alpha-1, a receptor for laminin and collagen, and is believed to function in cell-matrix adhesion and integrin-mediated signaling pathway. |
I-set∼fn3, fn3∼I-set | Found in a variety of proteins, including axon-associated and neural cell adhesion molecules (NCAMs, Fasciclins, Contactins), glycoproteins expressed on the surface of neurons, glia, skeletal muscle, and natural killer cells that have been implicated as having a role in cell–cell adhesion, neurite outgrowth, synaptic plasticity, and learning and memory. |
MH1∼MH2 | Found in SMAD transcription factors. |
PAX∼Homeobox | Found in paired box (PAX) transcriptional regulators. |
Pou∼Homeobox | Found in POU domain transcription factors. |
zf-C4∼Hormone_recep | Found in a wide variety of transcription factors, including nuclear hormone receptor family members. |
The 9 domain combinations exclusively found in all 48 animal genomes analyzed. For a complete list, see
Next, we investigated the evolutionary history of domain combinations—do they tend to appear once and are then inherited by the descendants (as, for example, is the case for the nine domain combinations listed in
The complete diagram on which this simplified version is based is available in the supplementary materials.
The results of analyzing each domain combination in this manner show that that a significant number of domain combinations emerged independently multiple times (
The histogram in
Number of reappearances | Domain combination | Description of domains (Pfam clans are in square brackets) | Comment |
32 | zf-MYND∼ |
MYND finger [TRASH (CL0175)] | Present in N-lysine methyltransferases. |
SET domain | |||
29 | IMS∼IMS_HHHIMS_HHH∼IMS_C | impB/mucB/samB familyIMS family HHH motif [HHH (CL0198)]IMS family HHH motif [HHH (CL0198)]impB/mucB/samB family C-terminal | Domain architecture IMS - IMS_HHH - IMS_C is present in DNA polymerases (i.e. kappa, IV) involved in DNA repair and present in species ranging from bacteria to humans. |
28 | ankyrin repeat [Ank (CL0465)]DHHC zinc finger domain [Zn_Beta_Ribbon (CL0167)] | Present in palmitoyltransferases. | |
ankyrin repeat [Ank (CL0465)]Regulator of chromosome condensation (RCC1) repeat [Beta_propeller (CL0186)] | In vertebrates, the architecture Ank_2 - RCC1 - BTB is present in inhibitor of Bruton tyrosine kinase proteins. | ||
GST_C∼tRNA-synt_1c_C | Glutathione S-transferase (C-term.) [GST_C (CL0497)]tRNA synthetases class I (E and Q), anti-codon binding domain | Domain architecture GST_C - tRNA-synt_1c - tRNA-synt_1c_C is present in Glutamyl-tRNA synthetases. | |
27 | GST_C∼tRNA-synt_1c | Glutathione S-transferase (C-term.) [GST_C (CL0497)]Aminoacyl tRNA synthetases, class I [tRNA_synt_I (CL0038)] | See above. |
Hexapep∼W2 | Bacterial transferase hexapeptideeIF4-gamma/eIF5/eIF2-epsilon [TPR (CL0020)] | Present in translation initiation factor eIF-2B subunit epsilon. | |
SMC_hinge∼SMC_N | SMC proteins Flexible Hinge DomainRecF/RecN/SMC N terminal domain [P-loop_NTPase (CL0023)] | Found in structural maintenance of chromosomes proteins, and bacterial chromosome partition proteins. | |
26 | IBN_N∼HEAT | Importin-beta N-terminal domain [TPR (CL0020)]HEAT repeat domain (related to armadillo/beta-catenin-like repeats) [TPR (CL0020)] | Found in Importin-4, Importin subunits beta-1 and beta-4, and in Transportin (which also includes HEAT-like repeats). |
Zinc finger, C3HC4 type (RING finger) [RING (CL0229)]IBR (In Between Ring fingers) domain | Found in proteins that have been suggested to accept ubiquitin from specific E2 ubiquitin-conjugating enzymes and then transferring it to various substrates |
Domains described as promiscuous in
To test whether longer proteins are more likely to contain reoccurring domain combinations than shorter ones, we compared the average lengths of proteins that contain reoccurring domain combinations to those that do not. The result is that the average length of proteins containing reappearing domain combinations is 782 residues (median: 589) and is slightly longer than that of proteins with non-reappearing domain combinations that have an average of 712 residues (median: 534). On the other hand, the average number of domains in these two groups is almost identical (∼3.3). We also compared the average lengths of domains themselves in those two groups. The results support the observation that repeated domain combinations tend to appear in longer multidomain proteins, but the preference is not very strong.
We also investigated how the specific choices of parameters affect these numbers. In particular, we tested a range of cutoff E-values (from 1e−3 to 1e−15), as well as domain-specific “trusted” and “noise” cutoff scores from the Pfam database (instead of gathering cutoff values)
One can argue that an unexpectedly high rate of parallel evolution events is due to potential false negatives caused by highly divergent domains. In this scenario, the ancestral domain combination would be wrongly counted as having been lost and replaced by an apparently independently evolving domain combination, while in fact the “new domain” would be simply a divergent version of the old domain. To test this hypothesis, we performed the analysis analogous to one described before, but on the level of Pfam-clans (groups of domains that are believed to have originated from a common ancestor, but at much earlier point in evolution
Next, we investigated whether parallel domain combination evolution is equally prevalent in all subtrees of the eukaryotic tree of life or whether some branches differ in their propensity for parallel domain combination evolution. Related to this issue is the question of how large the evolutionary distances between pairs of independently evolved domain combinations are. For this purpose, we calculated the last common ancestor (LCA) for each pair of independently evolved domain combinations and then counted for each internal node of the eukaryotic tree of life for how many pairs of independently evolved domain combinations it represents the LCA. These counts were then normalized by the sum of species emerging from each node. The results for the nodes with the highest rates are summarized in
Normalized (by the number of genomes) sums of independently evolved domain combinations across major splits on the eukaryotic tree of life are shown. “Opistho” stands for Opisthokonta and “Choano” stands for Chanoflagellatae. Ambulacraria is a clade that includes echinoderms and hemichordates.
In the following, we present some examples of parallel evolution of domain combinations between animals and fungi, Amoebozoa, and green plants (see
The complete diagram on which this simplified version is based is available in the supplementary materials (which explains that both major groups of fungi, Basidiomycota and Ascomycota, have one independent domain fusion event each).
The complete diagram on which this simplified version is based is available in the supplementary materials.
An example of independent domain combination evolution between animals (Neoptera, winged insects, in this particular case) and Dikarya (that subkingdom of fungi that includes the two major phyla Ascomycota and Basidiomycota) is shown in
Another example of parallel evolution is the Amidohydrolase∼Aspartate/ornithine carbamoyltransferase combination that evolved independently in Metazoa and in
A more-distant parallel evolution example is the evolution of the K Homology (KH)∼DEAD/DEAH box helicase combination that appeared independently in Bilateria and in Micromonas (a group of green algae) (
All the values presented here depend critically on the number and phylogenetic distribution of genomes analyzed, the size of the domain database, and the significance thresholds (and sensitivity) for domain assignments. We evaluated the effects of these on our results, especially in the light of many earlier papers reporting results of somewhat similar analyses being contradictory to our results. To understand these apparent discrepancies, we have repeated our analyses using a reduced number of genomes, different domain recognition thresholds, and smaller domain databases mimicking the approaches used in earlier works. The results from these analyses confirm that the differences between the earlier and the current results stem mostly from the increase in the number of analyzed genomes and in the size of the domain databases. For instance, using only five genomes resulted in a reemergence percentage similar to the estimates presented in (
Similarly, we can show that other differences between our results and that of the previous analyses are mostly due to the changes in the number of genomes and the size of the domain database. For instance, analysis of five eukaryotic genomes and domain definitions from the SCOP 1.53 database
Unfortunately, we cannot completely exclude the effects of erroneous gene models. To partially address this problem, we performed our analysis under both inclusion as well as exclusion of the one genome with the most unusual domain combinations (that of the amphioxus
There are several simplifications made in our model that likely lead to underestimating the number of independently emerging domain combinations. First, since our analysis is not based on domain trees (evolutionary trees built for specific domains), our results do not take into account parallel domain combination evolution within a genome, i.e., between paralogs in large protein families (this is in contrast to the study performed in
Finally, we would like to point out while the tree of life shown in
Our analysis shows that the number of distinct domain combinations per genome varies greatly between different groups of species and increases systematically with their complexity. This increase matches the intuitive meaning of “complexity” as related to differentiation between cell types in an organism, which typically results from the interactions between multidomain regulatory processes.
The main result presented in this paper, namely the fact that at least 25% of all known and 75% of all recurring domain combinations have evolved independently, is less intuitive. On one hand, it is an obvious effect of the plasticity of eukaryotic genomes, with genome rearrangements constantly reshuffling existing domain combinations. On the other hand, it is interesting that this apparently random process leads to repeated reemergence of the same domain arrangements. Given that the genomes analyzed in this work contain a total of 8,023 distinct domains, it would allow the formation of about 64×106 distinct directed domain combinations. And yet in the genomes analyzed here, we observed a total of only 34,778 domain combinations, which corresponds to only about 0.05% of the theoretical maximum. Therefore, we can speculate that the process of domain recombination is not entirely random and that organisms evolved some mechanisms that constrain the process of domain recombination in such a way that the chances of harmful, nonsensical arrangements are decreased. Here, we can only speculate about possible mechanisms to implement such constraints, but, for example, this could be achieved via the specific distribution of transposable elements and/or chromosomal locations of preferred recombination “hot spots.”
The number of times many domain combinations emerged independently is even more significant when viewed from the perspective of individual species. Over 70% of the domain combinations present in the human genome, and about 70% for all vertebrates, have evolved independently in other species at least once. This apparent discrepancy between the global and per-species averages is caused by a large number—over 22,000, unique, species-specific domain combinations, which, while rare (about 130 on average, with a median of 57) in individual species, add up to a large percentage over all species. One can argue that we are seeing two types of domain combinations: “universal, reemerging domain combinations” and “clade–specific, non-reemerging domain combinations.”
One might speculate that domains that tend to appear in independently evolved domain combinations could be functionally different from those that make up combinations that only appeared once. This seems not to be the case, though— preliminary studies using a variety of methods and tools (such as Gene Ontology term enrichment analysis) indicate that there is no significant correlation between domain function and the tendency of domains to appear in independently evolved domain combinations. Similarly, strong correlation between domain “promiscuity”
Observations presented in this paper have important consequences in interpreting similarities and differences between genomes of distantly related organisms. Usually, discovery of a protein with known domain architectures in newly studied species is taken as an argument for evolutionary conservation of function of these proteins. This is of particular importance when attempting to transfer protein function from distantly related model organisms, such as from the ecdysozoans
Besides estimating the rate of independent domain evolution, we also assessed the number of clade-specific domains and domain combinations. All branches of life (at all levels) have unique domain combinations (combinations not shared with other branches). Due to unequal sampling, it is difficult to compare these numbers. Nevertheless, some issues are worth mentioning. While, as expected, animals have the largest number of unique domain combinations (∼12,800, based on 48 genomes, compared to ∼4,800 in fungi based on 61 genomes and ∼3,700 in green plants based on 33 genomes), within animals there appears to be little-to-no correlation between the number of unique domain combinations and morphological complexity. For example, mammals have ∼400 unique domain combinations from 10 genomes, whereas Arthropoda have roughly three times that number (∼1,500 from 12 genomes). Clearly, the number of unique domain combinations does not explain the complexity of mammals. In this context, we introduced the concept of clade core domain combinations, combinations exclusively found in each genome of a given clade. It can be argued that such clade core domain combinations provide fundamental and distinguishing functionality for the organisms of a clade. For example, animal core domain combinations are all involved in extracellular matrix/cell–cell adhesion functions and in transcription regulation functions and are thus strongly correlated with the development of multicellular organisms.
In summary, our results stress a recurring theme—namely, that evolution is an exceedingly dynamic, and seemingly random, process. New domain combinations are being created and recreated throughout evolution. Each group of organisms (and probably even each organism) has their own solution, based on a partially shared set of building blocks (domains) to solve shared biochemical and regulatory needs.
As more and more genomes are being sequenced, we expect the percentage of independent domain combination evolution to grow even more. In fact, we expect that, with sufficient data available, the following paradigm of evolution at the domain level will emerge. Major clades (such as animals) have a relatively small set of distinguishing core domain combinations that are essential and defining for members of that clade (such as developmental programs and cell–cell adhesion for animals). Outside of these hierarchical sets of core domain combinations (such as for eukaryotes, animals, and vertebrates), all domains are randomly undergoing reshuffling, and the vast majority keep reemerging and disappearing both over species space and over time, with the exception of various small sets of core domain combinations.
Protein predictions for 172 completed eukaryotic genomes were downloaded from a variety of sources (for details, see
(PDF)
(BZ2)
(BZ2)
(PDF)
(PDF)
(PDF)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
The authors acknowledge the sequencing centers listed in