ÅKB and AE conceived and designed the experiments. ÅKB performed the experiments. ÅKB and DE analyzed the data. ÅKB, DE, and AE contributed reagents/materials/analysis tools. ÅKB and AE wrote the paper.
The authors have declared that no competing interests exist.
Many proteins, especially in eukaryotes, contain tandem repeats of several domains from the same family. These repeats have a variety of binding properties and are involved in protein–protein interactions as well as binding to other ligands such as DNA and RNA. The rapid expansion of protein domain repeats is assumed to have evolved through internal tandem duplications. However, the exact mechanisms behind these tandem duplications are not well-understood. Here, we have studied the evolution, function, protein structure, gene structure, and phylogenetic distribution of domain repeats. For this purpose we have assigned Pfam-A domain families to 24 proteomes with more sensitive domain assignments in the repeat regions. These assignments confirmed previous findings that eukaryotes, and in particular vertebrates, contain a much higher fraction of proteins with repeats compared with prokaryotes. The internal sequence similarity in each protein revealed that the domain repeats are often expanded through duplications of several domains at a time, while the duplication of one domain is less common. Many of the repeats appear to have been duplicated in the middle of the repeat region. This is in strong contrast to the evolution of other proteins that mainly works through additions of single domains at either terminus. Further, we found that some domain families show distinct duplication patterns, e.g., nebulin domains have mainly been expanded with a unit of seven domains at a time, while duplications of other domain families involve varying numbers of domains. Finally, no common mechanism for the expansion of all repeats could be detected. We found that the duplication patterns show no dependence on the size of the domains. Further, repeat expansion in some families can possibly be explained by shuffling of exons. However, exon shuffling could not have created all repeats.
The building blocks that create proteins are called domains, and domains are often combined to create multidomain proteins. In many vertebrate proteins, repeats with several adjacent domains from the same family can be found. The authors have investigated how these repeats may have evolved. It is believed that the repeats are created through internal duplications where the duplicated region is inserted next to its origin. Therefore, the pairwise sequence similarity between all repeated domains in a protein was used to identify recent duplications, and a method based on autocorrelation vectors was employed to distinguish patterns of duplication. The authors found that repeat regions are often created from the duplication of several domains at a time while duplication of one domain is less common. Further, the internal duplications often occur in the middle of the repeats. This is in contrast to the evolution of nonrepeating, multidomain proteins, which are thought to evolve by the addition of a single domain at the N-termini or C-termini. A preference for duplication of a certain number of domains was found for some of the domain families. Finally, the authors discuss some of the possible mechanisms for repeat expansion. However, the exact mechanism remains to be discovered.
Proteins are composed of domains, recurrent protein fragments with distinct structure, function, and evolutionary history. Protein domains may occur alone, but are more frequently found in combination with other domains in multidomain proteins. While the creation of new multidomain architectures through shuffling of protein domains has been studied extensively during the last few years [
Repeating domains are often short, such as the leucine rich repeat (LRR) family with a repeating unit of 30 residues. Some repeated domain families are mainly found in repeats, e.g., LRR and C2H2 zinc fingers, while other families are also frequently found as a single unit. The repeats may form regular structures, such as antiparallel β-sheets or solenoids, while others form filaments or are only structured upon binding to their ligands [
Domain repeats are often involved in interactions with proteins or other ligands such as DNA or RNA. Even if the repeated domains have a well-defined and conserved structure, the sequence conservation is often low, with only a few conserved residues required for the correct fold. Their variable sequences and the variation in number of domains provide flexible binding to multiple binding partners. Hence, repeats are found in proteins with highly diverse functions such as the tetratrico peptide repeats (TPR) that are involved in cell-cycle regulation, transcriptional regulation, protein transport, and assisting protein folding [
The domain repeats are found in all kingdoms of life, and long repeats, containing several domains in tandem, have been observed to be particularly common in multicellular species [
Domain repeats are thought to arise via tandem duplications within a gene [
In addition to internal duplications, frequent duplications of repeat-containing genes have occurred in the mammalian genomes [
It has been demonstrated that protein domain repeats are particularly abundant in multicellular organisms [
The initial domain assignments (D) using an E-value cutoff at 0.1 detected 51 nebulin domains. With a less strict cutoff, we were able to assign 15 additional domains. Still, there are four gaps (regions with no domain assignment), which are likely to contain domains that cannot be detected with the current HMMs. Below the domain assignments, the exon structure (E) is seen, with a box for each of the 44 exons, where it is evident that a block of four exons (a long one in black, two short ones in white, and one intermediate size in gray) correspond to a block of seven domains even if the exon borders all are found within the domains.
The different patterns indicate the length of the repeat, i.e., whether it contains 2, 3, 4 domains, etc. The eukaryotic species are labeled with the abbreviations of species names such as Hsa for
Summary of Repeat Distribution in the Different Species
As many proteins with repeats of more than two domains are found in vertebrates, they should provide functions that are required in complex organisms. Consistently, the proteins with repeats mainly have important binding functions in protein–protein interactions and complex assembly as demonstrated for the largest domain families in
Repeat Statistics for the Domain Families
Extended
The repeated domains are more abundant than nonrepeated domains. In fact, nearly half of the assigned domains in the vertebrates are found in repeats (
Further evidence of the frequent duplication in repeats is that orthologs appear to have expanded independently [
Expansion of repeats through internal duplication is not unique to eukaryotes since some prokaryote-specific repeats can be found, e.g., the bacterial immunoglobulin (IG)–like domain and haemaglutinin repeats. Other prokaryotic repeats may be explained by horizontal transfer [
The formation of repeats is not well-understood, therefore we aim to understand some of the underlying mechanisms of repeat expansion by studying the number of domains that is duplicated each time. Since domain repeats are assumed to be created through internal duplications [
(A) In a protein with five domains, a unit of three N-terminal domains has been duplicated in tandem.
(B) To identify this evolutionary event, alignment of all domain pairs in the protein is performed.
(C) The alignment scores between the domains displayed in a matrix with increasing color intensity for higher scores. The diagonal shows alignment scores for each domain to itself, while square 1,2 gives the score between the first and the second domain. A pattern where domain pairs 3–6, 4–7, and 5–8 have the highest alignment scores can be seen.
(D) From the alignment scores, an ACV is calculated as the mean alignment score at each distance normalized around zero. The distance between the domains is defined as one for neighbouring domains, while domain pairs with one domain between them have distance two, etc. In this example a peak at distance three can be seen. Hence, we assume that this protein has evolved through the duplication of three domains.
Distinct patterns of repetition could often be distinguished, and in many proteins, units containing multiple domains have been duplicated in tandem. For instance, in the human zinc finger protein found in
(A) ENSP00000319007.
(B) ENSP00000303696.
The intensity of the squares reflects the alignment score with darker color for higher scores. The numbers at each axis indicate the domains in N-to-C terminal orientation within the repeat. In these two examples, patterns of duplication of six domains (A) and two domains (B) can be seen.
For many proteins, however, no clear pattern was seen since all domain pairs had similar alignment scores. In other proteins, there were mixed patterns within the protein as distinct parts of the protein have been expanded with duplication units of different sizes. Therefore, autocorrelation vectors (ACVs) were used to get a general view of the relative frequency of duplication units of different sizes in each protein. We have defined ACV as the average alignment score between domains at each distance, i.e., the alignment score between neighboring domains, domains at distance two, three, etc. (
The most common duplication pattern for a domain family can be elucidated when the average ACV for all repeats containing the family is calculated. As an example, the chicken nebulin protein (
(A) The intensity of the squares is related to alignment scores, and the numbers on both axes indicate the domains in N-to-C terminal orientation. As there were gaps in the repeat sequence (
(B) ACV calculated from the alignment scores in (A) with the average similarity to domains at distance 1, 2, 3, etc. The ACV are normalized around zero, hence the dotted line at zero is the mean score between all domains in the protein. The ACV was calculated before introducing the gaps as domains (dashed line) and after (solid line). When the regions with no domain assignments were regarded as domains, the pattern of seven repeating units became much clearer, indicating that the gaps are also domains.
Solid line shows ACVs for proteins with repeats of eight different domain families. In the bottom right diagram, the ACV for all proteins with repeats is displayed. The ACV for each family was normalized around zero, hence the dashed line at zero is the mean bit score between all domains in the family. The
Such clear patterns could not be found for all domain families, as can be seen in
The ACVs show that duplication units of a few different sizes are dominant in each family. However, duplications of many different unit sizes may occur within a family. To get a view of how the patterns are distributed among the domain families, hierarchical clustering of the ACVs from all proteins was performed (
(A) Dendrogram of the 20 clusters. Each cluster is indicated by a cluster number followed by the number of proteins in the cluster.
(B) The average ACV for each cluster with red color for values below the average and green for values above.
(C) Distribution of the ten largest domain families, as well as nebulin, in the different clusters. The expected number of proteins from a domain family in each cluster was calculated using random shuffling, and Z-scores for overrepresentation (green) and underrepresentation (red) in the cluster were calculated. The numbers after the domain family names is the number of repeats of the family.
The number of proteins in each cluster is indicated after the cluster number.
In conclusion, the domain repeats are most often created from the duplication of several domains at a time, while duplication of one domain appears to be less common. Further, the number of domains involved in each duplication event differs considerably within the domain families. However, for some domain families, there may be selection for duplication of a certain number of domains due to some functional or structural constraint, as is likely in the case of the nebulin domain. In addition, the most commonly repeated domains, the C2H2 zinc fingers, show the most diverse distribution of duplication patterns.
To determine if duplication at either end of a protein is preferred, the most recent duplications were identified and their positions were determined, revealing that a large proportion of the repeats have been expanded in the middle of the protein. The fraction of duplications we observe in the middle is slightly, but significantly, higher than expected by chance (
Repeated domain families are on average shorter than nonrepeated domain families [
Another possibility is that there is a preference for duplication of certain sizes due to functional constraints, where a fixed number of domains are required for function. In that case, short repeats with that particular length may also be common. This seems to be true for cadherin domains, which have a peak in the ACV at distance five and are also abundant in five domain repeats (
Exon shuffling, i.e., nonhomologous recombination in the intron regions, can create new exon combinations and new proteins. As a consequence, exon shuffling is responsible for many new domain combinations, and it has been demonstrated that exon-bordering domains often combine with other domains [
To verify if the exon junctions are enriched in repeated domains or in linkers between the domains, simulations with random positioning of the junctions were performed. As a result, it was evident that more exon junctions are located in linkers than is expected at random (
Our results are consistent with findings that extracellular domains, such as IG and EGF, are often recombined through exon shuffling [
Interestingly, the exon structures revealed that 30% of the repeats with ten or more domains are located within one large exon, excluding the possibility of exon shuffling as the mechanism for their expansion. This was especially evident for human C2H2 zinc finger proteins, where 78% of the long repeats were found within one exon. The corresponding number of one-exon zinc finger repeats was lower in the other species, e.g., 11% in zebrafish. Also, LRR had many repeats in one exon, while other domain families always have the repeats spread over several exons (
In conclusion, exon shuffling may be responsible for the expansion of some domain repeats, especially the extracellular ones that are often expanded one domain at a time. However, all repeat duplications cannot have been created by exon shuffling.
A complication in this analysis is deletions within proteins, since our method does not detect domain deletions. However, protein evolution tends to generate longer proteins, and it has been shown that proteins are more often extended by fusion than truncated by fission in protein evolution [
Wright and coworkers recently published a study on protein aggregation where they found that neighboring domains, in repeats of IG and fibronectin domains, have lower sequence identity compared with more distant domains, and suggest that this may prevent protein aggregation [
Whether repeat expansion is a random process or a controlled mechanism, where specific segments are selectively duplicated, remains to be discovered. Internal duplications may take place in all proteins, but it is likely that such duplications are lost if the protein does not contain domains that have a repeat-forming characteristic. On the other hand, an increase in the number of repeated domains might not alter the protein structure drastically and can actually promote protein stability [
Short protein repeats may be created from DNA hairpin formation and strand slippage while the hypermutability of minisatellite loci (repeating units of more than ten nucleotides) is thought to be due to recombination events [
Identification of such hotspots would require exact identification of the gene segments that have been duplicated, which is difficult in most cases. Further, a method that would distinguish overrepresented DNA motifs at their flanks is needed. Finally, detection of such motifs would require that the motifs are conserved after the duplication has occurred. Still, many challenges lie ahead before the tandem duplication of protein domains can be fully understood.
In this work, we show that repeat regions are most often created from the duplication of several domains at a time while duplication of one domain is less common. Further, we found that the internal duplications often occur in the middle of the repeats. Hence, the internal duplications in repeats evolve differently from other domain recombinations, which mainly involve the addition of a single domain at either terminus. Preference for duplication of a certain number of domains could be seen for some of the domain families. However, most domain families show broad distribution of duplication patterns and can be expanded with different numbers of domains, even if certain duplication sizes are more common. The exact mechanism behind these duplications is not well-understood. We found no correlation between the size of each duplicated fragment and the domain sizes. For some domain families, however, selection for functional units containing a certain number of domains may favor the duplication of that unit. In addition, exon shuffling could partly explain the duplications of some domain families, especially the extracellular domains. However, many repeats are found within one large exon, hence it is highly unlikely that they have evolved via exon shuffling.
We have analyzed the proteomes of 24 species; ten eukaryotes:
The microbial sequences have been collected from the National Center for Biotechnology Information (NCBI) (
Exon and intron information for the seven metazoan species were extracted from Ensembl (
Pfam-A domains were assigned to the prokaryotic proteomes, the yeast species, and
In addition, many repeats with alternating domains from related domain families were found, e.g., the different Pfam families of TPR or the IG-like domains. Such related domains are grouped together in the Pfam Clans (
Throughout this study, a protein is regarded as a repeat-protein if it has at least two adjacent domains from the same family and no more than 100 unassigned residues between the domains.
Analysis of evolutionary patterns was performed on proteins with repeat length ten or more, i.e., at least ten domains in tandem. The sequences of the repeating domains were extracted and aligned to each other using the Smith-Waterman alignment tool in the EMBOSS package [
Our analysis is based on the assumption that the most recently duplicated domains have the highest sequence similarity to their originating domains. To quantify the duplication patterns, an
The ACVs presented in
ACVs of length nine were created for all proteins with ten or more repeats. As longer vectors cannot be created for proteins with repeat length ten, we used this cutoff to be able to compare the whole dataset. Hierarchical clustering of the ACVs was performed using the Ward incremental sum of squares distance measure, in Matlab (The MathWorks, Natick, Massachusetts, United States), to measure similarity between the vectors. The distance is defined as
In addition, the domain families were also clustered with the same method using the distribution of domain families in the 20 ACV clusters.
The position of latest duplication was determined for all proteins with repeats of five or more domains. To identify where in the repeat the most recent duplication took place, a matrix was created similar to the one in
To estimate the statistical significance of our results, Z-scores were calculated from randomization in 10,000 iterations. The Z-score was calculated as
In the simulation of ACVs, the positions of the domains in a protein were shuffled while maintaining their individual alignment scores. In each iteration, an ACV for proteins with each domain family was calculated, and finally the Z-score for each position of the vector was calculated from these randomized values.
In the case of enrichment of exon boundaries in linker regions, the domain and linker positions in each protein were kept constant. The number of exon boundaries in each protein was also conserved, but they were positioned randomly along the protein sequence. In each iteration, the fraction of linkers that contained exon boundaries was calculated.
The enrichment of the domain families in each cluster in
For estimation of the position of latest duplication, the domain order was shuffled in each protein while maintaining individual alignment scores. In each iteration, the fraction at each position was estimated, as described in the previous section. Finally, the Z-scores for fraction at N/C-terminal or middle were calculated.
A repeated domain family is defined as a family found in a repeat of at least three domains, and nonrepeated families are never found as repeated. The copy numbers for repeated domains have been calculated as the total number of copies (Rep. Copies) or counting each protein with the repeat only once (Rep. Compressed).
(24 KB EPS)
The domain family name is followed by the number of proteins (nP) and number of domains (nD) used in the calculations. The autocorrelation for each family was normalized around zero, hence the dashed line at zero is the mean bit score between all domains in the family. The
(61 KB EPS)
(47 KB EPS)
(A) The alignment scores between all domains in a human zinc finger protein with darker color for higher scores.
(B) All scores that are one standard deviation over the mean score are set to one (gray). Then the longest diagonal of “ones” is identified (black) and the position of that diagonal is determined. In this case the latest duplication is estimated to occur in the end.
(168 KB EPS)
The fraction of different regions that contain disordered regions or different secondary structures. The first bar shows the distribution in all of the proteins followed by repeated domains (RepDom), non-repeated domains (NRDom), and regions without domain assignments (Unass).
(25 KB EPS)
The size of most duplicated units, i.e., the number of domains involved in most duplications, was determined from the highest peak in the ACVs (
(884 KB EPS)
The connectivity is displayed for proteins with no repeat (repeatlength 1), two-domain repeats, etc., up to repeats of length nine or more. The networks for three eukaryotic species,
(26 KB EPS)
(21 KB DOC)
(28 KB DOC)
(60 KB DOC)
We would like to thank Sara Light and Janusz Bujnicki for helpful comments. Further, we are thankful for the extensive comments made by one of the referees, which provided great improvements to the manuscript.
autocorrelation vectors
epidermal growth factor
immunoglobulin
leucine rich repeats
tetratrico peptide repeats