Conceived and designed the experiments: ÁRO UB. Performed the experiments: APG. Analyzed the data: APG UB. Contributed reagents/materials/analysis tools: DA. Wrote the paper: UB.
The authors have declared that no competing interests exist.
Structural classifications of proteins assume the existence of the fold, which is an intrinsic equivalence class of protein domains. Here, we test in which conditions such an equivalence class is compatible with objective similarity measures. We base our analysis on the transitive property of the equivalence relationship, requiring that similarity of A with B and B with C implies that A and C are also similar. Divergent gene evolution leads us to expect that the transitive property should approximately hold. However, if protein domains are a combination of recurrent short polypeptide fragments, as proposed by several authors, then similarity of partial fragments may violate the transitive property, favouring the continuous view of the protein structure space. We propose a measure to quantify the violations of the transitive property when a clustering algorithm joins elements into clusters, and we find out that such violations present a well defined and detectable cross-over point, from an approximately transitive regime at high structure similarity to a regime with large transitivity violations and large differences in length at low similarity. We argue that protein structure space is discrete and hierarchic classification is justified up to this cross-over point, whereas at lower similarities the structure space is continuous and it should be represented as a network. We have tested the qualitative behaviour of this measure, varying all the choices involved in the automatic classification procedure, i.e., domain decomposition, alignment algorithm, similarity score, and clustering algorithm, and we have found out that this behaviour is quite robust. The final classification depends on the chosen algorithms. We used the values of the clustering coefficient and the transitivity violations to select the optimal choices among those that we tested. Interestingly, this criterion also favours the agreement between automatic and expert classifications. As a domain set, we have selected a consensus set of 2,890 domains decomposed very similarly in SCOP and CATH. As an alignment algorithm, we used a global version of MAMMOTH developed in our group, which is both rapid and accurate. As a similarity measure, we used the size-normalized contact overlap, and as a clustering algorithm, we used average linkage. The resulting automatic classification at the cross-over point was more consistent than expert ones with respect to the structure similarity measure, with 86% of the clusters corresponding to subsets of either SCOP or CATH superfamilies and fewer than 5% containing domains in distinct folds according to both SCOP and CATH. Almost 15% of SCOP superfamilies and 10% of CATH superfamilies were split, consistent with the notion of fold change in protein evolution. These results were qualitatively robust for all choices that we tested, although we did not try to use alignment algorithms developed by other groups. Folds defined in SCOP and CATH would be completely joined in the regime of large transitivity violations where clustering is more arbitrary. Consistently, the agreement between SCOP and CATH at fold level was lower than their agreement with the automatic classification obtained using as a clustering algorithm, respectively, average linkage (for SCOP) or single linkage (for CATH). The networks representing significant evolutionary and structural relationships between clusters beyond the cross-over point may allow us to perform evolutionary, structural, or functional analyses beyond the limits of classification schemes. These networks and the underlying clusters are available at
Making order of the fast-growing information on proteins is essential for gaining evolutionary and functional knowledge. The most successful approaches to this task are based on classifications of protein structures, such as SCOP and CATH, which assume a discrete view of the protein structure space as a collection of separated equivalence classes (folds). However, several authors proposed that protein domains should be regarded as assemblies of polypeptide fragments, which implies that the protein–structure space is continuous. Here, we assess these views of domain space through the concept of transitivity; i.e., we test whether structure similarity of A with B and B with C implies that A and C are similar, as required for consistent classification. We find that the domain space is approximately transitive and discrete at high similarity and continuous at low similarity, where transitivity is severely violated. Comparing our classification at the cross-over similarity with CATH and SCOP, we find that they join proteins at low similarity where classification is inconsistent. Part of this discrepancy is due to structural divergence of homologous domains, which are forced to be in a single cluster in CATH and SCOP. Structural and evolutionary relationships between consistent clusters are represented as a network in our approach, going beyond current protein classification schemes. We conjecture that our results are related to a change of evolutionary regime, from uniparental divergent evolution for highly related domains to assembly of large fragments for which the classical tree representation is unsuitable.
Structural genomics projects
This dramatic growth of the number of known protein structures calls upon automatic classification methods that are objective and based only on structural information. The most used structural classifications of proteins, such as SCOP
However, this goal raises the question of whether, and up to which point, the classification of protein structures is justified. This question is addressed in this paper, where we ask whether an automatic classification based on an objective similarity measure can be uniquely defined.
Several authors studied the agreement between SCOP and CATH classifications
The other side of the coin is that several structures classified in different folds present a significant structural similarity due to the presence of common substructures, a fact noted for instance by the group of Orengo and later by other groups
The first and most successful automatic classification of protein domains is the database FSSP
Some of the above difficulties are related with the very essence of protein classification schemes, which assume that it exists an intrinsic level of structure similarity for defining equivalence classes of protein structures. In SCOP, such an equivalence class is called
The difficulties presented above have led several authors to propose that the space of protein structures is continous
A similar spirit is present in the approaches of Efimov
Another basic assumption of CATH and SCOP is that evolutionary relationships at the superfamily level imply structure similarity at the fold level. Although this assumption is most of the times correct, it was observed already in Ref.
Given the above, one can ask whether protein classifications entirely based on a quantitative measure of structure similarity are possible at all, and if so to which extent.
In formal terms, a protein fold is an equivalence class of protein structures. Mathematically, an equivalence relationship must possess the three property of symmetry, reflexivity and transitivity. Whereas symmetry and reflexivity are automatically fulfilled by any relationship based on a similarity measure, transitivity is not. For transitivity to hold, every time that
The importance of gene duplication for protein evolution
Gene duplication is not the only possible mechanism for the evolution of protein domains. Complex proteins are formed from a combination of individual domains with independent evolutionary history. For this reason, the domain and not the complete protein is the basic unit for protein classification. However, there is increasing evidence that globular domains may be formed by combining fragments below the domain level
If
We expect that the validity of the transitive property strongly depends on structure similarity. Domain pairs with high similarity share most of their structure, and we expect that transitivity approximately holds for them, so that at high similarity the structure space is made of discrete clusters. However, less stringent similarities may be due to partial substructures, and we expect that the transitive property will be violated, and the clustering will strongly depend on the algorithm used.
We propose here a measure to quantify the violation of the transitive property at each step of a hierarchical clustering algorithm. In this way, we aim at detecting the minimum similarity at which transitivity still holds and clustering is justified. At lower similarity, the space should be regarded as continuous, and the significant similarities between clusters should be represented as a network rather than a tree.
Let us consider three elements or clusters
Notice that, by definition, Eq. (1) is comprised between zero and one because
Another way to interpret this formula is the following. Because of transitivity, only five clustering configurations of the elements
Yet a third way to look at the above equation is the following. Most hierarchical clustering algorithms join at each step
Finally,
Now let us consider the step
The main result obtained in this study is the existence of a cross-over in the behavior of transitivity violations. This cross-over point determines an intrinsic condition for stopping the hierarchical clustering algorithm. We call the classification obtained at this point “automatic classification”.
The results that we present here are based on a set of 2890 domains that are decomposed very similarly in the SCOP and CATH databases (see
We plot in
We also plot the mean quadratic error of the two-piece linear fit, whose minimum identifies the cross-over point, plotted as a vertical line;
In order to test the robustness of our method, we repeated the numerical experiments changing all the relevant choices: The alignment algorithm, the similarity measure and its normalization, the clustering algorithm and the set of domains. In all cases, we observed a clear cross-over in the behavior of the transitivity violations, and the cross-over point could be automatically located through our algorithm. Moreover, the cross-over point did not vary very much for different choices (see
Set | Ali | Score | Norm | Cl. Al. | N.Clu. | Clus.co. | T.V. | WKSS | WKSF | WKCS | WKCF |
SCOP 2890 | MM | Cont. | Gauss | AL | |||||||
SCOP 2890 | MM | AL | 740 | 0.87 | 0.101 | 0.59 | 0.60 | 0.55 | 0.22 | ||
SCOP 2890 | MM | EV | AL | 768 | 0.88 | 0.088 | 0.51 | 0.57 | 0.51 | 0.24 | |
SCOP 2890 | MM | EV | AL | 855 | 0.87 | 0.113 | 0.54 | 0.58 | 0.52 | 0.27 | |
SCOP 2890 | MM | EV | AL | 788 | 0.88 | 0.084 | 0.49 | 0.60 | 0.48 | 0.26 | |
SCOP 2890 | MM | Cont. | AL | 883 | 0.88 | 0.069 | 0.57 | 0.50 | 0.53 | 0.27 | |
SCOP 2890 | Cont. | AL | 950 | 0.86 | 0.070 | 0.51 | 0.54 | 0.53 | 0.23 | ||
SCOP 2890 | EV | AL | 797 | 0.77 | 0.089 | 0.47 | 0.44 | 0.49 | 0.19 | ||
SCOP 2890 | EV | AL | 758 | 0.88 | 0.085 | 0.51 | 0.54 | 0.51 | 0.25 | ||
SCOP 2890 | MM | Cont. | Gauss | 876 | 0.90 | 0.167 | 0.24 | 0.48 | 0.54 | 0.69 | |
SCOP 2890 | MM | Cont. | Gauss | 730 | 0.90 | 0.080 | 0.26 | 0.47 | 0.43 | 0.10 | |
CATH 2890 | MM | Cont. | Gauss | 776 | 0.90 | 0.079 | 0.50 | 0.71 | 0.54 | 0.36 | |
MM | Cont. | Gauss | AL | 1353 | 0.92 | 0.063 | 0.61 | 0.52 | - | - | |
MM | Cont. | Gauss | AL | 2287 | 0.91 | 0.068 | - | - | 0.51 | 0.14 |
The qualitative features of the classification at the cross-over point are robust with respect to different methodological choices. First column, set of domains at less than 40 percent sequence identity: either 2890 domains from SCOP, or the corresponding 2890 domains from CATH, or 5041 domains from SCOP, or 7073 domains from CATH. The number of superfamilies and folds is, respectively: SCOP 2890: 779, 466; CATH 2890: 873, 473; SCOP 5041: 1094, 660; CATH 7073: 995, 1852. 2nd column, alignment algorithm: either the multiple structure alignment algorithm MAMMOTH multiple (MM) or its pairwise version (MP), faster but much less accurate. 3rd column, similarity measures: either Contact Overlap (Cont.) or TM score (TM) or percentage of structure identity (PSI). This can have a tolerance of either 4Å or 6Å , and it can be normalized either with respect the length of the shortest domain, PSI partial (PSI-p), or with respect to the geometric average, PSI total (PSI-t). 4th column, normalization with respect to length: either none, or Gaussian statistics (Gauss) or extreme value statistics (EV) 5th column, clustering algorithms: either average linkage (AL), or single linkage (SL) or complete linkage (CL). The results presented are the following. Number of clusters at the cross-over point (6th column), clustering coefficient averaged until the cross-over similarity (7th column), mean transitivity violations(8th column) and weighted kappa with respect to SCOP superfamilies (9th column), SCOP folds (10th column), CATH superfamilies (11th column) and CATH topologies (12th column), The first line in bold face refers to the selected choices, used in the presented results. In the following lines we evidence in bold face the variables that have changed with respect to the reference.
In order to choose the best options, we measured the transitivity violations, the clustering coefficient, which is the network analogous of the transitive property (see
1. As
2. We used several different measures of
In order to remove the dependence on protein length for unrelated proteins, we normalized the PSI and the overlap as in Eq. (8). The parameters used in this expression were determined by fitting mean and standard deviation of the similarity of unrelated structures with respect to the length used to normalize the PSI, using either Gaussian statistics Eq. (9), or extreme value statistics, Eq. (10), as in the original Mammoth paper.
The best similarity score was selected based on the value of transitivity violations and the clustering coefficient evaluated up to the automatic cross-over point (see
The normalization with respect to domain size did not modify the clustering coefficient considerably. However, measures that omit the normalization yield much lower agreement with expert classifications, and their cross-over points are rather distinct, whereas all the normalized scores have almost the same cross-over points. Therefore, normalized scores were used as the standard.
3. As
The plot of transitivity violations for the three algorithms is shown as
Remarkably, single linkage clustering agrees much better than average linkage with the CATH classification at topology (fold) level. This is not surprising, since CATH uses single linkage clustering, but it is an interesting observation, since it illustrate that one basic difference between CATH and SCOP arises from their reliance on different clustering procedures. However, superfamilies agree much better with the average linkage classification for both CATH and SCOP. More important, the transitivity violation is an intrinsic criterion, not based on any reference classification, which clearly favors the average linkage algorithm (see also the
4. As
The number of domains per fold as defined by SCOP (1.67, 2.05) and CATH (1.64, 2.30) increases with the size of the set, as we would expect from the fact that the cluster size is power law distributed, so that smaller samples are more likely to have smaller averages. The same happens at the level of superfamily. In contrast, the number of domains per cluster does not increase for larger samples, being 3.71 and 3.73 for SCOP domains and 3.71 and 3.09 for CATH domains. This indicates that our method tends to stop the clustering process relatively earlier for larger samples. In fact, larger samples are more likely to contain proteins that evidence transitivity violations. The plots of transitivity violations are qualitatively very similar, and are represented in
At each clustering step, we measure the difference between the average domain length of the two joined clusters
One can see from
The cross-over of transitivity violations is depicted as a vertical line. One can see that length differences are significantly larger after the cross-over. To improve the representation, we performed running averages with window size of 30 steps.
At the cross-over point, we find a broad distribution of the number of domains per cluster, with power-law probability density,
Therefore, the exponent of the distribution of the number of domains per cluster agrees reasonably between the SCOP and the automatic classification. Nevertheless, this agreement is not an evidence of the consistency between classifications, since the same size distribution can be found also for clusters obtained from random networks with the same statistical properties
We compared the automatic classification with SCOP and CATH measuring their weighted kappa, which is plotted in
Notice that the cross-over point, depicted as a vertical line, lies between the maximum agreement with superfamilies and the maximum agreement with folds.
The cross over point is located before the maximum weighted kappa for folds, indicating that many clustering steps that join clusters containing domains in the same fold imply large transitivity violations. This suggests that these fold relationships are more compatible with a network than with a classification. The difference between the automatic classification and the classification at the step where the kappa for folds is maximum becomes larger when more domains are added to the set, which makes it more likely to find transitivity violations that prevents clusters from being joined.
These results are robust with respect to the different choices mentioned above. In the following, we analyze in more detail the instances of disagreement between the automatic and the expert classifications.
At the cross-over point, the great majority of the clusters only contain domains in the same SCOP or CATH superfamily. Their number is 632 for CATH superfamilies, 664 for SCOP superfamilies, and 673 over 779 (more than 86 percent) for either SCOP or CATH superfamilies (see
Reference classification | Num. clust. | Homogeneity | Joining probability |
SCOP SF | 779 | 85.2 | 68.0 |
CATH SF | 885 | 81.1 | 66.4 |
SCOP or CATH SF | - | 86.3 | 69.1 |
SCOP folds | 466 | 92.0 | 44.5 |
CATH folds | 473 | 91.4 | 10.7 |
SCOP or CATH folds | - | 95.4 | 45.0 |
First column: reference classification. Second column: Number of clusters in the reference classification. Third column: Percentage of the 779 clusters in the automatic classification that are pure with respect to the reference classification (in case of CATH or SCOP, it is the fraction of clusters that are pure with respect to either CATH or SCOP). Fourth column: Percentage of the pairs joined in the reference classification that are joined in the automatic classification. In the case of folds, only pairs in different superfamilies are counted.
Several superfamilies are splitted in various clusters of the automatic classification. This is one of the most common disagreement between the automatic and the expert classifications. This is however not surprising, since it is well known that evolutionarily related proteins may diverge structurally. The number of splitted superfamilies is 115 over 779 (almost 15%) for SCOP and 87 over 885 (less than 10%) for CATH, which splits several superfamilies that are unique in SCOP.
To analyse these splittings, we measured the distribution of structure similarity between each pair of domains in the same SCOP superfamily, distinguishing split superfamilies from superfamilies contained in just one cluster of the automatic classification. The two distributions are shown in
(A) Distribution of the normalized total similarity score, Eq. (6) and (8), for domain pairs in the same superfamily. The grey bars are obtained for superfamilies that are not split, whereas the white bars are obtained for splitted superfamilies. One can see that splitted superfamilies present a bimodal distribution, with a peak with very small structure similarity. (B) Distribution of the mean intracluster similarity in the automatic classification, Eq. (11). The white bars are obtained for domains in clusters that contain only proteins of the same SCOP fold. The orange bars are obtained for minority domains in clusters containing domains that are mostly of a different SCOP fold.
For some cases, the difference between domains in the same superfamily appears to be due to large insertions or deletions of secondary structures, which may produce fold changes in protein evolution
Above: Two domains classified in SCOP in the metalloproteases superfamily, but splitted in CATH. Their codes are 1c7ka_ (A) and 1e1h.1 (B), with lengths of 132 and 399 residues respectively. Most of the secondary structure elements in the long protein are not matched in the short one. Below: Lambda repressor-like DNA-binding domains 1lmb3_ and 1r69__ (C) and 1d1la_ (D), which represent a well studied example of possible secondary structure switch in evolution.
Another example is the superfamily lambda repressor-like DNA-binding domains (47413). We separate this superfamily in two clusters, one containing the domains with ASTRAL id. 1lmb3_ and 1r69__ and another one containing domain 1d1la_. This is consistent with the CATH classification, which separates them in two different topologies, and even two different secondary structure classes (all alpha and alpha+beta). Domains 1lmb3_ and 1d1la_ constitute possibly a very interesting example of evolutionary secondary structure switch between proteins that could be demonstrated to be homologues
A number of splittings is due to the limited ability of the similarity score to assign significant similarity to short proteins In fact, the average overlap or PSI of unrelated structures is larger for short proteins, and therefore a larger overlap or PSI is required to judge it as significant (see Eq. (8)). As a consequence, there is a bias to split superfamilies with small domains: The mean length of splitted superfamilies is 165 residues versus 180 residues for superfamilies that are not splitted. We show one such example in
These clusters would be joined with
The automatic classification disagrees with CATH or SCOP when two domains in the same cluster belong to different folds. This kind of disagreement is rather rare. Only 142 domains over 2890, i.e., less than 5 percent, are contained in clusters where the majority of domains is from another SCOP fold, and they are distributed in only 63 clusters, so that 92 percent of the clusters contains only domains from the same fold. Similarly, 124 CATH domains over 2890 are minority domains, distributed in 67 clusters. However, these do not coincide with the 62 homogeneous clusters according to SCOP. Only 36 clusters (less than 5 percent) are not homogeneous according to both SCOP and CATH, indicating a very high agreement in cluster composition with the expert classifications (see
For analyzing these disagreements, we computed the mean similarity score of each domain with the other domains in the same cluster, distinguishing domains in homogeneous clusters from minority domains in clusters with a majority of domains of a different fold. As one can see in
Some examples of fold unification are represented in
(A) Domain 1o4wa_ from SCOP fold PIN domain-like and domain 1jmva_ from fold Adenine Nucleotide alpha Hydrolase-like. They have a nearly identical description in the SCOP database in terms of secondary structure elements. (B) The 7-stranded barrel with code 1m65a_ is unified to a cluster with 12 TIM barrel, one representative of which, with code 1j6oa_, is shown for comparison. (C) Unification of two domains from the SCOP folds STAT-like (PDB 1lvfa) and spectrin repeat-like (PDB 2e2aa).
In another example, the automatic classification joins domains from the SCOP folds Spectrin repeat-like (46965, corresponding to CATH topology 12058) and STAT-like (47654, corresponding to CATH topology 1201050) in three different clusters. However CATH classifies domain 1lvfa_, which is STAT-like according to SCOP, in the Spectrin repeat-like fold, while a paper of the SCOP team reports that the SCOP release 1.53 changed the classification of domain 1br0 from spectrin repeat to STAT-like, showing that even experts can confound these two folds
The third example corresponds to two domains from SCOP folds PIN domain-like (PDB code 1o4wa_) and Adenine Nucleotide alpha Hydrolase-like (PDB 1jmva_), which are automatically classified in the same cluster. Besides a very high structure similarity, these folds have an almost identical description in the SCOP database (beta-sheet of 5 strands, order 32145).
Another possible disagreement happens when superfamilies that are joined together in the same SCOP fold or CATH topology are splitted in different clusters. This is very frequent: 55.5 percent of the domain pairs in the same SCOP fold but distinct superfamilies are separated. For CATH, this percentage raises to 89.2%. This is not likely to be an artifact of the automatic classification, since the automatic classification agrees with SCOP or CATH at the fold level better than they agree with each other, as discussed in next section. The transitivity analysis suggests that this happens because SCOP and CATH join superfamilies into folds at a similarity level for which transitivity violations are rather large, so that clustering is not justified and unique. At this similarity level different clustering algorithms yield radically different classifications. In contrast, the pairs of domains of the same superfamily that are separated in the automatic classification is significantly smaller, 32% for SCOP and 34% for CATH.
The expert classification schemes CATH and SCOP split proteins into domains differently. Domains in the CATH classification are typically smaller than those in the SCOP classification, with an average of 155 residues compared to 179 residues for SCOP domains (the standard deviations are 88 and 120 respectively). Comparison with a set of expert curated domain decompositions
Superfam. | Folds | |
SCOP vs. CATH | 0.84 | 0.48 |
Automatic (AL) vs. SCOP | 0.54 | 0.69 |
Automatic (AL) vs. CATH | 0.58 | 0.32 |
AL (max) vs. SCOP | 0.65 | 0.79 |
AL (max) vs. CATH | 0.64 | 0.63 |
Automatic (SL) vs. SCOP | 0.24 | 0.48 |
Automatic (SL) vs. CATH | 0.28 | 0.70 |
SL (max) vs. SCOP | 0.51 | 0.67 |
SL (max) vs. CATH | 0.51 | 0.80 |
The agreement is evaluated through the weighted kappa parameter, Eq. (19). The first line compares superfamilies and folds from SCOP and CATH. In the two following lines, the automatic classification at the stop point obtained with average linkage (AL) is compared with SCOP and CATH, respectively, at the levels of superfamilies and folds. The two following lines compare the expert classifications with the AL classification at the points where their weighted kappa is maximum. The four last line are the same, but using as clustering algorithm single linkage (SL), which gives a much stronger agreement with CATH than with SCOP at the fold level, consistent with the fact that CATH uses single linkage.
There is rather good agreement,
The agreement with the average linkage clustering is significantly weaker. Around 68 percent and 66 percent of pairs in the same SCOP and CATH superfamily are in the same automatic cluster, since many superfamilies are split in the automatic classification.
In contrast, the agreement between CATH and SCOP at fold level is much poorer, with
Interestingly, at the fold level the similarity based clustering agrees with the two manual classifications better than they agree with each other, with maximum agreement
If we perform the clustering using single linkage instead of average linkage, the agreement between the automatic clustering and CATH becomes much better (
If we compare the average linkage with the single linkage clustering as a function of the clustering step, we find that the single linkage joins many more pairs than the average linkage for the same number of clusters, as expected from the fact that it does not penalize the overunification. The weighted kappa between the two algorithms decreases as the clustering proceeds, as shown in Supporting
These findings shed light on the comparison between CATH and SCOP. Despite their good agreement at the level of superfamily, CATH and SCOP use different criteria for clustering superfamilies. They would nevertheless agree better if the clustering would be stopped at large similarity, where transitivity is approximately fulfilled. Therefore, the discrepancy between CATH and SCOP at fold level has two roots (besides the different in domain decompositions): (1) They use different clustering methods, a procedure effectively similar to average linkage for SCOP and single linkage for CATH. which yields a much larger number of pairs classified as the same fold, despite the number of folds is practically the same. (2) They push the clustering up to a low similarity level at which the two clustering methods diverge considerably.
Another possible source of subjectivity in the definition of the fold is the amount of biological knowledge that the expert curators use. To test the influence of this factor, we analyzed how SCOP folds and superfamilies changed through time. We labelled the age of a SCOP fold or superfamily through its SCOP index. Since the SCOP index depends on the secondary structure class, we normalized separately the index for different secondary structure classes, so that a value of 1 means that the index lies within the first 10% of its class and so on. We measured the mean similarity score for pairs of proteins in the same fold or superfamily. The MAMMOTH similarity score of related domains depends on their length. For superfamilies, we find that the average score depends on the average length of the superfamily,
One can see from
Older folds appear to be significantly more structurally diverse, as assessed both through the MAMMOTH score and their length difference.
To distinguish between these two interpretations, we measured structure similarity within superfamilies, see
Summarizing, the structure similarity within SCOP superfamilies remained stable through time, whereas the similarity of superfamilies classified into the same fold tends to be lower for ancient folds.
The cross-over point of transitivity violations determines an intrinsic threshold beyond which protein similarity is better represented as a network rather than as a tree. Protein similarities have been previously represented as a network by other authors. Dokholyan et al.
In contrast to these previous approaches, the graph presented here is not a preliminary step for clustering, but it represents the significant similarity relationships for which clustering is not justified. These relationships not only allow to recover relationships present in expert classifications, such as splitted superfamilies and folds, but also allow to treat on the same ground the cross-fold relationships discussed by several authors, which go beyond expert classifications.
We construct the similarity network by connecting the clusters of the automatic classification that have significant structural similarity. As the similarity threshold is decreased, more and more clusters are connected. Pairs of clusters containing structures from a superfamily splitted in the automatic classification get unified in the network. We measured the probability that a pair of domains is joined in the network as a function of the similarity threshold, distinguishing pairs of domains from the same superfamily, from the same fold, or from different folds. (see
In (A), we see that, for
A visual representation of such a network is shown in
(A) High similarity clusters (
In the context of network analysis, the transitive property studied in this paper is analogous to the clustering coefficient (see
Interestingly, the network allows not only to recover similarity relationships at the superfamily and fold level that are below the threshold for clustering, but it may also help to discover new evolutionary or functional relationships that are not contained in SCOP or CATH. For instance, in a recent paper Xie and Bourne proposed a new method to detect remote evolutionary relationships based on the structure similarity of the active site
In order to complement structure information with sequence information, we constructed the network connecting clusters that have members belonging to the same superfamily. The networks based on sequence and structure similarity can be accessed at the url
To investigate protein modularity, we studied the triangles that violate transitivity for a specific threshold
They are joined after the cross-over point in the network built using similarity threshold
The distribution of the fragment overlap
The peaks of the distribution at
Thus, beyond the cross-over point it is likely to find severe violations of transitivity in which two significant matches
As for all problems for which hierarchical clustering algorithms are applied, for clustering protein structures it is of key importance to determine up to which point the clustering is justified. We propose to test the internal consistency of a clustering method based on a similarity measure by testing the transitive property, which requires that whenever
Transitivity violations as defined here occur either when a pair of domains is joined below the similarity threshold, or when a pair is separated above the same threshold. Another definition, common in the context of sequence comparisons, considers that transitivity is violated only when pairs are separated above threshold. This definition is motivated by the fact that significant sequence similarity demonstrates almost certainly an evolutionary relationship, whereas the lack of similarity does not exclude it. With this definition, the single linkage algorithm does not produce any transitivity violation, since it joins all pairs above threshold. In fact, the term transitivity is often used as a synonymous of single linkage clustering.
Nevertheless, several reasons make the definition of transitivity adopted here more suitable in the context of structure classification. The first reason also applies to sequence comparisons, and it is based on protein modularity. If a domain
We have observed that the transitivity violations grow while the clustering algorithm joins protein domains into clusters. Interestingly, in all instances that we studied we have found a cross-over between two regimes of slow and fast increase of transitivity violations.
At high similarity, transitivity violations grow slowly as the clustering algorithm proceeds, and domain size does not vary very much within a cluster. Clusters in this regime mostly correspond to subsets of SCOP superfamilies. Therefore, most domains in the same cluster are related through gene duplication and subsequent divergence, which justifies to classify related domains on a tree.
At low similarity, transitivity violations grow rapidly as the clustering algorithm proceeds, and domains in the same cluster differ substantially in size. Many pairs in the same cluster are related through partial substructures.
We propose that the cross-over in transitivity violations is an intrinsic point to stop the automatic classification. Lower similarity relationships should be represented as a network rather than a tree.
The method that we presented requires several arbitrary choices. In order to test its robustness, and the influence of the parameters, we have studied at least two alternatives for each of these choices. Qualitatively similar results were obtained for several similarity scores computed on two different alignments obtained with a local and a global version of the MAMMOTH algorithm. Both alignment algorithms were developed at our group. We did not test whether alignments obtained with algorithms developed by other groups, such as DALI, yield different conclusions, as they might do.
In all cases that we tested, we have observed a cross-over in transitivity violations, finding that most of the clusters at the cross-over point correspond to subsets of SCOP or CATH superfamilies. However, the exact location of the cross-over point and the quality of the clustering, as assessed through the clustering coefficient and through the mean value of the transitivity violations, varies for different choices.
Although we do not aim at reproducing SCOP or CATH, which we believe is impossible, we recognize that these expert classifications have important merits. It is therefore noteworthy that the highest clustering coefficients and lowest transitivity violations tend to be associated with scores that are better compatible with SCOP or CATH classifications.
The first important choice is the structure alignment algorithm. Computationally, structure alignment is an NP-complete problem, and even if it were exactly solved different algorithms would differ, since they optimize different scores. We used two versions of the algorithm MAMMOTH that are quite different, since one optimizes local superimmposition of heptamers whereas the second one, MAMMOTH-mult, otpimizes the global structure superimposition, achieving alignments with better PSI and contact overlap. Despite this important difference, the results obtained with the two methods are rather similar.
The similarity measure used is probably the most relevant choice, and we tried several of them. We obtained better results with the contact overlap than with measures that score the optimal spatial superimposition of the two structures, which are used in the standard MAMMOTH score. We conjecture that the contact overlap is a better measure than the PSI for clustering protein structures because of three reasons: (1) It does not assume that there is an optimal rigid body superimposition between the two structures. In doing so, it implicitly allows for flexible superimpositions, which might be better suited for detecting evolutionary relationships
Similarity scores based on structure superimposition typically need a tolerance threshold to decide whether two residues superimpose. We tested the TM score
All measures, except the TM score, must be normalized in order to make them independent of the length of the aligned proteins. We implemented this through a length dependent Z score, as in the original MAMMOTH score. The drawback of the Z score is that not only it makes the similarity of unrelated proteins almost independent of length, but at the same time it reduces the similarity of related proteins with short length. In this way, the similarity of related proteins depend on their length and not on their evolutionary divergence, which makes the Z score an unsuitable measure for evolutionary analysis. This drawback does not occurr with the TM score, although this does not necessarily imply that it is a suitable measure for evolutionary analysis.
Last, we have to decide which clustering algorithm we use. If we adopt the definition of transitivity proposed in the present work, the average linkage algorithm has to be preferred over both single linkage and complete linkage. In fact, average linkage reduces the combination of splitting and overunification errors, whereas single linkage only eliminates splitting errors, since it joins all pairs above the similarity threshold, and the complete linkage eliminates overunification errors, since it separates all structures that are below the similarity threshold. Interestingly, from our analysis it turns out that the main difference between SCOP and CATH is that the latter uses single linkage, while the former uses some procedure effectively similar to average linkage.
As a last remark, we note that there is some analogy between our method, which uses transitivity violations to detect the point at which hierarchical clustering is not justified, and the bootstrap method that scores the significance of each cluster in a tree. Nevertheless, there are also important differences. Besides the fact that bootstrap is computationally much more cumbersome than our method, for obtaining a classification with the bootstrap method we would have to fix a threshold bootstrap probability to accept one cluster, whereas the cross-over that we obtain with our method arises in a natural way without fixing an arbitrary threshold.
The existence of two regimes of transitivity violations, and the fact that the automatic classification at the cross-over point mostly consists of sets of SCOP or CATH superfamilies are the main results of this work. They are robust with respect to changes in the clustering algorithm, the similarity measure, the set of protein domains that we automatically classify, and the accuracy of the alignment algorithm. These results suggest that it is possible to automatically and objectively define an equivalence class for protein domains up to the similarity corresponding to the cross-over point.
Clusters in the automatic classification are structurally more consistent than SCOP folds or CATH topologies, mainly because of two reasons. (1) In the automatic classification, almost 15 percent of superfamilies are split into structurally divergent clusters, indicating that there can be important structural changes in protein evolution
An indication that the fold defined in expert classification may not correspond to an intrinsic similarity level is that CATH and SCOP neatly agree at the level of superfamily, as assessed through the weighted kappa measure, but they disagree between each other at the level of fold even more than they disagree with the automatic classification, when the proper clustering algorithm is used. Indeed, the main difference between SCOP and CATH at fold level is that SCOP uses a procedure effectively similar to the average linkage algorithm, whereas CATH uses the single linkage algorithm, which does not penalize the joining of structurally distinct domains, resulting in clusters that are structurally very diverse.
Furthermore, we have shown that the structural diversity within a SCOP fold is larger if the fold was defined since longer time, suggesting that the criteria underlying the definition of fold may change through time. Classifications are very useful, but the present analysis supports the view that the low similarities at the fold level are better represented as a network rather than as a tree.
The comparison between the automatic and the expert classifications also indicates that the automatic classification can be improved along three lines.
First, in the present study we considered protein domains as defined in the SCOP and CATH classifications. However, proteins are split into domains in the two schemes in a rather different way. In particular, some domains defined in the SCOP classification appear by visual inspection to consist of more than one domain. An incomplete domain partition can be an important source of transitivity violations and consequent errors in an automatic classification of protein structures. We are developing a new automatic method for decomposing proteins into domains based on their recurrence in a database of unrelated structures, similar to the method proposed by Holm and Sanders
Secondly, our method tends to split superfamilies constituted of short domains. Some of these splitting appear to be due to the dependency of the similarity score on the protein length. The raw similarity score, either PSI or contact overlap, is transformed into a Z score in order to reduce as much as possible the dependency of the score of unrelated structures on their size. Our results show that the classification deteriorates if this normalization is not properly performed. However, due to this normalization the similarity score corresponding to identical structures decreases for decreasing domain size, which makes it more difficult to cluster together short proteins. In order to overcome this problem, it would be very helpful to define a similarity score that is independent of domain size both for unrelated and for closely related structures. This will be presented in a forthcoming work.
Third, we found 63 over 779 clusters that contain protein domains defined by SCOP curators as different folds (although 27 of these clusters are homogeneous in terms of CATH topologies). The distribution of structure similarity suggests that several of the foreign domains appearing in clusters that are mostly from another fold are characterized by low mean similarity, and that it could be possible to “clean” the clusters of the automatic classification. Preliminary results indicates that this strategy is promising.
Significant sequence or structure similarity below the threshold for clustering
As a concluding remark, we note that the two regimes of transitivity violations that we found can be related with two modes of protein domain evolution. In the regime of large structure similarity, transitivity violations are small, related domains are similar in size, and 95 percent of them contain domains from a single CATH or SCOP fold, whereas 86 percent contain evolutionarily related domains from the same superfamily. These results indicate that most of the domains with structure similarity above the cross-over are evolutionarily related through gene duplication and divergent evolution. Moreover, domains in different superfamilies but same fold can not be excluded to be evolutionarily related, and some careful studies have been able to demonstrate this common origin also in the absence of a clear signal from sequence similarity, as in the case of the study of TIM-barrels conducted by Nagano et al.
On the other hand, similarities below the cross-over of transitivity violations are often due to partial substructures, and the typical size difference between related domains raises from 20 to 40 residues, indicating the occurrence of large insertions and deletions when the related domains belong to the same superfamily. These are clues of multi-parental evolution, proceeding through the assembly of new polypeptide fragments. This hypothetical mechanism has been proposed by Lupas et al. for the evolution of early protein domains through assembly of small peptide fragments
These considerations parallel recent considerations about the classification of organisms on the tree of life
We have used two non redundant sets of protein domains. The first set was obtained from the ASTRAL 40 database, in which no pair has sequence similarity larger than 40%. We used the SCOP version 1.65 and selected only domains from the four main SCOP classes, all
In order to select a set of domains consistently defined in SCOP and CATH, we aligned with BLAST
We performed pairwise structure alignments using either the program MAMMOTH
The MAMMOTH similarity score is based on the number of aligned residues that are closer than 4Å after optimal spatial superimposition of structures
Third, we adopted the contact overlap, which counts the fraction of contacts in common between two aligned structures
The main qualitative difference between the contact overlap and the PSI is that in the contact overlap superimposed residues in the core of the protein, which form many contacts, receive a larger weight.
It is crucial for protein structure classification that the distribution of the similarity score used is almost independent of the length for comparisons of unrelated proteins. The MAMMOTH score takes care of this by normalizing the PSI in such a way that the distribution of the normalized PSI is almost independent of size for unrelated pairs:
Score | Normalization | Alignment | ||||
PSI partial | EV | Pair | 5.97 | 0.720 | 0.920 | 0.634 |
PSI partial | EV | Mult | 5.73 | 0.714 | 0.860 | 0.622 |
PSI total | EV | Pair | 6.48 | 0.722 | 0.972 | 0.662 |
PSI total | EV | Mult | 5.62 | 0.729 | 0.961 | 0.659 |
Overlap | Gauss | Pair | 0.375 | 0.535 | 1.340 | 0.676 |
Overlap | Gauss | Mult | 0.752 | 0.576 | 1.874 | 0.773 |
The reported parameters were used to normalize the raw scores according to Eq. (8).
Using Gaussian statistics, we fit
The domain similarity score of domain
We programmed and tested three hierarchical clustering algorithms: average linkage
With
With
With
An ultrametric set is a set
A concept related to transitivity in the context of networks is the clustering coefficient, which can be computed through the formula
We have computed the clustering coefficient for the network obtained by joining domains with similarity
For detecting the cross-over point of transitivity violations (TV), we first measure TV at each step of the clustering algorithm using Eq. (2). We then perform two-pieces exponential fits of TV versus the step
The last points in the
We assessed the agreement of two classifications through the weighted kappa measure
We compare this number to the observed number of pairs that agree,
A value of zero means that two classifications are as related as independent classifications, one means that the two classifications coincide. Using the weighted kappa, we have compared the classification obtained at every step of the clustering algorithm with the manual classifications of CATH and SCOP at the superfamily and the fold level.
Notice that the weighted kappa can be decomposed into the contributions of related and unrelated pairs as follows:
For the sake of illustration, we have represented two domain similarity networks obtained before and beyond the stopping point of the automatic classification.
Two networks were constructed by considering each cluster as a node, and connecting nodes with
To visualize spatial superimpositions, we used the multiple structure allignments program MAMMOTHmult
Clustering coefficient for three different similarity measures. The clustering coefficient is computed for networks in which domains with similarity above S0 are connected, and it is plotted as a function of the number of clusters obtained with single linkage clustering of the same network.
(0.02 MB PDF)
Transitivity violations versus the step of the clustering algorithm for three different clustering algorithms. The smallest violations are obtained with the average linkage algorithm.
(0.19 MB PDF)
Agreement between the classifications obtained with different clustering algorithms at the same step. The best agreement is between single linkage and complete linkage.
(0.05 MB PDF)
Network of protein clusters joining superfamilies NTH and PCK. Xie and Bourne confirmed a previously proposed evolutionary relationship between a member of SCOP superamily Phosphoenolpyruvate carboxykinase (PCK), with code 1ayl_1, and the P loop containing nucleotide triphosphate hydrolase (NTH) superfamily. PCK domain 1ayl_1 is joined in the automatic classification with domains 1knxa2 and 1ko7a2, which are classified in SCOP in the PCK superfamily but are classified in CATH in the NTH superfamily. The automatic classification supports the CATH classification. This cluster has a single significant structural link, with average similarity
(0.02 MB PDF)
This work is dedicated to the memory of Ángel Ramírez Ortiz, who proposed and directed its starting phase. We deeply mourn his loss as an advisor and a friend. We thank Ian Sillitoe and Christine Orengo for providing the data necessary to compare our classification with the CATH database. Useful discussions with Florian Teichert and Markus Porto are gratefully acknowledged. We also thank Enrique García de Bustos, who participated in a previous phase of this project.