PW and DF conceived and designed the experiments. PW performed the experiments. PW and DF analyzed the data. PW contributed reagents/materials/analysis tools. PW wrote the paper. DF provided helpful advice throughout this work.
The authors have declared that no competing interests exist.
Fold designability has been estimated by the number of families contained in that fold. Here, we show that among orthologous proteins, sequence divergence is higher for folds with greater numbers of families. Folds with greater numbers of families also tend to have families that appear more often in the proteome and greater promiscuity (the number of unique “partner” folds that the fold is found with within the same protein). We also find that many disease-related proteins have folds with relatively few families. In particular, a number of these proteins are associated with diseases occurring at high frequency. These results suggest that family counts reflect how certain structures are distributed in nature and is an important characteristic associated with many human diseases.
Most proteins are composed of structural domains that can be classified into “folds.” Domains with the same fold type share overall structural similarity. The number of amino acid sequences that encode a fold is termed the “designability” of the fold. Folds that have higher designability are thought to be more robust to stresses and mutations. Such features may also allow the fold to appear in a greater variety of contexts. Here, the authors show that proteins with folds estimated to be of higher designability are more widespread amongst proteins in human, mouse, and yeast, consistent with this hypothesis. The authors also find that many hereditary disease-associated proteins have folds estimated to be of low designability. A number of these diseases occur at a relatively high frequency. These results suggest that the estimate of designability employed reflects how certain structures are distributed in nature and is an important characteristic associated with many human diseases.
Different proteins exhibit a wide range of abilities to functionally withstand the affects of environmental stress or mutation. One property that has been proposed to contribute to protein functional robustness is “designability,” the number of sequences that encode a protein's structure. Using simple lattice models in which proteins are modeled as chains of hydrophobic and hydrophilic residues on lattices, Li et al. [
It has been hypothesized that protein structures of higher designability tend to be more fit because such structures would allow a greater amount of sequence changes associated with a greater diversity of function [
Four levels of SCOP are shown: fold, superfamily, family, and sequence (dark blue rectangles). The number of sequences is equal to or greater than the number of families, which is equal to or greater than the number superfamilies, which in turn is equal to or greater than the number of folds.
Because mutation or environmental change can disrupt and/or create aberrant function in proteins, and given that a large proportion of mutations seem to affect protein structure [
In this work, we continue to investigate the concept of fold designability and its connection to hereditary diseases. We estimate protein designability based on the average family counts of all folds in a protein and, subsequently, find that many disease proteins contain folds with relatively few families. In particular, disease proteins were again estimated to be less designable than non-disease proteins, using this measure. We also provide evidence using a database of disease properties that proteins predicted to be less designable are associated with diseases occurring with greater frequency. Taken together, this work provides further evidence that designability is a factor important to our understanding of many of our diseases.
A potential problem of estimating designability using family counts is that relatively young folds may not have had enough time to establish families, even though the fold may be encoded by large numbers of sequences. Subsequent investigation revealed that relatively ancient folds appearing in both prokaryotes and eukaryotes (see
We first compared the sequence divergence among ancient SCOP folds in human proteins against the number of families they contain using protein orthologs in mouse and yeast. Because orthologs were compared, the domains being compared belong to the same family (see
Ancient SCOP folds found in human proteins were compared to those in mouse and yeast orthologs (see
A possible consequence of higher designability is that of greater fitness. We reason that more designable folds would be more robust to sequence changes associated with a greater diversity of functionality, and thus would be found more often in a proteome in a greater variety of functional contexts. To test if our measure of designability correlates with fold fitness in the eukaryotic proteome, ancient folds from human, mouse, and yeast were binned according to the number of families they contain. For folds in each bin, the number of proteins each fold appeared in was counted and averaged. It was found that ancient folds with greater numbers of families appeared in greater numbers of different proteins (
Ancient SCOP folds were divided into a number of bins according to the number of families that they contain (
A related measure of fold fitness is that of fold promiscuity. We define “fold promiscuity” as the number of unique “partner” folds in the entire proteome, that a particular fold had appeared with in the context of the same protein. We found that ancient folds with more families also tended to be more promiscuous (
Ancient SCOP folds were divided into a number of bins according to the number of families that they contain (x-axis). SCOP folds are connected to other “partner” folds in the same protein. The mean promiscuities (the number of unique partner folds a SCOP fold has) of folds in human, mouse and yeast are plotted. As the number of families in a SCOP fold increases, its promiscuity tends to increase. The differences in fold promiscuity between human, mouse, and yeast are larger for folds with larger numbers of families. All promiscuity differences described here between folds with one family and folds with more than one family are significant (MW-test, KS-test:
A third measure of fold fitness is the number of times that a fold is reused in a given protein. A fold is considered here to be duplicated within the same protein if more than one instance of it exists in that protein. Duplication of folds within a protein allows either for amplification of existing functions associated with such folds or creation of new functions. Folds with different functionality are likely encoded by different sequences. Sequence dissimilarity may also be selected for in folds cooperating to amplify a single function because individual folds must function in different spatial contexts. Indeed, of the 3,468 human proteins that have duplicate folds, less than 7% (230/3,468) have such folds detected with the same BLASTP E-value (see
Ancient SCOP folds were divided into a number of bins according to the number of families that they contain (
Observations across multiple genomes allow one to compare the occurrence of associated phenomena within the timeframe delineated by the divergence of these genomes. Three different proteomes belonging to modern human, mouse, and yeast have emerged from the time of the common ancestor of these organisms. Within this fixed timeframe, the proteomes leading up to modern mammals have expanded considerably compared to those leading up to the yeast proteome. In particular, we found that the magnitude of differences between yeast and mammals in terms of fold occurrence, promiscuity, and duplication is much higher for ancient folds with larger numbers of families (
Although our investigation has focused thus far on the divergence and proliferation of ancient folds, we also detect similar trends when we examined all folds and families within these folds (
Thus far, older folds were found to have more families, and folds with more families were found to be more divergent and widespread. Thus, a possible explanation as to why folds with more families are more divergent and widespread within genomes is that these folds tend to be older. To test this hypothesis we compared ancient folds and young folds found only in human and mouse (see
The number of sequences that encode a fold has been hypothesized to be related to the length of the fold [
Using the SCOP hierarchy, the designability of proteins was estimated in two ways: (1) as the number of families in the fold predicted to be least designable [
Designability and Disease Frequency
Although biased toward the populations assessed and limited in quantity, data pertaining to disease frequency [
In this and a previous work [
To ensure adequate time was available for family procreation among folds to be examined, we concentrated our analysis on relatively ancient folds found both in eukaryotes and prokaryotes. Protein families belonging to ancient folds containing more families were found in larger numbers of proteins. Ancient folds with greater numbers of families were also found in partnership with a more diverse set of other folds, and were duplicated more often within the same protein. In particular, the expansion of families belonging to ancient folds with more families was found to be greater since the time of the yeast–mouse–human common ancestor. These results are also consistent with the hypothesis that folds with more families are more designable. More designable folds would allow for a larger number of sequence changes in a fold, allowing for greater diversity of function. This line of thought concerning protein folds is analogous to recent findings that designability correlates with contact density, which correlates with the mean functional flexibility score of gene families [
Fold designability defines a limit to the divergence associated with folds. Interestingly, we find that sequences belonging to folds with greater numbers of families were more divergent in orthologous proteins. Clearly, selection would affect the divergence of folds. However, if family counts capture the designability of a fold, these results also suggest that designability may have contributed significantly in limiting the divergence of folds. It would be highly interesting to tease apart structural and selective influences on divergence in the future.
Because older folds have more families, and folds with more families are more widespread and divergent, one might presume that folds are more widespread and divergent simply because they are older. We found that ancient folds were not necessarily more widespread than young folds in terms of occurrence and duplication, but were found to be more promiscuous, perhaps due to greater opportunity for recombination. Ancient folds were also found not to be more divergent than young folds. Abeln and Deane [
Our inability to find a relationship between length of folds and the number of families contained within folds may be the result of the limited number of structures known. It also raises speculation that increasing the length of folds does not necessarily increase their designability, perhaps due an increase in the potential for aberrant misfolding and aggregation [
It is worth noting that the associations between family counts, divergence, and fold proliferation in genomes were statistically weakened when we considered all folds instead of just ancient folds. This phenomenon is consistent with the idea that relatively young folds may not have had enough time to procreate families, thus obscuring trends between these attributes.
Early structural characterization of the human proteome indicated significant differences in SCOP superfamily composition between disease and non-disease proteins [
Interestingly, we found that within a database of disease properties, more frequently occurring diseases were associated with proteins containing folds with fewer families. Theoretically, proteins with less-designable folds would be less robust to mutation. Such a characteristic suggests two reasons why proteins predicted to be less designable have been associated with more common diseases. First, proteins with lower structural robustness would be more likely to receive disease-associated mutations. Our results lend to speculation that an increased chance for deleterious mutations in proteins predicted to be less designable have contributed to their association with more frequently occurring diseases. Second, one would also expect the diversity in terms of structure and stability of less-robust proteins to be greater in a population. Such diversity in structure or stability is likely correlated with functional diversity because proteins of different structures usually perform different functions, and proteins of different stabilities would likely have different cellular lifetimes. Such diversity would facilitate the survival of a population in rapidly changing environments because certain members are more likely to contain a mutation adapted to the new environment. These mutations, however, may cause disease directly or increase susceptibility to diseases. For example, over 100 mutations in
Analogous to a decrease in designability, an increase in length has been proposed to increase the likelihood of a protein receiving disease-causing mutations [
What has not been considered so far is the propensity for precursor molecules encoding proteins to undergo disease causing alterations. For example, certain genomic contexts or hotspots have been identified that predispose DNA sequences for mutation [
To what extent intrinsic susceptibility to mutation and selection on populations has contributed to the association of diseases with proteins with few families is not known. The former mechanism suggests that a larger number of disease alleles exists for more common diseases. The latter mechanism suggests that a small number of common disease alleles in the population account for the high frequency of occurrence in human diseases [
We must emphasize however, that fewer than 300 proteins with SCOP folds detected from our Ensembl database have been mapped to disease frequency categories. Disease frequency is influenced by many factors, including environmental effects and underlying genotypes. Thus, the disease frequency data we use may not reflect certain populations. Although we have provided evidence for an association between disease and SCOP family counts, the nature of this association is not known for many proteins. The extent to which our explanations hold as to why diseases are associated with proteins predicted to be less designable remains to be assessed. Establishment of principles based on our work would require further investigation on larger datasets accounting for multiple factors that influence disease propensity.
Throughout this work, we have extensively used family counts as an estimate of fold designability. How many sequences can encode a protein fold depends, not only on the intrinsic constraints imposed by the geometry of the fold, but also on the external environment. For example, high temperatures could restrict the sequence space occupied by folds [
A major disadvantage of using family counts is that it is an imprecise measure. It is likely that different proteins with the same folds, and hence the same family count scores, can have vastly different designabilities. Moreover, if the fold is relatively young, the number of families contained in that fold may be too small to reflect its designability. Although residue contacts [
In summary, we have provided evidence that family counts can capture characteristics of fold designability. By estimating fold designability, we suggest explanations regarding how folds are distributed in proteomes and their potential for evolution. We have also provided evidence that our measure of protein designability is associated with properties of diseases. The designability concept remains immature [
A total of 34,111 proteins predicted to be encoded in the human genome were obtained from the Ensembl human v23.34e.1 database [
Protein SCOP [
Folds identified in human and more than six other genomes
To test whether the number of families found within folds correlated with their divergence, the number of families associated with each SCOP fold was compared with the sequence divergence of that fold. Individual SCOP domains from human proteins were aligned with the corresponding domains found within corresponding mouse and yeast protein orthologs, and the sequence identity was recorded. Protein orthologs between human, yeast, and mouse were identified as bidirectional best BLASTP hits with exactly the same SCOP domains. Note that the strict ortholog definition in use ensures that the SCOP domains being compared belong to the same SCOP family. Sequence identity between the SCOP families were computed using ClustalW [
The average divergence of the fold is calculate as follows: Let one genome encode the proteins A, B, and C and another genome encode the proteins A1, B1 and C1. Let domain F be the hypothetical SCOP fold x.1 in proteins A, A1, B, B1, C, and C1. (A, A1), (B, B1) and (C, C1) are orthologous pairs. Fold x.1 contains only two families, denoted x.1.1.1 and x.1.1.2. The domains (that belong to the same family) are aligned. The average divergence for families x.1.1.1 and x.1.1.2 is (40% + 80%)/2 = 60% and (50% + 50%)/2 = 50%, respectively. The average divergence for fold F would be taken as the mean of the average divergence of all its families, namely (60% + 50%)/2 = 55%.
For human and mouse comparisons, we also defined orthologs as those genes encoding proteins with the same SCOP families and that are bidirectional best hits with at least one nearby gene being a bidirectional best hit to a gene nearby its ortholog. We define nearby genes of gene A as those genes within five genes of A. Using this orthology definition, we obtained similar results (unpublished data).
Protein designability was measured as done in Wong et al. [
We also assessed protein designability using a measure that ensures most residues that take part in structural domains in the protein chain contribute to the designability score. This was done by measuring designability by the mean number of families across all folds in each protein. In our example, the score for this measure would be (8 + 3 + 7)/3 = 6. The SCOP folds detected in this work, however, do not necessarily cover the entire protein, simply because either no such structure has been solved yet or the non-covered regions may be intrinsically disordered. Thus, we apply this measure to only those proteins that do not have regions longer than 70 amino acids that have not been covered by our SCOP detection methods.
Because of substantial scattering within plots, analysis was conducted using four bins to emphasize trends: Folds containing only one family, more than one family, more than five families and more than ten families. The last three bins overlap with each other.
All supporting information is available to download as a combined file called Combined Supporting Information.
The mean number of families found in all, ancient (see
(22 KB PDF)
SCOP domains found in Ensembl human proteins likely to be mammalian in origin were compared to orthologous domains in mouse, and the average divergence was recorded (see
(25 KB PDF)
SCOP folds found in Ensembl human proteins were compared to those in mouse and yeast orthologs, and the average divergence was recorded for each fold (see
(27 KB PDF)
Ancient SCOP folds were divided into a number of bins according to the number of families that they contain (
(24 KB PDF)
Ancient SCOP folds were divided into a number of bins according to the number of families that they contain (
(24 KB PDF)
SCOP folds were divided into a number of bins according to the number of families that they contain (
(25 KB PDF)
SCOP folds were divided into a number of bins according to the number of families that they contain (
(24 KB PDF)
SCOP folds were divided into a number of bins according to the number of families that they contain (
(24 KB PDF)
SCOP folds were divided into a number of bins according to the number of families that they contain (
(23 KB PDF)
SCOP folds were divided into a number of bins according to the number of families that they contain (
(25 KB PDF)
SCOP folds were divided into a number of bins according to the number of families that they contain (
(23 KB PDF)
SCOP folds were divided into a number of bins according to the number of families that they contain (
(24 KB PDF)
Ancient SCOP folds found in
(23 KB PDF)
Ancient SCOP folds were divided into a number of bins according to the number of families that they contain (
(24 KB PDF)
SCOP folds were divided into a number of bins according to the number of families that they contain (
(23 KB PDF)
SCOP folds were divided into a number of bins according to the number of families that they contain (
(22 KB PDF)
SCOP folds were divided into a number of bins according to the number of families that they contain (
(901 KB PDF)
SCOP folds in human (
(23 KB PDF)
The occurrence of ancient SCOP domains (see
(106 KB PDF)
The number of times ancient SCOP domains (see
(28 KB PDF)
The promiscuity of ancient SCOP domains (see
(23 KB PDF)
The number of families in ancient SCOP folds detected in human proteins were plotted against their length. Little correlation (
(24 KB PDF)
SCOP folds were divided into a number of bins according to the number of families that they contain (
(26 KB PDF)
SCOP folds and families were searched for (see
(23 KB PDF)
(11 KB DOC)
(16 KB DOC)
(54 KB DOC)
(15 KB DOC)
For more information on the occurrence of human families and fold class, see
(65 KB DOC)
(494 KB DOC)
The Ensembl database (
We thank the anonymous reviewers, Vladimir Uversky for reading a version of the manuscript, David Liberles, members of BFAM for insightful interaction helpful to this work, and Louise Gregory for helpful comments and PEDANT database setup for this work.
Kolmogorov-Smirnov test
Mann-Whitney test
Structural Classification of Proteins