The authors have declared that no competing interests exist.
Conceived and designed the experiments: VV LMI. Performed the experiments: VV PRLM CJO XZ CH. Analyzed the data: VV LMI. Contributed reagents/materials/analysis tools: PRLM CJO. Wrote the paper: VV PRLM VNU LMI.
Current address: Google Inc, Mountain View, California, United States of America
Current address: WaterSmart Software, San Francisco, California, United States of America
The effects of disease mutations on protein structure and function have been extensively investigated, and many predictors of the functional impact of single amino acid substitutions are publicly available. The majority of these predictors are based on protein structure and evolutionary conservation, following the assumption that disease mutations predominantly affect folded and conserved protein regions. However, the prevalence of the intrinsically disordered proteins (IDPs) and regions (IDRs) in the human proteome together with their lack of fixed structure and low sequence conservation raise a question about the impact of disease mutations in IDRs. Here, we investigate annotated missense disease mutations and show that 21.7% of them are located within such intrinsically disordered regions. We further demonstrate that 20% of disease mutations in IDRs cause local disorder-to-order transitions, which represents a 1.7–2.7 fold increase compared to annotated polymorphisms and neutral evolutionary substitutions, respectively. Secondary structure predictions show elevated rates of transition from helices and strands into loops and
Intrinsically unstructured or disordered proteins have been implicated in the etiology of a wide spectrum of diseases. However, the molecular mechanisms that relate mutations in intrinsically disordered regions (IDRs) to disease pathogenesis have not been investigated. Disordered proteins do not conform to the prevailing view of deleterious mutations which equates function, structure and evolutionary conservation – intrinsically disordered regions are functional, but lack a fixed three-dimensional structure and in general have low sequence conservation. Here we demonstrate that >20% of disease-associated missense mutations affect IDRs and interfere with their functions. We further show that 20% of deleterious mutations in IDRs induce predicted disorder-to-order transitions. Our predictions are supported by accelerated molecular dynamics simulations that show an increase in helical propensity of the region harboring a disease disorder-to-order transition mutation of tumor protein p63. Our results refine the traditional structure-centric view of disease mutations and offer a new perspective on the role of non-synonymous mutations in disease. Our findings have broad implications for improving predictors of the functional impact of missense mutations, and for interpretation of novel variants identified in large genome sequencing projects that aim to provide a better understanding of human genetic variation and its relevance to common diseases.
Recent years have seen significant advancements in cataloging the genetic variation in humans and relating it to disease susceptibility. In particular, missense mutations, which introduce changes in the amino acid sequence of proteins, have been the subject of considerable attention due to the large number of ongoing exome sequencing studies. As a result, numerous computational models that classify amino acid substitutions as damaging or benign are currently available (reviewed in
Intrinsically disordered proteins were first identified as a distinct class of proteins more than a decade ago
Here, we offer a new perspective on disease mutations that accounts for mutations in disordered regions. We investigate disease-associated mutations located in ordered and disordered regions, and compare them to missense mutations from two control datasets, single amino acid polymorphisms and neutral evolutionary substitutions. We demonstrate that deleterious missense mutations may affect disordered regions, thereby disrupting the disorder-based type of structure. Our results suggest that disease mutations in ordered regions (ORs) and IDRs differ substantially in frequency, properties, and functional impact. We find that disease mutations in disordered regions more frequently cause predicted disorder-to-order transitions and influence predicted disordered binding regions (MoRFs) compared to mutations from the control datasets. IDR mutations are also enriched in DNA-binding and transmembrane domains, and in sites of posttranslational modifications. Accelerated molecular dynamics simulations performed on a deleterious disorder-to-order transition mutation that affects the DNA-binding domain of tumor protein p63 support our disorder predictions. We further show that two widely used predictors of functional impact of single nucleotide variants, PolyPhen-2 and SIFT, exhibit a >10% decrease in sensitivity when predicting the effect of annotated disease mutations located in IDRs compared to ORs mutations. Our findings have broad implications for improving predictors of the functional impact of missense mutations and therefore may significantly influence the interpretation of novel variants identified in large genome sequencing projects.
We examined the frequency of annotated disease mutations (DM) from the UniProt database in predicted ordered and disordered regions and compared them to the distributions of putatively functionally neutral mutations from two control datasets, annotated polymorphisms from UniProt (Poly) and neutral evolutionary substitutions (NES) (
IDR | OR | ||||||
Dataset | Proteins | Mutations | n | % | n | % | Fold |
DM | 2,194 | 15,459 | 3,356 | 21.7 | 12,103 | 78.3 | - |
Poly | 8,489 | 24,220 | 9,790 | 40.4 | 14,430 | 59.6 | 0.54 |
NES | 1,998 | 60,299 | 26,927 | 44.7 | 33,372 | 55.3 | 0.49 |
The enrichment of disease mutations in ORs cannot be explained by the overall lower disorder content of the proteins containing these mutations. Although proteins that carry disease-associated mutations are on average slightly less disordered than proteins from the Poly dataset (mean±SD 32.7±17.9%
D→O | D→D | O→D | O→O | |||||||||
Dataset | n | % | n | % | Fold | n | % | n | % | Fold | ||
DM | 670 | 20.0 | 2,686 | 80.0 | - | - | 590 | 4.9 | 11,513 | 95.1 | - | - |
Poly | 1,125 | 11.5 | 8,665 | 88.5 | 1.7 | 1.06×10−32 | 710 | 4.9 | 13,720 | 95.1 | 1.0 | 0.89 |
NES | 1,971 | 7.3 | 24,956 | 92.7 | 2.7 | 5.47×10−105 | 1,870 | 5.6 | 31,502 | 94.4 | 0.9 | 0.0023 |
Despite the prevalence of disease mutations in ordered regions, 21.7% of DMs are mapped to the predicted disordered regions. We have investigated these mutations in greater detail, as discussed below, and mutations in IDRs form the main focus of the remainder of this study.
Based on the predicted disorder probability score, a residue can be classified as ordered or disordered depending on whether its score is below or above a threshold of 0.5. When analyzed from an order/disorder perspective, any missense mutation can have two different outcomes: (i) it can change the prediction score sufficiently to cross the 0.5 threshold, which would result in a conversion of the prediction from disorder to order, or from order to disorder; or (ii) it can preserve the order/disorder assignment. Thus, the effect of missense mutations can be classified as D
Disease mutations mapped to disordered regions cause D
To better understand how disease mutations influence protein secondary structure, we applied the secondary structure predictor PHD
Transitions from helix or strand to loop and
Despite the lack of stable secondary and tertiary structure in disordered regions, the dynamic behavior of IDRs does not preclude formation of short transient secondary structure elements. These short transient elements, or Molecular Recognition Features (MoRFs)
Molecular recognition features (MoRFs) are short order-prone segments within longer disordered regions that fold upon binding to their interaction partners
IDR mutations lead to gain or loss of predicted α-MoRFs 2.2 to 5.1 times more frequently than OR mutations, independent of the dataset used (
Y-axes show fractions of all D→O and O→D mutations that cause loss or gain of the predicted α-MoRFs, and error bars correspond to one standard deviation.
We also examined the influence of disease and control mutations on Eukaryotic Linear Motifs (ELMs), short (3 to 11 residues) conserved sequence motifs that play roles in mediating cell signaling, controlling protein turnover and directing protein localization
To characterize the functional impact of missense mutations, we examined UniProt region/residue feature annotations associated with each mutation (
Top row (A, B) contains level 1 and the bottom row (C, D) level 2 features. Error bars are one standard error of fold difference. Categories are sorted by decreasing fold difference in DM compared to controls.
In order to investigate mutations that contribute to the observed D
In the heatplots (panels A, B, C, D) wild-type residues are on the Y-axes and the mutant residues on the X-axes. The residues are arranged according to the Vihinen flexibility scale
The heat plots in
To verify that this result is not an artifact of our analysis, for example due to general enrichment of R→W mutations in disordered regions, or the choice of control datasets, we have compared the frequencies of R→W substitutions from this study to the matrices constructed based on the alignments of completely disordered sequences
Another category of amino-acid substitutions in DM, albeit not significantly enriched as a group, involve order-to-disorder mutations, such as L→P, C→R, G→R, W→R and others (
In summary, our analysis shows that a limited set of mutations accounts for a large fraction of all D→O and O→D transitions in the DM dataset. The top five disorder-to-order transition mutations (R→W, R→C, E→K, R→H and R→Q) collectively account for 44.0% of all D→O disease mutations, and the top five order-to-disorder transition mutations (L→P, C→R, G→R, W→R and F→S) collectively account for 32.2% of all O→D disease mutations (
We next compared the frequencies of wild-type and mutant residues in all datasets to the frequencies of typical human proteins from the UniProt database (
High mutability of arginine, also observed in earlier studies
A recent study demonstrated that SIFT has a higher error rate when predicting the impact of SNVs in solvent accessible and disordered protein regions
Both predictors show a drop in sensitivity for disease mutations in IDR and D
We observed 670 mutations in UniProt predicted to cause D→O transitions, and 590 mutations predicted to cause O→D transitions (
(A) PONDR VLXT disorder predictions for wild type and mutant p63. R243W mutation causes drop in the disorder score of 235–245 region. (B) Differential free energy weighted φ/ψ propensity plots for residues 235–243 obtained from AMD simulations performed on the wild-type and R243W mutant p63 DBD systems. The red dots represent those regions of the Ramachandran plot more heavily sampled by the mutant and the black dots represent those regions with greater propensity in the wild-type system.
Tumor protein p63 (TP63) is a transcription factor involved in development and morphogenesis of epithelial tissues
DNA-binding domains of transcription factors tend to be predicted as fully or partially disordered
AMD is an efficient and versatile enhanced conformational space sampling algorithm that has previously been successfully applied to the study of the conformational behavior of IDPs
It is interesting to note that in both the experimental NMR structure and the AMD simulations for wt-p63 the side-chain of R243 forms a strong salt-bridge with E252. One may postulate that in the wild-type system the strong electrostatic interaction between R243 and E252 introduces tensile stress in the extended loop region K232-R243, which exhibits conformational exchange on slow time-scales between local extended β-sheet/PPII and α-helical constructs. By contrast, the introduction of the R243W mutation removes the tensile strain from the loop facilitating the formation of a stable α-helix.
The widely accepted structure-centric view of deleterious mutations asserts that a disease may be caused by mutations disrupting protein activity, stability, oligomerization and other structure-based properties. Here, we further extend this concept by introducing a disorder-centric view of disease mutations, according to which a disease may arise due to a disruption of the disorder-based protein properties
There are many ways in which mutations in IDR may increase disease risk or cause a disease. For example, D→O mutations have a potential to alter interactions with DNA, RNA, proteins or ligands. Both, our results and those of a recent study by Dan
Our results show that across all three datasets, mutations in IDR are more likely to cause a predicted D→O transition than mutations in ORs are to cause a predicted O→D transition (
Our findings have wide implications for large genome sequencing projects that aim to provide a better understanding of human genetic variation and its relevance to complex diseases
A broader issue raised by our results is that caution should be exercised when interpreting the relationship between structure, function and conservation. A study by Yue and Moult found that human disease-relevant mutations in some cases could correspond to the wild-type variants in the mouse
Choosing an appropriate control for the analysis of disease mutations is an issue which deserves close attention
In summary, our results refine the traditional structure-centric view of disease mutations, and suggest new avenues for research in the area of protein disorder. With the recent explosion of exome and whole genome sequencing efforts, interpretation of the identified variants will require highly accurate predictors for the functional impact of SNVs in order to make reliable conclusions about their health risks. Our results offer help in narrowing down the gamut of disease mutations that dramatically influence protein structure and disorder. We hope that it will also facilitate predictions of the influence of mutations on protein function, which is currently a formidable task. The importance of mutations in disordered regions should not be overlooked in an attempt to construct better predictors.
A list of single amino acid substitutions annotated with the keyword “disease” was extracted from the UniProt/SwissProt database
The initial set of mutations was filtered as follows: proteins that carry disease mutations and have ≥40% pairwise sequence identity were clustered using hierarchical clustering with single linkage, and one representative protein was selected at random from each cluster. We further removed four proteins with an unusually high number of annotated disease mutations (
We assembled two control datasets: (1) annotated single amino acid polymorphisms from UniProt (Poly)
The second control dataset (NES) (
Protein disorder was predicted using VLXT
As a second comparison of disorder predictors, we examined the distributions of the difference between disorder prediction scores on WT and mutated sequences, defined as Δps = ps(WT residue)−ps(mutated). The three predictors have different observed dynamic ranges for Δps: [−0.91, 0.85] for VLXT, [−0.34, 0.39] for VSL2B and [−0.28, 0.27] for IUPRED, consistent with the fact that VLXT is more sensitive to small changes in amino acid sequence. Distribution of Δps is more platykurtic in DM compared to Poly and NES for all three predictors (higher % of disease-associated mutations in the tails), indicating that disease mutations tend to cause stronger differences in prediction scores.
Secondary structure was predicted from sequence using PHDsec
α-MoRFs were predicted from sequence using a two stage stacked prediction method
Residues were functionally annotated using the UniProt/SwissProt feature table (FT) at two levels of granularity, the FT keywords only (level 1) and concatenations of the FT keyword and description (level 2). Features marked as “Potential”, ”Probable” or “By similarity” were removed. The “Description” field was normalized by removing prefixes such as “For”, “Required for”, “Sufficient for”, “Essential for”, “Essential to”, “Important for”, “Critical for”, “Necessary for”, “Involved in”, “Mediates”, etc. Finally, all features that occurred <5 times in DM were removed. After this process, 22 level 1 and 782 level 2 features remained. We removed all disease keywords from this analysis, since they would be trivially enriched in the DM dataset.
Standard classical and accelerated molecular dynamics simulations were performed on both wild-type and R243W p63 mutant using an in-house modified version of the AMBER-10 simulation suite
Histograms of the distribution of proteins in DM and Poly datasets with x% of residues predicted to be disordered by VLXT. The lower mode and shorter right tail of the DM distribution indicates that on average proteins carrying disease-associated mutations (DM) are less disordered than proteins carrying polymorphisms (Poly) (mean±SD 32.7±17.9% disorder
(TIF)
Summary of the effect of mutations in DM, Poly, NES on predicted molecular recognition features (α-MoRFs). Disease D→O transition mutations lead to a loss, while O→D transition mutations lead to a gain of predicted MoRFs, significantly more frequently than control mutations (marked with an asterisk, and reproduced in
(TIF)
Frequencies of mutated residues across all proteins (A, B); in ordered regions (C, D), and in disordered regions (E, F). In panel (A) frequencies of amino acids across whole proteins were normalized by frequencies in human proteins from UniProt; (C) frequencies in ORs normalized with frequencies from PDBS25 (sequences of proteins with solved crystal structures from PBD, filtered at 25% pairwise sequence identity); and (E) frequencies in IDR with frequencies in experimentally confirmed disordered regions from the DisProt database, as described in (Vacic
(TIF)
Frequencies of residues mutated into (A, B) across all proteins, (C, D) in ordered regions and in (E, F) disordered regions only. Normalization was performed as in
(TIF)
Frequencies of mutations from (A,B) and into (C,D) arginine in DM, Poly and NES mutation datasets. (A) and (C) were normalized by the frequencies of amino acids from human proteins in UniProt, as described in (Vacic
(TIF)
Histogram of PolyPhen-2 scores for (A) disease mutations shows drop in sensitivity for mutations in IDRs and specifically for D→D mutations, while scores for (B) neutral polymorphisms and (C) neutral evolutionary substitutions show a drop in specificity for D→O mutations. High scores indicate deleterious mutations.
(TIF)
Histogram of SIFT scores for (A) disease mutations shows drop in sensitivity for mutations in IDRs and specifically for D→D mutations, while scores for (B) neutral polymorphisms and (C) neutral evolutionary substitutions show a drop in specificity for D→O mutations. Scores≤0.05 indicate damaging mutations.
(TIF)
Scatter plots of the number of mutations per protein against the rank of the protein for (A) DM, (B) Poly and (C) NES. Disease mutation (DM) plot (A) identifies four proteins which have an unusually high number of annotated disease mutations: tumor suppressor p53 (P04637), coagulation factor VIII (P00451), androgen receptor (P10275), and Stargardt disease protein (P78363). Taken together, these 4 proteins account for a total of 12.4% of all disease mutations, and were removed from subsequent analysis. The protein with most mutations in plot (B) is titin (Q8WZ42), the longest protein in the human proteome as annotated in UniProt, which has been removed from the Poly dataset.
(TIF)
Disease mutations have higher frequencies in ordered regions independently of the choice of predictor (VLXT, VSL3B, and IUPred).
(XLS)
Comparison of mutation rates in amino acid substitutions per residue (mean ± standard deviation) in disordered (IDR) and ordered (OR) regions in three studied datasets, disease mutations (DM), polymorphisms (Poly) and neutral evolutionary substitutions (NES).
(XLS)
Disorder-to-order transition mutations are significantly enriched in DM independently of the choice of predictor. Order-to-disorder transition mutations are significantly depleted in disease when compared to NES but not when compared to Poly (after multiple testing correction).
(XLS)
PHD secondary structure predictions (E strand, H helix, L loop) for disease mutations (DM), polymorphisms (Poly) and neutral evolutionary substitutions (NES) show an enrichment of predicted helices (H) and strands (E) in DM, and a corresponding depletion of loops (L).
(XLS)
Changes in PHD secondary structure predictions (E strand, H helix, L loop) upon mutations in disease mutations (DM), polymorphisms (Poly) and neutral evolutionary substitutions (NES) dataset.
(XLS)
Number of mutations in disease (DM), polymorphisms (Poly) and neutral substitutions (NES) datasets mapped to human instances of the eukaryotic linear motifs (ELM). Compared to controls, IDR disease mutations are enriched in ELM regions. D→O mutations in DM are significantly enriched in ELM regions compared to NES.
(XLS)
Fold difference for Swiss Prot FT Level 1 features between DM and Poly (first two rows), and DM and NES (last three rows) for D→O and O→D transitions. Only features with Bonferroni-corrected
(XLS)
Fold difference for Swiss Prot FT Level 2 features between DM and Poly (first five rows), and DM and NES (the remaining rows) for D→O and O→D transitions. Since no features in DM/Poly and only two features in DM/NES passed the Bonferroni-corrected
(XLS)
PolyPhen-2 and SIFT calls for all mutations in DM, Poly and NES show a drop in sensitivity for calling disease-associated mutations in IDRs (bold font in row DM IDR) and a drop in specificity for D→O (bold font in rows Poly D→O and NES D→O).
(XLS)
670 disease mutations from UniProt predicted to result in a D→O transition.
(XLS)
590 disease mutations from UniProt predicted to result in a O→D transition.
(XLS)
Summary of the AMD simulations as secondary structure propensities in DNA-binding domain of tumor protein p63 in the wild-type p63 (Before mutation) and in the R243W mutant (After mutation). “Difference” displays the differences between the wildtype and the R243W mutant and demonstrates than upon the mutation propensity towards α-helical conformation increases leading to a decrease in entropy of the sampled populations for all but one residue (K242). Abbreviations are as follows: β, β-sheet (−180<φ<−100, ψ>120); ppII, poly-proline II (−100<φ<0, ψ>120); α, α-helix (−100<φ<0, −75<ψ<−25); Frust. α, “frustrated” α-helix (−159<φ<−100, −75<ψ<−25); Entropy, Shannon's entropy of the residue propensities.
(XLS)
Details of α-MoRF predictions.
(DOC)
Details of accelerated molecular dynamics (AMD) simulations carried out on the wild-type and R243W mutant of p63 DNA-binding domain.
(DOC)
We are indebted to Keith Dunker and Andrew McCammon for fruitful and stimulating discussions and suggestions. We would like to thank Celeste Brown for providing us with disorder-based alignment matrices and Molecular Kinetics Inc. for the access to the PONDR VLXT predictor.