AS and BR conceived and designed the experiments, analyzed the data, and wrote the paper. AS and JL performed the experiments. All authors contributed reagents/materials/analysis tools.
The authors have declared that no competing interests exist.
Natively unstructured or disordered protein regions may increase the functional complexity of an organism; they are particularly abundant in eukaryotes and often evade structure determination. Many computational methods predict unstructured regions by training on outliers in otherwise well-ordered structures. Here, we introduce an approach that uses a neural network in a very different and novel way. We hypothesize that very long contiguous segments with nonregular secondary structure (NORS regions) differ significantly from regular, well-structured loops, and that a method detecting such features could predict natively unstructured regions. Training our new method, NORSnet, on predicted information rather than on experimental data yielded three major advantages: it removed the overlap between testing and training, it systematically covered entire proteomes, and it explicitly focused on one particular aspect of unstructured regions with a simple structural interpretation, namely that they are loops. Our hypothesis was correct: well-structured and unstructured loops differ so substantially that NORSnet succeeded in their distinction. Benchmarks on previously used and new experimental data of unstructured regions revealed that NORSnet performed very well. Although it was not the best single prediction method, NORSnet was sufficiently accurate to flag unstructured regions in proteins that were previously not annotated. In one application, NORSnet revealed previously undetected unstructured regions in putative targets for structural genomics and may thereby contribute to increasing structural coverage of large eukaryotic families. NORSnet found unstructured regions more often in domain boundaries than expected at random. In another application, we estimated that 50%–70% of all worm proteins observed to have more than seven protein–protein interaction partners have unstructured regions. The comparative analysis between NORSnet and DISOPRED2 suggested that long unstructured loops are a major part of unstructured regions in molecular networks.
The details of protein structures are important for function. Regions that do not adopt any regular structure in isolation (natively unstructured or disordered regions) initially appeared as a curious exception to this structure–function paradigm. It has become increasingly clear that unstructured regions are fundamental to many roles and that they are particularly important for multicellular organisms. Structural biology is just beginning to apprehend the stunning diversity of these roles. Here, we focused on unstructured regions dominated by a particular type of loop, namely the natively unstructured one. We developed a method that succeeded in the distinction between well-structured and natively unstructured loops. For the development, we did not use any experimental data for unstructured regions; when tested on experimental data, the method performed surprisingly well. Due to its different premises, the method captured very different aspects of unstructured regions than other methods that we tested. We applied the new method to two different problems. The first was the identification of proteins that may be difficult targets for structure determination. The second was the identification of worm proteins that have many interaction partners (more than seven) and unstructured regions. Surprisingly, we found unstructured regions of the loopy type in more than 50% of all the promiscuous worm proteins.
One central paradigm of structural biology is that the intricate details of 3-D protein structures determine protein function [
Proteins with unstructured regions are likely to occupy large portions of sequence space [
Methods that predict unstructured regions from sequence are mushrooming. Fast methods identify regions with high net charge and low hydrophobicity [
Our group identified long regions with no regular secondary structure (NORS), which are stretches of 70 or more sequence-consecutive surface residues with few or no predicted helices and strands [
NORS regions capture only one particular aspect of unstructured regions (
One goal of structural genomics is the determination of a 3-D structure representative for every protein family [
Our first hypothesis was that NORS regions share commonalities that distinguish such long unstructured loops from well-structured loops. If so, we should be able to somehow distinguish between the two types of loops at least in the sense that all loops predicted to be unstructured by our method ought to have different average features from other loops. We assumed that the neural network would pick up local correlations in amino-acid preferences for the different structural states. Our second hypothesis was that what distinguishes NORS regions from regular loops is exactly what makes regions become unstructured. If so, our method for the identification of NORS regions would also accurately predict unstructured regions.
Here, we describe NORSnet, a new method that extends our NORS concept to also detect shorter (30–70 residues) NORS-like regions. The method was developed without ever using proteins with experimentally known unstructured regions. Instead, it was optimized to distinguish predicted NORS from all other regions. This unique approach, unprecedented in any machine learning method competing in a real-life application with other methods, has three important advantages. First, the data used for development and testing do not overlap. Since NORS regions were predicted from sequence, we can identify thousands of such regions. Our dataset was “dirty” in the sense that it contained many false negatives (all residues in PDB were considered to be well-structured during training) as well as some false positives (incorrect NORS predictions). This was the second major advantage: the positives (unstructured regions) sampled entirely sequenced organisms without any major bias with respect to this particular flavor of unstructured regions. Thereby, we identified unstructured regions that were missed by methods trained on more specialized datasets. The third advantage was that the resulting method explicitly focused on one feature of unstructured regions with a structural interpretation, namely that they are loops. Although we could have assessed NORSnet on any existing dataset due to the lack of overlap, we added a new set with experimental data about unstructured regions different from existing data. Note that both sets differed from each other as well as from the set used for development.
Our three major results confirmed our hypothesis: (1) training on predictions succeeded in developing a powerful prediction method; (2) long loops are a major component of what is picked up by existing methods predicting unstructured regions; and (3) well-ordered and unstructured loops differ. In conjunction with existing methods, the one that we introduce here will allow the focus on particular structural aspects.
We trained our system on NORS regions that had been predicted by our previous high-accuracy/low-coverage method [
First, we established success by predicting
Very long NORS regions differ statistically from regularly structured or well-ordered loops [
We compared the amino acid compositions between four different subsets representing four types of “loops” (nonhelix/nonstrand): loops from regular, well-ordered structures; i.e., from proteins without natively unstructured regions (states T, S, L from the Dictionary of Secondary Structure of Proteins; in blue); unstructured loops as predicted by NORSnet (in green); “flexible loops” from regular structures (TSL states with normalized B-factors ≥1 [
NORSnet precisely distinguished between unstructured and well-structured loops. Although the amino acid composition of unstructured loops was similar to that in long disordered regions [
About 30%–60% of all eukaryotic proteins have been estimated to contain unstructured regions [
To assess the accuracy of NORSnet and to estimate to what extent unstructured loops dominate our current identification of unstructured regions, we investigated two different datasets. The first was built around the DisProt database used previously in the literature; the second originated from careful NMR measurements and has not been used in many previous analyses.
The first set included proteins with unstructured regions from DisProt as positives and 173 PDB structures from EVA (a server for assessing protein structure prediction servers) as negatives (see
(A) ROC-like curve for NORSnet (green), DISOPRED2 (orange), and their combination (through arithmetic average; gray). While the performance of NORSnet and DISOPRED2 were similar, the combined method seemed to outperform both methods. Particularly, at accuracy = 100% (inset), the combined method covers significantly more sequences than each one of the methods individually. IUPred (purple) outperformed all other methods on this dataset. Note that IUPred was optimized on a set similar to the one used in this study. In contrast, NORSnet and DISOPRED2 were optimized on different sets defining disorder differently.
(B) Venn diagram of overlap between very accurate predictions by NORSnet, DISOPRED2, and the combined method. The numbers in the circles are mutually exclusive; for instance, two proteins were identified only by DISOPRED2 to have an unstructured region, and 17 proteins were identified by both NORSnet and by the combined method to have an unstructured region.
NORSnet predictions were not superior to those from DISOPRED2. However, the performance of these two methods was surprisingly similar despite the fact that NORSnet was not trained on a single experimentally verified unstructured region. Did the similarity in performance indicate that both methods picked up the same signal, i.e., that DISOPRED2 largely captured unstructured loops?
If two prediction methods are based on very different information, their combination typically improves performance over any one of them [
The combination of DISOPRED2 and NORSnet by averaging their outputs was better than either method alone. This did not work with IUPred and either of the two methods. This might suggest that IUPred covers the same aspects as the other two. However, this notion proved to be incorrect: IUPred missed proteins in the NESG dataset that the others captured (
(A) The NESG set contains many proteins with unstructured regions that are not in DisProt and have never been used for method optimization. We compared NORSnet (in green), DISOPRED2 (in orange), their combined method (in gray), and IUPred (in purple) on these proteins. While DISOPRED2 performed better than all other methods in the low accuracy/high coverage region (top left), the combined method, NORSnet, and IUPred individually excelled in the high accuracy/low coverage region (lower right).
(B) Venn diagram of overlap between very accurate predictions by NORSnet, DISOPRED2, and IUPred. The numbers in the circles are mutually exclusive. Note that five proteins were identified only by NORSnet to have an unstructured region.
Many prediction methods were optimized or benchmarked on datasets overlapping with DisProt. In contrast, the dataset from the NESG contained proteins with unstructured regions that have not been used for training existing methods yet. The NESG set was collected with a unified definition of unstructured regions based on 2-D NMR experiments [
As demonstrated above, NORSnet and other predictors give similar predictions with some exceptions. For instance, we applied NORSnet and two other prediction tools (DISOPRED2 and FoldIndex) on the Kappa-casein precursor protein that is found in milk and stabilizes micelle formation by preventing casein precipitation. Raman optical activity and thermal stability experiments revealed the protein as entirely unstructured in isolation [
Kappa-casein precursor has been shown to be unstructured by different experiments [
This example reveals that NORSnet and DISOPRED2 outputs are rather correlated. However, the signal from NORSnet clearly indicated unstructured regions, while the one from DISOPRED2 did not. One reason for this drastic difference may have been that NORSnet correctly captured some global feature from its global input units (see
Although NORSnet was designed to identify all regions in any PDB structure as well-structured, the editor of this manuscript, Phil Bourne, suspected that NORSnet predictions of disorder might more often be in domain boundaries than expected at random and than expected for loop residues in general. To address this, we started with a sequence-unique subset of all PDB proteins considered to be multidomain by SCOP [
The DNA fragmentation factor (DFF) 45 must bind to DFF40 so that DFF40 can execute its catalytic function required for the onset of caspase-mediated apoptosis [
DFF45 (white, yellow, and red) becomes structured upon complex formation with DFF40 (purple; [
Secondary structure-prediction methods, such as PSIPRED and PROFsec, usually predict the secondary structure of these regions the way they appear in substrate-bound form. Therefore, methods that use this type of information might be fooled by the rigidity and stability that are associated with regular secondary structure segments and identify these regions as well-structured. Since NORSnet uses secondary structure predictions as input, it may mispredict unstructured regions that become helices and strands upon binding. However, despite the fact that DFF45 NTD is enriched in regular secondary structure (
The unstructured regions in DFF45 are correctly identified by many prediction methods. NORSnet, DISOPRED2, and FoldIndex are only three of those. This example was one of 24 proteins with unstructured regions that become structured upon binding and were extensively analyzed in a recent study [
The structural plasticity of proteins with unstructured regions may enable its binding to many proteins, i.e., may typify a protein–protein interaction hub (a protein with many binding partners in an interaction network) [
We addressed this point by correlating sustained large-scale datasets of physical protein–protein interactions (see
We ran both NORSnet and DISOPRED2 on worm proteins that are involved in protein–protein interactions (as identified by yeast two-hybrid [
(A) These graphs were compiled with the reliability threshold at which each method achieved 100% accuracy by the NESG data (
(B) NORSnet (filled, dark green) predicted many more unstructured regions in proteins with seven or more interaction partners than did DISOPRED2 (hatched, light green).
(C) For the thresholds at which both methods achieved 100% accuracy on the DisProt dataset, DISOPRED2 identified more proteins with unstructured regions than did NORSnet. In contrast to the situation for the NESG set (A), the difference was not as significant for promiscuous proteins (ten or more partners).
(D) Although NORSnet (filled, dark green) predicted as many unstructured as structured regions in hubs (seven or more), this ratio was significantly smaller than the one for proteins with a single interaction partner. In other words, even on this dataset NORSnet picked up a much stronger overrepresentation of unstructured regions in hubs than did DISOPRED2 (hatched, light green).
We chose the cutoff that yielded the highest number of unstructured regions (NORSnet, 1,279; DISOPRED2, 1,282) for each method and checked whether the two methods predicted unstructured regions in the same hub proteins. Both methods predicted unstructured regions in most (74) of the proteins observed with more than ten partners (140). DISOPRED2 predicted unstructured regions in another 13 of the promiscuous proteins, and NORSnet in another 21 proteins. If the reliable predictions of both methods are correct, 77% of all promiscuous proteins in the worm (74 + 13 + 21 = 108 of 140) have unstructured regions. While these data do not suffice to identify hubs from sequence, we undoubtedly showed that methods such as NORSnet and DISOPRED2 clearly have some capability in the identification of unstructured regions that will adopt 3-D structures upon binding. While this finding was not new, our particular perspective was that the differences between DISOPRED2 and NORSnet resulted from the difference in the focus of the two. NORSnet focuses more on loopy regions than DISOPRED2, and it also identified more hub proteins. Similar results were obtained when we compared NORSnet and IUPred predictions on the same dataset. Again, IUPred identified the hub signal but much less clearly than did NORSnet (
While NORSnet has some ability to identify unstructured regions that are often involved in binding (
Alternatively, these long connecting loops might function as extremely flexible connecting linkers that facilitate the modules to adopt different orientations, thereby allowing the binding of different targets or binding similar targets in different fashion. Each of these alternatives could be at the heart of a different function. These two hypotheses may explain some of the regulatory characteristics of hub proteins.
Most likely, unstructured regions and NORS regions occupy slightly different parts in sequence space (
One goal of structural genomics projects is to contribute considerably to the increase in the fraction of proteins with known 3-D structures. To achieve this goal, 3-D structures are experimentally determined for representatives of as many large families as possible [
Membrane proteins and proteins with unstructured regions are the two major types of proteins that are not only avoided by conventional small-scale structural biology but also by structural genomics efforts. Due to the fact that proteins with unstructured regions are much more abundant in eukaryotic organisms, consortia that focus on eukaryotes, such as NESG and CESG, have to carefully avoid such difficult targets. In the last six years, thousands of proteins have been cloned, expressed, and purified by NESG. Although the NESG target selection filtered out many domains with strong predictions for the presence of unstructured regions [
We applied NORSnet to 11,587 putative NESG targets that had already passed our previous and cruder NORS filter (
NORSnet Predictions for Structural Genomics Targets
The intricate details of protein 3-D structures are crucial for their functional role; i.e., structure determines function. Natively unstructured regions do not necessarily contradict this structure–function paradigm. Nevertheless, a variety of proteins require unstructured regions in order to function as domain linkers, filling material, and detergents. For other proteins with unstructured regions, changes in the environment (e.g., pH change, presence of target) or posttranslational modifications can trigger the formation of a regular 3-D structure that will then again determine function. In an evolutionary sense, the required changes/modifications constitute an integral part of the function and are therefore likely to be somehow encoded in the sequence of such proteins. The unusual aspect is that the key structural feature of these proteins is to keep regions natively unstructured or adaptable. The experimental and in silico identification and characterization of proteins with unstructured regions is evolving into an increasingly important challenge for structural biology. In facing this challenge, it becomes increasingly clear that the term “unstructured” describes a rather mixed bag of phenomena from regions that alter between different conformations to those that remain molten globule-like, and from regions that adopt regular helices and strands to those that remain intrinsically loopy.
Here, we present NORSnet, a neural network–based method that revisited the task of identifying unstructured regions from a different angle than that taken by other methods. It focuses on the distinction between unstructured and well-structured loops. The success in this undertaking confirmed our initial hypothesis, namely that short unstructured loops resemble very long unstructured loops (NORS regions). Our application of machine learning was rather unconventional in two ways. First, we trained on positives (predicted NORS) that did not contain the feature we sought to predict (short unstructured loops) and on negatives (all regions in the PDB) that contained regions that we wanted the method to predict as positives; i.e., we implicitly hoped that our development would fail for many cases. Second, we did not optimize any parameters on the dataset used for assessing the performance of our method. Due to the difference in our approach, NORSnet complemented existing methods that optimize on previous datasets of unstructured regions. Consequently, NORSnet will enable the application of additional filters for structural genomics. Last, through a comparison between our new and other prediction methods, we confirmed the importance of unstructured regions for protein–protein interactions. Moreover, we specifically touched on the importance of unstructured loops for network complexity.
We created our dataset of residues in natively unstructured regions (“positives”) in the following way. We grouped all proteins from 62 entirely sequence proteomes into domain-like families using CHOP and CLUP [
To optimize the parameters of the method, we trained the network on 90% of the sequences and tested it on the remaining 10%. Note that these data were only used for the development of the method. We never reported the performance of the method on these data. The datasets on which we
After optimizing our method to predict NORS regions (as described below in the prediction method section), we assessed NORSnet performance on different sets without any further optimization. In the first benchmark, we used DisProt proteins that have unstructured regions longer than 30 residues as positives and a sequence-unique subset of 173 PDB X-ray structures as negatives. The latter subset was taken from the EVA server [
To further validate the method, we tested it on a set of proteins from the NESG consortium. The positive set included 30 proteins that were identified to have unstructured regions (“NESG unfolded”), and the negative set included 170 recently determined protein structures. Both sets were identified as such by the NESG consortium. The definition of “unstructured region” was as follows: (1) HSQC (heteronuclear single quantum correlation) was high signal to noise and very low dispersion; and (2) hetNOE (heteronuclear Overhauser effect) data was clean negative (G. T. Montelione, personal communication). Using this set contributed to the removal of two types of biases in DisProt and similar databases. (1) Structure determination method: the negative set was almost equally divided between X-ray and NMR structures. (2) Length bias: while usually sequences selected for NMR structure determination are shorter than for X-ray determination, the NESG consortium reduced this artifact by using both methods in parallel to determine the structures of the same sequences. Thus, the length distribution of the NESG unfolded set is similar to the one of the folded set, in contrast to DisProt database, which consists of some much longer sequences (see
For the large-scale predictions of proteins that are involved in protein–protein interactions, we used the IntAct database (
We used a standard feed-forward neural network described elsewhere in more detail [
We downloaded the DISOPRED2 package from
The protein is predicted to have several long loops (residues 24–42, 89–125, and 130–171). Note that the location of the loops is correlated with high scores predicted by NORSnet and DISORPED2 that use this information.
(7.1 MB TIF)
Despite the fact that the N-term domain of DFF45 is unstructured, PSIPRED predicts secondary structure elements within that region.
(5.0 MB TIF)
Similarly to
(7.0 MB TIF)
The domain boundaries of 524 multidomain proteins were marked in a procedure described in Liu and Rost [
(5.2 MB TIF)
(624 KB DOC)
(A) NESG id refers to identifiers given by the NESG consortium.
(B) Disorder signal referred to different levels of signal of a protein to be unstructured from NMR experiments.
(63.5 KB DOC)
(74.5 KB DOC)
The Protein Data Bank (
The DisProt (
Thanks to Dariusz Przybylski and Guy Yachdav (Columbia University, United States) for providing preliminary information and programs, to Andrew Kernytsky and Marco Punta (Columbia University) for valuable discussions, and to Kazimierz Wrzeszczynski and Henry Bigelow (Columbia University) for helpful comments on the manuscript. Thanks to Jonathan Ward and David Jones (University College London, United Kingdom) for making DISOPRED2 and PSIPRED available, to Jaime Prilusky and Joel Sussman (Weizmann Institute, Rehovot, Israel) for making FoldIndex available, and to Zsuzsanna Dosztányi and István Simon (Institute of Enzymology, Hungary) for making IUPred available. Particular thanks to Guy Montelione and colleagues (Rutgers University, United States) for creating and providing the NESG datasets. Thanks to the constructive criticism of two anonymous reviewers and to those from the editor, Phil Bourne. Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases, in particular to Keith Dunker and his colleagues for the maintenance of DisProt. The work of BR was also supported partially by grant U54-GM072980 from the US National Institutes of Health.
DNA fragmentation factor
nuclear magnetic resonance spectroscopy
no regular secondary structure
N-terminal domain
Protein Data Bank
Protein Structure Initiative