JG, MG, and PEB conceived and designed the experiments. JG performed the experiments. JG analyzed the data. JG contributed reagents/materials/analysis tools. JG, MG, and PEB wrote the paper.
Philip E. Bourne is the Editor-in-Chief of
The Wiggle series are support vector machine–based predictors that identify regions of functional flexibility using only protein sequence information. Functionally flexible regions are defined as regions that can adopt different conformational states and are assumed to be necessary for bioactivity. Many advances have been made in understanding the relationship between protein sequence and structure. This work contributes to those efforts by making strides to understand the relationship between protein sequence and flexibility. A coarse-grained protein dynamic modeling approach was used to generate the dataset required for support vector machine training. We define our regions of interest based on the participation of residues in correlated large-scale fluctuations. Even with this structure-based approach to computationally define regions of functional flexibility, predictors successfully extract sequence-flexibility relationships that have been experimentally confirmed to be functionally important. Thus, a sequence-based tool to identify flexible regions important for protein function has been created. The ability to identify functional flexibility using a sequence based approach complements structure-based definitions and will be especially useful for the large majority of proteins with unknown structures. The methodology offers promise to identify structural genomics targets amenable to crystallization and the possibility to engineer more flexible or rigid regions within proteins to modify their bioactivity.
Proteins are not static entities in biology and are constantly changing their shape and form to perform their necessary biological roles. While we are intuitively aware of their constantly changing nature, we have little understanding of how their flexibility is encoded in the protein sequence. To address this knowledge gap, predictors were created to identify sequence patterns that dictate local regions to be flexible and serve a functional purpose. By combining protein dynamic modeling and machine learning techniques, the Wiggle predictor series were able to generalize the sequence-flexibility relationship for all proteins. With these predictors we are able to identify flexible regions of functional importance such as hinges, recognition loops, and catalytic loops using only sequence information. This work has important contributions to our understanding of the sequence-flexibility relationship and paves the road to identifying local sequence modulations that impact protein function without necessarily changing the structure.
Protein structures are not rigid bodies, as suggested by time-independent solid-state crystal structures. Rather, proteins are selected by nature to balance between stability and flexibility in order to traverse the funnels of the protein energy landscape that characterize the conformational states needed to achieve a specific bioactivity. In part because of the way protein structures are traditionally represented and visualized in the crystallographic structure, the dynamics of protein motion is poorly conveyed and often neglected as the protein is treated as a static entity, although intuitively we know otherwise. Furthermore, protein sequence-structure relationships have been heavily focused on creating the most stable structure that may not necessarily be optimal for the execution or regulation of protein function. If the sequence is deterministic of the adopted protein fold, then the flexibility and dynamics of proteins should also be encoded by the sequence. Support for this notion comes from the previous demonstration that large amplitude fluctuations are mostly related to the overall protein shape [
The flexibility of proteins is a necessary property to allow for conformational changes observed in allosteric interactions. The classic definition of allostery is the regulation of enzymes through the binding of effector molecules. This definition is now expanded to define allostery as the consequence of the redistribution of conformational states in the protein in response to a given external stimulus [
The Cooper-Dryden model of allostery is a theory that addresses the contribution of entropy to the allosteric free energy. In extreme cases, this theory suggests that allostery can be achieved in the absence of structural change by simply shifting the internal vibrational modes when reacting to an external stimulus such as ligand binding [
Entropy compensation has been observed using both computational and experimental approaches in many different proteins. Here we consider five examples to make the point. First, molecular dynamic (MD) simulations of lysozyme show differences between the dynamics of substrate bound and free states. When lysozyme is in complex with the substrate, a distant loop (residues 67 to 88) increases in fluctuation to compensate for the decreasing fluctuation observed for the substrate-contacting loop (residues 101 to 107) [
In each of these cases, local regions of protein structure serve to accommodate the redistribution of vibrational modes and provide an energy reserve of allosteric free energy as proposed by the Cooper-Dryden model. The relaxation and tensing of regions of local structure is a transition from an ordered to disordered state and vice versa. Such local regions include hinges, recognition loops, and certain catalytic loops whose vibrational states change in the presence of an external stimulus such as substrate binding. While structurally dissimilar, hinges, recognition loops, and catalytic loops all exhibit characteristic fluctuations that differ from the mean fluctuation. Hinges are relatively immobile at the hinge point compared to surrounding fluctuations about the hinge, whereas recognition loops and, in certain examples, catalytic loops show minimal fluctuations at the extremities and maximal fluctuations at the center of the loop. We attempt to identify these regions based on the scale and cooperativeness of fluctuations that often define protein function and refer to them as
In this paper, we begin with a structure-based definition of an FFR to obtain our training dataset and describe prediction tools created as a result to identify these regions using only protein sequence information. With the growth of protein structures fueled by structural genomics [
The long-term goal of work such as this is to provide a generalized relationship between sequence and FF for all proteins. An immediate benefit would be in facilitating the structure solution process such that proteins less tractable for crystallization could be identified. Further, by our definition, FFRs border on forming an ordered structure; therefore, if such regions can be identified, it may be possible to introduce a few mutations to stabilize local regions that are not located on the ends of the polypeptide chain. This strategy has been utilized to successfully create a soluble analog of erythropoietin [
We define FFRs to have the property of coordinated participation in large amplitude fluctuations that are different from the mean vibrational fluctuation of the protein. The Gaussian network model (GNM) [
There are two reasons for using protein dynamic modeling results instead of experimental temperature factors to define our target regions. First, by using protein dynamic modeling simulation, we are able to investigate protein flexibility with the added dimensionality of having functional importance. Second, by using modes of motion to define our target regions, we are able to focus specifically on large-amplitude fluctuations without including contributions from higher frequency fluctuations. These two features are the distinguishing qualities that set our predictors apart from other disorder predictors. The advantages of using this approach will be highlighted in the subsequent discussion and reflected in comparisons made to other disorder predictors.
To identify FFRs, we focus on the first two vibrational modes of protein fluctuation because these modes have been shown to sufficiently describe important contributions to global fluctuations necessary for protein function [
Correlation values were used to weight mode information to create an FF score and empirically define a threshold to objectively identify FFRs (see
The FF score is used for definition purposes only. With this definition procedure we are able to obtain an objectively defined dataset needed for SVM training. The dominant motions in the lowest amplitude modes correspond to rigid domain motions [
The FF score was first tested on HIV protease. While the recognition loop (residues 36 to 42) is identified without incorporating correlated movement information to weight normalized GNM fluctuations, the flap region (residues 46 to 56) important for dimerization was not identified because fluctuation is suppressed in the dimerized state (
(A) Comparison of temperature factor (dashed line) and weighted average of the two slowest modes (solid line) obtained with GNM. The HIV protease is modeled as a dimer; however, the plot shows results for a single chain.
(B) Gradient plot ranging from correlated (red) to anticorrelated (blue) movement for each residue in the dimer.
(C) Comparison of normalized scores for unweighted (dashed line) and correlation-weighted (solid line) modes for a single chain. Correlation-weighted modes define the FF score. Regions are identified as FFR when values exceed thresholds (red lines) greater than 1.5 and less than −1.5. The flap region (residues 46 to 56) exceeds the threshold after including correlated movement information (solid line).
(D) Structural mapping of FF score with gradient from negative (blue) to positive (red), (PDB ID: 1HIV).
Improvements in defining FFRs using the FF score were also observed for calmodulin and bovine pancreatic trypsin inhibitor (BPTI) (
Comparison of unweighted (dashed line) and weighted (solid line) FF scores for BPTI ([top], PDB ID: 5PTI) and calmodulin (bottom, PDB ID: 1CLL). FF scores are mapped with the same gradient coloring from negative (blue) to positive (red) as the scale shown in
The binding affinity of BPTI is influenced by mutations in the active loops (residues 11 to 19 and 35 to 42) that are inserted into the active site of the proteolytic enzymes. Mutations Y35G [
Based on the FF score, each residue in a nonredundant training set was classified as FFR or non-FFR. Residues were separated into a binary classification with FFRs assigned a value of 1 and non-FFRs were assigned a value of −1. Examining the distribution of residues in the two classes shows that an FFR averages 9 ± 11 residues in length and comprises about 20% of all residues. Residues identified as hinges comprise about 0.75% of all residues in the training set. The average maximum length of an FFR for each protein increases with increasing protein length (
(A) The average of all maximal FFR lengths plotted against overall protein length.
(B) The number of different sequence patterns observed for a given window size. Shown are the pattern counts for regions classified as FFR (dash line), non-FFR (thin line), and irrespective of classification (thick line). FFR regions sample a smaller sequence space compared to non-FFR regions. Patterns overlapping boundaries of FFR and non-FFR are excluded from these counts.
We examined the classification preference for each amino acid and secondary structure type using the same assignment values (1 and −1) (
FFR Classification Preference for Secondary Structures
FFR Classification Preference for Amino Acids
Window scanning for particular patterns reveals that FFRs occupy a smaller sequence space than their non-FFR counterparts (
Certain tripeptide sequences can be overrepresented in FFRs when compared to non-FFRs. We attempt to identify these tripeptides by using a modified bootstrapping approach to calculate
Results from this analysis suggest that there are sequence patterns associated with these regions that may be detected using machine learning techniques and these findings have been instrumental in improving the prediction quality of our SVM-based predictors when incorporated. The rationale behind the modified bootstrapping was to identify tripeptide sequence patterns associated with FFRs and to use this information to help SVMs distinguish between FFRs and non-FFRs. This finding of context dependence supports previous work that has shown that the Flory isolated-pair hypothesis does not hold true [
While many successful structure predictors use multiple sequence alignments or position specific scoring matrices, we chose to use hidden Markov models (HMMs) because they additionally capture insertion and deletion probabilities that may occur within the sequence [
Exploring the performances of various SVM architectures have shown that a two-layered architecture yields the best performing predictor to identify residues in FFRs. The first-layer SVM makes an initial classification based on sequence and evolutionary information contained in the HMM states. The second-layer SVM serves to smooth the prediction from the first-layer SVM and uses results obtained from the modified bootstrap analysis to make better predictions. Incorporating information regarding tripeptide classification preferences was instrumental to improving the performance of our final predictor despite having a weak statistical value. Compared to a predictor that does not include tripeptide classification preferences, the performance of the SVM showed an additional 5% increase in accuracy and precision with an additional 3% improvement in recall.
The predictive performance of the SVMs was found to be a function of protein length. High false-positive rates were observed for shorter proteins (
(A) Sequence effect on false-positive (thick line) and false-negative (thin line) error rate. Shorter sequences tend to have higher false positive identification of FFRs when trained on a nonpartitioned dataset.
(B) Comparison of SVM prediction results trained on a nonpartitioned dataset (dashed lines) and a partitioned dataset containing proteins up to 200 residues (solid lines). Improvements were seen in both the false-positive (black) and -negative (red) rates.
(C) Comparison of SVM prediction results trained on a nonpartitioned dataset (dashed lines) and a partitioned dataset containing proteins larger than 200 residues (solid lines). Minor improvements were observed in false-positive (black) and -negative (red) rates.
To account for protein length, the original training set was partitioned into two sets: A, 760 proteins up to 200 residues in length; and B, 574 proteins longer than 200 residues. SVMs trained on the partitioned training sets both showed an improvement in performance (
Our final predictors, Wiggle and Wiggle200, use the radial basis kernel function in the first layer and a linear kernel in the second layer. Wiggle is the product of training on all proteins and Wiggle200 was trained on subset A containing proteins up to 200 residues. Since minor improvements were observed for the predictor trained on the subset containing larger proteins, we use Wiggle to conduct our predictions. In the following discussion, we will first revisit the dependency of the predictors on protein size in regard to domain boundary detection. Then we will discuss the performance of the predictors on three examples with experimentally verified FFRs.
Flexible linkers between domains, sometimes acting as a hinge, are examples of FFRs and we evaluate the performance of Wiggle and Wiggle200 in the detection of these regions. We use a comprehensive domain boundary benchmark set (BENCH) that was curated to reflect the consensus of experts (CATH, SCOP, and authors of the protein structures) (T. Holland, S. Veretnik, I. N. Shindyalov, and P. E. Bourne, unpublished data). Because the boundary is defined between two residue positions, we expand the definition up to a window size of 15 residues, with the boundary in the center, to evaluate the performance of the predictors. We also partitioned BENCH based on protein size into BENCHA (200 residues or fewer) and BENCHB (more than 200 residues).
The general trend in predictor performance for Wiggle and Wiggle200 observed for all datasets (BENCH, BENCHA, BENCHB) is that precision increases with the size of domain boundary expansion, whereas recall increases up to window size 5 and begins to decline afterward (
Wiggle predictors were evaluated for domain boundary predictions on (A) a benchmark dataset containing domain boundary consensus between experts (BENCH), (B) a partitioned BENCH with proteins up to and including 200 residues (BENCHA), and (C) a partitioned BENCH with proteins longer than 200 residues. Definitions of domain boundaries were expanded up to a window size of 15 (win15) with the boundary in the center.
For BENCH, we find that Wiggle outperforms Wiggle200 in recall by an additional +12.99% with little improvement in precision (+0.31%) and a decrease in accuracy (−6.44%). Wiggle identifies domain boundaries in BENCH at an accuracy of 62.55% with a precision of 6% and recall of 54.15%. We are not surprised to see a poor precision value since both predictors will identify other flexible regions that are not linkers between domains. However, the results here show that our predictors are identifying linkers between domain boundaries, for example, possibly serving a functional purpose as a hinge.
For the partitioned benchmark set (BENCHA and BENCHB), we find that Wiggle again outperforms Wiggle200 in domain boundary recall with an additional +14.34% and +12.51%, respectively. Again, minor improvements were observed in precision (BENCHA: +0.13%, BENCHB: +0.31%) and a slight decrease in accuracy (BENCHA: −7.68%, BENCHB: −6.19%) was observed for Wiggle compared to Wiggle200. For the partitioned datasets, BENCHA and BENCHB, Wiggle predicts boundaries at (BENCHA: 59.08%; BENCHB: 63.24%) accuracy, (BENCHA: 9.39%; BENCHB: 5.21%) precision, and (BENCHA: 60.66%; BENCHB: 51.85%) recall, respectively. This clearly indicates that Wiggle, trained on the entire training dataset which includes larger multidomain proteins, has picked up sequence patterns associated with linker regions and is the better predictor for domain boundaries compared to Wiggle200.
Although the GNM provides a fast approach to identifying FFRs, there are limitations to the model. Dynamic modeling results are largely dependent on protein conformation, particularly that defined by bound and unbound conformations as discussed earlier for the observed higher false-positive error rate for smaller proteins. Therefore, the FF score does not always correctly define the regions of interest. We examined a few case studies where residues were largely misclassified by the FF score and compared the results to our SVM predictions. While it is ideal to have a precisely classified training dataset, we concluded that the classification made by the FF score provides a sufficient training set for the SVM to detect correct signals in sequence patterns for FFRs. In short, SVMs are powerful enough to generalize the relationship between protein sequence and FFRs as illustrated in the following examples.
The arc repressor is stable as a dimer, unfolded as a monomer [
Structurally, several flexible regions having important roles for protein function have been detected in the arc repressor using various experimental techniques. Despite being highly disordered in solution, according to an NMR structure determination [
Wiggle identifies residues 5 to 8, 23 to 35, 38, and 40 to 53 as FFRs, and Wiggle200 identifies residues 5, 23 to 29, and 43 to 53. The FF score only identifies residues 45 to 53 located at the C-terminus (
(A) The dimer conformation of the Arc repressor was used to model global fluctuation. Using the FFR definition, the plot for a single chain is shown on the left with structural mapping of values onto a dimer on the right. FF scores are mapped with the gradient code from negative (blue) to positive (red). Only the C-terminal tail exceeds threshold lines (red) and is defined as an FFR while the rest of the protein is not. (PDB ID: 1BAZ)
(B) The hinge between the two helices is identified by predictors as well as N-terminal residues important for DNA recognition. Predictions from Wiggle (solid line) are mapped in green on the structure and Wiggle200 (dashed line) are mapped in orange.
PVUII endonucleases (156 amino acids) are homodimerizing proteins that catalyze highly specific DNA cleavage. No regions of flexibility were identified with the FF score (
(A) Plot of FF scores and mapping of values in a gradient code from negative (blue) to positive (red) onto the structure of PVUII endonuclease in complex with DNA (yellow). The following structural features are labeled: (1) minor groove binding loop, (2) catalytic loop, (3) potential hinge for DNA binding, (4) tyrosine 94 for Mg++ ion coordination, and (5) major groove binding loop. (PDB ID: 3PVI).
(B) Wiggle predictions (solid line) are mapped in green and Wiggle200 predictions (dashed line) are mapped in orange onto the structure.
Y94 coordinates Mg++ ions needed for endonuclease activity in this restriction enzyme [
FFRs identified in erythropoietin contain examples where local flexible regions are stabilized by mutations or glycosylations, both of which are sequence modifications that result in a shift from a disordered to ordered state. No regions of flexibility were identified using the FF score (
(A) FF score plotted against residue number with thresholds shown in red. Erythropoietin is modeled by the GNM in the complexed form with the corresponding receptor (not shown). All residues have below mean fluctuation (colored blue), but none of the residues are defined as FFRs since they do not exceed the definition threshold. The four glycosylation sites (S126 and lysine substituted K24, K38, and K83) along with G151 are labeled. (PDB ID: 1EER)
(B) FFRs correspond to positive values as predicted by Wiggle (solid line) and Wiggle200 (dashed line) which are structurally mapped onto erythropoetin (green and orange, respectively). Not all loops are identified by the predictors to be functionally flexible, thus showing that discrimination is not based on structural features.
Overlaps were found between prediction results (Wiggle: residues 1, 16 to 40, 85 to 89, 113 to 121, 123, 124, 149 to 155, and 160 to 166; Wiggle200: residues 19 to 40, 50 to 57, 86 to 90, 92, 111 to 124, 126 to 128, 139, 150 to 152, 154, 155, 157, and 162 to 166) and correspond to mutations introduced for the creation of a soluble analog [
Glycosylation of erythropoietin is necessary for its biosynthesis and bioactivity and plays a critical role in its stability [
G151 plays an important structural role by introducing a kink in the αD helix. This enables K152 to come in contact with residues in the protein core to form one of the two interaction sites for erythropoietin receptors. Alanine replacement in either position 151 or 152 resulted in a substantial loss of bioactivity [
Mutagenesis performed to identify erythropoietin receptor binding sites revealed four regions (residues 11 to 15, 44 to 51, 100 to 108, and 147 to 151) important for the activation of receptor signaling [
Several protein disorder predictors were compared to Wiggle and Wiggle200 predictions (
Comparison of prediction results from Wiggle (red) to various disorder predictors (blue).
Some overlaps are expected with disorder predictions because FFRs may be disordered depending on the conformational state of the protein. Otherwise, we expect little correlation since disorder predictors generally aim to identify structural disorder and regions with a low propensity to form an ordered unit. Potential functional roles were not considered in their design, although these regions are suggested to be important for protein-protein recognition after examining positively classified sequences [
For arc repressor (1BAZ), disorder predictors positively classified terminal ends, although some failed to identify it altogether. The hinge region connecting the two helices is not fully identified by most disorder predictors. While Wiggle predictors did not identify all residues involved in recognition at the major groove for PVUII endonuclease (3PVI), it identified the minor groove recognition loop, catalytic loop, and magnesium ion coordinating residues. Current disorder predicting tools failed to identify these regions. Disorder predictors that successfully identified at least one of these regions are based on an index separating hydrophobicity and net charge (FoldIndex and GlobPlot) or the use of homology information (RONN).
Most disorder predictors failed to identify all glycosylation sites on erythropoietin (1EER) with the exception of DisEMBL, having the most overlap in predictions with Wiggle. The structure of erythropoietin is entirely helical, and DisEMBL has been designed to predict coils with high B factors. The glycine kink was also missed by most disorder predictors except for DisEMBL and FoldIndex.
We also compare the performance of predictors in identifying FFRs as defined by the FF score (
Comparison of Predictors Using TEST200 and TESTALL
We report the performance of Wiggle on TESTALL and Wiggle200 on TEST200. Wiggle predictors outperformed the other disorder predictors in overall performance for both test sets when comparing precision and recall values (
The motivation for this work is to advance our understanding of protein sequence and FF through easily applied in silico methods. Protein fold and disorder properties are encoded in the amino acid sequence. We believe that functionally important protein flexibility is also encoded in the primary sequence and have successfully created tools to identify these regions. We created two predictors; one specialized for proteins shorter than 200 residues and another for all proteins regardless of size. Between the two predictors, we correctly identified flexible regions of functional importance in several test cases where structure-based classification had difficulties. Our targets include hinges, recognition loops, and localized regions that may serve to accommodate entropy dislocation necessary for allostery.
We focused on regional motion important for protein function based on residue participation in correlated low-frequency fluctuations that correspond to large global changes as modeled by the GNM. Our predictors differ from other predictors by including an additional functional consideration in our targets used for training our SVMs. Secondary structure predictors are trained against well-ordered regions of proteins to identify regular secondary structural elements and disorder predictors have been trained using various definitions that include regions missing electron density in X-ray structures or have high temperature factors. Both focus on a subset of sequence space important for structural features but do not address patterns involved in modulated protein flexibility that switch between ordered and disordered states.
With the Wiggle predictors, we were able to show detection of domain boundary and experimentally confirmed FFR in specific examples. Comparison to disorder predictors shows that, while there are expected overlaps, different regions are identified. The difference between predictors is that Wiggle predictors are trained to select for residues participating in the two largest modes of global motion, whereas disorder predictors were trained on the propensity to form ordered structures or lack thereof.
While false prediction error rates are approximately 30%, this may largely be attributed to the difficulties of defining our regions of interest with misclassifications occurring in both directions when using the FF score. SVMs trained on partitioned datasets showed improved performance, suggesting that the characteristics of FFRs are related to protein size. The Wiggle predictors are especially useful for proteins where no structural data are available. Localizing regions of FF in the absence of structural information will help identify mutational hot spots that may modulate bioactivity and these regions can be targeted in protein engineering experiments. The identification of FFRs by sequence-based methods complements and reduces the limitations in structure-based definitions of flexible regions.
A nonredundant training set of protein chains with percent sequence identity of less than or equal to 10%, resolution better than 2.0 Å, and an R-factor less than 0.30 were retrieved from the PDB [
The final training set contained 1,277 sequences with 56.6% of the chains existing in the monomeric state. Multiple copies of a protein found in the asymmetric unit were eliminated. Complexes were manually inspected using the protein quaternary structure file server (PQS) [
SAM-2tk [
The GNM [
The equilibrium-correlated fluctuations between two sites can be obtained by finding the inverse of the Kirchhoff matrix and is represented as:
Cross-correlated fluctuations between residues
Participation in correlated movements was used to define flexible regions that are functionally important. Readers are referred to the original papers for details.
Operationally, FFRs are defined using normalized FF scores. For each residue
FF scores are normalized for each protein after removing outliers using a median-based approach [
FFRs are defined to contain amino acids with
A modified bootstrap approach was used to identify sequence preferences for FFRs defined by the FF score. The aim of this analysis is to use these findings as additional input features for SVM-based classification. Protein sequences in the dataset were window scanned to pool triplets found in the training set. These pooled triplets were analyzed to identify sequence pattern distributions most correlated with FFR and non-FFR classifications. Two null models were created, one for FFRs and another for non-FFRs, by randomly selecting from the pooled triplets with replacement. Samples were drawn to be the same size as observed for FFR and non-FFR classes.
All training schemes were performed with 5-fold cross-validation using SVM
The predictor architecture for both Wiggle and Wiggle200 contains two layers. Input features for the first layer SVM include the nine HMM transition states and 20 match states. In HMM models, the match state probabilities give the probability of observing an amino acid at a particular position. The transition state probability is the probability of changing from one state (deletion, insertion, or match) to another from the previous state. For a window size of 9, a total of 261 (9 × 29) input features were used for each residue. Values are set to 0 when the window extends beyond terminal ends.
The prediction results from this first layer SVM is then included along with calculated
With this two-layer architecture and optimized parameters, two different predictors were developed defined by their training sets. Wiggle was trained on the entire training set, while Wiggle200 is a more specialized predictor trained on proteins up to 200 amino acids in length.
Wiggle prediction results were compared to a benchmark dataset (BENCH) reflecting the consensus of domain boundaries among CATH, SCOP, and authors of the three-dimensional structures (T. Holland, S. Veretnik, I. N. Shindyalov, and P. E. Bourne, unpublished data).
This dataset contains 312 chains, of which 66% are multidomain proteins, covering 30 distinct architectures and 211 distinct topologies as defined by CATH.
The prediction performance was measured based on accuracy, precision, and recall values. Domain boundaries in the dataset were defined between two adjacent positions. We therefore investigated the performance of predictors for a variety of window sizes, up to 15 residues, with the boundary resting in the middle of the expanse. Performance evaluations were also tested on a partitioned benchmark set based on protein sizes up to 200 residues (BENCHA) and longer (BENCHB).
To compare residue classification of Wiggle predictors to different disorder predictors for the three specific protein comparisons, we set VSL1 version of PONDR to predict with a 10% false-positive rate, and DisEMBL to predict hot coils defined as coils with high B factors. Recommended defaults for a window size of 9 when requested were used for remaining predictors.
We also compare the performances of disorder predictors with two different test sets (TEST200 and TESTALL) containing randomly selected chains used during the training of Wiggle predictors. TEST200 contains 144 chains up to 200 residues and TESTALL contains 256 chains regardless of length. For disorder predictors, we used the same default values and settings as the specific case example comparisons with the exception of PONDR. The default predictor for PONDR (VLXT) was used to accommodate larger proteins in the test sets. Wiggle was used for TESTALL and Wiggle200 for TEST200.
The authors would like to thank Apostal Gramada and Ellen Kats for their critical review of this manuscript. We also thank the laboratory of Ivet Bahar for providing the source code for the GNM.
bovine pancreatic trypsin inhibitor
functional flexibility
functionally flexible region
Gaussian network model
hidden Markov model
molecular dynamic
support vector machine
The reference cited in the text as (T. Holland, S. Veretnik, I. N. Shindyalov, and P. E. Bourne, unpublished data) is now in press:
Holland TA, Veretnik S, Shindyalov IN, Bourne PE (2006) Partitioning proteins structures into domains: Why is it so difficult? J Mol Biol. In press.