Conceived and designed the experiments: ONAD JCM. Performed the experiments: ONAD. Analyzed the data: ONAD. Contributed reagents/materials/analysis tools: ONAD MDD. Wrote the paper: ONAD MDD JCM.
The authors have declared that no competing interests exist.
In allostery, a binding event at one site in a protein modulates the behavior of a distant site. Identifying residues that relay the signal between sites remains a challenge. We have developed predictive models using support-vector machines, a widely used machine-learning method. The training data set consisted of residues classified as either hotspots or non-hotspots based on experimental characterization of point mutations from a diverse set of allosteric proteins. Each residue had an associated set of calculated features. Two sets of features were used, one consisting of dynamical, structural, network, and informatic measures, and another of structural measures defined by Daily and Gray
Allostery is the process whereby a molecule binds to one site in a protein and alters the function of a distant site. This phenomenon is ubiquitous, as proteins frequently must adapt their behavior to changes in the cellular milieu. The mechanism(s) underlying allostery remains incompletely understood. In particular, predictive models are needed that distinguish amino-acid residues that are critical to allostery, or “hotspots”, from non-hotspots. Here we have used data-mining approaches to infer rules that distinguish hotspots from non-hotspots. Starting with a data set of known hotspot and non-hotspot residues from a diverse set of allosteric proteins, the training data set, we applied machine learning to this data to “learn” models, or sets of rules, for distinguishing hotspots and non-hotspots by inferring associations between the classification (hotspot or non-hotspot) and an associated set of calculated attributes. Many models that showed the highest predictive power on the training data also exhibited high accuracy and sensitivity when applied to an independent data set. Moreover, the pattern of predicted hotspots in the proteins we studied was consistent with known structure/function relationships and previous work suggesting that a network of essential residues mediates the allosteric transition.
Allostery is the process whereby an effector molecule binds to one site of a protein and concomitantly modulates the function of a distant site. “Allostery” is derived from the Greek
Over the past 40 years, much has been added to our knowledge of allostery. In particular, the concept of allostery has been extended to single subunit proteins, as allostery was originally characterized in multimeric proteins. In the early 1970s, shortly after the MWC and KNF models were expounded, Neet and coworkers
Since allostery relies on the communication of binding information from one site to another in a protein, much experimental work has been targeted at elucidating the network of coupled interactions among residues that mediate the allosteric transition. Di Cera and coworkers showed that a network of structural changes connecting an allosteric site and a distant site adjacent to the active site in thrombin is formed upon effector binding to the allosteric site, inducing a key conformational change that renders the active site able to bind substrate
In addition to these experimental studies, computational methods for elucidating the network(s) of coupled interactions among residues have been developed. Lockless and Ranganathan
Despite providing insight into allosteric regulation, some of these methods have drawbacks. Computational power constraints limit MD-based methods to small systems. While COREX provides significant insights, it uses a reduced model for the degrees of conformational freedom available to a residue, as each residue exists in either a folded or unfolded state. SCA's drawback is that evolutionary co-conservation of residues is not necessarily a property specific to allosterically coupled residues. Finally, the methods of Daily et al. and del Sol et al. rely on single static structures of a protein, and thus lack dynamical measures.
By contrast, a computationally inexpensive meta-method that incorporates a number of parameters putatively implicated in allostery may overcome the drawbacks of individual approaches. In this work, we seek to develop such a method that predicts “hotspot” residues important to allostery for large systems with high sensitivity and specificity. First, we assembled a dataset of residues that were classified as hotspots or non-hotspots (mutations known not to perturb allostery) based on the results of published mutagenesis experiments on allosteric proteins. Then, support vector models were trained to distinguish the known hotspots from the known non-hotspots in this dataset based on several calculated structural, dynamical, network, and informatic features. Support-vector machines are polynomial functions of the calculated features that separate the feature spaces of the hotspots and non-hotspots, thus discriminating between the two classes
The training data set consisted of point mutants of allosteric enzymes, transcription factors, and signal transduction proteins (
PDB of effector ligand-unbound (inactive state) | PDB of effector ligand-bound (active state) | |
CheY (signal transduction) | 3chy | 1fqw |
PurR repressor (transcription factor) | 1dbq | 1wet |
Tet repressor (transcription factor) | 2trt | 1qpi |
Hemoglobin (carrier protein/enzyme) | 4hhb | 1hho |
Phosphofructokinase (enzyme) | 6pfk | 4pfk |
phosphoglycerate dehydrogenase (enzyme) | 1psd | 1yba |
fructose-1,6-bisphosphatase (enzyme) | 1eyj | 1eyi |
Aspartate transcarbamoylase (enzyme) | 1rac | 1d09 |
RhoA (signal transduction) | 1ftn | 1a2b |
CDC-42 (signal transduction) | 1an0 | 1nf3 |
glycogen phosphorylase (transcription factor) | 1gpb | 7gpb |
glucokinase (enzyme) | 1v4t | 1v4s |
glutamate dehydrogenase (enzyme) | 1nr7 | 1hwz |
lac repressor (transcription factor) | 1tlf | 1efa |
myosin II (motor protein/enzyme) | 1vom | 1fmw |
thrombin (enzyme) | 1sgi | 1sg8 |
Since the Michaelis constant,
The Cheng-Prusoff equation relates the
Substituting (6) for
By substituting (7) in (5), we thus are able to establish a relationship between coupling free energy and
In the case of allosteric transcription factors, inducibility of the effector was measured using
In this study, care was taken not to include mutations in effector sites, as perturbations in allosteric properties resulting from such mutations could be attributed to altered binding free energy. Also, no mutations were included that completely abolished the protein's function, as such a case could be attributed to perturbed folding of the protein. Hence, our training set was chosen to represent residues that mediate coupling between sites using criteria that are reasonable proxies of energetic coupling. The training data set comprised 44 hotspots and 50 non-hotspots (See
Support-vector machine models for predicting allosteric hotspots were initially developed using two sets of features. Feature Set 1 (
Feature Set 1 | Abbreviation |
Deformation Energy of the inactive state | def-energ-i |
Mean Squared Fluctuation of the inactive state | msf-i |
Mean Squared Fluctuation of the active state | msf-a |
Difference in Mean Squared Fluctuation between inactive and active states | diff-msf |
Mutual Information in the inactive state | mut-info-i |
B-factor of the inactive state | bfac-i |
B-factor of the active state | bfac-a |
Difference in B-factor between the inactive and active states | diff-bfac |
No. Potential Hydrogen Bonds in the active state | hbond-a |
No. Potential Hydrogen Bonds in the inactive state | hbond-i |
Difference in No. of Potential Hyd. Bonds between the inactive and active states | diff-hbond |
Average Local Atomic Density in the inactive state | at-dens-i |
Average Local Atomic Density in the active state | at-dens-a |
Difference in Atomic Density between the inactive and active states | diff-at-dens |
Node degree in inactive state | node-deg-i |
Perturbation in Clustering Coefficient upon Node Removal in inactive state | pert-clust-coef-i |
Evolutionary Conservation | cons |
Local Structural Entropy | lse |
Feature Set 2 | Abbreviation |
Alpha-carbon Displacement | Ca-disp |
Side-Chain RMS Distance between inactive and active states | sc-rms |
Rotation of Alpha Carbon-Beta Carbon bond from the inactive to active state | sc-flip |
Difference in Phi Angle between inactive state and active states | dphi |
Difference in Psi Angle between inactive state and active states | dpsi |
Maximum of dphi and dpsi | maxdihed |
Difference in Chi1 Angle between inactive state and active states | dchi1 |
Difference in Chi2 Angle between inactive state and active states | dchi2 |
Maximum of dchi1 and dchi2 | maxdchi |
Fractional Change in Contact Environment relative to inactive state | fI |
Fractional Change in Contact Environment relative to active state | fA |
Maximum of fI and fA | fmax |
All-Atom Solvent-Accessible Surface Area in inactive state | asa1 |
All-Atom Solvent-Accessible Surface Area in active state | asa2 |
Average of asa1 and asa2 | asaavg |
Side-Chain Solvent-Accessible Surface Area in inactive state | asasc1 |
Side-Chain Solvent-Accessible Surface Area in active state | asasc2 |
Average of asasc1 and asasc2 | asascavg |
Backbone Solvent-Accessible Surface Area in inactive state | asabb1 |
Backbone Solvent-Accessible Surface Area in active state | asabb2 |
Average of asabb1 and asabb2 | asabbavg |
Both second- and third-degree polynomial kernels were used in the training. In the context of SVMs, the kernel is the following expression:
Each feature/kernel degree combination tested in the training was evaluated for predictive performance. For each combination, a nine-fold cross-validation was performed, where a model was trained on 8 portions of the training data and tested on the ninth. Here, each portions consists of one protein's hotspots and non-hotspots, except in cases for which only hotspots or non-hotspots existed in the data set. For these cases, the hotspots of a protein without non-hotspots were grouped with the non-hotspots of a protein having no hotspots. Thus, each feature/kernel degree combination resulted in nine support-vector-machine models. This procedure is designed to prevent overtraining, or over-fitting, of the support-vector machine parameters, which results from training on all the data at once and yields inflated performance measures. Precision, recall, and F1 were calculated for each feature/kernel combination by pooling the true positives, false positives, true negatives, and false negatives from each of the nine folds (models). Each feature/kernel degree combination was ranked by F1. The top 20 feature/kernel degree combinations by F1 using Feature Set 1 are given in
F1 | Precision | Recall | Feature Combination | Kernel Degree |
0.68 | 0.62 | 0.75 | def-energ-i, msf-i, diff-msf, at-dens-a, diff-at-dens, diff-bfac, lse | 3 |
0.68 | 0.58 | 0.82 | msf-i, msf-a, diff-hbond, bfac-a, node-deg-i, lse | 2 |
0.68 | 0.54 | 0.91 | msf-i, msf-a, diff-hbond | 3 |
0.67 | 0.63 | 0.73 | def-energ-i, msf-i, diff-msf, at-dens-i, at-dens-a, diff-at-dens, diff-bfac, lse | 3 |
0.67 | 0.61 | 0.75 | msf-a, diff-hbond, diff-at-dens, bfac-a, lse | 2 |
0.67 | 0.60 | 0.77 | msf-i, msf-a, mut-info-i, diff-hbond, diff-at-dens, bfac-a, lse | 2 |
0.67 | 0.57 | 0.82 | msf-i, diff-hbond, node-deg-i, lse | 3 |
0.67 | 0.57 | 0.82 | msf-i, msf-a, hbond-i, diff-hbond, bfac-a, lse | 2 |
0.67 | 0.57 | 0.82 | def-energ-i, msf-i, diff-hbond, lse | 3 |
0.67 | 0.62 | 0.73 | msf-a, diff-msf, diff-hbond, at-dens-a, diff-at-dens, bfac-a, lse | 2 |
0.67 | 0.56 | 0.82 | msf-i, msf-a, diff-hbond, diff-bfac, lse | 3 |
0.66 | 0.56 | 0.80 | def-energ-i, msf-i, diff-hbond, diff-at-dens, diff-bfac, lse | 3 |
0.66 | 0.56 | 0.80 | def-energ-i, msf-i, msf-a, diff-msf, diff-hbond, diff-at-dens, lse | 3 |
0.66 | 0.58 | 0.77 | msf-i, hbond-i, diff-hbond, node-deg-i, lse | 2 |
0.66 | 0.58 | 0.77 | def-energ-i, msf-i, diff-hbond, lse | 2 |
0.66 | 0.59 | 0.75 | def-energ-i, msf-a, diff-hbond, diff-at-dens, diff-bfac, lse | 3 |
0.66 | 0.60 | 0.73 | def-energ-i, msf-a, diff-hbond, diff-at-dens, bfac-a, node-deg-i, lse | 2 |
0.66 | 0.62 | 0.70 | def-energ-i, msf-a, diff-msf, diff-hbond, diff-at-dens, diff-bfac, node-deg-i, lse | 3 |
0.66 | 0.62 | 0.70 | def-energ-i, msf-i, diff-msf, diff-hbond, at-dens-a, diff-at-dens, diff-bfac, lse | 3 |
0.66 | 0.64 | 0.68 | def-energ-i, msf-a, diff-hbond, diff-at-dens, diff-bfac, node-deg-i, lse | 3 |
Precision, recall, and F1 scores calculated from the results of the nine-fold cross-validation on the training set. Refer to
Feature Set | Range of F1 of top 300 models for training data set | No. of models of top 300 w/F1>0.60 on ind. data set | F1 of top model on ind. data set |
Feature Set 1 | 0.63–0.68 | 22 | 0.73 |
Feature Set 2 | 0.68–0.71 | 293 | 0.68 |
Aug. Feature Set 1 | 0.60–0.71 | 31 | 0.68 |
Hybrid Feature Set | 0.63–0.73 |
26,113 |
0.73 |
80,000 feature/kernel degree combinations using the Hybrid Feature Set had F1 scores in the range of 0.63–0.73 on the training data set, and all of these feature/kernel degree combinations were tested on the independent data set. 26,113 models of the 80,000 had an F1 greater than 0.60 on the independent data set. Abbreviation: ind. = independent.
For the top 300 feature/kernel degree combinations using Set 2, precision was lower compared with Feature Set 1 (0.53–0.61; p = 4.2e-10), but recall (0.80–0.95; p<2.2e-16) and F1 (0.68–0.71; p<2.2e-16) were higher (
F1 | Precision | Recall | Feature Combination | Kernel Degree |
0.71 | 0.58 | 0.93 | Ca-disp, sc-flip, asa1, asa2, asasc1 | 3 |
0.71 | 0.58 | 0.93 | Ca-disp, sc-flip, asa1, asa2, asasc1 | 3 |
0.71 | 0.58 | 0.91 | dpsi, asaavg, asascavg, asabbavg | 3 |
0.71 | 0.56 | 0.95 | Ca-disp, sc-flip, asa1, asa2, asasc1, asascavg | 3 |
0.70 | 0.61 | 0.84 | dpsi, dchi1, asascavg | 2 |
0.70 | 0.57 | 0.91 | maxdchi, asa2, asasc1, asascavg, asabb1, asabbavg | 2 |
0.70 | 0.57 | 0.91 | maxdchi, asa2, asaavg, asasc1, asabb1, asabbavg | 2 |
0.70 | 0.57 | 0.91 | maxdchi, asa1, asaavg, asasc2, asabbavg | 2 |
0.70 | 0.57 | 0.91 | maxdchi, asa1, asa2, asasc1, asascavg, asabbavg | 2 |
0.70 | 0.57 | 0.91 | maxdchi, asa1, asa2, asasc1, asascavg, asabb1, asabbavg | 2 |
0.70 | 0.57 | 0.91 | maxdchi, asa1, asa2, asasc1, asasc2, asabbavg | 2 |
0.70 | 0.57 | 0.91 | maxdchi, asa1, asa2, asasc1, asasc2, asascavg, asabbavg | 2 |
0.70 | 0.56 | 0.93 | sc-flip, asa2, asasc1, asascavg, asabb1, asabbavg | 3 |
0.70 | 0.56 | 0.93 | asa2, asaavg, asasc2, asabb1 | 3 |
0.70 | 0.56 | 0.93 | Ca-disp, sc-flip, dchi2, asa1, asa2, asaavg | 3 |
0.70 | 0.56 | 0.93 | Ca-disp, sc-flip, asa2, asaavg, asasc1 | 2 |
0.70 | 0.56 | 0.93 | Ca-disp, sc-flip, asa1, asa2, asaavg, asascavg | 3 |
0.70 | 0.55 | 0.95 | Ca-disp, sc-flip, asa2, asaavg, asasc1 | 3 |
0.70 | 0.55 | 0.95 | Ca-disp, sc-flip, asa2, asaavg, asasc1 | 3 |
0.70 | 0.58 | 0.86 | dpsi, dchi1, asa1, asasc2, asabb1 | 2 |
Precision, recall, and F1 scores calculated from the results of the nine-fold cross-validation on the training set. Refer to
Identifying the features that were used most frequently in the top 300 feature/kernel degree combinations can yield insights into properties that may, when taken together, indicate signatures of an allosteric hotspot residue. Dominant features in the top 300 feature combinations of Set 1 were mean squared fluctuation in the inactive and active conformers; difference in atomic density between inactive and active conformers; deformation energy of the inactive state; difference in the number of hydrogen bonds between inactive and active states; B-factor in the active state; difference in B-factor between the inactive and active states; and local structural entropy (
For each feature, the number of models (frequency) in the top 300, as ranked by F1 performance on the training data, that used that particular feature was tabulated.
For each feature, the number of models (frequency) in the top 300, as ranked by F1 performance on the training data, that used that particular feature was tabulated.
Since many allosteric proteins have only a single solved structure on which to base hotspot predictions, we identified feature/kernel degree combinations in the top 300 from the Set 1 analysis consisting solely of features calculated from a single structure (either the inactive or active state) or a single structure plus sequence-based features. Seventeen such combinations exist for which precision, recall, and F1 ranges were 0.53–0.56, 0.73–0.80, and 0.63–0.65, respectively (
F1 | Precision | Recall | Feature Combination | Kernel Degree |
0.65 | 0.56 | 0.80 | msf-i, lse | 3 |
0.65 | 0.55 | 0.80 | msf-i, lse, mut-info-i | 3 |
0.65 | 0.55 | 0.80 | msf-i, hbond-i, lse | 3 |
0.65 | 0.56 | 0.77 | msf-i, lse, mut-info-i | 2 |
0.65 | 0.56 | 0.77 | msf-i, lse, node-deg-i | 3 |
0.64 | 0.56 | 0.75 | msf-i, bfac-i, lse, node-deg-i | 2 |
0.64 | 0.56 | 0.75 | def-energ-i, msf-i, lse, mut-info-i | 2 |
0.63 | 0.55 | 0.75 | msf-i, lse, node-deg-i, mut-info-i | 3 |
0.63 | 0.55 | 0.75 | msf-i, lse | 2 |
0.63 | 0.55 | 0.75 | msf-i, bfac-i, hbond-i, lse | 2 |
0.63 | 0.56 | 0.73 | msf-i, hbond-i, lse, node-deg-i | 3 |
0.63 | 0.56 | 0.73 | def-energ-i, msf-i, lse | 3 |
0.63 | 0.56 | 0.73 | def-energ-i, msf-i, lse | 2 |
0.63 | 0.53 | 0.77 | msf-i, bfac-i, lse, mut-info-i | 3 |
0.63 | 0.53 | 0.77 | msf-i, bfac-i, mut-info-i | 3 |
0.63 | 0.54 | 0.75 | msf-i, bfac-i, lse, mut-info-i | 2 |
0.63 | 0.54 | 0.75 | def-energ-i, msf-i, hbond-i, lse | 3 |
Precision, recall, and F1 scores calculated from the results of the nine-fold cross-validation on the training set.
To ascertain whether there are general discrepancies in the predictive behavior of models generated with Feature Set 1 and 2, we assessed the overlap in the predictions between the two feature sets. Here, we considered how many predictions were the same using a pair of feature/kernel degree combinations and how many were different, where one combination was based on Feature Set 1 and the other on Feature Set 2. All pair-wise combinations of models were tested. On average, the models agreed on 61.4% of their predictions and disagreed 38.6% of the time.
We hypothesized that residues important for allostery may reside in hinge regions that undergo a change in their deformation properties upon binding an allosteric effector. To test this, we assessed whether adding features related to active state deformations would result in more accurate models. We augmented the top 8 features as ranked by their frequency in the top 300 feature/kernel combinations (deformation energy in the inactive state; mean-squared fluctuation in the inactive and active states; difference in the number of H-bonds between inactive and active states; local structural entropy; difference in atomic density between inactive and active states; B-factor in the active state; and the difference in B-factor between the inactive and active states) with the deformation energy of the active state and the difference in the deformation energy between inactive and active states. We then evaluated all possible combinations of those features using kernel degree 2 or 3 and cross-validation on the training data set. The top 20 scoring feature/kernel degree combinations using this feature set (referred to henceforth as Augmented Feature Set 1) scored about 2 points higher in F1 than Feature Set 1 (
F1 | Precision | Recall | Feature Combination | Kernel Degree |
0.71 | 0.64 | 0.80 | msf-i, diff-hbond, lse, diff-at-dens, msf-a, bfac-a, def-energ-a | 2 |
0.70 | 0.66 | 0.75 | diff-hbond, lse, diff-at-dens, msf-a, def-energ-a | 3 |
0.70 | 0.64 | 0.77 | msf-i, diff-hbond, lse, diff-at-dens, diff-bfac, def-energ-a | 3 |
0.69 | 0.63 | 0.77 | diff-hbond, lse, diff-at-dens, msf-a, bfac-a, def-energ-a | 3 |
0.69 | 0.61 | 0.80 | msf-i, diff-hbond, lse, diff-at-dens, msf-a, bfac-a, diff-bfac, def-energ-a | 2 |
0.69 | 0.63 | 0.75 | def-energ-i, diff-hbond, lse, diff-at-dens, msf-a, bfac-a, def-energ-a | 2 |
0.69 | 0.63 | 0.75 | def-energ-i, diff-hbond, lse, diff-at-dens, msf-a, bfac-a, diff-def-energ | 2 |
0.69 | 0.59 | 0.82 | diff-hbond, lse, msf-a, diff-def-energ | 3 |
0.69 | 0.58 | 0.84 | def-energ-i, msf-i, lse, diff-def-energ | 3 |
0.68 | 0.64 | 0.73 | diff-hbond, lse, diff-at-dens, msf-a, bfac-a, def-energ-a, diff-def-energ | 2 |
0.68 | 0.62 | 0.75 | diff-hbond, lse, diff-at-dens, msf-a, bfac-a, def-energ-a | 2 |
0.68 | 0.62 | 0.75 | def-energ-i, diff-hbond, lse, diff-at-dens, msf-a, bfac-a | 2 |
0.68 | 0.62 | 0.75 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, msf-a, diff-bfac, def-energ-a | 3 |
0.68 | 0.61 | 0.77 | msf-i, diff-hbond, lse, diff-at-dens, msf-a, diff-bfac, def-energ-a | 3 |
0.68 | 0.54 | 0.91 | msf-i, diff-hbond, msf-a | 3 |
0.67 | 0.63 | 0.73 | def-energ-i, diff-hbond, lse, diff-at-dens, msf-a, diff-def-energ | 3 |
0.67 | 0.63 | 0.73 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, msf-a, bfac-a, def-energ-a | 2 |
0.67 | 0.61 | 0.75 | diff-hbond, lse, diff-at-dens, msf-a, bfac-a | 2 |
0.67 | 0.61 | 0.75 | def-energ-i, diff-hbond, lse, diff-at-dens, msf-a, diff-bfac | 3 |
Precision, recall, and F1 scores calculated from the results of the nine-fold cross-validation on the training set.
We tested our top models on an independent data set to further assess their predictive performance, because a useful predictive model should perform well on data that is unseen during the training. The top 300 feature/kernel degree combinations of Set 1 and Set 2 were used to train support-vector models. Here we created a single support-vector model by training on the entire training data set at once (rather than doing a cross-validation as in the previous section, where each fold in the training generated a model), and then tested this model on an independent data set. The independent data set consisted of 87 experimentally determined hotspots and non-hotspots from five allosteric proteins (
F1 | Precision | Recall | Feature Combination | Kernel Degree |
0.73 | 0.67 | 0.81 | msf-i, diff-hbond, mut-info-i, msf-a, diff-msf | 3 |
0.68 | 0.60 | 0.78 | msf-i, mut-info-i, msf-a | 3 |
0.68 | 0.59 | 0.81 | msf-i, diff-hbond, mut-info-i, msf-a | 3 |
0.67 | 0.61 | 0.76 | msf-i, diff-hbond, msf-a, diff-msf | 3 |
0.67 | 0.61 | 0.73 | msf-i, hbond-a, msf-a | 3 |
0.67 | 0.58 | 0.78 | msf-i, diff-hbond, diff-at-dens, mut-info-i, msf-a | 3 |
0.66 | 0.58 | 0.76 | msf-i, diff-hbond, msf-a | 3 |
0.66 | 0.58 | 0.76 | msf-i, bfac-i, msf-a | 3 |
0.66 | 0.62 | 0.70 | msf-i, msf-a, diff-msf | 3 |
0.66 | 0.64 | 0.68 | msf-i, diff-hbond, msf-a | 2 |
0.66 | 0.64 | 0.68 | msf-i, diff-hbond, diff-at-dens, msf-a | 2 |
0.66 | 0.70 | 0.62 | hbond-a, hbond-i, lse, mut-info-i, msf-a, diff-msf | 3 |
0.65 | 0.63 | 0.68 | msf-i, bfac-i, diff-hbond, msf-a | 3 |
0.64 | 0.57 | 0.73 | msf-i, diff-hbond, diff-at-dens, msf-a | 3 |
0.64 | 0.61 | 0.68 | def-eng-i, msf-i, diff-hbond, msf-a | 3 |
0.63 | 0.67 | 0.59 | hbond-a, hbond-i, lse, msf-a, diff-msf | 3 |
0.62 | 0.68 | 0.57 | hbond-a, hbond-i, lse, diff-at-dens, msf-a, diff-msf | 3 |
0.61 | 0.61 | 0.62 | hbond-i, lse, diff-at-dens, msf-a, diff-msf | 3 |
0.60 | 0.61 | 0.59 | diff-hbond, lse, diff-at-dens, pert-clust-coeff-i, msf-a, bfac-a | 2 |
0.60 | 0.61 | 0.59 | msf-i, diff-hbond, lse, at-dens-a, diff-at-dens, pert-clust-coeff-i, msf-a, bfac-a | 2 |
Each of the top 300 feature/kernel degree combinations (as determined by the leave-one-out cross-validation) was used to train a model on the entire training data set. The resulting models were tested on the independent data set.
F1 | Precision | Recall | Feature Combination | Kernel Degree |
0.69 | 0.56 | 0.89 | Ca-disp, dchi2, asa1, asaavg, asasc1, asabb1 | 3 |
0.69 | 0.56 | 0.89 | Ca-disp, dpsi, dchi2, asaavg, asascavg, asabb1 | 3 |
0.69 | 0.56 | 0.89 | Ca-disp, asa1, asaavg, asasc1, asabb1, asabbavg | 3 |
0.69 | 0.55 | 0.92 | dpsi, asaavg, asascavg | 3 |
0.68 | 0.59 | 0.81 | sc-flip, asa2, asaavg, asasc2, asascavg, asabb1 | 3 |
0.67 | 0.58 | 0.81 | Ca-disp, sc-flip, dchi1, dchi2, asasc2, asascavg, asabb1 | 2 |
0.67 | 0.55 | 0.86 | dpsi, dchi1, asascavg | 2 |
0.67 | 0.55 | 0.86 | Ca-disp, sc-flip, dchi2, asa1, asaavg, asasc2 | 3 |
0.67 | 0.55 | 0.86 | Ca-disp, asa1, asasc1, asascavg, asabbavg | 2 |
0.67 | 0.55 | 0.86 | Ca-disp, fI, asaavg, asasc1, asascavg, asabbavg | 2 |
0.67 | 0.54 | 0.89 | Ca-disp, dpsi, asa1, asa2 | 3 |
0.67 | 0.54 | 0.89 | Ca-disp, asa1, asaavg, asasc1 | 2 |
0.67 | 0.57 | 0.81 | dchi1, dchi2, asasc2, asascavg, asabb1 | 2 |
0.67 | 0.55 | 0.84 | dpsi, dchi2, asasc1, asascavg, asabb1, asabb2 | 3 |
0.67 | 0.54 | 0.86 | dpsi, dchi2, asaavg | 3 |
0.67 | 0.54 | 0.86 | dpsi, asaavg, asascavg, asabbavg | 3 |
0.67 | 0.54 | 0.86 | dpsi, asa2, asaavg, asascavg, asabb1, asabb2 | 3 |
0.67 | 0.54 | 0.86 | dpsi, asa1, asa2, asaavg, asasc1, asabb1 | 3 |
0.67 | 0.54 | 0.86 | Ca-disp, dchi2, asaavg, asascavg | 2 |
0.67 | 0.54 | 0.86 | Ca-disp, dpsi, dchi2, asa2, asascavg, asabb1 | 3 |
Each of the top 300 feature/kernel degree combinations (as determined by the leave-one-out cross-validation) was used to train a model on the entire training data set. The resulting models were tested on the independent data set. The top 20 models are given above.
We also evaluated the performance of the top 300 inactive state- and/or sequence-based (i.e., single structure-based) Feature Set 1 models on the independent data (
F1 | Precision | Recall | Feature Combination | Kernel Degree |
0.56 | 0.54 | 0.59 | msf-i, lse | 3 |
0.56 | 0.54 | 0.59 | def-eng-i, msf-i, lse | 3 |
0.55 | 0.54 | 0.57 | msf-i, hbond-i, lse | 3 |
0.55 | 0.56 | 0.54 | msf-i, lse, node-deg-i, mut-info-i | 3 |
0.55 | 0.56 | 0.54 | msf-i, bfac-i, hbond-i, lse | 2 |
0.55 | 0.53 | 0.57 | msf-i, lse, mut-info-i | 3 |
0.55 | 0.53 | 0.57 | msf-i, bfac-i, lse | 3 |
0.54 | 0.54 | 0.54 | msf-i, lse, node-deg-i | 3 |
0.54 | 0.54 | 0.54 | msf-i, lse | 2 |
0.53 | 0.53 | 0.54 | msf-i, bfac-i, lse, mut-info-i | 3 |
0.53 | 0.51 | 0.54 | def-eng-i, msf-i, hbond-i, lse | 3 |
0.52 | 0.53 | 0.51 | msf-i, lse, mut-info-i | 2 |
0.52 | 0.53 | 0.51 | msf-i, bfac-i, lse, mut-info-i | 2 |
0.51 | 0.57 | 0.46 | def-eng-i, msf-i, lse, mut-info-i | 2 |
0.50 | 0.55 | 0.46 | msf-i, bfac-i, lse, node-deg-i | 2 |
0.49 | 0.53 | 0.46 | def-eng-i, msf-i, lse | 2 |
0.49 | 0.57 | 0.43 | msf-i, hbond-i, lse, node-deg-i | 3 |
Further, the performance of the top 300 models using Augmented Feature Set 1 on the independent data was evaluated. The performance of the top 20 highest scoring models is given in
F1 | Precision | Recall | Feature Combination | Kernel Degree |
0.68 | 0.67 | 0.70 | def-energ-i, msf-i, diff-at-dens, msf-a, def-energ-a | 2 |
0.68 | 0.68 | 0.68 | def-energ-i, msf-i, diff-hbond, msf-a | 2 |
0.68 | 0.63 | 0.73 | msf-i, diff-hbond, msf-a, diff-def-energ | 3 |
0.67 | 0.66 | 0.68 | msf-i, diff-at-dens, msf-a, diff-def-energ | 2 |
0.67 | 0.66 | 0.68 | msf-i, diff-hbond, msf-a, def-energ-a | 2 |
0.67 | 0.63 | 0.70 | msf-i, msf-a, def-energ-a | 2 |
0.66 | 0.58 | 0.76 | msf-i, diff-hbond, msf-a | 3 |
0.66 | 0.62 | 0.70 | msf-i, diff-hbond, msf-a, def-energ-a | 3 |
0.66 | 0.64 | 0.68 | msf-i, diff-hbond, msf-a | 2 |
0.66 | 0.64 | 0.68 | msf-i, diff-hbond, diff-at-dens, msf-a | 2 |
0.66 | 0.64 | 0.68 | msf-i, diff-hbond, diff-at-dens, msf-a, def-energ-a | 2 |
0.66 | 0.67 | 0.65 | def-energ-i, msf-i, diff-hbond, diff-at-dens, msf-a | 2 |
0.65 | 0.65 | 0.65 | msf-i, msf-a, diff-def-energ | 2 |
0.65 | 0.65 | 0.65 | msf-i, diff-at-dens, msf-a, def-energ-a, diff-def-energ | 2 |
0.64 | 0.57 | 0.73 | msf-i, diff-hbond, diff-at-dens, msf-a | 3 |
0.64 | 0.66 | 0.62 | def-energ-i, lse, msf-a, def-energ-a, diff-def-energ | 3 |
0.62 | 0.62 | 0.62 | msf-i, diff-hbond, lse, msf-a, def-energ-a | 3 |
0.62 | 0.59 | 0.65 | def-energ-i, msf-i, diff-hbond, msf-a | 3 |
0.61 | 0.61 | 0.62 | def-energ-i, msf-i, lse, diff-at-dens, diff-def-energ | 3 |
0.61 | 0.63 | 0.59 | msf-i, diff-hbond, diff-at-dens, msf-a, diff-def-energ | 2 |
Precision, recall, and F1 scores were calculated from the results on the independent data set.
To assess how much each feature contributes to the predictive ability of a given feature/kernel degree combination, we considered a feature combination from the top 300 that also performed well on the independent data set and analyzed the effect of successive feature addition. In this analysis, the starting point is one feature contained in a top-300 feature/kernel degree combination, followed by a 2-feature model, etc. (
The bar on the far right represents a feature combination from the top 10 models. Preceding bars represent feature combinations where each bar contains one feature fewer than the bar to its right.
Naturally, a parsimonious model that makes accurate predictions with fewer parameters (or features, in our case) is more favorable than one that requires a large number of features. Having fewer features reduces the number of required calculations for test cases, and lowers propensity for overfitting. Thus, we investigated whether any of the top 300 feature/kernel degree combinations consisted of just 2 or 3 features. Twenty-three such feature/kernel degree combinations were found within the top 300. Feature usage in these combinations reflected that of the top 300 feature/kernel degree combinations, with mean-squared fluctuation and local structural entropy predominating (
F1 | Precision | Recall | Feature Combination | Kernel Degree |
0.68 | 0.54 | 0.91 | msf-i, diff-hbond, msf-a | 3 |
0.65 | 0.55 | 0.82 | msf-i, diff-hbond, msf-a | 2 |
0.65 | 0.56 | 0.80 | msf-i, lse | 3 |
0.65 | 0.54 | 0.82 | msf-i, msf-a, diff-msf | 3 |
0.65 | 0.55 | 0.80 | msf-i, lse, diff-msf | 3 |
0.65 | 0.55 | 0.80 | msf-i, lse, mut-info-i | 3 |
0.65 | 0.55 | 0.80 | msf-i, diff-hbond, lse | 3 |
0.65 | 0.55 | 0.80 | msf-i, hbond-i, lse | 3 |
0.65 | 0.56 | 0.77 | msf-i, lse, mut-info-i | 2 |
0.65 | 0.56 | 0.77 | msf-i, lse, node-deg-i | 3 |
0.64 | 0.54 | 0.80 | msf-i, hbond-a, msf-a | 3 |
0.64 | 0.55 | 0.77 | msf-i, lse, diff-bfac | 3 |
0.64 | 0.55 | 0.77 | msf-i, diff-hbond, lse | 2 |
0.64 | 0.55 | 0.77 | msf-i, hbond-a, lse | 3 |
0.64 | 0.52 | 0.82 | msf-i, mut-info-i, msf-a | 3 |
0.63 | 0.55 | 0.75 | msf-i, lse | 2 |
0.63 | 0.56 | 0.73 | msf-i, lse, at-dens-a | 3 |
0.63 | 0.56 | 0.73 | def-energ-i, msf-i, lse | 3 |
0.63 | 0.56 | 0.73 | def-energ-i, msf-i, lse | 2 |
0.63 | 0.53 | 0.77 | msf-i, lse, diff-at-dens | 3 |
0.63 | 0.53 | 0.77 | msf-i, bfac-i, lse | 3 |
0.63 | 0.54 | 0.75 | msf-i, hbond-i, msf-a | 2 |
0.63 | 0.54 | 0.75 | msf-i, bfac-i, msf-a | 3 |
Precision, recall, and F1 scores calculated from the results of the nine-fold cross-validation on the training set.
Because each feature set had its unique strengths in terms of predictive power, and there was limited consensus of predictions between models using the two feature sets, we formed a hybrid feature set consisting of the features of Set 1 and 2 that were most prevalent in top models. Specifically, we pooled the top 8 features from Set 1 as ranked by frequency in the top 300 feature/kernel degree combinations trained solely on this feature set (deformation energy in the inactive state; mean-squared fluctuation in the inactive and active states; difference in the number of H-bonds between inactive and active states; local structural entropy; difference in atomic density between inactive and active states; B-factor in the active state; and the difference in B-factor between the inactive and active states – see
F1 | Precision | Recall | Feature Combination | Kernel Degree |
0.73 | 0.65 | 0.84 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, bfac-a, diff-bfac, asa2, asaavg, asasc1, asascavg, asabb1 | 3 |
0.73 | 0.65 | 0.84 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, bfac-a, diff-bfac, asa2, asaavg, asasc1, asascavg, asabb1 | 3 |
0.72 | 0.68 | 0.77 | def-energ-i, diff-hbond, lse, bfac-a, Ca-disp, asasc1, asasc2, asascavg, asabb1, asabbavg | 2 |
0.72 | 0.68 | 0.77 | def-energ-i, diff-hbond, lse, bfac-a, Ca-disp, asa2, asaavg, asasc2 | 2 |
0.72 | 0.66 | 0.80 | def-energ-i, diff-hbond, lse, diff-at-dens, diff-bfac, asascavg | 3 |
0.72 | 0.66 | 0.80 | def-energ-i, msf-i, lse, diff-at-dens, bfac-a, diff-bfac, asa1, asa2, asasc1, asasc2, asabb1 | 3 |
0.72 | 0.64 | 0.82 | def-energ-i, msf-i, lse, diff-at-dens, bfac-a, diff-bfac, asa1, asa2, asasc1, asascavg, asabb1 | 3 |
0.72 | 0.64 | 0.82 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, bfac-a, diff-bfac, asa2, asaavg, asasc1, asasc2, asascavg, asabb1 | 3 |
0.72 | 0.64 | 0.82 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, bfac-a, diff-bfac, asa1, asaavg, asasc1, asasc2, asabbavg | 3 |
0.72 | 0.64 | 0.82 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, bfac-a, diff-bfac, asa1, asaavg, asasc1, asasc2, asabbavg | 3 |
0.72 | 0.64 | 0.82 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, bfac-a, diff-bfac, asa1, asa2, asasc1, asascavg, asabbavg | 3 |
0.72 | 0.64 | 0.82 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, bfac-a, diff-bfac, asa1, asa2, asasc1, asascavg, asabbavg | 3 |
0.72 | 0.69 | 0.75 | def-energ-i, diff-hbond, lse, bfac-a, asasc1, asasc2, asascavg | 2 |
0.72 | 0.61 | 0.86 | def-energ-i, lse, Ca-disp, asasc2, asabb1, asabbavg | 3 |
0.72 | 0.61 | 0.86 | def-energ-i, diff-hbond, lse, asasc2 | 3 |
0.72 | 0.67 | 0.77 | def-energ-i, diff-hbond, lse, bfac-a, Ca-disp, asa1, asa2, asasc1, asasc2, asabbavg | 2 |
0.72 | 0.67 | 0.77 | def-energ-i, diff-hbond, lse, bfac-a, Ca-disp, asa1, asa2, asaavg, asasc2, asascavg, asabbavg | 2 |
0.72 | 0.67 | 0.77 | def-energ-i, diff-hbond, lse, bfac-a, diff-bfac, asa2, asaavg, asasc2 | 2 |
0.72 | 0.67 | 0.77 | def-energ-i, diff-hbond, lse, bfac-a, asasc2, asascavg, asabbavg | 2 |
Precision, recall, and F1 scores were calculated from the results of the nine-fold cross-validation on the training set.
The models that scored highest on the independent data set are listed in
F1 | Precision | Recall | Feature Combination | Kernel Degree |
0.73 | 0.67 | 0.78 | msf-i, diff-at-dens, msf-a, asaavg, asascavg, asabb1, asabbavg | 2 |
0.73 | 0.67 | 0.78 | msf-i, diff-hbond, msf-a, Ca-disp, asasc1, asabbavg | 2 |
0.73 | 0.67 | 0.78 | msf-i, diff-hbond, msf-a, Ca-disp, asasc1, asabb1 | 2 |
0.73 | 0.67 | 0.78 | msf-i, diff-hbond, msf-a, Ca-disp, asa1, asabbavg | 2 |
0.73 | 0.67 | 0.78 | msf-i, diff-hbond, msf-a, asaavg, asasc1, asabb1 | 2 |
0.72 | 0.71 | 0.73 | diff-hbond, msf-a, Ca-disp, asaavg, asasc2, asascavg, asabbavg | 3 |
0.72 | 0.68 | 0.76 | diff-hbond, msf-a, Ca-disp, asa1, asa2, asaavg, asasc1, asascavg, asabbavg | 3 |
0.72 | 0.66 | 0.78 | msf-i, diff-hbond, msf-a, Ca-disp, asa1, asabb1 | 2 |
0.71 | 0.64 | 0.81 | msf-i, diff-at-dens, asaavg, asascavg, asabb1, asabbavg | 3 |
0.71 | 0.64 | 0.81 | msf-i, diff-hbond, msf-a, Ca-disp, asa1, asabb1, asabbavg | 3 |
0.71 | 0.64 | 0.81 | msf-i, diff-hbond, msf-a, asa1, asabbavg | 3 |
0.71 | 0.64 | 0.81 | msf-i, diff-hbond, msf-a, asa1, asasc1, asabbavg | 3 |
0.71 | 0.64 | 0.81 | msf-i, diff-hbond, diff-at-dens, msf-a, asa1, asascavg, asabb1, asabbavg | 3 |
0.71 | 0.62 | 0.84 | msf-i, diff-at-dens, msf-a, asa2, asaavg, asasc2, asascavg, asabb1 | 3 |
0.71 | 0.67 | 0.76 | msf-i, diff-hbond, msf-a, Ca-disp, asasc1, asabb1, asabbavg | 2 |
0.71 | 0.67 | 0.76 | msf-i, diff-hbond, msf-a, Ca-disp, asaavg, asasc1, asabb1 | 2 |
0.71 | 0.67 | 0.76 | msf-i, diff-hbond, msf-a, Ca-disp, asa1 | 2 |
0.71 | 0.67 | 0.76 | msf-i, diff-hbond, msf-a, Ca-disp, asa1, asasc1, asabbavg | 2 |
0.71 | 0.67 | 0.76 | msf-i, diff-hbond, msf-a, Ca-disp, asa1, asasc1, asabb1, asabbavg | 2 |
0.71 | 0.67 | 0.76 | msf-i, diff-hbond, msf-a, asa1, asabbavg | 2 |
Precision, recall, and F1 scores were calculated from the results on the independent data set. Listed are the top scoring feature/kernel degree combinations as ranked by F1 on the independent data set.
To investigate the topology of predicted allosteric hotspots, we considered the predictions made by the top 9 most precise Hybrid Feature Set models for each residue of each protein in the independent data set (
F1 train | P train | R train | F1 ind | P ind | R ind | Feature Combination | Kernel Degree |
0.65 | 0.54 | 0.84 | 0.70 | 0.75 | 0.65 | msf-i, diff-hbond, msf-a, Ca-disp, asa2, asaavg, asasc1, asasc2, asascavg, asabbavg | 3 |
0.65 | 0.55 | 0.80 | 0.70 | 0.74 | 0.68 | msf-i, diff-at-dens, Ca-disp, asaavg, asabb1, asabbavg | 2 |
0.64 | 0.57 | 0.73 | 0.69 | 0.73 | 0.65 | msf-i, diff-hbond, bfac-a, Ca-disp, asa2, asasc1, asabb1, asabbavg | 2 |
0.63 | 0.56 | 0.70 | 0.69 | 0.73 | 0.65 | msf-i, diff-hbond, diff-at-dens, msf-a, bfac-a, diff-bfac, asa1, asa2, asaavg, asasc2, asabbavg | 2 |
0.63 | 0.52 | 0.80 | 0.69 | 0.71 | 0.68 | msf-i, diff-at-dens, Ca-disp, asa1, asaavg, asabbavg | 2 |
0.65 | 0.55 | 0.80 | 0.69 | 0.71 | 0.68 | msf-i, diff-at-dens, msf-a, asa1, asa2, asaavg, asasc2, asabbavg | 3 |
0.64 | 0.57 | 0.73 | 0.69 | 0.71 | 0.68 | def-energ-i, msf-i, diff-hbond, bfac-a, diff-bfac, asa1, asa2, asasc1, asascavg, asabbavg | 2 |
0.64 | 0.56 | 0.75 | 0.69 | 0.71 | 0.68 | def-energ-i, msf-i, diff-hbond, msf-a, Ca-disp, asa2, asasc1, asasc2, asascavg, asabb1 | 3 |
0.64 | 0.57 | 0.73 | 0.69 | 0.71 | 0.68 | def-energ-i, msf-i, diff-hbond, diff-at-dens, bfac-a, diff-bfac, asa2, asaavg, asasc2, asabb1 | 2 |
The performance on both the training (abbreviated train) and independent (abbreviated ind) data sets are given. The F1, Precision (P) and Recall (R ) values for each model are reported based on their performance on the training and independent data sets.
In doing this analysis, we assessed whether predicted hotspots form a network pattern in the protein structure, in light of previous work showing the existence of networks of contiguous residues connecting effector and substrate sites in allosteric proteins
Furthermore, the locations of predicted hotspots and non-hotspots in the protein structure and the known functions of the structural elements of each protein system gave insight into the functional significance of the predictions. For
(A) Predictions made by the top 9 highest-precision Hybrid Feature Set models according to the voting scheme for lac repressor mapped onto the inactive state structure (1tlf). Experimentally tested residues rendered in van der Waals spheres, with known non-hotspots in small van der Waals spheres and known hotspots in larger ones. For other residues, the prediction is shown along the backbone trace, but no experimental data is available to test the prediction. Each residue in the structure is colored according to a blue→green→red heat map, where the extremes are as follows: red represents residues predicted to be hotspots by 9/9 of the models and blue residues to be predicted hotspots by 0/9 models (predicted non-hotspots by 9/9 models). (Refer to color bar above for exact mapping of the number of predicted hotspots to the color.) For ease of viewing only one set of dimers (chain A and B) is shown. His 74 and Asp 278, residues not in the independent data set but were studied experimentally and found to be allosterically active, are rendered in van der Waals mode as well
The top-precision Hybrid Feature Set models predicted many residues with known functional significance to be hotspots in the myosin II motor domain (
Predictions made by the top 9 highest-precision Hybrid Feature Set models according to the voting scheme for myosin II motor domain mapped onto the inactive state structure (1vom). Refer to
Glucokinase is an enzyme that plays a role in regulating blood glucose levels through its function as a glucose sensor. Congenital mutations in this protein are associated with maturity-onset diabetes of the young
Glutamate dehydrogenase is an enzyme that plays an important role in nitrogen/carbon metabolism, oxidatively deaminating glutamate to 2-oxoglutarate, which is supplied to the TCA cycle
Thrombin is a serine protease that plays key roles in both promoting and preventing clotting
In addition, Tyr 225 and Tyr 184A, two residues designated as part of the allosteric core
Since
First, we counted the number of residues associated with IS. 113 residues out of the 329 that were studied by Markiewicz et al.
We compared the performance of our best models with Statistical Coupling Analysis (SCA;
In this work, we assembled a data set of residues that have been found experimentally to either perturb allostery (hotspots) or not (non-hotspots). We then calculated features for each data point, i.e., mutation site, to train machine-learning models that can predict a mutation's impact on allostery. We compared the performance of models based on structural, dynamic, network, and informatic features (Feature Set 1 and Augmented Feature Set 1) with ones trained on structural features requiring both inactive and active state structures (Feature Set 2). An advantage of our approach is that the models make automatic predictions about whether a residue is a hotspot or non-hotspot, avoiding the need for qualitative assessment or manual data analysis, and make use of a broad range of residue-level attributes implicated in allostery. Furthermore, our methods do not require long simulations or free energy calculations, which are difficult to perform when screening a large number of residues.
After testing all possible combinations of features on the training data set, we evaluated feature usage by the top-scoring models to provide insights into what may be residue-level signatures of allostery. In top-scoring models using Feature Set 1, deformation energy, mean-squared fluctuation, B-factor, atomic density, hydrogen bonding, and local structural entropy were predominant. In Feature Set 2, α-carbon displacement and solvent-accessible surface area measures predominated. We then combined the features that were predominant in top scoring models (based on the training data set) of each of these two feature sets and trained models using this feature set (Hybrid Feature Set). It was this hybrid set that performed best on the training and independent data sets.
Features that predominate in high scoring models should be examined individually in the context of other work on allostery. Our examination of feature usage suggests that deformation energy is an important residue property in allostery. Deformation energy reflects a residue's participation in a protein hinge, and one can envision that hinge regions would coincide with residues of allosteric relevance. Others have applied residue-level constraints and analyzed their effects on the protein structure to define domains within a protein
Measures related directly to solvent-accessible surface area (SASA) and those that correlate with SASA were also found to be features important for describing allostery. In Feature Set 1, the difference between normalized B-factors and atomic densities in the active and inactive states, along with the magnitudes of mean-squared fluctuations in the active and inactive states, were predominant features in the top models. Mean-square fluctuation and B-factor indirectly reflect the degree of exposure to the surface, while atomic density relates directly to solvent exposure. In addition, SASA-related features were found to be especially dominant in the top- scoring models created using Feature Set 2.
The observed prevalence of these features in top-scoring models was confirmed by inspecting the average values of these measures for hotspots and non-hotspots. Mean-squared fluctuation in the inactive state, atomic density in the inactive and active states, and most SASA measures were all significantly lower for hotspots than for non-hotspots, suggesting that hotspots tend to be buried (
Consistent with the importance of B-factor and mean-squared fluctuations in our models is the fact that residue fluctuations and correlations in fluctuations have been found computationally to yield putative allosteric networks of communication, with confirmation by experiment in some cases
The observed prevalence of another feature related to changes in solvent-accessible surface area in Feature Set 1, the difference in atomic density between active and inactive states, can also be related to important work in allostery. In particular, this finding is consistent with work showing the ability of networks of changes in residue contacts to identify putative allosteric communication and experimental hotspots
A striking result from our analysis was the prevalence of local structural entropy, which is essentially a measure of the potential variability in protein secondary structure. The importance of the variability of secondary structure can be related to work using COREX
Our analysis further revealed the importance of differences in hydrogen bonding between the inactive and active states, underscoring the role of this feature in governing processes that require microenvironmental specificity. We found that hotspots undergo greater changes in their hydrogen-bonding network in the allosteric transition than non-hotspots (
We were surprised to observe the low occurrence in the top models of the two network related properties, node degree and perturbation of the clustering coefficient upon node removal of the inactive state, given the demonstration that proteins are small-world networks
We examined our top scoring models from Feature Set 1 to determine if any of them required only a single structure, since in many systems, the crystal structure for only one conformation has been solved. Models that required the inactive state structure alone were found among the top 300 models, but none required only the active state structure. This suggests that the inactive state encodes a greater amount of relevant functional information than the active state. This is consistent with the observation that the inactive state is predisposed toward adopting functionally relevant conformations and can undergo the allosteric transition in the absence of effector
Because neither Feature Set 1 nor 2 appeared to be absolutely superior in performance, we created an optimal “hybrid” feature set by combining the top features of each. The hybrid set outperformed either set 1 or set 2 individually. Specifically, top Hybrid Feature Set models achieved the highest F1 scores on the training data set and independent data sets with a statistically significant (p<2.2e-16; for both data sets) improvement over the non-mixture feature set that scored best on the training data, Set 2. This result suggests that optimal predictions of allosteric functional properties from protein structure and sequence must account both for dynamic properties of the protein structure and for structural differences between the end-states. Moreover, one can say that empirical structural observations can work synergistically with dynamical properties that are based on a simple mechanical model, i.e., the elastic network model for normal mode calculations.
Top-scoring SVM models trained using Feature Set 1, Set 2, Augmented Set 1, and the Hybrid Feature Set outperformed SCA in sensitivity and accuracy of class prediction. The difference in performance could be due to two reasons. First, SCA is based strictly on sequence, whereas our methods rely on sequence, structural, dynamical, and network features. Second, SCA was not originally developed as an allosteric hotspot-prediction method
To shed light on the pattern of hotspots in the structures, we applied a voting scheme to the Hybrid Feature Set models with the highest precision on the independent data set to make predictions for every residue of each protein in the independent data set. This voting among models was adopted to avoid the limitations of any single model in predictive power. Furthermore, this scheme yields a continuum of predictions based on how many models predict a hotspot or non-hotspot for each residue. That is, this method not only predicts which residues are strong hotspots or non-hotspots, in which cases the models cast a unanimous or nearly unanimous vote for hotspot or non-hotspot, but it also uncovers residues with intermediate relevance to allostery, where there is not a large majority of models predicting either class.
Predicted hotspots tended to occur at highest densities in the interior of the structures, while non-hotspots tended to be found in the periphery of the proteins studied, consistent with the work of others demonstrating the importance of internal networks of residues connecting distant sites in allosteric proteins
We examined the topology of all residues whose substitution with any residue has been found experimentally to cause IS in
One notable observation was that in one of the proteins, glucokinase, residues that made contacts with the synthetic allosteric activator, Compound A, were all predicted hotspots and some of these were known hotspots included in the independent data set. Compound A enhances the activity of glucokinase and has been considered as a therapy for diabetes, as glucokinase acts as a glucose sensor that plays a role in the regulation of serum glucose levels. This result suggests that predicted solvent-accessible hotspots might be candidates for binding sites of small-molecule effectors that can rescue the behavior of mutant proteins. Liu and Nussinov
We have demonstrated that machine-learning models using dynamical, structural, informatic, and network features can discriminate between allosteric hotspots and non-hotspots with high sensitivity and accuracy, that the patterns of predictions form a network of residues within the structures, and that hotspots correlate with regions of known functional relevance. In our structural analysis, we exploited the exhaustive nature of an experimental mutagenesis study of
We hope our methods can help experimentalists identify residues that contribute to mechanisms of allostery in proteins of interest. Typically, residues thought to participate in the allosteric transition are those that undergo significant structural alterations between the inactive and active states or those that interact at subunit interfaces. Thus, site-directed mutagenesis studies probing the allosteric transition tend to target these residues. However, other residues may play key roles in the transition yet are not targeted, since they do not undergo obvious structural rearrangements. The observed importance of dynamics in addition to structure suggests that traditional structure-based approaches to selecting candidate residues for mutagenesis may not give a complete picture of allosterically relevant residues. Our methods overcome this shortcoming by including dynamical as well as structural features. Predictions made by our methods may be used to guide experimentalists in their choice of residues to target in mutagenesis studies, in particular, residues that would not be considered relevant to allostery based on structural methods alone.
An important test of our methods will be whether predicted hotspot residues correlate with those whose mutations result in significant perturbation of the allosteric coupling free energy between sites,
The fact that one of the proteins we studied, glucokinase, exhibited extensive contacts of hotspot residues with a drug that shifts the protein to an active state suggests that hotspot residues could be candidates for drug targets. There exist enzymes in the drug discovery field for which finding active site inhibitors has been difficult
An advantage of our techniques over other computational methods is that they are “meta-methods” that incorporate a variety of features. In contrast, many computational methods for inferring allosteric coupling derive their predictions from measurements of only single features. However, allostery is arguably a complex phenomenon that requires a more detailed model. Here, we have taken into account a number of features putatively relevant to allostery and combined them using a machine-learning algorithm to determine their relative importance in discriminating hotspots from non-hotspots. An advantage of these features is that most of them, with the exception of mean-squared fluctuation, deformation energy, and mutual information, can be calculated directly from the structures or sequences without the use of calculations that require heavily parameterized force fields or expensive simulations. Even the features that do rely on a parameterized model are calculated using the elastic network model, which has only two adjustable parameters. Thus, in creating a complex model for allosteric communication, we have striven to keep the individual features of the model as simple as possible.
In the case of multimeric proteins, allosteric function is considered perturbed if, upon mutation: the Hill coefficient, a measure of cooperativity, is significantly altered; the protein is locked in either an inactive or active state (Hill coefficient of 1) even in the presence of effector; the concentration of allosteric inhibitor required to cause 50% inhibition is increased; binding or activity curves are altered from sigmoidal (characteristic of multimeric allosteric enzymes) to hyperbolic; or if inducibility is altered as measured by expression of a reporter gene
Since a classification model must distinguish between positive and negative data, mutations that have no effect on allostery are included in the training data as controls. An additional criterion for inclusion in the training data set is that the mutation not be located in an effector or substrate-binding site. Naturally, it is possible for mutations that perturb binding to perturb the allosteric transition. In this study, the aim is to predict mutations that disrupt or alter the communication between effector and substrate sites (in the case of heterotropic cooperativity) or between substrate sites (in the case of homotropic cooperativity). Our training data set is a subset of those allosteric proteins compiled by Daily and Gray
Eighteen attributes are computed for each protein in the training and independent data sets (Feature Set 1). Dynamical attributes are calculated with the program DIAGRTB, which calculates all-atom elastic network model (ENM) normal modes with rotational-translation blocking
Atomic mean square fluctuations were calculated for both inactive and active conformers using the following formula:
Since the actual numerical values of the mean square fluctuations are only meaningful within a protein and not across proteins, a method to determine the relative degree of fluctuation was required. To this end, the atoms were ranked according to the magnitudes of their fluctuation. The decile rank was determined for each atom of each of the mutant residues in the dataset, and a score for each residue was taken to be the average of the decile ranks for each atom in the residue. The difference in scores between mean squared fluctuation in the inactive and active states as well as the individual values were ascertained.
Deformation energy was calculated for the inactive and active conformers as follows
Mutual entropy, or mutual information, between two coordinates
In this work, an approximation for estimating
For each residue, a mutual information score was taken as the number of instances a given residue (represented by its alpha carbon) had an off-diagonal
In addition to dynamic information based on normal modes, the following static-structure attributes were calculated:
Mutation sites were ranked according to their B-factors in the same manner as applied in the case of mean-square fluctuation and deformation energy, that is, a decile rank score was used to normalize for variability in global protein flexibility. This was performed using both active- and inactive-state structures. The difference in scores between B-factor in active and inactive-states as well as the individual values were ascertained.
An average atomic density was determined for each residue in both the active and inactive states using FADE (Fast Atomic Density Evaluator;
Potential hydrogen bonds for residues in both active- and inactive-state structures were determined using the What-if program
A number of network-based features were calculated for the inactive-state structure:
Node degree was taken to be the total number of residues that contain at least one heavy atom within 5.0 Å of the residue (node) of interest.
The clustering coefficient is defined as follows
Finally, a number of informatics features were calculated:
Local structural entropy is a measure of the propensity for variability in secondary structure within a given 4-residue site
The Consurf web server was used to determine a residue's conservation score, as determined by multiple sequence alignment
Calculations of features related to the change in average structure between active- and inactive-state conformations (Feature Set 2) were originally performed by Daily and Gray
Support-vector machine learning was implemented using the Weka machine-learning package
The feature/kernel degree combinations that performed best in the training set were tested on the independent data set. Here, a single model was trained on the entire training data set using each of these highest performing feature/kernel degree combinations, and this model was subsequently tested on the independent data set.
We used position-specific iteration BLAST
For the case of myosin II, we used the results of the SCA analysis published by Yu et al. (
Precision, recall and F1 were calculated for each feature set and polynomial kernel combination used in the support vector machine learning, using a nine-fold cross validation for each combination. These same measures were calculated when evaluating the performance of models on the independent data set and when evaluating SCA on the training or independent data sets.
The feature/kernel degree combinations were ranked according to F1. For the calculation of these measures in evaluating the results of the cross-validated training, we pooled the
To measure the statistical significance of differences between the performance measures of sets of models, a one-tailed, unpaired Student's T test was used.
Predictions made by the top 9 highest-precision Hybrid Feature Set models according to the voting scheme for glucokinase mapped onto the active state structure (1v4s). Each residue in the structure is colored according to a blue→green→red heat map, where the extremes are as follows: red represents residues predicted to be hotspots by 9/9 of the models and blue residues to be predicted hotspots by 0/9 models (predicted non-hotspots by 9/9 models). Experimentally determined hotspots and non-hotspots included in the independent set are rendered in van der Waals spheres (non-hotspots in small van der Waals spheres). For other residues, the prediction is shown along the backbone trace, but no experimental data is available to test the prediction. Correct true positive (hotspot) and true negative (non-hotspot) predictions are colored according to the heat map, while false negatives and false positives are colored gray. Glucose, the effector and substrate for this enzyme, is rendered in sticks and colored by element. Some correctly predicted true hotspots depicted in spheres in the figure (Met 210, Tyr 214, Val 452, and Val 455), along with two predicted hotspots not in the independent data set (Arg 63 and Tyr 215) also contact the allosteric drug Compound A (rendered in sticks and colored by element), which enhances the activity of the enzyme.
(0.21 MB PDF)
Predictions made by the top 9 highest-precision Hybrid Feature Set models according to the voting scheme for glutamate dehydrogenase mapped onto the inactive state structure (1nr7). Each residue in the structure is colored according to a blue→green→red heat map, where the extremes are as follows: red represents residues predicted to be hotspots by 9/9 of the models and blue residues to be predicted hotspots by 0/9 models (predicted non-hotspots by 9/9 models). Experimentally determined hotspots and non-hotspots included in the independent set are rendered in van der Waals spheres (non-hotspots in small van der Waals spheres). For other residues, the prediction is shown along the backbone trace, but no experimental data is available to test the prediction. Correct true positive (hotspot) and true negative (non-hotspot) predictions are colored according to the heat map, while false negatives and false positives are colored gray.
(7.20 MB PDF)
Predictions made by the top 9 highest-precision Hybrid Feature Set models according to the voting scheme for thrombin mapped onto the structure of the slow form (1sgi). Each residue in the structure is colored according to a blue→green→red heat map, where the extremes are as follows: red represents residues predicted to be hotspots by 9/9 of the models and blue residues to be predicted hotspots by 0/9 models (predicted non-hotspots by 9/9 models). Experimentally determined hotspots and non-hotspots included in the independent set are rendered in van der Waals spheres (non-hotspots in small van der Waals spheres), along with two additional residues that are part of the allosteric core, Tyr 225 and Tyr184A, but did not meet the criteria for inclusion in the independent data set. For other residues, the prediction is shown along the backbone trace, but no experimental data is available to test the prediction. Correct true positive (hotspot) and true negative (non-hotspot) predictions are colored according to the heat map, while false negatives and false positives are colored gray.
(5.63 MB PDF)
SCA data. Results for SCA are present for each protein from the training and independent data sets, except for myosin II where we relied on the previously published analysis by Yu et al. [F1]. a. Hierarchically clustered matrix of ΔΔG values and dendrogram where terminal branches correspond to residue indices of the protein sequence. Branches of the dendrogram corresponding to regions in the matrix containing clusters of high ΔΔG (regions with high fraction of points greater than or equal to 1.6 kT) are highlighted. The color scale is once displayed for CheY and applies to the subsequent protein systems. b. Magnification of the ends of the highlighted branches to display the residue indices, which are based on the numbering in the corresponding PDB file (except for thrombin, where negative numbers are for residues cleaved from prothrombin chain B and thrombin residues start at 1).
(1.78 MB PDF)
Training data set. Given are the protein name, the PDB ID of the inactive state, the PDB ID of the active state, the residue that was mutated, the reference(s) where the effect(s) of the mutation is (are) described, and, in the final column, details of the experiment(s) in which the mutation was characterized. In the final column, first the point mutation(s) is (are) given, and this is followed by a brief synopsis of the experimental results. Abbreviations used: wt = wild type; coef. = coefficient; repr. = repression.
(0.11 MB RTF)
Independent data set. Given are the protein name, the PDB ID of the inactive state, the PDB ID of the active state, the residue that was mutated, the reference(s) where the effect(s) of the mutation is (are) described, and, in the final column, details of the experiment(s) in which the mutation was characterized. In the final column, first the point mutation(s) is (are) given, and this is followed by a brief synopsis of the experimental results, except for lac repressor where at least 12 amino acid substitutions were made for each residue (The reader may refer to Markiewicz et al. [T44] and Suckow et al. [T45] for details.). Abbreviations used: wt = wild type; coef. = coefficient; repr. = repression; Is = not responsive to inducer (allolactose or isopropyl–D-thiogalactoside); I- = abolished DNA binding or misfolded.
(0.09 MB RTF)
Classification of residues in the independent data set according to the voting scheme of the top 9 highest-precision Hybrid Feature Set models that was used in
(0.16 MB RTF)
Classification of residues whose mutation caused the IS phenotype in at least one residue substitution. The voting scheme of the top 9 highest-precision Hybrid Feature Set models that was used in Structural Analysis of Predicted Hotspots was used for this classification. The numbers in the columns to the right of the residue index are the number of models out of the nine that predicted a hotspot for each residue.
(0.07 MB RTF)
Average values of features of interest for hotspots and non-hotspots, along with the p-value (unpaired Student's T-test) signifying the statistical significance of the difference in the average value of each feature between hotspots and non-hotspots. Values with a strongly statistically significant difference (p<0.05) between the two classes are indicated by ** and in bold, and those with a moderate statistical significance are indicated by * and in bold italic. For Feature Set 1, dotted lines separate features that are based on dynamic structural features, local contact geometry, network-based features and conservation.
(0.02 MB RTF)
We would like to thank David Page, Steven Darnell, Qiang Cui, and Jeffrey Gray for helpful discussions. We also thank Ranganathan lab for providing us with the code for running SCA and Ryan Bannen for providing code for calculating local structural entropy. Finally, we thank Madeline Fisher for her guidance on the paper revisions.