Conceived and designed the experiments: WT MJO RJW. Performed the experiments: WT YW LFM. Analyzed the data: WT YW LFM MJO RJW. Contributed reagents/materials/analysis tools: MJO RJW. Wrote the paper: WT MJO RJW.
The authors have declared that no competing interests exist.
A new monotonicity-constrained maximum likelihood approach, called Partial Order Optimum Likelihood (POOL), is presented and applied to the problem of functional site prediction in protein 3D structures, an important current challenge in genomics. The input consists of electrostatic and geometric properties derived from the 3D structure of the query protein alone. Sequence-based conservation information, where available, may also be incorporated. Electrostatics features from THEMATICS are combined with multidimensional isotonic regression to form maximum likelihood estimates of probabilities that specific residues belong to an active site. This allows likelihood ranking of all ionizable residues in a given protein based on THEMATICS features. The corresponding ROC curves and statistical significance tests demonstrate that this method outperforms prior THEMATICS-based methods, which in turn have been shown previously to outperform other 3D-structure-based methods for identifying active site residues. Then it is shown that the addition of one simple geometric property, the size rank of the cleft in which a given residue is contained, yields improved performance. Extension of the method to include predictions of non-ionizable residues is achieved through the introduction of environment variables. This extension results in even better performance than THEMATICS alone and constitutes to date the best functional site predictor based on 3D structure only, achieving nearly the same level of performance as methods that use both 3D structure and sequence alignment data. Finally, the method also easily incorporates such sequence alignment data, and when this information is included, the resulting method is shown to outperform the best current methods using any combination of sequence alignments and 3D structures. Included is an analysis demonstrating that when THEMATICS features, cleft size rank, and alignment-based conservation scores are used individually or in combination THEMATICS features represent the single most important component of such classifiers.
Genome sequencing has revealed the codes for thousands of previously unknown proteins for humans and for hundreds of other species. Many of these proteins are of unknown or unclear function. The information contained in the genome sequences holds tremendous potential benefit to humankind, including new approaches to the diagnosis and treatment of disease. In order to realize these benefits, a key step is to understand the functions of the proteins for which these genes hold the code. A first step in understanding the function of a protein is to identify the functional site, the local area on the surface of a protein where it affects its functional activity. This paper reports on a new computational methodology to predict protein functional sites from protein 3D structures. A new machine learning approach called Partial Order Optimum Likelihood (POOL) is introduced here. It is shown that POOL outperforms previous methods for the prediction of protein functional sites from 3D structures.
Development of function prediction capabilities is a major challenge in genomics. Structural genomics projects are determining the 3D structures of expressed proteins on a high throughput basis. However, the determination of function from 3D structure has proved to be a challenging task; the functions of most of these structural genomics proteins remain unknown. Computationally based predictive methods can help to guide and accelerate functional annotation. The first step toward the prediction of the function of a protein from its 3D structure is to determine its local site of interaction where catalysis and/or ligand recognition occurs. Such capabilities have many important practical implications for biology and medicine.
We have reported on THEMATICS
THEMATICS utilizes only the 3D structure of the query protein as input; neither sequence alignments nor structural comparisons are used. Recently it was shown
The purpose of the present paper is five-fold: (1) We present a monotonicity-constrained maximum likelihood approach, called Partial Order Optimum Likelihood (POOL), to improve performance and expand the capabilities of active site prediction. (2) Then it is shown that POOL, with THEMATICS input data alone, outperforms previous statistical
In prior implementations of THEMATICS for the identification of active-site residues from the 3D structure of the query protein
In this study, we use the same features μ3 and μ4 used in the Ko
Geometric features, such as the relative sizes of the clefts on the surface of the protein structure, have been shown to correlate with active site location
The monotonicity-constrained maximum-likelihood approach underlying the POOL method described below is built on certain assumptions relating features used for classification to the probability that an instance having those features belongs to the positive class. Here we describe in detail the form these assumptions take when relating the THEMATICS features listed above to the probability that a residue is an active-site residue. Later we will also note that similar assumptions are reasonable when considering cleft rank and sequence conservation scores and apply them to those features as well. These THEMATICS feature-based monotonicity assumptions are as follows:
Given two ionizable residues in a single protein, the one having the more perturbed titration curve is more likely to be an active-site residue, all other things being equal.
Given two residues in a single protein, the one having a greater degree of overall titration curve perturbation among the ionizable residues in its spatial vicinity is more likely to be an active-site residue, all other things being equal.
More precisely, for the first assumption, we treat μ3 and μ4 as measures of degree of perturbation, and for the second we treat μ3env and μ4env as measures of overall perturbation within the spatial vicinity. These assumptions then become: Given two residues in the same protein, let their corresponding 4-dimensional feature vectors be
Finally, there is one additional subtlety that all implementations of THEMATICS have had to address, and the current approach is no exception: the need for some kind of normalization across proteins. In Ko's approach
Note that this rank normalization transformation does not affect the within-protein partial order used in the assumptions. That is,
After the normalization is performed, the labeled dataset may be regarded as a collection of
This is a convex optimization problem with linear constraints. We have shown
This latter optimization problem is a special case of the general
The kth constraint in the primal problem is active iff λk>0. Furthermore, by rescaling coordinates in the primal problem, its contours become circular and the negative gradient at any point points toward a single point, the unconstrained optimum. (Thus another formulation of the primal problem is to find the point in the feasible region closest to the unconstrained optimum.) As a consequence, the active set so determined at any feasible point is exactly the same as the active set at the solution point. But the active set simply represents equivalence classes of data points for which equality of the estimates must hold. Since equality-constrained maximum-likelihood estimates have the form (number of positives)/(total number of points), identifying which constraints are active at the solution leads immediately to the solution itself. Full details of this algorithm as well as the proof that the minimum sum-of-squared-errors solution is also the maximum-likelihood solution can be found in Tong's dissertation
We call our algorithm for solving this maximum-likelihood problem the POOL algorithm. POOL is both an acronym for Partial Order Optimal Likelihood as well as an accurate characterization of the way the method first identifies the active constraints and then simply combines the corresponding data values into “pools” to be assigned probability estimates according to the proportion of positives in that pool.
The use of the POOL method with this 4-dimensional THEMATICS feature vector is denoted POOL(T4) in the
Previous studies have shown that active site residues tend to be located in one of the largest clefts in a protein structure
An interesting alternative to simply concatenating all features into a single vector and applying a single classifier or probability estimator to such vectors is to compute two separate probability estimates and then combine them. Consider the general problem of estimating the class probability P(c|
This then gives another approach, which we have dubbed
Non-ionizable residues do not have titration curves and thus THEMATICS does not predict them directly. Nevertheless, the non-ionizable residues in interaction sites tend to have ionizable residues in their immediate vicinity and these ionizable residues generally have perturbed titration curves
Note that every non-ionizable residue has the environment features μ3env and μ4env; these serve as measures of the overall amount of titration curve perturbation in their spatial neighborhood. Thus we posit an extension to the THEMATICS monotonicity assumptions, namely: All other things being equal, a non-ionizable residue having more titration curve perturbation in its neighborhood is more likely to be an active-site residue. Thus we can apply the POOL method to non-ionizable residues separately by applying coordinate-wise monotonicity constraints to the probability estimates for the 2-dimensional feature vectors
Furthermore, we have the same options described above for incorporating cleft or other information for these non-ionizable residues. Finally, for any given protein, we can start with separate ordered lists of probability estimates for the ionizables and the non-ionizables, however computed, and then merge these into a single ordered list. This list then gives an estimated probability, and hence a ranking, for all residues.
Yet another feature that is generally taken to be predictive of functional activity in a monotonic fashion is the extent to which a given residue is found to be conserved across sequence homologues: The more conserved the residue, the more likely that residue is to be functionally important in the protein. Here we also examine combining such conservation information with THEMATICS and cleft size information. In particular, we use ConSurf
In the
As described in more detail in the
The results presented here are based on several standard measures of performance. For a standard classification problem, performance is typically measured by
Since our method outputs a ranked list (actually a list of probabilities) for all residues within a given protein and not a binary classification, considering the ROC curves is an especially useful way to characterize the behavior of
One disadvantage of using ROC curves alone, however, is that unless the curve for one method dominates (i.e., lies completely above and to the left of) that of another, there may be no simple metric to compare these two methods. For this reason, a single number that is sometimes used as a reliable measure for comparing systems in the machine learning literature is the
In order to generate ROC curves, we need to be able to calculate recall and false-positive rate values, which come from classification problems. In the POOL method, the result for each protein is a ranked list based on the probability of a residue being in the active site. A natural way to draw a ROC curve for every protein is to move the cutoff one residue at a time from the top to the bottom of the list. The resulting ROC curve has a staircase shape: only recall increases when an active site residue is encountered and only false positive rate increases when a non-active-site residue is encountered.
We define
To visually compare the performance from different methods, we also generate the
From these average ROC curves we can get a strong sense of the apparent relative performance of different systems, but it is also important to be able to verify that such apparent differences are in fact statistically significant. To test the significance of the observed differences, we also perform the Wilcoxon signed-rank test
Here we evaluate the ability of POOL with the four THEMATICS features, denoted
Shown in the plot are the averaged ROC curves, recall as a function of false positive rate, for POOL(T4) (solid curve) and Wei's statistical analysis (dashed curve) along with Tong's SVM (point X). Predictions all use THEMATICS features on ionizable residues only; performance is measured using annotated active site ionizable residues. POOL(T4) outperforms both the SVM and Wei's method.
Averaged ROC curves comparing different methods of predicting ionizable active site residues using a combination of THEMATICS and geometric features of ionizable residues only. The POOL(T4)xPOOL(G) method using chaining to combine both THEMATICS features and geometric information (dashed curve) performs better than POOL with THEMATICS features alone (solid curve), POOL on a 5D concatenated feature space (□), and an SVM on a 5D feature space (triangles).
Next we evaluate three different ways of combining THEMATICS features with cleft size information.
To compare the averaged ROC curves from
Method | SVM(T4,G) | POOL(T4,G) | POOL(T4) |
POOL(T4)xPOOL(G) | <0.0001 (53) | <0.0001 (59) | <0.0001 (46) |
POOL(T4) | 0.0002 (40) | 0.0006 (41) | |
POOL(T4,G) | 0.038 (37) |
The first number in each cell is the Wilcoxon p value, the probability that the method in the corresponding row does not outperform the method in the corresponding column. The number in parentheses is the number of proteins out of 64 for which the method in the row outperforms the method in the column.
The better performance of this chained method POOL(T4)xPOOL(G) over POOL(T4) alone is consistent throughout the ROC curve. For recall rates greater than 0.50, the recall for the chained method is better than that of POOL(T4) by roughly 10% for a given FPR. This qualitative trend is apparent from visual inspection of the ranked lists from the two methods. For a typical protein, these two ranked lists tend to be very similar, with annotated positive residues generally ranking a little higher, on average, in the list resulting from chaining.
We believe that the observation that chaining the two four- and one- dimensional estimators gives better results than applying POOL directly to the single, five-dimensional concatenated feature vector is probably an overfitting issue. There may be too much flexibility when POOL is used with a high-dimensional input space, especially when the data are sparse.
So far only predictions for ionizable residues have been described. The THEMATICS environment variables are now used to incorporate predictions for non-ionizable residues in the active site.
Recall rate for all annotated active site residues is plotted as a function of the false positive rate for all residues in the 64 protein test set.
The
This shows that this extension of the POOL method to non-ionizable residues gives a satisfactory result. From now on, all residues are included in the study and we further simplify our feature set naming convention to use
So far we have only considered 3D-structure-based active-site residue prediction. This is important because such methods are applicable to cases where sequence-based methods may not apply. For many structural genomics proteins, the number of homologues is too small to obtain meaningful sequence-based conservation information. Nevertheless, since it is generally true that most active site residues tend to be more conserved than other residues, it is obviously valuable to be able to include sequence conservation information when it is available. Here we examine to what extent adding sequence comparison information can improve active-site residue prediction within the POOL framework.
The method using chaining to combine THEMATICS, geometric and sequence conservation features has the best performance.
As pointed out earlier, not all proteins have enough homologues to perform reliable sequence conservation analysis. In this study, ConSurf was used to do the sequence analysis. However we only used this ConSurf result for proteins having more than 10 homologues. For those not having enough homologues (28 out of 160 proteins in the test set), a common nonzero value was assigned as the active-site probability estimate based on that feature alone. This has the same effect as ignoring this feature for these cases.
The
Method | POOL(T) | POOL(T)xPOOL(G) | POOL(T)xPOOL(C) |
POOL(T)xPOOL(G)xPOOL(C) | <0.0001 (115) | <0.0001 (95) | <0.0001 (103) |
POOL(T)xPOOL(C) | <0.0001 (101) | 0.0008 (89) | |
POOL(T)xPOOL(G) | <0.0001 (101) |
Numbers in parentheses give the actual number of proteins out of 160 for which the method in that row outperforms the method in that column in the AUC measure.
The results reported so far are all in the form of
Note that neither axis of a ROC curve involves a directly user-controllable parameter. Neither recall nor false positive rate is under the direct control of a user who does not already know the correct classifications. Assuming the user wishes to select the highest-ranking values in the list, down to a certain fixed proportion, a more useful curve would be a
Here results are compared for our best structure-only method, POOL(T)xPOOL(G), and for our best structure-plus-sequence method, POOL(T)xPOOL(G)xPOOL(C), with the results from some other top performing active site prediction methods, particularly, Petrova's method
The authors of the three methods report their performance results using a variety of different measures, often different from what we have reported here. Therefore we simply compute corresponding results, using their form of analysis, for POOL(T)xPOOL(G) and POOL(T)xPOOL(G)xPOOL(C) on our 160 protein test set and compare our numbers with theirs. Because the performance measures are not achieved from the same dataset, results are not strictly comparable, but qualitatively, we believe the comparisons below give a good idea of the relative performance.
In order to compare our results with theirs at a similar recall level, we used a 4% filtration ratio cutoff in the POOL method to compare with Youn's method, and a variable filtration ratio cutoff to compare with Petrova's method. Note that while our test set consists of proteins with a wide variety of different folds and functions, Youn's results are reported for sets of proteins with common fold or with similar structure and function. Performance on the more varied set is a much more realistic test of predictive capability on proteins of unknown function, particularly novel folds. Performance on a set of structurally or functionally related proteins is also substantially better than performance on a diverse set, as one would expect and as has been demonstrated by Petrova and Wu
Youn's method
Method/Dataset | Sensitivity (%) | Precision (%) | AUC |
Youn/Family | 57.02 | 18.51 | 0.9290 |
Youn/Superfamily | 53.93 | 16.90 | 0.9135 |
Youn/Fold | 51.11 | 17.13 | 0.9144 |
POOL(T)xPOOL(G)xPOOL(C)/all protein | 64.68 | 19.07 | 0.925 |
POOL(T)xPOOL(G)/all protein | 61.74 | 18.06 | 0.907 |
Petrova and Wu
POOL results are for a 72 protein test set.
POOL with 3D structure input information only, employed as POOL(T)xPOOL(G), predicts active site residues without any sequence alignment information and performs nearly as well as the very best methods to date that do use sequence alignment information.
In the method of Xie and Bourne
Method | Recall ≥ | False Positive Rate < | Achieved For |
Xie | 50% | 20% | 85% |
POOL(T)xPOOL(G)xPOOL(C) | 50% | 20% | 97% |
POOL(T)xPOOL(G)xPOOL(C) | 80% | 20% | 84% |
POOL(T)xPOOL(G)xPOOL(C) | 60% | 10% | 85% |
POOL(T)xPOOL(G) | 50% | 20% | 96% |
POOL(T)xPOOL(G) | 80% | 20% | 77% |
POOL(T)xPOOL(G) | 60% | 10% | 81% |
Each method achieves at least the specified recall rate with a false positive rate less than specified for the percentage of proteins in the last column.
The results in the tables clearly show that POOL(T)xPOOL(G), which only uses 3D structural information of proteins, achieves about as good or even better performance than that of these best performing current active site prediction methods. When additional sequence conservation information is available, still better performance is achieveable with POOL(T)xPOOL(G)xPOOL(C).
Another interesting result of our approach is one that is only obtainable from methods that generate a ranked list: the rank of the first annotated true positive in the list. This metric is useful for users who are interested in finding a few of the active site residue candidates and who do not necessarily need to know all of the active site residues. For instance, users could use the list from the POOL method to guide their site directed mutagenesis experiments by going down the ranked list one by one. A histogram giving the rank of the first active site residue found by POOL(T)xPOOL(G)xPOOL(C) on the 160 protein set is shown in
Top: Percentage of all proteins with specified rank of the first annotated active site residue in the ordered list from POOL(T)xPOOL(G)xPOOL(C) on the 160 protein set. Bottom: Cumulative distribution of the first annotated active site residue in the ranked list from POOL(T)xPOOL(G)xPOOL(C) on the 160 protein set.
To identify the proteins for which POOL performs poorly, we shall set a filtration ratio cutoff of 8.0% for this purpose and use the CSA-annotated residues as the reference. The top 8% of POOL-ranked residues contain one or more CSA-annotated residues for 156 (97.5%) of the 160 proteins in the test set. It is useful to examine the four other cases where the CSA-annotated active residues rank low. These are considered failure cases and consist of: Phenol hydroxylase from
Phenol hydroxylase (ΨOH) uses the cofactor flavin adenine dinucleotide (FAD) to hydroxylate phenols
The structure of Acylphosphatase contains only 98 residues and two sulfate ions; presumably the sulfate ions indicate phosphate binding sites. Neither POOL nor the statistical version of THEMATICS is able to identify the two CSA annotated residues R23 and N41. However Wei's statistical method does correctly identify two sulfate-contact residues, K32 and H60. These two residues both have low POOL rankings. Adenine-N6-DNA-methyltransferase and Serine carboxypeptidase II both have a relatively large number of residues involved in binding and recognition; POOL returns low rankings for the annotated residues. We note that POOL does well for other cases with relatively large numbers of residues involved in the site of interaction. Indeed at this time no pattern is discernable that distinguishes the small set of failure cases from the large group of successful cases for POOL.
Looking at
This demonstrates the relative contributions made to the combined system by these different features. Clearly there is a domination order: THEMATICS alone, then the combination of conservation and geometric information, then conservation alone, then geometric alone.
This bears out some widely recognized observations: that cleft size information is useful, but suffers from an inordinately large false positive rate and that conservation information is more useful
In this paper, we presented the application of the POOL method using THEMATICS plus some other features for protein active site prediction.
We started with the application of the POOL method just on THEMATICS features, with features similar to those used before in the SVM method
We also tested different ways of incorporating additional features into the learning system. Not surprisingly, the results show that in order to improve performance, we have to incorporate the right features in the right way. Even with features that were found to be helpful in improving the performance, how they are incorporated matters. The data show that chaining the results from separate POOL estimates is better than simply combining all the available features into a big POOL estimator over a higher-dimensional feature space. As mentioned earlier, the reason behind this might be overfitting, since combining features into a POOL table with high dimension causes the number of probabilities needed for estimation to grow exponentially, while the training data can only increase linearly in most cases. In other words, the high dimensionality makes the table too sparse and less accurate for probability estimates.
We also extended the application of THEMATICS to all residues, not just ionizable residues, in a natural way and showed that it is effective. Although the performance for non-ionizable residues is not as good as the performance for ionizable ones, this extension does provide a way to combine features from THEMATICS, which by itself can only be applied to ionizable residues directly, with some other features. The inclusion of the non-ionizable residues results in better overall performance and also makes performance comparison with other methods more accurate and fair.
The incorporation of sequence conservation information does improve the predictions when there are enough homologues with appropriate diversity. The POOL method gives us a means for easily utilizing this information when it is available, while not affecting the training and classification when it is not.
When comparing with other methods, especially if the other methods use binary classification instead of a ranked list, we have to commit to a specific cutoff value and turn our system into a binary classification system. The results in this paper clearly show that the POOL method using THEMATICS and geometric features achieves equivalent or better performance than the other methods in comparison, even in cases where their methods are tested on very special groups of proteins. This makes this method more widely applicable to proteins with few or no sequence homologues, such as some Structural Genomics proteins, than other methods that use sequence alignments from homologues. Performances of the previous best methods, those of Youn and of Petrova, will degrade significantly when sequence conservation information is not available. However with THEMATICS data the approach developed here is still robust in the absence of sequence conservation information. In effect, for those proteins having an insufficient number of sequence homologues, the POOL(T)xPOOL(G)xPOOL(C) method reduces to the still highly effective structure-only POOL(T)xPOOL(G) method.
Interestingly enough, when comparing the performance of POOL(T)xPOOL(G) and POOL(T)xPOOL(G)xPOOL(C) in
When looking at the recall and false positive rates of the results from all the protein active site prediction methods, one must keep in mind that the annotation of the catalytic residues in the protein dataset is never perfect. Since most of the labeling comes from experimental evidence, some active site residues are not labeled as positive simply because no experiment was ever carried out to verify the role of that specific residue. Since we have used the CatRes/CSA annotations as the sole criteria to evaluate the performance in order to keep the comparisons consistent, the reported false positive rate is probably higher than in reality. There is evidence available to support the functional importance of some residues that are not labeled as active in the CatRes/CSA database
Although we evaluated the POOL method performance using filtration ratio values as a cutoff, it is just for the purpose of comparing with other protein active site prediction methods that use a binary classification scheme. The ranked list of residues based on their probability of being in the active site contains much more information than traditional binary classification. The rank of the first annotated positive residue analysis in this paper shows just one application of the extra information contained in a ranked list rather than a traditional binary label. There are many possible measurements of performance depending on the actual application by users, and in turn many possible applications that can benefit from using a ranked list form. It is noteworthy that P-Cats
The POOL approach is amenable to the addition of other properties for the prediction of active sites
In conclusion, we have established that applying the POOL method, with THEMATICS and other features, appears to yield the best protein active site prediction system yet found and it provides more information than other active site prediction methods.
The three-dimensional coordinate files for the protein structures used for training and testing were downloaded from the Protein Data Bank (
For the
For the
The results reported here are based on eight-fold cross-validation on a set of 64 proteins or 10-fold cross-validation on a set of 160 proteins, both taken from the Catalytic Site Atlas (CSA) database
For the eight-fold cross-validation procedure, the 64-protein set was randomly divided into eight folds of eight proteins each, with seven of the eight folds (56 proteins) used for training and the remaining fold (8 proteins) used for testing. This was repeated eight times, once for each of the eight folds. Likewise, for the ten-fold cross-validation procedure, the 160-protein set was randomly divided into ten folds of sixteen proteins each, with nine of these (144 proteins) used for training the remaining fold (16 proteins) used for testing, and this was repeated a total of ten times, once for each fold.
Training was performed by applying the POOL method to obtain a function
An additional detail is that for training we quantize the multi-dimensional data points. For example, for POOL(T4), each rank-normalized feature falls into one of 20 bins whose sizes vary depending on their distance from 0.0. In particular, the lowest ranked bins cover the half-open intervals, [0.0, 0. 2), [0. 2, 0.4), [0.4, 0.6), [0.6, 0.7), and there are 16 more bins of width 0.02 above that, with one special bin for 1.0. Thus the lowest-ranking data are quantized more coarsely than the remaining data. This is appropriate since these data tend to have very low average probability of being in the active site anyway, because the vast majority of residues are negatives. Thus the inability to make fine distinctions among these low-probability candidates does not degrade the overall quality of the results. It does, however, improve the efficiency of the training procedure significantly, so this is an important component of the analysis. This is especially helpful in the 10-fold cross-validation on the 160-protein set. The typical training set of 144 proteins from this set contains about 14500 ionizable residues, which fall into more than 6000 quantized bins in the 4-dimensional space used for POOL(T4). The number of corresponding inequality constraints is about 35,000–40,000.
One final detail is that the probability estimates generated by the POOL method as described here tend to have numerous ties as well as some places where there is no well-defined value. The latter places occur because the method only assigns values to existing data points (or bins containing data in the case of our use of quantization). The locally constant regions occur both because of the quantization applied to the training data at the outset and because the data pools created by the algorithm acquire a single value. In cells where no value is defined, the interpolation scheme used is to simply assign a value linearly interpolated based on the Manhattan distance between the least upper bound and the greatest lower bound for that cell based on the monotonicity constraint. Finally, since both the data pooling performed by the algorithm and this interpolation scheme tend to lead to ties, the Manhattan distance from the origin of the four THEMATICS features is used as a tie-breaker for any residues whose probability estimates are identical. This simply imposes a slight bias toward strict monotonicity even though the mathematical formulation used to determine these probabilities is based on a non-strict monotonicity assumption, making it possible to obtain well-defined rankings for all the residues in a protein.
The 64 protein test set used for POOL
(0.08 MB DOC)
The 160 protein test set used for POOL
(0.18 MB DOC)