Conceived and designed the experiments: NN MI YS. Performed the experiments: NN TS YM KT HK. Analyzed the data: NN. Wrote the paper: NN YS.
The authors have declared that no competing interests exist.
Predictions of interactions between target proteins and potential leads are of great benefit in the drug discovery process. We present a comprehensively applicable statistical prediction method for interactions between any proteins and chemical compounds, which requires only protein sequence data and chemical structure data and utilizes the statistical learning method of support vector machines. In order to realize reasonable comprehensive predictions which can involve many false positives, we propose two approaches for reduction of false positives: (i) efficient use of multiple statistical prediction models in the framework of two-layer SVM and (ii) reasonable design of the negative data to construct statistical prediction models. In two-layer SVM, outputs produced by the first-layer SVM models, which are constructed with different negative samples and reflect different aspects of classifications, are utilized as inputs to the second-layer SVM. In order to design negative data which produce fewer false positive predictions, we iteratively construct SVM models or classification boundaries from positive and tentative negative samples and select additional negative sample candidates according to pre-determined rules. Moreover, in order to fully utilize the advantages of statistical learning methods, we propose a strategy to effectively feedback experimental results to computational predictions with consideration of biological effects of interest. We show the usefulness of our approach in predicting potential ligands binding to human androgen receptors from more than 19 million chemical compounds and verifying these predictions by in vitro binding. Moreover, we utilize this experimental validation as feedback to enhance subsequent computational predictions, and experimentally validate these predictions again. This efficient procedure of the iteration of the
This work describes a statistical method that identifies chemical compounds binding to a target protein given the sequence of the target or distinguishes proteins to which a small molecule binds given the chemical structure of the molecule. As our method can be utilized for virtual screening that seeks for lead compounds in drug discovery, we showed the usefulness of our method in its application to the comprehensive prediction of ligands binding to human androgen receptors and in vitro experimental verification of its predictions. In contrast to most previous virtual screening studies which predict chemical compounds of interest mainly with 3D structure-based methods and experimentally verify them, we proposed a strategy to effectively feedback experimental results for subsequent predictions and applied the strategy to the second predictions followed by the second experimental verification. This feedback strategy makes full use of statistical learning methods and, in practical terms, gave a ligand candidate of interest that structurally differs from known drugs. We hope that this paper will encourage reevaluation of statistical learning methods in virtual screening and that the utilization of statistical methods with efficient feedback strategies will contribute to the acceleration of drug discovery.
In the early stages of the drug discovery process, prediction of the binding of a chemical compound to a specific protein can be of great benefit in the identification of lead compounds (candidates for a new drug). Moreover, the effective screening of potential drug candidates at an early stage generates large cost savings at a later stage of the overall drug discovery process.
In the field of virtual screening for the drug discovery, docking analyses and molecular dynamics simulations have been the principal methods used for elucidating the interactions between proteins and small molecules
To achieve more comprehensive and faster protein-chemical interaction predictions in the post-genome era producing a vast number of protein sequences whose structural information is not available, it is essential to be able to utilize more readily available biological data and more generally applicable methods which do not require 3D structural data
Although the method yielded a relatively high prediction performance (more than 80% accuracy) in cross-validation and usefulness in the comprehensive prediction of target proteins for a given chemical compound with tens of thousands of prediction targets
In this paper, we describe two strategies, namely two-layer SVM and reasonable negative data design, which are used for the purpose of reducing the number of false positives and improving the applicability of our method for comprehensive prediction. In two-layer SVM, in which outputs produced by the first-layer SVM model are utilized as inputs to the second-layer SVM, in order to design negative data which produce fewer false positives, we iteratively constructed SVM models or classification boundaries and selected negative sample candidates according to pre-determined rules. By using these two strategies, the number of predicted candidates was reduced to around 100 (
dataset |
neg. |
1sts |
P10275 |
P11229 |
P35367 |
rec0.5 (%) |
rec0.95 (%) |
evaluation |
(A) | ||||||||
16 | – | 714 | 1408 | 1187 | 100 | 98.97 | 82.50 | |
16 | – | 1869.3(±136.1) | 10503.3(±1250.7) | 9305.3(±517.8) | 100 | 99.66(±1.09) | 69.45(±0.32) | |
(B) | ||||||||
14 | 10 | 177 | 535 | 451 | 96.91 | 93.81 | 75.56 | |
14 | 10 | 848.3(±345.0) | 1531.7(±628.9) | 988.0(±411.4) | 96.56(±2.89) | 81.10(±19.44) | 66.44(±7.82) | |
(C) | ||||||||
16 | 9 | 28 | 231 | 129 | 100 | 97.94 | 82.92 | |
16 | 9 | 74.7(±42.6) | 255.3(±32.2) | 146.7(±8.3) | 100 | 100 | 80.67(±0.93) | |
(D) | ||||||||
– | – | – | 640 | 1791 | 838 | 86.60 | 71.13 | 59.66 |
(E) | ||||||||
– | – | – | 1869 | 1816 | 1580 | – | – | – |
(A) One-layer SVM. (B) Two-layer SVM with the first-layer SVM models based on the
SVM model which only classifies chemical compounds (not pairs) according to the binding property to the target proteins. Chemical compounds binding to each target protein were treated as positives, and all other compounds in the DrugBank dataset were regarded as negatives.
A chemical compound
refers to negative data expansion rules (details are provided in
the number of negatives ( = 1,750×
the number of first-layer SVM models utilized to construct the second-layer SVM model.
target proteins whose ligands were predicted from 109,841 compounds. The number of predicted ligands is shown.
rec
With the aim of validating the usefulness of our method, our proposed prediction model with fewer false positives was applied to the PubChem Compound database in order to predict the potential ligands for the “androgen receptor”, which is one of the genes responsible for prostate cancer. We verified some of these predictions by measuring the IC50 values in an in vitro assay.
Biological experiments, conducted to verify the computational predictions based on statistical methods, docking methods or molecular dynamics methods, typically involve success as well as failure. In addition to fast calculation and wide applicability, one of the merits of using statistical methods that involve training with known data is that results obtained by verification experiments can be efficiently utilized as feedback to produce new and more reliable predictions. Most previous work on virtual screening has focused on the computational prediction and listing of dozens or hundreds of candidates, followed by their experimental verification. However, only on rare occasions have these experimental results been utilized for the further improvement of computational predictions and experiments. Moreover, even without verification experiments, additional data acquired from, for example, relevant literature can be used for enhancing the prediction reliability.
Therefore, we propose a strategy based on the effective combination of computational prediction and experimental verification. Our second computational prediction utilizing feedback from the first experimental verification successfully discovered novel ligands (
(A) 500 predictions without feedback data. (B) 527 predictions with feedback from the first experimental verification. (C) 213 predictions based on the feedback strategy where pairs of chemical compounds with steroid structures and the androgen receptor were regarded as negatives.
(A) Results of the first in vitro binding assay. (B) Results of the second in vitro binding assay. (C) The chemical space based on E-Dragon
In the following section, we first describe the real application of our method involving the computational prediction, the experimental verification and the feedback, and then explain the computational experiments conducted to verify the usefulness of our computational prediction method in comprehensive prediction.
We set the human androgen receptor (AR) as the target protein, whose binding ligands were predicted by using the PubChem database. Here, AR is a steroid hormone receptor and a transcription factor belonging to the nuclear receptor superfamily. In pathology, AR is one of the genes responsible for prostate cancer, which is the most frequently diagnosed cancer in men in the United States according to the American Cancer Society Statistics for 2008. The two-layer SVM model with an additional model for the androgen receptor, which constitutes a prediction model trained on the basis of supplementary information obtained from the relevant literature or databases as well as feedback from experimental verifications, was applied to the screening for human androgen receptor binding ligands from 19,171,127 chemical compounds in the PubChem Compound database. As a result, 500 chemical compounds (compounds with the same connectivity were counted only once) were predicted (
Out of 500 computationally predicted candidates, an in vitro binding assay was applied to 18 purchasable chemical compounds (details are provided in
For 12 predictions, except 6 known ligands, by applying a threshold level of IC50 = 100 µM, which was based on the fact that IC50 of flutamide was more than 50 µM, a precision of 67% (4/6) and an accuracy of 67% (8/12) were obtained (
By utilizing the results of the first experimental verification, the prediction model was reconstructed. Although the first computational prediction and experimental verification involved many compounds with steroid skeletons, binding of steroid-like compounds to the androgen receptor, which is a steroid-hormone receptor, is relatively obvious. Moreover, since steroid-like compounds are expected to act as agonists of the androgen receptor, antagonists are given preference in terms of search for chemical compounds with potential therapeutic effects for human prostate cancer, which involves activation of the androgen receptor. Thus, the prediction model in which pairs of the androgen receptor and steroid-like chemical compounds were regarded as negatives was also constructed in order to search for antagonists of the androgen receptor. The prediction coverage of these two models (
Among the second predictions, experimental verification was performed with respect to 5 purchasable candidates, which were predicted with the two models reconstructed with feedback data and different strategies, as described in the previous section, and which were selected from predictions specific to each model, including 49 compounds marked as purchasable in ChemCupid in July 2008 (details are provided in
As shown in
The third computational prediction, which utilized the results of the second experimental verification, further extended the predictions (details are provided in
In bioinformatics, statistical approaches extract rules from numerical data corresponding to biological properties. Here, it is not guaranteed that the extracted rules are biologically valid, and furthermore it is possible to utilize statistical methods to obtain general rules from any kind of numerical data which are meaningless and irrelevant to biological properties. The biological relevance of our approach can be verified as follows on the basis of supporting evidence which indicates that our method can extract significant rules only if biologically valid and relevant data is given.
First, high prediction performances on diverse datasets might support the validity of our approach. In several datasets consisting of known pairs of proteins, including nuclear receptors, GPCRs, ion channels and enzymes, and drugs and random protein-drug pairs, our statistical approach with SVM showed high prediction performances (details are provided in
Second, we showed the biological relevance of these high prediction performances by calculating the prediction performances using biologically meaningless artificial datasets as positives. Several datasets which contained fractions of valid samples found in the DrugBank dataset, and which comprised artificial pseudo-positive samples of protein-chemical pairs produced by shuffling with the same frequency of chemical compounds and proteins as that in the DrugBank dataset, were generated. Our method was applied to these shuffled artificial datasets (
The average of 10 datasets produced by shuffling pairs corresponded to each content rate (ex. 50%) of pairs comprising a protein and a chemical compound in the original dataset. A usual SVM training, which is referred to as the first-layer SVM in the
As shown in
It is often observed that although statistical learning approaches achieve very high prediction performances in given datasets, statistical prediction models suffer from the problem of generating vast prediction sets including many false positives when applied to a huge dataset, such as the PubChem database. In our approach, SVM models based on feature vectors directly representing amino acid sequences, chemical structures, and random protein-compound pairs as negatives also produced many predictions and inevitably yielded many false positives (
Upon the introduction of the two-layer SVM and the negatives designed to overcome this drawback, the prediction precision, or the confidence of positive prediction, was significantly improved in computational experiments based on the DrugBank dataset (
Model type |
prec. |
sens. |
acc. |
prec. |
sens. |
acc. |
(A) | ||||||
one-layer(designed) | 71.76 | 42.99 | 95.11 | 64.66 | 50.59 | 95.00 |
one-layer(random) | 82.38(±0.64) | 38.22(±0.95) | 95.38(±0.06) | 40.68(±1.19) | 50.00(±1.87) | 92.02(±0.28) |
(B) | ||||||
97.11 | 92.57 | 99.33 | 82.81 | 31.18 | 95.11 | |
95.66(±0.32) | 78.33(±1.60) | 98.33(±0.10) | 78.76(±2.86) | 25.59(±1.09) | 94.71(±0.09) | |
- | - | - | 8.89 | 57.06 | 59.27 | |
95.98 | 93.21 | 99.29 | 75.81 | 27.65 | 94.73 | |
70.69 | 54.39 | 95.49 | 34.52 | 17.06 | 92.52 | |
(C) | ||||||
99.68 | 100.00 | 99.98 | 100.00 | 10.59 | 94.20 | |
- | - | - | 90.70 | 22.94 | 94.85 | |
one-layer( |
- | - | - | 86.67 | 15.29 | 94.35 |
(A) Effect of rational negative design. (B) Effect of the second-layer SVM with designed negatives. (C) Improvement of precision with the two-layer SVM ant the type of the first-layer SVM models.
“Model type” exhibits the one-layer SVM model or the second-layer SVM, which is specified by the type of 11 first-layer SVM model, was utilized. Here,
• (designed) means that the rationally designed negatives was used to construct the SVM model.
• (random) means that three types of randomly chosen 22,050 pairs of protein and chemical compounds were used use to construct the SVM model. The 95% confidence intervals were shown.
• (r.f.) means that twenty types of randomly chosen 11 first-layer SVM models were used to construct the second-layer SVM model.
•
•
•
• (
precision (prec.) =
•
•
Following these results on given datasets, our approaches were evaluated with respect to comprehensive binding ligand prediction. For three proteins (UniProt ID P10275 (androgen receptor), P11299 (muscarinic acetylcholine receptor M1) and P35367 (histamine H1 receptor)), their binding ligands were predicted from PubChem Compound 0000001–00125000 which contains 109,841 compounds (
As shown in
These results suggest that our prediction models select a reasonable number of ligand candidates from all chemical compounds in large databases and encourage the comprehensive binding ligand prediction for the target protein.
The experimental verification of the computational predictions produces feedback data or samples which are not included in the given training datasets. The efficient utilization of these data can contribute to the fast identification of compounds with the desired properties and can be of advantage to statistical learning approaches.
We compared several strategies for utilizing feedback data as follows. For three proteins (UniProt ID P10275 (androgen receptor), P11299 (muscarinic acetylcholine receptor M1) and P353367 (histamine H1 receptor)), ligand data which were not included in the DrugBank dataset were collected from relevant literature
As shown in
(A) 1: st1; a strategy where additional data, or pairs comprising a chemical compound and a protein, were simply added to the training samples in constructing a prediction model. st2; a strategy where additional data were first used for the construction of an additional first-layer SVM model and subsequently added to the training samples in the construction of a second-layer SVM model. 2: target proteins whose ligands were predicted from 109,841 compounds. The number of predicted ligands is shown. 3: one-layer SVM using the
With this efficient strategy for utilizing feedback data, computational prediction and experimental verification improve each other to enable faster search toward the identification of useful small molecules.
We proposed a comprehensively applicable computational method for predicting the interactions between proteins and chemical compounds, in which the number of false positives was reduced in comparison to other methods. Furthermore, we proposed the strategy for the efficient utilization of experimental feedback and the integration of computational prediction and experimental verification.
The application of our method to the androgen receptor resulted in 67% (4/6) prediction precision according to in vitro experimental verification in the first computational prediction and 60% (3/5) in the second prediction, which included the feedback of the first experimental verification. However, these relatively low precision values do not represent the true statistical significance of the method.
This 60–70% precision can also be evaluated by using the following
These prediction performances are as good as or better than several previous virtual screening studies based mainly on docking analyses
The blue circles denote known compounds and the red triangles denote other tested compounds.
In another perspective, the re-evaluation of statistical prediction approaches by using 23 chemical compounds experimentally verified in this study showed that our proposed methods, which utilized information of both protein sequence and chemical structures, were superior to a conventional LBVS (Ligand Based Virtual Screening) method where only structures of specific chemical compounds were considered (
(A) Evaluation by recall rate with 10,000 chemical compounds. Here, the recall rate at the rank
Furthermore, the fact that the second computational prediction, or the use of feedback data, contributed to the discovery of novel ligands (
Regarding the computational prediction method used in this paper, we made the method available to the public as a web-based service named COPICAT (COmprehensive Predictor of Interactions between Chemical compounds And Target proteins;
The DrugBank dataset was constructed from Approved DrugCards data, which were downloaded in February, 2007 from the DrugBank database
Given
In the first-layer SVM, a pair comprising a protein and a small molecule, which constitutes a sample, is mapped onto an
We generated 100 first-layer SVM models with different random combinations of proteins and chemical compounds as negatives. The SVM parameters were chosen to give the best accuracy in a 10-fold cross validation in one set of positives and negatives.
We prepared two sets of first-layer SVM models, each of which consists of 100 models. One set
The second-layer SVM directly utilizes the outputs of the first-layer SVM models as inputs. The second-layer SVM model was constructed from the whole DrugBank dataset and reasonably designed negatives, which are described in detail later, on the basis of the RBF kernel
The number of first-layer SVM models whose output is used in the second-layer SVM models mainly determines the computation time and the workload of the two-layer SVM methods. Therefore, in order to practically realize comprehensive protein-chemical interaction predictions, fewer first-layer models achieving high prediction accuracy are given preference.
We applied the recursive feature elimination (RFE) method
We followed and modified the method described in Wang
min: Top
max: Top
mle: Top
mlt: Top
where
Unless otherwise specified, all solvents and reagents were obtained from commercial suppliers.
In the plasmid preparation, pTriAR, a construct in which Androgen receptor (AR) cDNA is subcloned into the pTriEX-3 Neo vector, was provided by Taiho Pharmaceutical.
In the in vitro binding assay, dihydrotestosterone (DHT), flutamide, nilutamide, spironolactone and cortexolone were purchased from Sigma. Testosterone and bicalutamide were purchased from Wako Pure Chemical Industries. ZINC 04369595, MDPI 944, MDPI 1011, NSC 6129, MDPI 10314, 3-epiuzarigenin, ZINC 04026296, methandriol, vitamin D3, ZINC 03849821, P712100 and fluanisone were purchased from Namiki Shoji.
The gene sequences corresponding to the ligand-binding domain (609th a.a.–919th a.a.) of androgen receptor C-termini (ARC) were subcloned into pMALc-2x vector digested with
Here, it is reported that an in vitro binding assay with ARC produced almost the same results as that with whole-length AR
50 µg/ml MBP-ARC, 2 nM [3H]-DHT, and the indicated amount of test compounds were incubated for three hours. Then, the radioactivity of [3H]-DHT bound to MBP-ARC was measured with a scintillation counter. Details are provided in
The concentration of the test compound to [3H]-DHT in which the measured radioactivity corresponded to 50% of that measured without the test compounds was regarded as IC50 of the test compound.
Given
AutoDock 4
Supplementary Methods and Supplementary Results are provided.
(0.13 MB PDF)
Schematic illustration of the two-layer SVM system.
(0.02 MB PDF)
Protein-drug interaction network for several datasets.
(0.38 MB PDF)
Effects of feature selection on two-layer SVM model.
(0.02 MB PDF)
Results of in vitro binding assay. Results of in vitro binding assay for each compound.
(0.62 MB PDF)
The scope of the third computational prediction.
(0.01 MB PDF)
Prediction performances in several datasets
(0.03 MB PDF)
Effects of integrating different types of protein-chemical interactions
(0.02 MB PDF)
Prediction performances on different designed negatives
(0.02 MB PDF)
Evaluation of our prediction method on an external test set
(0.06 MB PDF)
Evaluation of our method with respect to comprehensive interaction prediction
(0.01 MB PDF)
Utilization of one-class SVM in the selection of negative samples
(0.03 MB PDF)
Overlaps of predictions between prediction models in
(0.03 MB PDF)