Conceived and designed the experiments: JDF FMC. Performed the experiments: JDF. Analyzed the data: JDF. Wrote the paper: JDF FMC.
The authors have declared that no competing interests exist.
With the increasing amount of data made available in the chemical field, there is a strong need for systems capable of comparing and classifying chemical compounds in an efficient and effective way. The best approaches existing today are based on the structure-activity relationship premise, which states that biological activity of a molecule is strongly related to its structural or physicochemical properties. This work presents a novel approach to the automatic classification of chemical compounds by integrating semantic similarity with existing structural comparison methods. Our approach was assessed based on the Matthews Correlation Coefficient for the prediction, and achieved values of 0.810 when used as a prediction of blood-brain barrier permeability, 0.694 for P-glycoprotein substrate, and 0.673 for estrogen receptor binding activity. These results expose a significant improvement over the currently existing methods, whose best performances were 0.628, 0.591, and 0.647 respectively. It was demonstrated that the integration of semantic similarity is a feasible and effective way to improve existing chemical compound classification systems. Among other possible uses, this tool helps the study of the evolution of metabolic pathways, the study of the correlation of metabolic networks with properties of those networks, or the improvement of ontologies that represent chemical information.
Among the existing systems capable of computationally comparing chemical compounds, the majority use only structural and physicochemical properties. However, with the emergence of ChEBI and other chemical compound databases, it has become feasible to create a system that can use the relevance of compounds in a biological context as well. This setting enables the distinction of molecules with different roles in nature but similar structures, or similar roles and different structures. ChEBI is organized as an ontology that classifies chemical compounds, which we use to derive a semantic similarity measure that reflects the biological relevance of molecules. In an effort to use as much information as possible, we introduce Chym, a system that integrates structural and semantic information in a single hybrid metric, and we show the accuracy of the system in three distinct classification problems, which consist in deciding whether a compound crosses the blood brain barrier, is a P-glycoprotein substrate or an estrogen receptor ligand. Chym outperforms the previous attempts to solve these three problems, with a maximum accuracy of 90.0%.
The recent publication of large-scale chemical information, made available by PubChem, ChEMBL and ChEBI, for instance, increased the focus of the scientific community on the problem of chemical comparison. With the amount of chemical data being published and produced today, it has become increasingly necessary to devise automatic systems capable of handling this information. The creation of an effective and accurate system that can compare and classify chemical compounds is useful in a number of different applications. For instance, it can help the understanding of the evolution of metabolic pathways,
The best approaches existing today are based on the structure-activity relationship premise (SAR), which states that biological activity of a molecule is strongly related to its structural or physicochemical properties. While the existing methods prove that this assumption generally holds, it is not always true. For instance, while L-amino acids are used to synthesize proteins, their stereo-isomers, D-amino acids, are much less frequent in nature and their role is totally different
The two represented molecular structures, clavulanic acid (A) and 3-carboxyphenyl phenylacetamidomethylphosphonate (B), are different, and yet they both inhibit
Most automatic classification methods implemented currently use either (i) the chemical structure as the foundation of the comparison
One of the main advantages of approach (i) is its ability to compare two or more molecules
There have been attempts to use graph comparison algorithms applied to the chemical structure of two molecules. One way of doing this is to restrict the similarity problem to the search for the maximum common sub-graph
More often, though, structural similarity is calculated with the aid of fingerprints. A fingerprint, in this context, is a bitstring, a sequence of 0's and 1's, where each bit represents the presence or absence of a given feature or substructure. There are several ways to construct the fingerprint. For instance, for Daylight fingerprints, all the distinct linear fragments, up to a certain size, are identified from the graph and then converted into numbers
For approach (ii), one has to compute the describing properties (if possible), to gather them from literature or to conduct experiments to obtain them.
For example, in
In
The work of
Random forests also use decision trees as its basis, as shown by
These previous works (as well as the present study) validate their approaches by using the comparison algorithms as classification systems and consistently report performance as the fraction of correctly classified compounds:
Dataset | Classification system | Accuracy | Reference |
BBB | Artificial Neural Networks | 75.7% | |
Random Forest | 80.9% | ||
Support Vector Machines | 81.5% | ||
P-gp | Four-point Pharmacophore | 62.7% | |
Support Vector Machines | 79.4% | ||
Random Forest | 80.6% | ||
estrogen | Decision Forest | ||
Random Forest | 82.8% |
This table summarizes the performance of several classification methods used on the BBB, P-gp and estrogen problems.
The semantic information of an object, i.e., its meaning in a predetermined context, is not easily handled by computers, mainly because meaning is mostly described in terms of natural language. For this reason, comparing the semantics of two objects (in this case, two chemical compounds), is not a straightforward task, and is only possible if the semantics of both objects are described under a common schema
An ontology is a representation of terms and the relationship between them, and is usually visualized as a directed graph where nodes are the terms and the directed edges are the relationships
In this work, we used both the ontology as a graph and a concept known as information content. The information content is an abstract concept that reflects the
To validate the effectiveness of Chym as a classification tool, we tested it on the sets presented in
In the three sets retrieved from the previous works presented in the introduction, the compounds were listed by name only, with no information on structure. The first step in the assessment of Chym was, therefore, to translate that list of names into ChEBI identifiers. The task of getting the identifiers was accomplished by string matching techniques, since there was no structural information to make the search. We split the names into bags of words, where a word is a sequence consisting of only letters or only numbers, to determine whether two names refer to the same chemical entity. We used not only the preferred names of the compounds but also the synonyms stored in the ChEBI database. Only compounds present in the ontology and with a described molecular structure in the ChEBI database were considered. Because ChEBI is continually growing, we estimate that older compounds in the ontology are usually more correctly annotated and tend to have lower identifiers. So, in case of more than one possibility, we chose the lowest ChEBI id.
Since the ontology does not contain all the possible molecules, we were not able to get a full mapping between names and ChEBI compounds, which means that our sets were shorter versions of the original ones. We refer to our smaller sets as purged versions and denote them as BBB
Testing set | ChEBI coverage | ||
active | inactive | overall | |
BBB | 74/180 | 79/144 | 47.2% |
P-gp | 57/109 | 24/87 | 41.3% |
estrogen | 42/132 | 59/101 | 43.3% |
Fraction of names found in the ChEBI ontology for each set of molecules. Coverage for active and inactive compounds is detailed.
The results of this table show a significant reduction in the size of all three sets after converting the names into ChEBI identifiers. Facing these values, we chose to directly compare our results only to the ones obtained with the blood-brain barrier, because (i) it is the set with higher percentage of ChEBI coverage, (ii) after purging, it remains the biggest set, and as such is more fit to be broken into testing and training sets without losing too much information, and (iii) it is the set with a more balanced distribution of active vs. inactive compounds. We will also apply Chym to the two other sets, but the analysis will not be as deep.
The BBB set is first described in
In order to make an unbiased comparison between Chym and SVMs, we addressed the validation process in three steps, which were devised so that only one specification of the process changed in each step:
The SVM model described in
The same SVM model was used in our purged set, BBB
Finally, we replaced the SVM model with our Chym approach.
It must be mentioned here that Chym is actually a collection of 24 metrics, each having a real parameter,
For the SVM approach, we retrieved the compounds' properties from the article as 9-dimensional vectors and used the SVMlight
Moreover, to decrease the potential bias in our analysis, we implemented two different validation methods. The first one is a leave-multiple-out process, described in
25 active compounds and 25 inactive compounds are randomly removed from the set. They now form the testing set;
The remaining set is used to train the model;
The compounds in the testing set are classified according to the model learned in the previous step. Performance (as MCC and accuracy) is recorded;
Steps 1–3 are repeated 30 times, and an average of the performance indicators is recorded.
The second validation approach is
The last step in the assessment of Chym was to predict some new active compounds in each of the three sets. We calculated an activity coefficient for all compounds in the ChEBI ontology annotated with a structure, based on the active compounds in the respective purged sets, and the best metric for each problem, and retrieved the ones whose coefficient was higher. For a discussion about the methods used to calculate this value, refer to section
Set | Approach | Validation method | Accuracy | MCC |
BBB | SVM | LMO25 | 81.3% | 0.630 |
BBB |
SVM | LMO25 | 73.8% | 0.484 |
BBB |
Chym | LMO25 | 89.6% | 0.800 |
BBB | SVM | 10-fold | 81.2% | 0.625 |
BBB |
SVM | 10-fold | 74.1% | 0.492 |
BBB |
Chym | 10-fold | 90.0% | 0.810 |
For the LMO25 method, the accuracy values are the mean of 30 experiments, as explained in the previous section. The Chym results were obtained for FP3 fingerprint format, simGIC semantic method using the entire ontology, and
In its second part,
Set | Chym | Best previous attempt | ||||
Parameters | MCC | Accuracy | Approach | MCC | Accuracy | |
BBB |
FP3, simGIC, all, 0.29 | 0.810 | 90.0% | SVM | 0.628 | 81.5% |
P-gp |
FP4, simUI, role, 0.72 | 0.694 | 87.3% | Random Forests | 0.591 | 80.6% |
estrogen |
FP4, simGIC, role, 0.45 | 0.673 | 82.6% | Random Forests | 0.647 | 82.8% |
Chym parameters are “fingerprint format, semantic method, branch of the ontology used,
For each dataset, the best metric was stripped of the
Alpha | BBB |
P-gp |
estrogen |
0.0 | 0.66837 | 0.47723 | 0.26418 |
0.1 | 0.74508 | 0.54799 | 0.33957 |
0.2 | 0.78206 | 0.54634 | 0.42900 |
0.3 | 0.63492 | 0.50817 | |
0.4 | 0.75904 | 0.60167 | |
0.5 | 0.73267 | 0.61939 | 0.63670 |
0.6 | 0.68652 | 0.60764 | 0.66318 |
0.7 | 0.64528 | 0.57530 | |
0.8 | 0.57281 | 0.54896 | 0.60161 |
0.9 | 0.52186 | 0.49979 | 0.64252 |
1.0 | 0.51764 | 0.48429 | 0.61333 |
The Chym parameters used are the ones in
Finally,
Set | Rank | Compound | Coefficient | Ref. | |
ID | Name | ||||
BBB |
1 | 50931 | (Z)-chlorprothixene | 0.289 | |
BBB |
2 | 51137 | mianserin | 0.280 | |
BBB |
3 | 251412 | adinazolam | 0.279 | |
P-gp |
7 | 53290 | (S)-donepezil | 0.373 | |
P-gp |
15 | 31181 | aklavinone | 0.368 | |
P-gp |
16 | 48723 | (-)-lobeline | 0.366 | |
estrogen |
2 | 27917 | luteone | 0.277 | |
estrogen |
4 | 5262 | galangin | 0.274 | |
estrogen |
5 | 50399 | 3′,4′,7-trihydroxyisoflavone | 0.274 |
For each compound, a reference showing that the compound is indeed active is given. The thresholds for each problem, as determined by the algorithm detailed in the Methodology section, are 0.243 (BBB), 0.272 (P-gp) and 0.231 (estrogen).
The work presented in this paper shows compelling evidence that using semantic information in chemical classification algorithms improves their performance. To show that, we used three sets of compounds previously described and used as input in other classification methods. On those sets, Chym achieves higher performance for class prediction when compared to previously existing methods, with Matthews Correlation Coefficient as high as 0.810, corresponding to an accuracy of 90.0%. Parallel to this result, we also showed that the use of a hybrid metric that uses both structural and semantic information is better suited for this kind of problems than a system which uses only one of these types of information. Some issues should, however, be discussed in order to complete the analysis of this tool.
The properties that are relevant to decide whether a molecule should be classified as active or inactive depend obviously on the problem being solved. As such, the best metric for a problem is not necessarily the same for other problems. Thus, selecting the best metric is not much different than selecting the appropriate descriptors for SVM, random forest or the other approaches presented before. While it may be argued that the value of
On another note, our high performance could be due to a possible term in the ontology that classified compounds as able to cross the blood-brain barrier, as substrates to the P-glycoprotein or as estrogen receptor ligands. Admittedly, if there were such terms in the ontology, Chym would be biased and would report high accuracy values because it would be using the information it was trying to validate as a means to prove its effectiveness. As it turns out, no term in the ontology refers to the words “brain”, “barrier”, “P-glycoprotein” or “permeability” (the meaning of the P in P-glycoprotein). “Estrogen receptor” appears twice, in “estrogen receptor modulator” and “estrogen receptor antagonist”, but these two terms have only a total of 5 descendants in the ontology, and none of them is present in the set estrogen
The reason for this fact is that, although the information to solve the classification problem is not explicitly stated in the ontology, the proximity of terms in the ontology (their semantic similarity) is a good indicator that they should behave similarly. For instance, both the compounds ChEBI:8069, phenobarbital, and ChEBI:49575, diazepam, cross the blood brain barrier. Moreover, they share many of their ancestors. Their semantic similarity, as measured with a simGIC method in the whole ontology, is 0.324, and their structural similarity, as measured with the FP3 format, is 0.667. With an
Still in respect to the results presented in
As discussed in the Methodology, the ChEBI ontology contains three partially overlapping branches. One concern raised by this fact is that the molecular structure more or less reproduces the structural information used in the first part of the metric. Although the information being used is indeed the same, the ontology explores the structural properties from a totally new perspective (namely, a semantic perspective), that would be otherwise unusable in a similarity measure: purely structural comparison methods are probably unable to use the fact that both glucose and fructose are monosaccharides to compare them. So, even if there seems to be a duplication of information, the different approaches used yield similarity values that can be combined to produce a more robust score (as Chym does).
Another concern raised about the use of ChEBI ontology is the subatomic branch. This branch was never chosen by itself as the best branch of the ontology, which is not surprising, for two reasons. First, it is not much richer than the molecular structure or role branches, since only 35 ChEBI terms are unique to this branch of the ontology. Secondly, each of these 35 terms is either an ancestor to all chemical compounds used in the input set (as happens with electron, for instance, which is part of the atom, which is part of every molecular structure) or ancestor to none of the chemical compounds (photon, for instance). This means that this branch does not offer any kind of resolution.
However, like any other classification algorithm, Chym has its limitations. The most important drawback of this method is that it can only compare structures that are annotated in the ChEBI ontology. Of course that any chemist or other scientist wishing to use Chym may annotate the compound they are trying to study in ChEBI by creating a “non-official” node. There is, however, a large number of classes, which could potentially introduce a difficulty in selecting the most appropriate position for the compound; this annotation is also unfeasible for a large number of compounds. This severely impairs applications like drug discovery, or toxicology analysis.
In spite of this limitation, Chym introduces the comparison of chemical compounds through their semantics, which is an important technique that can be used in projects where comparison and or classification of known chemical compounds is needed. One instance of such project is the search for a possible correlation between strains of bacteria and their virulence. One could be interested in determining differences in metabolic networks of said strains and compare the differences with the different amount of virulence of those strains; the comparison of metabolic networks would benefit from the metrics explored here. Other applications include the comparison of models, for instance models of diseases containing references to molecules responsible for the disease or to drugs known to improve the condition of patients. On the other hand, the semantic similarity applied to ChEBI (developed and explored in this work) can also be useful in ontology managing, as happens in GO
In the future, it would be interesting to try other hybrid metrics, especially other structural comparison algorithms. For instance, since SVM and random forests seem to perform well, perhaps a system where the structural part of the comparison is done through one of these methods would outperform the actual version of Chym.
In order to develop and validate our hybrid similarity for chemical compounds, the
To calculate the structural similarity between two molecules, we need a representation of their structures. Because ChEBI contains a list of structures in SMILES, MDL and InChI chemical file formats, these are the formats used. For each distinct molecule, we prefer a SMILES representation of the structure. If one does not exist, we use MDL. The rationale for this choice is the wide use of SMILES over MDL. InChI was not used since every molecule with a structure in this format had at least one of the other formats as well.
For each structure, three fingerprints were calculated. These formats were computed with the OpenBabel software
Given two molecules and the corresponding fingerprints
From equation 1, it can be seen that the structural similarity will run from 0, when no bit is 1 for both molecules (total disparity), to 1, when the 1-bits in the two molecules are the same (equal fingerprints).
Following the application of semantic measures for the GO
It is known, however, that for ontologies where term specificity is not well correlated with term depth, methods based on information content (IC) are preferable
It is worth underlining here that the concept of information content is just a method to give weight to the compounds in the ontology. If two compounds share many ancestors, simUI will attribute a high similarity between them, but, for example, if most of those ancestors are unspecific, the similarity should be lowered accordingly; by weighting the ancestors, simGIC achieves this effect. For example, compounds ChEBI:17802, pseudouridine, and ChEBI:31747, kanosamine, share 30 or their 37 ancestors, but the most specific of those is ChEBI:23008, carbohydrate, already a very abstract term in the ontology. simGIC takes into account this fact. Considering the similarity values between all pairs of compounds that appear in the corpus at least once, the mean similarity measured with simUI is 0.431 and the mean similarity with simGIC is 0.048. Those two compounds share a simUI similarity value of 0.811, about twice the mean value, but by weighting the ancestors, simGIC assigns a similarity of 0.023, about half of the mean value.
For both metrics, the similarity value is between 0 and 1 because an intersection of two sets is always a subset of their union.
Until this point, we presented two orthogonal metrics to measure the similarity between two chemical compounds. Our intent, however, is to join them together to produce a hybrid metric that takes into account both structural and semantic information.
Since both measures explained above always fall in the closed interval
One of the possible uses of Chym is the application of this similarity metric to classify compounds. Ideally, we want to be able to get a set of chemical compounds that possess a common property as input, and then determine whether other chemical compounds also possess that property. This is also the approach used in SVM and random forests, for example, where the input serves as a training set that is used to create a classification model. In Chym, the model consists of a threshold that is used to decide whether a compound is active or inactive.
Given a training set of compounds, some sharing a common property (which we call
Within the training set, compare each compound with all active compounds. The comparison of an active compound with itself is excluded, since this value (which is always 1) could introduce a bias into the rest of the algorithm.
For each of the compounds in the training set, determine its
Determine the
For all compounds in the validation set, Chym calculates their activity coefficient as the average of similarities between the compound and all active compounds in the training set, and classifies it as active if the activity coefficient is greater than or equal to the threshold of activity
From the algorithm above, it can be seen that the inactive compounds are only used to adjust the value of the threshold, while the active compounds are used both in the adjustment of that value and in the determination of the activity coefficient of the validation compounds.
The ontology also includes classes of molecular entities and partial molecular entities, enabling ChEBI to be organized as an ontology, structuring molecular entities into classes and defining the relations between them. Several relationship types exist in ChEBI, with a number of them reciprocal in nature. The ontology is subdivided into three separate sub-ontologies:
As of the time of the computations (January 2010, release 64), the graph of this ontology contained 23,545 nodes representing chemical compounds, which represents approximately 4% of the whole ChEBI database. As stated above, some terms are not chemical compounds but parts of compounds, such as functional groups, that make the ontology structure possible. Also, for each individual chemical compound, there may be several identifiers, which come from different annotations that were later identified as the same compound.
Chym's branches are partially overlapping. For instance, the term glucose is classified as a molecular structure, as having the role of macronutrient and as having part electron, which means that it is present in three branches. Including glucose, 21676 nodes (92%) are part of the three branches.
Besides the ontology, the ChEBI database is enriched with an extensive list of synonyms and manually curated cross-references to other non-proprietary databases, as well as a list of chemical structures.
One of the main components of
The methods used to structurally compare compounds are implemented by the software, OpenBabel
The semantic similarity was not as straightforward. As in
To calculate the IC-based metric (simGIC), we had to find a corpus where the compounds are referenced. We chose
Since there are 3 fingerprint formats, and semantic similarity can be calculated based on 4 different DAGs and with 2 different methods, the approach we are presenting here is able to use
Performance indicators for every metric used by Chym, when solving the BBB problem. The table is sorted so that the metric with higher Matthews Correlation Coefficient appears first in the list.
(0.19 MB TXT)
Performance indicators for every metric used by Chym, when solving the Pgp problem. The table is sorted so that the metric with higher Matthews Correlation Coefficient appears first in the list.
(0.19 MB TXT)
Performance indicators for every metric used by Chym, when solving the estrogen problem. The table is sorted so that the metric with higher Matthews Correlation Coefficient appears first in the list.
(0.19 MB TXT)
Activity coefficient of every compound in the ChEBI ontology, when the dataset from the BBB problem is used to train Chym. Only compounds with a structure were considered.
(0.66 MB TXT)
Activity coefficient of every compound in the ChEBI ontology, when the dataset from the Pgp problem is used to train Chym. Only compounds with a structure were considered.
(0.66 MB TXT)
Activity coefficient of every compound in the ChEBI ontology, when the dataset from the estrogen problem is used to train Chym. Only compounds with a structure were considered.
(0.66 MB TXT)