Conceived and designed the experiments: DT UL. Performed the experiments: DT PT PP. Analyzed the data: DT PT PP. Wrote the paper: DT PT JH UL.
The authors have declared that no competing interests exist.
The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein–protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods.
The most important way of conveying new findings in biomedical research is scientific publication. In turn, the most recent and most important findings can only be found by carefully reading the scientific literature, which becomes more and more of a problem because of the enormous number of published articles. This situation has led to the development of various computational approaches to the automatic extraction of important facts from articles, mostly concentrating on the recognition of protein names and on interactions between proteins (PPI). However, so far there is little agreement on which methods perform best for which task. Our paper reports on an extensive comparison of nine recent PPI extraction tools. We studied their performance in various settings on a set of five different text collections containing articles describing PPIs, which for the first time allows for an unbiased comparison of their respective effectiveness. Our results show that the tools' performance depends largely on the collection they are trained on and the collection they are then evaluated on, which means that extrapolating their measured performance to arbitrary text is still highly problematic. We also show that certain classes of methods for extracting PPIs are clearly superior to other classes.
Protein-protein interactions (PPIs) are integral to virtually all cellular processes, such as metabolism, signaling, regulation, and proliferation. Collecting data on individual interactions is crucial for understanding these processes at a systems biology level
Several approaches are in use to study interactions in large- or small-scale experiments. Among the techniques most often used are two-hybrid screens, mass spectrometry, and tandem affinity purification
Taking into account the great wealth of PPI data that was published before the advent of PPI databases, it becomes clear that still much valuable data is available only in text. Turning this information into a structured form is a costly task that has to be performed by human experts
Several techniques for extracting protein-protein interactions from text have been proposed (cf. Related Work). Unfortunately, the reported results differ widely. While early works reported fabulous results of over 90% precision and recall
In this paper, we give an unbiased and comprehensive benchmark of a large set of PPI extraction methods. We concentrate on a fairly recent class of algorithms which usually is summarized with the term
The syntax tree parse of the example sentence
The dependency tree parse of the example sentence
Central to the learning and the classification phases is a so-called kernel function. Simply speaking, a kernel function is a function that takes the representation of two sentences and computes their similarity. Kernel-based approaches to PPI extraction—and especially those working with
In this paper, we provide a comprehensive benchmark of nine kernel-based methods for relationship extraction from natural text (all substantially different approaches that were available as programs from a list of around 20 methods we considered). We tested each method in various scenarios on five different corpora. The transformation of the sentences in the corpora were performed using state-of-the-art parser software, in particular, the latest release of the Charniak–Lease parser for constituent trees and the Stanford Parser for dependency graphs. We show how publicly available kernels compare to each other in three scenarios: document-level 10-fold cross-validation (CV), cross-learning (CL), and cross-corpus (CC) settings. We also introduce a new and very fast kernel, kBSPS, and demonstrate that it is highly competitive.
We see our work as a continuation of similar benchmarks that have recently shed some light on the state-of-the-art of selected phases in the PPI extraction pipeline; in particular, these are the work on the performance of different constituent and dependency parsers
A number of different techniques have been proposed to solve the problem of extracting interactions between proteins in natural language text. These can be roughly sorted into one of three classes: co-occurrence, pattern matching, and machine learning. We briefly review these methods here for completeness; see
A common baseline method for relationship extraction is to assume a relationship between each pair of entities that co-occur in the same piece of text (e.g.,
The second common approach is pattern matching. SUISEKI was one of the first systems to use hand-crafted regular expressions to encode phrases that typically express protein-protein interactions, using part-of-speech and word lists
In this section, we describe in detail the kernels we evaluated, the corpora and how we used them as gold standards, the measures we computed, and the parameter settings we used and how they were obtained. We believe that such a level of detail is necessary to compare different methods in a fair and unbiased manner. Note that our evaluation often produces results that are far from those published by other authors (see
The effect of using different parsers and parse representations for the task of extracting protein-protein interactions has been investigated in
For our experiments we selected the Charniak–Lease re-ranking parser (
A support vector machine (SVM) is a classifier that, given a set of training examples, finds the linear (hyper)plane that separates positive and negative examples with the largest possible margin
In our experiments we make use of SVM implementations where the training is performed by a convex quadratic programming (QP) task. Additionally, as proposed by its authors
The kernels introduced in this section are mostly convolution kernels
Most of these kernels have been specifically designed to extract PPI from text or have been successfully applied to this task. Exceptions are subtree, partial tree and spectrum tree kernels which to our knowledge were not tested for PPI extraction before. Next, we will very briefly introduce their underlying principles (see also
From all kernels we tested, this is the only one that exclusively uses shallow parsing information
The next four kernels use the syntax tree representation of sentences (see
The
The
The
In
L, E mismatches are tolerated (
In
The drawback of the cosine similarity for textual data is its order-independence. The
The
In the literature, several further kernel-based approaches to relationship extraction were proposed. We give a brief survey of them below. Note that most of these kernels are either unavailable as programs or very similar to at least one of those we selected for our benchmark (see also
In
In
The general sparse subsequence kernel for relation extraction
A mixture of previous approaches was proposed in
In
In
There is no widely accepted definition of the concept of PPI, i.e., what should be annotated as PPI in text, therefore methods evaluated on different PPI-annotated corpora are difficult to compare. In
Corpus | Sentences | Positive pairs | Negative pairs |
AIMed | 1955 | 1000 | 4834 |
BioInfer | 1100 | 2534 | 7132 |
HPRD50 | 145 | 163 | 270 |
IEPA | 486 | 335 | 482 |
LLL | 77 | 164 | 166 |
Pairs are checked for (orderless) uniqueness; self-interacting proteins are excluded.
In the same study, an XML-based format was also defined for annotating PPIs, called
The presence or absence of a relation is marked on the level of named entity pairs, not on the level of sentences (cf. attribute
We use various performance measures to evaluate kernel-based classifiers for PPI extraction. On one hand, we report on the standard evaluation measures: precision, recall, and F
In this setting, we train and test each kernel on the same corpus using document-level 10-fold cross-validation. We refrain from using the also frequently mentioned instance-level splitting, in which every sentence containing more than two protein names may appear, though with different labeling, both in the training and the test sets. This is a clear case of information leakage and compromises the evaluation results. Its impact on PPI results is higher than in many other domains, since in PPI corpora sentences very often contain more than two protein names. We employ the document-level splits that were used by Airola and many others, which allow direct comparison of the results. We indicate the standard deviation of the averaged 10-fold cross-validation values.
Although the document-level 10-fold cross-validation became the de facto standard of PPI relation extraction evaluation, it is also somewhat biased, because the training and the test data sets have very similar corpus characteristics. It was shown
Finally, in CC experiments, we train the model on one corpus and then test on the other four corpora.
Apart from measuring the quality of the extractions, we also looked at the time it takes to classify the corpora. Whenever the texts to be analyzed are large, classification time may be the decisive factor to choose a method. However, we did not take particular measures to obtain perfect run times (eliminating all concurrent processes on the machines), so our times should only be considered as rough estimates. We should also mention that all the tested software are prototypes where the efficiency of implementations may significantly differ. Nevertheless, these figures should be good indicators of what can be expected when using the kernels out-of-the-box. Note that all methods we analyzed also require extra time (in addition to classification) to parse sentences.
All corpora we use for evaluation have all entities readily annotated. This means that our results only measure the performance of PPI extraction and are not influenced by problems of named entity recognition. However, to produce the right format for the kernel methods, we apply entity blinding, that is, we replace named entity occurrences with a generic string. Entity blinding is usually applied in relation extraction systems to (1) inform the classifier about the location of the NEs; (2) ensure the generality of the learned model, since classifiers should work for any entity in the given context. Before doing that we had to resolve the entity–token mismatch problem.
Syntax and dependency parsers work on token-based representation of sentence text being the output of the tokenization, also encoded in the learning format. Entities, however, may not match directly contiguous token sequences; this phenomenon has to be resolved for enabling the entity-based referencing of PPIs. Practically all combinations of entailment and overlapping occur in text: one entity may spread over several tokens or correspond merely to a part of a token, and there may exist several named entities in one token. We depicted some examples of the entity–token mismatch phenomenon in
(1) Named entities may overlap. The string Arp2/3 contains two named entities, namely Arp2 and Arp3. (2) An entity may spread over multiple noncontiguous text ranges. The entity Arp3 paragraph spreads over two ranges [0–2] and [5–5]. (3) Such noncontiguous and overlapping entities may constitute a relation, such as in The Arp2/3 complex….
In order to overcome these difficulties and adopt a clear entity–token mapping concept, we apply the following strategy: every token that at least partly overlaps with an entity is marked as entity. Entity blinding is performed as follows: A sentence with
Since some of the selected kernel methods, namely ST, SST, PT and SpT kernels are defined for syntax trees, we injected the syntax tree parses into the learning format. The terminal symbols of the syntax tree parses (i.e., tokens) were mapped to the character offsets of the original sentence text. This was necessary for the entity blinding in the constituent tree parse. Finally, the parses were formatted so that they comply with the expectations of the given kernel's implementation (the extended corpus files are available at our web site).
All evaluated methods have several parameters whose setting has significant impact on the performance. To achieve best results, authors often apply an exhaustive systematic parameter search—a multidimensional fine-grained grid search for myriads of parameter combinations—for each corpus they evaluate on. However, results obtained in this way cannot be expected to be the same as for other corpora or for new texts, where such an optimization is not possible. In this study, we take the role of an end-user which has a completely new text and wants to choose a PPI extraction method to extract all PPIs from this text. Which parameters should this user apply?
Ideally, one could simply use the default parameters of the kernels, leaving the choice of best settings to the authors of the kernels. This was our initial idea, which we had to abandon for two reasons: (1) for some syntax-tree-based kernels (ST, SST, PT), the default regularization parameter of the learner,
Also in CC evaluation, optimization geared towards the test-corpus may improve the performance. As shown in
We performed a thorough evaluation of nine different methods for extracting protein-protein interactions from text on five different, publicly available and manually annotated corpora. All methods we studied classify each pair of proteins in a sentence using a kernel function. The methods differ widely in their individual definition of this kernel function (comparing all subtrees, all subsets, all paths,…), use different classifiers, and make use of different types of information (shallow linguistic information, syntax trees or dependency graphs).
We report results in three different scenarios. In
Kernel | AIMed | BioInfer | HPRD50 | IEPA | LLL | |||||||||||||||
AUC | P | R | F | AUC | P | R | F | AUC | P | R | F | AUC | P | R | F | AUC | P | R | F | |
SL | 83.5 | 47.5 | 54.5 | 55.1 | 66.5 | 64.4 | 67.0 | 64.2 | 81.1 | 69.5 | 71.2 | 69.3 | 81.2 | 69.0 | 85.3 | 74.5 | ||||
ST | 68.9 | 40.3 | 25.5 | 30.9 | 74.2 | 46.8 | 60.0 | 52.2 | 63.3 | 49.7 | 67.8 | 54.5 | 75.8 | 59.4 | 75.6 | 65.9 | 69.0 | 55.9 | 70.3 | |
SST | 68.9 | 42.6 | 19.4 | 26.2 | 73.6 | 47.0 | 54.3 | 50.1 | 62.2 | 48.1 | 63.8 | 52.2 | 72.4 | 54.8 | 76.9 | 63.4 | 63.8 | 55.9 | 70.3 | |
PT | 68.5 | 39.2 | 31.9 | 34.6 | 73.8 | 45.3 | 58.1 | 50.5 | 65.2 | 54.9 | 56.7 | 52.4 | 73.1 | 63.1 | 66.3 | 63.8 | 66.7 | 56.2 | 97.3 | 69.3 |
SpT | 66.1 | 33.0 | 25.5 | 27.3 | 74.1 | 44.0 | 53.4 | 65.7 | 49.3 | 71.7 | 56.4 | 75.9 | 54.5 | 81.8 | 64.7 | 50.0 | 55.9 | 70.3 | ||
kBSPS | 75.1 | 50.1 | 41.4 | 44.6 | 75.2 | 49.9 | 61.8 | 55.1 | 79.3 | 62.2 | 58.8 | 70.5 | 84.3 | 69.3 | 93.2 | |||||
cosine | 70.5 | 43.6 | 39.4 | 40.9 | 66.1 | 44.8 | 44.0 | 44.1 | 74.8 | 59.0 | 67.2 | 61.2 | 75.5 | 61.3 | 68.4 | 64.1 | 75.2 | 70.2 | 81.7 | 73.8 |
edit | 75.2 | 27.7 | 39.0 | 67.4 | 50.4 | 39.2 | 43.8 | 79.2 | 71.3 | 45.2 | 53.3 | 80.2 | 60.2 | 67.1 | 68.0 | 98.0 | ||||
APG | 59.9 | 53.6 | 61.3 | 69.8 | 67.8 | 66.6 | 82.6 | 83.5 | 91 | |||||||||||
APG (with SVM) | 71.2 | 62.9 | 48.9 | 54.7 | 73.9 | 63.4 | 74.1 | 65.4 | 72.5 | 67.5 | 76.2 | 71.0 | 75.1 | 72.1 | 74.9 | 95.4 | ||||
SL |
60.9 | 57.2 | 59.0 | |||||||||||||||||
kBSPS |
67.2 | 49.4 | 44.7 | 46.1 | 76.9 | 66.7 | 80.2 | 70.9 | 75.8 | 70.4 | 73.0 | 70.8 | 78.5 | 76.8 | 91.8 | 82.2 | ||||
cosine |
62.0 | 55.0 | 58.1 | |||||||||||||||||
edit |
77.5 | 43.5 | 55.6 | |||||||||||||||||
APG |
84.8 | 52.9 | 61.8 | 56.4 | 81.9 | 56.7 | 67.2 | 61.3 | 79.7 | 64.3 | 65.8 | 63.4 | 85.1 | 69.6 | 82.7 | 75.1 | 83.4 | 72.5 | 82.2 | 76.8 |
rich-feature-based |
49.0 | 44.0 | 46.0 | 60.0 | 51.0 | 55.0 | 64.0 | 70.0 | 67.0 | 72.0 | 73.0 | 73.0 | ||||||||
hybrid |
86.8 | 55.0 | 68.8 | 60.8 | 85.9 | 65.7 | 71.1 | 68.1 | 82.2 | 68.5 | 76.1 | 70.9 | 84.4 | 67.5 | 78.6 | 71.7 | 86.3 | 77.6 | 86.0 | 80.1 |
co-occ. |
17.8 | 100. | 30.1 | 26.6 | 100. | 41.7 | 38.9 | 100. | 55.4 | 40.8 | 100. | 57.6 | 55.9 | 100. | 70.3 | |||||
RelEx |
40.0 | 50.0 | 44.0 | 39.0 | 45.0 | 41.0 | 76.0 | 64.0 | 69.0 | 74.0 | 61.0 | 67.0 | 82.0 | 72.0 | 77.0 |
The first two blocks contain the results of our evaluation, the third block contains corresponding results of kernel approaches from the literature, and the third block shows some non-kernel-based baselines. Bold typeface shows our best results for a particular corpus (differences under 1 base point are ignored). AUC, precision, recall, and F
† instance-level CV.
In case of the AIMed corpus, there are different interpretations regarding the number of interacting and non-interacting pairs
In case of the cosine and the edit kernels, the figures reported in the original paper were achieved with instance-level CV (personal communication, not mentioned in the original paper). As noted earlier in the literature
We account for smaller differences in F-score to the fact that we used different parameter optimization than in the original works. This is, for instance, the case for kBSPS (our own implementation) and APG. However, recall that parameter tuning always carries the danger of overfitting to the training data. The relative performance of different kernels in our results should be fairly robust due to the usage of the same tuning strategy for all kernels, while better results can be achieved by performing further corpus-specific tuning. Interestingly, for APG, we obtained better F-score and AUC values than the published ones for two of the five corpora.
Based on the results in
Kernel | AIMed | BioInfer | HPRD50 | IEPA | LLL | |||||||||||||||
AUC | P | R | F | AUC | P | R | F | AUC | P | R | F | AUC | P | R | F | AUC | P | R | F | |
SL | 28.3 | 42.6 | 62.8 | 36.5 | 46.2 | 78.0 | 56.9 | 68.7 | 62.2 | 75.6 | 71.0 | 52.5 | 60.4 | 79.5 | 79.0 | 57.3 | 66.4 | |||
SpT | 56.8 | 20.3 | 48.4 | 28.6 | 64.2 | 38.9 | 43.0 | 60.4 | 44.7 | 56.6 | 54.2 | 41.6 | 19.6 | 15.5 | 50.5 | 48.2 | 61.2 | |||
kBSPS | 72.1 | 28.6 | 68.0 | 40.3 | 73.3 | 62.2 | 38.5 | 78.3 | 61.7 | 74.2 | 67.4 | 81.0 | 72.8 | 83.7 | 75.0 | |||||
cosine | 65.4 | 27.5 | 59.1 | 37.6 | 61.3 | 42.1 | 32.2 | 36.5 | 71.2 | 63.0 | 56.4 | 59.6 | 57.0 | 46.3 | 31.6 | 37.6 | 66.9 | 80.3 | 37.2 | 50.8 |
edit | 62.8 | 26.8 | 59.7 | 37.0 | 61.0 | 53.0 | 22.7 | 31.7 | 60.7 | 58.1 | 55.2 | 56.6 | 62.1 | 58.1 | 45.1 | 50.8 | 57.6 | 68.1 | 48.2 | 56.4 |
APG | 77.5 | 69.6 | 58.1 | 29.4 | 39.1 | 76.1 | 48.1 | 59.6 | 62.2 | 72.3 | ||||||||||
Fayruzov |
72.0 | 40.0 | 70.0 | 31.0 | 75.0 | 56.0 | 68.0 | 29.0 | 74.0 | 39.0 |
Classifiers are trained on the ensemble of four corpora and tested on the fifth one. Rows correspond to test corpora. Best results are typeset in bold (differences under 1 base point are ignored). We show for reference the results with the combined full kernel of
The overall trend from CV to CL confirms our expectation. Performance results drop significantly, sometimes by more than 15 points. The most stable is the kBSPS kernel (average drop AUC: 1.12, F: 2.84); in a few cases CL outperforms CV results (also seen with APG on HPRD). The SL and APG kernels show a modest drop in AUC (4.5 and 2.82), which gets larger by F-score (9.28 and 10.22). Cosine and edit suffer from the most significant drops.
We can form two groups of kernels based on their CL performance. The first consists of SpT, cosine, and edit—supposedly other syntax-tree-based kernels belong here as well. SpT is clearly the worst in this comparison. Two outlier corpora are BioInfer and IEPA: on the former SpT is on par with other kernels, while on the latter it achieves very low value due to the extremely low recall. Cosine and edit are just somewhat better than SpT, particularly on AIMed and IEPA. Their AUC scores are mostly just above 60%, and their F-scores outperform the co-occurrence methods only on AIMed. On IEPA and LLL, all three F-scores are inferior to the co-occurrence baseline.
The other group consists of SL, kBSPS, and APG kernels. The SL kernel produced the least divergent values across the five corpora in terms of both major evaluation measures. It shows performance comparable with the best kernels on the two larger corpora, but is somewhat inferior on the three smaller ones. The AUC values of our kBSPS kernel are improved with decreasing size of the test corpus, and are comparable on most corpora with the SL and APG kernel, except for AIMed (
Finally, it is interesting to compare the performance of the better group with RelEx, the rule-based baseline (which requires no learning at all). We can see that on most corpora, only the best kernel-based method is comparable with RelEx, and except on BioInfer, the difference is a mere few percent.
Kernel | Training corpus | AIMed | BioInfer | HPRD50 | IEPA | LLL | |||||||||||||||
AUC | P | R | F | AUC | P | R | F | AUC | P | R | F | AUC | P | R | F | AUC | P | R | F | ||
SL | AIMed | (83.5) | (47.5) | (65.5) | (54.5) | 66.8 | 72.9 | 61.7 | 56.4 | 59.0 | 68.8 | 66.3 | 15.8 | 25.5 | 72.6 | 86.4 | 23.2 | 36.5 | |||
BioInfer | 27.2 | (81.1) | (55.1) | (66.5) | (60.0) | 74.8 | 51.0 | 78.5 | 61.8 | 76.6 | 63.3 | 64.8 | 64.0 | 80.5 | 71.5 | 78.0 | 74.6 | ||||
SpT | AIMed | (66.1) | (33.0) | (25.5) | (27.3) | 69.5 | 48.4 | 16.3 | 24.3 | 60.0 | 47.1 | 39.9 | 43.2 | 67.9 | 59.7 | 11.0 | 18.6 | 57.0 | 72.7 | 29.8 | 17.2 |
BioInfer | 65.3 | 22.3 | 77.8 | 34.7 | (74.1) | (44.0) | (68.2) | (53.4) | 57.2 | 41.4 | 67.5 | 51.3 | 69.9 | 61.2 | 52.2 | 56.4 | 55.7 | 54.2 | 62.8 | 58.2 | |
kBSPS | AIMed | (75.1) | (50.1) | (41.4) | (44.6) | 69.9 | 71.6 | 15.0 | 24.8 | 76.8 | 77.5 | 38.0 | 51.0 | 73.6 | 66.7 | 25.4 | 29.9 | 75.1 | 85.7 | 27.3 | 13.5 |
BioInfer | 71.8 | 65.6 | 40.3 | (75.2) | (49.9) | (61.8) | (55.1) | 61.0 | 81.6 | 67.4 | 78.2 | 76.8 | |||||||||
edit | AIMed | (75.2) | (68.8) | (27.7) | (39.0) | 67.5 | 28.8 | 15.9 | 24.5 | 38.3 | 71.1 | 23.9 | 27.5 | 73.2 | 75.0 | 21.8 | 3.6 | ||||
BioInfer | 66.9 | 58.4 | 39.6 | (67.4) | (50.4) | (39.2) | (43.8) | 72.7 | 59.4 | 65.6 | 62.4 | 69.3 | 61.1 | 55.8 | 58.4 | 66.9 | 69.0 | 54.3 | 60.8 | ||
APG | AIMed | (84.6) | (59.9) | (53.6) | (56.2) | 66.0 | 56.5 | 14.0 | 22.5 | 74.1 | 52.8 | 61.6 | 73.1 | 69.2 | 13.4 | 22.5 | 82.7 | 29.8 | 17.6 | ||
BioInfer | 71.2 | 24.7 | 81.8 | 37.9 | (81.5) | (60.2) | (61.3) | (60.7) | 76.0 | 49.3 | 62.1 | 61.7 | 70.7 | 82.0 | 69.0 | 76.3 |
CC results trained on the 3 smaller corpora are shown in the Supplement,
Overall, our expectation that average CC performance would be worse than CL performance because of the smaller size of training data was in general not confirmed. On the one hand, the average performance measured across all four possible training corpora drops for SL, kBSPS, and APG kernels (the magnitude of the drop increases in this order), while it increases for SpT and edit, so the difference between the performance of these groups shrinks. On the other hand, the average CC F-score belonging to the best training corpus is somewhat better than the average CL F-score also for SL, kBSPS and APG, while AUC decreases slightly.
The CC results show large performance differences for most kernels depending on the training corpus. From cross-corpus evaluation, we can estimate which corpora is the best resource from a generalization perspective. We rank each training corpus for each kernel and average these numbers to obtain an overall rank (
The ranking are calculated as the average of rankings on the 5 selected kernels (see
We performed a systematic benchmark of nine different methods for the extraction of protein-protein-interactions from text using five different evaluation corpora. All figures we report were produced using locally installed, trained, and tuned systems (the packages are available in the online appendix). In almost all cases, our results in cross-validation are well in-line with those published in the respective original papers; only in some cases we observed differences larger than 2%, and those could be attributed to different evaluation methods and different tuning procedures (see
Taking all our results into account (summarized in
AIMed | BioInfer | |||
Kernel | AUC | F | AUC | F |
CV/CL/CC | CV/CL/CC | CV/CL/CC | CV/CL/CC | |
SL | 83.5/ |
54.5/42.6/ |
||
SpT | 66.1/56.8/65.3 | 27.3/28.6/34.7 | 74.1/64.2/69.5 | 53.4/43.0/24.3 |
kBSPS | 75.1/72.1/71.8 | 44.6/40.3/40.3 | 75.2/73.3/69.9 | 55.1/ |
edit | 75.2/62.8/66.9 | 39.0/37.0/39.6 | 67.4/61.0/67.5 | 43.8/31.7/15.9 |
APG |
CC results for AIMed (resp. BioInfer) are obtained with classifier trained on BioInfer (resp. AIMed).
The performance of the other six kernels is clearly weaker. Kernels using syntax trees are on par with simple co-occurrence for CV, and their performance decreases drastically at CL evaluation. Cosine and edit kernels are slightly better than co-occurrence in CV, but their performance also drops significantly in CL evaluation.
AUC and F-score values on AIMed and BioInfer. CC values are obtained with training on the other large corpus, though, eventually training on a smaller corpora may yield better results.
The performance of machine-learning methods to PPI extraction largely depends on the specific relationship between the training data and the data the method is used on later (test data). If these two data sets exhibit large differences, then evaluation results obtained using only the training data will be much different than those obtained when using the trained model on the test data. Differences can be, among other, the style of writing, the frequency of certain linguistic phenomena, or the level of technical detail in the texts. For the case of PPI, important differences are the ratio between sentences containing a PPI and those that do not, or the implicit understanding of what actually is a PPI—this might, for instance, include or exclude temporary interactions, protein transport, functional association only hinting, yet not proving a physical contact etc; see
Our experiments in CL and CC setting show, in accordance with results obtained by others
However, also the other corpora are not homogeneous. This becomes especially clear when comparing CV results with those from CL and CC evaluations. As explained before, in CV all characteristics of the test corpus are also present in the training corpus and are thus learned by the algorithms; in contrast, in CL and CC this is not the case. The relatively large differences in the obtained performance measures indicate that different corpora have notably different characteristics. As any new texts that PPI extraction algorithms would be applied on would have unknown characteristics, we conclude that only the performance we measured for CL and CC can be expected on such texts.
We evaluated kernels based on shallow linguistic features, syntax tree, dependency graph, and mixtures of these three. Our results clearly show that syntax trees are less useful than the other representations. Recall that syntax trees contain no explicit information about the semantic relations of connected nodes, which apparently is crucial in a relation extraction task. For the other types of data, the picture is less clear.
Several authors claimed that using more types of information yields better performance
In contrast to APG and kBSPS kernel, the SL kernel does not use any deep parse information. Nevertheless, it produces results comparable with APG and better than kBSPS for cross-validation. Its superiority over kBSPS vanishes for cross-learning, however. This change may be attributed to the decreasing usefulness of shallow linguistic features—including word sequences—when the model is trained on a more heterogeneous corpus.
Our results also show that the descriptive power of dependency graph parses can only be exploited when combined with an appropriate kernel. Cosine and edit kernels are unable to efficiently capture the features from dependency graphs. In case of the former, the shortcoming may be accounted to the fact that cosine does not take the word order into account. The handicap of the latter can be explained by weighting scheme applied at path distance calculation: its uniform, grammar-independent weighting disregards grammatical rules and structures, and thus the semantics of the underlying text.
Recently, some authors criticized the F-score as performance measure, because it is very sensitive to the ratio of positive/negative pairs in the corpus
On the other hand, one must keep in mind that AUC is a statement about the general capabilities of a PPI extraction method that must not be confused with its expected performance on a concrete problem. For a concrete task, a concrete set of parameters has to be chosen, while AUC expresses a measure over a range of parameter settings. When a user wants to analyze a set of documents, one probably can safely advise her to prefer kernels with higher average AUC measure, but the achieved performance will depend very much on the concrete parameters chosen. We also show via the APG-SVM experiment that the AUC score depends very much on the learning algorithm of the classifier, and only partially on the kernel. Therefore, the (less stable) F-score actually gives a better picture on the expected performance on new texts.
We investigated the robustness of the different kernels against parameter settings. To this end, we performed exhaustive, fine-grained parameter optimization for selected tasks and measured the difference to the parameter setting used in the benchmark. The resulting picture is quite heterogeneous.
SL kernel in principle has a number of parameters, but the implementation we were provided with from the authors always uses a default setting (which yields sound results). Therefore, we could not test robustness of SL in terms of parameter settings.
When using task-specific parameter tuning at CV for syntax-tree-based kernels, an improvement of 3 (5) points can be achieved on AUC (F-score). The magnitude of improvement is larger on CL, but the figures remain low. On the other hand, with improper parameter setting, the F-score may drop drastically, even to 0. Overall, syntax-tree-based kernels behave very sensitive to parameter setting.
A fine-grained parameter tuning improves kBSPS results only insignificantly (1–3 points of improvement both AUC and F). A similar small drop can be observed at CV evaluation if the parameters are selected improperly, while at CL evaluation the drop gets larger and reaches 10–15 points F-score. Consequently, we can state that kBSPS is fairly robust to parameter selection.
Cosine and edit show significantly better (high 60s/low 70s) AUC values with task-specific parameter tuning at CL evaluation, but those settings cause a dramatic F-score decrease (cosine: 20–25, edit 6–12 points). At CV evaluation, the trend is similar, but the extent of changes is smaller. As a summary, cosine and edit also should be considered as sensitive to parameter settings.
The performance of APG hardly changes (1–2 points) if the parameters are set differently (CV). The F-score drop is somewhat larger at CL. On the other hand, a major F-score drop can be observed when the threshold parameter is not optimized. When trained with SVM, APG becomes even more sensitive to the right selection of parameters.
The runtime of a kernel-based method has two main components. First, the linguistic structures have to be generated. Previous experiments show
We give an overview of the theoretical complexity of each kernel in the Supporting Information (
Runtime is a strong argument when it comes to the application of a PPI extraction method on large corpora. Consider the top-3 kernels AGP, SL, and kBSPS. When applied to all of Medline with its approximately 120M sentences, one would expect runtimes of 45, 141, and 4 days, respectively, on a single processor and I/O stream. Taking also into account the computation of shallow parses and dependency trees (on average 4 ms and 130 ms per sentence, respectively), times change to 226, 147, and 185 days, thus the formerly existing large differences almost vanish. Clearly, the exact times depend on the hardware that is used, but the ratios should stay roughly the same. The figures imply that an application of kernel based methods to PPI extraction on large corpora is only possible if considerable computational resources are available.
The SL kernel uses only shallow linguistic information plus the usual bag-of-word features. Taking parse time into account, this kernel is the fastest among all we tested. Despite its simplicity, its performance is remarkable. It is on par with the best kernels in most of the evaluation settings, and yields particularly good results in CV — all with default parameter settings. Furthermore, its performance is the most robust on the two larger corpora across CV, CL and CC evaluation in terms of both AUC and F-score.
Syntax-tree-based kernels (ST, SST, PT, SpT) fail to achieve comparative performance. Their performance hardly reaches the baseline even at CV evaluation. They are also very sensitive to the parameter setting and have a long runtime. Results are very sensitive to the particular training/test corpora and therefore cannot be extrapolated safely to new texts.
The kBSPS kernel achieves an overall very good performance, particularly in the more important CL and CC evaluations. Its performance decreases the least when CL evaluation is used instead of CV. It is very robust against parameter settings and achieves very good results with default parameters. Furthermore, it is by far the fastest kernel among all that use rich linguistic information.
Cosine and edit kernels, though using dependency trees, show significantly worse performance than the top-3 kernels. They are also very sensitive to the parameter settings. Their runtime is the double compared to other dependency tree kernels. In
APG shows the best performance at CV setting, but its superiority vanishes on the more important CL and CC settings. It uses a different learner than other kernels, which optimizes for AUC. Consequently, its AUC results are the best, but its F-score values are also good (CV, and partly CL). Recall when APG is trained with SVM its AUC performance drops significantly compared to APG-RLS. This reflects the fact that RLS specifically optimizes for AUC; in turn, one can expect other kernels to also obtain better results when RLS learning would be applied. APG is rather sensitive to evaluation settings, where is exhibits the largest drop among top-3 kernels. It is robust to parameter settings except the threshold for the RLS procedure, but becomes very sensitive to parameters when trained with SVM. The classification is pretty fast, but with the necessary preprocessing, it becomes the slowest of the top-3 kernels.
We investigated nine kernel-based methods for the extraction of PPIs from scientific texts. We studied how these methods behave in different evaluation settings, using different parameter sets, and on different gold standard corpora. We showed that even the best performing kernels, requiring extensive parameter optimizations and large training corpora, cannot be considered as significantly better than a simple rule-based method which does not need any training at all and has essentially no parameters to tune. We also showed that the characteristic features of PPIs can be extracted much more efficiently by kernels based on dependency tree parses than by those based on syntax tree parses. Interestingly, the SL kernel, using only shallow linguistic analysis, is almost as good as the best dependency-based kernels. We pointed out that the advantage of APG kernel, using multiple representations as features, vanishes in a realistic evaluation scenario when compared to the simpler kBSPS and SL kernels.
The ultimate goal of this study was to select the best PPI extraction method for real applications and to generate performance estimates for this method (and others) on new text. We state that this goal was not achieved for mostly two reasons. First, the performance of the methods we studied is very sensitive to parameter settings, evaluation method, and evaluation corpus. Best scores are only achieved when settings are optimized against a gold standard kernel—something that is not possible on unseen text. Our results reveal that some methods apparently are better than others, but a clear-cut winner is not detectable given the bandwidth of results. Second, the heterogeneity between corpora leads to extremely heterogeneous evaluation results, showing that all methods strongly adapt to the training set, and that, in turn, the existing training corpora are not large or not general enough to capture the characteristics of the respective other corpora. This implies that any extrapolation of the observed scores (AUC or F-score) to unseen texts is questionable.
We believe that these findings call for a number of actions. First, there is a strong need to create larger and better characterized evaluation corpora. Second, we think that there is also a need to complement the currently predominant approach, treating all interactions as equally important, with more specific extraction tasks. To this end, it is important to create specialized corpora, such as those for the extraction of regulation events or for protein complex formation. The more specific a question is, the simpler it is to create representative corpora, leading to better models, often higher extraction performance and better comparability of methods. For instance, works like
Overview of the evaluated kernels. Overview of the nine kernels evaluated in the paper.
(0.07 MB PDF)
Other kernels considered. Overview of other kernel based methods in the literature that we did not tested in the paper.
(0.06 MB PDF)
Overview of the usability of the different kernels. Some details on the nine evaluated kernels: availability of the algorithm and documentation, type of learning software.
(0.06 MB PDF)
Overview of our parameter selection strategy. Overview of our parameter selection strategy used in the paper. We provide a coarse description of parameter ranges and best parameters for each kernel and evaluation setting.
(0.07 MB PDF)
Cross-corpus results. Full table of cross-corpus results trained on all 5 corpora and evaluated on all nine kernels.
(0.08 MB PDF)
Ranking of corpora at CC evaluation based on their AUC and F-score values. We ranked the corpora from the generality perspective, i.e. how general the systems are trained on specific corpora. The evaluation is based on their AUC and F-score values at CC evaluation.
(0.07 MB PDF)
Cross-learning experiments with some selected kernel performed on 4 corpora (all but AIMed). Cross-learning experiments with some selected kernel performed on 4 corpora (all but AIMed). Classifiers are trained on the ensemble of three corpora and tested on the forth one.
(0.07 MB PDF)
CV results with transductive SVM for kBSPS, edit, cosine kernels. Results with the transductive learning strategy for some selected kernels.
(0.06 MB PDF)
Average runtime of training and test processes, and runtime estimates on entire Medline. Average runtime of training and test processes per corpus measured over all cross-validation experiments for each kernel (not including the parsing time at pre-processing), and rough runtime estimates on the entire Medline.
(0.06 MB PDF)
Similarity function in kBSPS kernel. Definition of similarity score used in kBSPS kernel.
(0.07 MB PDF)
Additional experiments. We provide here details of two additional experiments. (1) Cross-learning (CL) without AIMed, that is systems are trained on 3 corpora and tested on the fourth one. (2) Models trained with transductive SVM.
(0.05 MB PDF)
Theoretical complexity of kernels. We provide here details on the computational complexity of kernels.
(0.04 MB PDF)
We thank all authors of the kernel methods we discuss for providing code and numerous hints on how to install and use the systems. We particularly thank Antti Airola for intensive discussions on benchmarking PPI extraction in general and in numerous special cases. We also thank the anonymous reviewers for their valuable comments.