Conceived and designed the experiments: JAGR. Performed the experiments: JAGR IM. Analyzed the data: JAGR IM FSJ CO. Contributed reagents/materials/analysis tools: JAGR IM JGL AJR CY ABC. Wrote the paper: JAGR IM JGL AJR CY ABC FSJ CO.
The authors have declared that no competing interests exist.
Accurate modelling of biological systems requires a deeper and more complete knowledge about the molecular components and their functional associations than we currently have. Traditionally, new knowledge on protein associations generated by experiments has played a central role in systems modelling, in contrast to generally less trusted bio-computational predictions. However, we will not achieve realistic modelling of complex molecular systems if the current experimental designs lead to biased screenings of real protein networks and leave large, functionally important areas poorly characterised. To assess the likelihood of this, we have built comprehensive network models of the yeast and human proteomes by using a meta-statistical integration of diverse computationally predicted protein association datasets. We have compared these predicted networks against combined experimental datasets from seven biological resources at different level of statistical significance. These eukaryotic predicted networks resemble all the topological and noise features of the experimentally inferred networks in both species, and we also show that this observation is not due to random behaviour. In addition, the topology of the predicted networks contains information on true protein associations, beyond the constitutive first order binary predictions. We also observe that most of the reliable predicted protein associations are experimentally uncharacterised in our models, constituting the hidden or “dark matter” of networks by analogy to astronomical systems. Some of this dark matter shows enrichment of particular functions and contains key functional elements of protein networks, such as hubs associated with important functional areas like the regulation of Ras protein signal transduction in human cells. Thus, characterising this large and functionally important dark matter, elusive to established experimental designs, may be crucial for modelling biological systems. In any case, these predictions provide a valuable guide to these experimentally elusive regions.
To model accurate protein networks we need to extend our knowledge of protein associations in molecular systems much further. Biologists believe that high-throughput experiments will fill the gaps in our knowledge. However, if these approaches perform biased screenings, leaving important areas poorly characterized, success in modelling protein networks will require additional approaches to explore these ‘dark’ areas. We assess the value of integrating bio-computational approaches to build accurate and comprehensive network models for human and yeast proteomes and compare these models with models derived by combining multiple experimental datasets. We show that the predicted networks resemble the topological and error features of the experimental networks, and contain information on true protein associations within and beyond their constitutive first order binary predictions. We suggest that the majority of predicted network space is dark matter containing important functional areas, elusive to current experimental designs. Until novel experimental designs emerge as effective tools to screen these hidden regions, computational predictions will be a valuable approach for exploring them.
Many features of biological systems cannot be inferred from a simple sum of their components but rather emerge as network properties
The scarce knowledge of biological systems is further compounded by experimental error. It is common for different high-throughput experimental approaches, applied to the same biological system, to yield different outcomes, resulting in protein networks with different topological and biological properties
There has been a great deal of work analysing biological networks across different species, giving insights into how networks evolve. However, many of these publications have yielded disparate and sometimes contradictory conclusions. Observation of poor overlap in protein networks across species
Increasing the accuracy of networks by integrating different protein interaction data relies on the intuitive principle that combining multiple independent sources of evidence gives greater confidence than a single source. For any genome wide computational analyses, we expect the prediction errors to be randomly distributed amongst a large sample of true negative interactions (i.e. the universe of protein-protein interactions that do not take place). Hence, it is unlikely that two independent prediction methods will both identify the same false positive data in large interactomes like yeast or human. In general, we expect the precision to increase proportionally to the number of independent approaches supporting the same evidence.
From the available list of well-known integration methods specifically designed to integrate diverse protein-protein interaction -PPI- datasets (e.g. Naïve-Bayes; SVM; etc.
In this work, we significantly increase the prediction power of binary protein functional associations in yeast and human proteomes by integrating different individual prediction methods using the Fisher integration method. Three different untrained methods are implemented: GECO (Gene Expression COmparison); hiPPI (homology inherited Protein-Protein Interactions); and CODA (Co-Occurrence Domain Analysis) run with two protein domain classifications, CATH
Protein pairs identified by significant Fisher integration p-values were used to build a protein network model for yeast and human proteomes referred to as the Predictogram (PG). Additionally, all the protein-protein associations from several major biological databases, including Reactome
There have been frequent observations of low overlaps between different experimental high-throughput approaches
By analogy
The results are divided into four main sections in which the predicted and experimental PPI models of human and yeast are compared. The first section analyses the performance of the single and integrated methods predicting the protein associations and determines the correlation between the prediction scores and the degree of accuracy and noise in the predictions. The second chapter compares the topological network features of the predicted and experimental PPI models at equivalent levels of accuracy and noise. The third section searches for functional differences between the predicted and experimental models looking for specific functional areas which appear to be illuminated by the prediction methods but elusive to the experimental approaches. Whilst the final fourth section explores whether the predicted PPI network graphs contain additional context-based information on protein associations beyond the sets of predicted protein pairs used to build the networks.
The different methods for predicting protein-protein interactions and functional associations were run on the whole yeast and human proteomes, generating four prediction datasets for each organism, GECO, CODAcath, CODApfam and hiPPI (see the section: Running the PG methods on the human and yeast proteomes and section 1 in
Benchmark datasets for each organism, comprising reliable protein pairs based on Gene Ontology Semantic Similarity scores (referred to as the Goss refined – Gossr datasets; see the section: The GO Semantic Similarity refined dataset (Gossr) used for validating the prediction methods), were used to assess performance (note that the performance measured will depend on the quality of the validation dataset; see section 2 in
A and C plots are from Yeast datasets and B and D are for Human results. A and B plots show precision versus p-values and C and D graphs show precision versus recall. Inset to the C plot shows an enlargement to visualize the improvements obtained by using the Fisher integration in yeast.
The mutual information scores demonstrate the independence of the 4 different prediction datasets (see section 3 in
Precision (TP/TP+FP) versus Recall (Recall considered as the number of predicted hits) is plotted for yeast and human Gossr validation (
We have performed additional validation of the Fisher method using a set of physical interacting pairs as gold standards in yeast and human (see section 4 in
Fisher was also implemented to integrate similar datasets of individual STRING
In all cases (yeast and human) Fisher integration of the GECO, CODAcath, CODApfam and hiPPI predictions was shown to be a powerful combination which significantly increases the prediction power without using any KG trained or supervised algorithms. This premise is crucial if we aim to detect genuine similarities between the PG and KG models, unbiased by overlap between supervised predictions and their training sets (as would occur by using a Bayes integration). Because of this the Fisher weighted predictions were chosen for generating the PG network models used in subsequent analyses of the networks.
To test whether the PG networks based on the binary predictions share features with networks built on reliable KG evidences, different topological parameters (see section 7 in
Different PG networks were constructed from the binary predictions by varying the link (edge) p-value cut-off. This was done for a range of p-values from p-value≤0.001 (PG0.001) to p-value≤1.0 (PG1.0). KG network models were also tested at different levels of confidence based on the number of KG evidences supporting the same protein-protein associations. Mutual information calculation on the KG data showed broad independence except for the Goss and Foss (FunCat semantic similarity) datasets, therefore Goss and Foss evidences were summed and considered as a single dataset of KG evidences. Different KG networks were constructed by varying the minimum number of independent evidences required to form an edge/link. Random models were also generated for all the PG and KG networks as described in the section: Network randomisation. The PG, KG and their corresponding randomised networks, built at different significance levels, provide comparable frameworks for examining the topological properties of biological networks.
Real biological networks have been shown to have a scale-free topology with a high degree of clustering
Panels A and B correspond to the KG and PG networks respectively, the legend for these panels show the correlation coefficients and exponents corresponding to the linear regression fit of the data. The corresponding randomised networks are shown below for KG (panels C, E) and PG (panels D, F) networks respectively. Panels C and D are from network randomisations by the adjacency method (see the section: Network randomisation). Panels E and F randomisations are from the evidence and p-value shuffling respectively.
Panels A and B correspond to the KG and PG networks respectively, the legend for these panels show the correlation coefficients and exponents corresponding to the linear regression fit of the data. The corresponding randomised networks are shown below for KG (panels C, E) and PG (panels D, F) networks respectively. Panels C and D are from network randomisations by the adjacency method (see the section: Network randomisation). Panels E and F randomisations are from the evidence and p-value shuffling respectively.
Yeast and human KG and PG models show non-random distributions of their degree (ki) frequencies for all levels of network reliability tested, except for the lowest level (
Power-law degree distribution is a necessary but not a sufficient characteristic of scale free networks. Therefore, other topological features of the KG and PG networks were measured in order to give more support to the hypothesis of scale free tendency for our models. These included: average clustering coefficient; assortativity; or network hierarchy amongst other parameters described in the section: Network topology structure characterisation and the section 7 in
The trend of increasing average clustering coefficient with increasing network reliability (KG and PG network models built at more highly significant p-values and # evidences levels) lends further support to the scale-free organization of the KG and PG networks in yeast and human (see Figure S6d–Figure S8d in
Network hierarchy is another topological feature that can be considered by using the logarithmic distribution of the clustering parameter
KG models represent the known (experimentally determined) protein associations while PG models represent sets of associations predicted by
We used the most reliable (precision≥80%) PG models (PG0.01 in yeast – about 90,000 pairs - and PG0.014 in human – about 106 pairs; see section 15 and Table S5 in
Type | Networks | # Edges | # Nodes | #Ed./#Nod. | %PGe | |
KG/PG0.01 | 17,373 | 2,293 | 7.6 | 18.22 | ||
KG/PG0.01 | 1,280 | 2,707 | 0.5 | 1.34 | ||
KG/PG0.01 | 4,062 | 4,279 | 0.9 | 4.26 | ||
KG/PG0.014 | 14,048 | 3,958 | 3.5 | 1.34 | ||
KG/PG0.014 | 898 | 1,111 | 0.8 | 0.08 | ||
KG/PG0.014 | 3,073 | 5,633 | 0.5 | 0.29 |
From left to right: the
The percentage of edge predictions, backed by experiments, drops considerably for human. Only 1.4 percent of the predicted protein associations (PG) were also present in the KG model (
The density of the overlap between PG and KG in yeast (#Ed./#Nod. = 7.6 in
These statistical analyses of the PG and KG intersections indicate that about 82%, in yeast, and 98% of predicted protein associations in human are not backed by experimental evidence in the KG model, giving an estimate of dark matter in the yeast and human protein networks. Only 2.4% of the PG proteins in yeast are dark nodes (proteins without experimental association in the KG model), whilst dark nodes constitute 71% in human.
Although PG and KG models explore significantly different regions of protein binary association space, interestingly, given the small intersections and density values, the PG and KG overlap is still significantly larger than expected by random (
Enrichment of the degree of a node in the PG model (PGki_er) was calculated in order to measure the difference in the connectivity (ki) values for a protein in the PG and KG networks (see the section: Calculating the PGki enrichment ratio and the PG functional enrichment). A high PGki_er value indicates the presence of a dark (experimentally hidden) hub, a protein with many predicted associated proteins in the PG model and few, if any, experimentally validated KG associations. Proteins in the yeast and human PG models were ranked using their PGki_er value, retrieving the top 10 ranked proteins for both organisms (see
Yeast Prot. Acc. N. | KG ki | PG ki | PG ki_er | R. | Gene name | Uniprot descriptions |
Q07928 | 0 | 213 | 213 | 1 | GAT3 | GATA-type zinc finger: transcription factor activity (Inferred from electronic annotation). Unknown function. |
P47055 | 0 | 201 | 201 | 2 | LOH1 | Multi-pass membrane protein. Possibly involved in maintaining genome integrity |
A6ZR40 | 0 | 189 | 189 | 3 | SCY_1587 | Predicted protein, unknown function. |
Q12079 | 0 | 188 | 188 | 4 | YPR027C | Multi-pass membrane protein. Uncharacterized membrane protein YPR027C |
P53964 | 0 | 187 | 187 | 5 | YNL033W | Single-pass membrane protein. Uncharacterized membrane protein YNL033W |
P47056 | 0 | 172 | 172 | 6 | YJL037W | Multi-pass membrane protein. Uncharacterized protein YJL037W |
P09937 | 0 | 171 | 171 | 7 | SPS4 | Sporulation-specific protein 4. Not essential for sporulation. Might be a component of the cell wall. |
P32643 | 0 | 163 | 163 | 8 | TMT1 | Trans-aconitate 3-methyltransferase. Inducted during amino acid starvation. |
A6ZV06 | 0 | 155 | 155 | 9 | SCY_2239 | Predicted protein with alpha/beta hydrolase fold, unknown function. |
A6ZP11 | 0 | 151 | 151 | 10 | SCY_5229 | Predicted protein with a Nucleotide binding domain potentially found in RNases, unknown function. |
Human Prot. Acc. N. | KG ki | PG ki | PG ki_er | R. | Gene name | Uniprot descriptions |
Q6ZP81 | 0 | 2597 | 2597 | 1 | - | Highly similar to Homo sapiens titin (TTN), with Fibronectin type III domain. Unknown function. |
Q9UM08 | 0 | 2214 | 2214 | 2 | HGC6.3 | Unknown function |
Q9BZ69 | 0 | 1860 | 1860 | 3 | P143 | Predicted membrane protein with histidine kinase domain, two-component sensor activity (Inferred from electronic annotation). Unknown function. |
Q0VAC6 | 0 | 1828 | 1828 | 4 | SUMF1 | Unknown function |
Q5JRQ2 | 0 | 1796 | 1796 | 5 | NT5E | 5′-nucleotidase, ecto (CD73). Hydrolase activity; nucleotide catabolic process. |
Q96F04 | 0 | 1759 | 1759 | 6 | MMP28 | Matrix metallopeptidase 28. Predicted protein with a putative peptidoglycan binding domain. |
Q3KR05 | 0 | 1702 | 1702 | 7 | NEU4 | Sialidase 4. Unknown function. |
Q32MK0 | 0 | 1660 | 1660 | 8 | MYLK3 | Putative myosin light chain kinase 3. May play a role in smooth muscle contraction. |
Q4ZG20 | 0 | 1657 | 1657 | 9 | TTN | Putative uncharacterized protein TTN. Unknown function. |
Q96CP1 | 0 | 1602 | 1602 | 10 | RELA | Predicted protein with a transcription factor Rel homology domain (RHD) |
From left to right:
A common interesting feature of dark hubs, shown in
We analysed the top 10 dark hubs in the yeast PG network using functional annotation inferred by homology, these proteins correspond mainly to membrane embedded proteins, although there are also proteins related to other disparate functions, such as: transcription factors, RNase (probably involve in siRNA degradation processes), sporulation, and various enzymes (see
In order to study possible bias in the functional niches highlighted by the PG predictions but absent in the KGs, functional enrichment in the yeast and human PGki_er ranked lists was estimated using the GOrilla server
Biological process GO term name | GO code | P-value | N | B | n | b | E. |
Protein amino acid phosphorylation | GO:0006468 | 7.63E-22 | 12769 | 508 | 991 | 111 | 3 |
Regulation of small GTPase mediated signal transduction | GO:0051056 | 1.64E-13 | 12769 | 124 | 982 | 39 | 4 |
>Regulation of Ras protein signal transduction | GO:0046578 | 2.27E-10 | 12769 | 85 | 982 | 28 | 4 |
Molecular function GO term name | GO code | P-value | N | B | n | b | E. |
Protein kinase activity | GO:0004672 | 3.62E-21 | 12769 | 480 | 991 | 10 | 3 |
>Protein serine/threonine kinase activity | GO:0004674 | 4.83E-14 | 12769 | 349 | 984 | 75 | 3 |
>Protein tyrosine kinase activity | GO:0004713 | 2.73E-10 | 12769 | 146 | 972 | 37 | 3 |
GTPase regulator activity | GO:0030695 | 7.06E-13 | 12769 | 319 | 908 | 63 | 3 |
>Guanyl-nucleotide exchange factor activity | GO:0005085 | 1.92E-11 | 12769 | 123 | 982 | 36 | 4 |
>Small GTPase regulator activity | GO:0005083 | 1.59E-10 | 12769 | 211 | 982 | 47 | 3 |
ATP binding | GO:0005524 | 4.61E-25 | 12769 | 1097 | 991 | 190 | 2 |
For the
GOrilla did not find any significant functional enrichment bias (P-value>E-9) in the yeast ranked list, but detected enrichment of some GO terms in the human ranked list associated with particular biological processes and molecular function categories in GO (see
If the reliable PG0.01&0.014 pairwise predictions capture a significant percentage of true functional relationships and the PG0.01&0.014 networks show most of the topological properties of KG networks, it is reasonable to expect that the topology associations in these PG0.01&0.014 networks will resemble real biological networks. In other words, we should be able to exploit information on the context of a protein (i.e. connections in the network) to predict associations it has with other proteins sharing a similar context.
In order to test this hypothesis, functional predictions were generated for additional protein pairs, by comparing the interactions of the respective proteins in these pairs, in the PG networks. The results were then validated using the gold standard KG protein pairs' datasets.
This context analysis of the PG networks
Comparison of the association profiles identified 1,668,584 protein pairs in yeast and 49,117,115 protein pairs in human sharing at least one third of their interacting proteins in the PG0.01&0.014 network matrices. The similarity scores of the profiles were validated using the different KG datasets i.e. Int, Kegg, Goss, Foss, Reactome, and Reactome_int (see Figure S14, Figure S15, Figure S16, Figure S17, Figure S18, Figure S19, and Figure S20 in
These plots present the precision value (y-axis) versus specific bits similarity score between the interaction profiles of the protein pairs (x-axis in plots A and C) and versus Recall (# of pairs predicted, x-axis in plots B and D) in yeast (plots A and B) and human (plots C and D) PG 0.001&0.0014 networks. The gold standard dataset used, KG≥2 evidences, is described in the section: Validation of the second order predictions for the PG networks.
Bits and specific bits scores show very similar behaviour in all the KG datasets most probably due to the large set of potential random interactions in both PG matrices that make it very unlikely that two proteins would share a significant number of interactions by chance (see section 18 in
First order predictions based on Fisher scores yielded about 90,000 predictions in yeast with a precision≥80% (see
The scoring functions of the three
While the KG network models contain much of the current knowledge on protein functional associations provided by disparate experimental resources, in yeast and human, the PG models represent sets of predictions inferred by the integration of different
Different data integration methods are applied for reducing noise (error) in the KG and PG models, thereby generating analogous frameworks for the KG and PG models built at different reliability levels. In the KG models the associated error is inversely correlated to the number of evidences supporting a given protein-protein association. Reducing error by summing evidences is analogous to the repetition of experiments carried out in standard experimental protocols
Since one of the prediction methods, hiPPI, exploits available experimental data by inheriting experimentally validated interactions between homologous proteins there may be some concern that the dependency of the hiPPI predictions on some of the KG datasets could bias the PG network models so that the features resemble those of experimental KG networks. Addressing this possibility we repeated the main analyses of this work excluding the hiPPI predictions and demonstrated that the similarity of the PG and KG models remained and is therefore not due to any circular information or bias. This confirms our previous observations and conclusions of our work (see section 21 in
Coverage of reliable PG0.01&0.014 predictions by KG datasets appears much higher in yeast (18%) than in human (1.34%) for all the analysed cases (
For human, the top ranked dark hub dataset (see
Dark matter may even be more extensive than suggested by the initial comparison of PG and KG models. KG and PG models both show a non-hierarchical structure, as shown by the clustering parameter distribution (Figure S11 in
Since much of the PG network is dark matter containing hubs and other important functional regions not easily reached by current experimental designs (especially in more complex organisms like human), and since the PG models show the most important properties of real, biological networks, resembling the properties observed in the KG models, we can conclude that the yeast and the human PG networks are valuable models, akin to the currently more accepted KG models, for investigating the properties of real biological networks, complementing and completing experimental studies in Systems Biology.
The GECO, hiPPI, CODAcath and CODApfam methods were run against all sequences in the human (
A score for the cumulative frequency distributions was calculated for each of the four prediction datasets (GECO, hiPPI, CODAcath and CODApfam) using the curvefit tool from MATLAB. The particular Probability Density Functions (PDF) associated with the score distributions for each of the four methods was calculated in order to translate the scores into p-values. Right tailed Ztests were performed to ensure that the PDF distributions of the PG datasets fit random Gaussian distributions with different means μ (null hypothesis) at 5% significance level for accepting the null hypothesis being false (see section 20 in
Mutual information was calculated between the prediction datasets, to detect potential dependencies. The small values calculated for the mutual information (or conditional dependency) between pairs of predictors, indicated that the datasets were largely conditionally independent (Table S1 in
The p-values from each method were integrated using two methods: Simple integration, and Fisher weighted (Fisher_W)
The weights for the Fisher_W method were calculated using a MATLAB script. This consisted of simultaneously running a
We benchmarked our predictions using the highest quality annotations of yeast and human proteomes in the Gene Ontology (GO) database
The GO terms' Semantic Similarity (Goss) scores were calculated for all versus all protein pairs in human and yeast proteomes as described by Lord et al. 2003
Precision was calculated as the ratio of accumulative TP/TP+FP at different prediction p-values, where TP (True Positives) is the rate of hits predicted within the validation dataset of true protein binary associations (e.g. Gossr, see section above), and FP (False positive) is the average rate of hits predicted from 1000 random models of the same validation dataset.
The FP are the randomly selected PPIs above different scoring thresholds (i.e. prediction p-values). The FPs are calculated as an average of 1000 random validation iterations to estimate the errors (deviations) associated with the calculation. We then compare the relative differences in the TP and FP rates in the ranked prediction list, obtained by using our predictor and a random approach. For example, a precision ≥90% associated with a p-value≤0,001 means we find 9 times more TPs in the set of predictions with p-value≤0,001 than a random predictor does by chance. In our analyses the precision (ie TP/TP+FP) will always tend to 50% because we select the same number of FPs from our random predictors as given by the integrated prediction method.
Using a random model for benchmarking it is possible that a randomly selected PPI could be a known TP, by chance, although the probability is expected to be very low since the space of known PPI (TPs) is much lower than the space of random PPIs pairs considering all possible combinations. It is also likely that any of the gold standard datasets, or combinations of them, do not contain all the true PPIs taking place in nature. Therefore it is not possible to correctly estimate FPs in the ranked predictions, based on pairs absent in the validating datasets (ie many of these FPs may be currently uncharacterised TPs). In any case, the consequence of considering TPs as FPs in the random validation model used in this work is conservative, giving an underestimate of the performance of our predictor (see section 2 in
Although recall is usually defined as the TP/(TP+FN) ratio, since not all the true PPI are known in our validation model, we can not reliably estimate the FN rates. Therefore, in this work Recall is calculated as the accumulated number of predicted hits by a given method, at different p-value levels.
Yeast and human PG protein networks were built based on the binary protein prediction data selected at different discrete Fisher_W p-value statistical significance levels. Fisher_W predictions were chosen because these gave the best results from the benchmarking. Various PG networks were generated over a range of predicted p-value cut-offs. The p-value cut-off used to generate a given PG network is specified in the subscript of its name. For example if a p-value cut-off ≤0.01 was used the PG network was termed PG0.01.
The construction of KG protein networks for human and yeast proteomes was based on the existence of protein functional links. For the interaction databases HRPD
A cumulative score was associated with each edge (functional link) to represent the number of independent resources with evidence of the functional link between the two proteins. The KG models statistics are shown in Table S5 in
Two different randomisation procedures were implemented. The first method randomised the p-values associated with edges in the PG network and the # of evidences associated with edges in the KG models, whilst keeping the same pairs of connected nodes in the matrices. These models are referred to as
In order to compare the PG/KG networks generated by this study several different network statistical features were calculated. Topological parameters included the node degree connection (ki)
In order to determine whether some nodes had elevated degree connections in the PG, the relative enrichment of the node degree connection (ki) for nodes in the PG network compared to the KG network was calculated for all the nodes (proteins) using the following formula:
For each protein pair, the vectors of interacting proteins, within the PG0.01 in yeast and the PG0.014 in human network matrices (0,01 and 0,014 cut-offs relate to 80% precision in yeast and human respectively), were compared using different similarity measures, such as: bits, specific bits and congruence. These similarity scores,
The bits score formula is b(p1,p2) = b1, where p1 and p2 are the two proteins compared and b1 is the number of shared interacting proteins between the two proteins' interaction vectors in a given PG network matrix. The specific bits score was calculated using the following formula: s(p1,p2) = b1·[−log(b1/(b1+b2))], where p1 and p2 are the two proteins compared, b1 is the number of shared interacting proteins, and b2 is the number of non-shared interacting proteins between the two compared proteins in the PG networks. Congruence is a similarity measure between pairs of protein interacting vectors that was calculated as described in Lehner
Second order predictions were ranked based on the different similarity score values (see section above) from the most significant to the least significant. Validation was performed using as true positives (TP) protein pairs from the KG matrices in yeast and human respectively (Int, Goss, Foss and Kegg in Yeast and Goss, Foss, Kegg, Int, Reactom_Int, and Reactome in Human; see the section above: Knowledgegram (KG) construction) mapped to pairs in the ranked lists. An extra gold standard dataset of mapped true positive hits was built using those pairs present in two or more KG datasets (KG≥2). False positive (FP) sets were obtained by mapping the same KG gold standard datasets on randomised lists of second order predictions ranked lists, with 1,000 random iterations in yeast and 500 in human (fewer times in human balancing the sample size against computational cost).
Precision and recall parameters were calculated as described above, the precision mean and error (standard deviation) values were calculated based on the TP and the different accumulated random FP distributions. In order to present representative results values with standard deviations more than 1/3 of the mean were ignored, as they were due to the small size of the TP and FP samples at the beginning of the accumulated distributions (for further details see section 18 in
Supporting information.
(4.01 MB PDF)
We thank University of Malaga for data processing through the Picasso supercomputer, Miguel A. Medina for reviewing the manuscript and Raul Montanez for his help at the beginning of the project.