Conceived and designed the experiments: A. Ma'ayan, A. Mazloom, A. Boran, A. Grigoryan. Performed the experiments: A. Ma'ayan, A. Mazloom, A. Malovannaya, R. Dannenfelser, J. Bond, K. Linder, A. Boran, A. Grigoryan. Analyzed the data: A. Ma'ayan, A. Mazloom, A. Malovannaya, N. Clark, A. Boran, A. Grigoryan. Contributed reagents/materials/analysis tools: A. Malovannaya, R. Lanz, T. Cardozo, R. Iyengar. Wrote the paper: A. Ma'ayan, A. Mazloom, A. Malovannaya, R. Lanz, N. Clark.
The authors have declared that no competing interests exist.
Coregulator proteins (CoRegs) are part of multi-protein complexes that transiently assemble with transcription factors and chromatin modifiers to regulate gene expression. In this study we analyzed data from 3,290 immuno-precipitations (IP) followed by mass spectrometry (MS) applied to human cell lines aimed at identifying CoRegs complexes. Using the semi-quantitative spectral counts, we scored binary protein-protein and domain-domain associations with several equations. Unlike previous applications, our methods scored prey-prey protein-protein interactions regardless of the baits used. We also predicted domain-domain interactions underlying predicted protein-protein interactions. The quality of predicted protein-protein and domain-domain interactions was evaluated using known binary interactions from the literature, whereas one protein-protein interaction, between STRN and CTTNBP2NL, was validated experimentally; and one domain-domain interaction, between the HEAT domain of PPP2R1A and the Pkinase domain of STK25, was validated using molecular docking simulations. The scoring schemes presented here recovered known, and predicted many new, complexes, protein-protein, and domain-domain interactions. The networks that resulted from the predictions are provided as a web-based interactive application at
In response to various extracellular stimuli, protein complexes are transiently assembled within the nucleus of cells to regulate gene transcription in a context dependent manner. Here we analyzed data from 3,290 proteomics experiments that used as bait different member proteins from regulatory complexes with different antibodies. Such proteomics experiments attempt to characterize complex membership for other proteins that associate with bait proteins. However, the experiments are noisy and aggregation of the data from many pull-down experiments is computationally challenging. To this end we developed and evaluated several equations that score pair-wise interactions based on co-occurrence in different but related pull-down experiments. We compared and evaluated the scoring methods and combined them to recover known, and discover new, complexes and protein-protein interactions. We also applied the same equations to predict domain-domain interactions that might underlie the protein interactions and complex formation. As a proof of concept, we experimentally validated one predicted protein-protein interaction and one predicted domain-domain interaction using different methods. Such rich information about binary interactions between proteins and domains should advance our knowledge of transcriptional regulation by CoRegs in normal and diseased human cells.
CoRegs are members of multi-protein complexes transiently assembled for regulation of gene expression
To accelerate research in the area of CoRegs signaling, the Nuclear Receptor Signaling Atlas (NURSA)
The aforementioned work, and other similar prior studies, ranked predicted associations and provided probabilities for interactions between baits and preys, building on the explicit nature of bait-prey relationship in epitope-based purifications. However, due to secondary cross-reacting proteins, bait-prey relationships are rarely explicit in IPs carried out with primary antibodies. Hence, here we developed and compared different ways, coded into mathematical functions, to score prey-prey interactions from a large, recently published, HT-IP/MS dataset. The equations predict direct protein-protein interactions between prey proteins without considering the specific baits. We also used the same equations to predict domain-domain interactions underlying the protein-protein interactions. We evaluated the performance of these equations using known protein-protein and domain-domain interactions from the literature and validated one protein-protein interaction experimentally, and one domain-domain interaction using computational docking. By combining the data from the 3,290 IP-MS experiments collected by NURSA we predicted binary interactions between prey proteins and their domains. We offer a global view of CoRegs complexes in human cells, and provide the predicted networks for exploration on the web through a web-based application with downloadable tables freely available at
A detailed description of the IP-MS procedure can be found in references
To score prey-prey interactions from the HT-IP/MS data table, containing the ranks of proteins from the 3,290 IP-MS experiments, we evaluated existing and developed new equations implemented as algorithms in MATLAB and Java.
Sørensen similarity coefficient (Sor) provides a symmetric similarity coefficient for comparing two finite sets. The coefficient ranges between 0 and 1, where 0 denotes no similarity, and 1 denotes identical sets. The Sørensen coefficient is calculated as the ratio of the cardinality of shared members between two sets and the sum of the cardinalities of the same sets.
Pearson's Correlation coefficient (Pr) characterizes the linear dependency of two variables. Here we used the Pearson's Correlation coefficient to quantify the correlation the SPC scores of two proteins across all IP/MS experiments.
Equation 3 (E3) was developed through an intuitive manual symbolic search for functions that perform well, based on benchmarking, using known protein-protein interactions. E3 calculates a ratio between the sum of the SPC scores in experiment
The AB correlation was also developed through an intuitive manual symbolic search for functions that perform well based on benchmarking using known protein-protein interactions. The AB correlation computes the mean of the product of SPC scores normalized by dividing by the sum of mean SPC scores across all experiments.
To evaluate the predicted prey-prey protein interactions using the four equations, we used an updated version of the human literature-based protein-protein interactome we developed for the program Genes2Networks
To identify domains for proteins, we used the Pfam domain database release 24.0. The file ‘Pfam-A.full.gz’ was downloaded from:
Domain-domain interactions (DDI) were obtained from the Domine database
Antibodies for STRN, also called Striatin, are polyclonal rabbit, and were purchased from Millipore Corp. Antibodies for CTTNBP2NL were purchased from GeneTex. MCF-7 cells were lysed in immunopreciptation buffer containing Hepes (50 mM, pH 7.4), NaCl (150 mM), EDTA (1 mM), Tween-20 (0.1%), glycerol (10%) and protease inhibitors. The lysates were pre-cleared in the presence of rabbit IgG and protein A beads. The input sample was collected after pre-clearing. Samples were rotated overnight with IgG or Striatin antibody and subsequently incubated for two hours with Protein-A beads. The washed protein-containing beads were denatured and analyzed by Western blot.
The MolSoft ICM software was used to perform the domain-domain docking simulation. ICM uses a two-step method: pseudo-Brownian rigid-body docking followed by biased probability Monte Carlo minimization of the ligand side-chains, to sample conformational space in order to identify the global energy minimum for a given interaction
We analyzed the experimental data from 3,290 IP-MS experiments targeting 1,083 antigens (bait proteins) using 1,796 different antibodies. These experiments detected 11,485 non-redundant proteins (
IP-MS proteomics profiling have several known experimental challenges that need to be considered when applying functional global analyses on such data. First, it is well established that the proteins identified in such experiments are enriched for highly abundant and “sticky” proteins. This results in numerous proteins appearing frequently in almost all pull-downs regardless of the cell type, cellular fraction or experimental conditions. To address this we used a list of “non-specific” proteins to filter protein identifications that appear frequently in many pull-downs (
Our aim in this study is to assign confidence scores to binary prey-prey protein-protein and domain-domain interactions by integrating information from the 3,290 IP-MS experiments. The rationale for this approach is that the experiments, reporting lists of ∼30–200 proteins for each pull-down, taken together, provide enough information to reconstruct high-fidelity, small-sized complexes and potentially enough to recover direct physical interactions between pairs of proteins and domains. We reasoned that if we use all the information across all experiments to score each pair of proteins for potential direct interaction, we will be able to identify novel associations in addition to recovering known interactions better than by chance. In contrast with most prior methods that focused on scoring bait-prey interactions, our equations predict interactions between prey proteins that commonly reappear together in different pull-downs. Although the data collected for this study was aimed at the recovery of interactions between the intended antigens (baits) and other proteins, the majority of primary antibodies cross-react with multiple secondary antigens and those antigens interact with other proteins. This complicates bait-prey scoring of HT-IP/MS data. Yet, logically, if two proteins reappear together at the top of lists in many different pull-downs, we can guess that they may physically interact regardless of which baits were used to pull them down, making it possible to predict likely binary interactions by utilizing the spectral counts, not just co-occurrence. To encode such logic into mathematical functions we devised four scoring schemes, each attempting to address the problem in a slightly different way. To evaluate the performance of the four scoring schemes we used known PPIs we consolidated from online databases
To compare the performance of the different scoring methods we visualized the results as either receiver operator curve (ROC) (
Next, we used ball-and-stick diagrams to visualize the results across all experiments. We first visualized all overlapping interactions listed in the top 10% of predicted protein-protein interactions by each method (AB, E3 and Pr combined with Sor). This resulted in a network made of 2,509 proteins (nodes) and 28,886 interactions (edges) (
Yellow nodes are prey only and blue nodes were used as bait at least once. Edges are colored according to the following criteria: Blue edges are predicted interactions that do not have reported direct or indirect interaction in the literature; Green edges are predicted interactions that have one or more reported indirect interaction (one intermediate); Red edges are recalled direct interactions.
Complexes are assembled by selecting and visualizing cliques formed by predicted protein-protein interactions ranked in the top 1% by all three methods. The resulting network composed of 63 protein complexes containing 3 to 25 proteins. Yellow nodes are prey and blue nodes are bait proteins. Edges are colored according by the following criteria: White edges are predicted interactions that do not have reported direct or indirect interaction in the literature; Green edges are predicted interactions that have one or more reported indirect interaction; Red edges are recalled direct interactions.
(A) Selected complex from
Before analyzing all of the 3,290 IP-MS experiments published by Malovannaya et al
Since PPIs are often the result of interactions between the structural domains of the interacting proteins, and since we know most of those domains for all pulled prey proteins based on their amino-acid sequences, we can use the scores for PPIs to also score and rank domain-domain interactions (DDIs). The scoring of domain interactions is slightly more complex since most proteins have several different domains and the domains can appear more than once within the same protein. To resolve this we used the score for PPIs containing domains between all possible domain pairs from each side of the PPI and normalized the score across all the domains (see
The plots show that we can reliably recover known and predicted DDIs. In addition to the four equations used to score PPIs, we introduced another scoring scheme, λ, for scoring DDIs. λ is an index that counts the number of times two predicted interacting prey proteins have a domain on each side of the PPI. Such an index improves DDI predictions. In addition to the type of analysis we did for PPIs, we also attempted to further combine different prediction methods to optimize DDI predictions. Finally we visualize our predicted DDIs with known DDIs as a network diagram to visually explore interactions among all domains (
(A) Network showing the predicted DDIs for the predicted STRN protein complex. The network was constructed by importing domains for each protein from the PFAM database, associating protein domains to each of the proteins in the STRN complex, and using top predicted DDIs to connect the domains. In the network yellow octagons are domains and circles are proteins. Domains are connected to proteins using red, solid-black and dashed-blue edges. Black edges signify true positives and dashed-blue predicted DDIs. In the complex, PPIs that did not have a matching predicted DDI were eliminated. (B) Validation of a DDI interaction using molecular docking. The lowest energy conformation predicted by the docking simulations of STK25 to PPP2R1A. The interaction of the Pkinase domain with the HEAT domain is shown. (C) Binding energy landscape of all generated docking scores between STK25 and PPP2R1A. (D) Histogram of generated docking scores.
In this study we combined results from 3,290 experiments that identified nuclear protein complexes in human cells using IP-MS. We implemented and evaluated four different equations assessing their ability to predict direct physical PPIs from the aggregated proteomics data using known PPIs from the literature. The highest scoring predictions were visualized as networks with many densely connected clusters that are likely made of real protein complexes. The prediction scores for potential interactions could be considered as surrogates to real affinity constants. However, since we do not know the exact quantities of proteins, it is not possible to compute exact dissociation constants. Such binding constants can be useful for dynamical simulations where we could stochastically trace the transient dynamics of CoRegs complex formation in-silico. Scoring PPIs by only using the prey measurements may become more robust as more IP-MS experiments are published. However, careful attention should be given to weighting the repetitiveness of experiments so interactions from similar pull-downs, if repeated, are not mistakenly given higher scores. Regardless of possible limitations, the ability to recover direct PPIs based on such a massive dataset is an important step toward utilizing HT/IP-MS datasets for reconstructing networks and generating hypotheses. In addition, we show that the equations can be extended to predict interactions between structural domains. We also demonstrated two ways to further validate predicted PPIs and DDIs, using experimental and computational approaches. In summary, our analyses explored new methodologies for scoring PPIs and DDIs using data from related IP-MS experiments, providing many hypotheses about mammalian CoRegs complexes formation, and allowing users to explore novel complexes, PPIs and DDIs online at
Information on each IP/MS experiment, quantity of proteins purified in each IP/MS experiment, size of protein lists purified in each IP/MS experiment, list of sticky proteins.
(XLSX)
Scores for top 1% predicted PPIs by each method.
(XLS)
Scores for top 1% predicted DDIs by each method.
(XLSX)
Each node in the network represents a list of proteins identified in one of the 3,290 IP-MS experiments color coded according to the bait protein targeted by an antibody in a single experiment. An edge represents the similarity between two lists using the Jaccard distance. A node is preserved if it has at least one edge with Jaccard distance <0.7. The network contains 491 nodes and 2233 edges. The diameter of a node represents the size of a list from a specific experiment.
(EPS)
(A) Histogram of Jaccard distances between pairs of 3,290 experiments. (B) Histogram of the size of pull-down lists from all IP-MS experiments.
(EPS)
(A) Receiver operator curve (ROC) of the recovery of known interactions using the different scoring methods. Recall rate of known PPIs (y-axis) is computed and displayed as a ratio between ranked predicted PPIs by each scoring method and known PPIs. (B) Area under the curve (AUC) computed for each method.
(EPS)
Running-sum of the top 1,563,309 predicted PPIs, predicted with the equations: (A) E3, (B) AB, and (C) Pr. The running-sum increases by √((u−t)/t) units if it encounters a known PPI, and decreases by √(t/(u−t)) units otherwise. The magenta line in each chart shows the walk when incorporating the Sørensen similarity. u and t are counts of predicted and known interactions in the current dataset respectively. The running-sum for a random sample of scrambled ranks of the same set of interactions along with the mean of running-sums of 1000 random samples are also included in each chart.
(EPS)
Moving average of a window of 2,000 ranks predicted PPIs visualized as a line graph. Sørensen similarity between pairs of proteins combined with other scoring schemas. The inset in each chart shows the recall for PPIs with evidence of indirect interaction, i.e., one intermediate. (A) E3, (B) AB, and (C) Pr.
(EPS)
(A) Venn diagram showing the overlaps between the three different scoring methods for the top 10% of predicted interactions. (B) Overlaps of known PPIs from predicted interactions represented in (Fig. 7A).
(EPS)
Similarity graph created from a subset of 114 IP-MS experiments. Nodes represent baits and links represent similarity using the Jaccard index. Nodes are colored based on the bait. Most experiments used Estrogen Receptor α (ESR1) and nuclear receptor co-activator 3 (NCOA3), also called SRC3, as baits under different conditions.
(EPS)
(A) Hierarchical clustering of the quantities of identified proteins from the subset of 114 experiments. Only proteins that were present in three or more IP experiments were included. (B) Network of predicted complexes. Complexes are formed by visualizing predicted protein-protein associations ranked in the top 1000 by all three scoring schemes. All nodes with connectivity of one were removed. Edges are colored according by the following criteria: Light blue are predicted interactions that do not have reported direct or indirect interaction in the literature; Green are predicted interactions that have one or more reported indirect interaction; Red edges are recalled direct interactions. Dotted gray edges are direct interactions which were not ranked in the selected range by the methods but are present in the literature. Nodes with a pink circle around them represent members of previously characterized complexes from the Corum database; Blue nodes represent proteins that were also used as baits it at least one of the experiments.
(EPS)
Heatmap of the percent overlap between the five complexes predicted from the subset of 114 experiments (columns) and complexes from the Curom database (rows).
(EPS)
Left: Hierarchical clustering of the quantities of identified proteins from the subset of 114 experiments (same as Fig. 12A). Right: Zooming into two clusters to visualize the segregation of two complexes pulled by two different antibodies targeting the same bait.
(EPS)
(A) Recall rate for previously reported DDIs from DOMINE (y-axis) as a function of the ratio of predicted DDIs ranked by one or a combination of the scoring schemes. (B) Area under the curve (AUC) for the ∼728 K ranked DDIs (left y-axis, dark bars) and AUC for the top 7 K predicted DDIs (right y-axis, light bars).
(EPS)
A comparative chart of running-sums, as described for
(EPS)
Network representation of the top 10% of predicted DDIs where nodes having 50 or more predicted interactions were removed for visualization clarity. The network contains 357 domains (octagons) and 773 edges (red and blue lines). Node sizes are proportional to their connectivity. Predicted and recalled DDIs are colored in light blue and red respectively.
(EPS)