RA conceived of the idea and performed research. AS, MAMR, and BO provided scientific guidance. JB contributed sections to the manuscript. RA, MAMR, JB, and BO analyzed results; RA, AS, MAMR, and BO wrote the paper.
The authors have declared that no competing interests exist.
The characterization of protein interactions is essential for understanding biological systems. While genome-scale methods are available for identifying interacting proteins, they do not pinpoint the interacting motifs (e.g., a domain, sequence segments, a binding site, or a set of residues). Here, we develop and apply a method for delineating the interacting motifs of hub proteins (i.e., highly connected proteins). The method relies on the observation that proteins with common interaction partners tend to interact with these partners through a common interacting motif. The sole input for the method are binary protein interactions; neither sequence nor structure information is needed. The approach is evaluated by comparing the inferred interacting motifs with domain families defined for 368 proteins in the Structural Classification of Proteins (SCOP). The positive predictive value of the method for detecting proteins with common SCOP families is 75% at sensitivity of 10%. Most of the inferred interacting motifs were significantly associated with sequence patterns, which could be responsible for the common interactions. We find that yeast hubs with multiple interacting motifs are more likely to be essential than hubs with one or two interacting motifs, thus rationalizing the previously observed correlation between essentiality and the number of interacting partners of a protein. We also find that yeast hubs with multiple interacting motifs evolve slower than the average protein, contrary to the hubs with one or two interacting motifs. The proposed method will help us discover unknown interacting motifs and provide biological insights about protein hubs and their roles in interaction networks.
Recent advances in experimental methods have produced a deluge of protein–protein interactions data. However, these methods do not supply information on which specific protein regions are physically in contact during the interactions. Identifying these regions (interfaces) is fundamental for scientific disciplines that require detailed characterizations of protein interactions. In this work, we present a computational method that identifies groups of proteins with similar interfaces. This is achieved by relying on the observation that proteins with common interaction partners tend to interact through similar interfaces. The proposed method retrieves protein interactions from public data repositories and groups proteins that share a sensible number of interacting partners. Proteins within the same group are then labeled with the same “interacting motif” identifier (iMotif). The evaluation performed using known protein domains and structural binding sites suggests that the method is better suited for proteins with multiple interacting partners (hubs). Using yeast data, we show that the cellular essentiality of a gene better correlates with the number of interacting motifs than with the absolute number of interactions.
Protein–protein interactions play a central role in many cellular processes, ranging from signal transduction to formation of cellular macrostructures and cell cycle control [
Proteins interact through a limited set of interface types [
Traditionally, the description of protein interactions in terms of the interacting components has been based on protein structural domains [
Recently, several methods [
Here, our basic assumption is that proteins with overlapping sets of interacting partners tend to interact with the common partners through the same interacting motif, such as a domain, sequence segments, a binding site, or a set of residues. A similar assumption has been previously used to annotate protein sequences [
Two main objectives have been addressed in this work. The first objective was to demonstrate whether protein interactions alone can be used to infer interacting motifs. The positive predictive value of our method in detecting proteins with common SCOP families was 75% at sensitivity of 10%, and the Spearman correlation coefficient between the number of iMotifs assigned to proteins and the number of interfaces found by Kim et al. [
The basic assumption behind this work is that proteins with overlapping sets of interaction partners tend to interact with those partners through a common interacting motif. The validity of this assumption was tested on a nonredundant set of 368 proteins with known SCOP domains (Material and Methods). Although SCOP does not classify proteins by their interfaces, SCOP domains were used as surrogates for iMotifs because protein interaction types can be defined by the domains in the interacting proteins [
We found the number of common interaction partners (
It is worth noting that our assumption relies on the binary nature of the input interactions. Two proteins tend to have a common interacting motif only if they share direct physical interactions with the same partner(s). However, the likelihood of two proteins sharing a SCOP domain was lower by solely using yeast two-hybrid experimental data, a detection method that is more likely to contain binary protein interactions than other experimental methods [
Based on the observation that highly connected proteins with common interaction partners tend to interact with them through a common interacting motif, we have developed a method that groups proteins with similar interacting motifs (
First, the protein interaction network is built. Second, a cluster interaction network is created by placing each protein in a different cluster. Third, clustering is performed until the similarity score drops below a certain threshold. Fourth, an iMotif label is assigned to each cluster with more than one protein, and iMotif assignments and interactions are derived.
The definition of an iMotif depends on the minimum number of common partners required in order to consider the given binary protein interactions mediated through a common interacting motif.
(A) From the protein interaction network perspective, proteins with common partners (two in the example provided) are considered to interact with these partners through a similar feature, and, therefore, are classified as being of the same iMotif.
(B) The same process is shown from a structural perspective: proteins interacting through a similar feature (regardless of the feature being two structural domains or a single binding site) are considered to have a common iMotif. To further illustrate the method, we also describe a sample iMotif assignment for
The definition of iMotifs depends on a similarity metric and its threshold. Thus, different thresholds or metrics produce different iMotifs, corresponding to different levels of resolution in the description of protein–protein interactions. For example, the method can be applied at the resolution of domains from SCOP [
We evaluated the ability of the method to detect proteins with a domain in the same SCOP family (Methods). Using an
The positive predictive value, sensitivity, and applicability (Methods) are plotted as a function of the number of common interaction partners threshold (
Domain–domain interactions can be predicted from the iMotif–iMotif interactions found by the method (Materials and Methods). We evaluated the accuracy of these predictions with respect to domain interactions in the PDB. Our method achieves a positive predictive value of ∼65% for ∼5% of the proteins in the test set (
Kim et al. used protein 3-D structures and binary protein interactions to make inferences about the number of binding interfaces of proteins [
Each point corresponds to a protein from the test set for which a number of binding interfaces was assigned by Kim et al. [
Using an
Similarly to Kim and co-workers' results, [
Protein Essentiality and Predicted iMotifs
Proteins from PIANA were binned according to their number of iMotifs (A) and to their number of interactions (B), and the fraction of essential proteins was calculated for each bin. Bins with only one protein were not considered for calculating the correlations.
(A) Correlation between the number of iMotifs assigned to yeast hub proteins (≥20 interactions) in PIANA and the fraction of essential proteins (
(B) Correlation between the number of interactions of yeast hub proteins in
A common measure of evolutionary rate is the dN/dS ratio (the ratio of nonsynonymous to synonymous substitutions) [
Protein Evolutionary Rate and Predicted iMotifs
Sequence patterns for each iMotif were generated using the PRATT program [
Relationship between the percentage of iMotifs for which a significant sequence pattern was found and the percentage of proteins within the iMotif that contained the pattern. Three different significance cutoffs were used for associating sequence patterns to iMotifs: 10−5 (long dashed line), 10−8 (solid line), and 10−10 (short dashed line).
As indicated above, incompleteness in interaction data may result in artificially high numbers of iMotifs. This overestimation can be reduced by merging iMotifs with a common sequence pattern (Material and Methods). Fusing iMotifs based on sequence pattern similarity decreased the average number of iMotifs per hub from 8.6 to 4.2. This reduction, in turn, increased the correlation between the number of binding sites from Kim et al. [
We described, implemented, and evaluated a method that relies solely on binary protein interactions to identify interacting motifs (iMotifs) and their interactions. Our approach obtained high positive predictive value for identifying proteins with domains from the same SCOP family and predicting domain–domain interactions. We also analyzed hub proteins and their properties based on the number of iMotif assigned to them, obtaining similar findings to those in an independent approach that rely on protein structure information [
Recent estimates suggested that only one-fifth of interaction types are known [
Relying solely on experimentally detected interactions affects the accuracy of our method. It has been shown that high-throughput experiments have limited reliability and that many of the detected interactions are probably not direct (i.e., they are carried out through a third protein) or do not even exist (i.e., false positives) [
The combination of iMotif assignments with sequence search methods identified specific sequence patterns in iMotifs. We found that most iMotifs had a significant sequence pattern that was contained in most of the iMotif proteins. These patterns, which could be responsible for the iMotif proteins common interactions, could then be used to: (i) localize the iMotif in the protein sequence, (ii) assign iMotifs to proteins for which no interaction data is yet available, and (iii) predict interactions between proteins that contain patterns assigned to two interacting iMotifs.
Our iMotif assignments are similar to those obtained using an independent approach, which relies not only on known protein–protein interactions, but also on protein structure information [
Our results extend the findings and conclusions of Kim and co-workers [
Protein–protein interactions from DIP 2006.01.16 [
Protein domain assignments and classification were obtained from the SCOP release 1.69 [
A list of ORFs essential for the survival of the yeast cell was obtained from the
The assignment of iMotifs to a set of proteins is carried out in a four-step procedure (
First, build the protein interaction network.
Second, initialize a cluster interaction network (i.e., nodes are clusters that contain one or more proteins, and edges are interactions between clusters) by assigning each protein of the protein interaction network to a different cluster. Each cluster (containing one protein
Third, iteratively create new clusters by fusing the most similar clusters until the similarity score drops below a predefined threshold. Two clusters are similar if they share a minimum number of common interacting partners (
Fourth, each cluster with more than one protein is labeled with a different interacting motif identifier (iMotif), and that iMotif is assigned to all proteins within that cluster. iMotif–iMotif interactions are then derived from interactions in the cluster interaction network where both sides of the interaction have been labeled with an iMotif identifier.
For example (
We have evaluated the method on a test set created by selecting proteins (i) with at least five experimentally detected interactions, (ii) with at least 80% of their sequence covered by the domains defined in SCOP, and (iii) that did not introduce a redundancy bias in the evaluation (i.e., if any two sequences had a sequence identity greater than 30%, a BLAST e-value smaller than 10−5, and the alignment had at least 30 residues, the shortest member of the pair was not selected). The final set contained 368 sequences (
The SCOP family assignment was evaluated by considering as positive assignments those proteins found by the method to have a common iMotif with the query protein. Among these positives, we define as true positives those proteins that have a common SCOP family code with the query protein. Moreover, we define as false negatives the proteins that have the same SCOP family code as the query protein but were not found by the method to share an iMotif.
iMotif–iMotif interaction predictions were evaluated against interacting SCOP families obtained from the PDB. Two SCOP domains were considered to interact if they were co-crystallized and had at least two atoms within 5 Å distance. Because we are interested in domain interactions at the protein–protein interaction level, we excluded intrachain interactions from this set. Our method creates a list of putative domain–domain interactions for each predicted iMotif–iMotif interaction by assuming that all domains of the query protein with one iMotif interact with all domains of proteins with the other iMotif. In this context, we define as positive any iMotif–iMotif interaction where the query protein is involved. A positive is then considered a true prediction if at least one of its putative domain–domain interactions is observed in the PDB. Finally, false negatives are interactions observed in the PDB for SCOP families of the query protein that do not appear in any list of putative SCOP family interactions.
To avoid biases in the evaluation, only proteins from the test set (before removing redundancy) and their SCOP families were considered when counting positives and negatives. The positive predictive value is defined as the number of true positives over the total number of positives, and sensitivity is the number of true positives over the sum of true positives and false negatives. The positive predictive value and sensitivity were calculated with respect to the similarity score threshold used for stopping the clustering. We also define the applicability of the method as the percentage of proteins with at least one positive under a given threshold.
For each group of protein sequences with a given iMotif, sequence signatures were generated using the PRATT program [
The significance (i.e.,
In this work, the best sequence pattern for each iMotif (
Interacting motifs were merged by means of an agglomerative hierarchical clustering. Two iMotifs were considered to be similar if they had a common sequence pattern when applying a
Hidden Markov Models from the Pfam-A database [
All correlations were measured using the Spearman rank correlation coefficient (
To measure the likelihood of two proteins
(94 KB TIF)
(A) Proteins that have more than 70 interactions are ignored when performing the analysis.
(B) Only interactions from y2h are used.
(94 KB TIF)
(A) Superposition of the prothrombin and the pancreatic trypsin inhibitor structures (PDB IDs 1BTH and 2HPQ) shows an interaction through the SCOP family domain Eukaryotic proteases (in red).
(B) The structure of the anionic trypsin II interaction with the pancreatic trypsin inhibitor (PDB ID 1BRB) also shows an interaction through the SCOP family domain Eukaryotic proteases (in red).
(233 KB TIF)
The positive predictive value, sensitivity, and applicability are plotted as a function of the number of common interacting partners threshold used for the clustering. The positive predictive value and sensitivity using a trivial approach are also shown (thin lines). The applicability for the trivial approach is ∼70%. Sensitivity is highly dependent on the group of proteins for which an iMotif–iMotif interaction can be predicted at a given threshold: if a protein with a prediction has a SCOP code with multiple interactions in the PDB, the sensitivity obtained can vary greatly from one threshold to another. Moreover, we compared our method with the trivial approach of creating putative lists of domain–domain interactions by assuming that all domain families of proteins in the test set interact with all domain families of their interaction partners. The positive predictive value for this trivial approach was 33%, which is below that of our method for thresholds higher than 15.
(93 KB TIF)
Due to the limited number of iMotif assignments with stringent N thresholds, correlations become nonsignificant (i.e.,
(35 KB TIF)
Three different significance cutoffs were used for finding Pfams associated with iMotifs: 10–5 (long dashed line), 10–8 (plain line), and 10–10 (short dashed line).
(87 KB TIF)
The procedure followed to remove redundancy was the same as the one used for creating the evaluation set. We observe a significant improvement for all metrics with respect to
(95 KB TIF)
(345 KB TIF)
In parentheses, the number of pairs with at least one domain within the same SCOP family is indicated. We observe that metrics such as Rmin outperform N at detecting a higher number of protein pairs with a domain within the same SCOP family, but this is done at the expense of decreasing the accuracy of the method.
(63 KB PDF)
The number of common interaction partners (N) was set to 15. First column is the iMotif identifier. Second column is the number of proteins within the iMotif. Subsequent colums are the proteins within the iMotif. Proteins are identified using UniProt entry names and NCBI GI identifiers. In parentheses, “yes”, “no”, and “-” indicate whether the proteins had a domain within the same SCOP family.
(251 KB TXT)
The number of common interaction partners (N) was set to 20. See legend on
(2.4 MB TXT)
(38 KB TXT)
First column is the iMotif identifier. Second column is the pattern in PROSITE format. Third column is the fraction of proteins within the iMotif that have the pattern. Fourth column is the
(704 KB TXT)
Proteins from the Test Set, using UniProt Accession Numbers
(68 KB PDF)
We thank P. M. Kim for providing the data for
interacting motif
Protein Data Bank
Protein Interactions and Network Analysis
Structural Classification of Proteins