Conceived and designed the experiments: WKK EMM. Performed the experiments: WKK. Analyzed the data: WKK EMM. Contributed reagents/materials/analysis tools: WKK. Wrote the paper: WKK EMM.
The authors have declared that no competing interests exist.
Proteins interact in complex protein–protein interaction (PPI) networks whose topological properties—such as scale-free topology, hierarchical modularity, and dissortativity—have suggested models of network evolution. Currently preferred models invoke preferential attachment or gene duplication and divergence to produce networks whose topology matches that observed for real PPIs, thus supporting these as likely models for network evolution. Here, we show that the interaction density and homodimeric frequency are highly protein age–dependent in real PPI networks in a manner which does not agree with these canonical models. In light of these results, we propose an alternative stochastic model, which adds each protein sequentially to a growing network in a manner analogous to protein crystal growth (CG) in solution. The key ideas are (1) interaction probability increases with availability of unoccupied interaction surface, thus following an anti-preferential attachment rule, (2) as a network grows, highly connected sub-networks emerge into protein modules or complexes, and (3) once a new protein is committed to a module, further connections tend to be localized within that module. The CG model produces PPI networks consistent in both topology and age distributions with real PPI networks and is well supported by the spatial arrangement of protein complexes of known 3-D structure, suggesting a plausible physical mechanism for network evolution.
Proteins function together forming stable protein complexes or transient interactions in various cellular processes, such as gene regulation and signaling. Here, we address the basic question of how these networks of interacting proteins evolve. This is an important problem, as the structures of such networks underlie important features of biological systems, such as functional modularity, error-tolerance, and stability. It is not yet known how these network architectures originate or what driving forces underlie the observed network structure. Several models have been proposed over the past decade—in particular, a “rich get richer” model (preferential attachment) and a model based upon gene duplication and divergence—often based only on network topologies. Here, we show that real yeast protein interaction networks show a unique age distribution among interacting proteins, which rules out these canonical models. In light of these results, we developed a simple, alternative model based on well-established physical principles, analogous to the process of growing protein crystals in solution. The model better explains many features of real PPI networks, including the network topologies, their characteristic age distributions, and the spatial distribution of subunits of differing ages within protein complexes, suggesting a plausible physical mechanism of network evolution.
Life is highly organized at all levels of molecules, cells, tissues, and organisms, and such relationships among biological entities are often represented as networks, with vertices representing e.g. genes or proteins, and edges representing e.g. physical protein interactions, transcriptional regulation, or metabolic reactions. The topology of biological networks shows many interesting characteristics, such as scale-free topology (power-law or broad degree distribution) and hierarchical modularity (reviewed in
One important question is thus how these important network architectures originate, and what driving forces underlie the observed networks. It has not been clear whether network architecture results from the mosaic sum of each gene or protein's inherent properties, such as
Initially, Barabási and Albert proposed a preferential attachment rule as a general mechanism to generate scale-free networks
Although the DD model generates scale-free and modular networks, it has drawbacks that must be noted if it is to be considered a main mechanism for PPI network evolution. Primarily, only a small fraction of duplicate genes effectively contribute to the overall network topology. The key feature of the DD model originates from the fact that duplicate genes share a certain number of interaction partners. However, the interaction patterns of duplicate genes diverge rapidly
To better understand the evolution of PPI networks, we analyzed a non-topological property—the age of each protein as estimated based upon the taxonomic distribution of its constituent domains
First, we introduce the basic attachment rules of protein-protein interactions. The interaction densities, Dm,n, between two protein age groups (m,n) show unique patterns depending upon the attachment rule. Three basic rules are considered—random attachment (RA), preferential attachment (PA) by Barabási and Albert
In the RA model, a new node is randomly connected to existing nodes with equal probabilities. Initially, at time t = 1, the first age group, G1, makes only intra-group connections. Then a new group, G2, is introduced and connected randomly either to G1 (inter-group) or within G2 (intra-group). In the RA model, the expected interaction density, D, is the same between D1,2 and D2,2. Similarly, G3 connects to G1, G2, and within G3, showing the pattern of D1,3 = D2,3 = D3,3. More generally, the RA model shows a pattern of Dm,n = Dm+1,n (m<n) (
The protein age groups G1, G2, and G3 emerge at times t = 1, 2, and 3, respectively. In all cases, the first age group, G1, makes intra-group connections at t = 1. (A) In the random attachment (RA) model, G2 makes connections to G1 and within G2 with an equal probability at t = 2, showing that D1,1 = D1,2. Similarly, G3 makes connections to G1, G2, and within G3 (D1,3 = D2,3 = D3,3). The interaction densities between protein age groups are shown in the right panel. (B) In the preferential attachment (PA) model, G2 attaches more frequently to G1 than within G2 because, on average, G1 is more connected (D1,2>D2,2). At t = 3, G3 is preferentially connected to older groups in the order of G1>G2>G3 (D1,3>D2,3>D3,3). (C) In anti-preferential attachment (AP), the interaction density shows the reverse pattern to PA. Because a new node prefers less-connected nodes or younger groups, the density pattern shows D1,2<D2,2 and D1,3<D2,3<D3,3. Therefore, the interaction density (D) decreases in AP but increases in PA from top to bottom in the right panel.
As a measure of age-dependency of interaction density, ΔD is defined as the average value of Dm+1,n - Dm,n (m<n) (see
We collected two independent sets of yeast PPIs - literature curated (LC) and high-throughput (HTP) PPIs, using the method of Batada
The PPIyeast recapitulates known topological features such as a scale-free degree distribution, hierarchical modularity, and degree-dissortative mixing property
None of the canonical models (PA, DD, and AP) were compatible with the real PPIyeast in terms of both topology and the age-dependency of interaction density. Only the CG model shows similar characteristics to the PPIyeast for all the network properties tested. The plots in each row, I-IV, indicate (I) the degree distribution P(k), (II) the clustering coefficient C(k), (III) the average degree of nearest neighbors <knn>(k), and (IV) the interaction density pattern (ΔD) between protein age groups. In the yeast PPI, the network shows a scale-free degree distribution, hierarchical modularity, and dissortative mixing properties (negative correlation in rows I-III, respectively). In row IV, the interaction density tends to be dense within the same group (diagonal) and sparse between different age groups (off-diagonal) in each column with positive ΔD, similar in pattern to the anti-preferential attachment (AP) in
Surprisingly, the interaction density of PPIyeast is also highly age-dependent. Yeast proteins were assigned to one of the age groups ABE, AE/BE, E and F depending on the taxonomic distribution of constituent domains among archaea (A), bacteria (B), eukaryote (E) and fungi (F) (see
We next simulated PPI network evolution using the three canonical models—PA (preferential attachment), DD (duplication and divergence), and AP (anti-preferential attachment) and tested compatibility with PPIyeast in terms of both topology and age-dependency. In all three models, the network starts from a small number, N0 = 4 of seed nodes and a new node is added until the total number of nodes reaches N = 3,000, which is comparable to the PPIyeast (LC) with 3,268 nodes and 12,058 edges. In the PA and AP models, a fixed number of edges (ΔE = 4) are added for each new node, which makes the final network size similar to the PPIyeast. The link probability (P) is proportional to the degree in the PA model (P ∼ k) and inversely proportional in the AP model (P ∼ k−1). For the DD model, we employ one of the simplest models by Vázquez
Surprisingly, none of the three models satisfied all of the characteristics of PPIyeast (the 2nd, 3rd and 4th columns in
While additional variants of each model might be considered
To better address both topological and age properties of real networks, we developed an alternative model for PPI network evolution called the crystal growth model (CG), in which we view the growth of a PPI network as analogous to incorporating new proteins into crystals grown in solution (
(A) The CG model mimics sequential incorporation of new proteins to crystals grown in solution. In stage I, the initial set of proteins (red) form seeds of new crystals. In stage II, a new protein is added, which either forms a new seed crystal (n) or attaches to an existing crystal (e). In the latter case, the protein e attaches to one protein in the crystal (solid arrow) and then further interacts with nearby proteins (dotted arrow). In stages III and IV, the second- (orange) and third- (yellow) generation proteins repeat the process of stage II, with the result that the early generation tends to be located at the core of each crystal and the late generation at the periphery. (B) Similarly, the CG model starts with a small number of seed nodes (N0). In each cycle, modules are defined and a new node is added that makes a fixed number of connections (ΔE). A new node creates a new module at a probability Pnew and makes connections to any other node in accordance with the AP rule. Otherwise, one module (crystal) is randomly selected and the new node is connected exclusively to the nodes in the selected module. After ΔE connections are made, modules are redefined and the cycle is repeated.
The procedure of the CG model is illustrated in
The CG model introduces two parameters, how to define the network modules and how frequently a new module is created (Pnew). A network module is generally defined as a densely connected sub-network, and there are various ways to partition a network into modules. Most stringently, modules can be defined as complete subgraphs or cliques, and more loosely they can be defined as k-cores, triangularly connected components (TCC) and so on. We tested two different module definitions, one by Newman
Networks generated by the CG model show a remarkable similarity to real PPI networks for all tested network properties. A typical result of the CG model is shown in the 5th column in
The canonical models were shown to significantly deviate from the PPIyeast, but the CG model shows a good agreement not only qualitatively but also quantitatively (
(A) PPIyeast, (B) the PA and AP models, (C) the DD model at p = 0.1, q = 0.5∼0.7, and (D) the CG model. In (B), the scale-free index, γ, of the AP model is not shown because the resulting network is not scale-free. The properties of the CG model are more similar to PPIyeast than those of the PA, AP, and DD models. Index values are normalized so that the average indexes of LC and HTP are zero, calculated as Inorm = (Iraw−Iyeast)/Irange, where Inorm is the normalized index and Iraw is the index value of each model. Iyeast is the average index between LC and HTP except for <k>, where Iyeast is set to the average degree of LC because <k>LC is similar to <k> = 8.0 in the PA, AP, and CG models. The denominator Irange is set to max(Iraw) observed in LC, HTP, and the models, except for δ and ΔD showing both negative and positive values. In the case of ΔD and –δ, the denominator Irange is set to max(Iraw)−min(Iraw) because these indexes range from negative to positive values in LC, HTP, and the models. The sign of δ is reversed to −δ to give the index positive values for LC and HTP.
DD and PA show an inverse age-dependency of PPIyeast and much less modularity in terms of clustering coefficient and triangle density although they show scale-free degree distributions (
In the CG model, homodimers would be more frequent in older groups because there are simply fewer proteins with which to make connections in earlier stages. The age distribution of homodimeric interactions was exactly in the order of ABE>AE/BE>E>Fu among the 166 homodimeric yeast proteins collected from UniProt
The ratio between the observed (Obs.) and the expected (Exp.) number of homodimers is plotted for each age group, calculating for each age group the fraction of homodimeric proteins divided by the fraction of total yeast proteins accounted for by that age group.
Within the sub-networks of known complexes from MIPS, protein subunits tend to be either more likely to be connected among similar age groups in agreement with the general tendency of positive ΔD in the full yeast PPI networks (
We further validated the CG model by inspecting the 3-D subunit arrangement of protein complexes according to age. Obviously, a protein subunit of a stable complex interacts mostly with the subunits of its participating complex. When a subunit is in contact with multiple other subunits in a protein complex, it is most likely that the partner subunits are spatially close, often interacting among themselves as well. For transient interactions, the member proteins can interact with fewer spatial constraints but the interactions are much denser within each biological module, e.g. as for a MAP kinase signaling pathway or transcription initiation complex. Therefore, a protein tends to interact in a highly “localized” manner within the biological modules it belongs to. None of the canonical models has such a module-oriented mechanism as the CG model. In the CG model, older subunits of protein complexes would tend to be more centrally located than younger ones because each protein is attached in the order of its age. Therefore, it is more likely that older subunits are aggregated centrally and younger subunits are scattered at the periphery in a protein complex.
To examine this trend among known protein complexes, we collected protein complexes from the Protein Databank (PDB) which consisted of at least 3 protein chains, with at least 2 age groups represented; these are stringent criteria that strongly limit the number of available complexes. After removing inappropriate complexes, such as non-protein structures, viral proteins, antibodies and small peptides, a non-redundant set of 12 multi-protein complexes was collected that met these criteria (detailed descriptions are in
In general, older subunits tend to be aggregated centrally (red tone), while younger ones are separated peripherally (green and blue) (
Subunits of all 12 known multi-protein complexes with at least three proteins and two-age groups are colored according to their age groups: The most ancient group, ABE, is colored in red tones (yellow, pink, magenta, orange, red). The AB, AE, or BE groups are in green tone, and the most recent A, B, and E groups are in blue. For visual clarity, the older group(s) is presented in cartoon models and the youngest group in space-filling models in each complex. The age group assignments in (K) and (L) were ambiguous because the chains assigned AE could be assigned to ABE if the BLAST hit cut-off was slightly relaxed to 25% instead of 30% sequence identity (the “twilight zone” for homology detection). Therefore, (K) and (L) may, in fact, consist of the subunits of the ABE group only. The subunits are in various configurations. In (A) and (L), the younger subunits are spatially separated, but the older subunits are aggregated. In (B–E), two old subunits (three in (E)) and one young subunit are linearly connected. In all four cases, the older subunits are all connected without insertion of the younger subunit in the middle. In (F–H), the subunits form a trans-membrane helix bundle, where young subunits are always located at the periphery while old subunits are at the center. In (I) and (J), all the subunits are in contact with each other. In the case of (K), there are two modules—the clamp (upper homo-trimeric ring) and the clamp loader (lower hetero-pentameric ring). Considering the clamp loader alone, both younger and older subunits are separated. (A) APRIL and TACI (TNF receptor) complex (Protein databank code 1xu2). (B) Urokinase receptor, urokinase, and vitronectin complex (3bt1). (C) Factor Xa/NAP5 complex (2p3f). (D) Thrombin-PAR4 complex (2pv9). (E) Complexin/SNARE complex. (F) Cytochrome b6f complex (2e76). (G) Cyanobacterial photosystem I (1jb0). (H) Photosynthetic Oxygen-Evolving Center. A cross-section of the trans-membrane helix bundle is shown (1s5l). (I) APPBP1-UBA3-NEDD8 complex (1r4m). (J) Cytochrome ba3 Oxidase (1xme). (K) DNA clamp–clamp loader complex (1sxj). (L) Cythochrome bc complex (1ezv).
It is notable that the total degree of PPIyeast is underestimated relative to the actual degree due to homomeric interactions and subunit stoichiometry. For example, the APRIL-TACI complex (
The validity of network evolution models have been measured mainly by the resulting network topology, such as a power-law degree distribution, hierarchical modularity and dissortativity as observed in real PPI networks. Accordingly, the DD model has been thought of as the principal mechanism for PPI network evolution. Here, we dissect the history of PPI network evolution by inspecting several protein age-dependent patterns such as interaction density, homodimeric frequency, and the 3-D spatial arrangement of subunits within multiprotein complexes. The age-dependencies are shown to be very effective in discriminating the validity of different models as summarized in
PPI | PA | DD |
AP | CG | ||
Scale-free | Yes | Yes | Yes | No | Yes | |
Modularity | Yes | No | Yes | No | Yes | |
Q | High | Low | High | Low | High | |
C(k) | High | Low | Medium | Low | High | |
C(k) ∼ k shape | - | Different | Similar | Different | Similar | |
Triangle density | High | Low | Low∼Medium** | Low | High | |
Dissortativity (δ) | Yes | Yes | Yes/No** | No | Yes | |
Age-dependency | ||||||
Interaction density (ΔD) | Yes | No | No | Yes | Yes | |
Homodimeric frequency | Yes | Yes | No | Yes | Yes | |
3-D subunit arrangement in protein complexes | - | Not explicitly modeled | Non-supportive | Not explicitly modeled | Supportive |
Results for the DD model were collected at typical values (p = 0.1, q = 0.6). Results for scale-freeness and age-dependency are robust to changes in these parameters. However, aspects of modularity and dissortativity of the DD model vary with these parameters and the specific choice of DD model, indicated with **.
In the CG model, we view the PPI network as sparse and dynamic protein crystals
The CG model exploits two keys ideas, the first being that the chance of new connection is proportional to the availability of free surface, which is a feature readily recognized by a new protein molecule; this results in an anti-preferential attachment (AP) rule. Although the same surface of a protein can be involved in multiple interactions with different partners through spatial and temporal differentiation, such a factor uniformly increases the capacity of interactions in any protein. Therefore, the connection probability is still positively correlated with the available surface area. These results agree with those of Kim et al.
At the basis of the crystal growth model is the notion that new interactions form preferentially within existing physical complexes (enforcing modularity), and thus are limited by available protein surface area (the AP rule). Thus modularity & the AP rule both arise due to simple physical constraints of which proteins are most accessible to each other. Recently, Levy and colleagues has shown that the successive steps of homo-oligomeric assembly mimics the evolutionary pathway
Given that the CG model follows an AP rule, how does it generate scale-freeness or “the rich get richer” connectivity? In the CG model, the network grows by
Our result suggests that the CG model is a more plausible mechanism for PPI network evolution than the DD model. First, all the age-dependent aspects tested agree well with the CG model but disagree with the DD model. Second, the CG model is more comprehensive than the DD model in that the CG model can accommodate both gene duplication and horizontal gene transfer as the origins of new nodes (genes). Practically, the DD model may be applicable only to ∼20% of the yeast proteome having identifiable duplicates
The age-dependency of interaction density also sheds light on a more fundamental question regarding the mechanism of PPI network evolution. It has been hypothesized that inherent features of proteins, such as stickiness and hydrophobicity are dominant factors in shaping the global network structure
Power-law distributions have been commonly observed in various types of networks, such as the Internet, social networks, and biological networks. However, the growth of a PPI network poses unique constraints compared to other types of networks. For example, in an airline or railroad network, each new connection is made by considering the context of global network topology (e.g., to minimize average path length), which seems intuitively unlikely to be the case in PPI networks. The CG model follows two simple constraints of available free surface and localized connection, which are physically plausible and depend only on local context but not global topology. With these minimal assumptions analogous to growing protein crystals, the CG model recapitulates remarkably well the age-dependencies as well as the network topologies of the yeast PPI networks.
Two independent sets of yeast protein-protein interaction data were collected using a method essentially identical to that described by Batada et al.
Pfam domains were assigned for yeast proteins using BioMart (
Interaction density Dm,n measures the normalized interaction density between two age groups m, n (m<n). ΔD measures the interaction preference of a new node by the age differences. A positive value of ΔD indicates that a new node makes connections more frequently with close age groups than with distant ones.
First, the normalized interaction density Dm,n between two age groups m,n (m<n) is calculated as
The modularity of a network is measured by the modularity index Q by Newman
The list of PDB entries and 3-D coordinates were obtained from PQS (Protein Quaternary Structure Server,
The protein chain clusters at 30% sequence identity cut-off were downloaded from PDB (Protein Data Bank,
For NR30 entries, the age group of each PDB chain was assigned using BLAST against NR90 set of archaea, bacteria and eukaryote sequences from UNIPROT (
LC dataset
(0.20 MB TXT)
HTP dataset
(0.11 MB TDS)
The age group assignment of yeast genes
(0.08 MB TDS)
The list of homodimeric proteins and their age group assignment
(0.01 MB TDS)
The protein ratio of different age groups in yeast PPI networks. LC: literature-curated, HTP: high-throughput, LC+HTP: the union of LC and HTP.
(0.08 MB PDF)
The network properties of the HTP, LC+HTP, and Y2H-union dataset. The plots in each row, I-IV, indicate (I) The degree distribution P(k), (II) the clustering coefficient C(k), (III) the average degree of nearest neighbors <knn>(k), and (IV) the interaction density pattern (ΔD) between protein age groups. HTP, LC+HTP, and Y2H-union set show similar characteristics as LC dataset.
(0.29 MB PDF)
The network properties by the CG model, where the network modules were defined by TCC (triangularly connected components) instead of the Newman's method. The network structure is still similar to the yeast PPI networks, showing scale-free, hierarchical modular, degree-dissortative characteristics and an interaction density pattern of DD>0. (A) The degree distribution P(k), (B) the clustering coefficient C(k), (C) the average degree of nearest neighbors <knn>(k), (D) the interaction density pattern between protein age groups.
(0.09 MB PDF)
Age-dependent interaction patterns of several MIPS complexes in the LC+HTP set. In mRNA splicing (A) and replication (B) complexes, the subunits of the same age group are more likely to be connected. In RNA polymerase I & III (C and D), most subunits are densely connected to each other, therefore age-dependency is not evident. In the case of actin-associated proteins, most subunits are of the same age group (E), reflecting a relatively recently emerged module.
(0.52 MB PDF)
The network characteristics of the yeast PPI data.
(0.06 MB PDF)
The network characteristics of the network growth models
(0.13 MB PDF)