Conceived and designed the experiments: PVM LZ BCR JSL HG. Performed the experiments: KL. Analyzed the data: PVM GZ HG. Contributed reagents/materials/analysis tools: PVM. Wrote the paper: PVM HG. Responsible for all computational work, designed and implemented an independent version of algorithm that is introduced in the paper: PVM. Responsible for all experimental work that is presented in the paper: KL. Originally discussed the idea, designed another independent version of the algorithm (not used in the paper): LZ BCR. Provided computational analysis on muscle-specific motif modules that is used in the paper: GZ.
The authors have declared that no competing interests exist.
Recent studies of cellular networks have revealed modular organizations of genes
and proteins. For example, in interactome networks, a module refers to a group
of interacting proteins that form molecular complexes and/or biochemical
pathways and together mediate a biological process. However, it is still poorly
understood how biological information is transmitted between different modules.
We have developed
Protein–protein interactions mediate numerous biological processes. In the last decade, there have been efforts to comprehensively map protein–protein interactions occurring in an organism. The interaction data generated from these high-throughput projects can be represented as interconnected networks. It has been found that knockouts of proteins residing in topologically central positions in the networks more likely result in lethality of the organism than knockouts of peripheral proteins. However, it is difficult to accurately define topologically central proteins because high-throughput data is error-prone and some interactions are not as reliable as others. In addition, the architecture of interaction networks varies in different tissues for multi-cellular organisms. To this end, we present a novel computational approach to identify central proteins while considering the confidence of data and gene expression in tissues. Moreover, our approach takes into account multiple alternative paths in interaction networks. We apply our method to yeast and nematode interaction networks. We find that the likelihood of observing lethality and pleiotropy when a given protein is eliminated correlates better with our centrality score for that protein than with its scores based on traditional centrality metrics. Finally, we set up a framework to identify central proteins in tissue-specific interaction networks.
In the last decade, several high-throughput experimental techniques have allowed
systematic mapping of protein-protein interaction networks, or interactome networks,
for model organisms
Work on global topology of interactome networks has led to a conclusion that these
networks are
In an interactome network, the ‘central’ proteins, which
topologically connect many different neighborhoods of the network, are likely to
mediate crucial biological functions. The most straightforward way of quantifying
the centrality of a protein in the context of interactome networks is to examine the
protein's degree, e.g. the number of binding partners interacting with the
protein of interest. Perturbations of high-degree proteins (hubs) are more likely to
result in lethality than mutations in other proteins
Both degree and betweenness are graph metrics that are not specifically tailored to
describe biological networks. Degree measures a protein's local
connectivity and does not consider the protein's position in the network
globally. Betweenness is a better measure for centrality in that it takes into
account paths through the whole network, but it still has the disadvantage of only
considering the shortest paths and ignoring alternative pathways of protein
interactions. More importantly, interactome networks can be error-prone and some
interactions in the same network are not as reliable as others. Many studies have
been conducted to categorize interaction data into different confidence levels
For a multi-cellular organism, not all interactions have the same propensity to occur
in every tissue. However, the current network metrics usually treat interactome
networks as a whole, disregarding the possibility that some interactions may not
occur at all in certain types of tissues. To address this, we developed a framework
for studying tissue-specific networks using the information flow model. We
constructed an interactome network for muscle enriched genes in
We modeled an interactome network as an electrical circuit, where interactions
were represented as resistors and proteins as interconnecting nodes (
We model an interactome network as an electrical circuit, where a node represents a protein and a resistor represents an interaction. The resistance value of a resistor is inversely proportional to the confidence score of the corresponding interaction.
Unlike degree that only considers direct interactions or betweenness that only
scores proteins along the shortest paths interpreted as the dominant paths, the
information flow model weighs proteins along all the possible paths. Therefore,
the information flow model is able to rank “runner-up”
proteins participating in many paths of information transmission, instead of
only the seemingly prominent ones. This aspect of the information flow model
reflects the property of biological pathways more faithfully: there have been
plenty of observations for multiple pathways acting in parallel to achieve a
specific biological function
We applied the information flow model to two publicly available interactome
networks: a
Although the information flow score is a very different network metric from
betweenness or degree, there might be relationships between the information flow
score and these two topological metrics. We obtained scatter plots for the ranks
of information flow scores versus the ranks of betweenness or degree for both
the yeast interactome and the worm interactome (
Overall, ranks of information flow and betweenness are correlated, but a given betweenness usually corresponds to a wide range of information flow scores. Ranks of information flow and degree are less correlated. Low degree can correspond to low, medium or high information flow, but high degree usually corresponds to high information flow.
We propose that the information flow model is able to identify proteins central
to the transmission of biological information in an interactome network. If this
model works, eliminating the proteins of high information flow scores should be
deleterious. The perturbation of information flow and the disintegration of
functional modules are likely to result in lethality or multiple phenotypes
(pleiotropy). To test our hypothesis, we performed a correlation analysis
between the percentages of essential proteins or pleiotropic proteins and the
percentiles of information flow scores (see
The higher a protein's information flow score is, the higher the
probability of observing lethality (Panel A) or pleiotropy (Panel B)
when the protein is deleted from
In contrast, betweenness is a poorer predictor for both essentiality and
pleiotropy. For
To determine the statistical significance of the correlation, we generated
randomized datasets by shuffling genes among the percentile ranges while keeping
the number of genes in each range fixed. Next we obtained the percentage of
essential or pleiotropic genes for each range and performed correlation analysis
for each randomized dataset. We found that the correlation between essentiality
or pleiotropy and information flow scores is generally stronger in the actual
datasets than in the randomized datasets
(
Proteins with similar betweenness in an interactome can differ significantly in
terms of information flow scores (
Even among those proteins that rank in the lower 30% in terms
of betweenness, a protein's information flow score is still a
good indicator for the probability of observing lethality (Panel A) or
pleiotropy (Panel B) when the protein is deleted from
What properties make some proteins low in betweenness but high in information
flow scores? From the information flow model, we can expect two typical
situations: one situation is that a protein lies on alternative paths that are
slightly longer than the shortest paths; the other situation is that a protein
has a limited number of high-confidence interactions. Betweenness does not take
any alternative, longer paths into consideration in the first situation, and
betweenness does not give “extra credit” to high-confidence
interactions in the second situation. We illustrated the above two situations
with example “toy” networks, and analyzed how the
information flow model scores nodes that may be important but not recovered by
betweenness (
Every interaction in the yeast interactome has a socio-affinity index that
measures the likelihood of a true interaction
The
The interactions in the
Taken together, the information flow model is effective in identifying proteins that are central in interactome networks. Even in cases where betweenness ranks are relatively low, the information score serves as a strong predictor for essential or pleiotropic proteins.
As more high-throughput datasets become available, new interactions are added
into the networks. High-throughput experiments are error-prone and false
positives can be problematic
In order to analyze how well the information flow model tolerates the addition of
a large amount of noisy data, we simulated a growing yeast interactome network
by adding low-confidence interactions. Higher socio-affinity indices indicate
higher confidence of interactions. In total, there are 9,290 interactions with
socio-affinity indices of 4.5 or higher, or 17,159 interactions with
socio-affinity indices of 3.5 or higher, or 39,099 interactions with
socio-affinity indices of 2 or higher. We rank both information flow scores and
betweenness for all the proteins in each of the three versions of the
interactome. We showed that ranks of information flow scores were more
consistent than that of betweenness when low-confidence interactions were added
to the interactome (
The Y-axis represents the rank of information flow scores (Panel A and C) or the rank of betweenness (Panel B and D) in a yeast interactome that includes high-confidence interactions only (socio-affinity scores of 4.5 or higher). In Panel A and Panel B, the X-axis represents the rank of information flow scores or the rank of betweenness in a yeast interactome that includes interactions at lower confidence levels (socio-affinity scores of 3.5 or higher). The PCCs for the ranks of information flow scores (Panel A) and the ranks of betweenness (Panel B) are 0.83 and 0.71, respectively. In Panel C and Panel D, the X-axis represents the rank of information flow scores or the rank of betweenness in a yeast interactome that includes interactions at still lower confidence levels (socio-affinity scores of 2.5 or higher). The PCCs for the ranks of information flow scores (Panel C) and the ranks of betweenness (Panel D) are 0.54 and 0.38, respectively.
In multi-cellular organisms such as
We tested our hypothesis in an interactome network for muscle-enriched genes.
From a SAGE (Serial Analysis of Gene Expression) dataset of 12
Each row represents a gene and each column represents a tissue or cell
type. The normalized values of gene expression are represented in a
color scale. Genes are sorted by probability scores (Pi)
which indicate expression enrichment in muscle as compared to other
tissues. Altogether 310 muscle enriched genes (Pi≥0.5)
were identified. In this plot, the 310 muscle enriched genes, 155
randomly selected genes, and 155 genes with the lowest Pi are
shown. The list of genes can be found in
From the interactome dataset, we identified direct interacting partners of the muscle-enriched genes. We discarded the interacting genes that, according to the SAGE data, are not expressed in muscle cells. The muscle-enriched genes and their interacting partners which are expressed in muscle form a network of 332 genes and 638 interactions. We defined the weight of an interaction (g12) in the muscle interactome network as the product of the probability scores for the two interacting genes (g12 = P1P2). In other words, the more enriched a given gene's expression is in muscle, the higher its propensity is to interact with other enriched genes in muscle cells.
We applied the information flow model to the muscle interactome network, taking
the weights of interactions into account. We ranked all the genes in the muscle
interactome network by their information flow scores in the muscle interactome
network and by their information flow scores in the entire interactome network,
respectively. We found that genes of high information flow in the muscle
interactome network and genes of high information flow in the entire network did
not completely overlap (
We identified direct interacting partners for the muscle-enriched genes
from the
We obtained the percentiles of genes in terms of information flow scores in the
muscle network and the percentiles of genes in the entire network, calculated
the differences between these two percentiles, and ranked the genes by the
differences. A
Gene name | % in the entire interactome network | % in the muscle interactome network | % difference | Motility rate of RNAi-treated worms (thrashes per minute) (mean±s.d.) |
|
73 | 14 | 59 | Maternal sterility, unable to score |
|
72 | 14 | 58 | 103±19 |
|
77 | 21 | 56 | 20±14* |
|
69 | 14 | 55 | Maternal sterility, unable to score |
F37B4.7 | 72 | 21 | 51 | 95±30 |
|
64 | 13 | 51 | 104±22 |
F41C3.5 | 66 | 17 | 49 | 105±18 |
|
58 | 9 | 49 | 108±10 |
|
68 | 25 | 43 | 93±26 |
D2063.1 | 52 | 10 | 42 | 104±22 |
Y11D7A.12 | 45 | 6 | 39 | 113±9 |
|
67 | 29 | 38 | 100±11 |
|
68 | 32 | 36 | 106±13 |
|
59 | 25 | 34 | 111±19 |
Y62E10A.13 | 77 | 45 | 32 | 93±10 |
|
34 | 3 | 31 | 16±18* |
|
35 | 4 | 31 | 12±8* |
Y39A1A.3 | 42 | 11 | 31 | 99±14 |
|
36 | 5 | 31 | 65±26* |
|
70 | 40 | 30 | 102±5 |
|
48 | 18 | 30 | 103±11 |
|
63 | 33 | 30 | 39±30* |
|
74 | 45 | 29 | 4±9* |
|
78 | 49 | 29 | 98±10 |
R07G3.8 | 73 | 45 | 28 | 93±12 |
|
51 | 100 | −49 | 102±11 |
|
11 | 63 | −52 | 48±47* |
|
47 | 100 | −53 | 110±9 |
M05D6.2 | 11 | 63 | −52 | 105±13 |
|
45 | 100 | −55 | 110±8 |
F14E5.2 | 44 | 100 | −56 | Maternal sterility, unable to score |
|
43 | 100 | −57 | 104±11 |
|
40 | 100 | −60 | 104±6 |
F11D5.1 | 39 | 100 | −61 | 111±12 |
|
36 | 100 | −64 | 105±13 |
|
30 | 100 | −70 | 100±12 |
F31E3.2 | 30 | 100 | −70 | 115±8 |
|
16 | 100 | −84 | 97±15 |
T18D3.7 | 15 | 100 | −85 | 111±7 |
|
12 | 100 | −88 | 114±12 |
|
12 | 100 | −88 | 114±9 |
The normal motility of the
It is plausible that the genes showing higher information flow scores in the
muscle network than the entire network can also be distinguished by conventional
methods such as betweenness. To clarify this, we obtained the percentiles of
genes in terms of betweenness in the muscle network and that of genes in the
entire network, and ranked the genes by the differences between the two
percentiles (
We model interactome networks as large electrical circuits of interconnecting
junctions (proteins) and resistors (interactions). Our model identifies candidate
proteins that make significant contributions to the transfer of biological
information between various modules. Compared to degree and betweenness, our model
has two major advantages: first, it incorporates the confidence scores of
protein-protein interactions; second, it considers all possible paths of information
transfer. When a protein that mediates information exchange between modules is
knocked down, the disintegration of multiple modules is very likely to result in
lethality. Even if the organism is still viable, pleiotropy may be observed because
multiple phenotypes imply the breakdown of multiple modules. In support of our
model, we find that the information flow score of a protein is well correlated with
the likelihood of observing lethality or pleiotropy when the protein is eliminated.
Even among proteins of low or medium betweenness, the information flow model is
predictive of a protein's essentiality or pleiotropy. Compared to
betweenness, the information flow model is not only more effective but also more
robust in face of a large amount of low-confidence data. Our method is accessible to
the public. The MATLAB implementation of the information flow algorithm, along with
the information flow scores for proteins in the yeast interactome network and
proteins in the worm interactome network, can be downloaded at
The information flow model identifies central proteins in interactome networks, and
these proteins are likely to connect different functional modules. We developed an
algorithm that decomposes interactome networks into subnetworks by removing proteins
of high information flow in a recursive manner (
Panel (A) shows our procedure for network partition, and Panel (B) shows a “toy” example.
It was previously observed in a yeast interactome network that ‘date
hubs’, which connect different modules, are more likely to participate in
genetic interactions than randomly sampled proteins, because elimination of date
hubs may make the organism more sensitive to any further genetic perturbations
Another possible feature of “between-module” proteins is related
to the expression dynamics of these proteins and their interacting partners. In
general, interacting proteins are likely to share similar expression profiles
The transmission of biological signals is directional while at present interactome
networks often reflect the formation of protein complexes
In the future, with more information integrated into interactome networks, we should be able to improve on the performance of information flow model. In addition, interactome networks can vary at different times or in different spatial locations. After all, we still have very limited understanding of how biological information flows through cellular networks. Most likely, it does not flow exactly as the electrical current flow does. As more knowledge is accumulated, we should be able to modify the information flow model according to the design principles of cellular network and highlight the dynamic nature of cellular networks.
All of the data used in our study comes from openly available databases and
published high-throughput datasets. We obtained a list of essential genes for
Betweenness is a centrality measure of a node in a network graph. The betweenness
of a particular node is determined by how often it appears on the shortest paths
between the pairs of remaining nodes
We model an interactome network as a resistor network, where proteins are represented as nodes and interactions are represented as resistors. The conductance of each resistor is directly proportional to the confidence score of the corresponding interaction. In cases where the confidence levels of interactions are not known, we assume that all resistors have unit conductance.
In order to estimate the importance of node
For a given pair of source node and ground node, the standard way of computing
resistor currents of a circuit is using
Initialize an
For every resistor in the circuit: Insert the off-diagonal element g Add the value (1/R
Remove the row and column of
The right-hand-side of the equation (3) is a vector of currents, which is
zero except for the source node
Below we outline the resulting algorithm for calculating information flow of a given circuit.
Assemble the
Initialize the absolute sum of currents for each node to be the zero
vector
Iterate over the ground node
Get matrix Compute the LU decomposition of matrix Iterate over the source node
Set the right-hand-side vector Solve for node voltages Compute the absolute sum of all currents for
each node and add them to the entries of
Using (1), compute the information flow for each node.
Our information flow model identifies central proteins in interactome networks. Very likely the proteins of high information flow scores represent connecting points of functional modules. To test this hypothesis, we designed an algorithm to recursively remove the highest flow proteins and release subnetworks from a large interactome network component. In the algorithm described below, a ‘core module’ refers to a subnetwork composed of 15 to 50 proteins.
Initialize: core module set, core module size limits,
Iterate while Given Initialize nodes to be removed from Iterate over the set of modules If number of genes in If Append Add genes in Remove nodes present in Initialize high flow node(s) to be removed at this
iteration, Iterate while Remove next highest flow protein(s) from
Set Append
To evaluate the performance of information flow in signaling networks, we
combined a phosphorylation dataset for
We performed RNA interference (RNAi) experiments by feeding L4 worms, following
protocols from the WormBook
Correlation between degrees and loss-of-function phenotypes. The higher a
protein's degree is, the higher the probability of observing
lethality (Panel C) or pleiotropy (Panel D) when the protein is deleted from
(0.08 MB DOC)
Correlation between information flow scores and loss-of-function phenotypes
among proteins of low or medium degrees. Even among proteins of low or
medium degrees, a protein's information flow score is still a good
indicator for the probability of observing lethality (Panel A) or pleiotropy
(Panel B) when the protein is deleted from
(0.10 MB DOC)
Kirchhoff's Current Law: the basis for calculating information flow scores.
(0.10 MB EPS)
Genes in the
(0.02 MB DOC)
Training examples in the semi-supervised analysis of genes expressed in
(0.02 MB XLS)
A list of genes expressed in
(0.71 MB XLS)
A list of muscle-enriched genes for which promoter::GFP strains are available.
(0.03 MB XLS)
A list of muscle-enriched genes identified by the semi-supervised analysis. We scored whether the promoters of these genes contain cis-regulatory modules that indicate gene expression in muscle.
(0.04 MB XLS)
Genes showing significant difference in betweenness scores in the muscle interactome network versus in the entire interactome network.
(0.04 MB XLS)
Subnetworks revealed by recursive removal of genes of high information flow
from the
(0.07 MB DOC)
Information flow scores, lethality, and pleiotropy scores of proteins which
are part of a signaling network for
(0.03 MB XLS)
A list of genes shown in
(0.04 MB XLS)
In order to better illustrate the properties of information flow which are not exhibited by betweenness, we analyze two toy examples of possible network topologies using either of the two methods.
(0.05 MB DOC)
We executed the module extraction routines while varying the maximum and the minimum number of proteins allowed in a single subnetwork in order to determine the best size range.
(0.06 MB DOC)
We thank T. Jaakkola and G. Stormo for supporting this work and reading the manuscript critically.