Conceived and designed the experiments: CJV CHY NHL JMS. Performed the experiments: CJV CH TL BF NHL JMS. Analyzed the data: CJV CHY NHL JMS. Contributed reagents/materials/analysis tools: CJV CH TL BF NHL JMS. Wrote the paper: CJV CHY NHL JMS.
The authors have declared that no competing interests exist.
Complex phenotypes such as the transformation of a normal population of cells into cancerous tissue result from a series of molecular triggers gone awry. We describe a method that searches for a genetic network consistent with expression changes observed under the knock-down of a set of genes that share a common role in the cell, such as a disease phenotype. The method extends the
Biological processes are the result of the actions and interactions of many genes and the proteins that they encode. Our knowledge of interactions for many biological processes is limited, especially for cancer where genomic alterations may create entirely novel pathways not present in normal tissue. Perturbing gene expression (for example, by deleting a gene) has long been used as a tool in molecular biology to elucidate interactions but is very expensive and labor intensive. The search for new genes that may participate can be a daunting “fishing expedition.” We have devised a tool that automatically infers interactions using high-throughput gene expression data. When a gene is silenced, it causes other genes to be switched on or off, which provide clues about the pathway(s) in which the gene acts. Our method uses the genomewide on/off states as a fingerprint to detect interactions among a set of silenced genes. We were able to elucidate a network of interactions for several genes implicated in metastatic colon cancer. Genes newly connected to the network were found to operate in cancer cell invasion in human cells, validating the approach. Thus, the method enables an efficient discovery of the networks that underlie biological processes such as carcinogenesis.
Carcinogenesis involves a host of cell-cell communication breakdowns that include the loss of contact inhibition, an increased potential to proliferate, and the ability to invade and spread into foreign tissue
Our goal is to identify the genetic mechanisms underlying a phenotype, such as cancer cell deregulation. We take a network-based approach to the problem, starting with a set of
Previous approaches for pathway expansion have used methods based on expression correlations to a phenotype of interest. These methods search for genes with expression profiles that are highly correlated with a particular phenotype or disease state and have led to promising results
More recently, several approaches have demonstrated learning a structured model from perturbation experiments
The current NEM approach uses binary set membership relations to identify a network and thus the exact nature of interaction between S-genes (e.g. activation or inhibition) is not determined. However, an appreciable extent of inhibition occurs in real genetic networks. To estimate the amount of inhibition present in living cells, we estimated the proportion of genes up-regulated in deletion mutants relative to wild-type from a yeast knock-out compendium
To address this limitation, we developed a generalization of the NEM approach using a probabilistic graphical model called a factor graph that allows a broader set of S-gene interactions to be recovered from the secondary effects of E-gene expression. This paper offers three methodological contributions. First, we present a factor graph formulation called FG-NEM that allows for an efficient search over all possible NEM structures for a high-scoring model. Second, we show how FG-NEMs extend the NEM approach for expanding the network beyond the current set of S-genes. Third, we show that FG-NEMs can model a more general class of S-gene interactions than NEMs, which increases the accuracy of network identification over an approach that considers a more restricted set of interactions.
We demonstrate the usefulness of FG-NEMs on both simulated and biologically relevant signaling networks that contain both inhibition and activation. We apply FG-NEMs to identify novel genes not previously implicated in colon cancer cell invasiveness. Finally, we experimentally test FG-NEM predictions and report that knock-downs of the top-scoring genes lead to a loss-of-invasion phenotype, validating the approach. Source code is available as an R library from our website:
We first describe the Nested Effects Model, derive a
Our goal is to automatically identify genetic interactions among a set of
In addition to identifying
(A) Hypothetical example with four S-genes,
The E-gene expression changes are available in a data matrix
Nested effects models include two sets of parameters. The parameter set
To allow for both stimulatory and inhibitory interactions in our formulation,
Plotting the response of E-genes under Δ
Note that two genes are equivalent if their knock-downs lead to significantly similar expression changes, which may predict, for example, that they form a complex.
Our goal is to find a structure among the S-genes that provides a compact description of
As in previous NEM formulations, we assume that each E-gene is attached to a single S-gene and that each E-gene observation vector across the knock-downs is independent of other E-gene observations. The maximization function can then be written:
Previous approaches decompose
Substituting
The prior over interactions,
In order to preserve the transitivity of identified interaction modes, the prior is decomposed over interaction configurations into transitivity constraints on all triples of S-genes; i.e.:
While network structures are constrained to reflect more intuitive models, the decomposition introduces interdependencies among the interactions, adding complexity to the search for high-scoring networks. Importantly, max-sum message passing in a factor graph
The formulation above provides a definition of the objective function to be maximized but says nothing about how to search for a good network. The search space of networks is very large making exhaustive search
Here, we introduce the use of a graphical model called a factor graph to represent all possible NEM structures simultaneously. The parameters that determine the S-gene interactions,
The factor graph consists of three classes of variables (circles) and three classes of factors (squares).
The
Here, the message passing schedule performs inference in two steps. In the first step, messages from observations nodes
Once a signaling network is identified using the message passing inference procedure above, the network can be used to search for new genes that may be part of the pathway. The NEM and FG-NEM framework predict new members that act in the pathway by “attaching” E-genes to S-genes in the network, or leaving them detached if their expression data does not fit the model. Attaching E-gene,
To gain a global picture for where
We calculate a log-likelihood ratio that measures the degree to which
To validate the involvement of predicted invasiveness frontier genes, HT29 colon cancer cells were resuspended in DMEM medium containing 0.1% FBS and seeded into the top wells (2×105 per well) containing individual Matrigel inserts (BD Biosciences, San Jose, CA) according to manufacturer's protocol. The lower wells were filled with 800 µl medium with 10% fetal bovine serum as chemoattractant. Six to ten hours following seeding, the cells in the upper wells were transfected with the appropriate shRNA-expressing pSuper constructs
We evaluated FG-NEMs ability to recover artificial networks from simulated data. Data was generated by propagating signals in networks containing simulated knock-downs and then sampling expression data from activated, inhibited, or unaffected expression change distributions (see
To make the comparison of FG-NEM to uFG-NEM fair, we measured network recovery in two ways. 1) We calculated a measure of
We tested the ability of FG-NEMs and uFG-NEMs to recover the structure of networks simulated with varying fractions of inhibition, 0≤
(A) Influence of inhibition on network recovery. AUC (
We repeated the experiment of varying inhibition to match our expectations for application to the cancer invasion network discussed subsequently. In the invasion network the known S-genes were recovered in such a way that only activating S-gene connections were identified. To simulate this situation, we created networks containing only activating S-gene interactions but varied the proportion of inhibiting E-gene attachments. Even in this situation where all of the known S-genes have activating interactions, FG-NEM's performance begins to significantly surpass uFG-NEM's performance when 40–60% of the E-genes are connected with inhibitory attachments (see
Because our goal is to elucidate the network of genes involved in the colon cancer invasiveness pathway, we measured the ability of our method to expand the network to new genes involved in the pathway compared to a correlation-based method we refer to as Template Matching (TM) used by Irby et al. (2005)
We hypothesized that an estimate of genetic pathway structure based on modeling observed expression changes could facilitate the identification of new pathway members. To test this, we evaluated the ability of FG-NEMs, uFG-NEMs, and TM to identify genes involved in a diverse set of pathways in
In each deletion strain, gene expression changes with a
The factor graph approach allows prior information to be incorporated. We tested a supervised variant of FG-NEMs (sFG-NEM) in which additional factors were incorporated to reward models that included known interactions. Three classes of physical data were downloaded for use as interaction priors: protein-DNA interactions, phosphorylation target data, and protein-protein interactions (PPI). Protein-DNA interactions with a p-value less than 0.001 were selected from the study of Lee et al. (2002)
The accuracy of FG-NEMs for expanding each pathway to include new genes was measured. The likelihood of attachment ratio (LAR) score for each gene in the genome was calculated and the area under the precision-recall curve (AUC) was computed (see
(A) Precision/recall comparison. Each method's ability to expand a pathway was compared. Thick lines indicate mean precision and shaded regions represent standard error of mean calculated over the networks with the five highest AUCS from any of the tested methods. (B) Network expansion comparison. Networks were predicted for a non-redundant set of GO categories containing four or more S-genes in the Hughes et al. (2000) compendium and used to predict held-out genes from the same category (see
Except for ribosome biogenesis, FG-NEMs performed comparably or better than uFG-NEMs and TM (
Incorporating physical interaction priors showed little effect on network expansion performance. For most of the pathways, the performance of sFG-NEMs was indistinguishable from its unsupervised counterpart. A slight improvement was seen for the nitrogen metabolism pathway. Incorporation of structural priors adds activation from GLN3 to YEA4, and from ARG80 to ARG5,6, and slightly boosts the predictive power of the network. Thus, FG-NEM can usually identify new pathway genes in the unsupervised setting as well as when known interactions are provided.
Interestingly, the largest change in performance resulting from the use of prior information was a small drop observed for predicting genes involved in the sexual reproduction pathway. We investigated this decrease and found that using protein-DNA priors forced the placement of a transcription factor STE12 to the top of the pathway, whereas placement toward the bottom seemed to better fit the expression changes. Consequently, FG-NEM ranks the sexual reproduction E-genes higher than sFG-NEM.
On average, physical interaction priors increase the compatibility of FG-NEM predictions with high-throughput physical data. A leave-one-out analysis was used to test the ability of physical interaction data to improve pair-wise interaction predictions. To compare improvement in network structure prediction, we calculated the
Of the 163 physical interactions, 104 (63%) have higher while 43 (26%) have lower MOC in sFG-NEM than FG-NEM. Of these 43, 33 have positive MOCs for both approaches (i.e. both agree with the physical evidence). Notably, of the 93 that achieved higher compatibilities in sFG-NEM, 38 (23%) became compatible only when the physical evidence was included. One example is the interaction between CDC42 and FAR1 in the sexual reproduction pathway. FAR1 acts downstream of CDC42 in the pheromone response signal cascade. The FAR1 gene deletion shows little expression change and is not placed downstream of CDC42 even though CDC42 is placed at the top of the signaling cascade by FG-NEM. With the inclusion of other structural priors, FAR1 is correctly placed downstream of CDC42. Thus, incorporating known interactions, even from possibly noisy high-throughput sources, can increase the likelihood of finding other interactions. However, the caveat is that such information may force a poorer fit to the observed expression data which could decrease the accuracy of frontier expansion.
FG-NEMs achieved significant improvement over the unsigned variant on the ion homeostasis pathway. To gain insights into the structural predictions underlying the difference in performance of the methods, we compared the predicted S-gene networks of the FG-NEM and uFG-NEM methods for this pathway (
Both the FG-NEM and uFG-NEM correctly predicted the equivalence of CKA2 and CKB2 which together form a complex. Of the top fifteen frontier genes predicted by FG-NEM, eight are annotated by GO as involved in ion homeostasis (
We applied the FG-NEM approach to a human colon cancer invasiveness network elucidated by Irby et al. (2005)
We applied FG-NEMs to the five S-genes from the second tier of Irby et al. (2005). These five human genes are cytokeratin 20 (KRT20), transcription factor Dp-1 (TFDP1), DEAH (Asp-Glu-Ala-His) box polypeptide 32 (DHX32), ribosomal protein L32 (RPL32), and glutaminase (GLS). Knock-down of each second-tier S-gene has been demonstrated to significantly reduce the invasion phenotype of HT29 colon cancer cells (Irby et al., 2005). KRT20 has historically served as a diagnostic marker for colorectal carcinoma
We applied FG-NEMs to recover a network for the second-tier genes. We included E-genes that demonstrate a robust and significant effect under at least two of the knock-downs included in the Irby et al. (2005) study. We selected genes whose log2 ratios differ by less than 0.5 in replicate arrays and had an absolute log2 expression change at least equal to the mean absolute level of the activated distribution (1.75) in at least two arrays. Using these criteria, we identified 185 E-genes to use for model inference.
(A) Expression changes of selected E-genes following targeted S-gene knock-downs in HT29 colon cancer cells. Gene expression was measured in HT29 cells treated with a shRNA specifically targeting an S-gene (column of the matrix) relative to cells treated with a scrambled control shRNA (Irby et al., 2005). Colors indicate putatively inhibited E-genes (rows of the matrix) with up-regulated levels relative to control (red), activated E-genes with down-regulated levels relative to control (green), and unaffected E-genes with expression levels not significantly different from control (black). Biological replicates were available for KRT20, TFDP1, and GLS knock-downs. Genes were sorted by their attachment point and then by their LAR scores. (B) Cancer invasion network predicted by FG-NEM. For each pair of S-genes, the most likely interaction mode is shown. The same conventions used for illustrating interactions predicted for the yeast networks were used here. Some interactions were found to be significant at the 0.05 level (*) or 0.01 level (**) using a permutation test (see
FG-NEM recovered the network shown in
The FG-NEM model predicts that TFDP1 is at the bottom of the signaling cascade, which may reflect its role as part of the E2F transcriptional complex in targeting the expression of downstream genes that promote cell proliferation and invasion
Because the number of S-genes in the second tier is small, we compared the heuristic pair-wise search employed by FG-NEM to an exhaustive model search. If the heuristic approach is reasonable, it should identify network models that are among the highest scoring models identified by brute-force enumeration. To perform a brute-force search, we generated 1000 random networks among the five second-tier genes. For each network, we calculated the data likelihood using message passing. Out of the 1000 randomly enumerated networks, the recovered network for the second-tier genes had a likelihood higher than 997 of the random networks. Interestingly, all three of the random networks with higher scores had identical structures to the network recovered by FG-NEM except that all three networks differed in their attachment of DHX32 and GLS. This result demonstrates that the pair-wise heuristic search employed by FG-NEM successfully identifies high-scoring networks in the space of all networks. While we need to test the trend for increasing network sizes, these results are promising for scaling up to larger networks in which exhaustive search will not be feasible.
We used the highest-scoring model recovered by the FG-NEM to search for additional genes involved in colon cancer invasiveness by sorting each gene by its LAR score (see
LAR |
E-Gene | S-Gene | E-Gene Description |
18.79 | CHORDC1 |
GLS | Cysteine and histidine-rich domain-containing 1 |
11.35 | RNF32 | GLS | Ring finger protein 32 |
10.93 | TSP50 | TFDP1 | Testes-specific protease 50 |
10.02 | HS3ST1 |
KRT20 | Heparan sulfate (glucosamine) 3-O-Sulfotransferase 1 |
6.85 | CHMP4C |
TFDP1 | Chromatin modifying protein 4C |
6.76 | ADAM19 |
KRT20 | ADAM metallopeptidase domain 19 (meltrin beta) |
6.34 | CYP3A43 | KRT20 | Cytochrome P450, family 3, subfamily A, Polypeptide 43 |
5.97 | SPTLC3 |
TFDP1 | Serine palmitoyltransferase, long chain base subunit 3 |
5.25 | PLEKHM3 |
KRT20 | Pleckstrin domain containing |
4.92 | KRT13 | TFDP1 | Keratin 13 |
4.28 | CAPN12 | KRT20 | Calpain 12 |
3.87 | C1orf34 |
KRT20 | Hypothetical |
3.54 | ZNF350 | KRT20 | Zinc finger protein 350 |
3.53 | ADAM9 | TFDP1 | ADAM metallopeptidase domain 9 |
2.75 | SLC2A1 |
KRT20 | Solute carrier family 2 |
2.38 | TCTEX1D1 | TFDP1 | Tctex1 domain containing 1 |
2.23 | STK24 | KRT20 | Serine/threonine kinase 24 |
2.05 | DDX58 | KRT20 | DEAD (Asp-Glu-Ala-Asp) box polypeptide 58 |
2.01 | GFAP | KRT20 | Glial fibrillary acidic protein |
Natural logarithm of likelihood of attachment score (see
EST is inside an intron of this gene.
EST is on the 3′ end of this gene.
EST is on the 5′ end of this gene.
Many of the genes in
The E-genes with positive LAR scores constitute the network “frontier” of the cancer invasiveness pathway in that they are predicted to directly interact with the second-tier genes. From among the 38 genes with positive and significant LAR scores, two were arbitrary selected to test for a loss-of-invasiveness phenotype in HT29 cells as defined by invasion in Matrigel. We selected CAPN12 and expressed sequence tag AA099748 from
The factor graph nested effects model (FG-NEM) provides a general methodology for inferring networks from knock-down phenotypes. Our results extend the nested effects models in three significant ways: 1) we provide a means for efficiently searching for large S-gene networks using inference on a factor graph that can also incorporate prior information; 2) our method distinguishes activating from inhibiting interactions; and 3) we show that NEM attachment can be used successfully to expand the network to new pathway members. Our results on simulated and yeast networks suggest explicitly modeling inhibition and activation, rather than treating as generic interactions or effects, leads to higher accuracies for recovering known interaction networks and identifying members of the a pathway.
Applying FG-NEM predictions to a series of follow-up experiments in an HT29 colon cancer cell line model has identified new gene members of the tumor invasiveness pathway. Specifically, shRNA-mediated knock-down of two genes predicted to be connected to the original rudimentary network of Irby et al.
We envision applying the FG-NEM approach within an iterative computational-experimental framework. As a network is expanded, the frontier genes of one round of investigation can be included as S-genes in subsequent rounds. Iteration will therefore provide larger sets of S-genes on which to infer networks. While the primary data used for such network expansion is based on gene expression data, it will be intriguing to investigate whether a variety of transcriptional and non-transcriptional interactions can be recovered with this approach. There are many examples of coupling between transcription and non-transcriptional interactions in biological systems. An E-gene
Several aspects of the method could be improved upon in the future. The method could be extended to use over-expression of S-genes in addition to knock-downs. Over-expression of an S-gene would be expected to have an opposite effect on downstream E-genes compared to the E-gene effects observed under the S-gene's knock-down. Thus, the E-gene responses could be compared to an expanded list of interaction modes, derived by flipping the scatter-plots in
In this study of the colon cancer invasiveness pathway, S-gene interaction configurations were forced to reflect transitive connections but did not incorporate any external biological information. Additional knowledge, such as gene coexpression groups, or protein-protein interaction potentials, could be incorporated into the prior for making inferences about the cancer invasiveness pathway. For example, several gene expression experiments on invasive colon cancer cell lines are available in GEO
We modeled transitivity using deterministic factors. While this provides an intuitive interpretation of such constraints and increases the speed of convergence of message passing, relaxing these constraints to general belief potentials could allow a broader exploration of the search space. Imposing transitivity in the current framework disallows cycles of inhibitory links. However, it is possible to extend our method to incorporate such cycles, in which new interaction modes are introduced. For example, the cycle A→B⊣C→A would imply B⊣A, which could be modeled using a new type of interaction mode capturing A's activation on B and B's inhibition on A.
The methods could be extended to incorporate richer information such as degrading signals and higher-order knock-downs (single, double, triple, etc) as in Carter et al. (2007)
In our network expansion approach, we assumed genes whose expression levels are well-explained by the model are of more interest for subsequent rounds of experimentation, although there are other ways to approach this question from an experimental design perspective. For example, it would be conceivable to test whether selecting genes based on reducing a measure of uncertainty across models leads to better gene selection as in
Finally, the approach could be applied to the unsupervised discovery of regulatory interactions among E-genes rather than S-genes. In recent work, Sahoo et al. (2008)
We applied FG-NEMs to discover a human signaling network among genes involved in colon cancer cell invasiveness. The method formalizes and extends analysis of genetic interactions using high-dimensional quantitative phenotype data in the form of gene expression changes observed under specific perturbations. It makes explicit use of the knock-downs of known members of a pathway to identify how the members interact with one another and for identifying new members. The method predicts several genes with new roles in the cancer invasiveness process, two of which were verified to act in the pathway based on an
Colon cancer invasion data for SAM-selected E-genes. Sheet 1. Selected E-genes and their expression. Sheet 2. Input to SAM for determining parameters of the Gaussian mixture in the Expression Factors. Sheet 3. SAM results used for determining the parameters of the Gaussian mixtures in the Expression Factors.
(6.09 MB XLS)
Observed inhibitory effects and signaling in yeast compendiums Evidence for inhibition from measured responses of knockdown, and from annotation in curated pathways.
(0.09 MB PDF)
Comparison of uFG-NEM and exhaustive NEM model search for structure recovery.
(0.07 MB PDF)
Estimating difference in gene expression between activation and inhibition.
(0.08 MB PDF)
Accuracy of network recovery as a function of S-gene knowledge and number of microarray replicates, and E-gene inhibition.
(0.10 MB PDF)
Yeast Knockout Compendium Pathway AUC. AUC and AUC-ratios for expansion of Yeast pathways.
(0.03 MB XLS)
Ion Homeostasis Frontier Genes. Genes most likely to be attached to the ion homeostasis network for both the FG-NEM and uFG-NEM methods. Genes are sorted by LAR.
(2.75 MB XLS)
Invasiveness E-gene LAR Scores. Connection point, connection strength, and connection significance of E-genes in colon cancer network.
(0.10 MB XLS)
Supplemental methods and results.
(0.12 MB DOC)
We would like to thank M.T. Weirauch for helpful discussions regarding early drafts of the manuscript.