What Can Causal Networks Tell Us about Metabolic Pathways?

Rachael Hageman Blair; Daniel J. Kliebenstein; Gary A. Churchill

doi:10.1371/journal.pcbi.1002458

Abstract

Graphical models describe the linear correlation structure of data and have been used to establish causal relationships among phenotypes in genetic mapping populations. Data are typically collected at a single point in time. Biological processes on the other hand are often non-linear and display time varying dynamics. The extent to which graphical models can recapitulate the architecture of an underlying biological processes is not well understood. We consider metabolic networks with known stoichiometry to address the fundamental question: “What can causal networks tell us about metabolic pathways?”. Using data from an Arabidopsis BaySha population and simulated data from dynamic models of pathway motifs, we assess our ability to reconstruct metabolic pathways using graphical models. Our results highlight the necessity of non-genetic residual biological variation for reliable inference. Recovery of the ordering within a pathway is possible, but should not be expected. Causal inference is sensitive to subtle patterns in the correlation structure that may be driven by a variety of factors, which may not emphasize the substrate-product relationship. We illustrate the effects of metabolic pathway architecture, epistasis and stochastic variation on correlation structure and graphical model-derived networks. We conclude that graphical models should be interpreted cautiously, especially if the implied causal relationships are to be used in the design of intervention strategies.

Author Summary

High-throughput profiling data are pervasive in modern genetic studies. The large-scale nature of the data can make interpretation challenging. Methods that estimate networks or graphs have become popular tools for proposing causal relationships among traits. However, it is not obvious that these methods are able to capture causal biological mechanisms. Here we address the power and limitations of causal inference methods in biological systems. We examine metabolic data from simulation and from a well-characterized metabolic pathway in plants. We show that variation has to propagate through the pathway for reliable network inference. While it is possible for causal inference methods to recover the ordering of the biological pathway, it should not be expected. Causal relationships create subtle patterns in correlation, which may be dominated by other biological factors that do not reflect the ordering of the underlying pathway. Our results shape expectations about these methods and explain some of the successes and failures of causal graphical models for network inference.

Figures

Citation: Blair RH, Kliebenstein DJ, Churchill GA (2012) What Can Causal Networks Tell Us about Metabolic Pathways? PLoS Comput Biol 8(4): e1002458. https://doi.org/10.1371/journal.pcbi.1002458

Editor: Andrew G. Clark, Cornell University, United States of America

Received: August 18, 2011; Accepted: February 20, 2012; Published: April 5, 2012

Copyright: © 2012 Blair et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by the National Institute of General Medical Sciences (grant GM076468 GAC); the National Heart, Lung, and Blood Institute (NSRA fellowship 1F32 HL095240 to RHB); and the National Science Foundation awards (DBI 0820580 and DBI 064281 to DJK). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Understanding the nature of cause and effect is fundamental to all fields of scientific investigation, but the concept of causality can present special difficulties in biology [1]. Experiments that utilize controlled interventions represent the most widely used approach to establishing causality. However, in his seminal work on experimental design, RA Fisher proposed that causation can be inferred from multi-factorial experiments performed with randomization [2]. An extension of this principle provides the foundation for computational approaches to network reconstruction in experimental genetic crosses, such as the recombinant inbred strain panel used in this study. Natural allelic variation is randomized during meiosis to generate a multi-factorial perturbation affecting multiple phenotypic outcomes. This meiotic randomization allows for the inference of quantitative trait loci (QTL) that are causal to phenotype [3].

Recent advances in high-throughput phenotyping technologies have made large-scale measurements of molecular traits possible. Expression QTL (eQTL), metabolic QTL (mQTL) and protein QTL (pQTL) can be used to link thousands of molecular phenotypes to genetic loci, as well as to clinical phenotypes [4]. A typical xQTL study will involve cross sectional sampling of a genetically variable population at a single time point. It is not immediately obvious that such data could provide insight into causal biological mechanisms, which derive from non-linear dynamic processes of gene expression and metabolism. However, a rich body of literature supports the idea that correlation structure in static data can provide insights into causal relationships among the measured variables [5], [6].

The interpretation of a directed edge between nodes and in a graphical model is that intervention on will alter , but intervention on will not alter . In a metabolic reaction, intervention on the substrate concentration will alter the product concentration. Reaction stoichiometry is often well understood [7]. Substrate molecules are converted by known enzymes into products, which in turn act as substrates for subsequent reactions. Reactions are organized into pathways which may converge, branch or intersect to form elaborate networks. More complex pathways involving feedback through allosteric interactions between enzymes and metabolites may also be present. It is not clear to what extent graphical models inferred from mQTL data capture these types of interactions.

Several algorithms have been proposed for the inference of causal relationships among phenotypes using genetic data [8]–[14]. These methods employ linear statistical models to infer the relationships between QTL and phenotypes, as well as relationships among phenotypes [15]. Causal edge detection is sensitive to subtle correlation patterns in the data. Inferences have been shown to be subject to a large proportion of false positive edges and can be skewed by environmental and experimental design factors that are not accounted for in the model [16], [17]. Agreement between the graphical model and the true underlying biology is a central goal of systems biology. The topology of networks inferred from xQTL data is often interpreted as a reflection of the underlying biological process - which may be metabolic or regulatory in nature, nonlinear, and involve the dynamic interaction of molecules within cells and tissues. However, the extent to which graphical models derived from static data capture these processes is not well understood, which makes the interpretation of edges challenging.

Deterministic models of cellular metabolism can be defined by ordinary differential equations (ODEs) derived from simple laws of mass-balance [18]–[21]. The reaction rates are modeled as non-linear processes, e.g. Michaelis-Menten kinetics and Hill functions, which depend on kinetic rate parameters [22]. Models of this type are powerful because of their ability to make in silico predictions of the response of a system to perturbations. We present a simulation study in which we generate synthetic mQTL data from dynamical models of pathway motifs with two sources of perturbation. We vary the rate parameters in a manner that mimics a genetic cross and we drive the simulations models with an input function that includes stochastic noise.

Glucosinolates are secondary metabolites that influence the interaction of plant and pest and have a wide range of important functions in human health [23]–[25]. The economic importance of glucosinolates has led to significant progress in understanding the biochemical pathways and genetics [26], [27]. Glucosinolate biosynthesis occurs in three well understood stages in which amino acids undergo (Figure 1): (1) chain-elongation, (2) formation of glucone moeity, and (3) side-chain modification. In this work, we examine mQTL data from a class of aliphatic glucosinolates in a highly replicated Arabidopsis BaySha recombinant inbred population [28]. The metabolites under investigation participate in side-chain reactions. Genetic analysis reveals shared QTL and wide-spread epistasis in the pathway [29].

Download:

Figure 1. Biosynthesis of aliphatic glucosinolates.

The aliphatic glucosinolate biosynthetic pathway occurs in three stages: (1) side chain elongation, (2) formation of glucone moeity and (3) side-chain modification. The metabolites that are measured in the BaySha RIL population are indicated together with the facilitating enzymes.

https://doi.org/10.1371/journal.pcbi.1002458.g001

In order to address these questions, we have inferred causal networks from mQTL data using simulated metabolic models of common pathway motifs and real data from a well characterized metabolic network. We demonstrate that correlation structure can be shaped by a variety of factors, including, genetic variation, pathway architecture, position in the pathway and feedback. Our results highlight the necessity of biological variation outside of the variation contributed by genetic factors for reliable network inference. Substrate-product relationships are not always reflected in the correlation structure of the system and recovery of the biochemical ordering of species should not be expected. Substrate inhibition, which is pervasive in metabolic pathways, can diminish or mask these relationships and lead to missing edges in network inference. An accurate genetic model is also critical to the inference process, especially when epistasis is involved. Our findings should temper expectations and provide new insights into the interpretation of causal genotype-phenotype networks.

Results

Pathway motifs were constructed using ODEs (Figure 2). Flux rates, , were described with Michaelis-Menton kinetics. Simulations were performed under genetic perturbations, , with stochastic input, (Figure S1). The aliphatic glucosinolate biosynthetic pathway from an Arabidopsis BaySha population was also investigated (Figure 1). For each pathway, we carried out a three-step analysis: (1) QTL mapping for the metabolites in the pathway to identify the relevant genetic factors. (2) Metabolite correlations were calculated with and without conditioning on genetic factors. Correlation after conditioning represents the association between metabolites that is driven by sources outside of the genetic factors, e.g., propogation of random input fluctuations through the pathway. Correlation that disappears after conditioning implies an independent relationship between metabolites, e.g., and . We interpret the presence of correlation after conditioning as being indicative of either causal or reactive relationships, e.g., or . (3) We generated multiple causal networks from their posterior distribution, using a MCMC algorithm previously described [14] and summarized results across the ten top scoring networks.

Download:

Figure 2. Simulated pathway motifs.

(A) Linear, (B) merging pathway via metabolic reaction, (C) merging pathway via independent paths, (D) branching pathway, (E) branching pathway with inhibition, (F) branching pathway with epistasis. represents a constant pool of metabolite taken up at a constant flux rate that is subject to a stochastic perturbation , represents the flux rate, is a genetic perturbation and denotes an upstream signal that is affecting the pathway.

https://doi.org/10.1371/journal.pcbi.1002458.g002

Simulated Pathway Motifs

QTL detection.

Correlation of the genotype variable, , and a metabolite is considered evidence for a QTL with the sign and magnitude indicating the direction of the effect and the effect size (Figure 3). A similar QTL pattern is observed between pathways that contain linear chains of reactions. Specifically, the QTL for a substrate metabolite in a linear chain is the facilitating the downstream flux (e.g., Figure 3A). In the merging pathway via metabolic reaction; there are no QTL for the bi-substrate reaction that occurs at the merge point (Figure 3B). However, when the merging pathway is formed through two independent paths QTL mimic the linear pathway pattern (Figure 3C). The QTL effect pattern in the branching pathway illustrates the activation of the lower and upper branch (Figure 3C). When the flux through the upper branch is dominant, the production of is demanding substrate , which is then less available for the production of . This scenario is reflected in positive correlation between and , and the negative correlation between and and . An analogous story plays out for the lower branch and is seen in the relationships. Substrate inhibition in the branching pathway results in the loss of QTL at which facilitates the inhibited flux (Figure 3E). In the branching pathway with epistasis, is a QTL for the branch-point metabolite , and both and which reside on the branches (Figure 3F). The direction of the effect is a reflection of the metabolite position in the pathway. Epistasis has the strongest effects on and which are immediately downstream of the interacting signal and enzyme respectively.

Download:

Figure 3. Simulation results.

Left: The correlation between metabolites and genetic multipliers, correlation indicates evidence of a QTL, the sign and magnitude indicate direction and size of the effect respectively. Center: metabolite correlation after conditioning on QTL. Right: The inferred causal graphical model estimated from the top ten graphs from MCMC. Edge weights indicate regression coefficients.

https://doi.org/10.1371/journal.pcbi.1002458.g003

Metabolite correlations.

In most cases, the correlation between metabolites after conditioning on genotype variables was enhanced (Figure 3). Substrates in the linear pathway are uniformly correlated both before and after conditioning on QTL (Figure 3A). In the merging pathway via metabolic reaction, a high correlation between and suggests that they must be coordinated to form a product (Figure 3B). In the merging pathway via independent paths and are uncorrelated, and are highly correlated to each other, and to a lesser degree with and (Figure 3C). In the branching pathway and are highly correlated and relationships involving and become more pronounced after conditioning (Figure 3D). Substrate inhibition is observed in the negative correlation of with the other metabolites in the pathway (Figure 3E). The correlation in this pathway was the most sensitive to conditioning on QTL. After conditioning there was almost a total loss of correlation between and metabolites on the upper branch, and (Figure 3E). In the branching pathway with epistasis, and are negatively correlated reflecting the accumulation of when there is an allelic combination that results in the loss of function of (Figure 3F). The strongest correlation is between and .

Network reconstructions.

The linear and merging pathway reconstructions did not mimic the ordering in the metabolic pathway (Figure 3A–C). A causal edge occurred in the linear pathway in the ten best scoring models (Figure 3A), but faded when larger subsets of models were considered (Text S1). In the merging pathway via metabolic reaction a causal edge and an undirected edge between and were identified, with no link between the two pathway segments (Figure 3B). When and form from merging independent pathways, is predicted as a hub metabolite that affects both upstream and downstream neighbors. It is reasonable that , the merging point, controls the influx and efflux of the pathway and dominates the overall correlation structure (Figure 3C). The graphical model for the branching pathway captures the biochemistry exactly but does not include the genetic factors (Figure 3D). When substrate inhibition occurs in the branching pathway, the graphical model identifies the top and bottom branches, but does not link them together (Figure 3E). In the network reconstruction of the branching pathway with epistasis, the lower branch of the pathway is captured exactly and the epistasis term was found to affect and independently (Figure 3F).