Search
Advanced Search
Average Rating (0 User Ratings)
    • Currently 0/5 Stars.
    See all categories
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
    Rate This Article

Open Access

Research Article

Uncovering a Macrophage Transcriptional Program by Integrating Evidence from Motif Scanning and Expression Dynamics

Stephen A. Ramsey*, Sandy L. Klemm, Daniel E. Zak, Kathleen A. Kennedy, Vesteinn Thorsson, Bin Li, Mark Gilchrist, Elizabeth S. Gold, Carrie D. Johnson, Vladimir Litvak, Garnet Navarro, Jared C. Roach, Carrie M. Rosenberger, Alistair G. Rust, Natalya Yudkovsky, Alan Aderem*, Ilya Shmulevich*

Institute for Systems Biology, Seattle, Washington, United States of America

Abstract

Macrophages are versatile immune cells that can detect a variety of pathogen-associated molecular patterns through their Toll-like receptors (TLRs). In response to microbial challenge, the TLR-stimulated macrophage undergoes an activation program controlled by a dynamically inducible transcriptional regulatory network. Mapping a complex mammalian transcriptional network poses significant challenges and requires the integration of multiple experimental data types. In this work, we inferred a transcriptional network underlying TLR-stimulated murine macrophage activation. Microarray-based expression profiling and transcription factor binding site motif scanning were used to infer a network of associations between transcription factor genes and clusters of co-expressed target genes. The time-lagged correlation was used to analyze temporal expression data in order to identify potential causal influences in the network. A novel statistical test was developed to assess the significance of the time-lagged correlation. Several associations in the resulting inferred network were validated using targeted ChIP-on-chip experiments. The network incorporates known regulators and gives insight into the transcriptional control of macrophage activation. Our analysis identified a novel regulator (TGIF1) that may have a role in macrophage activation.

Author Summary

Macrophages play a vital role in host defense against infection by recognizing pathogens through pattern recognition receptors, such as the Toll-like receptors (TLRs), and mounting an immune response. Stimulation of TLRs initiates a complex transcriptional program in which induced transcription factor genes dynamically regulate downstream genes. Microarray-based transcriptional profiling has proved useful for mapping such transcriptional programs in simpler model organisms; however, mammalian systems present difficulties such as post-translational regulation of transcription factors, combinatorial gene regulation, and a paucity of available gene-knockout expression data. Additional evidence sources, such as DNA sequence-based identification of transcription factor binding sites, are needed. In this work, we computationally inferred a transcriptional network for TLR-stimulated murine macrophages. Our approach combined sequence scanning with time-course expression data in a probabilistic framework. Expression data were analyzed using the time-lagged correlation. A novel, unbiased method was developed to assess the significance of the time-lagged correlation. The inferred network of associations between transcription factor genes and co-expressed gene clusters was validated with targeted ChIP-on-chip experiments, and yielded insights into the macrophage activation program, including a potential novel regulator. Our general approach could be used to analyze other complex mammalian systems for which time-course expression data are available.

Introduction

Dynamic cellular processes, such as the response to a signaling event, are governed by complex transcriptional regulatory networks. These networks typically involve a large number of transcription factors (TFs) that are activated in different combinations in order to produce a particular cellular response. The macrophage, a vital cell type of the mammalian immune system, marshals a variety of phenotypic responses to pathogenic challenge, such as secretion of pro-inflammatory mediators, phagocytosis and antigen presentation, stimulation of mucus production, and adherence. In the innate immune system, the first line of defense against infection, the macrophage's Toll-like receptors (TLRs) play a crucial role by recognizing distinct pathogen-associated molecular patterns (PAMPs), such as flagellin, lipopeptides, or double-stranded RNA [1],[2]. TLR signals are first channeled through adapter molecules (e.g., TICAM1/TRIF [3],[4] and MyD88 [5]) and then through parallel cross-talking signal pathways. These activated pathways initiate a transcriptional program in which over 1,000 genes [6] and hundreds of TF genes [7] can be differentially expressed, and which is tailored to the type of infection [8],[9]. The transcriptional network underlying macrophage activation can exhibit many distinct steady-states which are associated with tissue- and infection-specific macrophage functions [10]. The transcriptional response is also dynamic and characterized by temporal waves of gene activation [6],[7],[9], each enriched for distinct sets of gene functions [7],[9] and likely to be controlled by different combinations of transcriptional regulators [6],[7]. Long-term, elucidating the transcriptional network underlying TLR-stimulated macrophage activation, and identifying key regulators and their functions, would greatly enhance our understanding of the innate immune response to infection and potentially yield new ideas for vaccine development.

Computational analysis of high-throughput experimental data is proving increasingly useful in the inference of transcriptional regulatory interaction networks [11][15] and in the identification and prioritization of potential regulators for targeted experimental validation [6],[7]. Time-course microarray expression measurements have been used to infer dynamic transcriptional networks in yeast [14],[15] and static “influence” networks in mammalian cell lines [11]. In the context of primary macrophages, expression-based computational reconstruction of the transcriptional control logic underlying the activation program is not straightforward and progress is difficult to measure, for several reasons. First, transcriptional control within mammalian networks in general [16], and for key TLR-responsive genes in particular [7], is combinatorial. Second, many induced TFs are subject to post-translational activation [17] and dynamic control of nuclear localization [6]. Third, targeted genetic perturbations are presently infeasible to perform on a large scale in a mammalian animal model, and expression knockdown (RNAi) is difficult in macrophages due to the tendency of the vector to stimulate TLRs. Finally, the few transcriptional regulatory interactions that have been validated through targeted experiments in TLR-stimulated primary macrophages are not available in a single “gold standard” dataset. Therefore, in the context of transcriptional regulation in the mammalian macrophage, with presently accessible expression data sets, large-scale computational inference is primarily useful for statistically identifying potential regulatory interactions, rather than as an inference tool for predicting the transcriptional control logic for specific target genes.

For the reasons described above, in order to computationally infer transcriptional regulatory interactions in a mammalian system, it is necessary to include additional sources of evidence (beyond expression data) to constrain or inform the transcriptional network model selection. Computational scanning of the promoter sequences of clusters of co-expressed genes for known transcription factor binding site (TFBS) motifs has proved particularly valuable when combined with global expression data [6],[18],[19]. Recently, Nilsson et al. [7] used a combination of expression clustering and promoter sequence scanning for TFBS motifs to construct an initial transcriptional network of the macrophage stimulated with the TLR4 stimulus lipopolysaccharide (LPS). Their work identified two novel regulators, but the clustering was based on an expression dataset with a single stimulus, limited biological replicates, and few time points. Moreover, TFBS motif scanning of co-expressed clusters, without utilizing expression dynamics, provides only a limited and static picture of the underlying transcriptional network. Many TFBS motifs are often recognized by multiple TFs, making difficult the unambiguous identification of the regulating TF from TFBS enrichment alone. Furthermore, because of the tendency of TFBS motifs to co-occur [20], it is difficult to determine from among a set of co-occurring motifs which associated TF is the most relevant to the condition-specific regulation of the target cluster. In the TLR-stimulated macrophage, core transcription factors already expressed in the cell (e.g., NFkB, AP1, and CREB) are rapidly activated and initiate transcriptional regulation of “second wave” TF genes [6]. Such transcriptionally regulated TF genes are key candidates for an integrated analysis combining TF-specific dynamic expression data and sequence-based motif scanning data.

This work is concerned with using computational data integration to identify a set of core differentially expressed transcriptional regulators in the TLR-stimulated macrophage and, in the form of statistical associations, the clusters of co-expressed genes that they may regulate. The clusters are differentiated based on temporal and stimulus-specific activation, and in this sense, the inferred associations constitute a preliminary dynamic transcriptional network for the TLR-stimulated macrophage. To achieve this, we used a novel computational approach incorporating TFBS motif scanning and statistical inference based on time-course expression data across a diverse array of stimuli. Our approach involved four steps. (i) A set of genes was identified that were differentially expressed by wild-type macrophages under at least one TLR stimulation experiment. (ii) These genes were clustered based on their expression profiles across a wide range of conditions and strains, grouping genes based on the similarity of the timing and stimulus-dependence of their induction. Gene Ontology annotations were used to identify functional categories enriched within the gene clusters. (iii) Promoter sequences upstream of the genes within each cluster were scanned for a library of TFBS motifs, each recognized by at least one differentially expressed TF, to identify possible associations between TFs and gene clusters. (iv) Across eleven different time-course studies, dynamic expression profiles of TF genes and target genes were compared in order to identify possible causal influences between differentially expressed TF genes and clusters.

Several techniques have been developed specifically for model inference from time-course expression data, notably dynamic Bayesian networks (DBN) [21] and ODE-based model selection [12]. However, the parametric complexity of these model classes makes it difficult to apply them to infer a network underlying a specific cellular perturbation (e.g., TLR activation in the macrophage) with a limited expression dataset. Here, potential transcriptional regulatory influence is inferred from time-course expression data using the time-lagged correlation (TLC) statistic, which has been used to infer biochemical interaction networks [22] as well as transcriptional networks [23][29]. The TLC has the advantage that it accounts for the time delay between differential expression of an induced TF and differential expression of a target gene. In contrast to standard correlation-based methods that identify co-expressed genes, the TLC method uses temporal ordering of expression to determine whether the time lag between two correlated genes is consistent with a causal interaction. We developed a novel method to identify the optimal time lag for each gene pair, and used a prior probability distribution of transcriptional time delays to score possible interactions.

By combining the promoter scanning-based evidence with the evidence obtained by the time-lagged correlation analysis of the expression data, we were able to identify a network of statistically significant associations between 36 TF genes and 27 co-expressed clusters. Overall, 63% of differentially expressed genes are included in the network. The network provided insights into the temporal organization of the transcriptional response and into combinations of TFs that may act as key regulators of macrophage activation. Finally, the analysis identified a potential transcriptional regulator, TGIF1 (Tgif1), which was not previously known to play a role in macrophage activation. As a targeted experimental validation of the inferred network, two transcriptional regulators, p50 (a component of NFkB) and IRF1, were assayed for binding to cis-regulatory elements in LPS-stimulated macrophages using ChIP-on-chip, and were confirmed to bind the promoters of genes in four out of five predicted target clusters at significantly higher proportions than expected for a random set of TLR-responsive genes.

Results

Gene selection and clustering

To probe a diverse set of transcriptional responses of Toll-like receptor (TLR)-stimulated macrophages, primary bone marrow-derived macrophages (BMMs) from five mouse strains (wild-type and four mutant strains; see Table S1) were stimulated with six purified TLR agonists representing various pathogen-associated molecular patterns (PAMPs). The TLR agonists include bacterial-associated (lipopolysaccharide, Pam2CSK4, Pam3CSK4, CpG), viral-associated (poly I:C), and anti-viral (R848) stimuli, and are listed in Table S2. The mutant strains, which were included to increase the diversity of the TLR-stimulated gene expression dataset and to increase the number of time-course measurements used, consisted of null mutations of the two key adapter molecules for the TLR signaling pathway (TRIF [3] and MyD88 [5]) and two TFs predicted to be involved in TLR activation (ATF3 [30] and CREM [31]). Genome-wide expression measurements of 45,037 probesets, representing 23,259 annotated genes, were made for time courses of up to 48 hours post-stimulation, using oligonucleotide microarrays (see Materials and Methods). In all, expression measurements were made for 95 distinct combinations of strain, stimulus, and elapsed time (hereafter, “experiments”; see Table S3). Using a spline-based multivariate regression method specifically adapted for significance testing of temporal expression datasets [32], annotated probesets were analyzed for differential expression across seven TLR-stimulated wild-type expression time-courses. After filtering for minimum absolute expression intensity and differential expression under at least one TLR-stimulation experiment (see Materials and Methods), 1,960 probesets were identified as significantly differentially expressed, with each probeset mapped to a unique gene (see Table S4). Of these, 44% were found to be upregulated in LPS-stimulated wild-type macrophages. Additionally, a set of 80 TF genes (for which corresponding position-weight matrices are available in the TRANSFAC database [33]) were found to be differentially expressed in the TLR-stimulated wild-type macrophage (Table S5). Those of TF families with established relevance in macrophage activation included two NFkB [34] component genes (Rel, Nfkb1), three AP1 [35] components (Jun, Junb, Fos), two ATF family genes [6] (Atf1, Atf3), six IRF family TF genes (Irf1/2/3/5/7/9) [17], and four STAT family TF genes [36] (Stat1/3/4/5a). The 80 TF genes were taken to constitute the set of potential regulators in the TLR-stimulated macrophage network.

Clustering was used to identify cohorts of genes that were co-expressed across the diverse set of TLR-stimulation experiments, based on the assumption that genes within a cluster are likely to share common cis-regulatory elements such as TF binding sites [18]. In order to focus on TF control of the timing and stimulus specificity of gene expression, genes were clustered based on the normalized profile of expression, rather than based on the fold-change. Expression measurements were transformed based on a single universal reference experiment (wild-type unstimulated macrophages) so that the transformed measurements would all lie between −1 and 1, with zero indicating the intensity in the reference experiment. This technique, which we call the signed difference ratio (SDR), has previously proved useful in clustering genes based on temporal expression in a mammalian system [37]. Each log2 intensity measurement ypj for probeset p and non-reference experiment j, was transformed to an SDR value xpj by
(1)
where jR is the index of the global reference experiment (j′ has the same range of values as j). By construction, −1≤xpj≤1 for all p and j. A positive SDR value indicates higher expression than in the reference experiment, and a negative value indicates lower expression. The SDR-transformed log2 intensities of all 1,960 target genes across all 94 non-reference experiments were clustered using an unsupervised algorithm (K-means with Euclidean distance), with the number of clusters chosen using the Bayesian information criterion (BIC) [38] (see Materials and Methods, and Figure S1). The target genes were partitioned into 32 clusters (see Table S4, column 5). The differences in temporal and stimulus-specific expression between the clusters are clearly visible in a heat-map representation of the SDR-transformed expression data (hereafter, “expression data”) (Figure 1; see also Figure S2 for the cluster-median expression heat-map).

thumbnail

Figure 1. Global heat-map of differential gene expression in TLR-stimulated murine macrophages, organized by clusters of co-expressed genes.

Each row is one of the 1,960 genes that are differentially expressed in macrophages under TLR stimulation, and each column is a replicate-combined microarray experiment. Red/green coloring indicates the differential expression level (SDR-normalized, see Equation 1). Red indicates upregulation relative to wild-type unstimulated macrophages. Green indicates downregulation relative to wild-type unstimulated macrophages. Genotypes are indicated along the bottom edge. Clusters are indicated along the left edge. Stimuli are indicated along the top edge, with the color scheme given in the lower right corner. Clusters have been ordered based on pairwise similarity, as described in Materials and Methods, Expression Clustering.

doi:10.1371/journal.pcbi.1000021.g001

The clusters (Table S6), which ranged in size from 18 to 113 genes, exhibit a significant diversity of timing and TLR-specificity of response. The wild-type LPS time-course was used to characterize the time scale for each cluster to respond transcriptionally (see Materials and Methods, and Table S6 columns 3–4). A small number of clusters reach peak induction within the first hour (C28, C27, C25, C26), but the majority of clusters (representing 55% of genes) respond between 2–4 hours. The temporal profiles of the clusters in wild-type BMMs under stimulation by LPS, Pam3CSK4, poly I:C, and R848 are shown in Figure S3, Figure S4, Figure S5, and Figure S6, respectively. The clusters exhibit distinct temporal profiles of transcriptional activation and repression that vary in the time of initial response and the duration of differential expression. Across all four stimuli, cluster C28 is induced first (and has sustained induction), followed by cluster C27 (which undergoes transient (2–3 h) upregulation), and then by induction of C25 and C26. Induction of C27 and C28 is delayed approximately 1 h under poly I:C stimulation, while C26 fails to fully induce under poly I:C. A comparison of the responses of clusters under 8 hours post-stimulation (LPS, Pam3CSK4, poly I:C, and R848) enabled the segregation of these clusters based on the signal transduction pathway through which they are likely primarily regulated (Figure 2). Groups include those primarily induced (C11, C12, C15, C17, C21, C26) and downregulated (C7, C29) by the MyD88-dependent pathway, and those primarily induced (C6, C8, C22, C24) and downregulated (C4, C5, C10, C20) by the TRIF-dependent pathway. Although “core early response” clusters C27 and C28 appear to be inducible through either signaling pathway, a comparison of the wild-type LPS vs. poly I:C response and of the wild-type vs. Ticam1(Lps2/Lps2) and Myd88(−/−) responses under LPS (see Table S7) together indicate that the MyD88-dependent pathway is responsible for the early response (within the first hour), and the TRIF-dependent pathway is responsible for sustaining the induction of these clusters. Early induced TF genes (Egr1/2/3, Junb, Rel, Irf1) also appear to be inducible through either pathway, from analysis of the LPS response in Ticam1(Lps2/Lps2) and Myd88(−/−) macrophages.

thumbnail

Figure 2. Hierarchical organization of differentially expressed gene clusters from TLR-stimulated macrophages reveals pathway-specific transcriptional responses.

The color of a rectangle in the heat-map shows the cluster-median differential expression (relative to wild-type unstimulated macrophages) under stimulation with the TLR agonist indicated by the column label (bottom of figure), for the cluster indicated by the row label (right-hand side). The column label Pam3 denotes the TLR agonist Pam3CSK4. The differential gene expression level is computed using the signed difference ratio (SDR, see Equation 1). Clusters (rows) have been ordered for display based on similarity of overall transcriptional response to the four indicated TLR agonists (see Materials and Methods, Expression Clustering). In the heat-map, green indicates downregulation, and red indicates upregulation. Colored subtrees of the dendrogram indicate specific inferences that can be made about the likely signaling pathway (MyD88-dependent, TRIF-dependent, or a combination) on which the transcriptional regulation of the cluster depends. The legend in the lower-left corner explains the color scheme for denoting the inferred signaling pathway-dependence of the clusters. Clusters without a color bar on the right appear to respond through either signaling pathway. The regulation of clusters C7, C11, C12, C15, C17, C21, C26, and C29 appears to be primarily MyD88-dependent; regulation of clusters C4, C5, C6, C8, C10, C20, C22, and C24 appears to be primarily TRIF-dependent; and clusters C23, C30, and C32 appear to be regulated oppositely by the two signaling pathways. This plot shows only the extremal differential response to TLR agonists; the clusters also differ in temporal expression (see Figure S3, Figure S4, Figure S5, and Figure S6).

doi:10.1371/journal.pcbi.1000021.g002

To characterize the functional role of each gene cluster in macrophage activation, gene ontology (GO) information was used to identify GO term enrichments within the gene clusters (see Materials and Methods). The 460 GO term enrichments identified within the 32 gene clusters are listed in Table S8. Many of the downregulated gene clusters are enriched for cell cycle related genes (C1, C3, C7). Clusters C15, C25, and C28 appear to be enriched for cytokines–C28 includes the pro-inflammatory cytokine Tnf (TNFa) as well as Ccl3, Ccl4, Cxcl1, and Cxcl2; C25 includes the cytokines Cxcl10 and Il10; and C15 includes the interleukin cytokine genes Il1b, Il6, and Il12b. Cluster C24, enriched for signal transduction genes, also includes the important cytokine Ifnb1 (IFNb). The early-unregulated clusters, C24–28, show a high proportion of induced TFs and are enriched for TFs relative to the genome (see Table S6 and Materials and methods). Across clusters, the fraction of TFs was generally found to decrease with increasing induction time (Figure 3). Subsequent analysis focused on identifying statistically significant associations between the 80 differentially expressed TF genes and the 32 co-expressed clusters.

thumbnail

Figure 3. Early induced gene clusters are enriched for transcription factors.

Each circular data point indicates a cluster. The horizontal axis is the estimated time scale for the differential expression level of the genes within the cluster to reach 25% of the maximum absolute differential expression (the “response time”). The response time was computed under LPS stimulation of wild-type macrophages (see Materials and Methods, Expression Clustering). The horizontal dashed line indicates the average fraction of genes that are known transcription factors, among all annotated genes in the mouse genome (0.053, see Materials and Methods, Selection of Transcription Factors). The slope of the best-fit line to the scatter plot is −3.84 (Pearson's R = −0.74).

doi:10.1371/journal.pcbi.1000021.g003

Expression dynamics analysis

Noting the high proportion of induced TFs in early-upregulated clusters, we chose a signal processing technique, the time-lagged correlation (TLC), to assess potential transcriptional regulatory interactions using the time-course expression data [22], [23], [25][28]. The approach is based on the observation that when an induced TF affects a target gene's expression through its own differentially regulated mRNA level (and through its own differential protein expression), the induction of the target gene's mRNA expression will occur with a time lag relative to the induction of the regulator [39][42]. This time lag is due to the combined effects of the translation, folding, nuclear translocation, and turnover time-scales for the regulatory protein, and the time scale for elongation of the target gene mRNA. In our application of the TLC method, both the correlation magnitude and the time lag are used to assess significance, as we describe below.

Let g1 denote a differentially expressed TF gene, and let g2 denote a differentially expressed gene. We wish to estimate our degree of confidence in the null hypothesis, that g1 does not transcriptionally regulate g2, given time-course expression data for both genes. In the simplest case, the alternative hypothesis could be that g1 codes for a TF protein that binds the promoter of g2, thereby regulating its transcriptional activity. Let t be a fixed time lag for which the TLC between g1 and g2 is to be computed. Let T denote a set of discrete time points at which gene expression is measured, and let T′ denote the set of time points T+t. Let XT(g1) denote the vector of expression measurements of g1 at the time points T, and let XT′(g1) denote the measurements of g2 at times T′ (which can be estimated by interpolation). The time-lagged correlation (TLC) coefficient between g1 and g2 with time lag t is defined as
(2)
where “cov” is the standard covariance. As with the standard correlation, a TLC that is close to 1 represents a perfect correlation, and a TLC that is close to −1 represents a perfect anti-correlation. This definition is easily extended to multiple time-courses (see Materials and Methods). We note that although Equation 2 is defined in terms of g1 being a TF, it can be applied to any gene pair, for example, to obtain a background distribution of TLC coefficients of gene pairs satisfying the null hypothesis. Two examples of a TF exhibiting a high time-lagged correlation with a target gene are shown in Figure 4. Both interactions (RelNfkb1 [43] and Irf7Stat1 [44]) correspond to known transcriptional regulatory interactions, and in both cases, the correlation with zero time lag is poorer than the correlation obtained with a time lag.

thumbnail

Figure 4. Two validated transcriptional regulatory interactions exhibiting high time-lagged correlations.

(A) Rel and Nfkb1. The solid line shows the expression of Rel (c-REL), and the dotted line shows the expression of Nfkb1 (p50/p105) in LPS-stimulated wild-type macrophages, over eight hours. The genes exhibit a high time-lagged correlation with a time delay of 60 minutes (across the eleven time-course experiments listed in Table S9, ρτ = 0.91 and P = 0.011; see Materials and Methods, Time-lagged Correlation, for an explanation of the statistical test). The NFκB heterodimers c-REL-p50 and c-REL-p65 are known to regulate expression of Nfkb1 [43]. The correlation at zero time lag is 0.81. (B) Irf7 and Stat1. The solid line shows the expression of Irf7 (IRF7) and the dotted line shows the expression of Stat1 (STAT1) in LPS-stimulated Atf3(−/−) macrophages. The genes exhibit a high time-lagged correlation with a time delay of 20 minutes (across the ten experiments, ρτ = 0.96 and P = 0.002). The transcription factor IRF7 has been shown to regulate the Stat1 gene expression in the innate immune response to viral infection [44]. The correlation at zero time lag is 0.95. (C) Time-lagged correlation coefficient and time-lagged correlation significance measure (see Equation 4) as a function of the time lag τ, for Irf7 and Stat1. The peak value of ρτ2 occurs at τ = 10, but the peak significance value (taking into account the lag-specific null distribution) occurs at τ = 20 min.

doi:10.1371/journal.pcbi.1000021.g004

Assessing the significance of an observed sequence of time-lagged correlations between two genes (as a function of the time lag) as an indicator of possible transcriptional regulation necessitates formulating our prior expectation (i.e., prior probability distribution) for the time lag of a true transcriptional regulatory interaction. For a TF gene g1 and a target gene g2, the overall transcriptional regulatory time delay tc (where “c” stands for the combined gene-gene delay) can be decomposed as a sum of two contributions, one for translation of the TF and post-translational processing/translocation (~10.5±4 min [41],[45]), and one for transcription and post-transcriptional processing of the target gene (~20–40 min [41],[42]). The total delay tc was modeled using the gamma distribution with a mean value of 45 min and a variance of ~250 min2 (see Text S1, Section 3). Because it is conditioned on the existence of a transcriptional regulatory interaction (TRI) between g1 and g2, we denote this probability distribution by P(τc|0) (the symbol 0 means that the null hypothesis, i.e., that there is no TRI, is false). This distribution was discretized to the set of time lags for which the TLC was computed, to obtain an estimate of the discrete probability for observing a given optimal time lag, P(τ|0) (see Figure S7). These probabilities were then combined with the P value for the squared time-lagged correlation coefficient, ρτ2(g1, g2), whose derivation we describe next.

For each pair (g1,g2) for which the TLC approach was to be applied, an “optimal time lag” θ(g1,g2) was selected, so that a single representative TLC could be obtained for the pair. The set of time lags and the set of time-course experiments to use were selected according to a constraint (imposed to minimize interpolation error) that the target gene expression at maximum time lag must be interpolated from at least three measurements. Based on this constraint, and taking into account the expected precision at which the optimal time lag can be estimated (±5 min, based on the replicate variability in the expression data–see Materials and Methods), the set of time lags was chosen to be t Є {0, 10, 20, 30, 40, 50, 60, 70, 80 min}. Eleven time-course experiments satisfied the criteria (combining six stimuli and three genotypes, see Table S9). The TLC ρτ2(g1, g2) was computed for each of the t values, for each pair of genes, using data from all eleven time-course experiments combined (see Materials and Methods). The next step was to determine the optimal time lag for (g1,g2) from the squared TLC coefficient ρτ2(g1, g2). It is not ideal to simply select the t at which ρτ2(g1, g2) is maximal, as some studies have done [23],[26],[46], because of two competing bias effects, as we now explain. Consider a pair of genes (h1,h2) satisfying the null hypothesis, and let tmax≡max(T), where T is the set of time points for a single time-course. In practice the expression of h2 cannot be extrapolated beyond tmax, so the effective number of data points for computing the TLC ρτ2(h1, h2) is limited to the number of time points within T that are less than tmaxt. Thus, the number of measurements that can be used to compute the TLC is t-dependent, and the distribution of TLCs for pairs of genes satisfying the null hypothesis depends on t. Therefore, one will more frequently observe (by chance) a TLC exceeding a given value (say, 0.9), by selecting the largest possible t. In addition, the high degree of synchronization within the transcriptional response, as well as the fact that all the SDR-transformed expression levels are zero at the initial time point, result in a second bias towards zero time lag. This effect is strengthened as the number of time points in the data set (per time-course) decreases. Therefore, selecting the optimal tto maximize ρτ2(g1, g2) introduces an unwanted bias towards the smallest and largest tvalues investigated (see Figure S8), and against t values in the middle of the range of time lags (i.e., 20–60 min).

To avoid the above-described bias, a background cumulative distribution of squared time-lagged correlation coefficient values, denoted by (where pt is the squared correlation ρτ2) was computed separately for each time lag t, using a large set H of gene pairs such that there is no direct transcriptional regulatory interaction (TRI) for each gene pair in the set (see Materials and Methods). The functions were used to select the optimal time lag θ(g1,g2),
(3)
and the fractional significance of the lag-specific squared correlation coefficient ξ(g1,g2),
(4)
Making use of the discretized distribution P(τ|0) defined above, a probability ratio R(τ) was computed as the ratio of the probability of the null hypothesis (that there is no direct TRI between g1 and g2) given the measured optimal time lag, to the marginal probability of the null hypothesis,
(5)
It should be noted that the uncertainty in q due to the discretization of time lags (a practical necessity in the context of microarray-derived expression data) leads to uncertainty in the estimation of R(t). However, the effect of this uncertainty on the cluster-combined P value (see Equation 10 below) is small, due to the fact that time lag estimation errors for genes in a cluster are not strongly correlated. The marginal probability P(τ) was estimated from the optimal time lags of all gene pairs, and the marginal probability P(H0) was estimated from data in the literature (see Materials and Methods). Using this probability ratio, and in analogy with Fisher's method, a combined score for the gene pair (g1,g2) was constructed, taking into account both the optimal time lag θ(g1,g2) and the fractional lag-specific significance ξ(g1,g2),
(6)
Using the cumulative distribution of s scores for gene pairs satisfying the null hypothesis, the significance of the association between g1 and g2 based on expression data can be computed,
(7)
This formula was applied for all pairs (g1,g2) where g1 ranged over the set of 80 TFs, g2 ranged over the set of all 1,960 differentially expressed genes, and g1g2 (see Materials and Methods). The expression data for the TFs are provided in Table S10 and the expression data for all 1,960 differentially expressed genes are provided in Table S4.

To estimate the overall significance (based on time-course expression data) of the association between a TF gene f and a cluster C, the P values Ptlc(f,g) were combined into a P value for the cluster, Pexp(f,C). For each pair (f,C), a Fisher score Fexp(f,C) was computed,
(8)
where C\{f} means that if the TF gene f was a member of cluster C, the self-association Ptlc(f,f) was excluded. For each cluster C, the number of degrees of freedom, denoted by d(C), was estimated using K-means clustering (see Materials and methods). The d(C) values were used to obtain a TF-to-cluster P value, Pexp(f,C), using a χ2 test (see Materials and methods). The number of pairs for which Pexp(f,C)≤10−3, was 23. The differential expression levels for the strongest (TF,cluster) pairs in wild-type time-courses following stimulation by LPS (one of the time-courses used for the TLC analysis; see Table S9) are shown in Figure S9. They show a high degree of correlation between the TF gene and target cluster. The distribution of Pexp(f,C) over all TF-to-cluster pairs, and the estimated false discovery rate (FDR), are shown in Figure S10.

Promoter scanning of co-expressed gene clusters

To provide an independent source of evidence for association between a differentially expressed TF gene and a co-expressed gene cluster, the promoters of differentially expressed genes were scanned using position-weight matrices (PWMs) representing motifs recognized by murine TFs. A motif was selected if it is recognized by at least one TF of which at least one component protein was differentially expressed in the expression dataset, ensuring that the TF had at least one expression profile that could be compared with (potential) target genes using the TLC. For each PWM, the fraction of genes with at least one above-threshold match within the promoter was computed, within a reference set of all genes detected as expressed within the TLR-stimulated macrophage, and within each co-expressed gene cluster. A total of 150 position-weight matrices were selected from the TRANSFAC database [33] for motif scanning, corresponding to the 80 differentially expressed murine TF genes (see Table S5, and Materials and Methods). Promoter sequences 2 kbp upstream of the transcription start site were obtained for 1,713 out of the 1,960 differentially expressed genes, and for 7,492 out of 8,788 expressed genes (used as a reference set; see Materials and Methods) from the UCSC genome annotation database [47]. For each PWM, a minimum match score was determined at which the PWM had a match on average once per 10 kb, within a set of 7,503 promoter sequences for genes not detectably expressed in the macrophage (to avoid biasing the match score threshold calculation with true TF targets; see Materials and Methods). Using these PWM match score thresholds, promoters were scanned within the reference set of genes, and within each co-expressed cluster of genes. The distribution of distances of matches from the transcription start site (Figure S11) has a median of 537 bp.

As a next step towards inferring a transcriptional network, enrichments of TFBS motifs were computed for individual gene clusters. For each cluster C and position-weight matrix m, enrichment statistics were computed based on the fraction of genes in C possessing at least one match for m. For each pair (m,C) for which the fraction of genes containing a match for m within the cluster C was greater than in the reference set of genes, a P value was computed using Fisher's exact test (see Materials and Methods, and [48]) and denoted by Pscan(m,C). This P value represented the significance of the enrichment of matrix m within the promoters of cluster C, relative to the reference set of promoters (expressed genes). A matrix representation of the strongest motif enrichments (56 associations with Pscan(m,C)≤10−2) with the clusters grouped by expression similarity (Figure 5) reveals several associations between TF motifs and patterns of differential expression. First, NFκB and IRF recognition elements are associated with upregulated clusters, while E2F and MYCMAX elements are associated with downregulated clusters. The IRF element was strongly associated with TRIF-dependent cluster C6 and STAT1 was strongly associated with C22. Many TF motifs were associated with the core early response cluster C27, including AP1, CREB/ATF, EGR, PEBP, and PPARA. The quantitative results of the cluster-wise statistical tests (numbers of matches and P values) are provided in Table S11.

thumbnail

Figure 5. Patterns of high-confidence motif enrichments within promoters of target clusters reveal associations between regulatory elements and expression patterns.

Each row in the matrix represents a TF binding element, and each column represents a cluster of differentially expressed genes. Clusters are ordered as in Figure 2, and thus are grouped hierarchically by similarity of their extremal expression fold-change under the four TLR agonists LPS, Pam3CSK4, poly I:C, and R848. Each motif (row) is associated with one or more position-weight matrices (the V$ prefix and numeric suffixes are omitted, and results for multiple position-weight matrices representing the same motif were combined for each column, by taking the matrix with the maximum number of matches within the indicated cluster). Each colored block in the matrix indicates pair of a motif and target cluster for which the fraction of genes in the cluster with a motif match, is enriched relative to the overall fraction of genes expressed in the macrophage that possess the motif (P≤10−2, Fisher's exact test). The color of each matrix element (block) in the interior of the figure indicates the fraction scanned of genes within the cluster containing at least one match for the indicated motif. The number of scanned genes within the cluster that contained a match for the indicated motif is shown in yellow typeface. The red/green colored blocks above the top horizontal axis shows whether each cluster is upregulated (red) or downregulated (green) at its most extremal fold-change under stimulation with the aforementioned TLR agonists. The hatched green/red pattern indicates a cluster whose extremal fold-change direction (up/down) is stimulus-dependent (see Figure 2). The colored (blue, cyan, orange, yellow, purple) blocks above the top of the matrix indicate the likely pathway through which the cluster is differentially expressed; the color scheme corresponds to that shown in the dendrogram in Figure 2.

doi:10.1371/journal.pcbi.1000021.g005

To enable integration of the promoter scanning evidence with the time-lagged correlation evidence, PWMs that were enriched for matches within gene clusters, were mapped to differentially expressed TF genes as follows. For each PWM m, a list of genes coding for TFs (or TF components) that bind the motif corresponding to m were obtained from a TRANSFAC-derived mapping (see Materials and Methods). For each TF gene f and cluster C, a P value for the association between f and C based on promoter scanning evidence, Pscan(f,C), was defined as the minimum over all Pscan(m,C) for all matrices m that are associated with the TF gene f. The distribution of the resulting P values and the false discovery rate (as a function of P value) are shown in Figure S12. A total of 31 factor-to-cluster associations were identified with Pscan(f,C)≤10−3, indicating a statistical power that is slightly higher than with the TLC-based evidence.

Data integration and network extraction

To identify the set of all possible TF gene-to-target interactions consistent with motif scanning evidence, for each TFBS motif match within the promoter of a target gene, the time-lagged correlation was computed for all possible TF genes that map to the TFBS motif. The resulting list of 54,253 pairs (f,g) of TF gene f and target gene g, provided as Table S12, shows that many known transcriptional regulatory interactions have high ranking based on time-lagged correlation–for example, NFκB/Rel associated with Icam1 [49] and Cebpd associated with Il6 [50]. Although the TLC-ranked list of motif targets has some potential utility for identifying specific transcriptional regulatory interactions, even the high-ranking elements of the list will contain many false positives (and will miss many true transcriptional regulatory interactions) due to the uncertainty in motif PWMs and the prevalence of post-translational regulation that may obscure the time-lagged correlation. Therefore, further data reduction is necessary to gain insight into the global transcriptional program of the TLR-stimulated macrophage. By using a statistical test that compares the relative frequency of motif occurrence within a cluster relative to a background set of genes, a more reliable estimate of TF association with a co-expressed cluster can be obtained.

To construct a combined transcriptional network of the TLR-stimulated macrophage, P values for associations between TF genes and co-expressed gene clusters based on expression dynamics and promoter scanning were combined. For each pair (f,C) where f is one of 80 TF genes and C is one of 32 gene clusters, a combined P value Pcomb(f,C) was computed from the P values for the scanning and expression evidences, Pscan(f,C) and Pexp(f,C). The P values were combined using Fisher's method (see Materials and Methods), a standard tool for meta-analysis of independent tests of a hypothesis [51]. TF-cluster pairs were then ordered by increasing P value Pcomb(f,C), and a cutoff was selected so that the estimated false discovery rate did not exceed 0.025 (resulting in a cutoff Pcomb(f,C)≤0.0248). Additionally, two filtering criteria were imposed: (i) Pscan(f,C)≤0.05, to ensure that there is a minimal enrichment of TFBS; and (ii) a cluster-average optimal time lag between f and C that was greater than 10 min, i.e., 〈θf,C≥10 min (see Materials and Methods). A scatter plot of the P values for the two evidences is shown in Figure S13, and indicates that for the data points that were rejected based on the Pcomb(f,C) cutoff, no dependency between the evidences is evident. A total of 90 interactions involving 36 TF genes and 27 clusters (comprising 86% differentially expressed genes), were accepted based on the above criteria (see Table 1). If the TLC P values were not included, and if the same rate of false discovery were imposed, the network would be significantly less parsimonious (~150 interactions), due to the large number of TF gene families that map to a common motif. Overall network coverage was estimated by taking the fraction of differentially expressed genes that (i) are members of the 27 clusters in the network; and (ii) possess a match for a motif recognized by one or more of the TFs associated with the cluster. From this estimate the network contains 1,232 genes, or 63% of the 1,960 genes that are differentially expressed under TLR stimulation.

thumbnail

Table 1. Network of inferred transcription factor–cluster associations

doi:10.1371/journal.pcbi.1000021.t001

The distribution of the number of targets regulated by TFs, the so-called out-degree distribution of the transcriptional network, is one key measure of the network's interconnectedness [52]. For each TF that was included in the transcriptional network, the number of targets was estimated using the promoter scanning data (see Materials and Methods). The out-degree varied approximately 20-fold over the set of 36 TF genes (Figure S14). The transcription factor MYC (which is involved in development and cellular differentiation [53]) was found to be the most highly connected in the network (consistent with the high out-degree for MYC found in [11]), followed by members of the E2F family of TFs (believed to play a role in cell cycle regulation [54]). Other highly connected TFs include NFYC (a repressor in the TGFβ signaling pathway [55] and member of a TF family involved in monocyte differentiation [56]) and RXRA (a component of heterodimeric TFs that regulate inflammatory signaling and cholesterol metabolism [57]). Also strongly connected in the network are the NFκB TF family members cREL and NFKB1/p50 (key early regulators of the immune response [58]); the IRF family members IRF1, IRF3, IRF5, IRF7, and IRF9 (regulators of interferon-induced immune response [17]); and STAT1 (a key regulator of apoptosis and mediator of interferon signaling [59]). Both the IRF and E2F family TFs had strong P values for association with target clusters (Figure S14). The out degree distribution appears to be scale-free, consistent with previous reports for mammalian networks [11],[60]. The number of TF genes associated each cluster (in degree) ranged from 1 to 9, with an average in-degree of 3.3.

To reveal patterns among TFs that may regulate multiple clusters, the connections between the 36 TFs and the 27 clusters in the inferred network were arranged in a matrix in which each row represents an induced TF and each column represents a cluster of differentially expressed genes (Figure 6). Both the TFs and clusters were divided into subsets that are induced or repressed under LPS stimulation, and ordered within these subsets based on the time of 25% differential expression under LPS (see Materials and Methods). Thus, the matrix is divided into quadrants; for example, the upper left quadrant contains connections between induced TF genes and induced clusters, and the lower-right quadrant contains connections between downregulated TF genes and downregulated clusters. The upper left and lower right quadrants contain primarily positive correlations, with most anti-correlated connections found in the upper right and lower left quadrants. In the upper left quadrant, the connections generally fall along an arc indicating the temporal sequence of TF gene activation. The anti-correlated “off arc” connections within this quadrant generally indicate the association between the falling edge of a transiently induced TF gene and the rising edge of a late-induced gene cluster. The only correlated “off arc” connections within this quadrant (Nfkb1→C28, and Junb→C11) have weak time-lagged correlation evidence, but a very significant motif scanning P value. In contrast, the downregulated gene clusters and TF genes are not as stratified as the upregulated clusters in terms of the time of differential expression, and thus associations appear throughout the lower-right quadrant.

thumbnail

Figure 6. Transcription factor genes associated with clusters in the inferred transcriptional network.

(A) The matrix shows associations between transcription factor genes and co-expressed gene clusters. Each column represents one of the 27 clusters within the inferred network, and each row represents one of the 36 transcription factor genes in the network. Clusters are ordered based on the LPS response time, defined as the time (under LPS stimulation) at which the cluster-median differential expression level reaches 25% of the maximum differential expression (see Materials and Methods, Expression Clustering). Transcription factor genes are ordered based on the LPS response time. The vertical gray line separates upregulated clusters (left half) from downregulated clusters (right half). The horizontal gray line separates upregulated transcription factors (top) from downregulated transcription factors (bottom). An orange or blue square indicates a statistically significant association between the transcription factor gene and the cluster, based on both promoter scanning and expression dynamics. An orange solid rectangle represents a positive average time-lagged correlation with genes in the cluster; a blue solid rectangle represents a negative average time-lagged correlation. (B) The red-green matrix is a heat-map showing transcription factor gene expression. The color indicates the normalized differential expression of the indicated transcription factor gene (over time), in LPS-stimulated wild-type macrophages (SDR, see Equation 1). Red indicates upregulation relative to unstimulated macrophages and green indicates downregulation. A diamond symbol indicates the transcription factor response time. (C) Each column of the red-green matrix indicates the median normalized differential expression of the genes in the indicated cluster (over time), in LPS-stimulated wild-type macrophages. The diamond indicates the average LPS response time of the genes within the cluster.

doi:10.1371/journal.pcbi.1000021.g006

The network of associations between TF genes and clusters (based on combined scanning and expression evidence) directly leads to hypotheses regarding TF regulation of clusters. For example, a statistical association between any of the TF genes Jun, Junb, or Fos and a cluster would suggest a hypothesis that the TF AP1 regulates that cluster. The network also recapitulates several known transcriptional regulatory interactions. First, the NFκB component Rel is associated with C15, which is enriched for cytokines and contains many NFκB targets including Nfkb1 [43], Il6, and Il12b [6]. Second, Jun, a component of AP1 (a regulator of stress response such as response to ultraviolet radiation or pathogenic insult [61]), is associated with C27, an early-upregulated cluster that is enriched for cell cycle-related genes and genes involved in the DNA damage response. Furthermore, C27 contains Egr1, which is a known target of AP1 under genotoxic stress conditions [61]. Third, IRF1 is strongly associated with the antiviral cluster C13, which contains the validated IRF1 target gene, Ccl5 [62]. The network also includes the TF genes Egr1 (a key regulator of LPS-induced cytokine signaling [63]) and Egr2 (implicated in adhesion and phagocytosis [64] as well as cell proliferation [65]) as regulators of C27. Finally, the TF gene Sfpi1 (PU.1) is associated with C17, an induced gene cluster enriched for endosome-associated genes (PU.1 over-expression is known to block viral escape from the endosome [66]).

Several interactions in the network were detected only through the integration of expression data with promoter scanning evidence. For example, based on scanning evidence alone, with a FDR of 0.1 (Pscan≤0.0033), the association between Nfkb1 and C17 would not have been detected, but by including the effect of the strong TLCs between Nfkb1 and C17 genes, an association between Nfkb1 and C17 was detected. As a second example, the network includes an association between the TF gene Irf1 and cluster C25; based on promoter scanning evidence alone, only a general association of the IRF family with the cluster would have been possible (see Table 1).

In order to investigate the possible co-operative regulation of clusters by TFs in the network, protein interactions were obtained for human orthologs of protein units associated with the 36 TF genes shown in