The authors have declared that no competing interests exist.
Conceived and designed the experiments: JEM HS KJ RT AS. Analyzed the data: JEM ML KBV SLS MPSP. Contributed reagents/materials/analysis tools: KBV SLS MPSP. Wrote the paper: JEM HS KJ KBV.
The ability to examine the behavior of biological systems
Computational modeling aims to use mathematical and algorithmic principles to link components of biological systems to predict system behavior. In the past such models have described a small set of carefully studied molecular interactions (proteins in signal transduction pathways) or larger abstract components (cell types or functional processes in the immune system). In this study we use data from global transcriptional analysis of the processes of neuroprotection in a mouse model of stroke to generate functional modules, groups of genes that coherently act to accomplish functions. We then derive equations relating the expression of these modules to one another, treating these individual equations as a closed system, and demonstrate that the model can be used to simulate the gene expression of the system over time. Our work is novel in describing the use of global transcriptomic data to develop dynamic models of expression in an animal model. We believe that the models developed will aid in understanding the complex dynamics of neuroprotection and provide ways to predict outcomes in terms of neuroprotection or injury. This approach will be broadly applicable to other problems and provides an approach to building dynamic models from the bottom up.
The ability to examine the behavior of biological systems
Within the last decade, there has been a slow but steady growth in the application of dynamic modeling to represent biological systems including metabolic networks, regulatory networks, and signal transduction pathways. Mandel et al. (2004) provides an exemplification and discussion of a host of candidate techniques for modeling dynamic biological processes with reference to an idealized representation of the lac operon
Our approach introduces several novel elements to the dynamic modeling research described above. First, we derive our model structure and parameters from high-throughput global transcriptional data using a network inference method coupled with an optimization process. Second, we work with a set of regulatory relationships that are representative of an entire system. Finally, the computational model describes stroke and neuroprotection in a higher eukaryote, the mouse. Our resulting ODE-based model has several advantages over other modeling approaches including the inclusion of feedback loops, which are important for many biological processes, and the ability to predict the temporal patterns of gene expression. Our goal in adopting this approach is to model changes in expression levels of functional modules over time, and to assess the interactions between these functional modules as regulatory influences, in order to facilitate interactive simulations of neuroprotection during cerebral ischemia.
In this paper we describe an approach to generate dynamic systems models of networks of functional modules using predicted causal influences from temporal transcriptomics data (
1) Inferelator 1.1 infers a parsimonious set of potential regulatory influences whose expression can explain the expression of the target cluster maximally, but does so independently for each cluster. 2) The actual structure of the inferred network would consider that the regulators are members of clusters and that the network structure is complex and cyclical. 3) The regulatory influence model can be represented by a regulatory influence matrix and used to simulate the closed system of ordinary differential equations over time. 4) The optimization process (see text) is used to improve the ability of the model to simulate the system over time (i.e. calibrate to temporal data). 5) The resulting optimized model retains much of the structure of the initial model.
Gene expression data obtained from microarrays (Affymetrix) run on mouse blood was utilized for this study (see
To establish the actors in the model with predictable expression we first defined functional modules from the set of significantly changing probes (see
To ascertain a reasonable number of clusters to consider in our model abstraction we calculated the normalized functional modularity (Y axis) for varying numbers of clusters (X axis) from the same hiearachical tree derived from the expression data. Functional modularity was defined as the number of genes annotated with a biological process gene ontology category that was functionally enriched in the gene's parent cluster with a p-value less than the threshold indicated (colored lines). The results show that 25 clusters provides a peak of functional modularity, especially for more coherent functional categories with lower p-values.
To learn the parameters in a model system of ODEs that relate the expression levels of clusters to one another we applied a modification of the original Inferelator
The initial steady-state model can be used to predict the expression of target clusters given the known expression levels of the input regulatory influences (in this case the regulatory influences are other clusters). Although the Inferelator is capable of incorporating time course data explicitly in its inference process, we chose to treat data from each time point as a steady-state measurement since the time steps were relatively long and of variable length. We assessed the performance of the model using a cross-validation approach in which multiple models are trained on subsets of the data, by leaving out the data corresponding to a treatment time course, and then evaluated on the excluded time course data as previously described
We were interested in determining if the initial steady-state model generated by the Inferelator could be used as the basis for dynamic simulation to describe how the expression levels of all clusters in the model change over time. Accordingly we transformed the model into a matrix of coefficients in which the rows refer to regulatory influences and the columns refer to their targets. This matrix is the basis for a linear system of ODEs that can be solved in closed form for specific parameter choices, and more generally using standard ODE solvers. The resulting dynamic model was used to simulate the expression levels of the target clusters in the system given only one known input variable, which was the initialization state from the first time point in each preconditioning treatment. The simulated expression profile was then compared to the observed expression profile for that preconditioning treatment for each cluster at multiple times using correlation as a basis of comparison as above. We used a correlation measure for optimization, as opposed to a more standard measure such as the root mean square deviation (RMSD), because we are more interested in capturing the pattern of expression rather than the magnitude of the fold-change in expression. We found that this dynamic model using only the Inferelator-derived structure and parameters yielded a correlation with the observed data of only 0.36 for LPS and similarly poorly for the other time courses (
Pretreatment | Inferelator |
Dynamic |
Optimized |
BestCross |
|
0.78 | 0.36 | 0.71 | 0.48 |
|
0.80 | 0.26 | 0.80 | 0.56 |
|
0.82 | −0.32 | 0.67 | 0.24 |
|
0.92 | 0.66 | 0.79 | 0.51 |
|
0.81 | −0.34 | 0.71 | 0.10 |
Performance of the Inferelator-based model in steady-state prediction.
Performance of the Inferelator-based model in dynamic prediction.
Performance of the best optimized model for that pretreatment.
Performance of the best model optimized to another pretreatment.
Though the performance of the dynamic model using parameters derived by Inferelator 1.1 was moderate, we were interested in determining if the model contained useful information, in the form of its coefficients and/or structure that could be used as a basis for further dynamic model development. Two aspects of the model are critical in determining how accurately it simulated temporal profiles: its structure in terms of the pattern of regulatory influences between clusters, and the coefficients of each of these regulatory influences that form the system of ODEs. We investigated the information content of the coefficients of the model as well as the structure of the model by performing several different randomizations of the existing model. We first preserved the structure of the model but perturbed the coefficients for each edge (nonzero values) by randomly resampling all nonzero values from the initial matrix. We calculated the mean (0.087) and standard deviation (0.167) of the correlation measures for 100 such resampled matrices and found that the performance of the initial matrix (0.36) was significantly different from the performance of the model using the randomized matrices (p-value<0.05), indicating that the model contains significant predictive value. To examine this result further we resampled coefficients in the model from a uniform distribution for nonzero values, which produced similar results. We next investigated the structure of the network by randomly permuting all values in the matrix. This process creates a new structure of regulatory influences between the clusters, but preserves the overall number of such influences between clusters as well as the distribution of coefficients. The mean (0.029) and standard deviation (0.149) of the correlation measures using 100 such scrambled matrices indicate that the model using the initial matrix had significantly different performance than models using scrambled matrices (p-value<0.02), suggesting that the initial model derived using the ODEs from Inferelator contains significant value in the form of its structure as well. These results are summarized in
The Inferelator-based influence model was treated as a system of ODEs and simulated over time. Expression levels of the simulation were compared with observed values of expression by correlation (Y axis) for the initial model (red dot) or for 100 randomized matrices. The matrices were randomized by replacing all non-zero values with other non-zero values (resampled) or from a uniform distribution (uniform), or the locations of all values in the matrix were reassigned (scramble). The results (as a box and whiskers plot) show that the Inferelator-based initial matrix produces simulation over time with a performance that is significantly better than that using random permutations of the matrix.
To improve the performance of the dynamic model we employed simulated annealing to optimize the initial matrix against observed patterns of expression from the data. Similar regulatory network model parameter estimation was accomplished using a Metropolis-Hasting Monte Carlo approach in a Bayesian formulation
We optimized the initial matrix against each of the time courses from individual pretreatments separately, and show the results from the LPS-optimized matrix here (all results are presented in
The two-stage simulated annealing (SA) process described was applied using the Inferelator-based model as a starting matrix. The performance (Y axis) of each of the 25 models in each stage are shown over the steps (X axis) in the SA process. The results show that the optimization process can dramatically improve the performance of the initial model.
We examined both the significance of the performance obtained in the optimized model and the sensitivity of the optimization result to the initial matrix for model optimization to each pretreatment. For the former we randomized the optimized model as described above for the initial model. Results indicate how significant the performance of the optimized model is relative to models with randomized coefficients and randomized structures. We first examined the performance of the best model relative to scrambled and randomized versions of the same matrix (as above), resampling coefficients but preserving model structure (resampled and uniform random in
The best optimized model was simulated over time to provide predictions of expression levels for clusters. Correlation (Y axis) of the simulated versus observed data is shown for the best optimized model (red dot) and for 100 randomized matrices (boxes). The matrices were randomized by replacing all non-zero values with other non-zero values (resampled) or from a uniform random distribution (uniform), or the locations of all values in the matrix were reassigned (scramble). The results (as a box and whiskers plot) show that the optimized model is capable of simulation over time with a performance that is significantly better than randomized versions.
The expression patterns predicted by the LPS-optimized model are shown in
The LPS-pretreatment observed (black lines) and predicted (green lines) expression patterns for clusters containing more than five genes are shown. The expression patterns from a randomized consensus model (red lines) are also shown. The X axes indicate the log2 fold-change expression for the observed pattern and the predicted and random expression patterns were scaled to this range. The Y axis shows time from 0 to 100 hours post-pretreatment.
Our approach generates an ensemble of 25 models. We wanted to compare these models and determine if a combined model might give better performance than individual models. Accordingly, we assessed the variability in the model structures in the ensembles by counting the numbers of times each edge is represented in the ensemble. We found that there is a large degree of concordance between the models in terms of their structure, but that some edges seem to be present in all models whereas other edges are represented in only a subset of models. The distribution of edge counts is shown in
To examine the performance of combined models we took edge weights of the combined model as either the mean or median of weights from the ensemble. Either of these approaches resulted in models with performance that was representative of the performances in the ensemble for the time course used in the optimization, but not better than the best model in the ensemble. We examined performance of these models both on the time course used for optimization as well as the other time courses (cross-validation) with similar results (see
To test for sensitivity to the initial matrix, we optimized using randomly selected initial models. This was to determine how important the initial model provided by the Inferelator is to the final optimized performance. We repeated the optimization process starting with 25 matrices with the same structure as the original matrix, but with the weights randomized. After the two rounds of optimization we found that the mean performance of the model using the 25 final matrices was slightly lower than those produced by optimization using the original initial matrix (mean 0.60 versus 0.66, respectively, p-value 2e-7 by t-test) indicating that the initial weights are important to the final outcome of the optimization. This shows that the optimization works well even when starting with a randomized model, but performs significantly better when the initial model is provided by the Inferelator.
We next examined the question of whether the structure of the initial model was important to the outcome of the optimization process. We randomized the structures 25 times by randomly permuting the initial weights in the matrix and proceeded with the optimization process. After the second round of optimization we found that the performance of the 25 final matrices was significantly worse than those produced by optimization of the original matrix (0.62 versus 0.66, p-value 5e-3) indicating that the initial structure provided by the Inferelator is very important to the outcome of the process. Though the optimization process itself allows restructuring of the model, it is unlikely that large-scale restructuring will take place since individual changes (addition or removal of an edge) are evaluated individually. Thus modifications to the initial structure of the model are expected to be conservative (see Conclusions).
Thus far, to evaluate the optimized dynamic model, we applied it to the same data that had been used for the optimization process to determine whether it was successfully predictive. However, this can result in overstatement of results due to overfitting data. To more rigorously evaluate the performance of the dynamic model we determined the performance of the best model optimized against the LPS preconditioning time course on the dataset for each of the other conditions. The results of this analysis indicate that the model can provide reasonable predictions of behavior in the LPS (correlation 0.74), CpG-ODN (0.41, p-value 0.01) and saline (0.39, p-value 0.004) time courses but that it fails to accurately predict behavior under the brief ischemia preconditioning treatment (correlation 0.01) and sham treatment. To assess the similarity between responses in each treatment we calculated the correlation between gene expression ratios from all differentially expressed genes at each comparable time point from different treatments, and report the results as the mean correlation between treatments (
LPS | CpG | Pre | Saline | Sham | |
LPS | — | 0.617 | −0.039 | 0.394 | 0.700 |
CpG | 0.617 | — | 0.113 | 0.592 | 0.754 |
Preconditioning | −0.039 | 0.113 | — | 0.063 | 0.049 |
Saline | 0.394 | 0.592 | 0.063 | — | 0.689 |
Sham | 0.700 | 0.754 | 0.049 | 0.689 | — |
|
|
|
|
|
|
Simulation Performance | |||||||
Cluster | Functional label | Gene Count | LPS | CpG | IP |
Sham | Saline |
cluster_1 | apoptosis | 587 | 0.57 | 0.49 | −0.49 | 0.50 | −0.43 |
cluster_2 | hemopoiesis | 275 | 0.93 | 0.71 | 0.09 | −0.49 | −0.47 |
cluster_3 | cell migration/blood coagulation | 1324 | 0.80 | −0.76 | −0.38 | −0.42 | 0.63 |
cluster_4 | cell division/defense response | 934 | 0.61 | 0.22 | −0.41 | 0.79 | −0.80 |
cluster_5 | metabolic process | 824 | 0.79 | 0.15 | 0.85 | −0.57 | 0.39 |
cluster_6 | cell differentiation | 280 | 0.77 | −0.33 | 0.70 | −0.17 | −0.61 |
cluster_7 | inflammatory response | 58 | 0.75 | 0.81 | 0.87 | −0.59 | 0.34 |
cluster_8 | cell differentiation | 84 | 0.82 | 0.64 | −0.47 | 0.90 | −0.39 |
cluster_9 | NK cell/leukocyte mediated immunity | 140 | 0.08 | −0.16 | 0.19 | 0.82 | 0.08 |
cluster_10 | blood coagulation | 410 | 0.88 | 0.91 | 0.32 | 0.80 | −0.55 |
cluster_11 | inflammatory response | 637 | 0.79 | 0.75 | 0.54 | −0.09 | 0.43 |
cluster_12 | mitosis | 510 | 0.88 | 0.74 | 0.08 | 0.70 | −0.54 |
cluster_13 | innate immune response | 474 | 0.10 | 0.27 | −0.54 | 0.47 | −0.35 |
cluster_14 | inflammatory response | 129 | 0.97 | 0.86 | 0.73 | −0.56 | 0.50 |
cluster_15 | apoptosis | 288 | 0.97 | 0.99 | 0.64 | 0.50 | −0.32 |
cluster_16 | cell differentiation | 84 | 0.04 | 0.09 | 0.43 | −0.48 | 0.40 |
cluster_17 | development | 226 | 0.75 | −0.18 | 0.96 | 0.67 | 0.47 |
cluster_18 | response to stimulus | 52 | 0.57 | 0.43 | 0.51 | −0.45 | 0.40 |
cluster_19 | immune response | 22 | 0.79 | 0.81 | −0.52 | 0.75 | −0.45 |
cluster_20 | - | 4 | 0.32 | 0.61 | −0.44 | 0.86 | −0.08 |
cluster_21 | - | 5 | 0.97 | 0.65 | 0.47 | −0.08 | 0.53 |
cluster_22 | - | 1 | 0.53 | −0.25 | 0.47 | −0.75 | 0.30 |
cluster_23 | - | 1 | 0.95 | 0.38 | 0.59 | −0.42 | 0.53 |
cluster_24 | - | 1 | 1.00 | 0.74 | 0.02 | 0.63 | −0.19 |
cluster_25 | - | 1 | 0.45 | 0.68 | −0.37 | 0.93 | −0.24 |
IP, ischemic preconditioning.
The simulated and observed expression values from the LPS-optimized model are shown in
Relative expression levels (log2 fold change expression) are plotted over time (X axis) for the predicted (green line) and observed (black line) expression levels for the indicated cluster. Several representative clusters are shown and the remaining plots are included as
Although our inferred and optimized model seems to be consistent with existing data, at least within the limitations of the results presented above, we were concerned that this could be due to overfitting of the data and/or to our model being one of many possible consistent models that is not necessarily biologically relevant. Our derived model does not make specific predictions of gene-to-gene interactions or influences, but rather relates the general expression patterns of clusters of genes. Therefore to examine the consistency of this model with data from external sources we used the following strategy. We used interaction data from four independent data sources; regulatory binding site interaction from chromatin immunoprecipitation (ChIP) experiments
Dataset | Accuracy |
CHIP | 60.9% |
PPI | 66.8% |
Regulatory | 64.3% |
Functional | 64.3% |
Any | 82.0% |
These results show that each independent dataset has a modest correspondence with our inferred model ranging from 61–67%. Each of these different sources of interaction data is limited, either by the coverage it provides or by the amount of accuracy it might have. Therefore we evaluated the maximum accuracy obtainable by combining results from each interaction dataset by simply counting a match as a true positive or true negative if it was validated by any interaction dataset. While this method would not be appropriate to evaluate the prediction accuracy since it would be impossible to choose
We also found that the number of protein-protein interactions linking genes inside a cluster were highly significant, with most clusters having a p-value of 0 (out of the 1000 random counts; see
A key question in neuroprotection and stroke concerns the regulatory differences between injurious conditions (no pretreatment) and non-injurious (neuroprotective pretreatment). By grouping together like conditions we can gain insight into these differences. We did this by assessing the concordance of models generated for each set of injurious conditions (saline and sham treatment) and models generated for each set of non-injurious conditions (LPS, CpG or brief ischemic pretreatment). An edge was considered to be present in the final injurious or non-injurious model if it was present in 50% or greater of the models from each of the member conditions from either group (that is, one or two models in the injurious set and two or three in the non-injurious set). The weight of the final edge was taken as the mean of weights from the sets, with the primary goal of assessing differences in differences in the weights between injurious and non-injurious models, either in terms of presence/absence of an edge or a reversal in function from activation to repression or vice-versa. A comparison of the resulting models is shown in
Clusters containing more than 5 genes are shown (green squares) as determined by our functional clustering approach. The influences between clusters are shown as directed edges with red arrows indicating a positive influence (activation) and blue T lines indicating a negative influence (repression). Dashed lines indicate relationships that are significantly different between injurious and non-injurious conditions, either absent in one or opposite sign. General functional categories chosen from statistically enriched functions are indicated in grey boxes.
In this paper we demonstrate how a steady-state model derived from high-throughput data that describes inferred relationships between regulatory influences and their targets can be used to produce dynamic simulation models capable of making accurate predictions over extended periods of time. We explored the ability of these models to predict expression under conditions not used for optimization and found the predictive ability, while limited, to be significant. This work represents an advancement in the application of dynamic models that are based on high-throughput transcriptional data sampled at low temporal resolution. The model depends on considering groups of coexpressed and functionally related genes as both the targets of regulatory influences and as the originators of these influences. This has the advantage of creating a simple closed model for which the transcriptional levels of all regulatory influences are predicted and allows the model to function as a system of ODEs that can be solved using standard tools. The steady-state model produced by the original Inferelator requires input of observed expression levels of the regulatory influences to make predictions, limiting its utility for making novel predictions. That is, at a given time or in a given condition, the regulators must be measured in order for the gene expression values of the functional modules to be predicted by the steady-state model. Although our approach requires using the observed data for initial model inference, the resulting dynamic model is capable of simulating other treatment time courses given only the initial state of the system, or extrapolating to further time points not measured. Additionally, the dynamic model couples ODEs used in the initial steady-state model, allowing explicit temporal evolution of the system. Calibrating the dynamic model to observed expression levels produces results that are shown to be significant, indicating that it can provide a good starting place for further optimization to produce a better model.
Using our optimized dynamic model, we found that the relationships between regulatory influences and their putative target clusters can be used to simulate the system over time with statistically significant performance. Our modeling approach represents a way to derive dynamic models from steady-state static models that represent an abstraction of the system from high-throughput transcriptional data. Previous efforts similar to this have focused on very detailed models (for example
One caveat of our results is that the optimized model does not perform as well on data that was not used for the optimization. In general the predicted patterns of expression for the evaluation time courses (see, for example,
We chose to use correlation to compare the predicted expression patterns with observed patterns. Previously, the root mean square deviation (RMSD) measure or other similar error measures have been used to evaluate performance for similar problems. Unlike measures of error, correlation does not account for similarity in the magnitude of the two vectors being compared. Rather, it simply compares the relative patterns of expression. For this study we were more interested in getting the overall pattern of expression correct rather than matching absolute magnitude, which corresponds to fold-change in expression relative to baseline in this case. Models optimized using RMSD as a fitness criteria reached a mean normalized RMSD (error as percentage of the expression range) of approximately 30%, which is not exceptional. Evaluating these RMSD-optimized models using correlation revealed that they had essentially the same performance as the un-optimized model and many clusters showed trivial behavior, monotonically increasing or decreasing (data not shown). For some practical applications of these models a correlation-based approach would be insufficient. For example, when predicting expression levels that are associated with a phenotype that manifests as differences in magnitude between outputs under different conditions, an error measure would need to be used, potentially combined with correlation to ensure that patterns of expression are accurate.
An important limitation of the current model is that the experiments all include a significant disruption of the system in the middle of the time course in the form of the ischemic stroke induced at hour 72. Our dynamic model does not explicitly incorporate this event, which dramatically alters the regulation of gene expression. The effect of the stroke on gene expression is captured in this experiment, but is only poorly understood. Therefore, using the current dataset including stroke as an input to the model is not possible. An implicit assumption in our modeling approach is that the relationships between regulators and targets are fixed. The Inferelator approach and our dynamic ODE optimization process both aim to define these relationships based on existing data. The optimization process therefore learns the stroke stimulus from the data. We can predict the effects of changing early expression of particular clusters on the eventual output (ischemic injury) given our current model, but further experiments would be needed to validate these predictions. An important next step for model construction will be to consider the disruption in the model explicitly.
The time course experiments used in this analysis all had five time points. This is a limited number of points to parameterize the relationships between 25 clusters. Combining networks to create injurious and protected networks alleviates this limitation to a certain extent in terms of confident regulatory relationships, but the resulting models should be considered to be underdetermined. Including more data points in these models will be necessary to strengthen these models. Finally, a significant limitation of our models is that the clusters used as functional modules are still quite large and are likely to perform multiple functions. This fact prevents inference of mechanistic relationships between regulators and functional processes and pathways that would be desirable for this kind of approach. However, our validation results indicate that the functional modules defined are fairly coherent and supported by external data sources. We believe that the inclusion of more data points, which would allow parameterization of larger models, will allow the use of smaller, more focused clusters in the model. Additionally, development of methods to better delineate functionally coherent modules
The specific successes and failures of the dynamic model may provide important information on the role of certain gene clusters in neuroprotection and stroke. For example, the model shows strong predictive value for gene expression in clusters 14 and 15 for LPS, CpG-ODN, and brief ischemic preconditioning but has low predictive values for saline and sham (
Comparison of models derived from the neuroprotective and injurious states revealed that a number of predicted edges seemed to be different, either in presence and absence of an edge or in the sign of the weight of the edge (
We have shown how transcriptomic data can be used to construct a dynamic model of gene expression at the level of functional modules in a largely automated way without the use of any prior knowledge about the system. The model is capable of accurately predicting the expression levels of component modules given only an initial starting state for key functional clusters. Our study also identifies several caveats and limitations that remain to be addressed, either through addition of more detailed expression data to our existing process, or by refinement of our methods. To our knowledge this is the first application of a modeling approach like this to data from a complex disease process in a vertebrate organism, specifically a mouse model of ischemia. Our approach provides a way to formulate prototype dynamic models from high-throughput data that can provide valuable insight into disease processes.
Microarray data were obtained from a transcriptional study of a mouse model of neuroprotection during stroke
Hierarchical clustering was used to define functional modules from the filtered transcriptional data using the hclust command in the R statistical software (
A functional coherence score was calculated by counting the number of genes appearing in at least one functional category that was enriched at a given threshold of significance over all clusters in that set. The pseudocode for the algorithm is as follows:
Use hierarchical clustering to construct a tree of gene expression profiles
For each clustering division:
For each cluster present at this division:
Calculate functional enrichment in this cluster versus all other genes
Identify functional labels that are significantly enriched in the cluster versus all other genes at a specified level of significance
Count number of genes that are annotated with at least one significant label (Gc)
Functional coherence for this division is the sum of Gc over all clusters in that division
This metric provides a measure of functional modularity that is easy to interpret and can be adjusted by varying the significance threshold.
The Inferelator
To derive an initial model for dynamic simulations the coefficients for each regulatory influence were averaged over each of the five evaluation models, and those influences that were inferred in fewer than three models were excluded. Regulatory influences with coefficients having an absolute value less than 0.1 were also excluded. This eliminates regulatory influences with low impact on the final model and limits the final number of regulatory influences in the model.
Our main model for the time evolution of gene cluster expression can be described as follows. Suppose we have
Initial values for the system were taken to be the expression value of each cluster at
To optimize the initial matrix for dynamic simulation over time, we employed a standard simulated annealing approach that was separated into two stages. In the first stage 25 independent simulated annealing runs of 5000 steps each were performed using the initial matrix
The second stage of optimization was performed in the same manner as the first to generate 25 additional models, but was allowed to proceed for 25000 steps. The initial probability of accepting deleterious perturbations was set to 10%, with all other parameters remaining the same as in the first stage. This stage allowed the simulation to explore the variable space around the best performing matrix from the first stage in order to refine the model. The best performing matrix from this stage was chosen as the final model. A third stage of optimization was found to provide no improvement to the best models and is not included in the results (data not shown).
For simulated annealing optimization the fitness function was the performance of the simulation, defined as the correlation between the simulated and observed expression levels, for each pretreatment time course. Initial values for the simulation were taken as the 3 h time point from the corresponding time course. The final model was then evaluated for its performance on the other time courses using the 3 h time point from the other time courses as initial model values.
To evaluate the significance of the results obtained by optimization we compared performance of the optimized model to the model using randomly perturbed matrices, as was done with the initial matrix prior to optimization. To study the dependence of the optimized model on the initial matrix, the matrix was perturbed randomly and the simulated annealing process described above was repeated on the perturbed matrices. To assess the dependence of performance on the initial Inferelator-derived values for the matrix, non-zero values in the matrix were randomly permuted (referred to here as resampled) such that the structure of the model was preserved. This was repeated 100 times and a p-value calculated based on the distribution of performance values from the resampled matrices. A second test in which random numbers drawn from a uniform distribution between −2 and 2 were assigned to non-zero values in the matrix (uniform) was also used to assess the impact of the initial values. Finally, the dependence of performance on the structure of the model was assessed by randomly permuting all values in the matrix (scramble). These approaches were also used to evaluate the significance of the performance of the final optimized matrix.
To validate the cluster-to-cluster relationships in our model we used edges from four different sources. Regulatory interactions (regulator to target relationships) derived from CHIP experiments were obtained from the ChEA database
We counted the number of interactions between clusters in our model then compared that number to counts from analyses with randomly rewired edges (same genes and number of edges with randomized gene labels), repeated 1000 times to obtain p-values for the count. For undirected edges (HPRD and MouseNet) relationships were counted for both directions. For relationships present in our model (optimized weight between two clusters is non-zero) we counted a true positive (TP) if the p-value was less than 0.05 for the interaction count, otherwise counted a false negative (FN). Likewise, for relationships not present in our model a true negative (TN) was counted if the p-value was more than 0.05, otherwise a false positive was counted. Accuracy was calculated as TP+TN/(TP+TN+FP+FN).
(TGZ)
(PDF)
(PDF)
(PDF)
(XLSX)
(XLSX)
(XLSX)