VV and DA conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, and wrote the paper.
The authors have declared that no competing interests exist.
Computational analysis of gene expression data from microarrays has been useful for medical diagnosis and prognosis. The ability to analyze such data at the level of biological modules, rather than individual genes, has been recognized as important for improving our understanding of disease-related pathways. It has proved difficult, however, to infer pathways from microarray data by deriving modules of multiple synergistically interrelated genes, rather than individual genes. Here we propose a systems-based approach called Entropy Minimization and Boolean Parsimony (EMBP) that identifies, directly from gene expression data, modules of genes that are jointly associated with disease. Furthermore, the technique provides insight into the underlying biomolecular logic by inferring a logic function connecting the joint expression levels in a gene module with the outcome of disease. Coupled with biological knowledge, this information can be useful for identifying disease-related pathways, suggesting potential therapeutic approaches for interfering with the functions of such pathways. We present an example providing such gene modules associated with prostate cancer from publicly available gene expression data, and we successfully validate the results on additional independently derived data. Our results indicate a link between prostate cancer and cellular damage from oxidative stress combined with inhibition of apoptotic mechanisms normally triggered by such damage.
Diseases such as cancer are often associated with malfunctioning pathways involving several genes. Identifying modules of such genes and how the genes in each module interact with each other is helpful toward understanding the nature of these diseases. Here the authors provide a novel computational method for discovering such modules of genes merely from two sets of gene expression data, one from healthy tissues and one from tissues suffering from a particular disease. The method is based on the concept of identifying sets of genes whose joint expression state predicts the presence or absence of a particular disease with minimum uncertainty. Once such gene sets have been identified, we can then further use the microarray data to determine the “logic” that connects the genes' individual expression states related to the outcome of the disease. In turn, this logic may give us valuable insight into the nature of the pathways and how we may target some elements of these pathways for therapeutic purposes. The authors apply this methodology in a particular example and conclude that prostate cancer is often associated with cellular damage from oxidative stress combined with the inhibition of the apoptotic mechanisms normally activated by such damage.
The expression levels of thousands of genes, measured simultaneously using DNA microarrays, provide information useful for medical diagnosis and prognosis [
Gene selection techniques from microarray analysis are often based on individual gene ranking depending on a numerical score measuring the correlation of each gene with particular disease types. The expression levels of the highest-ranked genes tend to be either consistently higher in the presence of disease and lower in the absence of disease, or vice versa. Such genes usually have the property that their joint expression levels corresponding to diseased tissues and the joint expression levels corresponding to healthy tissues can be cleanly separated into two distinct clusters. These techniques are therefore convenient and powerful for classification purposes between disease and health, or between different disease types, but they are incompatible with a systems biology viewpoint, because they do not identify systems of synergistically interacting genes, whose joint expression state predicts disease. Rather, microarray clustering techniques tend to produce clusters of co-regulated genes.
Sophisticated machine learning classification approaches, in which a hypersurface on a high-dimensional space serves as a classification boundary separating the gene expression points into classes of tissues, have also been successfully used. Certain nonlinear transformations are typically used to define the shape of the hypersurface, but the performance of the algorithms is limited by the ability to identify and use the optimum such transformation. Furthermore, the set of selected genes is typically not combined with an easily interpretable interrelationship among its members, which could otherwise provide biological insight about the combined role of these genes.
To address such problems, several efforts have recently been made to analyze expression data at the level of biological modules, rather than individual genes [
In this paper, we present a systems-based approach (Entropy Minimization and Boolean Parsimony [EMBP]) that identifies modules of genes jointly associated with disease from gene expression data. The technique also produces a simple logic function connecting the combined expression levels in each gene module with the presence of disease. Roughly speaking, the goals of EMBP analysis are, first, to identify the smallest module of genes whose joint expression levels can predict the presence of disease with high accuracy, and, second, to identify the simplest logic function connecting these genes to achieve this prediction. We applied EMBP analysis on a prostate cancer dataset [
We first binarize microarray expression data into two levels. Although the EMBP methodology can be generalized to account for multiple expression levels, binarization of expression data simplifies the presentation of the concepts in this paper and provides simple logical functions connecting the genes within the found modules.
Rather than independently binarizing each gene's expression level, which would be more appropriate for an individual gene ranking approach, we chose to use single thresholds for all genes. This approach is consistent with the fact that we seek to find global interrelationships among genes and that the microarray data have already been normalized across all tissues and all genes. Therefore, a choice of high threshold will identify the genes that are “strongly” expressed, while a choice of a low threshold will identify the genes that expressed even “weakly.” We performed EMBP analysis across several thresholds and we focused on the threshold choices that provided best performance, as described in the following sections.
Following binarization, each gene is assumed to be either expressed or not expressed in a particular tissue, and we also assume that there are two types of tissues, either healthy ones or tissues suffering from a particular disease. The latter assumption can also be generalized to include more than two types of tissues, or modified to be used for classification among several disease types.
Thus, given
Two Examples of State-Count Tables and the Corresponding Normalized Conditional Entropies
We first address the following problem:
Given a number
We refer to this problem as the “entropy minimization” problem, because we quantify the uncertainty with the information-theoretic measure known as conditional entropy [
In the following, we define the conditional entropy and explain in what sense it measures uncertainty. Given a discrete random variable satisfying a probability distribution {
Specifically, using the counts
If we know the expression state
Finally, to ensure that the range of possible values for the conditional entropy extends from 0 to 1, we normalize by dividing by
The conditional entropy, as defined above, depends on the counts
The entropy minimization problem consists of identifying the gene set with the minimum conditional entropy, as defined above, among all subsets of size
If the conditional entropy for a particular gene set is found to be exactly 0, this implies that the joint expression levels of the members of that gene module determine the existence of disease with absolute certainty under the assumption of the probabilistic model derived from the relative frequencies. This happens whenever, for all 2
Calculating the conditional entropy of a particular set of genes of length
Once a gene module has been identified, and the expression states for that module that are predominantly associated with disease have been determined as described above, we then address the following problem: Given the gene expression states associated with disease, find the simplest logical rule that connects the expression levels in the gene module with the presence of disease.
We refer to this problem as the “Boolean parsimony” problem, because the logical rule will be identified by the “most parsimonious Boolean function.” Our definition for this logic function is one containing the operators AND, OR and NOT, which minimizes a “cost,” defined as the total number of logic variables appearing in the expression. In Boolean algebra [
The reason for the need of Boolean parsimony is that the biological role of each gene becomes more immediately clear if the Boolean expression contains the corresponding logic variable either once or only a few times. We selected the above definition of Boolean parsimony because the logic functions AND, OR, and NOT often have straightforward potential biological interpretations.
The problem is easily solved manually when the size of the gene set is less than 5, as in the examples of this paper, using Karnaugh map logic design methodology [
The state-count table for a set of three genes is used to create the 4 × 2 table shown on the right, known as a “Karnaugh map.” The binary coordinates in the rows and columns of Karnaugh maps are arranged according to the “Gray code” (consecutive coordinates differ by one bit only) for easier derivation of the logical function. Each entry of the Karnaugh map corresponds to one of the states and the numbers shown are equal to the counts
The computational cost of Boolean Parsimony is insignificant compared to that of Entropy Minimization. For example, Espresso [
We used two different prostate cancer datasets. The first prostate cancer microarray expression data [
Each threshold choice for binarizing the continuous-valued data of the EMBP dataset will produce potentially different results. The “optimum” threshold can be defined and evaluated as the one that yields the minimum overall entropy. In our case, we found that this would be equal to 60. However, EMBP analysis is not meant to produce a unique gene set as its output. Rather, it can reveal a rich content of information from the microarray data by using different threshold choices. The combined information resulting from several such gene sets can further help provide insight into related pathways.
We used several thresholds ranging from −30 to 225 in steps of 15 and estimated the minimum entropy gene modules for each of them. For each threshold value we considered gene modules of size
(A) Estimates of the minimum entropy values for gene modules of size
(B) Minimum entropy values for subset of thresholds (colored lines) along with the estimated means of the entropies over randomly permuted data for
List of the Minimum Entropy Gene Modules for Different Thresholds
List of Genes That Were Included in the Estimated Gene Modules along with Their Accession Numbers and Gene Description
To evaluate the significance of these minimum entropy values, we performed entropy minimization over ten random permutations of the tissue class labels. In other words, while keeping the number of healthy and cancerous tissues constant to 50 and 52, respectively, we randomly assigned healthy (0) and cancerous (1) labels to the individual tissue profiles. The entropy minimization algorithm was performed on the randomly permuted data, and the average minimum entropies for
We then estimated the most parsimonious Boolean functions for these five gene modules, which were selected because the corresponding thresholds are close to each other and spread around the “optimum” one, equal to 60, as mentioned above.
Shown are the Karnaugh maps for the minimum entropy three-gene modules for five choices of binarization threshold, the parsimonious Boolean functions derived from the Karnaugh maps, and the classification accuracy of these Boolean functions over the EMBP dataset. Furthermore, we found that gene
The genes mentioned in
(1) Absence of
(2) Presence of
(3) Presence of
(4) If either
(5) In the absence of
Although the classification performance of the Boolean functions from EMBP analysis is high over the dataset upon which the results were derived (
Because of the simplicity of the Boolean functions shown in
We used a simple transformation of the form
Classification Accuracy of EMBP Analysis Results over the Validation Dataset
Remarkably, the classification accuracy of the simple three-gene (two-gene in one case) Boolean functions in the validation dataset were consistently high, exceeding 90% in most cases, indicating that EMBP analysis accurately extracted universally valid prostate cancer–related features. By comparison, the validation results of the
As mentioned earlier, because of overfitting, the classification accuracy on the validation data would not increase had we used four-gene modules for the threshold choices in
The genes in the modules resulting from EMBP analysis are not co-regulated, because, if they were, then each of them alone would provide much of the information that all of them provide; therefore a different gene would be a more appropriate partner, as it would provide complementary information. Nevertheless, these genes are typically related by a shared common “theme” in which they are playing synergistic roles. For example, two genes may appear because they are both required for the activation of a particular cancer-causing pathway. Of course, the cause-and-effect relationship connecting disease and the presence of particular genes in a gene module is not clear from the results of quantitative analysis alone, and the Boolean functions can only be seen as approximations when they are based on a relatively small set of input data, as in our case.
Coupled with additional biological knowledge, however, the results of EMBP analysis can help infer disease-related pathways, which, in turn can help develop therapeutic interventions. This methodology uses the clues provided by the results to create speculative assumptions involving additional genes. Assuming that each gene module has a “story” to say, we can then attempt to combine all these “stories” into an integrated scenario combining many genes. In this section, we provide two examples of this methodology.
We first focus on the three-gene module with the lowest overall conditional entropy (0.1159) (
Interestingly, both of the other members of the module
Taken together, the above observations suggest a speculative scenario consistent with the Boolean function of
There are many more gene modules that are revealed by EMBP analysis in addition to those indicated in
This module resulted from a threshold choice of 15. As seen in
Cancerous tissues are associated with the absence of both
The meaning of this function is the following: Cancerous tissues are associated with the absence of both
The common theme in this cluster is easily detected by the simultaneous presence of two genes, which have the following products: On the one hand, CYP1B1 is an oxidative enzyme induced by, and metabolizing, various substances, several of which are toxins, and on the other hand GSTP1 is another enzyme whose role is to detoxify by catalyzing conjugation of glutathione. Products of oxidative metabolism are “natural” substrates for the glutathione transferases [
A third molecule in the module of
More specifically, maspin was found to interact with three molecules: The two molecules in addition to glutathione S-transferase were two “stress proteins” (heat shock proteins HSP70 and HSP90), suggesting that intracellular maspin may be primarily involved in cellular response to stress stimuli [
From the observations so far, it follows that GSTP1 and maspin may work synergistically to counteract oxidative stress, perhaps by maspin inducing apoptosis whenever GSTP1-induced detoxification is not successful. Interestingly, maspin, a serine protease inhibitor, and hepsin, a serine protease that we previously speculated to play an apoptosis-inhibitive role, were found to be inversely expressed in prostate cancer [
The only remaining gene in that module is
There are two identified splice variants of
Together with the fact, noted earlier, that maspin interacts with
Therefore, a plausible speculative scenario for this module is the following:
The most notable feature of the EMBP method is that it is systems-based, in the sense that it considers the synergistic contributions of sets of genes, rather than individual genes, or sets of co-regulated genes as in microarray clustering techniques. As a result, the optimal gene module of size
Discrimination between healthy and diseased tissues, or between different types of diseases, is not the main purpose of EMBP analysis, although it can be used for this task. If we wanted to focus on classification, then we would use many genes, connect their expression levels in sophisticated ways, rather than through one extremely simple Boolean function, and use their continuous expression levels instead of binarizing them, so that we can extract additional information. Nevertheless, we still validated our results for classification accuracy using the binarized data from sets of three genes connected by simple Boolean functions, applied on totally new microarray data, independently generated in a different lab. The resulting high classification accuracy of typically more than 90% (
We tested the effect of sample size on the results by performing EMBP analysis on several randomly selected subsets of the training data using equal numbers (10, 20, 30, and 40) of samples from normal and cancerous tissues. The minimum entropy results are shown in
The minimum entropy values increase with sample size for any given
As mentioned earlier, the genes in the modules from EMBP analysis are not co-regulated, but they are co-operative. This is in sharp contrast to the gene modules resulting from clustering or bi-clustering approaches [
For example, in reference [
We believe that EMBP analysis provides an opportunity for fruitful cross-disciplinary collaboration in which biologists use the “clues” resulting from the computational results to infer potential pathways, which they can validate with genetic experiments, as well as suggest further computational experiments. For example, if we wish to identify which genes play synergistic roles with another particular gene in terms of causing disease, we could “freeze” the presence of that gene and identify the other genes in a module minimizing the entropy. Furthermore, our systems-based approach can immediately suggest novel potential therapeutic methods that would not be possible with traditional individual-gene approaches, for example, targeting simultaneously two genes that appear in the same Boolean function by combining two already existing drugs targeting each of these genes.
We hope that EMBP analysis will prove to be a significant new tool for medical research working synergistically with the future efforts of diseased tissue genome sequencing. To illustrate with an example, if a Boolean function a′b′c′ is found when analyzing expression data of a particular cancer, this would suggest the possibility that the three genes “gene a,” “gene b,” and “gene c” may cause cancer when all are inactivated, perhaps due to their mutation or to hypermethylation of their promoter, as we previously discussed regarding
We used a combination of two heuristic optimization techniques to search for minimum entropy gene sets, allowing sufficient time for each of them to converge. The first technique is the following: Starting from a randomly chosen gene set of size
To reduce the chance that the found solution corresponds to a local, rather than global minimum, we also used simulated annealing [
The parameters of the simulated annealing algorithm that we used were:
The “cooling scheme” defines the rate at which the temperature falls over the length of the algorithm. We adopted the exponential cooling scheme defined as
An average run of the simulated annealing algorithm for estimating the minimum conditional entropy gene-set for
The National Center for Biotechnology Information (NCBI) (
Entropy Minimization and Boolean Parsimony