Open Access
Research Article
- Download: XML | PDF | Citation
- E-mail this Article
- Order Reprints
- Print this Article
- Bookmark this page:
Geometric Interpretation of Gene Coexpression Network Analysis
Department of Human Genetics, David Geffen School of Medicine, and Department of Biostatistics, School of Public Health, University of California, Los Angeles, California, United States of America
Abstract
The merging of network theory and microarray data analysis techniques has spawned a new field: gene coexpression network analysis. While network methods are increasingly used in biology, the network vocabulary of computational biologists tends to be far more limited than that of, say, social network theorists. Here we review and propose several potentially useful network concepts. We take advantage of the relationship between network theory and the field of microarray data analysis to clarify the meaning of and the relationship among network concepts in gene coexpression networks. Network theory offers a wealth of intuitive concepts for describing the pairwise relationships among genes, which are depicted in cluster trees and heat maps. Conversely, microarray data analysis techniques (singular value decomposition, tests of differential expression) can also be used to address difficult problems in network theory. We describe conditions when a close relationship exists between network analysis and microarray data analysis techniques, and provide a rough dictionary for translating between the two fields. Using the angular interpretation of correlations, we provide a geometric interpretation of network theoretic concepts and derive unexpected relationships among them. We use the singular value decomposition of module expression data to characterize approximately factorizable gene coexpression networks, i.e., adjacency matrices that factor into node specific contributions. High and low level views of coexpression networks allow us to study the relationships among modules and among module genes, respectively. We characterize coexpression networks where hub genes are significant with respect to a microarray sample trait and show that the network concept of intramodular connectivity can be interpreted as a fuzzy measure of module membership. We illustrate our results using human, mouse, and yeast microarray gene expression data. The unification of coexpression network methods with traditional data mining methods can inform the application and development of systems biologic methods.
Author Summary
Similar to natural languages, network language is ever evolving. While some network terms (concepts) are widely used in gene coexpression network analysis, others still need to be developed to meet the ever increasing demand for describing the system of gene transcripts. There is a need to provide an intuitive geometric explanation of network concepts and to study their relationships. For example, we show that certain seemingly disparate network concepts turn out to be synonyms in the context of coexpression modules. We show how coexpression network language affects our understanding of biology. For example, there are geometric reasons why highly connected hub genes in important coexpression modules tend to be important, and why hub genes in one module cannot be hubs in another distinct module. We provide a short dictionary for translating between microarray data analysis language and network theory language to facilitate communication between the two fields. We describe several examples that illustrate how the two data analysis fields can inform each other.
Citation: Horvath S, Dong J (2008) Geometric Interpretation of Gene Coexpression Network Analysis. PLoS Comput Biol 4(8): e1000117. doi:10.1371/journal.pcbi.1000117
Editor: Satoru Miyano, University of Tokyo, Japan
Received: October 12, 2007; Accepted: June 9, 2008; Published: August 15, 2008
Copyright: © 2008 Horvath, Dong. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: We acknowledge grant support from 1U19AI063603-01, P50CA092131, 1U24NS043562-01, 5P30CA016042-28, and HL28481.
Competing interests: The authors have declared that no competing interests exist.
* E-mail: shorvath@mednet.ucla.edu
Introduction
Many biological networks share topological properties. Common global properties include modular organization [1],[2], the presence of highly connected hub nodes, and approximate ‘scale free topology’ [3],[4]. Common local topological properties include the presence of recurring patterns of interconnections (‘network motifs’) in regulation networks [5]–[7].
One goal of this article is to describe existing and novel network concepts (also known as network statistics or indices [8]) that can be used to describe local and global network properties. For example, the clustering coefficient [9] is a network concept, which measures the cohesiveness of the neighborhood of a node. We are particularly interested in network concepts that are defined with regard to a ‘gene significance measure’. Gene significance measures are of great practical importance since they allow one to incorporate external gene information into the network analysis. In functional enrichment analysis, a gene significance measure could indicate pathway membership. In gene knock-out experiments, gene significance could indicate knock-out essentiality. We study gene significance measures since a microarray sample trait (e.g., case control status) gives rise to a statistical measure of gene significance. For example, the Student t-test of differential expression leads to a gene significance measure. Many traditional microarray data analysis methods focus on the relationship between the microarray sample trait and the gene expression data. For example, gene filtering methods aim to find a list of (differentially expressed) genes that are significantly associated with the microarray sample trait; another example are microarray-based prediction methods that aim to accurately predict the sample trait on the basis of the gene expression data.
Gene expression profiles across microarray samples can be highly correlated and it is natural to describe their pairwise relations using network language. Genes with similar expression patterns may form complexes, pathways, or participate in regulatory and signaling circuits [10]–[12]. Gene coexpression networks have been used to describe the transcriptome in many organisms, e.g., yeast, flies, worms, plants, mice, and humans [13]–[23]. Gene coexpression network methods have also been used for typical microarray data analysis tasks such as gene filtering [19], [24]–[26] and outcome prediction [27],[28].
While the utility of network methods for analyzing microarray data has been demonstrated in numerous publications, the utility of microarray data analysis techniques for solving network theoretic problems has not yet been fully appreciated. One goal of this article is to show that simple geometric arguments can be used to derive network theoretic results if the networks are defined on the basis of a correlation matrix.
Definition of Gene Coexpression Networks
Although many of our network concepts will be useful for general networks, we are particularly interested in gene coexpression networks (also known as association-, influence-, relevance-, or correlation networks). Gene coexpression networks are built on the basis of a gene coexpression measure. The network nodes correspond to genes—or more precisely to gene expression profiles. The ith gene expression profile xi is a vector whose components report the gene expression values across m microarrays. We define the coexpression similarity sij between genes i and j as the absolute value of the correlation coefficient between their expression profiles:
Using a thresholding procedure, this coexpression similarity is transformed into a measure of connection strength (adjacency). An unweighted network adjacency aij between gene expression profiles xi and xj can be defined by hard thresholding the coexpression similarity sij as follows(1)
where τ is the “hard” threshold parameter. Thus, two genes are linked (aij = 1) if the absolute correlation between their expression profiles exceeds the (hard) threshold τ. Hard thresholding of the correlation leads to simple network concepts (e.g., the gene connectivity equals the number of direct neighbors) but it may lead to a loss of information: if τ has been set to 0.8, there will be no link between two genes if their correlation equals 0.799. To preserve the continuous nature of the coexpression information, one could simply define a weighted adjacency matrix as the absolute value of the gene expression correlation matrix, i.e., [aij] = [sij]. However, since microarray data can be noisy and the number of samples is often small, we and others have found it useful to emphasize strong correlations and to punish weak correlations. It is natural to define the adjacency between two genes as a power of the absolute value of the correlation coefficient [19],[24]:(2)
with β≥1. This soft thresholding approach leads to a weighted gene coexpression network. We present empirical results for weighted and unweighted networks in the main text, Text S1, Text S2, and Text S3.
Social Network Analogy: Affection Network
Since humans are organized into social networks, social network analogies should be intuitive to many readers. Therefore, we will refer to the following ‘affection network’ throughout this article. Assume that n individuals filled out an interest questionnaire, which was used to define a pairwise similarity score sij. For convenience, we assume that the similarity measure takes on values between 0 and 1. Our definition of the affection network is based on the following assumption: the more similar the interests between two individuals, the more affection they feel for each other. More specifically, we assume that the affection (adjacency) aij between two individuals is proportional to their similarity on a logarithmic scale, i.e.,(3)
This is equivalent to our soft thresholding approach aij = sijβ (Equation 2). A soft threshold β = 2 implies that the affection aij equals 0.25 if the similarity sij equals 0.5.
Results
Gene Significance Based on a Microarray Sample Trait
Many network applications use at least one gene significance measure. Abstractly speaking, we define a gene significance measure as a function GS that assigns a nonnegative number to each gene; the higher GSi the more biologically significant is gene i. We assume that the minimum gene significance is 0. For example, if a statistical significance level (p-value) is available for each gene, the gene significance of the ith gene can be defined as minus log of the p-value, i.e., GSi = −log(pi). In this article, we are particularly interested in gene significance measures that are based on a microarray sample trait, e.g., a clinical outcome. The microarray sample trait T = (T1,…,Tm) may be quantitative (e.g., body weight) or binary (e.g., case control status). Since our goal is to provide a simple geometric interpretation of coexpression network analysis, we define the trait-based gene significance measure by raising the correlation between the ith gene expression profile xi and the clinical trait T to a power β(4)
Although any power β could be used in Equation 4, we use the same power as in Equation 2 to facilitate a simple geometric interpretation.
Geometric Interpretation Using a Hypersphere
We find it convenient to express network quantities in terms of correlation coefficients since the correlation between two vectors can be interpreted as the cosine of the angle between them (measured in radians) if the vectors are scaled to have a mean of 0. Since the correlation is scale-invariant, i.e., cor(axi+b, cxj+d) = cor(xi,xj), we can assume without loss of generality that the vectors xi have a mean 0 and are of the same length. In other words, they correspond to points on a hypersphere.
The network adjacency aij is a monotonically decreasing function of the angle θij between the two scaled expression profiles if 0≤θij≤π/2. When the angle θij equals 0 or π/2, the adjacency equals 1 or 0, respectively. The network adjacency is a monotonically decreasing function of the length of the shortest path (geodesic) between the two points on the hypersphere. Soft thresholding methods (Equation 2) preserve the continuous nature of these distances. The higher the soft threshold β, the more weight is assigned to short geodesic distances compared to large distances.
Since the trait-based gene significance measure GSi = |cor(xi,T)|β, (Equation 4) is scale-invariant, the sample trait T can also be considered a point on the hypersphere. Analogous to the network adjacency, the smaller the geodesic distance between the ith gene expression profile and the trait T, the higher the gene significance of the ith gene. In other words, the smaller the angle between the sample trait and the expression profile, the more significant is the gene.
A Motivational Example
As a motivational example, we study the pairwise correlations among 498 genes that had previously been found to form a sub-network related to mouse body weight. The microarray data measure the expression levels in multiple tissue samples (liver, adipose, brain, muscle) from male and female mice of an F2 intercross. Approximately 100 tissue samples are available for each gender/tissue combination. The biological significance of this subnetwork is described in [23],[26]. Here we focus on the mathematical and topological properties of the pairwise absolute correlations aij = |cor(xi,xj)| between the genes. For each gender and tissue type Figure 1A depicts a hierarchical cluster tree of the genes. Figure 1B shows the corresponding heat maps, which color-code the absolute pairwise correlations aij. As can be seen from the color bar underneath the heat maps, red and green in the heat map indicate high and low absolute correlation, respectively. The genes in the rows and columns of each heat map are sorted by the corresponding cluster tree.
Figure 1. This motivational example explores the pairwise absolute correlations aij = |cor(xi,xj)| among 498 genes in different mouse tissues.
The biological significance of this network is described in [23],[26]. Each figure panel contains 8 subfigures for different genders and tissue types (liver, adipose, brain, muscle). (A) An average linkage hierarchical cluster tree of the genes. (B) The corresponding heat maps, which color-code the absolute pairwise correlations aij: red and green in the heat map indicate high and low absolute correlation, respectively. The genes in the rows and columns of each heat map are sorted by the corresponding cluster tree. (C) The relationship between gene significance GS (y-axis) and connectivity (x-axis). The gene significance of the ith gene was defined as the absolute correlation between the ith gene expression profile and mouse body weight. The hub gene significance HGS (Equation 13) is defined as the slope of the red line, which results from a regression model without an intercept term.
doi:10.1371/journal.pcbi.1000117.g001It is visually obvious that the heat maps and the cluster trees of different gender/tissue combinations can look quite different. Network theory offers a wealth of intuitive concepts for describing the pairwise relationships among genes that are depicted in cluster trees and heat maps. To illustrate this point, we describe several such concepts in the following. By visual inspection of Figure 1B, genes appear to be more highly correlated in liver than in adipose (a lot of red versus green color in the corresponding heat maps). This property can be captured by the concept of network density (defined below). The density of the female liver network is 0.39 while it is only 0.23 for the female adipose network. Another example for the use of network concepts is to quantify the extent of cluster (module) structure. In this example, branches of a cluster tree (Figure 1A) correspond to modules in the corresponding network. The cluster structure is also reflected in the corresponding heat maps: modules correspond to large red squares along the diagonal. Network theory provides a concept for quantifying the extent of module structure in a network: the mean clustering coefficient (defined below). The female liver, male liver and female brain networks have high mean clustering coefficients (mean ClusterCoef = 0.42, 0.43, 0.41, respectively). In contrast, the female adipose, male adipose, and male brain networks have lower mean clustering coefficients (mean ClusterCoef = 0.27, 0.27, 0.25, respectively). Difference in module structure may reflect true biological differences or they may reflect noise (e.g. technical artifacts or tissue contaminations).
As another example for the use of network concepts, compare the cluster tree of the female brain network with that of the male brain network. The cluster tree of the female network appears to be comprised of a single large branch, i.e., a highly connected hub gene at the tip of the branch forms the center in this network. In contrast, the cluster tree corresponding to the male brain network appears to split into multiple smaller branches, i.e., no single gene forms the center. To measure whether a highly connected hub gene forms the center in a network, one can use the concept of centralization (defined below). The female brain and male brain networks have centralization 0.34 and 0.21, respectively.
These examples illustrate that graph theory contains a wealth of network concepts that can be used to describe microarray data. But we will argue that microarray data analysis techniques can also be used to derive network theoretic results. For example, network theorists have long studied the relationship between gene significance and connectivity. Several network articles have pointed out that highly connected hub nodes are central to the network architecture [17], [29]–[32] but hub genes may not always be biologically significant [33]. To define a sample trait based gene significance measure (Equation 4), we define the gene significance of gene i as the absolute correlation between the gene expression profile xi and body weight T, i.e., GSi = |cor(xi,T)|. Figure 1C shows the relationship between this gene significance measure and connectivity in the different gender/tissue type networks. We find a strong positive relationship between gene significance and connectivity in the female and the male mouse liver networks. The positive relationship between gene significance and connectivity suggests that both variables could be used to implicate genes related to body weight. For example, we used connectivity as a variable in a systems biologic gene screening method [26]. While most network theorists would agree that connectivity is an important variable for finding important genes in a network [17],[19], the statistical advantages of combining gene significance and connectivity are not clear. Below, we use the geometric interpretation of coexpression network analysis to argue that intramodular connectivity can be interpreted as a fuzzy measure of module membership. Thus, a systems biologic gene screening method that combines a gene significance measure with intramodular connectivity amounts to a pathway based gene screening method. Empirical evidence shows that the resulting systems biologic gene screening methods can lead to important biological insights [23]–[26]. Before combining gene significance and connectivity in a systems biologic gene screening approach, it is important to study their relationship. Toward this end, we propose a measure of hub gene significance HGS as slope of a regression line (through the origin) between gene significance and scaled connectivity. As can be seen from Figure 1C, the hub gene significance is high in liver and adipose tissues but it is low in brain and muscle tissues. Below, we use the geometric interpretation of coexpression networks to characterize coexpression networks that have high hub gene significance if the gene significance measure is based on a microarray sample trait T.
Network Concepts
Abstract definition of network concepts.
We define network concepts for (weighted) undirected networks that can be represented by a symmetric adjacency matrix A = [aij], where 1≤i,j≤n. We assume that the pairwise adjacency (connection strength) aij takes on values in the unit interval, i.e., 0≤aij≤1. For notational convenience, we set the diagonal elements to 1. In the Methods section, we define a network concept NCF(A,GS) by evaluating a network concept function NCF(·,·) on the adjacency matrix A and/or a corresponding gene significance measure GS. This abstract definition will be useful in defining intramodular network concepts (e.g., Equation 17) and eigengene-based analogs of network concepts (e.g., Equation 30). In the following, we describe several network concepts including the connectivity, the maximum adjacency ratio, the density, and the centralization.
Connectivity and related concepts.
The connectivity (also known as degree) of the ith gene is defined by(5)
In unweighted networks, the connectivity ki equals the number of genes that are directly linked to gene i. In weighted networks, the connectivity equals the sum of connection weights between gene i and the other genes.
The maximum connectivity is defined as(6)
The scaled connectivity Ki of the i-th gene is defined by(7)
By definition, 0≤Ki≤1. Note that we distinguish the scaled from the unscaled connectivity by using an upper case “K” and a lower case “k”, respectively.
Social Network Interpretation of the Connectivity: For the aforementioned affection network (Equation 3), assume that the affection (adjacency) aij equals 1 if two individuals strongly like each other; it equals 0.5 if they are neutral towards each other, and it equals 0 if they strongly dislike each other. Then the scaled connectivity Ki is a measure of relative popularity: high values of Ki indicate that the ith person is well liked by many others.
Potential Uses of the Connectivity: The connectivity is the most widely used concept for distinguishing the nodes of a network. As described in the motivational example and detailed below, intramodular connectivity can be used to define a systems biologic gene screening strategy that keeps track of module membership information [24].
Maximum adjacency ratio.
For weighted networks, we define the maximum adjacency ratio of gene i as follows(8)
which is defined if ki = Σj≠i aij>0. One can easily verify that 0≤aij≤1 implies 0≤MARi≤1. Note that MARi = 1 if all nonzero adjacencies take on their maximum value of 1, which justifies the name “maximum adjacency ratio.” By contrast, if all nonzero adjacencies take on a small (but constant) value aij = ε, then MARi = ε will be small.
Social Network Interpretation of the Maximum Adjacency Ratio: MARi = 1 suggests that the ith individual does not form neutral relationships; this individual either strongly likes or dislikes others. In contrast, MARi = 0.5 suggests the ith individual forms less intense relationships with others.
Potential Uses of the Maximum Adjacency Ratio: Since MARi = 1 for all genes in an unweighted network, the maximum adjacency ratio is only useful for weighted networks. The MAR can be used to determine whether a hub gene forms moderate relationships with a lot of genes or very strong relationships with relatively few genes. To illustrate this point, we show in the following simple example that the MAR can be used to distinguish nodes that have the same connectivity. Assume a network (labeled by I) for which the adjacency between node 1 and every other node equals a1,j(I) = 1/(n−1). Then k1(I) = (n−1)/(n−1) = 1 and MAR1(I) = 1/(n−1). For a different network (labeled by II) where a1,2(II) = 1 and a1,j(II) = 0 for j≥3, the connectivity k1(II) still equals 1 but MAR1(II) = 1.
In weighted coexpression networks, we find empirically that MARi is often highly correlated with the connectivity Ki (see also Equation 36). As we demonstrate in Figure 2, the MARi is sometimes (but not always) superior to Ki when it comes to identifying biologically important intramodular hub genes. As aside, we mention that a directed network analog of MARi has been used in the analysis of metabolic fluxes [34].
Figure 2. Relationships among maximum adjacency ratio, scaled connectivity, and gene significance.
(A) The relationship between MARi (y-axis) and scaled connectivity Ki using the female mouse muscle tissue network described in the motivational example. The genes are colored red or black depending on whether they are significantly (p-value<0.05) related to mouse body weight. (B) Boxplots and a Kruskal-Wallis test p-value (p = 0.00072) for studying whether MARi differs between significant (red) and non-significant (black) genes. (C) The analogous boxplots and p-value for the scaled connectivity Ki. In this female muscle tissue application, MARi is more significantly (p = 0.00072) related to GSi than is Ki (p = 0.051). (D,E,F) The analogous relationships for male muscle. Here, the MARi is more significantly (p = 0.00014) related to GSi than is Ki (p = 0.0034). (G,H,I) The analogous relationships for the brown module of the brain cancer application. Here, the MARi is slightly more significantly (p = 1.6E-8) related to GSi than is Ki (p = 2.6E-7). As a caveat, we mention that in other applications (e.g., the yeast network), we have found that Ki is more significantly related to GSi than MARi.
doi:10.1371/journal.pcbi.1000117.g002Network density.
The network density (also known as line density [35]) is defined as the mean off-diagonal adjacency and is closely related to the mean connectivity.(9)
where k = (k1,…,kn) denotes the vector of connectivities and the function vector v is defined by Sp(v) = Σi vip.
Social Network Interpretation of the Density: The density measures the overall affection among individuals. A density close to 1 indicates that all individuals strongly like each other while a density of 0.5 suggests the presence of more ambiguous relationships.
Potential Uses of the Density: The density of genes in a subnetwork (e.g., a pathway) can be used to measure whether this sub-network is tight or cohesive. In our motivational mouse tissue example, we find that a network of genes has high density in liver tissue but low density in adipose tissue. The goal of many module detection methods is to find clusters of genes with high density.
Network centralization.
The network centralization (also known as degree centralization [36]) is given by(10)
The centralization is 1 for a network with star topology; by contrast, it is 0 for a network where each node has the same connectivity. A regular grid network such as a square has centralization 0.
Social Network Interpretation of the Centralization: The centralization of the affection network is close to 1, if one individual has loving relationships with all others who in turn strongly dislike each other. In contrast, a centralization of 0 indicates that all individuals are equally popular.
Potential Uses of the Centralization: While the centralization is a widely used measure in social network studies, it has only rarely been used to describe structural differences of metabolic networks [37]. As described in our motivational example, the centralization can be used to describe properties of cluster trees, see also [8].
Network heterogeneity.
The network heterogeneity measure is based on the variance of the connectivity. Authors differ on how to scale the variance [35]. We define it as the coefficient of variation of the connectivity distribution, i.e.(11)
This heterogeneity measure is invariant with respect to multiplying the connectivity by a scalar.
Social Network Interpretation of the Heterogeneity: The heterogeneity can be used to measure the variation of popularity (connectivity) across the individuals.
Potential Uses of the Heterogeneity: Describing the reasons for and the meaning of the heterogeneity of complex networks has been the focus of considerable research in recent years [29],[38]. Many complex networks have been found to exhibit an approximate scale-free topology, which implies that these networks are very heterogeneous [3].
Clustering coefficient.
The clustering coefficient of gene i is a density measure of local connections, or “cliquishness” [9]. Specifically,(12)
In unweighted networks, ClusterCoefi equals 1 if and only if all neighbors of i are also linked to each other. For weighted networks, 0≤aij≤1 implies that 0≤ClusterCoefij≤1 [19].
Social Network Interpretation of the Clustering Coefficient: The higher the clustering coefficient of an individual, the higher is the affection among his friends. The clustering coefficient is zero if all of his friends strongly dislike each other.
Potential Uses of the Clustering Coefficient: As described in our motivational example, the mean clustering coefficient has been used to measure the extent of module structure present in a network. The relationship between the clustering coefficient and connectivity has been used to describe structural (hierarchical) properties of networks [1].
Hub gene significance.
To measure the association between connectivity and gene significance, we propose the following measure of hub gene significance:(13)
When GSi is proportional to the scaled connectivity (GSi = cKi), the hub gene significance equals the constant of proportionality: HubGeneSignif = c. The hub gene significance equals the slope of the regression line between GSi and Ki if the intercept term is set to 0 (Figure 3D and 3E).
Figure 3. Overview and an example application of gene coexpression network analysis.
(A) Outline of an analysis flow chart. Gene coexpression network analysis aims to identify pathways (modules) and their key drivers (e.g., intramodular hub genes). (B) The hierarchical cluster tree of genes in the brain cancer network. Modules correspond to branches of the tree. The branches and module genes are assigned a color as can be seen from the color-bands underneath the tree. Grey denotes genes outside of proper modules. A functional enrichment analysis of these modules can be found in Horvath et al. (2006). (C) The module significance (average gene significance) of the modules. The underlying gene significance is defined with respect to the patient survival time (Equation 4). (D,E) Scatter plots of gene significance GS (y-axis) versus scaled connectivity K (x-axis) in the brown and blue module, respectively. The hub gene significance (Equation 13) is defined as the slope of the red line, which results from a regression model without an intercept term.
doi:10.1371/journal.pcbi.1000117.g003Social Network Interpretation of the Hub Gene Significance: Assume that the node significance measures the grade point average of the ith individual. Then the hub node significance can be used to assess whether there is a relationship between popularity (connectivity) and grade point average.
Potential Uses of the Hub Gene Significance: Several studies have shown that the relationship between connectivity and gene significance (i.e., the hub gene significance) carries important biological information. For example, in the analysis of yeast networks, highly connected hub genes were found to be essential for yeast survival and there is evidence that hub genes are preserved across species [17], [25], [29]–[32]. A detailed analysis shows that the positive relationship between connectivity and knockout essentiality cannot always be observed [33], i.e., the hub gene significance can be close to 0.
Network significance measure.
We define the network significance measure as the average gene significance of the genes:(14)
Social Network Interpretation of the Network Significance: The network significance simply measures the average grade point average among the individuals.
Potential Uses of the Network Significance: We refer to the network significance of a module network as “module significance.” The module significance measure can be used to address a major goal of gene network analysis: the identification of biologically significant subnetworks or pathways.
Centroid significance and centroid conformity.
We define the centroid significance as the gene significance of a suitably chosen representative node (centroid) in the network.(15)
where i.centroid denotes the index associated with the centroid. A centroid can be defined in many different ways, e.g., based on connectivity or other centrality measures. In our applications, we define the centroid as the most highly connected gene in the network. If multiple genes attain the maximum connectivity, we define the centroid significance by their average gene significance.
We define the centroid conformity of the ith gene as the adjacency between the centroid and the ith gene(16)
If multiple genes attain the maximum connectivity, we define the centroid conformity as their average adjacency with the ith gene.
Social Network Interpretation of the Centroid Conformity: In our affection network, we choose the most popular individual as centroid; then his or her grade point average is the centroid significance. The centroid conformity of the ith individual equals his or her affection (connection strength) with the most popular individual.
Potential Uses of the Centroid Conformity: Below, we will characterize coexpression networks for which the adjacency aij can be approximated by a product of the centroid conformities: aij≈CentroidConformityi CentroidConformityj. We will use this insight to derive relationships among seemingly disparate network concepts. For example, the mean clustering coefficient (Equation 12), the density (Equation 9), and the heterogeneity (Equation 11) measure different network properties but we show that they satisfy a simple relationship (Equation 31) in coexpression modules. Further, we will use the centroid significance to derive a simple relationship (Equation 37) between module significance (Equation 14) and hub gene significance (Equation 13).
Overview of Weighted Gene Coexpression Network Analysis
One of the many biological applications of gene coexpression networks is the identification of pathways (modules) and centrally located genes (referred to as module centroids). In our applications, we define highly connected intramodular hub genes as module centroids. Weighted gene coexpression network analysis (WGCNA, [19],[24]) can be considered a step-wise microarray data reduction technique, which starts from the level of thousands of genes, identifies clinically interesting gene modules, and finally represents the modules by their centroids. The module centric analysis alleviates the multiple testing problem inherent in microarray data analysis. Instead of relating thousands of genes to a sample trait, it focuses on the relationship between a few (usually less than 10) modules and the sample trait.
An outline of WGCNA is presented in Figure 3A. The module definition does not make use of a priori defined gene sets. Instead, modules are constructed from the expression data by using a tight clustering procedure. Although it is advisable to relate the resulting modules to gene ontology information to assess their biological plausibility, it is not required. Because the modules may correspond to biological pathways, focusing the analysis on modules (and corresponding centroids) amounts to a biologically motivated data reduction method. Intramodular hub genes are centrally located in the module and thus lend themselves as candidates for biomarkers. Examples of biological studies that show the importance of intramodular hub genes can be found reported in [23]–[25],[33],[39]. Because the expression profiles of intramodular hub genes are highly correlated (in our data, r>0.90), typically dozens of candidates result. Although these candidates are statistically equivalent, they may differ in terms of biological plausibility or clinical utility.
Network Modules
Roughly speaking, we define network modules as groups of highly interconnected genes. As detailed in Text S1, Text S2, Text S3, and in our online R tutorials, we use a hierarchical clustering procedure to identify modules (clusters) as branches of the resulting cluster tree. A common but inflexible branch cutting method uses a constant height cutoff value. Alternatively, dynamic branch cutting adaptively chooses cutting values depending on the shape of the branch [40]. Each module is assigned a unique color label (Figure 3B). Our branch cutting algorithm only assigns module colors to branches whose size exceeds a user-specified threshold parameter. In practice, it is advisable to vary the minimum module size and other branch cutting parameters to determine how the results are affected by different parameter choices. An iterative approach for choosing the parameters could be defined by optimizing the module significance. This module detection approach has led to biologically meaningful modules in several applications [1], [8], [23]–[25], [33], [39]–[43] but our theoretical results transcend this particular module detection method. Any module detection method that results in clusters of highly correlated gene expressions could be used.
Intramodular Network Concepts
In the following, we assume that a module detection method (e.g., a clustering procedure) has found Q modules. We denote the adjacency matrix of the genes inside the qth module by A(q). Thus, A(q) represents a subnetwork comprised of the genes in the qth module. Analogously, we define GS(q) as the gene significance measure restricted to the module genes. Denote by n(q) the number of genes inside the qth module. Throughout the manuscript, we use the superscript (q) to denote quantities associated with the qth module. But for notational convenience, we sometimes omit (q) when the context is clear.
We define an intramodular network concept NCF(A(q),GS(q)) by evaluating a network concept function NCF(·,·) on the adjacency matrix A(q) and/or a corresponding gene significance measure GS(q).
For example, the intramodular connectivity is defined by(17)
where the j indexes the genes in the qth module. Intramodular connectivity has been found to be an important complementary gene screening variable for finding biologically important genes [24],[25],[39].
We refer to the network significance (Equation 14) of a module network simply as the module significance measure, i.e., the module significance is the average gene significance of the module genes:(18)
Data Reduction Methods for Microarray Data
The high dimensionality of gene expression data has inspired two broad categories of data reduction techniques. The first category, often used by network theorists, is to reduce the gene coexpression networks into modules. Each module can be represented by a centroid, e.g., an intramodular hub gene. The second category, often used by microarray data analysts, reduces the gene expression data to a small number of components that capture the essential behavior of the expression profiles [27], [44]–[51]. One of our goals is to understand how the two categories of data reduction methods relate to each other. Here we use the singular value decomposition [44],[45],[48] since this will allow us to define a simple measure of factorizability (Equation 24).
Singular value decomposition.
For the qth module, denote by X(q) the n(q)×m matrix of n(q) gene expression profiles across m microarrays:(19)
where xi denotes the gene expression vector of the ith gene.
The singular value decomposition (SVD) of X(q) is given by X(q) = U(q)D(q)(V(q))T, where U(q) is an n(q)×m matrix with orthonormal columns, V(q) is an m×m orthogonal matrix, and D(q) is an m×m diagonal matrix of the singular values {|dl(q)|}. Specifically, V(q) and D(q) are given by(20)
The singular value decomposition of X(q) is closely related to the principal component analysis of the correlation matrix COR = [cor(xi(q),xj(q))] whose entries correspond to the pairwise correlations between the rows (genes) of X(q). For example, the eigenvalues of the correlation matrix COR are squares of corresponding singular values |dl(q)|.
We assume that the singular values |dl(q)| are arranged in decreasing order. Adapting terminology from [44], we refer to the first column of V(q) as the Module Eigengene:(21)
For brevity, we sometimes drop the superscript (q) and simply refer to E as the eigengene. The module eigengene can be used to summarize and represent the expression profiles of the module genes, see Figure 4B. The proportion of variance explained by the module eigengene E(q) is defined as(22)

Start a discussion on this article