AET, PAA, RS, and CC conceived and designed the experiments. AET and MJ analyzed the data. AET wrote the paper.
The authors have declared that no competing interests exist.
The quantity of mRNA transcripts in a cell is determined by a complex interplay of cooperative and counteracting biological processes. Independent Component Analysis (ICA) is one of a few number of unsupervised algorithms that have been applied to microarray gene expression data in an attempt to understand phenotype differences in terms of changes in the activation/inhibition patterns of biological pathways. While the ICA model has been shown to outperform other linear representations of the data such as Principal Components Analysis (PCA), a validation using explicit pathway and regulatory element information has not yet been performed. We apply a range of popular ICA algorithms to six of the largest microarray cancer datasets and use pathway-knowledge and regulatory-element databases for validation. We show that ICA outperforms PCA and clustering-based methods in that ICA components map closer to known cancer-related pathways, regulatory modules, and cancer phenotypes. Furthermore, we identify cancer signalling and oncogenic pathways and regulatory modules that play a prominent role in breast cancer and relate the differential activation patterns of these to breast cancer phenotypes. Importantly, we find novel associations linking immune response and epithelial–mesenchymal transition pathways with estrogen receptor status and histological grade, respectively. In addition, we find associations linking the activity levels of biological pathways and transcription factors (NF1 and NFAT) with clinical outcome in breast cancer. ICA provides a framework for a more biologically relevant interpretation of genomewide transcriptomic data. Adopting ICA as the analysis tool of choice will help understand the phenotype–pathway relationship and thus help elucidate the molecular taxonomy of heterogeneous cancers and of other complex genetic diseases.
The amount of a given transcript or protein in a cell is determined by a balance of expression and repression in a complex network of biological processes. This delicate balance is compromised in complex genetic diseases such as cancer by alterations in the activation patterns of functionally important biological processes known as pathways. Over the last years, a large number of microarray experiments profiling the expression levels of more than 20,000 human genes in hundreds of tumor samples have shown that most cancer types are heterogeneous diseases, each characterized by many different expression subtypes. The biological and clinical goal is to explain the observed tumor and clinical heterogeneity in terms of specific patterns of altered pathways. The bioinformatic challenge is therefore to devise mathematical tools that explicitly attempt to infer these altered pathways. To this end, we applied a signal processing tool in a meta-analysis of breast cancer, encompassing more than 800 tumor specimens derived from four different patient cohorts, and showed that this algorithm significantly outperforms popular standard bioinformatics tools in identifying altered pathways underlying breast cancer. These results show that the same tool could be applied to other complex human genetic diseases to better elucidate the underlying altered pathways.
Microarray technology is enabling genetic diseases like cancer to be studied in unprecedented detail, at both transcriptomic and genomic levels. A significant challenge that needs to be overcome to further our understanding of the relation between the quantitative transcriptome of a sample/cell and its phenotype is to unravel the complex mechanism that gives rise to the measured mRNA levels. The amount of a given mRNA transcript in a normal sample/cell is determined by a whole range of biological processes, some of which (e.g., transcription repression and degradation) act to reduce this number, while others (e.g., transcription factor induction) act to increase it. Therefore, it is natural to model the level of a given mRNA transcript as the net sum of a complex superposition of cooperating and counteracting biological processes, and, furthermore, to assume that disease is caused by aberrations in the activation patterns of these biological processes that upset the delicate balance between expression and repression in otherwise healthy tissue. Many distinct biological mechanisms that underlie the aberrations observed in human cancer have been identified, most notably copy-number changes [
While several studies have recently characterised the altered functional pathways and transcriptional regulatory programs in human cancer, they have done so either by interrogating the expression data directly with previously characterised pathways, regulatory modules [
A necessary property of such an algorithm is that it allows “gene-sharing,” so that a specific gene can be part of multiple distinct pathways. In this regard, it is worth noting that popular approaches for analysing transcriptomic data, such as hierarchical or k-means clustering, do not allow for genes to be shared by multiple biological processes, since they place a gene in a single cluster [
Algorithms that allow genes to be part of multiple processes/clusters have also been extensively applied [
Schematic depiction of the ICA model for gene expression.
(A) Measured gene expression variations are caused by alterations in the activation levels of biological pathways. In the ICA model, the gene expression matrix is decomposed into the product of a “source” matrix
(B)
Many studies have shown the value of ICA in the gene expression context as a dimensional reduction and gene-functional discovery tool [
In this work we apply various popular ICA algorithms to six of the largest available microarray cancer datasets. We focus on breast cancer for two reasons. First, for this type of cancer many large patient cohorts that have been profiled with microarrays are available. Second, breast cancer is a highly heterogeneous disease and hence it provides a more challenging (and hence suitable) arena in which to compare and evaluate different methodologies. We also use two large microarray datasets from two other cancer types to show that our results are valid more generally. The aim of our work is 2-fold. First, to test the ICA paradigm by showing that it significantly outperforms both a gene-sharing method that does not use the statistical independence criterion (PCA) and a traditional (“non–gene-sharing”) clustering method (k-means). We achieve this by using a pathway and regulatory module–based framework for validation. The second aim is to find the most frequently altered pathways and regulatory modules in human breast cancer and to explore their relationship to breast cancer phenotypes.
The main modelling hypothesis underlying the application of ICA to gene expression data is that the expression level of a gene is determined by a linear superposition of biological processes, some of which try to express it, while other contending processes try to suppress it (
To test the modeling hypothesis of ICA for expression data, we first asked how well the inferred components mapped to known pathways, as curated in the MSigDB pathway database [
Breast Cancer Cohorts
The PEI for each of the seven methods (“PCA”, “MVG-KM”, “PCA-KM”, “fastICA”, “JointDiag”, “KernelICA”, “Radical”, and “PCA”) and the four largest breast cancer sets (“Vijver”, “Wang”, “Naderi”, “JRH-2”) are shown in
(A) For each cohort and method, we give the pathway enrichment index, PEI, defined by the fraction of biological pathways (536 in total) found enriched in at least one component.
(B) For each cohort and method, we give the fraction of cancer-signalling and oncogenic pathways (14 in total) successfully mapped by the inferred components.
(C) For each cohort and method, we give the fraction of motif-regulatory gene sets (173 in total) captured by the inferred components.
To investigate this further, we next compared the algorithms on the subset of nine cancer-signalling pathways from the curated resource NETPATH (
As a further validation that ICA outperforms PCA, we investigated the relation of the derived components with regulatory modules. Specifically, we tested the selected gene sets from each component for enrichment of genes with common regulatory motifs in their promoters and 3′ UTRs [
The results above show that ICA provided a more biologically meaningful decomposition of breast cancer expression data than PCA or KM-based methods. This suggested to us that similar results would hold in other types of cancer. To check this, we analysed two additional large microarray datasets, one profiling 221 lymphomas [
To investigate the robustness of the algorithms, we next compared the ability of the algorithms to identify pathways and regulatory modules that were differentially activated independent of the breast cancer cohort used. Two important observations that were independent of the ICA algorithm and cohort used could be derived from the heatmaps of differential activation of pathways and regulatory modules (
(A) For each method, we compare the number of pathways that were consistently mapped to components across the four major breast cancer studies.
(B) Twenty of the most frequently mapped pathways by ICA. The scores give the average number of ICA components in which the pathway was mapped.
(C) For each method, we give the number of motif-regulatory gene sets consistently mapped to components across the four major breast cancer cohorts.
(D) The 20 most frequently mapped transcription factors/regulatory motifs by ICA. The scores give the average number of ICA components in which the regulatory module of the motif was mapped.
Second, we also observed that ICA outperformed PCA, MVG-KM, and PCA-KM in identifying regulatory modules that were consistently differentially activated across cohorts (
We next asked whether components mapping into the various pathways/modules were associated with breast cancer phenotypes. Specifically, we considered three categorical phenotypes: estrogen receptor (ER) status (0,1), histological grade (1,2,3), and outcome (0,1). To evaluate statistical significance of any association between a component
This revealed a complex pattern of significant associations with several components differentiating breast tumours according to ER status and histological grade (
Since we characterised each component in terms of the differential activation pattern of cancer-related pathways and regulatory modules, for those components associated with a phenotype we were able to link the corresponding pathways and regulatory motifs with the phenotype (
For three phenotypes (ER, Grade, Outcome), we show heatmaps of association between phenotypes and selected pathways (A) and selected regulatory motifs (B), as revealed by the four ICA algorithms across the four major breast cancer cohorts. For phenotypes, we used a
(A) For each major breast cancer cohort, we give the heatmap of component expression values for the component enriched for the immune-response pathway characterised in [
(B) For each major breast cancer cohort, we give the heatmap of expression values for the same set of genes as in (A). Thus, the heatmap matrix shown is
Association of Immune Response with Estrogen Receptor Status
(A) For each major breast cancer cohort where grade information was available, we give the heatmap of component expression values for the component enriched for the EMT pathway characterised in [
(B) For each major breast cancer cohort, we give the heatmap of expression values for the same set of genes as in (A). Thus, the heatmap matrix shown is
Association of Epithelial–Mesenthymal Transition with Grade
The parallel analysis for regulatory motifs and breast cancer phenotypes provided direct links between the associated transcription factors and clinical variables (
It is important to point out that ICA facilitated the identification of many of the biological associations in comparison with PCA, MVG-KM, and PCA-KM (
The ability of the various methods to capture novel biological associations between pathways/regulatory modules and phenotypes is represented as a binary heatmap across methods and cohorts. (A) Immune response pathway and ER status, (B) EMT-pathway and grade, (C) IRF and ER status, (D) Neurofibromin-1 and clinical outcome. Black denotes a statistically significant association between a pathway/regulatory module and the phenotype in question, white means no evidence of an association.
Finally, we verified that in many cases the identified associations were independent, in the sense that the component(s) or genes linking a pathway with a phenotype could be different from the one(s) linking another pathway with the same phenotype. For example, we noted that this was the case for the associations of the cell-adhesion and estrogen-signalling pathways with grade (see
Networks are a useful tool for graphically representing relational structures between many layers of organisation. In our application, we sought to construct a network of associations, linking breast cancer phenotypes, pathways, and regulatory modules with each other as the nodes in the network. To represent only the most salient and robust features, we focused attention on those pathways and regulatory modules with most phenotypic associations (
Average association networks shown for ER status (A) and clinical outcome (B). Only edges between phenotypes, pathways, and transcription factors are shown (for the sake of clarity, edges between any two pathways, transcription factors, or phenotypes are not shown). An edge between two nodes was defined if the association between the two nodes was present in at least three out of the four studies, as predicted by the KernelICA algorithm. The diagrams are colour-coded as follows: phenotype (red), pathways (green), and transcription factors/binding motifs (blue).
INFLR, inflammatory response; TM, tyrosine metabolism.
In our view, it is most natural to analyse gene expression data in the context of a generative model, however approximate this model is to the true underlying mechanism that gives rise to the measured expression levels. ICA provides such a generative model since it explicitly recognises how the data was generated in the first place. By comparing ICA with PCA and clustering-based methods, we have shown that a more realistic representation of the data is obtained by allowing “gene-sharing” and using the statistical independence criterion (non-linear decorrelation) in the inference process (ICA), as opposed to not allowing gene-sharing (MVG-KM, PCA-KM) and only using a linear decorrelation criterion (PCA). We showed this on a total of six cancer microarray datasets, using existing pathway knowledge and gene regulatory module databases for evaluation. Specifically, we found that ICA components mapped closer to cancer-related pathways as well as to gene modules that are under the control of a common regulatory motif. It is worth pointing out though that the improvement of ICA over KM methods was less marked in the case of regulatory motifs, as we would expect, since a clustering method is partially tailored to finding co-regulatory structure. Importantly, when comparing the results across cohorts, we found that ICA algorithms were much more robust than PCA or KM-based methods, in the sense that pathways that were found to be differentially activated through ICA in one cohort were also consistently differentially activated in the other cohorts. A similar observation could also be made for the regulatory motifs and their regulatees. For example, using PCA or PCA-KM, no regulatory module was found to be differentially activated across all four major breast cancer studies, while the ICA algorithms found an average of four modules. The most likely explanation for the relatively smaller number of regulatory modules found in common across the four studies, as compared with pathways, is that many regulatory modules important to breast cancer have yet to be elucidated.
Of note, we also performed the enrichment analysis of the independent components for chromosomal bands (using the MSigDB database), which confirmed that the independent components were not capturing transcriptional programs localised to specific chromosomal regions. Instead, we believe that the inferred independent components encapsulate “net” transcriptional programs that act globally and downstream of the epigenetic and genetic modifications underlying cancer.
We also found that ICA components were associated more often with known breast cancer phenotypes, including clinical outcome, and that these associations were also much stronger for ICA than for PCA. While this result is to be expected, since ICA components map closer to pathways that have been characterised using phenotypic information, one should also bear in mind that these pathways were derived from independent experiments; hence, the stronger associations between components, pathways, and phenotypes as revealed by ICA provides a validation, not only of the algorithm itself, but also of the characterised pathways.
Another important observation was the presence of multiple components showing an association with a particular pathway, regulatory module, or phenotype. This suggests that a significant proportion of pathways are part of multiple biological processes. Alternatively, the presence of multiple components enriched for a given pathway may reflect distinct gene subset selection, which in turn suggests that the pathways in MSigDB and NETPATH may need to be refined further. In the context of phenotypes, the presence of multiple components correlating with ER status, grade, or outcome, is suggestive of tumour heterogeneity, since, more often than not, the differential distribution of the phenotype across samples is dependent on the precise component. Hence, the fingerprint patterns of pathway activation derived from ICA could potentially form the basis for further clinically relevant definitions of breast cancer subtypes.
In an exploratory analysis, ICA revealed many interesting associations between pathways and phenotypes that can form the basis for future investigations. While all methods were able to identify the expected relationships of the estrogen-signalling pathway with ER status and cell-cycle pathway with histological grade, ICA clearly outperformed PCA and KM-clustering in identifying many other biologically relevant associations (
It could be argued that both IR- and cell-adhesion pathways are differentially activated across tumours merely as a result of lymphocytic or stromal contamination, respectively. However, microarray studies profiling breast cancer cell lines (BCL) have shown that genes associated with IR- and cell-adhesion functions are also differentially regulated across cell lines [
Generally, we found that genes selected in the same independent component showed a relatively strong co-expression pattern (
On the other hand, ICA also found “non-trivial” associations, such as the association of the EMT pathway with grade (
In summary, this work is the first to our knowledge to validate the ICA paradigm using a framework based on existing pathway-knowledge and regulatory-module databases. Moreover, it confirms the added value of ICA over PCA and clustering-based methods in identifying novel associations of known pathways and regulatory modules with breast cancer phenotypes. Our results also indicate that larger datasets may be required before a more complete understanding of the ICA model in the gene expression context can be obtained, as well as to understand to what degree ICA can help in defining a more clinically relevant molecular taxonomy of breast cancer.
To test the ICA model, we first generated a comprehensive list of pathways, most of which are known to be directly or indirectly involved in cancer biology. To compile this list, we used the Molecular Signatures Database MSigDB [
We used the sequence-derived regulatory motifs in human promoters and 3′ UTRs from [
Briefly, we review the ICA model [
PCA consists of identifying an orthonormal matrix
A quantitative measure of independence between measurements of random variables, in this case the columns of
The estimation of the number of sources in ICA is a hard outstanding problem. While approaches to estimating the number of sources exist, for example, the Bayesian Information Criterion (BIC) in a maximum likelihood framework [
For each component that is inferred, ICA and PCA yield a corresponding list of genes and signed weights. The ICA model is based on the premise that ICA modes selectively pick out a small percentage of genes (∼1%) that are strongly activated or repressed in response to the deregulation of a particular pathway, while the great majority of genes are unaffected. Mathematically, the distribution of inferred weights must be non-gaussian, and in the gene expression context they must be supergaussian (or leptokurtic), since most of the genes in a mode belong to a gaussian component centred at zero. Thus, to find the genes that are differentially activated, it is conventional to set a threshold, typically two or three standard deviations from the mean, and to pick out those genes whose absolute weights exceed this threshold. Although a more elegant method for determining an appropriate threshold, and which is based on measuring the deviation from normality of the weight distributions, is available [
To provide an objective comparison of ICA/PCA with clustering methods, the clustering step was preceded by a feature selection step which ensured that all methods selected an approximately equal number of genes. This feature selection step was performed in two different ways. For a given cohort, genes were first ranked according to their expression variance across samples. In the most-variable-genes (MVG) method, the top 15% variable genes were then selected. In the second method, we used all the distinct genes selected through PCA using the 3 sigma threshold. Since this number is less than the total number (i.e., not distinct) of features selected from the PCA components, the remaining distinct genes were selected from the ranked MVG list. Having selected the features via one of the above methods, clustering was then performed using a robust version of k-means clustering, known as partitioning around medoids [
For the genes selected in a ICA or PCA component or for the genes in a given cluster derived from either MVG-KM or PCA-KM, enrichment analysis evaluates whether there is statistically significant enrichment of genes from a given pathway or regulatory module. For a given study
The pathway enrichment index,
(2 KB PDF)
For the breast cancer cohort “Vijver”, we provide a heatmap of association between components/clusters and the most commonly enriched pathways and regulatory modules, as well as the association heatmap between components/clusters and breast cancer phenotypes. The strength of association between a component/cluster and a phenotype is colour-coded as follows:
(171 KB PDF)
For the breast cancer cohort “Wang”, we provide a heatmap of association between components/clusters and the most commonly enriched pathways and regulatory modules, as well as the association heatmap between components/clusters and breast cancer phenotypes. The strength of association between a component/cluster and a phenotype is colour coded as follows:
(162 KB PDF)
For the breast cancer cohort “Naderi”, we provide heatmaps of association between components/clusters and the most commonly enriched pathways and regulatory modules, as well as the association heatmap between components/clusters and breast cancer phenotypes. The strength of association between a component/cluster and a phenotype is colour coded as follows:
(162 KB PDF)
For the breast cancer cohort “JRH-2”, we provide heatmaps of association between components/clusters and the most commonly enriched pathways and regulatory modules, as well as the association heatmap between components/clusters and breast cancer phenotypes. The strength of association between a component/cluster and a phenotype is colour coded as follows:
(168 KB PDF)
Boxplot showing the distribution of weights from an independent component enriched for immune response genes across the basal, luminal, and mesenchymal cell–line subtypes, as defined in [
(3 KB PDF)
(A) For each major breast cancer cohort, we give the heatmap of component expression values for a component enriched for the estrogen-signalling pathway, i.e., the heatmap matrix shown is
(B) For each major breast cancer cohort, we give the heatmap of expression values for the same set of genes as in (A). Thus, the heatmap matrix shown is
CL, cluster labels from 2-means clustering; ER, ER status (black, ER−; grey, ER+). Red denotes relative overexpression, green underexpression.
(235 KB PDF)
(98 KB PDF)
(26 KB TXT)
(0 KB TXT)
This research was supported by a grant from Cancer Research UK (AET, CC) and a grant from the Isaac Newton Trust to Simon Tavare (AET). MJ is a research fellow of the Belgian National Fund for Scientific Research (FNRS). PAA was supported by Microsoft Research through a Microsoft Research Fellowship at Peterhouse, University of Cambridge. This paper presents research results of the Belgian Programme on Interuniversity Attraction Poles, initiated by the Belgian Federal Science Policy Office. The scientific responsibility rests with its authors. We would like to thank Jason Carroll for useful discussions.
cancer related
epithelial–mesenchymal transition
estrogen receptor
Independent Component Analysis
immune response
interferon regulatory factor
matrix metalloproteinases
most variable genes
Principal Components Analysis
Singular Value Decomposition