The authors have declared that no competing interests exist.
Conceived and designed the experiments: WYC THOY DA. Performed the experiments: WYC THOY DA. Analyzed the data: WYC THOY DA. Contributed reagents/materials/analysis tools: WYC THOY DA. Wrote the paper: WYC DA.
Mining gene expression profiles has proven valuable for identifying signatures serving as surrogates of cancer phenotypes. However, the similarities of such signatures across different cancer types have not been strong enough to conclude that they represent a universal biological mechanism shared among multiple cancer types. Here we present a computational method for generating signatures using an iterative process that converges to one of several precise attractors defining signatures representing biomolecular events, such as cell transdifferentiation or the presence of an amplicon. By analyzing rich gene expression datasets from different cancer types, we identified several such biomolecular events, some of which are universally present in all tested cancer types in nearly identical form. Although the method is unsupervised, we show that it often leads to attractors with strong phenotypic associations. We present several such multi-cancer attractors, focusing on three that are prominent and sharply defined in all cases: a mesenchymal transition attractor strongly associated with tumor stage, a mitotic chromosomal instability attractor strongly associated with tumor grade, and a lymphocyte-specific attractor.
Cancer is known to be characterized by several unifying biological capabilities or “hallmarks.” However, attempts to computationally identify patterns, such as gene expression signatures, shared across many different cancer types have been largely unsuccessful. A typical approach has been to classify samples into mutually exclusive subtypes, each of which is characterized by a particular gene signature. Although occasional similarities of such signatures in different cancer types exist, these similarities have not been sufficiently strong to conclude that they reflect the same biological event. By contrast, we have developed a computational methodology that has identified some signatures of co-expressed genes exhibiting remarkable similarity across many different cancer types. These signatures appear as stable “attractors” of an iterative computational procedure that tends to collect mutually associated genes, so that its convergence can point to the core (“heart”) of the underlying biological co-expression mechanism. One of these “pan-cancer” attractors corresponds to a transdifferentiation of cancer cells empowering them with invasiveness and motility. Another represents a mitotic chromosomal instability of cancer cells. A third attractor is lymphocyte-specific.
Despite their type-specific features, cancers share some common traits, or “hallmarks,” related to, e.g., the abilities of some cancer cells to divide uncontrollably or to invade surrounding tissues
Gene signatures may occasionally be found to exhibit similarities across multiple cancer types. However, to our knowledge no algorithm has ever produced a set of nearly identical signatures after independently and separately analyzing datasets from different cancer types.
There are various ways by which modules of co-expressed genes can be identified from rich datasets, some of which may be within the context of regulatory network discovery
The main objective addressed by techniques such as NMF is to reduce dimensionality by identifying a number of metagenes jointly representing the gene expression dataset as accurately as possible, in lieu of the whole set of individual genes. Each metagene in NMF is defined as a positive linear combination of the individual genes, so that its expression level is an accordingly weighted average of the expression levels of the individual genes. The identity of each resulting metagene is influenced by the presence of other metagenes within the objective of overall dimensionality reduction achieved by joint optimization.
By contrast, if the aim is exclusively to identify metagenes as surrogates of biomolecular events, then a fully unconstrained algorithm should be devised, without any effort to achieve dimensionality reduction, classification, mutual exclusivity, orthogonality, regulatory interaction inference, etc.
We can consider, for example, a hypothetical case in which we have found a cluster consisting of a number of co-expressed genes in a rich gene expression dataset. We may wish to scrutinize and “sharpen” this co-expression trying to identify the “heart” (core) of the genes that are most strongly co-expressed in that case. In the absence of a defining phenotype, we can continue applying an unsupervised methodology, as follows: We can first define a consensus metagene from the average expression levels of all genes in the cluster, and rank all the individual genes in terms of their association (defined numerically by some form of correlation) with that metagene. We can then replace the member genes of the cluster with an equal number of the top-ranked genes. Some of the original genes may naturally remain as members of the cluster, but some may be replaced, as this process will “attract” some other genes that are more strongly correlated with the cluster. We can now define a new metagene defined by the average expression levels of the genes in the newly defined cluster, and re-rank all the individual genes in terms of their association with that new metagene; and so on. It is intuitively reasonable to expect that this iterative process will eventually converge to a cluster that contains precisely the genes that are most associated with the metagene of the same cluster, so that any other individual genes will be less strongly associated with the metagene. We can think of this particular cluster defined by the convergence of this iterative process as an “attractor,” i.e., a module of co-expressed genes to which many other gene sets with close but not identical membership will converge using the same computational methodology.
The above description represents a simplified conceptual introduction of the computational methodology presented in this paper. Rather than using the average of the expression values in gene clusters of a particular size, the “attractors” are metagenes defined as weighted averages of all genes where each individual gene has a nonnegative weight, just like the metagenes defined using NMF
This methodology is totally unsupervised, as it does not make use of any phenotypic association. As we show in this paper, however, once identified, a metagene attractor is likely to be found associated with a phenotype.
We found that several attractor metagenes are present in nearly identical form in multiple cancer types. This provides an additional opportunity to combine the powers of a large number of rich datasets to focus, at an even sharper level, on the core genes of the underlying mechanism. For example, this methodology can precisely point to the causal (driver) oncogenes within amplicons to be among very few candidate genes. Importantly, this can be done from rich gene expression data, which already exist in abundance, without making any use of sequencing data.
We identified several attractors, each of which has the potential to lead to corresponding testable biological hypotheses after scrutinizing their top-ranked genes and finding a putative underlying mechanism. For the purposes of this paper we present the general methodology for the benefit of the research community together with a listing of the attractors in six datasets from three cancer types (ovarian, colon, breast).
Here, we focus on a few interesting cancer-associated attractors that we found present in multiple cancer types. Particular emphasis is given to what we consider to be three key “bioinformatic hallmarks” of cancer, related to the ability of cancer cells to invade surrounding tissues; to divide uncontrollably; and the ability of the organism to recruit the immune system to fight cancer: a tumor stage-associated mesenchymal transition attractor, a tumor grade-associated mitotic chromosomal instability (mitotic CIN) attractor, and a lymphocyte-specific attractor.
Given a nonnegative measure
According to this definition, the genes with the highest weights in an attractor metagene will have the highest association with the metagene (and, by implication, they will tend to be highly associated among themselves) and so they will often represent a biomolecular event reflected by the co-expression of these top genes. This can happen, e.g., when a biological mechanism is activated, or when a copy number variation (CNV), such as an amplicon, is present, in some of the samples included in the expression matrix. In the following we use the term “attractor” for simplicity to refer to an attractor metagene, and the term “top genes” to refer to the genes with the highest weights in the attractor. The definition of an attractor metagene can readily be generalized to include features other than gene expression, such as methylation values. It can also be used in datasets of any objects (not necessarily genes) characterized by any type of feature vectors, with applications in other disciplines, such as social and economic sciences.
The computational problem of identifying attractor metagenes given an expression matrix can be addressed heuristically using a simple iterative process: Starting from a particular seed (or “attractee”) metagene
This algorithmic behavior with nice convergence properties is not surprising, because if a metagene represents co-expressed genes, then the next iteration will naturally “attract” other similarly co-expressed genes, and so forth, until there are no other genes more associated with the top genes than those genes themselves. Furthermore, the set of the few genes with the highest weight are likely to represent the “heart” (core) of the underlying biomolecular event. In support of this concept, the association of any of the top-ranked individual genes with the attractor metagene is consistently and significantly higher than the pairwise association between any of these genes, suggesting that the set of these top genes jointly comprise a proxy representing a biomolecular event better than each of the individual genes would.
Indeed, related versions of the signatures identified by attractors in this paper have been previously identified in various contexts in individual cancer types, often intermingled with additional genes. However, the contribution of our work is that these signatures are found as pan-cancer biomolecular events, sharply pointing to the underlying mechanism. Therefore the top genes of the attractors will be appropriate for being used as biomarkers or for understanding the underlying biology. For example, one of the attractors that we identified (the “mitotic chromosomal instability” attractor, described below) has previously been found in approximate forms among sets of genes described generally
A reasonable implementation of an “exhaustive” search of attractor metagenes is to start from each individual gene as a seed (“attractee”) assigning a weight of 1 to that gene, and 0 to all the other genes. Each gene participating in a particular co-expression event will then lead to the same attractor when used as attractee. The computational implementation of the algorithm is described in
We analyzed six datasets, two from ovarian cancer, two from breast cancer and two from colon cancer (Supplementary
This attractor contains mostly epithelial-mesenchymal transition (EMT)-associated genes.
Rank | Gene Symbol | Avg MI | Rank | Gene Symbol | Avg MI |
1 | COL5A2 | 0.814 | 51 | SULF1 | 0.505 |
2 | VCAN | 0.775 | 52 | LOXL1 | 0.502 |
3 | SPARC | 0.766 | 53 | PRRX1 | 0.502 |
4 | THBS2 | 0.758 | 54 | PPAPDC1A | 0.499 |
5 | FBN1 | 0.749 | 55 | COL10A1 | 0.498 |
6 | COL1A2 | 0.749 | 56 | ITGA11 | 0.495 |
7 | COL5A1 | 0.747 | 57 | NTM | 0.494 |
8 | FAP | 0.734 | 58 | MXRA8 | 0.494 |
9 | AEBP1 | 0.711 | 59 | FIBIN | 0.493 |
10 | CTSK | 0.709 | 60 | WISP1 | 0.483 |
11 | COL3A1 | 0.688 | 61 | RCN3 | 0.483 |
12 | COL1A1 | 0.683 | 62 | TNFAIP6 | 0.481 |
13 | SERPINF1 | 0.674 | 63 | ECM2 | 0.480 |
14 | COL6A3 | 0.669 | 64 | HTRA1 | 0.480 |
15 | CDH11 | 0.663 | 65 | EFEMP2 | 0.478 |
16 | GLT8D2 | 0.658 | 66 | MXRA5 | 0.474 |
17 | LUM | 0.654 | 67 | ACTA2 | 0.472 |
18 | MMP2 | 0.654 | 68 | LOX | 0.470 |
19 | DCN | 0.650 | 69 | ITGBL1 | 0.466 |
20 | CCDC80 | 0.637 | 70 | PMP22 | 0.465 |
21 | POSTN | 0.631 | 71 | P4HA3 | 0.464 |
22 | CTHRC1 | 0.616 | 72 | PTRF | 0.463 |
23 | ADAM12 | 0.613 | 73 | CALD1 | 0.460 |
24 | COL6A2 | 0.608 | 74 | HEG1 | 0.458 |
25 | MSRB3 | 0.608 | 75 | NEXN | 0.455 |
26 | OLFML2B | 0.607 | 76 | NID2 | 0.455 |
27 | INHBA | 0.600 | 77 | TAGLN | 0.455 |
28 | FSTL1 | 0.600 | 78 | FAM26E | 0.452 |
29 | SFRP2 | 0.596 | 79 | ZNF521 | 0.452 |
30 | SNAI2 | 0.577 | 80 | SFRP4 | 0.451 |
31 | CRISPLD2 | 0.574 | 81 | PALLD | 0.450 |
32 | PCOLCE | 0.571 | 82 | OLFML1 | 0.447 |
33 | PDGFRB | 0.567 | 83 | FILIP1L | 0.447 |
34 | BGN | 0.565 | 84 | TIMP3 | 0.445 |
35 | COL12A1 | 0.560 | 85 | SPON2 | 0.443 |
36 | ANGPTL2 | 0.555 | 86 | SPOCK1 | 0.443 |
37 | COPZ2 | 0.553 | 87 | COL8A2 | 0.441 |
38 | CMTM3 | 0.549 | 88 | GPC6 | 0.438 |
39 | ASPN | 0.547 | 89 | PDPN | 0.437 |
40 | FN1 | 0.545 | 90 | GFPT2 | 0.436 |
41 | CNRIP1 | 0.540 | 91 | LHFP | 0.436 |
42 | FNDC1 | 0.538 | 92 | GREM1 | 0.436 |
43 | LRRC15 | 0.533 | 93 | TGFB1I1 | 0.435 |
44 | COL11A1 | 0.529 | 94 | C1S | 0.433 |
45 | ANTXR1 | 0.528 | 95 | EDNRA | 0.432 |
46 | RAB31 | 0.527 | 96 | GAS1 | 0.431 |
47 | FRMD6 | 0.524 | 97 | NOX4 | 0.431 |
48 | TSHZ3 | 0.520 | 98 | FBLN2 | 0.428 |
49 | THY1 | 0.519 | 99 | TCF4 | 0.428 |
50 | NNMT | 0.519 | 100 | NUAK1 | 0.427 |
The consistency of the attractor is established by the fact (Supplementary
This is a stage-associated attractor, in which the signature is significantly present only when a particular level of invasive stage, specific to each cancer type, has been reached. Supplementary
This attractor has been previously identified with remarkable accuracy as representing a particular kind of mesenchymal transition of cancer cells present in all types of solid cancers tested leading to a published list of top 64 genes
Although similar signatures are often labeled as “stromal,” because they contain many stromal markers such as α-SMA and fibroblast activation protein, the fact that most of the genes of the signature were expressed by xenografted cancer cells
The only EMT-inducing transcription factor found upregulated in the xenograft model
The expression of the mesenchymal transition attractor indicates that the tumor is actively invasive at the specific sample site, so its prognostic value is cancer type and stage specific. As an example, we analyzed an oral squamous cell carcinoma dataset deposited in the Gene Expression Omnibus (GEO) under accession number GSE25104. The corresponding Kaplan-Meier survival curve (
Gene expression data from 57 patients (GSE25104) were divided into two groups: high mesenchymal transition metagene level and low mesenchymal transition metagene level depending on whether the metagene expression value exceeding the mean of the 57 patients. The
This attractor contains mostly kinetochore-associated genes.
Rank | Gene Symbol | Avg MI | Rank | Gene Symbol | Avg MI |
1 | CENPA | 0.720 | 51 | CDCA8 | 0.532 |
2 | DLGAP5 | 0.693 | 52 | CDC45 | 0.528 |
3 | MELK | 0.684 | 53 | KIF18A | 0.524 |
4 | BUB1 | 0.674 | 54 | HMMR | 0.506 |
5 | KIF2C | 0.660 | 55 | TOP2A | 0.505 |
6 | KIF20A | 0.658 | 56 | CENPF | 0.503 |
7 | KIF4A | 0.656 | 57 | ZWINT | 0.503 |
8 | CCNA2 | 0.654 | 58 | PLK1 | 0.501 |
9 | CCNB2 | 0.652 | 59 | RAD51AP1 | 0.501 |
10 | NCAPG | 0.649 | 60 | FAM83D | 0.498 |
11 | TTK | 0.642 | 61 | E2F8 | 0.497 |
12 | CEP55 | 0.638 | 62 | CENPE | 0.497 |
13 | CCNB1 | 0.632 | 63 | MKI67 | 0.492 |
14 | CDK1 | 0.629 | 64 | CENPN | 0.491 |
15 | HJURP | 0.626 | 65 | MAD2L1 | 0.489 |
16 | CDC20 | 0.624 | 66 | CHEK1 | 0.486 |
17 | CDCA5 | 0.615 | 67 | GTSE1 | 0.477 |
18 | NCAPH | 0.615 | 68 | RAD51 | 0.475 |
19 | BUB1B | 0.609 | 69 | SGOL2 | 0.474 |
20 | KIF23 | 0.592 | 70 | PARPBP | 0.469 |
21 | KIF11 | 0.591 | 71 | TRIP13 | 0.467 |
22 | BIRC5 | 0.589 | 72 | SHCBP1 | 0.465 |
23 | NUF2 | 0.587 | 73 | DTL | 0.465 |
24 | TPX2 | 0.586 | 74 | CENPL | 0.462 |
25 | AURKB | 0.582 | 75 | FEN1 | 0.461 |
26 | RACGAP1 | 0.580 | 76 | FANCI | 0.461 |
27 | NUSAP1 | 0.580 | 77 | FBXO5 | 0.459 |
28 | ASPM | 0.579 | 78 | ECT2 | 0.457 |
29 | MCM10 | 0.579 | 79 | MND1 | 0.456 |
30 | PRC1 | 0.576 | 80 | CDC25C | 0.456 |
31 | DEPDC1B | 0.572 | 81 | PBK | 0.456 |
32 | UBE2C | 0.569 | 82 | KPNA2 | 0.452 |
33 | UBE2T | 0.567 | 83 | RAD54L | 0.452 |
34 | NEK2 | 0.566 | 84 | ESPL1 | 0.447 |
35 | FOXM1 | 0.565 | 85 | CDCA2 | 0.446 |
36 | NDC80 | 0.556 | 86 | FAM64A | 0.440 |
37 | CDCA3 | 0.556 | 87 | CENPK | 0.436 |
38 | FAM54A | 0.553 | 88 | MYBL2 | 0.435 |
39 | ANLN | 0.551 | 89 | SPAG5 | 0.434 |
40 | KIF15 | 0.548 | 90 | EZH2 | 0.431 |
41 | STIL | 0.547 | 91 | SMC4 | 0.430 |
42 | EXO1 | 0.542 | 92 | TACC3 | 0.428 |
43 | AURKA | 0.540 | 93 | C11orf82 | 0.427 |
44 | PTTG1 | 0.539 | 94 | MASTL | 0.426 |
45 | OIP5 | 0.539 | 95 | ASF1B | 0.426 |
46 | RRM2 | 0.539 | 96 | PTTG3P | 0.425 |
47 | DEPDC1 | 0.539 | 97 | CENPW | 0.424 |
48 | CDKN3 | 0.538 | 98 | ORC1 | 0.424 |
49 | KIF14 | 0.537 | 99 | HELLS | 0.422 |
50 | SPC25 | 0.534 | 100 | TK1 | 0.421 |
The consistency of the attractor is established by the fact (Supplementary
Contrary to the stage-associated mesenchymal transition attractor, this is a grade-associated attractor, in which the signature is significantly present only when an intermediate level of tumor grade is reached. Supplementary
This attractor is associated with chromosomal instability (CIN), as evidenced from the fact that another similar gene set comprising a “signature of chromosomal instability”
The attractor is characterized by overexpression of kinetochore-associated genes, which is known
Among transcription factors, we found
Inactivation of the retinoblastoma (RB) tumor suppressor promotes CIN
This attractor consists mainly of lymphocyte-specific genes with prominent presence of
Rank | Gene Symbol | Avg MI | Rank | Gene Symbol | Avg MI |
1 | PTPRC | 0.782 | 51 | NCF1 | 0.560 |
2 | CD53 | 0.768 | 52 | CCL5 | 0.557 |
3 | LCP2 | 0.739 | 53 | LST1 | 0.557 |
4 | LAPTM5 | 0.708 | 54 | CD3D | 0.553 |
5 | DOCK2 | 0.699 | 55 | RCSD1 | 0.548 |
6 | IL10RA | 0.699 | 56 | FGL2 | 0.538 |
7 | CYBB | 0.698 | 57 | HCST | 0.538 |
8 | CD48 | 0.691 | 58 | MARCH1 | 0.538 |
9 | ITGB2 | 0.679 | 59 | FERMT3 | 0.536 |
10 | EVI2B | 0.675 | 60 | FCGR2B | 0.533 |
11 | MS4A6A | 0.673 | 61 | GIMAP5 | 0.530 |
12 | TFEC | 0.659 | 62 | MYO1F | 0.530 |
13 | SLA | 0.657 | 63 | KLHL6 | 0.530 |
14 | TNFSF13B | 0.657 | 64 | GIMAP1 | 0.527 |
15 | C1orf162 | 0.656 | 65 | CD163 | 0.524 |
16 | SAMSN1 | 0.652 | 66 | CLEC7A | 0.522 |
17 | PLEK | 0.649 | 67 | CCR1 | 0.518 |
18 | GMFG | 0.647 | 68 | GBP5 | 0.517 |
19 | GIMAP4 | 0.647 | 69 | NCF2 | 0.516 |
20 | SASH3 | 0.645 | 70 | HLA-DPA1 | 0.516 |
21 | EVI2A | 0.638 | 71 | RNASE6 | 0.515 |
22 | SRGN | 0.638 | 72 | CD14 | 0.515 |
23 | AIF1 | 0.636 | 73 | FAM26F | 0.511 |
24 | LAIR1 | 0.627 | 74 | CD4 | 0.510 |
25 | FYB | 0.625 | 75 | FCGR1A | 0.506 |
26 | FCER1G | 0.623 | 76 | GZMA | 0.506 |
27 | MPEG1 | 0.621 | 77 | GPR183 | 0.505 |
28 | CD86 | 0.621 | 78 | CD84 | 0.505 |
29 | C3AR1 | 0.611 | 79 | NKG7 | 0.504 |
30 | C1QB | 0.608 | 80 | C1QA | 0.502 |
31 | CD2 | 0.606 | 81 | CD300LF | 0.500 |
32 | HCLS1 | 0.599 | 82 | FPR3 | 0.499 |
33 | HCK | 0.592 | 83 | PARVG | 0.496 |
34 | MNDA | 0.587 | 84 | TRAF3IP3 | 0.494 |
35 | CD37 | 0.587 | 85 | TYROBP | 0.492 |
36 | LY96 | 0.585 | 86 | LPXN | 0.492 |
37 | CCR5 | 0.585 | 87 | GIMAP8 | 0.492 |
38 | ARHGAP9 | 0.580 | 88 | MS4A7 | 0.490 |
39 | CD52 | 0.580 | 89 | IL2RB | 0.489 |
40 | GPR65 | 0.580 | 90 | CD300A | 0.488 |
41 | GIMAP6 | 0.578 | 91 | IGSF6 | 0.488 |
42 | SLAMF8 | 0.577 | 92 | SELPLG | 0.488 |
43 | WIPF1 | 0.577 | 93 | FCGR2A | 0.487 |
44 | MS4A4A | 0.574 | 94 | NCKAP1L | 0.483 |
45 | ARHGAP15 | 0.573 | 95 | DOK2 | 0.483 |
46 | HAVCR2 | 0.567 | 96 | CD247 | 0.481 |
47 | ARHGAP30 | 0.566 | 97 | SELL | 0.480 |
48 | CLEC4A | 0.566 | 98 | GZMK | 0.479 |
49 | TAGAP | 0.564 | 99 | CCR2 | 0.479 |
50 | CYTIP | 0.563 | 100 | LY86 | 0.479 |
The gene membership of the attractor provides hints about the underlying immune mechanism, which could be valuable towards generating hypotheses for potential immunotherapies such as adoptive transfer of lymphocytes. For example, the presence of the signal-transducing
We found that each of the above three main attractors under particular conditions is highly prognostic in breast cancer by analysing the METABRIC discovery breast cancer dataset
In breast cancer, the mesenchymal transition attractor is expressed very early, as cancer becomes invasive. The presence of the attractor in a particular sample of high-stage tumor in not as informative, because of heterogeneity. On the other hand we found that the presence of the attractor in early-stage tumors is highly prognostic, consistent with the hypothesis that it indicates increased invasiveness. As shown in
The mesenchymal transition attractor metagene is most prominent in the early stage of breast cancer. The survival curve of the full dataset is insignificant (left). However, when the samples are restricted to only those at early stage (with no positive lymph nodes and tumor size less than 30 mm), the association between the mesenchymal transition attractor and the survival becomes significant (right), with
The expression of the mitotic CIN attractor indicates that the tumor is dividing uncontrollably and therefore, in all cases, the attractor is highly prognostic for survival. The corresponding Kaplan-Meier 15-year survival curve (
To evaluate the association between the mitotic CIN metagene expression and the 15-year survival, patients were divided into two groups: high mitotic CIN and low mitotic CIN. This binary expression level was determined by whether the mitotic CIN metagene expression value exceeding the mean of the patients. The
Rank | Gene Symbol | Concordance Index | Rank | Gene Symbol | Concordance Index |
1 |
|
0.670 | 51 | PRR11 | 0.639 |
2 |
|
0.663 | 52 | LOC651816 | 0.638 |
3 |
|
0.662 | 53 | KRT80 | 0.638 |
4 | TROAP | 0.661 | 54 | C15orf42 | 0.637 |
5 |
|
0.659 | 55 | SGOL1 | 0.637 |
6 |
|
0.658 | 56 | GPI | 0.637 |
7 |
|
0.657 | 57 |
|
0.637 |
8 | SHMT2 | 0.655 | 58 |
|
0.636 |
9 |
|
0.655 | 59 | PKMYT1 | 0.635 |
10 |
|
0.653 | 60 |
|
0.635 |
11 |
|
0.653 | 61 | C20orf24 | 0.635 |
12 |
|
0.653 | 62 | SPC24 | 0.635 |
13 | ORC6 | 0.653 | 63 | RIPK4 | 0.635 |
14 |
|
0.653 | 64 | TOMM40 | 0.634 |
15 | C1orf106 | 0.652 | 65 |
|
0.634 |
16 |
|
0.652 | 66 | ADRM1 | 0.634 |
17 |
|
0.651 | 67 |
|
0.633 |
18 | STIP1 | 0.651 | 68 |
|
0.633 |
19 |
|
0.649 | 69 | AIF1L | 0.633 |
20 |
|
0.649 | 70 | MRPS5 | 0.633 |
21 | GARS | 0.649 | 71 | GPR56 | 0.633 |
22 |
|
0.649 | 72 | PEX13 | 0.633 |
23 | UCK2 | 0.648 | 73 | ENO1 | 0.633 |
24 |
|
0.648 | 74 | NUTF2 | 0.633 |
25 |
|
0.647 | 75 | MEMO1 | 0.632 |
26 | CBX2 | 0.646 | 76 | TXNRD1 | 0.632 |
27 | CCNE1 | 0.646 | 77 | SLC7A5 | 0.631 |
28 |
|
0.646 | 78 |
|
0.631 |
29 |
|
0.645 | 79 |
|
0.631 |
30 |
|
0.645 | 80 | PPP1R14B | 0.631 |
31 | GMPSP1 | 0.645 | 81 |
|
0.630 |
32 |
|
0.645 | 82 | C20orf24 | 0.630 |
33 |
|
0.644 | 83 | SGOL1 | 0.630 |
34 |
|
0.643 | 84 | NUP93 | 0.630 |
35 |
|
0.643 | 85 | ZNF695 | 0.630 |
36 |
|
0.643 | 86 |
|
0.630 |
37 | LOC731049 | 0.642 | 87 |
|
0.630 |
38 | POLQ | 0.642 | 88 | SOX11 | 0.630 |
39 | GSK3B | 0.642 | 89 |
|
0.629 |
40 | CCNE1 | 0.642 | 90 | SLC52A2 | 0.629 |
41 |
|
0.641 | 91 | AIF1L | 0.629 |
42 |
|
0.641 | 92 |
|
0.629 |
43 | LAD1 | 0.641 | 93 | CDC25A | 0.629 |
44 |
|
0.641 | 94 |
|
0.628 |
45 | SAPCD2 | 0.641 | 95 | TMEM132A | 0.628 |
46 |
|
0.641 | 96 |
|
0.628 |
47 | POLR2D | 0.641 | 97 | NACC2 | 0.628 |
48 | CKAP2L | 0.640 | 98 |
|
0.628 |
49 |
|
0.640 | 99 | SNRPA1 | 0.628 |
50 | ECE2 | 0.639 | 100 | MMP15 | 0.628 |
The 47 underlined genes are also among the top 100 genes of the mitotic CIN attractor (
We found the attractor to be strongly protective in ER-negative breast cancers. As shown in
For ER-negative patients, the expression of the attractor is highly protective (high expression implies longer survival, left). However, when multiple lymph nodes are already affected, the expression of the attractor has a reversed effect on survival. When we restrict the samples to those with more than five positive lymph nodes, higher expression of the lymphocyte-specific attractor implies shorter survival (right), although the association is not significant due to the limited number of samples (76).
Amplification in chr8q24 is often considered to be associated with cancer because of the presence of the
We found, however, that the core of the amplified genes occurs at location 8q24.3 and this is, in fact, our most prominent multi-cancer amplicon attractor. Core genes of the attractor are
The top ten genes of the chr8q24.3 attractor, ranked by the average of the highest five values of mutual information (
|
|
||||
Rank | Gene Symbol | Avg MI of Top 4 Datasets | Rank | Gene Symbol | Avg MI of Top 4 Datasets |
1 | EXOSC4 | 0.716 | 1 | PGAP3 | 0.794 |
2 | PUF60 | 0.659 | 2 | ERBB2 | 0.793 |
3 | BOP1 | 0.653 | 3 | STARD3 | 0.768 |
4 | SLC52A2 | 0.639 | 4 | MIEN1 | 0.764 |
5 | SHARPIN | 0.634 | 5 | GRB7 | 0.718 |
6 | HSF1 | 0.616 | 6 | PSMD3 | 0.602 |
7 | FBXL6 | 0.615 | 7 | GSDMB | 0.539 |
8 | CYC1 | 0.608 | 8 | ORMDL3 | 0.498 |
9 | SCRIB | 0.552 | 9 | MED24 | 0.414 |
10 | GPAA1 | 0.551 | 10 | MED1 | 0.400 |
Furthermore, prognostic associations involving the 8q24.3 amplicon have recently been recognized in various cancers
This amplicon is prominent in breast cancer
In addition to the narrow
We found this attractor clearly present only in breast cancer, and therefore we derived it using six breast cancer datasets (GSE2034, GSE3494, GSE31448, GSE32646, GSE36771, breast TCGA).
Rank | Gene Symbol | Avg MI | Rank | Gene Symbol | Avg MI |
1 | AGR3 | 0.847 | 26 | ERBB4 | 0.393 |
2 | CA12 | 0.616 | 27 | AR | 0.383 |
3 | FOXA1 | 0.613 | 28 | P4HTM | 0.383 |
4 | GATA3 | 0.585 | 29 | SLC44A4 | 0.380 |
5 | MLPH | 0.580 | 30 | KDM4B | 0.375 |
6 | AGR2 | 0.570 | 31 | GFRA1 | 0.374 |
7 | ESR1 | 0.543 | 32 | MAPT | 0.370 |
8 | TBC1D9 | 0.540 | 33 | MYB | 0.364 |
9 | XBP1 | 0.460 | 34 | DACH1 | 0.359 |
10 | ANXA9 | 0.456 | 35 | SLC7A8 | 0.359 |
11 | PRR15 | 0.452 | 36 | MAGED2 | 0.358 |
12 | SCUBE2 | 0.444 | 37 | FBP1 | 0.357 |
13 | FSIP1 | 0.438 | 38 | SLC22A5 | 0.355 |
14 | TFF3 | 0.429 | 39 | CMBL | 0.346 |
15 | SPDEF | 0.429 | 40 | DYNLRB2 | 0.346 |
16 | NAT1 | 0.428 | 41 | C6orf211 | 0.342 |
17 | ABAT | 0.423 | 42 | GREB1 | 0.342 |
18 | CCDC170 | 0.422 | 43 | SIDT1 | 0.338 |
19 | DNALI1 | 0.418 | 44 | TTC39A | 0.330 |
20 | DEGS2 | 0.415 | 45 | FAM214A | 0.326 |
21 | DNAJC12 | 0.411 | 46 | IL6ST | 0.324 |
22 | SLC39A6 | 0.406 | 47 | CXXC5 | 0.323 |
23 | CAPN8 | 0.399 | 48 | ACADSB | 0.323 |
24 | TFF1 | 0.397 | 49 | CELSR1 | 0.322 |
25 | THSD4 | 0.395 | 50 | CLSTN2 | 0.322 |
The scope of the algorithm identifying attractor metagenes is different from that of other unsupervised methods, which are usually aimed at identifying subtypes or mutually exclusive clusters. Nevertheless, it is interesting to find to what extent other algorithms can produce multiple cancer signatures each of which appears in nearly identical form across different types. We applied three widely used methods, k-means clustering, principal component analysis and hierarchical clustering on the six cancer datasets used in this paper. In all cases, we listed the top fifty genes in each cluster and applied the same clustering algorithm as in the main text to find common genes among them and group them together. The results are shown in Supplementary
A biomolecular event, whether it is present in multiple cancer types or it is cancer specific, can be represented by a “consensus attractor metagene” after analyzing multiple datasets. To generate such consensus attractors, we use genes that were profiled by at least three of the six datasets, then rank individual genes in terms of their average mutual information (
For example,
The two metagenes were defined to be “consensus attractors” after ranking individual genes in terms of their average mutual information with the corresponding attractor metagenes, across all datasets, and selecting the genes having average mutual information greater than 0.5. These criteria led to 59 genes in the consensus mitotic CIN attractor (the top 59 genes in
Gene expression analysis has resulted in several cancer types being further classified into subtypes labeled, e.g. as “mesenchymal” or “proliferative.” Such characterizations, however, may sometimes simply reflect the presence of the mesenchymal transition attractor or the mitotic chromosomal instability attractor, respectively, in some of the analyzed samples. Similar subtype characterizations across cancer types often share several common genes, but the consistency of these similarities has not been significantly high.
By contrast, using an unconstrained algorithm independent of subtype classification or dimensionality reduction, we identified several attractors exhibiting remarkable consistency across many cancer types, suggesting that each of them represents a precise biological phenomenon present in multiple cancers.
We found that the mesenchymal transition attractor is significantly present only in samples whose stage designation has exceeded a threshold, but not in all of such samples. On the other hand, the absence of the mesenchymal transition attractor in a profiled high-stage sample (or the absence of the mitotic chromosomal instability attractor in a profiled high-grade sample) does not necessarily mean that the attractor is not present in other locations of the same tumor. Indeed, it is increasingly appreciated
Existing molecular marker products make use of multigene assays that have been derived from phenotypic associations in particular cancer types. For breast cancer, biomarkers such as Oncotype DX
We envision, instead, a multi-cancer biomarker product that will include detection of the level of expression of each of the key attractor metagenes. These levels would need to be combined in different ways in different cancer types, but each of the metagenes would indicate the same attribute and the contributions of each component will be cleanly distinguished. Even though molecular marker genes in some existing products are already separated into groups that are related to our attractor designation, any improvement in diagnostic, prognostic, or predictive accuracy resulting from better such group designation and better choice of genes in each group would be highly desirable. We hope that the identification of the attractors of cancer, as presented here, will be valuable in that regard.
The full code of the attractor finding algorithm is publicly available in the Sage Bionetworks Synapse platform at
We chose the association measure
At one extreme, if
We empirically found that an appropriate choice of
As mentioned in the
Identified attractors can be ranked in various ways. The “strength of an attractor” can be defined as the mutual information between the
The top genes of many among the found attractors are genomically localized. In that case the biomolecular event that they represent is often the presence of a particular copy number variation. In the cancer datasets that we tried, this phenomenon almost always corresponds to a local amplification event known as an amplicon. We therefore also devised a related amplicon-finding algorithm, custom-designed to identify localized amplicon-representing attractor metagenes, described below.
To identify genomically localized attractors – almost always amplicons – we use the same algorithm but for each seed gene we restrict the set of candidate attractor genes to only include those in the local genomic neighbourhood of the gene, and we optimize the exponent a so that the strength of the attractor is maximized. Specifically, we sort the genes in each chromosome in terms of their genomic location and we only consider the genes within a window of size 51, i.e., with 25 genes on each side of the seed gene. We further optimize the choice of the exponent
Because the set of allowed genes is different for each seed, the attractors will be different from each other, but “neighbouring” attractors will usually be very similar to each other. Therefore, following exhaustive attractor finding from each seed gene in a chromosome, we apply a filtering algorithm to only select the highest-strength attractor in each local genomic region, as follows: For each attractor, we rank all the genes in terms of their mutual information with the corresponding attractor metagene and we define the range of the attractor to be the chromosomal range of its top 15 genes. If there is any other attractor with overlapping range and higher strength, then the former attractor is filtered out. This filtering is done in parallel, so elimination of attractors occurs simultaneously.
Assuming that the continuous expression levels of two genes
We used Level 3 data when directly available, and imputed missing values using a k-nearest-neighbour algorithm with k = 10, as implemented in R
To investigate the associations between the attractor metagene expression and the tumor stage and grade, we used the following annotated gene expression datasets. For stage association: Breast (GSE3893), TCGA Ovarian, Colon (GSE14333). For grade association: Breast (GSE3494), TCGA Ovarian, Bladder (GSE13507). For Breast GSE3494 we used only the samples profiled by U133A arrays. For Breast GSE3893 we combined two platforms by taking the intersections of the probes in the U133A and the U133Plus 2.0 arrays. For datasets profiled by Affymetrix platforms all the datasets were normalized using the RMA algorithm. For Bladder GSE13507 normalization was done as provided in the GEO.
The significance of the consistency of the mesenchymal transition and mitotic CIN attractors was evaluated as follows: Supplementary
General attractors identified from the six datasets.
(XLS)
Genomically localized attractors identified from the six datasets.
(XLS)
Association of mesenchymal transition attractor with tumor stage.
(XLS)
Association of mitotic CIN attractor with tumor grade.
(XLS)
Common clusters from the six datasets using k-means.
(XLS)
Common clusters from the six datasets using principal component analysis.
(XLS)
Common clusters from the six datasets using hierarchical clustering.
(XLS)
Datasets and methods used to derive Supplementary
(DOCX)
Comparison with other unsupervised algorithms.
(DOCX)
Pseudo-code for attractor metagene finding algorithm.
(DOCX)
We thank the Chang Gung Memorial Hospital-Linkou and Chang Gung University, Taoyuan, Taiwan, R.O.C. and in particular their Head and Neck Oncology Group and Dr. Tzu-Chen Yen who served as our contact point, for providing us with survival data for dataset GSE25104. This study makes use of data generated by the Molecular Taxonomy of Breast Cancer International Consortium. Funding for that project was provided by Cancer Research UK and the British Columbia Cancer Agency Branch. Our results using these data (the METABRIC 996-sample discovery dataset) were produced during our participation in the Sage Bionetworks/DREAM breast cancer prognosis Challenge.