Conceived and designed the experiments: VD. Performed the experiments: DV VD. Analyzed the data: DV VD. Wrote the paper: DV JED VD.
The authors have declared that no competing interests exist.
Bridging the gap between animal or
Proving that research findings from
Ethics limits experimental investigation on human subjects. Hence, most experimental biomedical research is performed on animal and/or
Hundreds of studies in oncology have suggested the biological relevance to human of putative cancer-driving mechanisms with the following three steps: 1) characterize the mechanism in a model system, 2) derive from the model system a marker whose expression changes when the mechanism is altered, and 3) show that marker expression correlates with disease outcome in patients—the last figure of such paper is typically a Kaplan-Meier plot illustrating this correlation.
Breast cancer has been a test bed in oncogenomics. Several landmark studies (reviewed in ref.
Beyond clinical utility, many signatures were derived as markers of specific mechanisms and/or biological states and their association with outcome was evaluated in the context of studies structured along the 3-steps outlined above. These include signatures of stem cells
This raises a question: are all these mechanisms major independent drivers of breast cancer progression, or is step #3 inconclusive because of a basic confounding variable problem? To take an example of complex system outside oncology, let us suppose we are trying to discover which socio-economical variables drive people's health. We may find that the number of TV sets per household is positively correlated with longer life expectancy. This, of course, does not imply that TV sets improve health. Life expectancy and TV sets per household are both correlated with the gross national product per capita of nations, as are many other causes or byproducts of wealth such as energy consumption or education. So, is the significant association of say, a stem cell signature, with human breast cancer outcome informative about the relevance of stem cells to human breast cancer?
Resolving this issue has become more pressing recently. Several large cohorts with genome-wide tumoral expression profiles and patient follow-ups are available in the public domain. Servers resting on these data
Few studies using the outcome-association argument present negative controls to check whether their signature of interest is indeed more strongly related to outcome than signatures with no underlying oncological rationale. In statistical terms, these studies typically rest on H0 assuming a background of no association with outcome. The negative controls we present here prove this assumption wrong: a random signature is more likely to be correlated with breast cancer outcome than not. The statistical explanation for this phenomenon lies in the correlation of a large fraction of the breast transcriptome with one variable, we call it meta-PCNA, which integrates most of the prognostic information available in current breast cancer gene expression data.
In order to assess whether association with outcome was specific, we tested the association with breast cancer outcome of three signatures whose rationale does not suggest any connection with cancer: a signature of the effect of postprandial laughter on peripheral blood mononuclear cells
In plots A–C the NKI cohort was split into two groups using a signature of post-prandial laughter (panel A), localization of skin fibroblasts (panel B), social defeat in mice (panel C). In panels A–C, the fraction of patients alive (overall survival, OS) is shown as a function of time for both groups. Hazard ratios (HR) between groups and their associated p-values are given in bottom-left corners. Panel D depicts p-values for association with outcome for all MSigDB c2 signatures and random signatures of identical size as MSigDB c2 signatures.
To check that these were not anecdotal observations, we downloaded all signatures from MSigDB database
Cancer is a major subject matter of biomedical research, thus MSigDB c2 may be enriched for cancer-related signatures. To rule out the potential effect of a cancer bias, we generated for each signature in MSigDB c2 a signature of identical size but selected its genes randomly in the human genome. Although they are completely devoid of any biological rationale, 77% of these signatures were associated with outcome at p<0.05, and 30% at p<10−5 (
Thus, nominal p-values should not be used directly because a signature associated with outcome with a significance of 10−5 and even more so, 0.05, is not more related to outcome than a random set of genes.
Although most random signatures are significantly associated with breast cancer outcome, the association could be much stronger for published breast cancer signatures and provide valid statistical support for their relevance.
We compiled 47 signatures from the literature. Association with outcome has been reported for most of them (Supporting Information,
The x-axis denotes the p-value of association with overall survival. Red dots stand for published signatures, yellow shapes depict the distribution of p-values for 1000 random signatures of identical size, with the lower 5% quantiles shaded in green and the median shown as black line. Signatures are ordered by increasing sizes.
At the other end of the size spectrum, we found that 26% of individual genes printed on the NKI arrays were associated with outcome at p<0.05. Thus, a single gene study has 26 chances in 100 to yield a significant association. When we applied a q-value correction
Proliferation is a well-known breast cancer prognostic marker
The proliferating cell nuclear antigen, PCNA, is a ring-shaped protein that encircles DNA and regulates several processes leading to DNA replication
We next compared for each one of the 47 published signatures its association with outcome in the original NKI data set and after adjustment of expression levels for the meta-PCNA index (
Hazard ratios for overall survival association of 48 signatures in the original dataset (blue) and the meta-PCNA-adjusted dataset (red). Box sizes are inversely related to the size of the confidence intervals. Related Kaplan-Meier plots are available in the Supporting Information (
We plotted the hazard ratios of the 47 signatures against the absolute correlation of their first principal component with the meta-PCNA index. The more a signature was correlated with meta-PCNA, the higher its hazard ratio (R2 = 0.9,
A) Each point denotes a signature. The x-axis depicts the absolute value of the correlation of the first principal component of the signatures with meta-PCNA, the y-axis depicts the hazard ratio for outcome association. Details of the analysis for each data point are available in the Supporting Information (
Since only a limited set of genes is included in the 47 signatures, we plotted the distribution of correlations with the meta-PCNA index of all genes significantly associated with outcome and, as a negative control, of all genes printed on the microarrays (
The potential confounding effect of proliferation has been recognized by a number of authors who attempted to rule it out by removing known proliferation genes from expression data
Following Ben Porath et al.
Distribution of the correlations with meta-PCNA of genes in the Embryonic Stem Cell Module (blue, ref.
Moreover, 58% of the genes printed on the array were significantly correlated with the meta-PCNA index in the NKI cohort. Thus, the correlations with meta-PCNA extend far beyond cell cycle genes. Removing these genes fails to rule out the confounding effect of proliferation. Similarly, a signature does not have to be enriched with known cell cycle genes to convey a strong cell proliferation signal.
Previous sections rested on the NKI data set and the overall survival end-point. Are our observations specific of this popular, but not universal, setup? We reran the analyses using recurrence-free survival, and on another cohort
We calculated hazard ratios for the 47 published signatures using all combinations of end-points and cohorts. Correlation between hazard ratios among the different cohorts/end-points was ≥0.97 (
Each dot represents a published signature. A) Hazard ratios. B) Log rank p-values. Lower panels give correlation coefficients for corresponding scatter plots in the symmetric upper panels. OS, overall survival; RFS, recurrence-free survival. NKI, data from ref.
NKI OS | NKI RFS | LOI OS | LOI RFS | |
Fraction of patient experiencing an event | 79/295 | 101/295 | 96/380 | 139/393 |
% MSigDB c2 with p<0.05 | 67% | 56% | 52% | 45% |
% of all genes with p<0.05 | 17% | 9% | 8% | 5% |
% BC signatures better than 5% best random signatures of same size | 40% | 35% | 29% | 31% |
Correlation of BC signatures HR with their association with meta-PCNA | 0.9 | 0.9 | 0.9 | 0.9 |
There are many ways to estimate association between the expression of a multi-gene marker and disease outcome, and different studies have taken different routes. Our goal to compare signatures and assess them against negative controls, however, implied a uniform statistical framework. We present a comparison of a number of such methods in the Supporting Information (
The main message of this paper is that, if the purpose of a study is to assert the biological relevance to human cancer of a signature, the association between this signature and outcome cannot rest on the nominal p-values, as obtained on breast cancer by the Cox analysis. This follows from elevated likelihood that random sets of genes are related to the outcome. Thus, an investigator finding that her/his signature is associated with outcome with a significance of 10−5, and even more so, 0.05, gains no specific information because sets of random genes would likely yield similar, or better, results. Nominal p-values do not answer the appropriate statistical question: the question is not whether a given set of genes is related to survival, but whether it is more related to survival than random sets of genes.
This problem extends to single-gene markers and therefore to many studies published in the pre-genomic era. Claims similar to those concerning signatures have been made, that single genes, important in a model system, are relevant for human cancer progression based on differential expression between short- and long-survival groups. As 26% of the genes are related to survival at p<0.05 (17% at q<0.05), much tighter p-values than commonly used should be imposed to demonstrate such a relation.
Several studies in the panel of 47 we investigated developed arguments independent of outcome association. For example, Hu et al.
The present study addresses purely correlative association between gene-expression and disease outcome. We have shown that proliferation integrates most of the prognostic information contained in the breast cancer transcriptome. Yet—we cannot stress this enough—we have
Our study questions the biological interpretation of the prognostic value of published breast cancer signatures, but has no bearing on their usefulness in the clinic: a marker may be accurate without yielding interesting biological insight regarding the mechanism of disease progression. Nevertheless, the prominence of proliferation should be taken into account in future clinical research. Are there transcriptional signals in breast cancer that are prognostic, but independent of proliferation? Is there any hope to perform better than the 70 genes NKI signature
In conclusion, we have shown that 1) random single- and multiple-genes expression markers have a high probability to be associated with breast cancer outcome; 2) most published signatures are not significantly more associated with outcome than random predictors; 3) the meta-PCNA metagene integrates most of the outcome-related information contained in the breast cancer transcriptome; 4) this information is present in over 50% of the transcriptome and cannot be removed by purging known cell-cycle genes from a signature.
All analyses were run with R 2.9.0
The code and data underlying the results and figures of this study are available as a Bzip2-compressed tar bundle from the
All the data were available from public sources:
Ge
Loi
The NKI, a.k.a. van de Vijver
Probes mapping to the same genes were averaged in each one of the three datasets.
Whenever possible, the signatures were compiled from the publications online supplementary tables. When not available, the gene symbols were automatically read with an optical character recognition system from the papers tables and figures. In rare instances, signatures were encoded manually and double-checked. Because gene names and symbols are changing over time, the gene symbols of all signature genes were updated to match the HUGO nomenclature and therefore maximize the match with microarray gene annotations. HUGO gene symbols and their older aliases were obtained from the file gene_info as available on May 9th 2007 from the NCBI ftp server.
MSigBD 2.0 c2 signatures were downloaded as a *.gmt file from the Broad Institute page
We computed the Pearson correlation between PCNA and all the genes in the Ge et al.
The expression of each gene was fitted with R's ‘lm’ function and each expression measurement was substituted by the sum of its residual and its mean expression across the cohort.
In order to systematically compare the published signatures to random signatures and evaluate the relation between outcome association and meta-PCNA, we needed an outcome association estimation procedure that is robust and fully automated. We systematically compared three procedures and selected among them the most sensitive and stable one. This is described in Supporting Information (
(BZIP2)
(PDF)
(PDF)
This work rests almost entirely on open source software and data. Contributors are gratefully acknowledged.