JPM conceived, contributed sections to, coordinated, and put together the complete manuscript. GDB, SP, AMC, CS, SJB, and BM discussed and contributed sections to the paper. BST and MA contributed sections to the paper.
Jomol P. Mathew is with the Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America. Barry S. Taylor is with the Department of Physiology and Biophysics, Weill Medical College of Cornell University, New York, New York, United States of America, and the Computational Biology Center, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America. Gary D. Bader is with Banting and Best Department of Medical Research, Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Canada. Saiju Pyarajan and Steven J. Burakoff are with the Skirball Institute of Biomolecular Medicine, New York University Cancer Institute, and the Department of Pathology at the New York University School of Medicine, New York, New York, United States of America. Marco Antoniotti is with the Dipartimento di Informatica Sistemistica e Comunicazione, Università di Milano-Bicocca, Milan, Italy. Arul M. Chinnaiyan is with the Departments of Pathology and Urology, and the Bioinformatics Program at the University of Michigan Medical School, Ann Arbor, Michigan, United States of America. Chris Sander is with the Computational Biology Center at Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America. Bud Mishra is with the Courant Institute at New York University and the Department of Cell Biology at New York University School of Medicine, New York, New York, United States of America. At the time of preparation of this manuscript, Jomol Mathew was with the Department of Environmental Medicine, New York University School of Medicine, New York, New York, United States of America; Barry Taylor was at the Department of Pathology and Bioinformatics Program, University of Michigan Medical School, Ann Arbor, Michigan, United States of America; Gary Bader was at the Computational Biology Center, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America; and Marco Antoniotti was at the New York University Bioinformatics Group, Courant Institute, New York University, New York, New York, United States of America.
The authors have declared that no competing interests exist.
Major advances in genome science and molecular technologies provide new opportunities at the interface between basic biological research and medical practice. The unprecedented completeness, accuracy, and volume of genomic and molecular data necessitate a new kind of computational biology for translational research. Key challenges are standardization of data capture and communication, organization of easily accessible repositories, and algorithms for integrated analysis based on heterogeneous sources of information. Also required are new ways of using complementary clinical and biological data, such as computational methods for predicting disease phenotype from molecular and genetic profiling. New combined experimental and computational methods hold the promise of more accurate diagnosis and prognosis as well as more effective prevention and therapy.
Over the last two decades, our knowledge of cancer and its causes has increased greatly. However, we still have few examples of cures. This underscores the need for a clearer understanding of the alterations in the biological circuitry that lead to tumor development and growth. Sequencing of the human genome and biotechnological advances have led to the generation of large volumes of genome-scale data. Combining this genome-scale molecular data with clinical information provides new opportunities to discover how perturbations in biological processes lead to disease. This knowledge can be used to improve disease diagnosis, prognosis, prevention, and therapy. However, the large scale and diversity of both experimental and clinical data necessitate that they be well-organized and computationally accessible to research scientists for analysis and interpretation. This review focuses on the challenges and opportunities to combine clinical and genome-scale molecular data, using computational approaches, to better understand cancer biology and to translate this knowledge into improved disease prevention and therapy.
Cancer is a complex disease, involving multiple and specific changes at the DNA level that can be inherited or induced by environmental factors. There are many different types and subtypes of cancer marked by specific sets of molecular changes. Most of our current cancer treatment efforts are focused on surgery for a curative treatment and radiation and/or toxic drugs (chemotherapy) to induce remissions. Candidates for successful cancer therapy with surgery are few, and radiation and chemotherapy suffer from lack of target specificity, leading to serious side effects. Identifying cancer-specific molecular changes and discovering how they can be used to increase therapeutic specificity will lead to higher success rates and fewer side effects.
Translational cancer research seeks to identify and understand the cause and effect of cancer-specific molecular defects and to translate this “bench” knowledge to the clinic to improve disease prevention and therapy. Examples of research questions include, from a clinical perspective: what are the molecular subtypes of cancer? What reliable molecular markers are available for early cancer detection (diagnostic) and for predicting the course of disease (prognostic)? How do we find better drugs and optimize therapy (development of more specific drugs with lower toxicity) to suit an individual patient's molecular profile? From a molecular biology perspective: can we accurately predict vulnerable point(s) in molecular pathways that are potential therapeutic targets? What specific drug or drug combination can target these vulnerable points in the pathway? Can genotype and pathway information be combined to predict the effect of a mutation on disease or therapy?
Advances in our understanding of cancer-specific molecular defects have led to improved cancer treatments. For example, the protein kinase inhibitor imatinib (Gleevec) was designed to treat chronic myelogenous leukemia (CML) based on knowledge of the causative molecular defect—translocation and dys-regulated BCR–ABL kinase. Protein kinase inhibitors such as gefitinib (Iressa) and erlotinib (Tarceva) are showing therapeutic promise by targeting known molecular abnormalities of non-small cell lung cancer (NSCLC). Similarly, antibody therapies such as Rituximab (Rituxan), an anti-CD20 monoclonal antibody for non-Hodgkin lymphoma; Cetuximab (Erbitux), an epidermal growth factor receptor (EGFR)-binding antibody for colorectal and head and neck cancer; Trastuzumab (Herceptin), a monoclonal antibody that allows targeted therapy in HER2 positive breast cancer; and Bevacizumab (Avastin), a recombinant humanized antibody against vascular endothelial growth factor (VEGF) for metastatic colorectal cancer are promising.
While these treatments based on molecular knowledge of the cancer show promise, major challenges remain. For instance, development of compensatory mutations induces resistance to Gleevec and limits its use, while humanization and effective delivery of antibodies is difficult [
Cancer cells can now be profiled on a genome scale using new experimental techniques. We thus have an unprecedented opportunity to comprehensively study cancer-specific molecular processes. This study requires computational tools to handle the large volume and diversity of available information. Collection, standard organization, aggregation, storage, integration, and analysis of diverse genome-scale molecular data along with patient data collected in the clinic will broaden our understanding of how cancer-specific molecular defects affect clinical outcome and will lead to improved disease prevention and therapy (
Archival of clinical and molecular data in easily retrievable standardized formats, aggregation, integration, and data analysis will provide opportunities for the next-generation biomedical discoveries that can impact cancer research and treatment.
Cancer is a genetic disease involving point mutations, translocations, segmental amplifications, and deletions in the genome that alter specific vulnerable molecular points in cellular regulatory pathways. Analysis of chromosomal changes by fluorescent in situ hybridization (FISH)-based cytogenetic approaches including comparative genomic hybridization (CGH), spectral karyotyping (SKY), and multiplex-FISH (M-FISH) [
Epigenetic changes such as DNA methylation, histone modification, and RNA silencing are involved in regulating many cellular processes, including development, via gene silencing (chromatin structure and transcription regulation) and genetic imprinting. Specific DNA methylation alterations have been identified in various neoplasms. For example, aberrant promoter methylation associated with transcriptional downregulation of tumor suppressor genes has been found in basal cell carcinoma (BCC), cutaneous squamous cell carcinoma (SCC), melanoma, and cutaneous lymphoma [
Genomic Variation Repositories
Several projects attempt to comprehensively study genomic variation in cancer. The Cancer Genome Atlas (
Global gene expression profiling with DNA microarrays [
Gene Expression Repositories
Not all available cancer microarray data are generated from gold-standard tissues or primary cell culture. Often, immortalized cell line models of neoplastic disease are studied because they are easier to access than histopathologically characterized human tumors [
Mass spectrometric instruments and protein chip technology allow large-scale analysis of proteins, their quantitative expression, interactions, post-translational modifications, and localization [
Generation of protein profiles using mass spectrometry is an example of an experimental technique that produces massive amounts of data that is difficult to interpret without computational and statistical algorithms. For instance, comparison of disease versus control sample profiles can lead to identification of disease-specific protein expression signatures, which could be used as diagnostic or prognostic markers. Aggregation of such data from multiple sources and pooled analysis requires proper annotation of sample source, sample handling, and experiment information.
Proteomics Repositories
Small RNAs add an additional layer of complexity to gene regulation [
mRNA targets for several hundred miRNAs that are expressed in human have been computationally predicted, though few targets have been experimentally confirmed ([
Pathway information is vital for understanding biological processes and how they are disrupted or reprogrammed in disease. However, collecting complex pathway information in a usable form from diverse and heterogeneous sources, including more than 220 pathway databases (
Human-Focused Pathway Repositories
Metabolomics involves measurement of metabolite concentrations and fluxes in cells and tissues [
Advances in optics, digital detectors, and automation have significantly improved biological imaging technology and have led to a large increase of quantitative information extracted from digital images. Fluorescent and confocal microscopy, and whole-body imaging of model organisms [
Clinical data is information about patients that is collected using surveys, during doctors' office visits, through administration of standard treatment procedures, or during clinical trials. Typical cancer clinical trials are conducted to determine the safety and efficacy of a drug in humans and depend on detailed patient information for accurate interpretation of results. Clinical trials range from pilot studies for feasibility assessment of the trial to more involved Phase I to IV trials [
The Oncomine cancer microarray database is an integrated meta-analysis platform that overcame diverse data integration and normalization challenges to enable comprehensive analysis of complex multistudy disease datasets [
Software platforms such as Oncomine are important for discovery and algorithm development. When mined using appropriate algorithms, such as the cancer outlier profile analysis (COPA) method, they can supplement experiments to make fundamental contributions to cancer genetics [
To effectively use genome-scale molecular information, it must be collected, organized in a standard way, aggregated, and stored so that it is widely accessible to the research community. Aggregation is pooling data from multiple experiments of the same type. The advantages of data aggregation are: it can increase the sample size and lead to improved statistical power for comparisons, and it can improve coverage, for instance, over more cell types, different parts of the tumor, or from different populations. Public biomedical data repositories that organize, aggregate, and store data from genome-scale molecular experiments are increasingly available for diverse data types (
Data from these repositories support comprehensive molecular analysis of tumors. For instance, commonly activated gene signatures [
However, data aggregation is difficult unless standard methods for data collection, organization, archiving, and exchange are developed and followed.
Data Standardization Efforts in Biomedical Research
In
Data integration is the combination of heterogeneous biological data encoded with different semantics. Integration of heterogeneous data is useful not only to validate and to improve confidence in experimental results but also to develop more complete models of biological systems. For instance, real time quantitative RT–PCR data are routinely used to validate cDNA array experiment results. Integration of gene expression and proteomics data, for example, could be used to identify post-transcriptional or post-translational modifications. It could also provide insights into the advantages and shortcomings of particular experimental methods.
The integration of diverse experimental data to build models of biological processes, or pathways, will boost our ability to identify clinical markers and therapeutic targets and to interpret genotype information. For instance, a marker such as prostate specific antigen (PSA) may be widely known and used clinically without much knowledge about its biology. Knowing the pathway involving the marker gene allows other pathway components, or the entire pathway, to serve as a more specialized marker.
Clinical data, securely and ethically accessed, can be integrated with molecular data from basic research to gain insight into disease state and lead to better treatments. Molecular and clinical data has been integrated for identification of clinically relevant subtypes of leukemia with 100% sensitivity and specificity [
Effective integration of heterogeneous data is difficult since important information necessary to decipher data semantics, such as context, could be missing. It is often possible for the human brain to infer this information using prior knowledge, but such tasks have remained impossible to encode into a model or rule-based computational procedure. Thus, missing information can lead to errors during integration. For example, a query may identify relevant datasets labeled with the term
In
Data integration was applied to identify one of the genes responsible for Leigh syndrome [
Garraway et al. used an integrative approach to identify
The volume of the data generated by modern biomedical studies is too large to be processed by the human brain alone. Data storage, querying, and presentation software systems and computational algorithms are required for effective interpretation of large-scale experimental data. Automated methods are now available to find genes or pathways that are significantly differentially expressed using molecular profiles [
Pathway simulations, requiring detailed cellular models, have been used in model organisms, such as
Development of software systems that integrate diverse biomedical research data types promise to support the study of disease biology and development. For instance, the REMBRANDT (Repository for Molecular Brain Neoplasia Data) framework attempts to integrate clinical and molecular data from the Glioma Molecular Diagnostic Initiative (GMDI)—a collaborative effort of the NCI and the US National Institute of Neurological Disorders and Stroke (NINDS) (
Computational prediction of the biological effects of drugs based on structure–function relationships across many targets can help increase the success rate of clinical trials and may forewarn of possible adverse events associated with the small-molecule therapy. This strategy could also be used to develop drug combinations to target multiple vulnerable points to shut down tumor growth. ADME/Tox (absorption, distribution, metabolism, excretion, and toxicity) prediction based on molecular profiles can help eliminate candidate drugs that have unacceptable toxicity early in the drug discovery process and thus reduce the cost of drug development.
Aggregating, integrating, and analyzing experimental data from multiple sources must overcome social as well as technical challenges. Critically, while archives of datasets from molecular studies are often publicly available, a public clinical counterpart remains largely unavailable due to patient privacy concerns. Securely providing de-identified patient data obtained with adequate patient consent, for example, as per the US Health Insurance Portability and Accountability Act (HIPAA) guidelines (
Data collected from biological samples must be clearly annotated using standard representations, including descriptions of the sample and experimental conditions. Without such information data integration is significantly more difficult, inefficient, and error-prone. Effort must be spent to make data publicly available, to agree on and use community standards, and most importantly to make computational tools easy to use for biologists; these steps will significantly improve the effectiveness of translational cancer research. Computing infrastructure for facilitating data aggregation/integration can use either centralized systems wherein an investigator accesses a central computer system that holds all the data, or, alternatively, federated systems where an investigator sends a query and the system assembles pertinent information from where it exists. Two examples of research computer systems for data integration are caBIG and BIRN's cyber infrastructure. The Cancer Biomedical Informatics Grid (caBIG) is a network to enable sharing of data and software tools across individuals and cancer research institutions to improve the pace of innovations in cancer prevention and treatment (
Computational biology is pivotal for effectively using large and diverse data resources to provide insights into disease biology and to optimize treatment. Modeling and simulation techniques, standards, and software systems must be enhanced to deal with expanding molecular and clinical information. Making well-organized experimental datasets widely accessible will spur algorithm development, testing, and comparison, leading to the development of better computational methods. These new computational tools will allow us to effectively interpret available genome-scale datasets to improve disease diagnosis, prognosis, therapy, and prevention.
We apologize to those whose publications we were not able to refer to in this review because of space limitations.
Cancer Biomedical Informatics Grid
comparative genomic hybridization
fluorescent in situ hybridization
messenger RNA
microRNA
micropthalmia-associated transcription factor
National Cancer Institute
single nucleotide polymorphism