The authors have declared that no competing interests exist.
Pathway analysis has become the first choice for gaining insight into the underlying biology of differentially expressed genes and proteins, as it reduces complexity and has increased explanatory power. We discuss the evolution of knowledge base–driven pathway analysis over its first decade, distinctly divided into three generations. We also discuss the limitations that are specific to each generation, and how they are addressed by successive generations of methods. We identify a number of annotation challenges that must be addressed to enable development of the next generation of pathway analysis methods. Furthermore, we identify a number of methodological challenges that the next generation of methods must tackle to take advantage of the technological advances in genomics and proteomics in order to improve specificity, sensitivity, and relevance of pathway analysis.
Techniques such as high-throughput sequencing and gene/protein profiling techniques have transformed biological research by enabling comprehensive monitoring of a biological system. Irrespective of the technology used, analysis of high-throughput data typically yields a list of differentially expressed genes or proteins. This list is extremely useful in identifying genes that may have roles in a given phenomenon or phenotype. However, for many investigators, this list often fails to provide mechanistic insights into the underlying biology of the condition being studied. In this way, the advent of high-throughput profiling technologies presents a new challenge, that of extracting meaning from a long list of differentially expressed genes and proteins.
One approach to this challenge has been to simplify analysis by grouping long lists of individual genes into smaller sets of related genes or proteins. This approach reduces the complexity of analysis. Researchers have developed a large number of knowledge bases to help with this task. The knowledge bases describe biological processes, components, or structures in which individual genes and proteins are known to be involved in, as well as how and where gene products interact with each other. One example of this idea is to identify groups of genes that function in the same pathways.
Analyzing high-throughput molecular measurements at the functional level is very appealing for two reasons. First, grouping thousands of genes, proteins, and/or other biological molecules by the pathways they are involved in reduces the complexity to just several hundred pathways for the experiment. Second, identifying active pathways that differ between two conditions can have more explanatory power than a simple list of different genes or proteins
The goals of this review are to i) describe the existing knowledge base–driven pathway analysis methods, ii) discuss limitations of each class of methods, and iii) describe the challenges not yet addressed by any method.
The term “pathway analysis” has been used in very broad contexts in the literature
It is beyond the scope of this review to discuss the large number of analytic methods covered by such a broad application of the term “pathway analysis.” Therefore, this review focuses on methods that exploit pathway knowledge in public repositories such as GO or Kyoto Encyclopedia of Genes and Genomes (KEGG), rather than on methods that infer pathways from molecular measurements. We call this approach
Instead of individually reviewing a large number of pathway analysis approaches, our goal here is to group approaches by the type of analysis they perform and discuss their relative merits. However, for those desiring specific information about individual tools,
Virtually all of the approaches and tools discussed here are independent of the data generated from most high-throughput technologies, including next-generation sequencing data and the knowledge bases used for pathway annotations. In this review, we use gene expression measurements as example data for discussing and explaining various approaches.
The immediate need for functional analysis of microarray gene expression data and the emergence of GO during that period gave rise to over-representation analysis (ORA), which statistically evaluates the fraction of genes in a particular pathway found among the set of genes showing changes in expression (
Note that this overview is equally applicable to molecular measurements using proteomics, and any other high-throughput technologies. The data generated by an experiment using a high-throughput technology (e.g., microarray, proteomics, metabolomics), along with functional annotations (pathway database) of the corresponding genome, are input to virtually all pathway analysis methods. While ORA methods require that the input is a list of differentially expressed genes, FCS methods use the entire data matrix as input. In addition to functional annotations of a genome, PT-based methods utilize the number and type of interactions between gene products, which may or may not be a part of a pathway database. The result of every pathway analysis method is a list of significant pathways in the condition under study. DE, differentially expressed.
Name | Availability | Reference |
|
||
Onto-Express | Web ( |
|
GenMAPP | Standalone ( |
|
GoMiner | Standalone, Web ( |
|
FatiGO | Web ( |
|
GOstat | Web ( |
|
FuncAssociate | Web ( |
|
GOToolBox | Web ( |
|
GeneMerge | Standalone, Web ( |
|
GOEAST | Web ( |
|
ClueGO | Standalone ( |
|
FunSpec | Web ( |
|
GARBAN | Web |
|
GO:TermFinder | Standalone ( |
|
WebGestalt | Web ( |
|
agriGO | Web ( |
|
GOFFA | Standalone, Web ( |
|
WEGO | Web ( |
|
|
||
GSEA | Standalone ( |
|
sigPathway | Standalone (BioConductor) |
|
Category | Standalone (BioConductor) |
|
SAFE | Standalone (BioConductor) |
|
GlobalTest | Standalone (BioConductor) |
|
PCOT2 | Standalone (BioConductor) |
|
SAM-GS | Standalone ( |
|
Catmap | Standalone ( |
|
T-profiler | Web ( |
|
FunCluster | Standalone ( |
|
GeneTrail | Web ( |
|
GAzer | Web |
|
|
||
ScorePAGE | No implementation available |
|
Pathway-Express | Web ( |
|
SPIA | Standalone (BioConductor) |
|
NetGSA | No implementation available |
|
Despite the availability of a large number of tools and their widespread usage, ORA has a number of limitations. First, the different statistics used by ORA (e.g., hypergeometric distribution, binomial distribution, chi-square distribution, etc.) are independent of the measured changes. This means that these tests consider the number of genes alone and ignore any values associated with them such as probe intensities. By discarding this data, ORA treats each gene equally. However, the information about the extent of regulation (e.g., fold-changes, significance of a change, etc.) can be useful in assigning different weights to input genes, as well as to the pathways they are involved in, which in turn can provide more information than current ORA approaches.
Second, ORA typically uses only the most significant genes and discards the others. For instance, the input list of genes from a microarray experiment is usually obtained using an arbitrary threshold (e.g., genes with fold-change
Third, by treating each gene equally, ORA assumes that each gene is independent of the other genes. However, biology is a complex web of interactions between gene products that constitute different pathways. One goal of gene expression analysis might be to gain insights into
Fourth, ORA assumes that each pathway is independent of other pathways, which is erroneous. For instance, GO defines a biological process as a series of events accomplished by one or more
The hypothesis of functional class scoring (FCS) is that although large changes in individual genes can have significant effects on pathways, weaker but coordinated changes in sets of functionally related genes (i.e., pathways) can also have significant effects. With few exceptions
Second, the gene-level statistics for all genes in a pathway are aggregated into a single pathway-level statistic. This statistic can be multivariate
The final step in FCS is assessing the statistical significance of the pathway-level statistic. When computing statistical significance, the null hypothesis tested by current pathway analysis approaches can be broadly divided into two categories: i) competitive null hypothesis and ii) self-contained null hypothesis
FCS methods address three limitations of ORA. First, they do not require an arbitrary threshold for dividing expression data into significant and non-significant pools. Rather, FCS methods use all available molecular measurements for pathway analysis. Second, while ORA completely ignores molecular measurements when identifying significant pathways, FCS methods use this information in order to detect coordinated changes in the expression of genes in the same pathway. Finally, by considering the coordinated changes in gene expression, FCS methods account for dependence between genes in a pathway, which ORA does not.
Although FCS is an improvement over ORA
Second, many FCS methods use changes in gene expression to rank genes in a given pathway, and discard the changes from further analysis. For instance, assume that two genes in a pathway, A and B, are changing by 2-fold and 20-fold, respectively. As long as they both have the same respective ranks in comparison with other genes in the pathway, most FCS methods will treat them equally, although the gene with the higher fold-change should probably get more weight. Importantly, however, considering only the ranks of genes is also advantageous, as it is more robust to outliers. A notable exception to this scenario is approaches that use gene-level statistics (e.g., t-statistic) to compute pathway-level scores. For example, an FCS method that computes a pathway-level statistic as a sum or mean of the gene-level statistic accounts for a relative difference in measurements (e.g., Category, SAFE in
A large number of publicly available pathway knowledge bases provide information beyond simple lists of genes for each pathway. Unlike GO and the Molecular Signatures Database (MSigDB), these knowledge bases also provide information about gene products that interact with each other in a given pathway, how they interact (e.g., activation, inhibition, etc.), and where they interact (e.g., cytoplasm, nucleus, etc.). These knowledge bases include KEGG
ORA and FCS methods consider only the number of genes in a pathway or gene coexpression to identify significant pathways, and ignore the additional information available from these knowledge bases. Hence, even if the pathways are completely redrawn with new links between the genes, as long as they contain the same set of genes, ORA and FCS will produce the same results. Pathway topology (PT)-based methods (
Rahnenfuhrer et al. proposed ScorePAGE, which computes similarity between each pair of genes in a pathway (e.g., correlation, covariance, etc.)
A recent impact factor (IF) analytic approach was proposed to analyze signaling pathways. IF considers the structure and dynamics of an entire pathway by incorporating a number of important biological factors, including changes in gene expression, types of interactions, and the positions of genes in a pathway
FCS methods that use correlations among genes
Shojaie et al. proposed a method, called NetGSA, that accounts for the the change in correlation as well as the change in network structure as experimental conditions change
Although PT-based methods are difficult to generalize, they have several common limitations. One obvious problem is that true pathway topology is dependent on the type of cell due to cell-specific gene expression profiles and condition being studied. However, this information is rarely available and is fragmented in knowledge bases, even if it is fully understood
The current challenges in pathway analysis can be divided into two broad categories: i) annotation challenges and ii) methodological challenges. We believe that development of the next generation of pathway analytic approaches will require improvement of the existing annotations. It is necessary to create accurate, high resolution knowledge bases with detailed condition-, tissue-, and cell-specific functions of each gene. These knowledge bases will allow investigators to model an organism's biology as a dynamic system, and will help predict changes in the system due to factors such as mutations or environmental changes.
Recent technological advances in genomics and proteomics are generating data at unprecedented high resolution. As a result, there is a need for correspondingly high resolution annotation knowledge bases. For instance, using RNA-seq, more than 90% of the human genome is estimated to be alternatively spliced. Multiple transcripts from the same gene may have related, distinct, or even opposing functions
Green arrows represent abundantly available information, and red arrows represent missing and/or incomplete information. The ultimate goal of pathway analysis is to analyze a biological system as a large, single network. However, the links between smaller individual pathways are not yet well known. Furthermore, the effects of a SNP on a given pathway are also missing from current knowledge bases. While some pathways are known to be related to a few diseases, it is not clear whether the changes in pathways are the cause for those diseases or the downstream effects of the diseases.
Therefore, before pathway analysis can exploit current and future technological advances in biotechnology, it is critically important to annotate exact transcripts and SNPs that participate in a given pathway. While new approaches are being developed in this regard, they may not yet be adequate. For example, Braun et al. proposed a method for analyzing SNP data from a GWAS
Despite the enormous number of annotations available in the public domain, a surprisingly large number of genes are still not annotated. For instance, the November 2009 release of GO contained entries for 18,587 human genes annotated with at least one GO term (
As the estimated number of known genes in the human genome is adjusted (between January 2003 and December 2003) and annotation practices are modified (between December 2004 and December 2005, and between October 2008 and November 2009), one can argue that, although the number of annotated genes and the annotations are decreasing (which is mainly due to the adjusted number of genes in the human genome and changes in the annotation process), the quality of annotations is improving, as demonstrated by the steady increase in non-IEA annotations and the number of genes with non-IEA annotations. However, the increase in the number of genes with non-IEA annotations is very slow. In almost 7 years, between January 2003 and November 2009, only 2,039 new genes received non-IEA annotations. At the same time, the number of non-IEA annotations increased from 35,925 to 65,741, indicating a strong research bias for a small number of genes.
In addition to incomplete annotations, many of the existing annotations are of low quality and may be inaccurate. For instance, >95% of the annotations in the October 2007 release of GO had the evidence code “inferred from electronic annotations (IEA)”. These annotations are the only ones in GO that are not curated manually
It is very likely that the reduced number of annotations and annotated genes since January 2003 is an indicator of improving quality. This is due in part to the fact that the number of genes in a genome are continuously being adjusted and the functional annotation algorithms are being improved. Indeed, the number of non-IEA annotations is continuously increasing (
Manual curation of the entire genome is expected to take a very long time (∼13–25 years)
Most pathway knowledge bases are built by curating experiments performed in different cell types at different time points under different conditions. However, these details are typically not available in the knowledge bases. One effect of this omission is that multiple independent genes are annotated to participate in the same interaction in a pathway. This effect is so widespread that many pathway knowledge bases represent a set of distinct genes as a single node in a pathway, and is part of the standard BioPAX format. An example of this problem is the
However, this contextual information is typically not available from most of the existing knowledge bases. A standard functional annotation format discussed above would make this information available to curators and developers. For instance, the recently proposed Biological Connection Markup Language (BCML) allows pathway representation to specify the cell or organism in which each pathway interaction occurs
Existing knowledge bases do not describe the effects of an abnormal condition on a pathway (
Although multivariate pathway-level statistics outperform univariate statistics on simulated data, univariate statistics are equal to or better than multivariate statistics on real biological data
Using simulated data
A number of well-studied biological data sets can be used for this purpose
While information missing from pathway knowledge bases limits analysis from a systems biology perspective, no existing approach can collectively model and analyze high-throughput data as a single dynamic system. Current approaches are designed to analyze a snapshot of a biological system by assuming that each pathway is independent of the others at a given time. A typical approach for analyzing dynamic response at the pathway level is to measure expression changes at multiple time points, and analyze each time point individually to see which pathways are significant at each time point
For example, existing approaches for pathway analysis of gene expression profiles obtained from transplanted organ biopsies on day 1 would identify
Topology-based analysis approaches can potentially model and analyze dynamic responses. For example, IF analysis models each pathway as a linear system and propagates changes in gene expression as perturbations in the system via interactions between gene products
Gene set–based approaches often only consider genes and their products, and completely ignore the effects of other molecules participating in a pathway, such as the rate limiting step of a multi-step pathway. For instance, the amount/strength of Ca2+ causes different transcription factors to be activated
In the last decade, pathway analysis has become the first choice for extracting and explaining the underlying biology for high-throughput molecular measurements. Today, virtually every bioinformatics study looks for statistically significant pathways as either biological interpretation or validation of computationally derived results. This paper discusses the evolution of pathway analysis methods of high-throughput molecular measurements in the last decade, distinctly divided into three generations based on the type of analysis they performed. Although widely adopted, the first generation of pathway analysis methods, ORA methods, decouple molecular measurements from functional analysis and assume that genes and pathways are independent of each other. The second-generation FCS methods address these limitations. PT-based methods further improve FCS methods by considering the number and type of interactions between genes, which FCS methods ignore.
However, despite these efforts, there are outstanding annotation and methodological challenges. First, low resolution knowledge bases, missing condition- and cell-specific information, and incomplete annotations restrict development of the next-generation pathway analysis methods. Second, the inability to integrate the dynamic nature of a biological system in analysis limits the utility of existing methods. However, despite these hurdles, as the number and type of functional annotations increase, coupled with technological advances and analysis methods that provide better guidance for strategic planning for subsequent biological experiments, the utility of pathway analysis and confidence in results will likely improve. The community must address these challenges collectively to move pathway analysis into the next generation that is able to utilize the new high-throughput technologies in order to better understand large biological systems and to increase the specificity, sensitivity, and relevance of pathway analysis, and consequently, its utility.
Description of the linear model used by IF analysis.
(PDF)
Feature comparison of a few existing pathway analysis tools in each generation.
(PDF)
Comparison of 11 ORA pathway analysis tools and analysis features available in them.
(PDF)
Comparison of seven FCS pathway analysis tools and analysis features available in them.
(PDF)
Comparison of three PT-based pathway analysis tools and analysis features available in them.
(PDF)
NCBI Entrez Gene statistics for the types of genes annotated for humans.
(PDF)
We thank Valmik Desai, Richard Hayden Jones, Nigam Shah, and Shai Shen-Orr for their useful comments.