Current address: Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
Conceived and designed the experiments: SZ. Performed the experiments: HC WHW. Analyzed the data: JC DX ZF SZ. Contributed reagents/materials/analysis tools: JC DX ZF JM WHW SZ. Wrote the paper: SZ.
The authors have declared that no competing interests exist.
Complex interactions between genes or proteins contribute substantially to phenotypic evolution. We present a probabilistic model and a maximum likelihood approach for cross-species clustering analysis and for identification of conserved as well as species-specific co-expression modules. This model enables a “soft” cross-species clustering (SCSC) approach by encouraging but not enforcing orthologous genes to be grouped into the same cluster. SCSC is therefore robust to obscure orthologous relationships and can reflect different functional roles of orthologous genes in different species. We generated a time-course gene expression dataset for differentiating mouse embryonic stem (ES) cells, and compiled a dataset of published gene expression data on differentiating human ES cells. Applying SCSC to analyze these datasets, we identified conserved and species-specific gene regulatory modules. Together with protein-DNA binding data, an SCSC cluster specifically induced in murine ES cells indicated that the KLF2/4/5 transcription factors, although critical to maintaining the pluripotent phenotype in mouse ES cells, were decoupled from the OCT4/SOX2/NANOG regulatory module in human ES cells. Two of the target genes of murine KLF2/4/5,
A major goal in biology is to understand the evolution of complex traits, such as the development of multicellular body plans. To a certain extent, complex traits are governed by regulated gene expression. The comparison expression data between species requires extra considerations than sequence comparison, because gene expression is not static and the level of expression is influenced by external conditions. Considering that co-expression patterns are often comparable across species, we developed a statistical model for cross-species clustering analysis. The model allows each species to create its own clusters of the genes but also encourages the species to borrow strength from each others' clusters of orthologous genes. The result is a pairing of clusters, one from each species, where the paired clusters share many but not necessarily all orthologous genes. The model-based approach not only reduces subjective influence but also enables effective use of evolutionary dependence. Applying this model to analyze human and mouse embryonic stem (ES) cell data, we identified the transcription factors and the signaling proteins that are specifically expressed in either human or mouse ES cells. These results suggest that the pluripotent cell identity can be established and maintained through more than one gene regulatory network.
A major goal in biology is to understand the evolution of complex traits, such as the development of multicellular body plans or an organism's physical state as it ages
Cross-species comparative analyses have made fundamental contributions to biology, most remarkably exhibited by comparative analysis of genomic sequences
A major challenge in comparing expression data between organisms is that gene expression is not static and the level of expression is influenced by external conditions. This difficulty was circumvented in the special cases in which identical perturbations could be applied across species, as in comparisons of the sexes across species
Automatic clustering algorithms, such as K-means and hierarchical clustering, have been widely used in gene expression data analysis to discover co-expression patterns that can be translated to biological knowledge or new hypotheses
We have developed a statistical model for cross-species clustering analysis. The model allows each species to create its own clusters of the genes but also encourages the species to borrow strength from each others' clusters of orthologous genes. The result is a pairing of clusters, one from each species, where the paired clusters share many but not necessarily all orthologous genes. The clustering and degree of overlap are chosen by the data through maximum likelihood estimation. The model-based approach not only reduces subjective influence but also enables effective use of evolutionary dependence.
A model-based Soft Cross-Species Clustering (SCSC) method was developed. The rationale of this model stems from the following observations and intuitions. First, clusters of co-expressed genes may be conserved across a large evolutionary distance, in the sense that the orthologous genes also exhibit correlated expression
We formulated the above observations and intuitions into a probabilistic model, with certain simplifications that made the model mathematically tractable. First, we assumed that in every species there are a certain number of clusters that can be mapped one-to-one (called orthologous clusters), with each cluster corresponding to an essential regulatory program. However, the mean profiles of orthologous clusters were assumed to be independent. Second, the expression of a gene in a cluster was assumed to be a sample from a Gaussian distribution, which was the characteristic or mean profile of this cluster. This assumption is commonly made in model-based clustering analysis
1. G1–G8 are eight genes in Species 1, which have orthologues G1′–G8′ in Species 2. 2. G4 and G9′ do not have orthologues, but they participate in the clustering analysis. 3. The shapes of the lines represent gene expression patterns. For example, G1 has an increasing pattern and G6′ has a first decreasing and then increasing pattern. 4. The genes with the same color, except for the black color, are clustered together. Genes in black are “scattered” genes, which form a singleton cluster each.
The performance of SCSC was compared with that of DCA, K-means, hierarchical clustering, MCLUST, WGCNA and CLICK clustering
To mimic errors in the orthology map or the scenario where some orthologous genes have divergent functional roles in two species, we permuted a proportion (10%–30%) of the orthologous relations into wrong matches in the first synthetic dataset. SCSC, DCA, and K-means were executed on these datasets with orthology errors (
The biological process that inspired the SCSC model is cellular differentiation, a fundamental process occurring universally in multicellular organisms. Embryonic stem (ES) cells were used as a tool to study this process. ES cells are characterized by the ability to self-renew and differentiate into every cell type found in the mature organism. We are interested in determining the extent to which molecular circuits that underlie ES cell phenotypes and the processes of commitment and differentiation are conserved across species.
Human and mouse ES (hES and mES) cells share the critical properties of ES cells but do not employ the identical set of transcription factors. For example, transcription factor FOXD3 is required for mES cell self-renewal
We generated detailed time-course microarray data during a differentiation process of mES cells (GEO accession number:
Representative transcription regulators are listed in each cluster. Thick lines enclose clusters with upregulation in either mouse or human ES cells. Dotted lines enclose the conserved clusters with upregulation in ES cells of both species. Detailed expression patterns of every cluster and sample information are given in
Clusters (2, *)FF and (3, *) were upregulated in mES cells, and Cluster (*, 3) was upregulated in hES cells (
To explore which signaling pathways and what components of these signaling pathways are induced in hES and mES cells, we mapped the genes that were induced in either hES or mES cells, i.e., Clusters (2, *), (3, *) and (*, 3), onto all known signaling pathways documented in the KEGG pathway database
The gene induced in either hES or mES cells, i.e., Clusters (2, *), (3, *) and (*, 3), are mapped onto all the signaling pathways documented in the KEGG database
Among 1,113 genes involved in transcriptional regulation (GO: 0003700) and included in this analysis, 448 clustered in either mES or hES upregulated clusters ((2, *), (3, *) and (*, 3)), indicating that a very large proportion (40%) of the transcriptional regulators were utilized in ES cells. Among these 448 transcription regulators, 85 (19%) exhibited conserved upregulation in mES and hES cells (in clusters (2, 3) and (3, 3)), representing a core set of regulators with higher expression in undifferentiated than differentiated ES cells (
KLF2, KLF4 and KLF5 belong to the Krüppel-like factor (KLF) family of evolutionarily conserved zinc finger transcription factors that regulate numerous biological processes, including proliferation, differentiation, development and apoptosis
Nodes represent upregulated genes in ES cells in a conserved (blue, upregulated in both hES and mES cells) or species-specific (red, upregulated in mES cells only) manner. Edges represent positive regulatory relationships (activation) that are validated by ChIP-chip and RNAi data in both species (dark blue), in mouse only (red), or in human only (light blue). As the KLF module appears to have lost its regulatory function in hES cells, its target genes
The mES cell expression of the three KLF factors was not conserved in humans. Human
If the KLF2/4/5 module was mouse-specific, it should
Another group of KLF target genes in mice exhibited conserved upregulation in hES cells. ChIP-chip and RNAi data
In summary, the mouse-specific KLF2/4/5 regulatory module upregulated a set of key mES cell regulators. This module was not conserved in humans and therefore represented a peripheral component of the pluripotency maintaining regulatory networks.
To what extent do gene clusters reflect functionally related gene groups? Although we do not expect a generic answer to this question, well-deliberated quantitative analyses may provide useful empirical data. Two sets of co-regulated genes were derived from an independent functional analysis, where seven regulatory proteins were knocked down by RNAi in mES cells
The applications of clustering analyses of expression data are limited by strong noise in the results. Some genes known to be involved in a particular pathway are invariably missed, whereas other apparently unrelated genes exhibit expression profiles that are strikingly similar to bona fide pathway components
Similar to KLF4,
Compared to differentiated cells, relatively few signal transduction factors were produced in ES cells. Comparing within the clusters that were upregulated in either mES or hES cells, i.e., among Clusters (2, *), (3, *) and (*, 3), genes involved in NOTCH, WNT, TGFβ, JAK-STAT and MAPK pathways were all depleted in the conserved clusters ((2, 3) and (3, 3), p-value <4*10−5). The lack of shared signal transduction factors in the conserved clusters suggests that these signaling pathways either do not present in one of the two ES cells or they utilize alternative implementations in them (
Pathway | Components | Mouse specific: Clusters (2, *), (3, *) but not (2, 3), (3, 3) | Human specific: Clusters (*, 3), but not (2, 3), (3, 3) | Shared: Clusters (2, 3), (3, 3) | Comment |
JAK-STAT | Extracellular & membrane | LIF, IL2R, IL4R, IL6R | mES specific | ||
Downstream factors | JAK3, TK2#, PTPRC, PTPRF, PTPRN, SOSC3, PINK1* | ||||
Transcription regulators | STAT3#, STAT4, STAT5A, STAT5B, STAT6 | ||||
NOTCH | Extracellular & membrane | NOTCH4, JAGGED2, MFNG | |||
Downstream factors | |||||
Transcription regulatorsF | NCOA1* | ||||
TGFβ | Extracellular & membrane | TGFβ1, TGFβR1 | BMP4, ACVR2B | LEFTY, BMPR1 | Alternatively implemented |
Downstream factors | MAPK3, PINK1* | PPP1CC*, SAR1A | |||
Transcription regulators | [SMAD7]&, SKIL&, NCOA1* | [SMAD2]& (interact with TGIF1) | TGIF1& | ||
WNT | Extracellular & membrane | [FRIZZLED9], H{LRP5} | [FRIZZLED3], {LRP6} | [FRIZZLED7] | |
Downstream factors | RHOA | CSNK2A1, CSNK1D | CSNK2B | ||
Transcription regulators | [TLE4]&, LEF1, MYC | [TLE1]&, TCF7L2 | CTBP2& | ||
MAPK | Extracellular & membrane | FGF4, MET*, EGFR, GRB2 | FGFR2 | {FGF2}, TDGF1 | |
Downstream factors | [ARAF], {PTPRC}, {PTPRF}, {PTPRN}, [MAP3K6], [MAP2K7], [MAPK3] | [KRAS], {PTPRK}, {PRPRG}, {PTPN11}, [MAP3K7] | [RAF1] | ||
Transcription regulators | ATF4 | ||||
VEGF | Extracellular & membrane | Unlikely to be expressed in ES cells | |||
Downstream factors | PINK1*, PLCD1, PLA2 | PTK2, PPP1CC*, CLK2 | |||
Transcription regulators |
Genes in the same family are embraced with the same parenthesis. Genes with a * are involved in the multiple pathways. Genes with & signs are transcriptional repressors or co-repressors. Genes with a # have abundant transcripts in mES cells, but they do not show obvious up or down regulation during differentiation of mES cells.
JAK-STAT and NOTCH were present in mES cells, but no typical signaling transducers of these pathways appeared to be present in hES cells. It has long been known that mES cells remain undifferentiated in the presence of Leukemia Inhibitory Factor (LIF), and the activation of Signal Transducer and Activator of Transcription 3 (STAT3) via LIF-JAK signaling appears sufficient for maintenance of pluripotency of mES cells. However, LIF is unable to maintain the pluripotent state of hES cells
TGFβ, WNT and MAPK pathways appeared to be present in both mES and hES cells. However, our data suggest that mouse and human ES cells do not always use orthologous factors in these pathways. The non-orthologous components of these signaling pathways appeared to share two common features. First, paralogous members of the same gene family could serve as surrogates of an orthologous component. Using the WNT pathway as an example, growth factors FRIZZLED9 (mES), FRIZZLED3 (hES), receptors LRP5 (mES) and LRP6 (hES), and transcription regulators HHTLE4HH (mES) and TLE1 (hES) were alternative members of the same gene family that appeared to assume orthologous functions in mES and hES cells (
mES and hES cells are similar in the sense that they are both derived from the inner cell mass of blastocyst embryos, and are both pluripotent. Besides mES cells, pluripotent stem cells were also derived from the late epiblast layer of post-implantation mouse embryos (mEpiS cells)
Total RNA for transcriptional profiling was obtained from B6 mES cells at various stages of differentiation, including undifferentiated (0 day), 4, 8, 12, 21 and 31 days of differentiation. Six biological samples were analyzed at each time point. B6 mouse ESC were cultured on mouse embryonic feeders (MEFs) using standard methods as previously described
Total RNA was extracted from the different samples using the RNeasy kit (Quiagen) and amplified using a two-round linear amplification strategy as previously described
The expression value of an orthologous gene pair is denoted as (gi, gi′), where i and i′ index two orthologous genes. The goal of SCSC is to assign a cluster label ci,i′ to every orthologous gene pair (i, i′). The range of ci,i′ goes from (1, 1) to (K, L), where K and L are the maximum numbers of clusters allowed in the two species. Without loss of generalizability, we assume there are no more than K clusters in either of the two species; i.e.,
SCSC takes a model-based approach. The cluster labels are assumed to be generated according to probabilities
Given the cluster indicator of a gene pair, for example
Synthetic data.
(0.01 MB PDF)
Selecting genes for SCSC analysis of mouse and human ES cells.
(0.01 MB PDF)
An iterative maximization algorithm for SCSC model.
(0.03 MB PDF)
SCSC algorithm.
(0.02 MB PDF)
Performance evaluation on synthetic datasets. Average performance scores from 20 independent runs of each algorithm are listed. Dataset numbers correspond to the datasets listed in
(0.06 MB PDF)
SCSC clusters of mouse and human ES cell differentiation. (A) Sample information for human ES cells. (B) The number of orthologous probe sets in each result cluster, and (C) the corresponding expression patterns of mouse and human clusters. Each dot represents the mean expression of a cluster in a biological replicate.
(0.04 MB PDF)
Performance evaluation with independent experimental data. The consistency of a clustering result to a set of co-regulated gene groups is measured by biological homogeneity index (BHI). K-means clustering was run 10 times with different initial values. The red bars and their error bars represent the average BHI and the standard deviation for these 10 runs. The best performance out of the 10 runs is also reported (yellow bar). Two test sets of co-regulated genes groups were defined as follows. Set 1: Co-regulated genes were defined as the genes whose expression levels were affected by all of the seven RNA knockdown (RNAi) experiments of seven regulatory proteins (OCT4, SOX2, NANOG, ESRRB, TBX3, TCL1, DPPA4) which maintain ES cell identity. Each RNAi experiment provided a list of genes whose expression was affected. Taking the intersection of the seven gene lists, a total of 60 genes were identified as regulated by all seven ES cell regulators. These 60 genes were used as the first test set. Set 2: Co-regulated genes were defined as the genes whose expression was affected by RNAi knockdown of all four of the transcription factors NANOG, OCT4, SOX2 and ESRRB, and for which the direction of expression change was the same. These four transcription factors physically interact and synergistically regulate gene expression in ES cells. Two groups of co-regulated genes were identified. Group 1 contained 107 genes that were consistently induced by the RNAi of each of the four factors, whereas group 2 contained 48 genes that were repressed by all four RNAi treatments. These two co-regulated gene groups constitute the second test set.
(0.03 MB PDF)
Scheme of computational implementation of the SCSC method. The scheme mimics an EM algorithm for clustering one-species data under a Gaussian-mixture model. (
(0.02 MB PDF)
Synthetic datasets. Cluster number is the number of clusters in the two species. For example, means 10 clusters in both species. Dimension is the number of samples in each species. “# of data points in each cluster” is the number of orthologous genes in each cluster”. “# of scatter data points” is the number of randomly distributed gene pairs that do not belong to any clusters. They represent intrinsic deviation of the transcriptome from a clustering model. The cluster means of dataset 1–5 are randomly generated between 0 and 10. The cluster mean of dataset 6 are generated between 0 and 13. Cluster variation shows the standard deviation used to generate each cluster, with the two numbers representing two standard deviations for each of the two species.
(0.02 MB PDF)
Performance evaluation with errors in ortholog map. 10%–30% of the ortholog mapping in synthetic dataset 1 (
(0.01 MB PDF)
SCSC clusters. (A) with conserved upregulation in hES and mES cells, (B) specifically upregulated in mES cells, and (C) specifically upregulated in hES cells.
(0.15 MB PDF)
Conserved transcription regulators in human and mouse ES cells. Genes with a & may act as transcriptional repressors or corepressors.
(0.01 MB PDF)
We thank Seth Ament, Xin He, Yue Lu, Drs. Peter Bickel, Rex Gaskins, Kim Hughes, Douglas Melton and Lisa Stubbs for useful discussion and suggestions.