Conceived and designed the experiments: TK PW RB. Performed the experiments: TK PW. Analyzed the data: TK PW. Wrote the paper: TK PE RB. Designed the CMMR and BiclusterCards, collected and standardized datasets analyzed, generated results visualizations: TK. Implemented and tested the method: PW. Aided in the testing the resource: AB. Oversaw all biological aspects of the project, contributed to the validation and visualization of the results: PE. Oversaw all aspects of the project: RB.
The authors have declared that no competing interests exist.
The increasing abundance of large-scale, high-throughput datasets for many closely related organisms provides opportunities for comparative analysis via the simultaneous biclustering of datasets from multiple species. These analyses require a reformulation of how to organize multi-species datasets and visualize comparative genomics data analyses results. Recently, we developed a method, multi-species cMonkey, which integrates heterogeneous high-throughput datatypes from multiple species to identify conserved regulatory modules. Here we present an integrated data visualization system, built upon the Gaggle, enabling exploration of our method's results (available at
Advancing high-throughput experimental technologies are providing access to genome-wide measurements for multiple related species on multiple information levels (e.g. mRNA, protein, interactions, functional assays, etc.). We present a biclustering algorithm and an associated visualization system for generating and exploring regulatory modules derived from analysis of integrated multi-species genomics datasets. We use multi-species-cMonkey, an algorithm of our own construction that can integrate diverse systems-biology datatypes from multiple species to form biclusters, or condition-dependent regulatory modules, that are conserved across both the multiple species analyzed and biclusters that are specific to subsets of the processed species. Our resource is an integrated web and java based system that allows biologists to explore both conserved and species-specific biclusters in the context of the data, associated networks for both species, and existing annotations for both species. Our focus in this work is on the use of the integrated system with examples drawn from exploring modules associated with nitrogen metabolism in two Gram-negative bacteria,
It is now routine to have genomics data for multiple organisms of interest. For example, data may be available for both an organism of primary relevance to a specific study, as well as data for related species. Tools and algorithms for comparative analysis of multi-species datasets are therefore in high demand. Comparative analysis of gene sequences is a mainstay in computational biology
A number of tools are being developed for interpreting and exploring large-scale biological networks, such as: PathSys
Several recent studies have shown that comparative genomics analysis improves our ability to learn regulatory interactions, co-regulated groups, and to delineate the conserved components of fundamental pathways and modules
The analysis of multiple species datasets presents several challenges not encountered when analyzing single species datasets. In addition to the display and exploration of multiple datatypes, such as interaction networks, cis-regulatory sequences, transcriptome and proteome data, we add the challenge of tracking connections between orthologous groups of genes. In this work we focus on exploring sets of multi-species biclusters generated with MScM. A typical multi-species biclustering (set of biclusters) will consist of:
The source data used to:
Compute the biclustering. For each species, its protein association networks, upstream sequences and expression data
Perform post-analytic evaluations, such as enrichment of ontology terms, i.e. GO functions and KEGG pathways
A set of conserved biclusters. Biclusters composed of pairs of orthologous genes spanning both species
Species-specific elaborations of the conserved biclusters. Following the initial generation of the conserved core of the biclusters, genes added to conserved biclusters based on evidence in a single species – including genes lacking putative orthologs in the other species
Species-specific biclusters. Biclusters composed entirely of genes lacking detectable orthology relationships between the two species
Our system to navigate this analysis enables exploration of both conserved biclusters, in the context of both species, and species specific additions to conserved biclusters, in the context of each individual species dataset, and illustrates general strategies for building loosely coupled systems for exploring other multi-species genomics analysis.
High-throughput data exists for many microbial organisms on multiple information levels (i.e. genome sequences, transcriptomics, proteomics, metabolomics, networks of pathways and interactions). Collecting and integrating diverse and heterogeneous datasets from disparate databases is not trivial and poses a number of barriers to automating the process. One of the most significant barriers to automation of data-import is the inconsistency among the naming schemes for loci, mRNA and protein products that are employed by the major public repositories such as NCBI, Uniprot and EMBL. Versioning can also be an issue if a given data source is delayed in updating their annotations. Our resource integrates diverse data from microarray experiments, genomic sequences, and various functional associations. It utilizes a database for translating gene names across datatypes and disparate resources and ortholog names across species, and is linked to the Gaggle. We will focus our examples on two closely related γ-Proteobacteria:
Clustering and biclustering are typically used to identify groups of co-expressed genes that, ideally, represent true regulatory modules and co-functional groups such as pathways and complexes. Biclustering groups genes into condition-specific gene clusters, and can allow genes to participate in more than one bicluster. Many biclustering methods have been previously described, for example, SAMBA
To enable exploration of a multi-species integrative biclustering result, we have constructed a system using the Gaggle and MScM (
The CMMR consists of an integrated suite of web components for visualizing the diverse aspects of the multi-species, multi-datatype analysis; facilitating access to each organism's dataset. (A) Written descriptions of the individual components for hypothetical Organism 1. (B) The corresponding graphics of each component goose displaying example data, for hypothetical Organism 2. Each of the components fetches information from the data compendium (MScM results, and raw data). (C) The CMMR integrative components: the FireGoose allows transfer of data between web pages and gaggled software, the Gaggle Boss acts as a hub for passing communications among the geese, and the Global Synonym/Ortholog Translator converts among gene annotations, accessions and translates orthologous genes between organisms. The arrows represent information flow between tools, primarily as broadcasts between tools and the Gaggle boss.
We present an overview of the MScM algorithm, and the system we have constructed for visualizing the resulting multiple-species biclusters. Further methodological detail, additional validation of our method, and a full description of the dataset used to demonstrate our resource can be found in the supplemental section (
Microarray data was acquired from several large, public repositories such as the Gene Expression Omnibus (GEO)
The MScM algorithm consists of four main steps. Beginning with step 1, putative orthologous relationships between genes in each species are identified using InParanoid
We have made the cMonkey and MScM code available including tools for automating many of the data acquisition and processing steps required for assembling an integrated dataset
We created a database containing the MScM biclustering analysis data compendium for a number of microbial species. Our pipeline begins with several post-processing steps to convert cMonkey output to Gaggle compatible formats. Enrichment of functional annotations within biclusters is determined for each bicluster and the bicluster is assigned any significant annotations (p-values<0.05). A score is computed from the statistical components of each bicluster (e.g. residual, functional enrichment significance values). Specifically, the bicluster score is computed using Stouffer's z-score method for meta-analysis from a collection of bicluster statistics. Data files are generated for the complete bicluster network and the subnetwork of related biclusters before the website for a result is generated. Lists of orthologous genes between each species are generated as part of the analysis and loaded into the synonym/ortholog database.
To mirror selections simultaneously in several tools that visualize different aspects of the data, the results and the comparison between species we utilize the Gaggle, a loosely coupled system of web applications (geese)
A web interface was implemented to facilitate exploration of the multi-species biclusters. The starting page allows users to create several types of queries and contains a text box to input a gene name or group of genes, select boxes to choose bicluster sets from single and, core or elaborated MScM analyses, and a submit button to begin the search for biclusters containing the gene or genes of interest from the selected biclustering analyses (
Gaggle tools: Embedded links to integrated software tools
Statistics: The number of genes and conditions in the bicluster, score, residual, mean motifs p-value, motif E-values
Enrichment Summary: based on the most significant annotations from COG, KEGG and GO enrichment analysis
Core Genes: Genes table for conserved core members of the bicluster– including GO, KEGG, and COG gene annotations
Elaborated Genes: Same as above, but for elaborated members of the bicluster
Experiments: Table with links to the meta-data and primary articles
Bicluster Motifs: if any motifs were found, the sequence logo is displayed here along with matches to any known motifs
Enrichment Analysis: Tables for GO, KEGG, and COG annotation enrichment – with description and significance values
Related Biclusters: Table with links to biclusters with similar functional/pathway annotations, similar motifs, or overlapping gene members
Plots: Bicluster plots for gene expression profiles, mean gene expression, and expression heatmap
Each element of the bicluster card is generated automatically by our system, is compatible with outputs from other widely used biclustering tools, and provides links to descriptions/tutorials for using the linked tools or databases.
The CMMR web interface allows users to search for biclusters of interest, with each resulting bicluster displayed in a BiclusterCard format. (A) The CMMR search page showing the title links to the CMMR wiki, query form button, upload form button, and input fields. Shown is the query form with an example search for
Visualizing the entire multiple-species dataset and integrative biclustering analysis at once, in a single view or tool, is cumbersome and ineffective at conveying biologically useful information due to the scale and multitude of different relationships in the data and analysis. Therefore, a main goal of our resource is to design an interface that provides access to the MScM results and collected data compendium via multiple queries (e.g. query by pathway, gene, network neighborhood, bicluster or ontology term). Although multiple queries are possible it is envisioned that a user will typically begin by querying for a gene or group of genes and browse MScM gene modules. A user can then begin exploring relationships between datasets for individual genes, subnetworks of genes, among modules, or among modules with particular shared attributes, such as, functional annotation. The system also allows high-level manipulation of queries, i.e. queries and operation on results of past queries, via Sungear. Examining the intersections, complements, and unions of module gene memberships, or identifying common promoter elements among genes in a module or among modules can be performed using Sungear following several broadcasts of gene lists. Gene lists are typically the results of queries, neighbors in a network loaded into the cytoscape goose, or the members of biclusters. These are just few examples of how a user can use the resource. Moreover, all of this functionality is automatically performed (mirrored) across species multiple species datasets.
To demonstrate our resource's capabilities, we explore nitrogen metabolism associated multi-species biclusters with the specific biological goal of identifying new genes functionally associated with nitrogen metabolism in
Nitrogen is an essential input into several metabolic pathways including amino acid and nucleotide biosynthesis, and can act as a terminal electron acceptor in dissimilatory nitrate reactions
We begin our exploration of identifying conserved biclusters containing
The BiclusterCard is a summary of the information supporting a bicluster, including links to online tools and source data. Shown in the figure are the expanded tabs for: statistics, enrichment summary from COG, GO and KEGG enrichment analysis, KEGG pathway enrichment, and core gene table for multi-species bicluster
Then, looking at the gene GO, KEGG and COG annotations by expanding the ‘Core Genes’ tab we see many genes have the same or similar annotations and some have either none or different annotations such as
Shown in the figure are, the expanded tab for Plots displaying a gene expression heatmap, the expanded tab for Bicluster Motifs, and an example of the upstream motif patterns for multi-species bicluster
Expanding the ‘Bicluster Motifs’ tab displays the motifs detected in the bicluster. Two of the detected motifs for eco57 show similarity to known nitrate/nitrite response transcriptional regulator binding motifs (
Among the core gene list for this bicluster,
Expanding the Gaggle tools tab on the BiclusterCard for multi-species bicluster
Another possible use of our system is the exploration of collections of biclusters to identify novel interactions among modules. In the context of this example we can extract the subnetwork of biclusters related to the nar bicluster described above from a network that displays associations among biclusters by broadcasting the list of biclusters related to the orthologous core from the BiclusterCard to the Bicluster Network Viewer (
We can further explore nitrogen metabolism in the context of
The Sungear goose is a visualization tool capable of displaying set relationships and operations (intersections, complements, unions). In this case, sets are gene lists from a gaggle broadcast. (A) Four biclusters were broadcast to Sungear: eco57, eco83, eco12, and eco90. Each bicluster is represented as a vertex or anchor on the square and the circles, called vessels, represent the intersection of elements, in this case, bicluster gene members (bottom center window). Selected are four circles (filled circles) representing the intersections of gene members for bicluster 57 with the other three biclusters, 83, 12, and 90. The list of genes from the selected sets is seen in the gene list window (left window). Manipulation of the sets is done through the control window (top center window). Over representation of GO terms are shown in the GO term window (right window). (B) The list of 39
Using the CMMR, much knowledge was uncovered from the search of just a single gene,
We have developed a publicly accessible web resource for comparative genomics studies of several prokaryotic organisms, with plans to expand this resource over time. As described above, in our example with coupled
The CMMR wiki is intended to be a platform for information exchange, encouraging the contributions of researchers who use the resource, whether via curation or suggestions of new tools. Improvements to the resource could be made 1) in method development, for example, further optimization of the MScM algorithm and inclusion of additional analysis methods, 2) as datasets become available, increasing the number of included species, and 3) as further development and invention of intuitive visualization and exploration tools manifest. This effort could also serve as a framework for applications to comparative biclustering of eukaryotic organisms.
(DOC)