Conceived and designed the experiments: MDC OGT. Performed the experiments: MDC. Analyzed the data: MDC. Wrote the paper: MDC OGT.
The authors have declared that no competing interests exist.
Correctly evaluating functional similarities among homologous proteins is necessary for accurate transfer of experimental knowledge from one organism to another, and is of particular importance for the development of animal models of human disease. While the fact that sequence similarity implies functional similarity is a fundamental paradigm of molecular biology, sequence comparison does not directly assess the extent to which two proteins participate in the same biological processes, and has limited utility for analyzing families with several parologous members. Nevertheless, we show that it is possible to provide a cross-organism functional similarity measure in an unbiased way through the exclusive use of high-throughput gene-expression data. Our methodology is based on probabilistic cross-species mapping of functionally analogous proteins based on Bayesian integrative analysis of gene expression compendia. We demonstrate that even among closely related genes, our method is able to predict functionally analogous homolog pairs better than relying on sequence comparison alone. We also demonstrate that the landscape of functional similarity is often complex and that definitive “functional orthologs” do not always exist. Even in these cases, our method and the online interface we provide are designed to allow detailed exploration of sources of inferred functional similarity that can be evaluated by the user.
Common ancestry is a central tenet of modern biology, as genes from different species often show a high degree of sequence similarity, making it possible to study analogous processes across model organisms. However, many genes belong to large families with several duplicates and the relationship between genes from different species is often not one-to-one, complicating the transfer of experimental knowledge. We present a method that uses a large compendia of high-throughput expression data, that covers many genes that have not been analyzed in any other way, to systematically predict which genes are most likely to participate in the same biological process and thus have analogous function in different organisms. We show that our method agrees well with current experimental knowledge and we use it to investigate several families of genes that demonstrate the complexity of functional analogy.
The idea that protein sequence similarity implies shared function is a central paradigm in modern biology, allowing experimental knowledge obtained from model organisms such as yeast or mouse to be applied to our understanding of human diseases, or to be transferred via functional annotations to newly sequenced genomes. When no clear one-to-one homology relationship exists, however, and proteins of interest belong to families with several paralogous members (which may have arisen from post-divergence duplications), our ability to correctly transfer functional annotations based on sequence-similarity is fundamentally limited.
Such difficulties regularly arise in model organism studies of disease, where it is essential to identify which proteins and pathways are functionally analogous to the mammalian protein of interest. Attempts to understand human laminopathies through studies in Drosophila, for example, are limited by the lack of knowledge of the relationships among the human and fly lamin families; no one-to-one orthology exists in this case, and the most promising approach seems to be one based on functional information, rather than sequence. Frequently, however, not enough directed experimental information is available to make an accurate comparison among all homologs, as it is often the case that some members of homologous families are much better studied than others (see
Prior efforts at identifying functional orthologs (i.e. proteins that not only share sequence ancestry but also perform the same function) have investigated the technical aspects of global network alignments, largely focusing on large-scale protein-protein (physical) interaction networks
We address these issues by developing a new local approach to alignment that leverages large collections of diverse gene expression data to identify functionally analogous homologs whose relevance to a particular research context can be easily interpreted. Microarray data is a complementary source of high throughput functional information that is in many cases as accurate as large-scale protein-protein interactions for predicting function
Identifying functional orthologs based on integrative analysis of large microarray compendia presents new challenges, including dense, hard to align networks, need for robust integration that is comparable across organisms, and robust identification of similarly functioning proteins. While we focus our discussion on microarray data because it provides the most unbiased coverage, our Bayesian integration method can readily combine different data types. In fact, in our online interface we provide both microarray-only predictions and predictions that incorporate PPI information. Furthermore, while in this study we focus on microarray data because of its excellent coverage in model organisms, our approach can easily integrate other expression data such as RNA-seq as it becomes widely available
We employ a local alignment to provide a robust measurement of functional similarity among homologous proteins. This is in contrast to earlier work
Microarray studies provide a global view of the transcription state and are informative of the biological processes required by the tissue, cell-type or a particular perturbation under study. For this reason, individual microarray experiments have been used extensively to predict gene function
We provide a general approach that requires no direct alignment and seamlessly integrates the entirety of microarray data available for each species. This approach extends our previous work in large-scale microarray integrations to a multi-species comparison, allowing us to combine the breadth of microarray coverage with a network alignment approach to provide a global view of functional similarity across organisms.
Our method (
Species-specific functional networks are derived by Bayesian integration of microarray data. For each intra-species pair of genes, the networks associate a probability of functional relationship based on their pattern of correlation. For a single gene, the set of genes with high probabilities of being functionally related to it defines a functional neighborhood. To make functional neighborhoods comparable across organisms, neighbors are grouped into meta-genes according to their Treefam families. The network similarity score is then defined as the hypergeometric probability of the overlap obtained from intersecting the sets of species-independent Treefam families present in each species-specific functional neighborhood. Such intersection analysis enables identification of specific biological processes responsible for network similarity scores. We have taken a comparison between the mouse and fly Snap25 genes as the basis for the schematic figure. The overlap meta-genes are a selection and the complete overlap can be viewed online using our webserver.
While it has been shown that microarray data can been used to accurately distinguish functionally interacting gene pairs from unrelated ones, it is significantly more difficult to demonstrate that an integration successfully detects the subtle differences in the functional relationships between homologous proteins. Solving this evaluation problem is a prerequisite to providing functional analogy predictions that can be trusted in cases where a sequence comparison is ambiguous, as there is no reason to believe
While sequence based analyses are indispensable for identification of molecular functions or domain architecture of proteins, organisms often possess several genes that belong to a cohesive family whose members are predicted to have the same structural or enzymatic features, which are nevertheless involved in quite different biological processes. Consider, for example, the mouse genes Snap25 (discussed briefly above) and its paralog Snap23. The two proteins have the same structural features
We hypothesize that our method, based on genomic datasets, is especially useful for the differentiation of such genes, as it provides a complementary characterization of function by specifically probing biological responses that may discriminate genes that otherwise appear similar at the sequence level. As this approach would be the most valuable for closely related genes (as distantly related genes can be distinguished based on sequence alone), which are very hard to distinguish functionally based on sequence, we focus our analysis on homologs that belong to the same TreeFam family.
Our first evaluation method is based on the tissue-expression pattern of genes. This evaluation is motivated by the fact that cross-species homologs that perform the same function are expected to express in similar tissues, reflecting the specific molecular requirements of different cell types. Correctly identifying homologous genes with similar tissue expression patterns is also a worthwhile goal in its own right, as tissue-specific expression is a critical facet of complex human disorders such as cancer and diabetes
Although the anatomies of worm, fly and mouse are quite different, and many tissues in one organism have no obvious analog in another, all three organisms possess a nervous system. We thus use fly and mouse genes annotated with brain expression and worm genes annotated with neuronal expression (all based on small-scale experiments such as in situ or GFP tagging) to define a nervous system standard. (Standard data is available in the download section of the website.) We then evaluate our predictions by assessing whether nervous system expressed genes are predicted by our method to have greater similarity to their homologs that are also present in the nervous system than to those which are not. Our method is significantly better than chance at matching pairs of nervous-system-expressed homologs. Moreover, when we subject a sequence derived measure to the same evaluation, we find that our network similarity score (based solely on the microarray-based network similarity) outperforms sequence in all comparisons (
We consider single query genes that are known to express in the nervous system and have multiple homologs in another organism (according to Treefam family co-membership), with at least one of the homologs also expressed in the nervous system (“correct” functional homolog), and another whose expression has been evaluated but was not detected in the nervous system (“incorrect” functional homolog in this evaluation). We then evaluate how well the various metrics rank the homologs consistent with their nervous system expression by computing the AUCs of homolog rankings (normalized per query gene). Numbers below the bars represent the p-value that corresponds to the AUC score.
It is important to note that our approach is complementary to—and can be used side by side with—sequence-alignment based methods. In fact, as our network similarity score and sequence-based scores provide orthogonal information, we find that a simple combined rank score can improve performance further in cases where both scores perform comparably-well (as evident in the fly/worm nervous system comparison,
In addition to finding that our method can accurately pair homologs that are expressed in the same tissues, we aim to assess whether we are able to correctly identify cross-species homologs that play the same biological role in a cell. Thus, we also design an evaluation based on the Gene Ontology annotations, using a standard in which homologs that share “biologically specific” (i.e. sufficient for follow-up experiments, see
This evaluation is performed identically to the nervous system evaluation with the variation that “correct” functional homologs are those that are co-annotated with the query to a specific GO term while “incorrect” ones are those that are annotated to a specific term but do not share annotations with the query. Numbers below the bars represent the p-value that corresponds to the AUC score. A. Evaluation performed with all of specific biological process annotations with experimental evidence codes. B. Evaluations performed by considering co-annotation to “cell cycle” only. C. Evaluations performed with co-annotation to “mitochondria”.
Detailed analysis reveals that there are areas of biological annotations where our network similarity score is especially accurate (as compared to sequence similarity) at identifying functionally analogous homologs. For example, if we use a standard based just on the “cell-cycle” GO annotation, the NS score performs well for all comparisons involving
To support this notion, we perform an evaluation using mitochondrial localization annotations, focusing on the homologs in mouse and yeast as these are the only two organisms in which genome-wide screens for mitochondrial localization and function have been performed
We have shown that the network similarity score provides reliable functional information that is complementary to sequence-based comparisons, correctly differentiating homologs with shared tissue-specific expression and playing similar roles in biological processes. We now illustrate how our method may be used to gain insight into the functional landscape of protein families with complex evolutionary histories.
We first consider the family represented by the mouse gene Snap25, a SNARE protein that participates in the regulation of synaptic vesicle exocytosis
A. The sequence derived family tree (TreeFam) indicates the presence of 2 lineage specific duplications so that the fly and mouse family members are collectively coorthologous. B. Using our method we have clustered members of the Snap25 family with respect to functional similarity. Family members cluster into neuronal and non-neuronal functional groups in a manner that is independent of their evolutionary history. Though the mouse and
In particular, mouse Snap25 shows strong functional similarity to fly Snap25 but has no significant similarity to mouse Snap23 or fly Snap24, which are nevertheless similar to each other. This is particularly unexpected since multiple alignment analysis shows that the four genes arose from a single ancestor with subsequent lineage-specific duplications (
The distribution of yeast and
Since the fly and mouse homologs are predicted to have arisen by lineage-specific duplications, it appears that the independent emergence of neuronal and non-neuronal genes is an instance of convergent evolution. Interestingly, the development of both these types of Snap25 homologs may also be evident in the two
Protein families with several lineage-specific duplications present a particular challenge for the transfer of disease models between organisms, since sequence similarity produces ambiguous mappings in such cases. For example, there has been significant interest in generating Drosophila models of vertebrate laminopathies that has been complicated by the lack of one-to-one orthologs. Vertebrate lamins have been classified into two types, type-A and type-B. Mutations in type-A cause a large class of diseases collectively termed laminopathies, such as muscular dystrophy and premature aging, while viable type-B mutations are extremely rare (the two B-type genes together are required for cellular viability while type-A lamin is not). Two main attributes are associated with the type-A/type-B distinction. First, B-type lamins are expressed ubiquitously while A-type lamins have a dynamic developmental expression profile. Second, type-B lamins possess a CaaX box that is prenylated and anchors the protein to the nuclear envelope, while mature type-A proteins do not
Unlike many other invertebrates that have a single lamin gene, the
Using our NS score to cluster mouse,
A. The sequence derived family tree (Treefam) for the lamin genes being considered. B. The patterns of functional similarity among members of the lamin family. Lamins can be broadly classified as type-A and type-B based on pattern of expression and structural features, with type-A lamin mutations causing a diverse set of human diseases. While
The similarity between Lmna and
While we have shown that the network-based score can aid in finding homologs that perform the same function and have similar phenotypes, the nature of functional conservation is complex and may not be easily summarized with a single score, as demonstrated by the lamin example. Our method was in fact designed with this challenge in mind. Once genes from different organisms are made comparable by projecting their correlation neighborhoods onto organism-independent meta-genes, our network similarity score is generated using the neighborhood overlap metric (see
To enable biologists to easily perform such analysis, we have made our method accessible through an interactive web interface that not only provides network similarity scores, but also allows the user to explore the source of inferred similarities. The user can identify precisely which meta-genes connections are shared by a pair of homologous genes and evaluate whether the overlap is representative of the particular biological functions that the user is interested in.
As an example we consider the gene SOD1 in
A Venn diagram of shared meta-gene neighbors is shown. While both
In contrast, the overlap between yeast SOD1 and worm
We have developed a method to leverage a large compendium of gene expression data to provide a measure of functional similarity of homologs across organisms. Our measure reliably predicts gene pairs that share tissue expression patterns or participate in the same biological process even for closely related genes and can therefore serve as a useful tool for identifying homologs with analogous function, as well as a way of examining more general questions about the landscape of functional similarity.
By leveraging a large compendium of expression data, our method yields both good gene coverage and extensive functional coverage by combining datasets from many tissues and perturbations. As expression datasets provide information for many genes that have not been studied in any other way, our method is, for many homologous gene-pairs, the only way currently available to explore functional relationships. Our method is also designed to allow detailed examination of the sources of our functional predictions through the provided web interface (available at
Microarray data for
Standards for Bayesian integration were constructed using a custom set of Gene Ontology terms described in
Functional interaction networks were used to define gene neighborhoods. A “hard-cutoff” neighborhood of a query gene is defined as all genes connected to the query with a resonable probability. While the hard-cutoff of 0.5 is a natural probability cutoff (and was used by us in this paper), this too can be adjusted by the user in the on-line interface. As not all genes have a sufficient number of neighbors above this threshold, we have also set the minimal size of the neighborhood at 50, termed “soft-cutoff”. While using the soft-cutoff method slightly decreases our evaluation performance, we believe that it is nevertheless important to give scores for as many gene pairs as possible and we use this method throughout the paper. The soft-cutoff is the default option in our web interface though it can be turned off by the user. For the purpose of all evaluations and figures a hard-cutoff of 0.5 and a soft-cutoff of 50 is used. Based on our calculations a hard-cutoff of 0.5 performed best overall though may not be optimal for all queries.
After gene neighborhoods are defined based on our functional interaction networks, TreeFam B families
To determine the functional similarity of genes from different organisms, we compute the hypergeometric p-value of their meta-gene neighborhood overlap. The background set of TreeFam families used for the p-value computation is specific to the organism pair considered and is defined as all TreeFam families that contained at least one gene from each organism such that the gene is also present in our microarray compendium. Likewise for the purpose of the p-value calculation the size of each gene's TreeFam neighborhood is considered to be the set of those TreeFam families that are both present in the gene's neighborhood and in the organism-pair-specific background set.
Our evaluation methodology is motivated by how we believe our system is likely to be used by biology researchers. In particular, given a query gene we would like to evaluate if our network similarity score produces a ranking of potential homologs that is consistent with what is known about the genes experimentally. We expect that homologs expressed in the same tissue or those that show the same phenotype as the query should be ranked above those that do not share these functional attributes. In order to evaluate this we define various standards for homolog pairings.
In the nervous system standard homolog pairs that both express in the nervous system are considered positive, while homolog pairs whose expression has been studied but were not co-expressed in the nervous system are considered negative. For Gene Ontology based evaluations we used a set of specific GO terms with experimental evidence codes (the same set that is used for standard construction). Homolog pairs that shared at least one such annotation were considered positive, while homolog pairs that have been experimentally annotated (to this set of specific GO terms) but did not have annotations in common are considered negatives (See
To compute GO enrichments for Treefam families we consider a family to be annotated to a particular term if any of the member genes have an experimental annotation for that term. GO enrichment is computed as hypergeometric p-values with the background count taken from the organism-pairs-specific background families defined above. Thus, while the annotations are not organism-specific, the enrichment computation does depend on the organism pair being considered. All p-values are cutoff at and FDR of 0.05.
Binary evaluation standards and annotation sources.
(4.51 MB TAR)
Average fraction of family wide experimental GO annotations that belong to the single most annotated family member. Experimental annotations may often show bias with respect to close homologs as some members of homologous families are studied and annotated more thoroughly than others.
(0.01 MB EPS)
Average fraction of family wide protein-protein interactions (as compiled by BioGRID) that belong to the single most connected family member. While closely related homologs would be expected to have similar numbers of interactions, due to various study biases the number of reported PPIs varies widely.
(0.01 MB EPS)
The results of all evaluations performed, includes networks with added PPI informations and process-specific evaluations.
(0.03 MB XLS)
A list of all datasets included in the integration and their sources.
(0.05 MB DOC)
We would like to acknowledge Kara Dolinski for helpful discussions.