The authors have declared that no competing interests exist.
Conceived and designed the experiments: Y. Guan, M. Burmeister, C. Bult, M. Hibbs, O. Troyanskaya. Performed the experiments: M. Burmeister, J. Schimenti, M. Handel. Analyzed the data: Y. Guan, M. Burmeister, M. Hibbs, O. Troyanskaya. Wrote the paper: Y. Guan, M. Burmeister, C. Bult, M. Hibbs, O. Troyanskaya. Developed web interface for network comparative viewing: D. Gorenshteyn, A. Wong.
Integrated analyses of functional genomics data have enormous potential for identifying phenotype-associated genes. Tissue-specificity is an important aspect of many genetic diseases, reflecting the potentially different roles of proteins and pathways in diverse cell lineages. Accounting for tissue specificity in global integration of functional genomics data is challenging, as “functionality” and “functional relationships” are often not resolved for specific tissue types. We address this challenge by generating tissue-specific functional networks, which can effectively represent the diversity of protein function for more accurate identification of phenotype-associated genes in the laboratory mouse. Specifically, we created 107 tissue-specific functional relationship networks through integration of genomic data utilizing knowledge of tissue-specific gene expression patterns. Cross-network comparison revealed significantly changed genes enriched for functions related to specific tissue development. We then utilized these tissue-specific networks to predict genes associated with different phenotypes. Our results demonstrate that prediction performance is significantly improved through using the tissue-specific networks as compared to the global functional network. We used a testis-specific functional relationship network to predict genes associated with male fertility and spermatogenesis phenotypes, and experimentally confirmed one top prediction,
Tissue specificity is an important aspect of many genetic diseases, reflecting the potentially different roles of proteins and pathways in diverse cell lineages. We propose an effective strategy to model tissue-specific functional relationship networks in the laboratory mouse. We integrated large scale genomics datasets as well as low-throughput tissue-specific expression profiles to estimate the probability that two proteins are co-functioning in the tissue under study. These networks can accurately reflect the diversity of protein functions across different organs and tissue compartments. By computationally exploring the tissue-specific networks, we can accurately predict novel phenotype-related gene candidates. We experimentally confirmed a top candidate gene,
Phenotypes caused by mutations in genes often show tissue-specific pathology, despite organism-wide presence of the same mutation
Functional relationship networks, representing the likelihood that two proteins participate in the same biological process, provide invaluable information for phenotype gene discovery, pathway analysis, and drug discovery
Current approaches to create functional relationship networks are difficult to apply in a tissue-specific manner. Typically, networks are constructed by integrating data sources that vary in terms of measurement accuracy as well as biological relevance for predicting protein functions. Machine learning methods, such as Bayesian networks, learn the relative accuracy and relevance of datasets when given a ‘gold standard’ training set, which consists of gene pairs that are known to work in the same biological process. Then probabilistic models are constructed to weigh and integrate diverse datasets based on how accurately they recover the ‘gold standard’ set. The networks generated by this approach lack tissue-specificity information, because systematic collections of large-scale data or ‘gold standard’ pairs with quantitative tissue-specific information are often not available.
Here, we address the tissue-specificity challenge by simulating the natural biological mechanism that defines tissue-specificity: co-functionality in most cases would require the presence of both proteins in the same tissue. Inspired by our previous efforts to establish biological process-specific networks, such as networks specifically related to the cell cycle or to mitochondrial biogenesis
In addition to generating the first tissue-specific networks for the laboratory mouse, we also explicitly tested the potential of using such networks to predict phenotype-associated genes. To do so, we mapped diverse phenotypes to their respective tissues in the laboratory mouse, according to the terminology and description of the phenotypes. We show that the tissue-specific functional relationship networks can improve our prediction accuracy for phenotype-associated genes compared to a single global functional relationship network through computational analyses, and through experimentally confirmed predictions of novel fertility-related genes and visualization of their local networks. We further identified candidate genes specifically predicted by the cerebellum network to be related to ataxia, which are supported by both literature and experimental evidence. Our networks are publicly available at
In this study, we develop and apply a novel algorithm that generates tissue-specific functional networks in the laboratory mouse by integrating diverse functional genomic data, and we demonstrate that our tissue-specific networks are more accurate in predicting phenotype-related genes than a single global functional network. In the following sections, we first outline the strategy used to generate tissue-specific networks by interrogating gene expression profiles across tissues and integrating different data sources using Bayesian statistics. Second, we developed a cross-network comparison metric for identifying significantly changed genes across networks which are enriched in tissue specification and development. Third, we quantitatively demonstrate that combining our tissue-specific networks with a state-of-the-art machine learning algorithm can produce improved predictions of genotype-phenotype relationships compared to previous single global networks
A common mechanism resulting in tissue-specific protein functionality is the modulation of gene expression levels between tissues
Diverse functional genomic datasets such as expression, protein-protein interactions and phenotype information were integrated in a Bayesian framework to generate tissue-specific networks. Input datasets were probabilistically “weighted” based on how informative they were in reflecting known co-functional proteins that are both expressed in a given tissue. To account for overlap in information in multiple datasets (especially the large number of gene expression microarray datasets), mutual information-based regularization was used to down-weight datasets showing significant overlap with each other. These networks were then used as input into a Support Vector Machine classifier to predict phenotype related genes. Finally, we implemented a web interface that allows network comparison between tissues.
In the global (non-tissue-specific) sense, following previous definitions
For tissue-specific expression information, our gold standards rely on the Gene Expression Database (GXD) of the Mouse Genome Informatics group (MGI). GXD provides an extensive, hierarchically structured dictionary of anatomical expression results for mouse to allow us to carry out our analysis
We pursue two main goals in this study: First, we generate tissue-specific networks that synthesize as much data as possible and provide these networks to the public through an online visualization interface at
One key application of tissue-specific networks is to identify novel genes and relationships between genes that may be specific to a particular tissue. To computationally evaluate our ability to identify novel relationships, we used cross-validation to test whether our tissue-specific Bayesian scheme is more accurate than the global network. Cross-validation was used to assess predictions by evaluating the accuracy of recovering subsets of known annotations withheld during the training process. Specifically, we performed 3-fold cross-validation, by holding out one third of the tissue-specific ‘gold standard’ pairs in each of the three iterations. We learned the parameters in the Bayesian networks,
Compared to a single global functional relationship network, our approach significantly improved our ability to predict tissue-specific functional linkages. The mean AUC (area under the receiver operating characteristic curve, which represents the accuracy in recovering tissue-specific functional relationships) for the global network estimated through three-fold cross-validation was 0.68. Tissue-specific networks achieved median AUC of 0.72. With a random baseline of 0.5 in AUC, this represents a ∼20% improvement of the tissue-specific networks over the predictive power of the global network. This improvement is consistent over all 12 major organ systems defined by GXD
One important application of our tissue-specific networks is to identify functional relationships between genes that change significantly across tissues. This provides a platform for analyzing tissue-specific molecular interactions, as well as tissue-specific roles for genes that are ubiquitously expressed but play different roles in different tissues. For example, Wnt10b (wingless related MMTV integration site 10b) is expressed in many tissues throughout development and participates in many biological processes including bone trabecular formation
In A, blue-highlighted genes are directly involved in skeletal muscle development. In B, blue-highlighted genes are involved in bone minerization or bone structure formation. The enrichment of genes involved in the above processes reflects the differential roles of Wnt10b in skeletal muscle and bone.
To quantify gene connectivity changes across networks, we developed a metric that captures how much the edges involving a gene differ across networks (see
GO:0048856 | anatomical structure development | 3.40E−09 |
GO:0007417 | central nervous system development | 4.27E−09 |
GO:0048513 | organ development | 2.61E−08 |
GO:0021536 | diencephalon development | 5.25E−07 |
GO:0021984 | adenohypophysis development | 8.94E−07 |
GO:0007420 | brain development | 2.18E−06 |
GO:0048732 | gland development | 1.48E−05 |
GO:0032502 | developmental process | 1.62E−05 |
GO:0030900 | forebrain development | 2.10E−05 |
GO:0007399 | nervous system development | 2.27E−05 |
A key hypothesis in this study is that analyzing tissue-specific networks may improve our ability to identify phenotype-related genes. To test this hypothesis, we regenerated tissue-specific networks using the same Bayesian approach as above, but excluded all phenotype and disease data as inputs to avoid circularity in our cross-validations. Then, we mapped 451 phenotypes to their most related tissue in the laboratory mouse according to the terminology and description of these phenotypes in the Mammalian Phenotype ontology
To test whether our tissue-specific networks are more capable of identifying phenotype-associated genes than the global network, we used bootstrap bagging
By mapping phenotypes to different tissues according to their terminology and description, we are able to compare the performance of tissue-specific networks and the global network in predicting phenotype-related genes. Candle-stick plots (minimum, 25%, median, 75% and maximum) show the distribution of percentage AUC improvement when predicting phenotype-related genes.
Performance improvements were consistent across phenotypes of different sizes (
Performance improvements were also consistent across different major organ systems. Phenotypes involved in the endo/exocrine system achieved the most significant improvement in AUC (+35%, compared to global networks against baseline of 0.5) and those in cardiovascular system achieved 21.8% improvement in AUC. However, prediction accuracy was improved across all major systems, with the least improvement of 5.9% in renal/urinary phenotypes. Phenotypes related to musculoskeletal systems achieved the highest AUC of 0.82 and the group with lowest AUC was digestive system, which still achieved an average of 0.78. The consistency in improvements across different organ systems demonstrates the robustness of our modeling framework to predict phenotype-related genes in a tissue-specific manner.
We focused on two cases to illustrate how our tissue-specific networks can facilitate disease gene discovery. These two phenotypes represent two extremes of the phenotype/disease-associated gene prediction problem. The first, reduced male fertility, is a broadly defined, common phenotype with many causative genes already known. The second, ataxia, is a rare neurological disorder affecting ∼3–10/100,000 of the general population
First, we used male fertility related phenotypes to test the performance of tissue-specific networks to predict phenotype-related genes. To do so, we utilized a recent, nearly comprehensive literature review of genes involved in mammalian spermatogenesis and male fertility phenotypes
We selected
In addition to the well-studied phenotype of male infertility, we also examined a less well-understood disease, ataxia, to investigate whether our tissue-specific networks can identify genes related to phenotypes or diseases with limited prior knowledge. Gene identification through genetic approaches, such as pedigree analyses, has had a major impact on our understanding of ataxia (over 40 candidate genes identified so far). Genetic testing is now an integral part of assessment. Routinely, a blood sample of any new ataxia case is mailed in for laboratory evaluation. However, the majority of the sporadic cases as well as the familial cases are so far unexplained. We curated the known gene list (43 in total) related to human ataxia, mapped these genes to their mouse orthologs, and used this list as seeds to predict additional candidate genes using our cerebellum-specific network, which is the major tissue affected by ataxia.
Our cerebellum-specific network reveals connections of ataxia-related genes not shown in the global network. A key, known ataxia gene is
Edges with weight greater than 0.9 are shown. In the cerebellum network (
In addition to identifying these novel, likely correct edges, we also identified novel candidates using our SVM-based approach described above. Out of our top 10 novel candidates, we found strong evidence in the literature for 4 of these genes to be associated to ataxia (
Gene | Rank in cerebellum network | Rank in global network | Evidence |
PDK2 | 1 | 71 | None |
RBFOX1 | 2 | 9 | Physical interaction with ATXN2 |
HLF | 3 | 302 | None |
APBB1 | 4 | 241 | None |
PLCB4 | 5 | 84 | Double knockout confirmed in mice |
LRRC2 | 6 | 1778 | None |
TXLNB | 7 | 356 | None |
SORBS1 | 8 | 743 | Physical interaction with ATXN7 |
CYP2D6 | 9 | 476 | None |
PLP1 | 10 | 87 | homologue of PMP22, implicated in ataxia-related Spastic paraplegia-2 and Pelizaeus-Merzbacher disease |
Genetic diseases often manifest tissue-specific pathologies
Our approach addresses the twin challenges of incomplete systematic knowledge of tissue-specific protein functions and of limited availability and coverage of tissue-specific high-throughput functional data. Due to this lack of systematically defined tissue-specific genomic data, our approach uses highly reliable, low-throughput measures of gene expression to constrain our gold standard examples into tissue-specific sets. As more tissue-specific protein functions are defined systematically, perhaps with the help of hypotheses generated by approaches such as this, tissue-specific functional interactions will be directly used for experimental testing. Many genomic datasets, especially physical interaction studies, such as yeast 2-hybrid screens, and large-scale genetic screens, utilize artificial or
While our current study focuses on predicting genotype-phenotype associations using tissue-specific functional relationship networks, the potential application of tissue-specific networks extends far beyond predicting phenotype-associated genes. For example, just as perturbations of the same gene may lead to different phenotypic outcomes across different tissues; treatments with bioactive chemicals or drugs may manifest differential effects across different tissues. Our broad conceptual framework of utilizing tissue-specific expression to refine a global network could be brought into these application domains such as drug target identification.
Our framework to generate both global and tissue-specific functional networks is based on naïve Bayesian network data integration coupled with mutual information-based regularization. Genomic datasets are weighted differentially based on how well they recover gold standard positive pairs in either global or tissue-specific training sets (see the section below for construction of tissue-specific gold standards). Specifically, we computed the posterior probability of a functional relationship given all available evidence following the scheme in
The assumption of conditional independence is a major factor penalizing the performance of naïve Bayesian integration, given that many biological datasets share information. We used multiple strategies to minimize the impact of information overlap based on the nature of the data integrated. For physical and genetic interaction datasets, we combined different data sources as described in the next section. For microarray gene expression data, which are the major data sources contributing to overlapping information, we regularize the contribution of each microarray dataset according to:
To learn the parameters in this Bayesian framework, we first established a gold standard that approximates a true set of functionally related proteins. We obtained Gene Ontology (GO) biological process branch annotations from the Mouse Genome Informatics database (MGI)
For each tissue-specific Bayesian framework, we created a tissue-specific gold standard restricted to protein pairs that are both expressed in the tissue of interest, which approximates a set of functionally related proteins within a specific tissue context. We utilized the mouse Gene eXpression Database (GXD)
We collected diverse functional genomics data to use as input for the integration. All data used in phenotype analysis were acquired as of Jan 2011. All data were processed into pair-wise similarity scores
Protein physical interactions: We acquired protein-protein physical interaction data from MiMI (Michigan Molecular Interactions)
Expression data: To utilize the signals represented by diverse microarray data, we acquired mouse microarray datasets from GEO (977 datasets, 960 of them have more than or equal to three samples, totaling 13632 arrays)
Homologous functional relationship predictions: Previous analysis indicates that homologous functional relationships in simpler model organisms are a good indicator of functional relationship in higher model organisms
Phenotype and disease: We acquired data from MGI
Phenotype and disease data are included in the networks displayed on our web interface, but were excluded from the networks used to predict phenotype-related genes to prevent circularity.
The above data are integrated together using the Bayesian framework (formulas 1–4) to generate both global and tissue-specific networks. The evaluation of each of the input datasets against each tissue-specific gold standard is included in
We used three-fold cross-validation to evaluate the performance of our tissue-specific networks and the global network. Each gold standard set was randomly partitioned into three subsamples. Each subsample was retained as a validation set while the other two were used for training the Bayesian networks. The AUC of the tissue-specific networks in recovering the held-out set was compared against that of the global network.
Phenotypes were mapped to tissues based on sub-string matches between phenotype and tissue descriptions. For example, the phenotype thyroid gland hyperplasia (MP:0003498) can be mapped to the tissue thyroid gland (MA:0000129). This resulted in 451 phenotype-tissue matches. In the rare case where multiple tissues mapped to a single phenotype, the network with the highest cross-validation performance was selected for evaluation.
We downloaded the mammalian phenotype (MP) ontology and annotations for mouse from MGI on May 4, 2011, including 196190 entries for 13438 genes in total. All annotations were propagated along the ontology hierarchy. If any allele of a gene was annotated to phenotype under consideration or a descendent of this phenotype term, we associated that gene with this phenotype. We then adopted the network-based candidate gene prediction scheme from
We applied a unified scheme for evaluation and prediction based on bootstrapping, where examples were randomly sampled with replacement (0.632 bootstrap, that is, the expected fraction of selected data points is 0.632)
For each pair of networks, we quantify how much each gene has changed in the network relative to its neighbors. Suppose in network
To examine the role of Myb1 in spermatogenesis, as described in detail elsewhere
We curated 43 genes causing ataxia that have been confirmed in human pedigree studies. These 43 genes were mapped to mouse one-to-one orthologs using the orthology defined by MGI
To allow dynamic visualization and cross-network comparison of our integration results, we developed the mouseMAP software (
Our public, web-based system features cross-comparison of different networks that highlights connections in the newly queried network
List of genomics datasets used in integration.
(XLS)
Precision-recall figures for each tissue-specific network (red) versus the global (blue) network.
(ZIP)
Functional enrichment of top 100 changed genes for each tissue-specific network.
(ZIP)
Expert-created ontology of spermatogenesis-related phenotype terms.
(XLSX)
The AUC of each expression dataset evaluated against each tissue-specific gold standard.
(ZIP)