Designed the study: EP LB. Developed the code for the SBR and SBMR analyses: LB. Performed all statistical analyses: EP LB SRL. Performed all PCR based experiments: CMR RS SAC. Contributed reagents and materials for the study: MH MP NH TJA SAC. Wrote the paper: EP LB SR.
The authors have declared that no competing interests exist.
The majority of expression quantitative trait locus (eQTL) studies have been carried out in single tissues or cell types, using methods that ignore information shared across tissues. Although global analysis of RNA expression in multiple tissues is now feasible, few integrated statistical frameworks for joint analysis of gene expression across tissues combined with simultaneous analysis of multiple genetic variants have been developed to date. Here, we propose Sparse Bayesian Regression models for mapping eQTLs within individual tissues and simultaneously across tissues. Testing these on a set of 2,000 genes in four tissues, we demonstrate that our methods are more powerful than traditional approaches in revealing the true complexity of the eQTL landscape at the systems-level. Highlighting the power of our method, we identified a two-eQTL model (
Integrated analysis of genome-wide genetic polymorphisms and gene expression profiles from different tissues or cell types has been highly successful in identifying genes modulating complex phenotypes in animal models and humans. However, an important limitation of the current approaches consists in their sole application to individual tissues, thus ignoring information shared across different tissues. To uncover complex genetic regulatory mechanisms controlling gene expression at the whole organism's level, it is essential to develop appropriate analytical methods for the analysis of genome-wide genetic polymorphisms and gene expression profiles simultaneously in multiple tissues. This paper presents a novel, fully integrated Bayesian approach for mapping the genetic components of gene expression within and across multiple tissues. In addition to increased power and enhanced mapping resolution when compared with traditional approaches, our model directly provides information on potential systemic effects on transcriptional profiles and co-existing local (
A number of integrated transcriptional profiling and linkage mapping studies have been published to date
Identification of
Although global analysis of mRNA expression in multiple tissues in now feasible
In this paper we have implemented a new Bayesian variable selection method for multivariate mapping of single or multiple outcomes, and show an application to uncover simultaneous
We used Sparse Bayesian Regression (SBR) and Sparse Bayesian Multiple Regression (SBMR) models to identify genetic control points of gene expression, which are common across or specific to four rat tissues. To demonstrate the power of this approach, we selected a subset of 2,000 probe sets that show the highest variation in gene expression in the BXH/HXB RI strains
We first investigated the distribution of the size of the eQTL lists associated with the best SBR model visited, for the transcripts that were below the 5% FDR using Jeffreys' scale of evidence (see
no eQTL |
1 eQTL |
2 eQTLs |
≥3 eQTLs |
|
SBR in fat | 1634 (81.7%) | 311 (15.6%) | 43 (2.2%) | 12 (0.6%) |
SBR in kidney | 1627 (81.4%) | 323 (16.2%) | 38 (1.9%) | 12 (0.6%) |
SBR in adrenal | 1649 (82.5%) | 301 (15.1%) | 40 (2.0%) | 10 (0.5%) |
SBR in heart | 1607 (80.4%) | 345 (17.3%) | 39 (2.0%) | 9 (0.5%) |
SBMR in all tissues | 1469 (73.5%) | 275 (13.8%) | 93 (4.7%) | 163 (8.2%) |
*We used “no eQTL” to identify probe sets whose best model visited was the null model (i.e., no evidence of genetic control) or when the best model visited with genetic control was not significant at FDR <5% (see
For the probe sets that are under genetic control the number of probe sets with one, two or at least 3 distinct eQTLs is indicated. Polygenic models (2 eQTLs and ≥3 eQTLs) are indicative of two or more distinct eQTLs (for the same probe set) that are located at least 10 cM far apart. Percentages were calculated with respect of the set of 2,000 transcripts considered in this study.
The SBR outperformed both QTL Reaper and SSM approaches in detecting complex genetic regulation by two or more eQTLs. While QTL Reaper and SSM found no polygenic control in any tissue at 5% FDR, the SBR model revealed that ∼12% of the probe sets that were found to be under genetic control across tissues mapped to two or more distinct eQTLs, delineating a set of 140 polygenic expression traits (
Thresholding the Jeffreys' scale of evidence to control the FDR at 5% level, the SBMR model identified 531 transcripts (∼27% of the total) under common genetic regulation in all tissues. We showed evidence of polygenic control by two or more distinct eQTLs for a significant proportion of probe sets (13%) (
A key aspect of the SBMR approach is that it exploits additional information provided by the covariance structure between tissues to find a set of parsimonious models that jointly predict gene expression levels in all tissues. For illustration,
(A, E) For each gene, the set of markers associated with high marginal posterior probability of inclusion corresponds to the filtered best model found by SBMR, showing monogenic control for
The proposed SBMR model directly provides information on potential systemic effects of the eQTL(s). To assess the extent by which the detected common eQTLs explain the correlation in gene expression across tissues, we calculated the raw empirical correlation matrix and the posterior mean of the residual correlation matrix given the putative eQTL markers (see
We investigated whether the common eQTLs mapped within each tissue by the SBR model were detected in the SBMR analysis. Ninety-three transcripts showed genetic regulation by the same eQTL that was independently detected in all tissues by SBR (FDR <5%) (
Both SBMR and Hotelling's
In addition, we carried out a simulation study to investigate the power of our approach as compared with the Hotelling's
Log-scale Receiver Operating Characteristic (ROC) curves of SMBR (blue), Hotelling's
To validate eQTL linkages detected by microarray using Bayesian model approaches, we measured mRNA abundance in the BXH/HXB RI strains by quantitative RT-PCR (qRT-PCR) for
As a further example to highlight the power of our method when compared to other approaches, we also validated polygenic regulation for the
The filtered best model for the regulation of
We have shown that our Sparse Bayesian Regression models coupled with an efficient computational algorithm (Evolutionary Stochastic Search, ESS hereafter) provide significant advantages over other methods in eQTL mapping within and across multiple tissues. A key feature of the proposed approach is its ability to uncover polygenic regulation of gene expression, with greater power to identify secondary
We extended the SBR model to accommodate multiple phenotypic responses such as expression profiles in multiple tissues, and showed increased power to discover pleiotropic genetic regulation of gene expression, that was unappreciated by single tissue analyses or other multivariate approaches. We showed that the SBMR model yielded >5 fold increase in the number of common eQTLs when compared with the SBR model. We identified a set of 277
An additional major advantage of the SBMR approach is its ability to assess systemic genetic effects, as illustrated for the
For detection of common polygenic and
Computationally, our ESS algorithm implemented for SBR and SBMR is more efficient than other Bayesian variable selection methods since we sample just the vectors of selection indicators (see
Our approach is quite flexible and the underlying linear regression model as well as the model search could be extended to handle more complex scenarios, including human data and other genetic study designs. This versatility is currently being implemented in our software, enabling data from different sources to be analysed, for example with applications to gene expression and epigenetic profiles, or to deal with binary outcomes and quantitative predictors in a similar manner, as well as extending the search space to include epistatic interactions within the predictor subsets. One important additional benefit of our Bayesian variable selection approach is that, besides providing a best visited model with a list of eQTLs, it also addresses the inherent uncertainty in finding best predictor subsets. Looking marginally at the role of each marker, we can average over a set of well supported models to assess the overall marginal contribution of each eQTL to explain gene expression variability. Moreover, we can use the same set of models to perform further post-processing analysis, for example to focus on eQTLs with noticeable biological effects in all tissues (see
In conclusion, we have shown that the SBR and SBMR approaches have distinctive features and perform significantly better than the existing eQTL mapping methods tested. The proposed modelling approaches provide a general and powerful framework for investigating complex genetic regulatory mechanisms controlling gene expression at the systems-level.
Additional technical details on the implementation of the Bayesian model, detailed comparison between methods, illustrative examples and simulations are given in
Here we used data previously described by Petretto
To show the benefit of the proposed statistical method, in this pilot study we analyzed a subset of 2,000 probe sets from the original set of 15,923 that are common in the four tissues. In particular we chose a set of 2,000 probe sets that have the largest variation across tissues, measured as
Hotelling's
Here we are using a Bayesian variable selection (BVS) approach. BVS methods for mapping multiple quantitative loci have been implemented for single trait
Besides the difference in computational schemes associated to BVS, an important extension is the simultaneous analysis of multiple traits. Banerjee
Here we report the likelihood specification for the linear regression model when multiple outcomes are taken into account as well as when a single response is considered. In the former case, the
In order to induce sparsity and find a parsimonious model which predicts the multiple outcomes using only a few predictors, we place ourselves in the Bayesian variable selection framework
From a Bayesian point of view, uncertainty about the parameters in (1) is introduced by specifying a suitable prior distribution for all the unknowns
The specification of the hypermatrix
The coefficient
The exchangeable prior on each predictor,
Bearing in mind the likelihood and the prior specification of the parameters, the joint distribution of all variables can be written as
Here we highlight the main features of the algorithm, namely Evolutionary Stochastic Search, ESS hereafter, while interested readers are referred to Bottolo, L. and Richardson S. (2010) Evolutionary Stochastic Search for Bayesian model exploration (
One of the key features of ESS algorithm applied to SBR or SBMR models is the automatic set-up and tuning of most of the hyperparameters during the burn-in (in particular for the temperature ladder in the population-based MCMC). The only discretionary setting necessary for both SBR or SBMR is the specification of
Details of the running of ESS (number of sweeps and burn-in) are given in
Associated to each unique model visited, we define the posterior model probability as the renormalized version of the posterior probability
simulating for each probe set the null model through a reshuffle of the order of the observations;
running ESS for SBR and SBMR for the reshuffled transcripts;
calculating the Bayes Factor of the best model visited with respect to the null model;
selecting the level of the Jeffreys' scale above which the best model visited is considered decisively different from the null model, for a fixed level of the FDR.
In an ideal situation, after the reshuffle, which weakens the genotype–phenotype association, the best model visited and the null model should coincide,
For the probe sets whose Jeffreys' scale is above the 5% FDR cut-off, as described before for the non-Bayesian mapping, for the best model visited we investigated the position of the putative eQTLs and collapse markers that we found within a 5 cM window, giving rise to a more easily interpretable list of genetic control points. We refer to this refined list of markers as the filtered best model. Although this has been done in a post-processing exercise for ease of interpretation and comparison with other non-Bayesian mapping approaches, ESS takes full advantage, during the model search, of sets of non-redundant closely linked markers in order to better explain the responses' variability (see
While the posterior density of the regression coefficients can be simulated for each predictor
for a marker
we define a marker in the filtered best model as having a noticeable effect if this fraction is larger than
we simulate the regression coefficients (effect sizes) conditionally on
Supplementary Information.
(1.58 MB PDF)
Correlation structure for the 2,000 transcripts that have the largest variation across tissues. Only 18 probe set pairs, whose Pearson's correlation is above 0.5, are common in the four tissues, while 102,932, 134,690, 82, 508 and 161,341 are the probe set pairs with Pearson's correlation above 0.5 in adrenal, fat, heart and kidney, respectively. This shows that the increment of the pairwise Pearson's positive correlation does not involve the same set of transcripts in the four tissues.
(0.82 MB TIF)
Overview of the Sparse Bayesian Regression (SBR) and Sparse Bayesian Multiple Regression (SBMR) approaches. In the SBR, mRNA levels (ygh, with g for the gth probe set and h for the hth tissue, respectively) are modelled at the level of each tissue, ygh∼Nn(Xβ,σ2), and the resulting eQTL lists are then compared to find common eQTLs across tissues. In the SBMR approach, mRNA levels of the same transcript measured in four tissues (Yg = [yg1, yg2, yg3, yg4]) are modelled jointly, Yg−XB∼N (In,Σ), and mapped to the genome to identify pleiotropic genetic control points of gene expression in all tissues. In the multiple tissues analysis the search for a set of markers that jointly predict the level of gene expression is complicated due to the fact that marginally each tissue can be potentially associated to a different group of covariates (mainly
(3.00 MB TIF)
Distribution of log10 Bayes Factor for the best model visited for each transcript (y-axes) versus the number of distinct control points (x-axes) identified in each model after merging closely linked markers (see
(3.00 MB TIF)
Genome-wide eQTL linkage results for
(3.25 MB TIF)
Marginal posterior probability of inclusion obtained from the SBMR and from the SBR analysis within individual tissues. We report the marginal posterior probability for all models visited (top panels) and for the filtered models (bottom panels) whose log10 Bayes Factor is above the selected cut-off (see
(3.25 MB TIF)
Validation of microarray gene expression linkages by RT-PCR. We replicated
(3.00 MB TIF)
Validation of small-effect
(3.00 MB TIF)
Summary statistics of heritability of mRNA levels for the 2,000 transcripts considered in this study.
(0.03 MB DOC)
Number of probe sets found to be under genetic control in the SBR and SBMR analyses (FDR 1% and 0.5%).
(0.05 MB DOC)
Comparison between SBR, SSM and QTL Reaper results.
(0.06 MB DOC)
Polygenic models that have been detected in at least one tissue by the SBR model (FDR <5%).
(0.10 MB PDF)
eQTLs that were detected in common to all tissues by the SBR model (FDR <5%).
(0.09 MB PDF)
Cis-regulated transcripts found by both SBMR and the Hotelling's T2-test at 5% FDR.
(0.09 MB PDF)