A Bayesian method for detecting pairwise associations in compositional data

Emma Schwager; Himel Mallick; Steffen Ventz; Curtis Huttenhower

doi:10.1371/journal.pcbi.1005852

Abstract

Compositional data consist of vectors of proportions normalized to a constant sum from a basis of unobserved counts. The sum constraint makes inference on correlations between unconstrained features challenging due to the information loss from normalization. However, such correlations are of long-standing interest in fields including ecology. We propose a novel Bayesian framework (BAnOCC: Bayesian Analysis of Compositional Covariance) to estimate a sparse precision matrix through a LASSO prior. The resulting posterior, generated by MCMC sampling, allows uncertainty quantification of any function of the precision matrix, including the correlation matrix. We also use a first-order Taylor expansion to approximate the transformation from the unobserved counts to the composition in order to investigate what characteristics of the unobserved counts can make the correlations more or less difficult to infer. On simulated datasets, we show that BAnOCC infers the true network as well as previous methods while offering the advantage of posterior inference. Larger and more realistic simulated datasets further showed that BAnOCC performs well as measured by type I and type II error rates. Finally, we apply BAnOCC to a microbial ecology dataset from the Human Microbiome Project, which in addition to reproducing established ecological results revealed unique, competition-based roles for Proteobacteria in multiple distinct habitats.

Author summary

Data from many fields are available primarily in the form of proportions, also referred to as compositions, which impose mathematical constraints on identifying interactions among components in the underlying systems. In particular, correlations cannot be calculated directly from proportions or from count data that give rise to them. Methods that work around this difficulty generally do so by imposing strong assumptions about the distribution of underlying data or associated correlations, and these in turn often prevent quantifying uncertainty in the resulting estimates of correlation. We developed a statistical model (BAnOCC: Bayesian Analysis of Compositional Covariance) that both estimates correlations between counts or proportions and provides a posterior distribution for each correlation that quantifies how uncertain the estimate is. BAnOCC does well at controlling the number of false positives in simulated data and can be practically applied to a wide range of proportional data types.

Figures

Citation: Schwager E, Mallick H, Ventz S, Huttenhower C (2017) A Bayesian method for detecting pairwise associations in compositional data. PLoS Comput Biol 13(11): e1005852. https://doi.org/10.1371/journal.pcbi.1005852

Editor: Ran Blekhman, University of Minnesota, UNITED STATES

Received: May 24, 2017; Accepted: October 25, 2017; Published: November 15, 2017

Copyright: © 2017 Schwager et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: This work was supported by the National Institutes of Health grants U54DE023798 (CH) and R01HG005220 (CH), National Science Foundation grants ATD-1042785 (SV) and DBI-1053486 (CH), and Army Research Office grant W911NF-11-1-0473 (CH). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

A long-standing goal of applied statistics in many fields has been identifying features associated significantly by a measure such as correlation [1,2]. When the features to be associated form a composition, inference of the correlation matrix is subject to the well-known problem of spurious correlation [3–6]. Compositional data in particular are vectors of proportions that sum to a fixed constant (typically one); they are usually thought of as the result of sum-normalizing an unobserved (or unrecorded) and unconstrained basis, following the terminology of [6]. The resulting sum-constraint of the compositional data means that any pairwise correlation measured using such data can be non-zero even if all the pairwise correlations on the unobserved count scale are zero, a phenomenon called spurious correlation [3]. The fact that all the features sum to one also makes the correlation matrix on the unobserved counts (that is, the basis correlation matrix) non-identifiable without untestable, though perhaps not unreasonable, assumptions [7–10]. Any method thus offers at best a partial reconstruction of the unobserved count correlation matrix, and the interest in characterizing such correlations in fields from geology to ecology has led to a variety of approaches.

In the context of microbial ecology, several methods have been proposed to identify significant ecological relationships from compositions; virtually all rely on some form of sparsity assumption and infer quantities relating to the log-transformed unobserved counts (hereafter referred to as the log-basis). The only technique that does not rely on a sparsity assumption is ReBoot [7], which estimates a “compositionally-corrected” correlation matrix using a permutation-based method. Friedman and Alm [8] proposed SparCC, which estimates the log-basis correlation matrix under the assumption that the correlations are on average small in magnitude. Fang et al. [9] noted that the resulting estimate is not guaranteed to be positive definite or that the elements will lie inside [–1, 1] and proposed CCLasso to estimate the log-basis correlation matrix using a LASSO penalty on the off-diagonal elements of the variance-covariance matrix. Ban et al. [10] similarly proposed REBACCA to estimate the log-basis correlation matrix; they use the same LASSO penalty function but a different likelihood function. Kurtz et al. [11] proposed SPIEC-EASI to estimate the log-basis precision matrix when the number of features is large by using sparse graph estimation techniques.

These approaches have difficulty quantifying uncertainty in the estimates, cannot incorporate uncertainty from the choice of tuning parameter, and are not flexible in the quantities they estimate. Friedman and Alm [8] proposed an inferential procedure based on the bootstrap, but offered no theoretical justification. Fang et al. [9] and Kurtz et al. [11] focused solely on estimation, while Ban et al. [10] used a subsampling method from Shah and Samworth [12] to stabilize the selection error rate. The LASSO-based methods [9–11] typically choose a shrinkage parameter and subsequently infer the log-basis covariance or precision matrix. Friedman and Alm [8], Fang et al. [9], and Ban et al. [10] all use the log-basis covariance matrix for network construction, while Kurtz et al. [11] use the log-basis precision matrix. This means that investigators typically must choose whether a precision or correlation matrix is best, and often use the resulting estimate with little guidance as to its uncertainty.

We address these issues by providing a flexible, fully Bayesian approach to identify correlations in compositional data. It is able to quantify uncertainty through the associated posterior and estimates both the log-basis correlation and precision matrix by modeling the composition directly. The graphical LASSO prior of [13] is used to estimate a sparse log-basis precision matrix (and hence a sparse log-basis correlation matrix) through a LASSO penalty, mitigating the non-identifiable nature of the unobserved count correlation matrix. We have implemented the resulting method as BAnOCC (Bayesian Analysis of Compositional Covariance). In this study, we also use a first-order Taylor expansion to approximate the compositional covariance as a function of the mean and variance of the unobserved counts. While not necessary to the development of our method, this expansion helps us explore the situations in which a naïve approach (ignoring the sum-constraint) might work. This approximation shows not only that the spurious correlation between two features can take any value in [−1,1] even if none of the features are correlated on the unobserved count scale, but also that both the variances and means of the unobserved counts control the magnitude and direction of the spurious correlation. Thus, we provide a novel characterization of the surprisingly broad circumstances under which compositionality can impede straightforward identification of the correlation matrix, and we provide the BAnOCC model to overcome this in datasets where it is possible.

Methods

Per-subject basis (unobserved and unconstrained count) and composition notation

The model assumes that a single subject’s composition, C_i = (C_i,1,…,C_i,p)^T, is generated by the normalization of that subject’s unobserved and unconstrained counts, X_i = (X_i,1,…,X_i,p)^T. That is, . We also assume that the unobserved counts for all subjects are independent and identically distributed (iid); this implies that the compositions are iid as well because the transformation is per-subject.

Feature correlations and covariances in composition and unobserved counts

We also introduce notation for the covariance and correlation among the features. The covariance matrix of the unobserved counts is denoted by Σ_X = [σ_X,jk], to be inferred from C₁,…,C_n. Similarly, the covariance matrix of the composition is denoted by Σ_C = [σ_C,jk]. To construct the network of feature interactions, the relevant null hypotheses (one for each feature pair j and k) are that features j and k have a covariance of zero (σ_X,jk = 0); this is equivalent to testing if they are uncorrelated (ρ_X,jk = 0). We then define the unobserved count and compositional correlation matrices as R_X = [ρ_X,jk] and R_C = [ρ_C,jk], respectively.

BAnOCC: Bayesian analysis of compositional covariance

BAnOCC assumes that the unobserved counts follow a log-normal distribution and that their correlation matrix is sparse; it is parametrized with the log-basis precision matrix and the log-basis mean (Fig 1). Posteriors for the parameters of the model (and thus functions of them which are of interest) are inferred using MCMC sampling. This fully Bayesian treatment of the problem gives several advantages: a full posterior distribution to quantify the uncertainty in the estimates, the ability to place a prior on the sparsity parameter, and estimates of any function of the log-basis precision matrix, including the log-basis covariance and correlation matrices.

Download:

Fig 1. BAnOCC infers log-basis correlation and precision matrices from compositions by modeling unobserved and unconstrained counts.

In the BAnOCC model, the observed compositions, C_i, are derived by normalizing the unobserved counts X_i. The BAnOCC model assumes that the X_i follow a log-normal distribution, parametrized by the log-basis mean m and covariance S. It places a normal prior on m, the GLASSO prior of [13] on the log-basis precision matrix O, and a hyperprior on the GLASSO shrinkage parameter λ (see Methods).

https://doi.org/10.1371/journal.pcbi.1005852.g001

BAnOCC models the unobserved and unconstrained counts using a log-normal distribution with parameters based on the moments of the log-basis: , such that and S = Var(log{X}). This continuous approximation of the underlying unobserved count data is expected to perform well when the underlying counts have a large dynamic range. In ecology, for example, the log-normal distribution is used to model the (discrete) abundance across species [14,15]. In microbial ecology specifically, the logistic normal is sometimes assumed to be the generating distribution of the composition [10,11]; further, the (discrete) read counts are often simulated using a log-normal distribution [16,17]. The log-normal distribution also allows the totals to be easily integrated out of the likelihood.

Parametrization of the likelihood

The likelihood is parametrized by the log-basis precision matrix O = S⁻¹ and the log-basis mean m, and other parameters of interest like the log-basis covariance matrix S are sampled as transformations of these. By parametrizing using O, we are able to leverage a graphical LASSO prior to enforce sparsity on O and by extension S. Conveniently, the assumption of the log-normal distribution obviates the need to sample the covariance of the unobserved counts to determine the existence and direction of an association between two features on the unobserved count scale. This results because when some element of S, s_jk, is zero, then the corresponding element of will also be zero; further, the non-zero elements of S and Σ_X will have the same sign (though not the same magnitude).

Under the log-normal assumption, the complete likelihood of the observed composition c_i and the latent total is given by (1) where . A detailed derivation can be found in S1 Text. Fitting this likelihood directly is computationally expensive, as the presence of the latent totals necessitates exploring a space whose dimension depends on both n and p. However, (1) factors into two portions: a part dependent on the compositions c_i, and the kernel of a log-normal distribution for the totals with parameters and (where 1 is a vector of 1’s). Integrating over the totals in (1) (S1 Text) gives the more computationally tractable marginal likelihood

Prior distributions

In order to mitigate the non-identifiability of the precision matrix O, BAnOCC uses a shrinkage prior to conservatively estimate the sparsest O consistent with the observed relative abundance data. This is the graphical LASSO prior of [13]: where is an indicator function that O is positive definite, Exp(x|λ) has the exponential density of the form p(x) = λe^−λx1_x>0, and Laplace (x|λ) has the Laplace density of the form . In comparison to variable selection priors such as spike-and-slab [18], the graphical LASSO prior is more scalable to high dimensions at the cost of being unable to generate estimates that are exactly zero [19]. We deal with this by using the resulting posterior samples to conclude whether a correlation is likely to be zero or not. The choice of λ is key to the degree of shrinkage imposed by this prior. We placed a gamma prior on λ in lieu of specifying it a priori; this is possible because [13] showed that the normalizing constant C does not depend on λ. The prior for m is the conditionally-conjugate normal prior with mean n and covariance matrix L. Hyperparameter choice for the two priors (on m and λ) is discussed in more detail below.

Implementation and inference

BAnOCC samples the posterior using Stan’s C++ implementation and R interface [20]. Multiple quantities can be estimated from BAnOCC, including the log-basis precision, covariance, and correlation matrices. In our simulations and application, we estimated the log-basis correlation R_logX because it is interpretable and nicely scaled; we used the posterior median as the point estimate and the 95% credible intervals for w_jk to determine whether the correlation between features j and k was non-zero.

Choosing hyperparameters

The interpretation of the prior parameters on m is relatively straightforward, while that of the shrinkage parameter λ is less clear. Because log-basis means m have a normal distribution, e^m represents the median unobserved counts, which conveniently have a log-normal distribution with parameters n and L. Therefore, we could parametrize the prior on m by the expected median unobserved counts n_LN = exp{n + 0.5diag(L)} and uncertainty of the median unobserved counts . The prior on the shrinkage parameter λ has a shape parameter a that determines how much prior probability mass is placed on λ values close to zero, and a rate parameter b that determines how the probability mass is spread across the entire domain. In particular, a ≤ 1 forces an asymptote at zero, while a > 1 does not.

When little or no prior data is available, weakly informative priors can be used. Any prior on λ should have high probability mass close to zero and so should have a ≤ 1. Larger values of a will “soften” the asymptotic behavior at zero (S1 Fig). The value of the rate parameter b should be chosen to so that most prior probability mass is on sensible values for λ. The degree of shrinkage implied by λ does not appreciably change for λ > 1 (S2 Fig), and so a b of around 5 will give a reasonable uninformative prior distribution for λ. For the log-basis means, can be used, with l a large value such as 100. An overlarge value for l can make computation less efficient and put prior mass on grossly implausible values of e^m, so an l of 500 or less is reasonable.

Prior subject-matter information can be incorporated into the priors for both λ and m, but most easily into the prior on m. If the data have few features, a smaller shape hyperparameter a should be employed to upweight values of λ that yield high shrinkage. The implied prior on the median unobserved counts e^m could be sampled to provide an empirical distribution of the total counts ; this could be assessed for gross deviations from what might be considered reasonable, or agreement with known ranges if such data are available.

Software

The implementation of BAnOCC is publicly available with source code, documentation, and tutorial data as an R/Bioconductor package at http://huttenhower.sph.harvard.edu/banocc.

Results

Unobserved count mean and covariance determine spurious correlation sign and magnitude

We first aimed to identify what characteristics of compositional data impede or facilitate the accurate estimation of the unobserved count correlation matrices in general. Such characteristics should delineate when BAnOCC or any other technique for estimating the unobserved count correlation would perform well. A first-order Taylor expansion approximates the compositional covariance as a function of the mean and covariance of the unobserved counts. Because the compositional correlation is a function of the compositional covariance, the resulting approximation also explains how the correlation behaves. Letting X represent the unobserved counts and C the composition, with the mean of X denoted by μ_X = (μ_X,j)^T and the approximate average proportions by , the Taylor expansion yields (2)

Here I is the p × p identity matrix, and 1 is a p-dimensional vector of 1’s. Eq (2) allows us to approximate the behavior of the compositional covariance from the parameters of the unobserved counts that generate it. For a detailed derivation, see S1 Text.

Spurious correlation can take any value between -1 and 1

Surprisingly, when no features are correlated on the unobserved count scale, the spurious correlation can take any value in [−1,1] depending on the properties of the unobserved counts (Fig 2). This is suggested by considering Eq (2) when σ_X,jk = 0 for all j ≠ k, then (3)

The weights ω_j and the variances σ_X,ll can be configured arbitrarily to force σ_C,jk either to the extreme positive or extreme negative end of the spectrum. In particular, we see three types of strong spurious correlations (Fig 2B–2D): “negative dominant”, “positive dominant”, and “negative mixed”. These three types of correlations are thus representative of a range of expected real-world behaviors, and we included them in subsequent simulation studies of BAnOCC and previous models.

Download:

Fig 2. Spurious correlation is not constrained as a function of feature count, mean, and variance.

A The approximate compositional correlation (based on Eq (3)) between features j and k when σ_X,jk = 0, as a function of the proportion of the total mean and proportion of total variability they contribute. B-D Examples of compositions that display positive (B) and negative (C-D) compositional correlations; in each, the top panel shows the correlation of the unconstrained and unobserved abundances across samples, while the bottom panel shows the correlation of the relative abundances across samples. The spurious correlation can be positive or negative, and of arbitrary magnitude, depending on the characteristics of the unobserved abundances.

https://doi.org/10.1371/journal.pcbi.1005852.g002

“Negative dominant” spurious correlation (Fig 2B) occurs when features j and k in the unobserved counts have (1) high mean and (2) high variability compared to the remaining (l ≠ j,k) features. Intuitively, the remaining features must contribute minimally to the total mean or total variance in the unobserved counts. When normalized, the sum-constraint thus forces a negative correlation between features j and k because they behave as if they were the only two features in the composition.

In the “positive dominant” spurious correlation type (Fig 2C), features j and k in the unobserved counts have (1) small variability and (2) high mean relative to the remaining (l ≠ j,k) features. The positive correlation in the composition results because the variability in the sum of the remaining feature abundances causes the compositions for features j and k to be shrunk or stretched in the same direction when the data are normalized.

Finally, “negative mixed” spurious correlations are the result of “positive dominant” type bases where feature k and the remaining features have switched roles (Fig 2D). After normalization, the variability in feature k forces feature j to move in the opposite direction to accommodate the remaining features.

Extending and improving current assumptions about compositional correlation

Eq (3) also offers an alternative explanation for the negative covariance between features in a Dirichlet distribution. A Dirichlet distribution with parameters α₁…,α_p results when each feature is independent on the unobserved count scale and has a Gamma(α_j,β). The mean and variance of a Gamma distribution are and , respectively, implying that in the unobserved counts, a feature with high mean will also have high variance, and vice versa. This captures “negative dominant” correlations well, but fails to capture “positive dominant” or “negative mixed” correlations, which result when at least one feature has high mean but low variance in the unobserved counts.

Eqs (2) and (3) further suggest that the overall effect of normalization on the correlation estimate as the number of features p increases depends on the characteristics of μ_X and Σ_X. In ecological applications, it is often assumed that if p is large and the compositional means are similar across the p features, then the correlation estimates based on the composition and unobserved counts are not likely to be very different [8,10]. Part of the appeal of this reasoning is that it does not rely on information about the unobserved and unconstrained counts. Expanding Eq (2), we can see that Σ_C ∝ Σ_X − ω1^TΣ_X − Σ_X1ω^T + ω1^TΣ_X1ω^T. If the means are very similar to each other, this affects only the weights ω given to the offset ω1^TΣ_X − Σ_X1ω^T + ω1^TΣ_X1ω^T. Small weights render the offset negligible only in the case where the unobserved variance on the unobserved counts Σ_X is not too large: the behavior of the offset as the number of features increases depends on the similarity of the means (through ω) and on the variances of the additional features in the unobserved counts (through Σ_X).

Thus when analyzing compositional data, one cannot know with certainty in which data the correlations are strongly affected by the normalization, much less the magnitude and direction of the change in correlation structure induced by normalizing. The information loss due to normalization implies that Σ_X is non-identifiable without assumptions about its structure. However, knowing how the unobserved and unobserved counts affect the spurious correlation allows simulation of datasets that have specific types of spurious correlation for testing the performance of estimation methods in these cases.

Simulation studies

Data generation methods

Using the information from this theoretical analysis, we tested BAnOCC on two types of datasets. The first comprised small datasets generated using the model itself but designed to be challenging by incorporating negative dominant correlations. Second, we also simulated larger, more realistic datasets using an independent model specific to microbial community structure, sparseDOSSA [21].

For the former, four small datasets with 1,000 samples and nine features each were generated according to four scenarios. The “simple” scenario had no true correlations and no negative dominant correlation; the “high spurious” scenario had no true correlations but the presence of a negative dominant correlation; the “retained spike” scenario had several true correlations and no negative dominant correlation; and the “reversed spike” scenario had several true correlations and a negative dominant correlation between two features that are positively correlated in the unobserved abundances (see details in S2 Text and data in S1 Data). On these data, we used hyperparameters n_j = 0, L = 1000I, a = 0.5 and b = 5 (S3 Fig).

Realistic data were generated using the SparseDOSSA model [21], which generates each feature from a zero-inflated, truncated log-normal distribution with subsequent rounding and estimates the feature-specific parameters by fitting to a given real-world template dataset. We induced correlations between features by using a multivariate distribution with a log-basis correlation that had off-diagonal elements set to one of four different correlation strengths ({−0.7,−0.3,0.3,0.7}). To ensure that strong compositional effects were present, we used a template with low-diversity community structure [22] with 14 pseudomicrobial features. The correlations were set so that the non-zero elements of the log-basis precision matrix and the log-basis covariance matrix would be the same; we used seven correlations (see details in S2 Text and data in S2 Data). We used hyperparameters a = 0.5, b = 5, n_j = 3, and L = 30I (S4 Fig).

BAnOCC and CCLasso perform comparably in difficult scenarios

Using our first set of simulated data for evaluation, we compared the estimation and inference from BAnOCC with that from CCLasso [9], a frequentist LASSO-based method that chooses the shrinkage parameter using K-fold cross validation (Fig 3). BAnOCC had much lower false positive rates than CCLasso, resulting from the model’s ability to use the posterior distribution to account for estimate uncertainty while CCLasso, being LASSO-based, used a non-zero point estimate to determine significance of an effect.

Download:

Fig 3. BAnOCC infers the correct unobserved abundance correlation matrix in four scenarios simulated to be challenging.

Each column represents one four datasets simulated to evaluate methods for identification of correlations from compositional data: “simple”, with no true correlations and no negative dominant correlation; “high spurious”, with no true correlations and the presence of a negative dominant correlation; “retained spike” with several true correlations and no negative dominant correlation; and “reversed spike” with several true correlations and a negative dominant correlation between two positively correlated features. The top row shows the true correlation matrix. The second row shows the uncorrected compositional correlations as estimated using the 1,000 samples in the simulated data. Each of the subsequent rows shows the log-basis correlation estimate and the associated inference using the compositional data for Pearson correlation, BAnOCC, and CCLasso, respectively.

https://doi.org/10.1371/journal.pcbi.1005852.g003

BAnOCC and CCLasso both estimate the log-basis correlation matrix accurately, and both are a substantial improvement on a naïve approach (row 2 of Fig 3). In particular, both BAnOCC and CCLasso have much lower false positive rates than Pearson correlation. Over all the null associations, Pearson correlation had a staggering false positive rate of 82%; CCLasso had almost 14% false positives as a result of many small but non-zero estimates; BAnOCC, because it uses the posterior credible intervals to evaluate uncertainty, had a false positive rate of about 3%. BAnOCC cannot estimate the log-basis correlations w_jk to be exactly zero because of the continuous prior, but the null associations whose 95% credible intervals cover zero have very small estimates (all are less than 0.15, 75% are less than 0.05).

The association between features 1 and 5 in the “reversed spike” dataset was difficult for both BAnOCC and CCLasso. Both gave a small, negative estimate (-0.001 for BAnOCC and -0.113 for CCLasso). BAnOCC displays a slight bias toward positive correlations instead of the moderate negative correlation that was present in the underlying unobserved abundances, as shown by several false positive associations in this dataset. This behavior is common among many methods, including SparCC and SPIEC-EASI (S5 Fig). It results from the fact that when a negative-dominant structure is present, positive correlations become much more likely to be real than negative ones, an interesting observation to consider when interpreting real-world results from any of these methods.

BAnOCC and CCLasso agree well with the true magnitude and direction of the non-zero associations that both methods conclude are significant. For these associations, the relative difference with the true value is less than 15% for both methods. When the associations were rejected, the 95% credible interval from BAnOCC covered the true value, indicating its utility for evaluating the uncertainty of the estimate. The false negative rates were 25% for BAnOCC and 0% for CCLasso, a direct result of the higher tolerance for false positives CCLasso exhibits. In practice, this has the expected effect of dramatically lowering BAnOCC’s false positive rate in recovering true correlations from compositional data.

Comparison of type I and type II error rates

We compared BAnOCC’s performance as measured by type I and type II error rates to a range of previous methods (Fig 4): simplicial variation [23], SparCC [8], CCLasso [9], SPIEC-EASI [11], ReBoot [7], and Spearman correlation (directly on the composition as a negative control). Of the two frequentist LASSO-based methods (CCLasso and REBACCA [10]), CCLasso alone had an R package interface; because they employ highly similar approaches, they should yield similar results. For a positive control, we also applied Spearman correlation to the unconstrained (and usually unobserved) counts (Table 1 and S3 Text).

Download:

Fig 4. The BAnOCC model controls type I error while maintaining power.

Results on simulated data comprising SparseDOSSA-derived compositions modeled on a low-diversity dataset with 14 features. The type I error rate is controlled at the 0.05 level for BAnOCC and approximately so for SparCC, CCLasso, and SPIEC-EASI (MB), but not for simplicial variation or Spearman correlation (on the composition, a negative control). BAnOCC maintains good power across all true correlation values, but as expected has better power for stronger true correlation values. Type I and type II error rates are determined by correct or incorrect rejection of H₀ based on inference (simplicial variation, SparCC, Spearman correlation, and BAnOCC) or estimation (CCLasso and SPIEC-EASI). * = rejection of H₀ based on estimation; ** = rejection of H₀ based on inference from credible intervals; all others, rejection of H₀ based on inference from p-values. (S6 Fig and S7 Fig).

https://doi.org/10.1371/journal.pcbi.1005852.g004

Download:

Table 1. Methods included in an evaluation on simulated data.

Type I and type II error rates were determined for these methods by the correct or incorrect rejection of H₀; for CCLasso and SPIEC-EASI, no inferential methodology was provided and so the correct or incorrect estimation of w_jk as zero was used. Note that although SPIEC-EASI infers the precision matrix, construction of the true correlation matrix in the simulated data guarantees that the same elements will be non-zero in the precision and covariance matrix.

https://doi.org/10.1371/journal.pcbi.1005852.t001

Overall, BAnOCC controlled the type I error rate for all correlation strengths (Fig 4A) while maintaining comparable power compared with other recent methods (Fig 4B). These results held true in a more even community with larger features, in which BAnOCC was the sole method to fully control the type I error rate (S8 Fig). As the number of samples increased, all methods increased in power (S9 Fig), while the type I error rates remained fairly constant (S10 Fig).

Only BAnOCC and SparCC controlled type I error while maintaining high power for all correlation strengths (see also AUC boxplots in S6 Fig). Both behaved similarly to Spearman correlation applied to the unconstrained abundances, which represents the best possible performance (as it uses the unconstrained data rather than the composition—this is impossible in practice, when only the composition is available). SparCC’s type I error rate was slightly inflated in a larger dataset with more features, while BAnOCC continued to control the type I error rate at the nominal level (S8 Fig). As other authors have noted, SparCC does not guarantee that its log-basis correlation estimate has bounded elements nor that it is positive definite [9]. By contrast, BAnOCC not only estimates a positive definite correlation matrix with bounded elements, but also can infer network edges based on the precision matrix as well.

Several methods proved to control the type I error rate poorly: Spearman correlation exemplifies this as a negative control, but simplicial variation, SPIEC-EASI using GLASSO and to a lesser extent CCLasso were comparable. ReBoot, by design, attenuates the type I error rate of Spearman correlation, but does not control it perfectly. The high type I error rates are also somewhat expected in simplicial variation, but SPIEC-EASI using GLASSO may not be performing as expected, especially since in contrast the Meinshausen-Bühlmann neighborhood selection method did control type I error. This may also possibly be because the neighborhood selection infers each element of the matrix one at a time, while GLASSO infers the matrix all at once; this makes the GLASSO optimization a more difficult problem.

Feature 5 in the template dataset has a large mean and variance, while feature 3 has a small mean and variance. This results in a strong negative spurious correlation in the composition, which gives rise to interesting behavior of essentially all methods when detecting this association. When the true association is negative, many compositionally-appropriate methods such as BAnOCC, SparCC, and SPIEC-EASI (MB) do poorly at detecting the true correlation (Fig 4B) because the negative correlation is difficult to attribute to the unobserved counts rather than spurious correlation. Conversely, more naïve methods such as simplicial variation and Spearman correlation do very well at detecting a weak negative correlation between these two features because this becomes a strong negative correlation in the composition. This simulated example thus provides some insight into the form of sensitivity / specificity tradeoff that applies in the constrained, information-loss setting of identifying true correlations from compositions.

A microbial interaction network from the Human Microbiome Project

As an example application, we inferred a correlation network among microbial taxa profiled using ecological data from the Human Microbiome Project [22] (Fig 5). Microbial community sequencing generates compositions by assigning sequencing reads to microorganisms; since nucleotide sequencing depth is arbitrary, the resulting counts are not informative regarding the unobserved and unconstrained counts and are often normalized to relative abundances. Co-variation patterns in such data are of interest because they suggest ecological interactions, such as mutualism (positive correlation) or predation (negative correlation) [7].

Download:

Fig 5. BAnOCC association networks from the Human Microbiome Project.

The association networks inferred from three HMP body sites: stool (A), buccal mucosa (B), and posterior fornix (C). Using four chains with a minimum of 5000 iterations, we ran BAnOCC until convergence (see details in S3 Text). Only significant correlations stronger than 0.15 are shown (see S1 Table, S2 Table and S3 Table). The GLASSO prior results in sparse networks for these datasets, highlighting individual associations between taxa.

https://doi.org/10.1371/journal.pcbi.1005852.g005

The microbial taxonomic relative abundance data used here consisted of 523 microbial features measured across 700 total samples using MetaPhlAn2 v2.0_beta1 [24] in July of 2014 (available in S3 Data), further excluding from all networks markers removed in the subsequent version’s database (v2.0_beta2). These samples were in turn drawn from 127 individuals at six distinct body sites. Microbial ecology differs at each body site [22], providing examples for BAnOCC analysis that ranged from diverse, relatively even communities (such as stool) to less diverse, highly skewed ecologies (such as the vaginal posterior fornix). For each of three representative body sites (stool, posterior fornix, and buccal mucosa), we selected the first time point from each subject, collapsed taxonomic information to the genus level, and then removed features with relative abundance less than 0.0001 in at least 50% of samples. With too few features, little to nothing can be concluded about the true correlations; so if fewer than 10 features remained we lowered the prevalence cutoff until 10 features were retained.

The hyperparameters for the gamma prior on λ were a = 0.5 and b = 5 for all body sites, ensuring that we gave substantial weight to sparser precision matrices. For all body sites, we used the prior variability of the log-basis means L = 30I; each body site, however, had a different n_j so that the distribution of the sums of medians were similar across different body sites (see S11–S14 Figs). We further compared BAnOCC’s inferred network using the log-basis correlation matrix with that from CCLasso, and BAnOCC’s inferred network using the log-basis precision matrix with SPIEC-EASI. There is broad agreement between the methods as to which edges are significant, with very few edges discrepant between the methods (S15 Fig).

In stool, BAnOCC inferred several positive associations between genera within the family Bacteroidales, in particular Bacteroides, Odoribacter, Parabacteroides and Alistipes (Fig 5A). Until recently, these genera were classified as part of the same genus [25]. This supports the common observation that closely (but not too closely) related taxa tend to have positive ecological associations [26]. Additionally, positive associations in the buccal mucosa (Fig 5B) connect taxa that are known to physically co-aggregate; in particular, Fusobacterium interactions with species from the Porphyromonas and Capnocytophaga genera (among others) are crucial in biofilm formation [27] and have been previously recovered from 16S-based ecological analyses [7]. Lastly, we can see the well-documented negative association between the Lactobacillus genus in the posterior fornix with several genera associated with dysbiosis such as Gardnerella and Prevotella [28] (Fig 5C).

Two interactions newly suggested by this analysis involved the Proteobacteria across multiple body sites, and specifically in stool and the oral cavity (buccal mucosa). The genera Escherichia and Haemophilus represent the two major proteobacterial residents in these habitats, respectively, and both were involved in predominantly negative interactions with more typical, abundant members of these communities (e.g. Faecalibacterium and Eubacterium in the gut, Leptotrichia or Corynebacterium in the mouth). These clades are highly phylogenetically diverged and tend to carry larger, more generalize genomes and pan-genomes [29,30]; this suggests that they will overgrow in these habitats only in unusual situations, exemplified by E. coli’s abundance in the gut primarily during inflammation [31]. Further details may be provided by future analyses using BAnOCC or related methods on species or strain-level ecological profiling.

Discussion

Here, we describe BAnOCC, a Bayesian method for inferring the log-basis correlation structure from compositional data. Assuming a log-normal distribution on the unobserved and unconstrained counts, the model estimates the log-basis correlations using a sparsity-inducing shrinkage prior on the log-basis precision matrix. It is part of a family of several recently proposed LASSO-based methods [9–11] which provide a more rigorous approach to correcting for compositional effects than earlier methods [7,8]. Unlike the other LASSO-based correlation-inference methods that summarize pairwise associations using a single point estimate, BAnOCC yields uncertainty estimates of the precision, covariance, and correlation parameters. Simulation results show that BAnOCC performs as well as or better than existing methods in controlling type I error while maintaining power for network edge detection from compositional data. Finally, we applied the method to assess microbial relationships in the human microbiome, confirming established interactions and suggesting novel ones for future validation.

Analysis using a Taylor series approximation provided one of the first characterizations of properties that make true correlations “difficult” to recover from compositions, or conversely “easy” to miss as false negatives. In particular, this depends not only on the more intuitive number and evenness of feature means, but also on the distribution of their variance. This allowed us to simulate designedly difficult test cases for BAnOCC and a variety of published methods, in contrast to previous simulation studies that relied primarily on relatively simple synthetic data [7–10]. In most studies, spurious correlation is noted to be commonly present and of varying magnitudes and directions [11]. However, the possible sensitivity of methods to the type of spurious correlation encountered has not been explored and is an important contribution to the characterization of existing and future methods.

We anticipate several computational and statistical refinements that may further improve BAnOCC’s performance. While BAnOCC uses 95% credible intervals for inference, these can be overly conservative [32]. Alternative thresholding methods may improve on this, such as the scaled neighborhood criterion [32] or the partial-correlation based approach of [33] and [13]. A discrete-continuous mixture prior such as the G -Wishart prior [34] or the covariance selection prior [35] on the log-basis correlation matrix would further allow the posterior probability that w_jk = 0 to be nonzero, and this quantity could be used as a threshold.

For applications specifically on count data, such as microbial compositions, the data could be modeled more accurately by adding a hierarchical layer. This would generate measurement counts conditional on the unobserved and unconstrained counts, making the observed compositions a function of normalized measurement counts. The degree of zero-inflation observed in ecological data could also be modeled directly using a hurdle or mixture model, or a multinomial distribution for the measurement counts. This would provide a particularly targeted approach for microbial ecology, in which more detailed data (at the species or strain level [24]) could be further incorporated. We thus hope to refine both the accuracy of compositional correlation inference and the applications to microbial community data in future studies.

Supporting information

S1 Text. Detailed mathematical derivations.

Beginning from initial definitions, a step-by-step derivation of the likelihood in Eq (1), the marginal likelihood for the composition, and the Taylor Series approximation in Eq (2).

https://doi.org/10.1371/journal.pcbi.1005852.s001

(DOCX)

S2 Text. Detailed description of simulated datasets.

Descriptions of how the datasets were generated for both the challenging scenarios case and for the realistic data case.

https://doi.org/10.1371/journal.pcbi.1005852.s002

(DOCX)

S3 Text. Implementation of methods compared.

Details on how each of the methods compared in the Results section were implemented, run on the simulated data, and evaluated for type I and type II errors.

https://doi.org/10.1371/journal.pcbi.1005852.s003

(DOCX)

S1 Data. Simulated data for difficult scenarios.

The simulated data for each of four difficult simulation scenarios described in the Results section. For details on how these were generated, see S2 Text.

https://doi.org/10.1371/journal.pcbi.1005852.s004

(ZIP)

S2 Data. Realistic simulated data.

All simulated datasets from sparseDOSSA, as well as the template dataset used. For details on how these were generated, see S2 Text.

https://doi.org/10.1371/journal.pcbi.1005852.s005

(ZIP)

S3 Data. HMP taxonomic profiles.

The taxonomic profiles from the Human Microbiome Project data as processed with MetaPhlAn version 2.0_beta1 [24].

https://doi.org/10.1371/journal.pcbi.1005852.s006

(ZIP)

S1 Fig. A relatively informative prior on λ is effective.

The densities of different priors on λ for different ranges of λ values. The shape parameter a determines how quickly the prior density decreasys, while the rate parameter b determines how much prior weight is placed on small λ values rather than large λ values.

https://doi.org/10.1371/journal.pcbi.1005852.s007

(TIF)

S2 Fig. Shrinkage increases for smaller λ.

A The shape of the prior on o_jk and o_jj for several values of λ. Smaller λ results in greater shrinkage towards zero. B-C The prior probability in the interval (−0.001,0.001) for each off-diagonal element o_jk| λ∼Laplace(λ) across small (B) or large (C) values of λ. Small values (<0.1) of λ show the greatest shrinkage, while beyond λ = 1 the shrinkage becomes negligible, as shown by the maximal shrinkage for λ > 0.2 being <0.005.

https://doi.org/10.1371/journal.pcbi.1005852.s008

(TIF)

S3 Fig. Prior distributions for test cases.

The prior distributions for the test cases used a prior on m that was very uninformative, being centered at 0 and with a large variance. The prior on λ put most prior weight on λ values less than one and had narrow tails to encourage shrinkage of the correlation estimates (B).

https://doi.org/10.1371/journal.pcbi.1005852.s009

(TIF)

S4 Fig. Prior distributions for realistic simulated data.

We used a prior for m that gave reasonable behavior for the sum of the unobserved count medians (A). The prior on λ put most prior weight on λ values less than one and had narrow tails to encourage shrinkage of the correlation estimates (B). (See also S14 Fig).

https://doi.org/10.1371/journal.pcbi.1005852.s010

(TIF)

S5 Fig. Additional results for difficult scenarios.

The estimates and significance of several methods on the four scenarios (columns): simple, with no true correlations and no negative dominant spurious correlation; high spurious, with no true correlations and a negative dominant spurious correlation; retained spike, with several true correlations and no negative dominant spurious correlation; and reversed spike, with several true correlations and a negative dominant spurious correlation. The top row is data-derived, with the bottom triangle indicating the true log-basis correlation and the top triangle the compositional correlation calcualted using the 1,000 samples from the data. BAnOCC evaluates significance using 95 % credible intervals. CCLAsso and SPIEC-EASI (MB) are significant if they are non-zero. SPIEC-EASI (MB) colors indicate the sign rather than the magnitude of the estimated correlations as the estimates are not possible to compute. SparCC evaluates significance using a bootstrap-based method. All the methods do poorly at detecting and correctly estimating the negative correlation between features 1 and 5 in the reversed spike scenario, and instead tend to falsely detect several positive correlations.

https://doi.org/10.1371/journal.pcbi.1005852.s011

(TIF)

S6 Fig. AUC boxplots of method performance on “realistic” simulated datasets.

For each given correlation strength and template dataset, AUCs were calculated for each of 105 simulated datasets comprising sparseDOSSA-derived compositions with 100 samples modeled on a low-diversity dataset with 14 features. The ROCs used to measure the AUCs are based on p-values (Spearman correlation, simplicial variation, SparCC), credible intervals (BAnOCC), correlation estimate (CCLasso) or stability score (SPIEC-EASI). Thus each boxplot consists of 105 points. Each of the 105 AUCs are measured over seven true correlations, and all of the methods do better than expected by chance (red line), although BAnOCC has overall the highest average AUC.

https://doi.org/10.1371/journal.pcbi.1005852.s012

(TIF)

S7 Fig. Average ROC curves of method performance on “realistic” simulated datasets.

For a given correlation strength, each ROC is calculated over the aggregation of all 735 true associations in 105 simulated datasets comprising SparseDOSSA-derived compositions with 100 samples modeled on a low-diversity dataset with 14 features. The cutoffs used are based on p-values (Spearman correlation, simplicial variation, SparCC), credible interval width (BAnOCC), correlation estimate (CCLasso) or stability score (SPIEC-EASI).

https://doi.org/10.1371/journal.pcbi.1005852.s013

(TIF)

S8 Fig. Type I error rates and power in large datasets.

Results on simulated data comprising 100 SparseDOSSA-derived compositions modeled on a high-diversity dataset with 89 features. A Type I error rates are controlled across all correlation values only by BAnOCC. B Power is comparable between BAnOCC and other modern methods across spiked correlation strengths, with BAnOCC and others correctly controlling error rates and only BAnOCC providing full inference and probability distributions on the resulting microbial interaction networks. * = rejection of H₀ based on estimation; ** = rejection of H₀ based on inference from credible intervals; all others, rejection of H₀ based on inference from p-values. (See S16 Fig for the priors used.)

https://doi.org/10.1371/journal.pcbi.1005852.s014

(TIF)

S9 Fig. Power across multiple sample sizes and numbers of features.

Power on simulated data comprising SparseDOSSA-derived compositions modeled on a low-diversity dataset with 14 features (small template) or a high-diversity dataset with 89 features (large template). See S2 Text for simulation details. The rows correspond to the number of samples (50, 100, or 150) simulated. BAnOCC controls the type I error rate in all scenarios, and the type I error rate behavior for most methods does not change with increasing sample size. * = rejection of H₀ based on estimation; ** = rejection of H₀ based on inference from credible intervals; all others, rejection of H₀ based on inference from p-values.

https://doi.org/10.1371/journal.pcbi.1005852.s015

(TIF)

S10 Fig. Type I error rates across multiple sample sizes and numbers of features.

Type I error rates on simulated data comprising SparseDOSSA-derived compositions modeled on a low-diversity dataset with 14 features (small template) or a high-diversity dataset with 89 features (large template). See S2 Text for simulation details. The rows correspond to the number of samples (50, 100, or 150) simulated. BAnOCC controls the type I error rate in all scenarios, and the type I error rate behavior for most methods does not change with increasing sample size. * = rejection of H₀ based on estimation; ** = rejection of H₀ based on inference from credible intervals; all others, rejection of H₀ based on inference from p-values.

https://doi.org/10.1371/journal.pcbi.1005852.s016

(TIF)

S11 Fig. Prior distributions for the stool body site.

We used a prior for m that gave reasonable behavior for the sum of the unobserved count medians (A). The prior on λ put most prior weight on λ values less than one and had narrow tails to encourage shrinkage of the correlation estimates (B). (See also S14 Fig).

https://doi.org/10.1371/journal.pcbi.1005852.s017

(TIF)

S12 Fig. Prior distributions for the buccal mucosa body site.

We used a prior for m that gave reasonable behavior for the sum of the unobserved count medians (A). The prior on λ put most prior weight on λ values less than one and had narrow tails to encourage shrinkage of the correlation estimates (B). (See also S14 Fig).

https://doi.org/10.1371/journal.pcbi.1005852.s018

(TIF)

S13 Fig. Prior distributions for the posterior fornix body site.

We used a prior for m that gave reasonable behavior for the sum of the unobserved count medians (A). The prior on λ put most prior weight on λ values less than one and had narrow tails to encourage shrinkage of the correlation estimates (B). (See also S14 Fig).

https://doi.org/10.1371/journal.pcbi.1005852.s019

(TIF)

S14 Fig. Implied priors on median unobserved counts.

The implied priors on the median unobserved counts (top panel) and the sum of the median unobserved counts (bottom panel) for the SparseDOSSA simulated data and the body sites from the application. Each distribution is estimated using 100,000 random samples. The mean of m_j was selected such that the sum of the median unobserved counts approximately shared the same average.

https://doi.org/10.1371/journal.pcbi.1005852.s020

(TIF)

S15 Fig. Comparison of inferred networks on HMP data.

The number of edges significant in both methods, neither method, or only one method, stratified by body site and whether the methods use the log-basis precision or correlation matrix. Most edges are concordantly significant (or not) between both methods; few are significant by only one method. Further, most of the edges significant in CCLasso but not BAnOCC are small in magnitude (BAnOCC not sig, CCLasso magnitude < 0.1).

https://doi.org/10.1371/journal.pcbi.1005852.s021

(TIF)

S16 Fig. Prior distributions for large datasets.

For our larger datasets simulated based on a stool dataset with 89 features, we used a prior for m that gave reasonable behavior for the sum of the basis medians (A). The prior on λ put most prior weight on λ values less than one and had narrow tails to encourage shrinkage of the correlation estimates (B). (See also S14 Fig).

https://doi.org/10.1371/journal.pcbi.1005852.s022

(TIF)

S1 Table. BAnOCC stool network.

The significant edges from running BAnOCC on the stool body site with 5,500 warmup iterations and 12,000 total iterations. Edges are ordered by posterior median correlation magnitude. “hpd.95.ci” indicates the highest posterior density 95% credible intervals.

https://doi.org/10.1371/journal.pcbi.1005852.s023

(XLSX)

S2 Table. BAnOCC buccal mucosa network.

The significant edges from running BAnOCC on the buccal mucosa body site with 5,500 warmup iterations and 12,000 total iterations. Edges are ordered by posterior median correlation magnitude. “hpd.95.ci” indicates the highest posterior density 95% credible intervals.

https://doi.org/10.1371/journal.pcbi.1005852.s024

(XLSX)

S3 Table. BAnOCC posterior fornix network.

The significant edges from running BAnOCC on the posterior fornix body site with 1,500 warmup iterations and 5,000 total iterations. Edges are ordered by posterior median correlation magnitude. “hpd.95.ci” indicates the highest posterior density 95% credible intervals.

https://doi.org/10.1371/journal.pcbi.1005852.s025

(XLSX)

References

1. Pearson K. Mathematical contributions to the theory of evolution. III. regression, heredity, and panmixia. Philos Trans A Math Phys Eng Sci. 1896;187: 253–318.
- View Article
- Google Scholar
2. Spearman C. The proof and measurement of association between two things. The American Journal of Psychology. 1904;15: 72–101.
- View Article
- Google Scholar
3. Pearson K. Mathematical contributions to the theory of evolution.–On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc Lond. 1897;60: 489–498.
- View Article
- Google Scholar
4. Chayes FA. On correlation between variables of constant sum. J Geophys Res. 1960;65: 4185–4193.
- View Article
- Google Scholar
5. Chayes FA, Kruskal W. Approximate statistical test for correlations between proportions. The Journal of Geology. 1966;74: 692–702.
- View Article
- Google Scholar
6. Aitchison J. A new approach to null correlations of proportions. Math Geol. 1981;13: 175–189.
- View Article
- Google Scholar
7. Faust K. and Sathirapongsasuti F., Izard J, Segata N, Gevers D, Raes J, Huttenhower C. Microbial co-occurrence relationships in the human microbiome. PLoS Comput Biol. 2012;8: e1002606. pmid:22807668
- View Article
- PubMed/NCBI
- Google Scholar
8. Friedman J, Alm EJ. Inferring correlation networks from genomic survey data. PLoS Comput Biol. 2012;8: e1002687. pmid:23028285
- View Article
- PubMed/NCBI
- Google Scholar
9. Fang H, Huang C, Zhao H, Deng M. CCLasso: Correlation inference for compositional data through lasso. Bioinformatics. 2015;31: 3172–3180. pmid:26048598
- View Article
- PubMed/NCBI
- Google Scholar
10. Ban Y, An L, Jiang H. Investigating microbial co-occurrence patterns based on metagenomic compositional data. Bioinformatics. 2015;31: 3322–3329. pmid:26079350
- View Article
- PubMed/NCBI
- Google Scholar
11. Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol. 2015;11: 1–25.
- View Article
- Google Scholar
12. Shah RD, Samworth RJ. Variable selection with error control: Another look at stability selection. J R Stat Soc Series B Stat Methodol. 2013;75: 55–80.
- View Article
- Google Scholar
13. Wang H. Bayesian graphical lasso models and efficient posterior computation. Bayesian Anal. 2012;7: 867–886.
- View Article
- Google Scholar
14. Preston FW. The commonness, and rarity, of species. Ecology. 1948;29: 254–283.
- View Article
- Google Scholar
15. Magurran AE, Henderson PA. Explaining the excess of rare species in natural species abundance distributions. Nature. 2003;422: 714–716. pmid:12700760
- View Article
- PubMed/NCBI
- Google Scholar
16. Koren O, Knights D, Gonzalez A, Waldron L, Segata N, Knight R, et al. A guide to enterotypes across the human body: Meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput Biol. 2013;9: 1–16.
- View Article
- Google Scholar
17. Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial marker-gene surveys. Nat Meth. 2013;10: 1200–1202.
- View Article
- Google Scholar
18. Mitchell TJ, Beauchamp JJ. Bayesian variable selection in linear regression. J Am Stat Assoc. 1988;83: 1023–1032.
- View Article
- Google Scholar
19. Mallick H, Yi N. Bayesian methods for high dimensional linear models. J Biom Biostat. 2013;1: 005. pmid:24511433
- View Article
- PubMed/NCBI
- Google Scholar
20. Stan Development Team. RStan: the R interface to Stan, version 2.6.0 2014. Available: http://mc-stan.org/rstan.html
21. Ren B, Schwager E, Tickle TL, Huttenhower C. SparseDOSSA: Sparse data observations for simulating synthetic abundance. 2016. Available: https://huttenhower.sph.harvard.edu/sparsedossa
22. The Human Microbiome Consortium. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486: 207–214. pmid:22699609
- View Article
- PubMed/NCBI
- Google Scholar
23. Aitchison J. A concise guide to compositional data analysis. 2nd compositional data analysis workshop. 2003.
24. Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli E, et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Meth. 2015;12: 902–903.
- View Article
- Google Scholar
25. Rajilić-Stojanović M, Vos WM de. The first 1000 cultured species of the human gastrointestinal microbiota. FEMS Microbiol Rev. Oxford, UK; 2014;38: 996–1047. pmid:24861948
- View Article
- PubMed/NCBI
- Google Scholar
26. Barberán A, Bates ST, Casamayor EO, Fierer N. Using network analysis to explore co-occurrence patterns in soil microbial communities. ISME J. 2012;6: 343–351. pmid:21900968
- View Article
- PubMed/NCBI
- Google Scholar
27. Kolenbrander PE, Palmer RJ Jr, Periasamy S, Jakubovics NS. Oral multispecies biofilm development and the key role of cell–cell distance. Nat Rev Micro. 2010;8: 471–480.
- View Article
- Google Scholar
28. Gajer P, Brotman RM, Bai G, Sakamoto J, Schütte UME, Zhong X, et al. Temporal dynamics of the human vaginal microbiota. Sci Transl Med. 2012;4: 132ra52–132ra52. pmid:22553250
- View Article
- PubMed/NCBI
- Google Scholar
29. Rasko DA, Rosovitz MJ, Myers GSA, Mongodin EF, Fricke WF, Gajer P, et al. The pangenome structure of escherichia coli: Comparative genomic analysis of e. coli commensal and pathogenic isolates. J Bacteriol. 2008;190: 6881–6893. pmid:18676672
- View Article
- PubMed/NCBI
- Google Scholar
30. Hogg JS, Hu FZ, Janto B, Boissy R, Hayes J, Keefe R, et al. Characterization and modeling of the haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biol. 2007;8: R103. pmid:17550610
- View Article
- PubMed/NCBI
- Google Scholar
31. Arthur JC, Perez-Chanona E, Mühlbauer M, Tomkovich S, Uronis JM, Fan T-J, et al. Intestinal inflammation targets cancer-inducing activity of the microbiota. Science. 2012;338: 120–123. pmid:22903521
- View Article
- PubMed/NCBI
- Google Scholar
32. Li Q, Lin N. The bayesian elastic net. Bayesian Anal. 2010;5: 151–170.
- View Article
- Google Scholar
33. Carvalho CM, Polson NG, Scott JG. The horseshoe estimator for sparse signals. Biometrika. 2010;97: 465–480.
- View Article
- Google Scholar
34. Dawid AP, Lauritzen SL. Hyper markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics. 1993;21: 1272–1317.
- View Article
- Google Scholar
35. Wong F, Carter CK, Kohn R. Efficient estimation of covariance selection models. Biometrika. 2003;90: 809–830.
- View Article
- Google Scholar

[ref1] 1. Pearson K. Mathematical contributions to the theory of evolution. III. regression, heredity, and panmixia. Philos Trans A Math Phys Eng Sci. 1896;187: 253–318.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Spearman C. The proof and measurement of association between two things. The American Journal of Psychology. 1904;15: 72–101.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Pearson K. Mathematical contributions to the theory of evolution.–On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc Lond. 1897;60: 489–498.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Chayes FA. On correlation between variables of constant sum. J Geophys Res. 1960;65: 4185–4193.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Chayes FA, Kruskal W. Approximate statistical test for correlations between proportions. The Journal of Geology. 1966;74: 692–702.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Aitchison J. A new approach to null correlations of proportions. Math Geol. 1981;13: 175–189.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Faust K. and Sathirapongsasuti F., Izard J, Segata N, Gevers D, Raes J, Huttenhower C. Microbial co-occurrence relationships in the human microbiome. PLoS Comput Biol. 2012;8: e1002606. pmid:22807668
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref8] 8. Friedman J, Alm EJ. Inferring correlation networks from genomic survey data. PLoS Comput Biol. 2012;8: e1002687. pmid:23028285
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref9] 9. Fang H, Huang C, Zhao H, Deng M. CCLasso: Correlation inference for compositional data through lasso. Bioinformatics. 2015;31: 3172–3180. pmid:26048598
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref10] 10. Ban Y, An L, Jiang H. Investigating microbial co-occurrence patterns based on metagenomic compositional data. Bioinformatics. 2015;31: 3322–3329. pmid:26079350
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref11] 11. Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol. 2015;11: 1–25.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref12] 12. Shah RD, Samworth RJ. Variable selection with error control: Another look at stability selection. J R Stat Soc Series B Stat Methodol. 2013;75: 55–80.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref13] 13. Wang H. Bayesian graphical lasso models and efficient posterior computation. Bayesian Anal. 2012;7: 867–886.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref14] 14. Preston FW. The commonness, and rarity, of species. Ecology. 1948;29: 254–283.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref15] 15. Magurran AE, Henderson PA. Explaining the excess of rare species in natural species abundance distributions. Nature. 2003;422: 714–716. pmid:12700760
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref16] 16. Koren O, Knights D, Gonzalez A, Waldron L, Segata N, Knight R, et al. A guide to enterotypes across the human body: Meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput Biol. 2013;9: 1–16.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref17] 17. Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial marker-gene surveys. Nat Meth. 2013;10: 1200–1202.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref18] 18. Mitchell TJ, Beauchamp JJ. Bayesian variable selection in linear regression. J Am Stat Assoc. 1988;83: 1023–1032.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref19] 19. Mallick H, Yi N. Bayesian methods for high dimensional linear models. J Biom Biostat. 2013;1: 005. pmid:24511433
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref20] 20. Stan Development Team. RStan: the R interface to Stan, version 2.6.0 2014. Available: http://mc-stan.org/rstan.html

[ref21] 21. Ren B, Schwager E, Tickle TL, Huttenhower C. SparseDOSSA: Sparse data observations for simulating synthetic abundance. 2016. Available: https://huttenhower.sph.harvard.edu/sparsedossa

[ref22] 22. The Human Microbiome Consortium. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486: 207–214. pmid:22699609
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref23] 23. Aitchison J. A concise guide to compositional data analysis. 2nd compositional data analysis workshop. 2003.

[ref24] 24. Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli E, et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Meth. 2015;12: 902–903.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref25] 25. Rajilić-Stojanović M, Vos WM de. The first 1000 cultured species of the human gastrointestinal microbiota. FEMS Microbiol Rev. Oxford, UK; 2014;38: 996–1047. pmid:24861948
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref26] 26. Barberán A, Bates ST, Casamayor EO, Fierer N. Using network analysis to explore co-occurrence patterns in soil microbial communities. ISME J. 2012;6: 343–351. pmid:21900968
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref27] 27. Kolenbrander PE, Palmer RJ Jr, Periasamy S, Jakubovics NS. Oral multispecies biofilm development and the key role of cell–cell distance. Nat Rev Micro. 2010;8: 471–480.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref28] 28. Gajer P, Brotman RM, Bai G, Sakamoto J, Schütte UME, Zhong X, et al. Temporal dynamics of the human vaginal microbiota. Sci Transl Med. 2012;4: 132ra52–132ra52. pmid:22553250
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref29] 29. Rasko DA, Rosovitz MJ, Myers GSA, Mongodin EF, Fricke WF, Gajer P, et al. The pangenome structure of escherichia coli: Comparative genomic analysis of e. coli commensal and pathogenic isolates. J Bacteriol. 2008;190: 6881–6893. pmid:18676672
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

[ref30] 30. Hogg JS, Hu FZ, Janto B, Boissy R, Hayes J, Keefe R, et al. Characterization and modeling of the haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biol. 2007;8: R103. pmid:17550610
View Article
PubMed/NCBI
Google Scholar

[94] View Article

[95] PubMed/NCBI

[96] Google Scholar

[ref31] 31. Arthur JC, Perez-Chanona E, Mühlbauer M, Tomkovich S, Uronis JM, Fan T-J, et al. Intestinal inflammation targets cancer-inducing activity of the microbiota. Science. 2012;338: 120–123. pmid:22903521
View Article
PubMed/NCBI
Google Scholar

[98] View Article

[99] PubMed/NCBI

[100] Google Scholar

[ref32] 32. Li Q, Lin N. The bayesian elastic net. Bayesian Anal. 2010;5: 151–170.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref33] 33. Carvalho CM, Polson NG, Scott JG. The horseshoe estimator for sparse signals. Biometrika. 2010;97: 465–480.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref34] 34. Dawid AP, Lauritzen SL. Hyper markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics. 1993;21: 1272–1317.
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref35] 35. Wong F, Carter CK, Kohn R. Efficient estimation of covariance selection models. Biometrika. 2003;90: 809–830.
View Article
Google Scholar

[111] View Article

[112] Google Scholar

Abstract

Author summary

Figures

Introduction

Methods

Per-subject basis (unobserved and unconstrained count) and composition notation

Feature correlations and covariances in composition and unobserved counts

BAnOCC: Bayesian analysis of compositional covariance

Parametrization of the likelihood

Prior distributions

Implementation and inference

Choosing hyperparameters

Software

Results

Unobserved count mean and covariance determine spurious correlation sign and magnitude

Spurious correlation can take any value between -1 and 1

Extending and improving current assumptions about compositional correlation

Simulation studies

Data generation methods

BAnOCC and CCLasso perform comparably in difficult scenarios

Comparison of type I and type II error rates

A microbial interaction network from the Human Microbiome Project

Discussion

Supporting information

S1 Text. Detailed mathematical derivations.

S2 Text. Detailed description of simulated datasets.

S3 Text. Implementation of methods compared.

S1 Data. Simulated data for difficult scenarios.

S2 Data. Realistic simulated data.

S3 Data. HMP taxonomic profiles.

S1 Fig. A relatively informative prior on λ is effective.

S2 Fig. Shrinkage increases for smaller λ.

S3 Fig. Prior distributions for test cases.

S4 Fig. Prior distributions for realistic simulated data.

S5 Fig. Additional results for difficult scenarios.

S6 Fig. AUC boxplots of method performance on “realistic” simulated datasets.

S7 Fig. Average ROC curves of method performance on “realistic” simulated datasets.

S8 Fig. Type I error rates and power in large datasets.

S9 Fig. Power across multiple sample sizes and numbers of features.

S10 Fig. Type I error rates across multiple sample sizes and numbers of features.

S11 Fig. Prior distributions for the stool body site.

S12 Fig. Prior distributions for the buccal mucosa body site.

S13 Fig. Prior distributions for the posterior fornix body site.

S14 Fig. Implied priors on median unobserved counts.

S15 Fig. Comparison of inferred networks on HMP data.

S16 Fig. Prior distributions for large datasets.

S1 Table. BAnOCC stool network.

S2 Table. BAnOCC buccal mucosa network.

S3 Table. BAnOCC posterior fornix network.

References

Cookie Preference Center

Customize Your Cookie Preference