Conceived and designed the experiments: TJS SJR SWK JL JPOD JLG JAE KSP. Performed the experiments: TJS. Analyzed the data: TJS SJR SWK JL JPOD JLG JAE KSP. Contributed reagents/materials/analysis tools: TJS SJR SWK JL. Wrote the paper: TJS SJR SWK JL JPOD JLG JAE KSP.
The authors have declared that no competing interests exist.
Microbial diversity is typically characterized by clustering ribosomal RNA (SSU-rRNA) sequences into operational taxonomic units (OTUs). Targeted sequencing of environmental SSU-rRNA markers via PCR may fail to detect OTUs due to biases in priming and amplification. Analysis of shotgun sequenced environmental DNA, known as metagenomics, avoids amplification bias but generates fragmentary, non-overlapping sequence reads that cannot be clustered by existing OTU-finding methods. To circumvent these limitations, we developed
Microorganisms comprise the majority of the biodiversity on the planet. Because the overwhelming majority of microbes are not readily cultured in the laboratory, researchers often rely on PCR-based investigations of genomic sequence to characterize microbial diversity. These analyses have dramatically expanded our understanding of biodiversity, but due to methodological biases PCR-based approaches may only reveal part of the microbial biosphere. Shotgun sequencing of environmental DNA, known as metagenomics, avoids the biases associated with targeted amplification of genomic sequence and can provide insight into the diversity hidden from traditional investigations. However, the fragmentary, non-overlapping nature of shotgun sequence data makes it intractable to analyze with existing tools. Here, we present
A central goal of ecology and evolution is to understand the forces that shape biodiversity - the variety of life on Earth. It is becoming increasingly clear that global biodiversity is mostly microbial. It is estimated that there are millions of microbial species on the planet, relatively few of which have been isolated in culture
Biodiversity science has traditionally focused on comparing species richness across space, time and environments. Out of necessity, microbial diversity studies usually examine the richness (i.e. number) of operational taxonomic units (OTUs), where OTUs are sequence similarity based surrogates for microbial taxa, which can be difficult to define. In addition to richness, OTUs have been used to characterize the abundance, range, and distribution of microbes, thereby improving our understanding of both natural ecosystems and human health
The SSU-rRNA sequences for OTU identification are traditionally amplified from a sample via polymerase chain reaction (PCR) using universal primers. Each PCR product is then individually sequenced. One of the biggest drawbacks of this targeted sequencing approach is that it leverages PCR, which has been shown to exhibit sequence-based biases at the level of priming and extension
Because of the fragmentary nature of shotgun sequencing, metagenomic reads frequently exhibit minimal, if any, sequence overlap. PID-based evaluations using metagenomic data are thus restricted to the subset of reads that mutually overlap and can therefore be aligned to one another (e.g.,
We present
Traditionally, OTUs are identified from a PCR-generated targeted sequence library by aligning all pairs of sequences, calculating each pair's PID-based distance, and using this distance to group sequences using agglomerative hierarchical clustering. Due to the fragmentary nature of shotgun metagenomic reads, this traditional approach is limited to the subset of overlapping sequences; non-overlapping reads cannot be directly aligned to one another. Even when reads can be aligned (e.g., to full-length reference sequences), one still cannot calculate PID for sequences that do not overlap. To overcome these limitations, we designed
The general strategy
Computational processes are represented as squares and databases are represented as cylinders in this generalize workflow of
First, probabilistic profiles that encode the evolutionary diversity and secondary structure of the SSU-rRNA sequence from Bacteria and Archaea
We use this high quality alignment of metagenomic reads and references sequences to construct a fully-resolved, phylogenetic tree and hence determine the evolutionary relationships between the reads. Reference sequences are included in this stage of the analysis to guide the phylogenetic assignment of the relatively short metagenomic reads. While the software can be easily extended to incorporate a number of different phylogenetic tools capable of analyzing metagenomic data (e.g.,
To evaluate the performance of
We sought to identify how PD-based clustering compares to commonly employed PID-based clustering methods by applying the two methods to the same set of sequences. Both PID-based clustering and
To statistically evaluate the similarity of cluster composition between of each pair of clustering results, we used two summary statistics that together capture the frequency with which sequences are co-clustered in both analyses: true conjunction rate (i.e., the proportion of pairs of sequences derived from the same cluster in the first analysis that also are clustered together in the second analysis) and true disjunction rate (i.e., the proportion of pairs of sequences derived from different clusters in the first analysis that also are not clustered together in the second analysis) (see
On the other hand, when applying the same clustering threshold to both distance matrices, PID-based clustering produces a higher richness estimate (i.e., total number of OTUs) than PD-based clustering (
We subsequently investigated whether we could both maintain accuracy of PD-based clustering, while at the same time obtaining richness estimates more similar to PID-based results, which are thought to approximately correspond to the number of distinct microbial taxa in an environmental sample. First, we considered changing the hierarchical clustering algorithm. It has been shown that the choice of nearest-neighbor, average, or furthest-neighbor linkage in hierarchical clustering algorithms results in substantially different estimates of taxonomic richness, with average-linkage clustering performing the best for PID-based approaches
PID | PD Full-length Sequence | PD Shotgun Read |
0.03 | 0.03 | 0.15 |
0.06 | 0.06 | 0.17 |
Overall, our results imply that
Next, we looked at how well
To investigate the performance of
Comparing the PD matrices from metagenomic and full-length data sets, we observe a strong correlation between the pairwise distances computed on reads and full-length sequences. For each of the 25 simulated samples, the read and corresponding full-length-sequence distance matrices show a positive and significant correlation (Mantel test, p<0.05;
Next, we compared the OTUs produced from metagenomic and full-length sequences, using
Each read data set was clustered into OTUs at various thresholds and compared to the corresponding full-length data set, which was clustered at several fixed PD thresholds (shown here are full-length sequence cutoffs of 0.01, 0.03, 0.05 and 0.1). For each full-length sequence threshold, the true conjunction and false conjunction rates of the read OTUs were calculated as a function of the read threshold. Solid lines represent the median value of the true and false conjunction rates across simulations. Dashed lines represent the median value of the true and false conjunction rates derived from comparisons of randomly permuted clusters relative to the source sequence clusters.
We then determined whether the performance of
Given this insight into the accuracy with which
To demonstrate the utility of PID-based clustering of metagenomic data, we analyzed the pooled Global Ocean Survey (GOS) metagenomic read library
The GOS project also generated 6,413 full-length SSU-rRNA sequences via targeted sequencing of PCR products from six of the 73 geographical sites surveyed
Rarefaction curves are shown for OTUs from PCR (blue) and metagenomic (red) sequencing libraries. Two different sequence similarity cutoffs are used (solid = 0.03, dashed = 0.15). Curves represent the average number of OTUs per sequence from 100 random draws of subsets of sequences from each data set.
Evaluating the intersection of OTUs identified by the two libraries when they were pooled together and processed by
We compared the sequences from the novel OTUs identified from metagenomic reads to the Greengenes SSU-rRNA sequence database to determine if any other PCR-based study revealed the existence of these taxa
We have developed a novel method that enables comparison of non-overlapping metagenomic SSU-rRNA reads and their assignment into OTUs. This is the first automated procedure that identifies OTUs directly from non-overlapping metagenomic reads, which facilitates the identification of taxa potentially overlooked by targeted sequencing studies and leverages the vast quantities of shotgun sequencing data currently being produced by environmental and microbiome studies. The key innovation allowing us to compare non-overlapping reads is our use of phylogenetic distance (PD) to cluster reads into OTUs in place of PID. Building a phylogenetic tree requires that at least some of the sequences within the input alignment overlap. Thus, we incorporate high-quality, full-length reference sequences into the SSU-rRNA sequence alignment to guide the phylogenetic placement of metagenomic reads. The accuracy of this approach is constrained, at least in part, by the phylogenetic diversity of the reference sequences and the means by which the phylogenetic algorithm processes missing data. For example, it is challenging to assess distances between non-overlapping shotgun reads derived from a similar place in the phylogeny, even via comparison to full-length reference sequences. We determined the robustness of our method by evaluating the OTU assignment accuracy of simulated metagenomic reads relative to their full-length sources, finding that the relative PD between a pair of reads is on average highly consistent with the relative PD between full-length sources. This result indicates that metagenomic reads can be assigned to OTUs with high accuracy by simply scaling the clustering threshold.
We also tested whether clustering based on PD could accurately recapitulate clustering based on PID for full-length reads where both methods may be applied. Processing 508 full-length reference sequences via both algorithms reveals that PD accurately assigns sequences into OTUs when compared to the PID OTUs. However, this analysis also reveals that PD results in lower richness estimates relative to PID. This phenomenon appears to be due to a difference in the relative distances between sequences. Specifically, the phylogenetic approach appears to shorten the estimated distance between closely related sequences, relative to the PID approach. This is likely due to the fact that the PD approach employs a weighted substitution model when calculating distances, while the PID approach treats all substitutions with equal weight. Thus, while the hierarchical structure of the clusters is generally consistent between the two methods, as revealed by the cluster composition accuracy analysis, sister OTUs in the PID analysis tend to be merged together via the PD approach. For this reason, it may be necessary to take into account this systematic difference in order to compare the diversity results from a PD-based study with a PID-based study.
A similar pattern is observed when the PD-based and PID-based OTUs are compared to OTUs constructed from GenBank taxonomy terms. Specifically, both methods accurately cluster the 508 full-length reference sequences at the species and genus level. Both methods also tend to underestimate the richness, though PID produces an estimate more in line with the taxonomy-guided clusters. Though this analysis serves as a useful benchmark, a more thorough investigation of richness estimation may be warranted in future work for several reasons. First, GenBank taxonomy terms do not necessarily recapitulate the true taxonomic signal or correspond to monophyletic clades. Second, there are known errors in taxonomic assignment and annotation of GenBank sequences
Having demonstrated the accuracy with which sequences, both full-length and shotgun, are clustered into OTUs using PD, we applied
Metagenomic sequencing is an increasingly common means of investigating microbial communities. We expect methods, such as
Metagenomic reads were identified as SSU-rRNA homologs and classified into their appropriate phylogenetic domain in the following manner. First, each read was compared to the full-length SSU-rRNA sequences found in the Bacteria and Archaea
Evolutionary profiles based on stochastic context-free grammars of the SSU-rRNA gene were produced for both the Bacteria and Archaea using
Quality controlled alignments were subjected to phylogenetic analysis via
The 508 full-length RDP Bacterial reference SSU-rRNA sequences served as the pool for simulated data. Simulated reference data sets were generated by randomly sampling without replacement 423 of the RDP sequences. This was done five distinct times to create five different simulated reference data sets, or batches. For each batch, the remaining 85 full-length sequences were appropriated as source sequences. Simulated reads were generated from each batch of source sequences five times, resulting in five simulation sets for each of the five batches (a total of 25 simulation data sets). To account for reads that run off the ends of the SSU-rRNA gene, we also created simulations, with the same settings, in which source sequences were concatenated to a 500-bp poly-N sequence pad at the 5′ and 3′ ends, before generating reads. We additionally created simulations from 1000 full-length Bacterial SSU-rRNA sequences drawn at random from the SILVA database
For a single simulation set, source sequences were used to simulate an average of 5 reads per source via
Each set of simulated reads and their corresponding full-length source sequences were independently partitioned into OTUs using
and the rate of false disjunctions is given by
Note that
We checked whether the
Given a threshold value, partition reads into OTUs using the
Randomly permute the OTU assignments of the read.
We expected that this algorithm would have much higher error rates than the
The weighted path difference metric
where
The normalization implies that
Example showing how false conjunction and disjunction rates are calculated. In the example, two samples of reads are possible, one consisting of reads A, B, C, D, and E, and another consisting of reads C, D, F, G, and H. In reality, the set of possible samples and reads would be much larger, but for simplicity, we have chosen this smaller set. For the purposes of this example, we suppose that the probabilities of observing samples I and II are 0.3 and 0.7, respectively. At the top of each panel in grey, the partitions into OTUs based on the full length sequences are shown, while the partitions based on the simulated reads are shown in blue. The conjoined and disjoined columns give the pairs of reads placed in the same OTU and different OTUs, respectively, for the partition based on the full-length sequences. The pairs highlighted in blue are correctly conjoined or disjoined in the sample partitions. The rates of false conjunction for Samples I and II are 3/4 and 1/2, respectively, while rates of false disjunction are 2/6 and 0, respectively. Because the probabilities of the samples are 0.3 and 0.7, the average rate of false conjunction is (0.3)(3/4)+(0.7)(1/2) = 0.575, and the average rate of false disjunction is (0.3)(2/6)+(0.7)(0) = 0.1. The average rates provide a useful characterization of a clustering algorithm under a given sampling scenario.
(0.40 MB PDF)
PD-based clustering is accurate relative to PID-based clustering. This graph illustrates the change in the True Conjunction Rate (solid line) and the True Disjunction Rate (dashed line) of full-length sequences clustered using PhylOTU and PD relative to the clustering obtained when the same sequences are clustered via percent identity (shown here is the PID cutoff of 0.03). Average-neighbor hierarchical clustering, the default PhylOTU setting, was used to generate these results.
(0.09 MB PDF)
Distribution of corresponding PD and PID pairwise distances. This figure illustrates the relationship between PD and PID distances (less than 0.2) calculated for all pairs of the 508 references sequences used in our study. The red line indicates identical distances estimated by PD and PID methods. The blue and green lines identify the clustering threshold distance of 0.03 for PD and PID, respectively. The mass of points above the red line for PD distances less than 0.03 indicates that among those pairs that are closely related as per the PD calculation, the PID method tends to estimate a larger corresponding distance (notably the points above the green line and to the left of the blue line). This observation could account for the difference in the estimated richness between the two methods. Conversely, when estimated PD distances are larger (e.g., distances greater than 0.1), the corresponding PID distance tends to be smaller. These observations are likely the result of how distances are calculated in the two approaches: PD leverages a weighted substitution model that down-weights similar substitutions and corrects for multiple substitutions, while the PID method weights all substitutions equally.
(0.26 MB PDF)
PhylOTU clustering of full-length sources and simulated reads is positively correlated. For each of the 25 RDP reference library-based simulations, we compared the source and simulation PD distance matrices produced by PhylOTU using a Mantel test. All 25 tests reveal a significant (p<0.05) and positive correlation coefficient. The above histogram reveals the distribution of the correlation coefficients identified through this analysis.
(0.09 MB PDF)
Derivation of methodological accuracy. Accuracy plots, which capture the change in the true conjunction and true disjunction rate as the simulated shotgun read clustering threshold increases relative to a fixed full-length sequence clustering threshold, were generated for several full-length sequence thresholds. We show here the accuracy plot of the source threshold of 0.03 as an example. The median true clustering and true cutting rates are represented by the solid and dashed black lines, respectively. The red line indicates the minimum tolerated accuracy, which we designate to be 80%. The most accurate read threshold is indicated by the solid blue line, which represents the point where the true clustering rate is controlled at the minimum accuracy and the true cutting rate is maximized. At least three interpretations of an optimal thresholds could be identified from this analysis, contingent upon the application: 1) the threshold where the true conjunction rate (TCR) is fixed at a controlled minimum accuracy and the true disjunction rate (TDR) is maximized, 2) the threshold where the TCR and TDR intersect, and 3) the threshold where the TDR is controlled and the TCR is maximized. The standard hypothesis testing approach is to control the type I error, which corresponds to controlling the TCR in this analysis (
(0.11 MB PDF)
Relationship between full-length sequence clustering threshold and adjusted shotgun read clustering threshold. The relationship between full-length clustering thresholds and the corresponding shotgun read clustering threshold that maximizes the True Disjunction Rate while controlling the True Conjunction Rate is plotted in black. Specifically, this curve represents a loess smoothing of the most accurate thresholds we identified from the reference library-based simulation analyses across a series of full-length sequence clustering thresholds using the procedure described in
(0.08 MB PDF)
Change in accuracy as the full-length sequence clustering threshold increases. The accuracy of clustering reads via an adjusted threshold (black line) remains high, even at relatively large full-length sequence clustering thresholds. The solid red line represents the minimum tolerated True Conjunction Rate (TCR) of 80%.
(0.07 MB PDF)
Accuracy with which PhylOTU clusters shotgun sequences simulated from the SILVA database relative to a full-length cutoff of 0.03. We used the SILVA-based simulation data to construct accuracy plots as described in
(0.11 MB PDF)
The clustering threshold affects the rate of discovery of unique and non-unique OTUs per sequence. We randomly sampled 1000 sequences from the PCR (blue lines) and shotgun (red lines) sequence libraries and counted the total number of distinct OTUs (solid lines) as well as the number of OTUs unique to each sequence library (dashed lines) identified by the sample across clustering thresholds (100 bootstraps). In regards to the total number of OTUs identified by each library, this analysis reveals that the number of OTUs discovered per sequence depreciates at similar rates in both libraries as the threshold increases. Conversely, we find that the rate of change of unique OTU discovery is not consistent between libraries: more unique OTUs per sequence are discovered in the PCR-generated sequence library at thresholds below 0.05, while the inverse is true at thresholds greater than or equal to 0.05. Notably, the number of unique OTUs per PCR sequence declines as the threshold increases at a rate similar to the total number of OTUs discovered per PCR sequence. This is not the case with shotgun data, where the slope of the unique OTUs discovered per read is much flatter. This may be the result of the increased phylogenetic diversity discovered in the shotgun library and suggests that PCR sequences tend to contribute less unique phylogenetic branch length than shotgun reads.
(0.09 MB PDF)
Partial alignment of shotgun sequence from uniquely metagenomic OTUs that overlap a universal SSU-rRNA primer site. Of those sequences that cluster into OTUs that are uniquely identified via analysis of shotgun sequence data (clustering threshold of 0.15), 18 overlap a universal SSU-rRNA primer site in the alignment. Here, we show the result of aligning those 18 sequences as well as the 8F and 27F primers to the INFERNAL SSU-rRNA model used in PhylOTU. We find that two sequences contain a shared C->T substitution that differentiates them from all other sequences in the alignment (red column) directly adjacent to the degenerate site in the 27F primer sequence (blue column). Incorporation of a degenerate base at this position in the universal primer sequence may enable more rigorous characterization of those lineages that harbor this C->T transition.
(0.02 MB PDF)
Hierarchical clustering algorithms affect both clustering accuracy and richness estimates. Full-length SSU-rRNA sequences were clustered via both PID and PD approaches using three difference linkage definitions in the hierarchical clustering algorithm: nearest-neighbor (nn), average (avg), and furthest-neighbor (fn). Here, we show the results of comparing each the PD clusters (threshold of 0.03) to each of the PID clusters (threshold of 0.03) using the True Conjunction Rate (TCR) and True Disjunction Rate (TDR) calculations described in the
(0.02 MB DOC)
Both PID and PD clustering accurately recapitulates taxonomy-guided clusters. Full-length reference sequences were clustered based on their GenBank taxonomy at the species level. These same sequences were then clustered using both the PID and PD methods by employing a distance threshold of 0.03 across three hierarchical clustering algorithms: nearest-neighbor (nn), average-linkage (avg), furthest-neighbor (fn). Each of the PID and PD clusters were compared to the taxonomy guided clusters using the True Conjunction Rate (TCR) and True Disjunction Rate (TDR) calculations described in the
(0.02 MB DOC)
Accuracy of adjusted shotgun read clustering cutoff relative to full-length clusters when controlling the true conjunction rate and maximizing the true disjunction rate (TDR). Data was obtained from the RDP reference library-based simulations.
(0.02 MB DOC)
Accuracy of adjusted shotgun read clustering cutoff relative to full-length clusters when controlling the true conjunction rate and maximizing the true disjunction rate (TDR). Data was obtained from the SILVA-based simulations.
(0.02 MB DOC)
Taxonomic distribution of OTUs identified via comparison of GOS PCR-generated and shotgun-generated SSU-rRNA sequences. This table documents the frequencies at which major Bacterial taxonomic clades (approximately Bacterial phyla) were represented by sequences clustered into three different sets of OTUs identified by PhylOTU: those unique to PCR-generated SSU-rRNA (PCR), those unique to shotgun sequenced SSU-rRNA (WGS), and those reveal by both sequence libraries. The RDP taxonomy classifier was used to determine the classification of each sequence in question. Major Bacterial clades were excluded from this table if their frequency was not greater than or equal to 0.1% in at least one of the three sets of OTUs.
(0.02 MB DOC)
Alignment quality control filters.
(0.01 MB DOC)
MetaSim run time setting used during the PhylOTU simulation analysis.
(0.01 MB DOC)
We would like to thank the RDP for providing access to their curated SSU-rRNA multiple sequence alignments and CAMERA for enabling access to the Global Ocean Survey Data. We would also like to thank the authors and developers of the various software packages we've leveraged, especially the BioPerl and R developer communities. Finally, we thank Dr. Aaron Darling, Dr. Dongying Wu and Dr. Rebecca Lamb for their insight and the reviewers of this manuscript for their critical and informative suggestions.