Posted by pschloss on 23 Nov 2010 at 20:00 GMT
Pyrosequencing of PCR-amplified fragments that target variable regions within the 16S rRNA gene has quickly become a powerful method for analyzing the membership and structure of microbial communities. This approach has revealed and introduced questions that were not fully appreciated by those carrying out traditional Sanger sequencing-based methods. These include the effects of alignment quality, the best method of calculating pairwise genetic distances for 16S rRNA genes, whether it is appropriate to filter variable regions, and how the choice of variable region relates to the genetic diversity observed in full-length sequences. I used a diverse collection of 13,501 high-quality full-length sequences to assess each of these questions. First, alignment quality had a significant impact on distance values and downstream analyses. Specifically, the greengenes alignment, which does a poor job of aligning variable regions, predicted higher genetic diversity, richness, and phylogenetic diversity than the SILVA and RDP-based alignments. Second, the effect of different gap treatments in determining pairwise genetic distances was strongly affected by the variation in sequence length for a region; however, the effect of different calculation methods was subtle when determining the sample's richness or phylogenetic diversity for a region. Third, applying a sequence mask to remove variable positions had a profound impact on genetic distances by muting the observed richness and phylogenetic diversity. Finally, the genetic distances calculated for each of the variable regions did a poor job of correlating with the full-length gene. Thus, while it is tempting to apply traditional cutoff levels derived for full-length sequences to these shorter sequences, it is not advisable. Analysis of β-diversity metrics showed that each of these factors can have a significant impact on the comparison of community membership and structure. Taken together, these results urge caution in the design and interpretation of analyses using pyrosequencing data.
Microbial communities are notoriously difficult to analyze because of their inaccessibility via culturing and high diversity. Next generation sequencing technologies have made it possible to obtain deep sampling coverage of the 16S rRNA gene; however, interpretation of the resulting data is complicated by the inability to relate sequences from variable regions within the gene to the full-length gene and ultimately, the parent genome. Here, I present a comprehensive analysis quantifying the effects of varying sequence alignment quality, pairwise distances calculation methods, sequence filtering, and regions within the 16S rRNA gene on downstream analysis using OTU- and phylogeny-based methods. This analysis indicates that each factor can have a significant effect on descriptions of α- and β-diversity. Because it is not possible to relate pyrotags to full-length 16S rRNA gene sequences directly, I encourage scientists to view pyrotags as markers within a microbiome in an analogous fashion to how geneticists view single nucleotide polymorphisms as markers within genomes.
Citation: Schloss PD (2010) The Effects of Alignment Quality, Distance Calculation Method, Sequence Filtering, and Region on the Analysis of 16S rRNA Gene-Based Studies. PLoS Comput Biol 6(7): e1000844. doi:10.1371/journal.pcbi.1000844
Editor: Jonathan A. Eisen, University of California Davis, United States of America
Received: February 25, 2010; Accepted: June 1, 2010; Published: July 8, 2010
Copyright: © 2010 Patrick D. Schloss. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This project was funded by the School of Medicine at the University of Michigan and a grant from the National Science Foundation (award #0743432). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The author has declared that no competing interests exist.
The recent advent of massively-parallelized pyrosequencing platforms has enabled a renaissance in the field of microbial ecology , . Pyrosequencing has engendered much enthusiasm since it is now possible to obtain nearly 100-times as many sequences by pyrosequencing for the same cost as using traditional Sanger sequencing technology. Although pyrosequencing is capable of generating 105–106 sequences per run, the sequences are between 100 and 400 bp in length. This method has become widely used among microbial ecologists to sequence PCR amplicons from variable regions within the ca. 1,500-bp 16S rRNA gene.
These massive datasets have been analyzed through the generation of phylogenetic trees [e.g. 3], assignment of sequences to operational taxonomic units (OTUs) for based on distance thresholds [e.g. 4], and classification of sequences to phylogenentic bins based on similarity to reference sequences [e.g. 5]. Each approach has received some level of evaluation using pyrotag sequencing. Liu et al.  asserted that phylogenies generated using pyrotags were as good as full-length sequences based on similarity of UniFrac test statistics. Several studies have evaluated various regions and methods for assigning sequences to phylotypes –. Finally, a recent study emphasized differences in α-diversity metrics using different regions within the 16S rRNA gene and OTU definitions .
Each of these studies have focused on a limited range of phylogenetic groups found in a particular environment (e.g. soil, mouse cecum, human feces) and have glossed over more fundamental questions related to how alignment quality, methods of calculating pairwise genetic distances, sequence filtering, and region affects downstream analysis and their relationship to full-length sequences. Alignment quality is expected to significantly affect pairwise distances. Investigators have either used reference alignments to align sequences that implicitly incorporate the secondary structure of the 16S rRNA molecule – or they have used methods that do not consider the secondary structure , . Previous results have shown that the manually-curated SILVA reference alignment provides superior complementary base-pairing within the secondary structure compared to the greengenes alignment, which appears haphazard; the RDP alignment does not align the variable regions . Considering the focus of these studies is on the variable region, there is the added complication that these areas are difficult to align accurately. To overcome limitations in alignment of variable regions, many studies have employed the use of masks to filter the troublesome regions [e.g. 3]. Yet, these filters remove a considerable amount of information from already information-sparse data (Table 1). The actual method of calculating distances is also typically taken for granted. Practically every 16S rRNA survey has made use of substitution models that assume that an alignment gap represents missing data instead of a mutation [e.g. 17]. The decision to use such a model seems motivated more by a sense of phylogenetic guilt than by biology. It is also unknown how distances calculated between partial sequences predict distances between full-length sequences. To make data analysis more tractable, some have employed heuristics based on correlations between kmer- and sequence-based pairwise distances to select which pairs of sequences to align and group within OTUs . It is unclear how these correlations vary across regions within the 16S rRNA gene or what the level of risk is for falsely ignoring pairs of similar sequences. Finally, most studies make the implicit assumption that distances between partial sequences are not significantly different from those of full-length sequences; however, this is a questionable assumption as it is well-established that the 16S rRNA gene does not evolve uniformly along its length. This is apparent in the choice a 3% distance cutoff, which is used as a proxy species definition for full-length sequences, to define species using sequences from variable regions [e.g. 2], . Each of these factors is expected to have a significant effect on the analysis, interpretation, and generalizability of 16S rRNA gene surveys.
Table 1. Summary of the 13 regions within the 16S rRNA gene that were used in this study.doi:10.1371/journal.pcbi.1000844.t001
Here, I used a collection of full-length 16S rRNA gene sequences representing 43 bacterial phyla to quantify how alignment quality, distance calculation methods, masking, and region within the 16S rRNA gene affect out ability to assess α- and β-diversity. The results of these analyses urge greater caution in how surveys are designed and interpreted.
For each of the 13 regions I used various alignment methods to calculate 91,131,750 pairwise distances assuming that a series of consecutive gaps represented one insertion or deletion. The SILVA, greengenes, and RDP alignments represent a gradation in the level of attention given to aligning the variable regions and are each guided by the secondary structure of the 16S rRNA gene. In contrast, the MUSCLE and pairwise alignments are attempts to optimize the alignment between sequences based on a limited number of parameters that are set a priori. To compare the pairwise distances calculated for the same pairs of sequences across alignments, I calculated the regression coefficients describing the relationship between the distances for the greengenes, RDP, MUSCLE, and pairwise alignments and the SILVA alignment for each region (Table 2). Distance calculations for this analysis assumed that consecutive gap positions were the product of a single insertion or deletion mutation (i.e. one gap). With the exception of the V3 and V4 regions, the RDP alignment for each of the regions predicted greater genetic diversity than that of the SILVA alignment. Interestingly, the greengenes alignment, which does a poor job of aligning the variable regions, predicted between 9 and 33% more genetic diversity for each region than the RDP alignment, which does not attempt to align the variable regions. Visual inspection of the greengenes alignment suggests that in many instances the variable region alignments are somewhat random . I observed that the MUSCLE-generated alignments described considerably greater genetic diversity than any of the other methods for the V3, V6, and V9 regions (Table 2); however, the use of pairwise alignments yielded smaller distances than those calculated with the other alignment methods because pairwise alignment methods optimize the alignment without the constraint of preserving positional homology across multiple sequences (Table 2). Perhaps most worrisome is the observation that with the exception of the distances calculated from pairwise alignments, regressions of the other alignment methods to the SILVA-based alignment typically did a poor job of accounting for the variation in the distances (Table 2). These data make it clear that variation in alignment quality can have a significant impact on the genetic diversity that is calculated between the same pairs of sequences.
Table 2. Slope coefficients and R2 values for the comparison of one gap distances calculated for SINA-aligned sequences extracted from different regions within the 16S rRNA gene sequence to one gap distances calculated using sequences aligned by different methods.doi:10.1371/journal.pcbi.1000844.t002
Considering the poor correlation between the distances generated from the five alignment methods, it was necessary to determine the effect of this variation on the ability to accurately describe and compare communities. As expected based on the genetic distance analysis, the number of OTUs observed using the greengenes alignment was routinely higher than that observed using the other alignment methods and the number of OTUs observed using the pairwise alignment method was routinely the lowest (Fig. 1). Inspection of these lineage through time plots identified a stair-like appearance for many of the regions. This was due to the loss of information as sequence length decreased. The most extreme example of this phenomenon was for the V6 region that had an average sequence length of 60 bp. Each difference between a pair of V6 sequences changed the distance by approximately 0.0167 units, which is the step-length observed for the V6 data in Fig. 1. When the phylogenetic diversity of the datasets was calculated, the greengenes aligned sequences had the highest phylogenetic diversity and the pairwise aligned sequences had the lowest (Fig. 2). One limitation of the phylogenetic diversity metric is that it is difficult to interpret the statistic and so it is unclear how biologically meaningful the level of variation observed is in Fig. 2.
Figure 1. The number of OTUs observed as a function of genetic distance for various regions within the 16S rRNA gene when using different sequence alignments.doi:10.1371/journal.pcbi.1000844.g001
Figure 2. The phylogenetic diversity observed for different regions within the 16S rRNA gene when using different alignments.
Phylogenetic diversity was measured by calculating the total branch length for a phylogenetic tree.doi:10.1371/journal.pcbi.1000844.g002
To describe β-diversity, I used two OTU-based metrics (Figs. 3 and 4) and two phylogenetic-based metrics (Fig. 5) to measure the sensitivity of the metrics to alignment quality. Sequences were partitioned so that they would represent two samplings of communities whose Jaccard similarity index was 0.80, but whose Morisita-Horn similarity index was 0.60 with a cutoff of 0.05 when defining OTUs with full-length sequences. Because the sampling of the two simulated communities was limited (ca. 6,750 sequences per community), the Jaccard and unweighted UniFrac statistics did not equal the expected values. Within this simulation framework, the effect of alignment was generally highly statistically significant across metrics of β-diversity (p≪0.001); however it is unclear how biologically meaningful the observed differences were.
Figure 3. The Jaccard coefficient calculated between two mock communities (described in Materials & Methods) for different OTU definitions and alignments.
Each bar represents the average coefficient value for 100 randomized partitionings of the data. Within the same OTU cutoff, alignment strategies with the same symbol and regions with the same letter were not significantly different from each other.doi:10.1371/journal.pcbi.1000844.g003
Figure 4. The Morisita-Horn coefficient calculated between two mock communities (described in Materials & Methods) using different OTU definitions and alignments.
Each bar represents the average coefficient value for 100 randomized partitionings of the data. Within the same OTU cutoff, alignment strategies with the same symbol and regions with the same letter were not significantly different from each other.doi:10.1371/journal.pcbi.1000844.g004
Figure 5. Unweighted and weighted UniFrac similarity values calculated between two mock communities (described in Materials & Methods) using different alignments.
Each bar represents the average coefficient value for 100 randomized partitionings of the data. Within the same UniFrac approach, alignment strategies with the same symbol and regions with the same letter were not significantly different from each other.doi:10.1371/journal.pcbi.1000844.g005
Using the same SILVA-aligned sequences that I analyzed above, I investigated the effect of different distance calculation methods on downstream analyses. Specifically, I considered the one gap calculator (i.e. a gap of any length between two sequences represents a single mutation) and each gap (i.e. gaps length n, represent n mutations) and ignore gap calculators (i.e. gapped characters are not considered in calculating a distance; Table 3). The slope of lines forced through the origin indicated that the each gap calculator calculated between 0 (V4) and 9% (V3) more genetic diversity than the one gap calculator. With the exception of the V3 region (69%), the regression between the each gap and one gap calculators accounted for more than 87% of the variation in the distances. The differences in the explanatory power of the regression were a function of frequency of gaps longer than 1 nucleotide. The ignore gap calculator calculated between 2 (V4) and 7% (V9) less genetic diversity than the one gap calculator. The regression between the ignore gap and one gap calculators accounted for more than 94% of the variation in the data. Until there is a more well-developed theoretical basis for selecting a method for treating gaps in sequence alignments, these results suggest that treating gaps of any length as a single mutation is a middle ground between ignoring them and treating each of them as a separate evolutionary event.
Table 3. Slope and R2 values for the regression of one gap pairwise distances onto each gap, ignore gap, Lane mask-filtered one gap, and kmer distances using SINA-aligned sequences over the same region.doi:10.1371/journal.pcbi.1000844.t003
Pairwise kmer distances were much larger than the alignment-based calculators and their regression onto the one gap calculated distances accounted for between 83 and 97% of the variation observed between the distances. In order to have no risk of falsely ignoring true one gap pairwise distances smaller than 0.10, it was necessary to keep kmer distances smaller than 0.45 (V19) to 0.73 (V6). This would result in needing to calculate between 3.3- and 9.1-fold more distances than would be needed by alignment-based methods.
Lacking a theoretical basis for treating gaps as a single evolutionary event, I was curious how much measures of α- and β-diversity are affected by the choice of a distance calculator. I used an OTU-based approach to determine the effect of distance calculation methods on the richness of OTUs within the dataset (Fig. 6) and a phylogeny-based approach using total branch length to measure phylogenetic diversity (Fig. 7). As would be predicted, the number of observed OTUs at any genetic distance was greatest with the each gap and least with the ignore gap calculators; the one gap and each gap calculators generated comparable numbers of OTUs. When I analyzed the effect of region and distance calculation method on the phylogenetic diversity of the datasets, there were qualitative trends between methods and regions that could have been predicted from the regression analysis in Table 2 (Fig. 7). These analyses suggest that the difference observed in α-diversity when using either the one gap or each gap calculator is unlikely to be biologically meaningful.
Figure 6. The number of OTUs observed as a function of genetic distance for various regions within the 16S rRNA gene when using different methods of calculating distances and masking sequences.doi:10.1371/journal.pcbi.1000844.g006
Figure 7. The phylogenetic diversity observed for different regions within the 16S rRNA gene when using different methods of calculating distances and masking sequences.
Phylogenetic diversity was measured by calculating the total branch length for a phylogenetic tree.doi:10.1371/journal.pcbi.1000844.g007
I next investigated what effect each calculator method had on two OTU-based (Figs. 8 and 9) and two phylogeny-based β-diversity measures (Fig. 10). For the OTU-based metrics, ignoring gaps resulted in an over-estimate of the similarity between the two communities and counting each gap resulted in an under-estimate. Increasing and decreasing the cutoff used to define the OTUs had a parallel effect on the Jaccard and Morisita-Horn indices (Figs. 8 and 9). These results occurred because ignoring gaps and increasing the threshold each dampen the differences between sequences and pull more sequences into an OTU so that more OTUs are likely to be shared; the same phenomenon was observed when sequences were filtered using the Lane mask (see below). Penalizing each gap or making the OTU definition more stringent had the opposite effect. The calculated Jaccard coefficients were not significantly different between the one gap and each gap distance calculation methods when using the 0.03 and 0.05 OTU cutoffs (Fig. 8); all four distance calculation methods yielded statistically significant differences in Morisita-Horn coefficients, regardless of the OTU cutoff. For the phylogeny-based methods, the observed differences between each of the distance calculation methods were statistically significant (Fig 10). Although the differences between distance calculation methods were highly statistically significant (p≪0.001), it is unclear how biologically meaningful the differences were.
Figure 8. Jaccard similarity values calculated between two mock communities (described in Materials & Methods) for different OTU definitions, methods of calculating distances, and masking sequences.
Each bar represents the average coefficient value for 100 randomized partitionings of the data. Within the same OTU cutoff, regions with the same letter were not significantly different from each other; for each OTU cutoff all distance calculation methods were significantly different from each other.doi:10.1371/journal.pcbi.1000844.g008
Figure 9. The Morisita-Horn coefficient calculated between two mock communities (described in Materials & Methods) using different OTU definitions, methods of calculating distances, and masking sequences.
Each bar represents the average coefficient value for 100 randomized partitionings of the data. Within the same OTU cutoff, distance calculation methods with the same symbol and regions with the same letter were not significantly different from each other.doi:10.1371/journal.pcbi.1000844.g009
Figure 10. Unweighted and weighted UniFrac similarity values calculated between two mock communities (described in Materials & Methods) using different methods of calculating distances and masking sequences.
Each bar represents the average coefficient value for 100 randomized partitionings of the data. Within the same UniFrac method, regions with the same letter were not significantly different from each other; for both UniFrac methods the distance calculation methods were all significantly different from each other.doi:10.1371/journal.pcbi.1000844.g010
To circumvent alignment quality problems, the Lane mask has been used to filter variable regions from 16S rRNA genes. Results of analyses using filtered sequences aligned by any method or when distances were calculated by any method did not vary to a meaningful degree. Comparison of distances calculated using filtered sequences to those calculated using unfiltered sequences showed that filtering significantly reduced the genetic diversity observed between sequences (Table 3). With the exception of the V4 and V6 regions, masking removed between 15 and 45% of the genetic diversity. The V4 region is largely unaffected by the Lane mask and the average length of V6 sequences following the Lane mask treatment was only 27 bp, which made the resulting pairwise distances of dubious value (Table 1). As would be expected, the number of OTUs and phylogenetic diversity observed using Lane mask-filtered sequences was significantly lower than those calculated with the unfiltered sequences. For the four β-diversity measures, when the Lane mask-filtered sequences were analyzed, the communities appeared more similar than for non-filtered SILVA-aligned sequences (Figs. 8–10). One explanation for this observation is that because filtering makes sequences more similar to each other, it also makes communities appear more similar to each other. Although useful for broad-scale phylogenetic analysis at the level of a kingdom or phylum, filters remove the sequence information necessary to differentiate populations within a community. Ultimately, application of such filters is troublesome because it mutes the signals that differentiate communities.
I compared the one gap distances calculated for each of the 12 regions from each alignment to the one gap distances calculated from the full-length SILVA alignments (Table 4). The regression of pairwise distances calculated from a sub-region onto distances calculated from full-length sequences was rarely near 1.00. The most extreme case was the V6 region for which distances were nearly 3-fold higher than distances calculated using full-length sequences. Conversely, sequences from the V9 region were 33% less diverse than their full-length counterparts. In general, genetic diversity decreased along the length of the 16S rRNA gene. Although one could use these regression coefficients to relate data collected from one region to that from full-length sequences, the ability of the regression to explain the variation observed between sub-region and full-length sequences was quite poor. As expected, longer regions did the best job of relating the variation between sub-regions and full-length sequences. For example, when using the SILVA alignments, the regression of the V14, V35, and V69 distances onto the full-length distances accounted for 87, 77, and 77% of the variation in distances. Shorter regions such as the V3, V6, and V9 accounted for 26, 36, and 46% of the variation (Table 4). This analysis revealed that all sub-regions are limited in their capacity to serve as surrogates for full-length 16S rRNA gene sequences.
Table 4. Regression coefficients and R2 values for the comparison of one gap distances calculated for different regions within the 16S rRNA gene sequence and aligned by different methods to the one gap distances calculated using SINA aligned full-length sequences.doi:10.1371/journal.pcbi.1000844.t004
The distance-based analysis clearly showed significant differences between distances calculated from sub-regions and full-length sequences. The OTU-based analysis in Fig. 1 demonstrates that there was a clear difference in the number of OTUs observed across regions for a given genetic distance as well as the level of curvature observe observed in the lineage-through-time plots (Figs. 1 and 6). In the phylogenetic-based analysis those regions that described more genetic diversity than the full-length sequences had greater phylogenetic diversity than the phylogenetic diversity calculated for the full-length sequences whereas the regions that described less genetic diversity yielded greater phylogenetic diversity (Figs. 2 and 7).
Using pyrotag data introduces several complexities to β-diversity analyses. Moving across regions, but using the same OTU definition could lead one to overestimate community similarity. For example, the average Morisita-Horn similarity for full-length SILVA-aligned sequences with one gap distances was 0.56. Using similarly treated sequences from the V12, V13, V14, and V23 regions I calculated Morisita-Horn values between 0.57 and 0.60; however those from the other 8 regions yielded values between 0.64 (V2) and 0.79 (V9). For a single region, changing the OTU cutoff also had a significant effect on the Morisita-Horn index. For instance, full-length SILVA-aligned sequences yielded 0.52, 0.56, and 0.86 for cutoffs of 0.03, 0.05, and 0.10. This spread in Morisita-Horn values between the 0.03 and 0.10 OTU cutoffs (0.34) was the largest of any region. The narrowest spread was observed for the V6 region (0.06). In contrast to the Morisita-Horn values, there was little variation in the unweighted or weighted UniFrac statistic when comparing sequences analyzed by the same alignment and distance calculation method. With the exception of the V6 region (0.33), the average unweighted UniFrac values varied between 0.24 (V13, V14, V19) and 0.30 (V9) and with the exception of the V12 region (0.69), the average weighted UniFrac values varied between 0.80 (V13) and 0.87 (V9); the value for the full-length sequence was 0.82. Similar to the α-diversity measure of phylogenetic diversity, an added complication of phylogeny-based methods is the complexity of interpreting the proportion of branch length that is shared between or unique to two communities and how such proportions relate to classical β-diversity measures. Thus, it is difficult to interpret the biological significance of such variation. Regardless, the results of the OTU- and phylogeny-based analyses demonstrate that caution must be taken in extrapolating results from one region to another.
The ability to define OTUs and reconstruct phylogenies allows an investigator to approach their problem using the data as they present themselves without being confined to an a priori taxonomy. Regardless, the analysis I have presented indicates that comparing results obtained by sequencing one region of the 16S rRNA gene can not be easily compared to those obtained using full-length sequences. Ultimately, the fact that the 16S rRNA gene does not evolve uniformly across its length complicates its analysis. Technical limitations require investigators to select a region based on the availability of conserved PCR primers, fragment length, and the ability to generate high quality sequence. Analytical limitations require investigators to select a region based on the availability of database sequences for that region, the ability to accurately classify sequences, and the level of genetic diversity found in the region. Until there is a standardized approach, individual investigators will continue to select different regions for their analysis. Studies such as this are necessary to inform investigators about the strengths and weaknesses of the various regions within the 16S rRNA gene. Based on this analysis, it is clear that regardless of the region, longer reads will improve one's ability to relate their analysis to full-length sequences. As sequence lengths increase to the point that pyrosequencing full-length 16S rRNA genes is possible, this discussion will be unnecessary. Ultimately, all pyrotag regions represent a marker of a marker of genomic diversity. Even if full-length 16S rRNA gene sequencing is possible, it is still just a marker of genomic diversity. Correlations between the complete genome sequence and full-length 16S rRNA gene sequences are probably just as poor as correlations between full-length 16S rRNA gene sequences and their sub-regions [e.g. 18]. Although one may endeavor to characterize and compare the composition of multiple communities, any cutoffs that are employed are at best empirical and hopefully have some biological meaning.
I have shown that alignment quality has a significant impact on downstream data analysis. Because the 16S rRNA gene sequence follows a well-determined secondary structure, it is possible to objectively state that one alignment is better than another. Furthermore, pairwise and multiple sequence alignments that ignore the secondary structure are unadvisable on theoretical grounds. Such methods are also unadvisable on technical grounds as the time and memory required to complete them typically scales in excess of the number of sequences squared; the time required to perform a profile-based alignment scales linearly with the number of sequences.
A significant factor in the analysis of DNA sequences is the calculation of pairwise distances. The rich literature developed for protein-coding sequences has generated the Jukes-Cantor, Kimura, Hasegawa-Kishino-Yano and other substitution models [reviewed in 19]. Yet these models ignore gapped positions, which I have shown to have a significant impact on downstream analyses. Substitution models for structural RNA molecules such as the 16S rRNA gene are not well developed or widely used , . It is underappreciated that use of short sequence or filtering methods such as the Lane mask reduces the precision and information represented by a distance. For instance, if there are fewer than 200 bases being considered, then it is difficult to place much confidence in an OTU threshold of 0.03 (i.e. 6 differences) when one considers the potential impact of PCR, sequencing, and alignment artifacts. Furthermore, reducing the information content of a 1,500 bp molecule to a 200-bp sequence read will affect the confidence placed in the generation of phylogenetic trees and OTU assignments. Althoguh removing non-informative positions can be helpful for reconstructing broad phylogenies, the α- and β-diversity analyses described here are adversely affected by removing this fine level sequence diversity. These are clearly issues that warrant further attention.
This study has ramifications on how analyses are performed. Since it is clear that the 16S rRNA gene does not evolve uniformly across its length, it is critical that sequences fully overlap before they are compared. For example, consider an analysis that includes sequences from the V2 region and those from the V12 region. The V12 sequences will have higher pairwise distances amongst each other than compared to the V2 region because the V1 region is evolving at a faster rate. Thus, the comparison of short and long sequence reads will add artifacts into the analysis, which will overstate the richness within the community. Although not explored here, it is likely that similar problems will be encountered in analyses where a taxonomy hierarchy is used to assign sequences to bins. Thus it is critical that sequences are trimmed to start and end at the same sequence-based landmarks. Because pyrosequencing does not yield a uniform length sequence read, this introduces a conundrum of whether to favor fewer long reads or many short reads. Because it is impossible to compare pyrotags to the full-length sequences accurately, it seems appropriate to increase the power of other statistical analyses by sacrificing sequence length in favor of having more sequence reads.
Next generation sequence analysis of 16S rRNA genes offers the first opportunity to replicate analyses, develop more complex experimental designs, and to increase sampling depth and breadth. The results of this study encourage one to see pyrotags as markers within a metagenome and suggest a different way of considering microbial community analysis. Just as single nucleotide polymorphisms (SNPs) have been used as markers of disease in genome-wide association studies (GWAS), which may have no direct effect on a genes phenotype, pyrotags no doubt will serve as a useful analog to SNPs for the nascent field of metagenome-wide association studies (MWAS).
I obtained the SSURef 16S rRNA gene sequence database from the SILVA project (version 98; http://www.arb-silva.de) . From this collection of sequences longer than 1,200 bp, I identified bacterial sequences that had an alignment quality score (ARB database field “align_quality_slv”) of 100 and were not chloroplasts, mitochondria, or suspected of being chimeric. The collection was further screened to remove sequences that had more than 5 ambiguous base positions and did not start by E. coli position 28 or end after position 1491. Of the remaining sequences, 13,501 sequences were unique and shared between the SILVA , greengenes , and RDP sequence collections . I then generated 12 datasets from the full-length sequences using the SILVA, greengenes, and RDP alignments by extracting sub-regions of various lengths (Table 1). These regions were selected because they had already been used in publications or are amenable to the available sequencing platforms. Lane masks were generated by mapping the original mask onto the E. coli reference sequence and then it was applied to each of the three reference alignments . In addition to the three reference alignments I generated pairwise alignments between all pairs of sequences using the Needleman-Wunsch algorithm  and multiple sequence alignments using MUSCLE with two iterations (maxiters = 2) and the diags option .
I implemented three sequence-based methods for calculating pairwise distances and a kmer-based distance metric. The first sequence-based method ignored any site that contained a gap; this method is implemented in the commonly used DNADIST program from the PHYLIP package . The second sequence-based method counted gaps as a fifth character so that any comparison between a gap and a base was penalized as a mismatch; comparisons between two gaps were ignored. This approach asserts that every gap represents a distinct mutation. The third sequence-based method calculated distances by only penalizing a string of gaps as one mismatch . This approach asserts that a gap, of any length, represents a single mutation. Distances were not corrected for multiple substitutions to simplify analysis of the data. Furthermore, some distances were so large that when they were corrected, they yielded undefined values. Distances were calculated as implemented in the mothur software package with precision to 0.0001 . Finally, kmer-based distances were calculated between pairs of unaligned sequences based on their 7-base kmer profiles .
Pairwise distances were compared using a custom C++-coded program that calculated the linear regression coefficient using the origin as the intercept and the Pearson product-moment correlation coefficient . Because several of the datasets did not demonstrate a linear correlation with the V19 region when the V19 pairwise distances were larger than 0.10, all regression and correlation coefficients are presented for V19 distances smaller than 0.10. Assessments of how much genetic diversity was either gained or lost represent the deviation from a slope of 1.0. The square of the Pearson product-moment correlation coefficient (i.e. R2) was used to quantify the fraction of the variation that was accounted for by the linear regression.
OTU- and phylogeny-based analyses were performed to assess the intra-sample biodiversity. Sequences were assigned to OTUs using the mothur implementation of the furthest-neighbor clustering algorithm ; although parallel analyses using the nearest and average neighbor algorithms yielded different α- and β-diversity values, the overall relationships observed with furthest neighbor algorithm were observed. The observed richness (i.e. the number of OTUs in a sample) of the dataset was calculated using every possible cutoff that the data could describe. Traditional neighbor-joining trees were generated using the clearcut software program and the distance matrices that were used in the OTU-based analyses ; however, the relaxed neighbor-joining algorithm was not used. The phylogenetic diversity of the data was calculated by summing the branch length for the entire tree . Both analyses were replicated 50 times to assess the effects of randomization on α-diversity.
The OTU assignments and neighbor-joining trees created to study α-diversity were used to evaluate the effects of each variable on the ability to calculate β-diversity. Towards this end, I segregated the sequences to create two mock communities that shared 80% of their membership but had different structures. To create the mock communities full-length SILVA-aligned sequences were first assigned to OTUs using a furthest neighbor clustering of one gap distances with a cutoff of 0.05. Second, OTUs were randomly ordered. Third, 10% of the OTUs were assigned exclusively to the first community, another 10% were assigned exclusively to the second community, and the remaining OTUs were shared. For half of the shared OTUs, the probability of a sequence being from the first community was 0.375 and for the other half of the shared OTUs, the probability was 0.625. These probabilities were selected to simulate sampling two communities that had a Jaccard similarity index of 0.80 and Morisita-Horn Index value of 0.60. This process was repeated to create 100 simulated communities. Because the mock communities were not exhaustively sampled, it was unlikely that the measures would actually equal 0.80 and 0.60 for the Jaccard and Morisita-Horn indices. All β-diversity calculations were made using the mothur software package . The same 100 partitions were used to analyze all distance calculation methods, alignments, regions, and β-diversity measures. I analyzed the effects of region and the alignment or distance calculation methods using a two-way analysis of variance. Each factor was highly significant (p≪0.001) and so I used the Tukey's honestly significant difference test for pairwise comparisons. Only those differences, which were non-significant (p>0.05) are indicated in figures. All test were performed within an OTU cutoff or UniFrac method.
Conceived and designed the experiments: PDS. Performed the experiments: PDS. Analyzed the data: PDS. Contributed reagents/materials/analysis tools: PDS. Wrote the paper: PDS.