Conceived and designed the experiments: PDS. Performed the experiments: PDS. Analyzed the data: PDS. Contributed reagents/materials/analysis tools: PDS. Wrote the paper: PDS.
The author has declared that no competing interests exist.
Pyrosequencing of PCR-amplified fragments that target variable regions within the 16S rRNA gene has quickly become a powerful method for analyzing the membership and structure of microbial communities. This approach has revealed and introduced questions that were not fully appreciated by those carrying out traditional Sanger sequencing-based methods. These include the effects of alignment quality, the best method of calculating pairwise genetic distances for 16S rRNA genes, whether it is appropriate to filter variable regions, and how the choice of variable region relates to the genetic diversity observed in full-length sequences. I used a diverse collection of 13,501 high-quality full-length sequences to assess each of these questions. First, alignment quality had a significant impact on distance values and downstream analyses. Specifically, the greengenes alignment, which does a poor job of aligning variable regions, predicted higher genetic diversity, richness, and phylogenetic diversity than the SILVA and RDP-based alignments. Second, the effect of different gap treatments in determining pairwise genetic distances was strongly affected by the variation in sequence length for a region; however, the effect of different calculation methods was subtle when determining the sample's richness or phylogenetic diversity for a region. Third, applying a sequence mask to remove variable positions had a profound impact on genetic distances by muting the observed richness and phylogenetic diversity. Finally, the genetic distances calculated for each of the variable regions did a poor job of correlating with the full-length gene. Thus, while it is tempting to apply traditional cutoff levels derived for full-length sequences to these shorter sequences, it is not advisable. Analysis of β-diversity metrics showed that each of these factors can have a significant impact on the comparison of community membership and structure. Taken together, these results urge caution in the design and interpretation of analyses using pyrosequencing data.
Microbial communities are notoriously difficult to analyze because of their inaccessibility via culturing and high diversity. Next generation sequencing technologies have made it possible to obtain deep sampling coverage of the 16S rRNA gene; however, interpretation of the resulting data is complicated by the inability to relate sequences from variable regions within the gene to the full-length gene and ultimately, the parent genome. Here, I present a comprehensive analysis quantifying the effects of varying sequence alignment quality, pairwise distances calculation methods, sequence filtering, and regions within the 16S rRNA gene on downstream analysis using OTU- and phylogeny-based methods. This analysis indicates that each factor can have a significant effect on descriptions of α- and β-diversity. Because it is not possible to relate pyrotags to full-length 16S rRNA gene sequences directly, I encourage scientists to view pyrotags as markers within a microbiome in an analogous fashion to how geneticists view single nucleotide polymorphisms as markers within genomes.
The recent advent of massively-parallelized pyrosequencing platforms has enabled a renaissance in the field of microbial ecology
These massive datasets have been analyzed through the generation of phylogenetic trees
Each of these studies have focused on a limited range of phylogenetic groups found in a particular environment (e.g. soil, mouse cecum, human feces) and have glossed over more fundamental questions related to how alignment quality, methods of calculating pairwise genetic distances, sequence filtering, and region affects downstream analysis and their relationship to full-length sequences. Alignment quality is expected to significantly affect pairwise distances. Investigators have either used reference alignments to align sequences that implicitly incorporate the secondary structure of the 16S rRNA molecule
Region | Platform | Example Ref. | Masking | Average number of bases |
|
2–1491 | Sanger |
None | 1454 (1399–1490) | ||
Lane | 1255 (1244–1256) | ||||
28–337 | Titanium |
None | 306 (276–332) | ||
Lane | 239 (238–239) | ||||
28–514 | Titanium | HMP |
None | 480 (428–508) | |
Lane | 386 (384–386) | ||||
28–784 | Sanger | None | 750 (698–779) | ||
Lane | 656 (653–656) | ||||
100–337 | FLX |
None | 240 (223–257) | ||
Lane | 198 (197–198) | ||||
100–514 | Titanium | None | 415 (378–437) | ||
Lane | 345 (343–345) | ||||
357–514 | FLX/Illumina |
None | 158 (135–161) | ||
Lane | 128 (127–128) | ||||
357–906 | Titanium | HMP | None | 546 (523–552) | |
Lane | 507 (504–507) | ||||
578–784 | FLX | None | 207 (206–208) | ||
Lane | 207 (206–207) | ||||
986–1045 | FLX/Illumina | None | 60 (57–66) | ||
Lane | 27 (27–27) | ||||
986–1491 | Titanium | HMP | None | 507 (489–516) | |
Lane | 411 (407–412) | ||||
1100–1491 | Titanium | None | 392 (373–403) | ||
Lane | 330 (326–331) | ||||
1300–1491 | FLX | None | 192 (182–197) | ||
Lane | 170 (146–147) |
Each sub-region was generated from the sequences in the V19 database.
The numbers in the parentheses represent the 95% confidence interval.
Sanger reads are estimated to be up to 800 bp and can be used to sequence the same molecule multiple times.
GS FLX Titanium reads average ∼400 bp (amplicon kit released 11/2009).
GS FLX reads average ∼240 bp.
Illumina reads average ∼200 bp.
The NIH Human Microbiome Project is considering these regions for their cross-sectional studies.
Here, I used a collection of full-length 16S rRNA gene sequences representing 43 bacterial phyla to quantify how alignment quality, distance calculation methods, masking, and region within the 16S rRNA gene affect out ability to assess α- and β-diversity. The results of these analyses urge greater caution in how surveys are designed and interpreted.
For each of the 13 regions I used various alignment methods to calculate 91,131,750 pairwise distances assuming that a series of consecutive gaps represented one insertion or deletion. The SILVA, greengenes, and RDP alignments represent a gradation in the level of attention given to aligning the variable regions and are each guided by the secondary structure of the 16S rRNA gene. In contrast, the MUSCLE and pairwise alignments are attempts to optimize the alignment between sequences based on a limited number of parameters that are set
Region | Statistic | RDP | greengenes | MUSCLE | Needleman |
Slope | 1.06 | 1.17 | 1.04 | 0.93 | |
R2 | 0.97 | 0.77 | 0.98 | 0.99 | |
Slope | 1.13 | 1.25 | 1.11 | 0.93 | |
R2 | 0.80 | 0.52 | 0.77 | 0.91 | |
Slope | 1.08 | 1.20 | 1.07 | 0.93 | |
R2 | 0.88 | 0.62 | 0.92 | 0.93 | |
Slope | 1.06 | 1.16 | 1.05 | 0.94 | |
R2 | 0.94 | 0.74 | 0.96 | 0.97 | |
Slope | 1.04 | 1.21 | 1.16 | 0.97 | |
R2 | 0.94 | 0.67 | 0.64 | 0.95 | |
Slope | 1.04 | 1.18 | 1.09 | 0.95 | |
R2 | 0.96 | 0.74 | 0.92 | 0.96 | |
Slope | 1.00 | 1.11 | 2.07 | 1.02 | |
R2 | 0.91 | 0.67 | 0.14 | 0.96 | |
Slope | 1.01 | 1.12 | 1.04 | 0.95 | |
R2 | 0.98 | 0.83 | 0.97 | 0.98 | |
Slope | 1.00 | 1.09 | 1.08 | 0.98 | |
R2 | 0.99 | 0.77 | 0.87 | 0.98 | |
Slope | 1.09 | 1.42 | 3.04 | 0.98 | |
R2 | 0.66 | 0.14 | 0.30 | 0.97 | |
Slope | 1.07 | 1.21 | 1.10 | 0.94 | |
R2 | 0.92 | 0.70 | 0.91 | 0.97 | |
Slope | 1.05 | 1.18 | 1.12 | 0.96 | |
R2 | 0.94 | 0.78 | 0.86 | 0.98 | |
Slope | 1.08 | 1.19 | 1.58 | 0.96 | |
R2 | 0.80 | 0.67 | 0.36 | 0.96 |
Considering the poor correlation between the distances generated from the five alignment methods, it was necessary to determine the effect of this variation on the ability to accurately describe and compare communities. As expected based on the genetic distance analysis, the number of OTUs observed using the greengenes alignment was routinely higher than that observed using the other alignment methods and the number of OTUs observed using the pairwise alignment method was routinely the lowest (
Phylogenetic diversity was measured by calculating the total branch length for a phylogenetic tree.
To describe β-diversity, I used two OTU-based metrics (
Each bar represents the average coefficient value for 100 randomized partitionings of the data. Within the same OTU cutoff, alignment strategies with the same symbol and regions with the same letter were not significantly different from each other.
Each bar represents the average coefficient value for 100 randomized partitionings of the data. Within the same OTU cutoff, alignment strategies with the same symbol and regions with the same letter were not significantly different from each other.
Each bar represents the average coefficient value for 100 randomized partitionings of the data. Within the same UniFrac approach, alignment strategies with the same symbol and regions with the same letter were not significantly different from each other.
Using the same SILVA-aligned sequences that I analyzed above, I investigated the effect of different distance calculation methods on downstream analyses. Specifically, I considered the one gap calculator (i.e. a gap of any length between two sequences represents a single mutation) and each gap (i.e. gaps length
Region | Statistic | Each gap | Ignore gaps | Filtered | kmer |
Slope | 1.02 | 0.94 | 0.66 | 3.73 | |
R2 | 0.99 | 0.99 | 0.89 | 0.97 | |
Slope | 1.02 | 0.94 | 0.56 | 3.91 | |
R2 | 0.87 | 0.97 | 0.47 | 0.89 | |
Slope | 1.02 | 0.94 | 0.55 | 3.80 | |
R2 | 0.92 | 0.97 | 0.56 | 0.91 | |
Slope | 1.02 | 0.95 | 0.67 | 3.87 | |
R2 | 0.96 | 0.98 | 0.75 | 0.95 | |
Slope | 1.01 | 0.97 | 0.71 | 4.31 | |
R2 | 0.90 | 0.99 | 0.58 | 0.92 | |
Slope | 1.01 | 0.96 | 0.64 | 3.99 | |
R2 | 0.92 | 0.98 | 0.66 | 0.93 | |
Slope | 1.09 | 0.95 | 0.77 | 4.69 | |
R2 | 0.69 | 0.94 | 0.52 | 0.83 | |
Slope | 1.01 | 0.97 | 0.79 | 4.24 | |
R2 | 0.99 | 0.98 | 0.84 | 0.94 | |
Slope | 1.00 | 0.98 | 0.99 | 4.70 | |
R2 | 1.00 | 0.98 | 0.99 | 0.93 | |
Slope | 1.00 | 0.96 | 1.17 | 4.82 | |
R2 | 1.00 | 0.95 | 0.38 | 0.86 | |
Slope | 1.02 | 0.94 | 0.73 | 3.94 | |
R2 | 0.98 | 0.98 | 0.79 | 0.94 | |
Slope | 1.03 | 0.93 | 0.85 | 4.35 | |
R2 | 0.94 | 0.97 | 0.86 | 0.95 | |
Slope | 1.01 | 0.93 | 0.76 | 4.60 | |
R2 | 0.95 | 0.94 | 0.69 | 0.91 |
Pairwise kmer distances were much larger than the alignment-based calculators and their regression onto the one gap calculated distances accounted for between 83 and 97% of the variation observed between the distances. In order to have no risk of falsely ignoring true one gap pairwise distances smaller than 0.10, it was necessary to keep kmer distances smaller than 0.45 (V19) to 0.73 (V6). This would result in needing to calculate between 3.3- and 9.1-fold more distances than would be needed by alignment-based methods.
Lacking a theoretical basis for treating gaps as a single evolutionary event, I was curious how much measures of α- and β-diversity are affected by the choice of a distance calculator. I used an OTU-based approach to determine the effect of distance calculation methods on the richness of OTUs within the dataset (
Phylogenetic diversity was measured by calculating the total branch length for a phylogenetic tree.
I next investigated what effect each calculator method had on two OTU-based (
Each bar represents the average coefficient value for 100 randomized partitionings of the data. Within the same OTU cutoff, regions with the same letter were not significantly different from each other; for each OTU cutoff all distance calculation methods were significantly different from each other.
Each bar represents the average coefficient value for 100 randomized partitionings of the data. Within the same OTU cutoff, distance calculation methods with the same symbol and regions with the same letter were not significantly different from each other.
Each bar represents the average coefficient value for 100 randomized partitionings of the data. Within the same UniFrac method, regions with the same letter were not significantly different from each other; for both UniFrac methods the distance calculation methods were all significantly different from each other.
To circumvent alignment quality problems, the Lane mask has been used to filter variable regions from 16S rRNA genes. Results of analyses using filtered sequences aligned by any method or when distances were calculated by any method did not vary to a meaningful degree. Comparison of distances calculated using filtered sequences to those calculated using unfiltered sequences showed that filtering significantly reduced the genetic diversity observed between sequences (
I compared the one gap distances calculated for each of the 12 regions from each alignment to the one gap distances calculated from the full-length SILVA alignments (
Region | Statistic | SILVA | greengenes | RDP | MUSCLE | Needleman |
Slope | NA | 1.17 | 1.06 | 1.04 | 0.93 | |
R2 | NA | 0.77 | 0.97 | 0.98 | 0.99 | |
Slope | 1.50 | 1.79 | 1.65 | 1.74 | 1.36 | |
R2 | 0.70 | 0.55 | 0.64 | 0.57 | 0.70 | |
Slope | 1.31 | 1.52 | 1.40 | 1.40 | 1.19 | |
R2 | 0.73 | 0.60 | 0.66 | 0.74 | 0.72 | |
Slope | 1.13 | 1.29 | 1.19 | 1.18 | 1.05 | |
R2 | 0.87 | 0.72 | 0.83 | 0.88 | 0.87 | |
Slope | 1.39 | 1.61 | 1.43 | 1.57 | 1.31 | |
R2 | 0.70 | 0.59 | 0.69 | 0.62 | 0.70 | |
Slope | 1.21 | 1.38 | 1.26 | 1.29 | 1.13 | |
R2 | 0.73 | 0.64 | 0.71 | 0.72 | 0.74 | |
Slope | 1.05 | 1.15 | 1.06 | 1.63 | 0.97 | |
R2 | 0.26 | 0.26 | 0.24 | 0.23 | 0.27 | |
Slope | 0.90 | 0.99 | 0.90 | 0.93 | 0.85 | |
R2 | 0.77 | 0.68 | 0.76 | 0.75 | 0.76 | |
Slope | 0.97 | 1.05 | 0.97 | 1.05 | 0.94 | |
R2 | 0.57 | 0.50 | 0.56 | 0.55 | 0.57 | |
Slope | 2.98 | 3.52 | 3.30 | 4.99 | 2.62 | |
R2 | 0.36 | 0.31 | 0.34 | 0.24 | 0.37 | |
Slope | 0.98 | 1.16 | 1.05 | 1.04 | 0.90 | |
R2 | 0.77 | 0.62 | 0.72 | 0.77 | 0.78 | |
Slope | 0.79 | 0.93 | 0.84 | 0.86 | 0.75 | |
R2 | 0.70 | 0.62 | 0.67 | 0.68 | 0.70 | |
Slope | 0.67 | 0.83 | 0.76 | 0.85 | 0.63 | |
R2 | 0.46 | 0.43 | 0.42 | 0.39 | 0.46 |
The distance-based analysis clearly showed significant differences between distances calculated from sub-regions and full-length sequences. The OTU-based analysis in
Using pyrotag data introduces several complexities to β-diversity analyses. Moving across regions, but using the same OTU definition could lead one to overestimate community similarity. For example, the average Morisita-Horn similarity for full-length SILVA-aligned sequences with one gap distances was 0.56. Using similarly treated sequences from the V12, V13, V14, and V23 regions I calculated Morisita-Horn values between 0.57 and 0.60; however those from the other 8 regions yielded values between 0.64 (V2) and 0.79 (V9). For a single region, changing the OTU cutoff also had a significant effect on the Morisita-Horn index. For instance, full-length SILVA-aligned sequences yielded 0.52, 0.56, and 0.86 for cutoffs of 0.03, 0.05, and 0.10. This spread in Morisita-Horn values between the 0.03 and 0.10 OTU cutoffs (0.34) was the largest of any region. The narrowest spread was observed for the V6 region (0.06). In contrast to the Morisita-Horn values, there was little variation in the unweighted or weighted UniFrac statistic when comparing sequences analyzed by the same alignment and distance calculation method. With the exception of the V6 region (0.33), the average unweighted UniFrac values varied between 0.24 (V13, V14, V19) and 0.30 (V9) and with the exception of the V12 region (0.69), the average weighted UniFrac values varied between 0.80 (V13) and 0.87 (V9); the value for the full-length sequence was 0.82. Similar to the α-diversity measure of phylogenetic diversity, an added complication of phylogeny-based methods is the complexity of interpreting the proportion of branch length that is shared between or unique to two communities and how such proportions relate to classical β-diversity measures. Thus, it is difficult to interpret the biological significance of such variation. Regardless, the results of the OTU- and phylogeny-based analyses demonstrate that caution must be taken in extrapolating results from one region to another.
The ability to define OTUs and reconstruct phylogenies allows an investigator to approach their problem using the data as they present themselves without being confined to an
I have shown that alignment quality has a significant impact on downstream data analysis. Because the 16S rRNA gene sequence follows a well-determined secondary structure, it is possible to objectively state that one alignment is better than another. Furthermore, pairwise and multiple sequence alignments that ignore the secondary structure are unadvisable on theoretical grounds. Such methods are also unadvisable on technical grounds as the time and memory required to complete them typically scales in excess of the number of sequences squared; the time required to perform a profile-based alignment scales linearly with the number of sequences.
A significant factor in the analysis of DNA sequences is the calculation of pairwise distances. The rich literature developed for protein-coding sequences has generated the Jukes-Cantor, Kimura, Hasegawa-Kishino-Yano and other substitution models
This study has ramifications on how analyses are performed. Since it is clear that the 16S rRNA gene does not evolve uniformly across its length, it is critical that sequences fully overlap before they are compared. For example, consider an analysis that includes sequences from the V2 region and those from the V12 region. The V12 sequences will have higher pairwise distances amongst each other than compared to the V2 region because the V1 region is evolving at a faster rate. Thus, the comparison of short and long sequence reads will add artifacts into the analysis, which will overstate the richness within the community. Although not explored here, it is likely that similar problems will be encountered in analyses where a taxonomy hierarchy is used to assign sequences to bins. Thus it is critical that sequences are trimmed to start and end at the same sequence-based landmarks. Because pyrosequencing does not yield a uniform length sequence read, this introduces a conundrum of whether to favor fewer long reads or many short reads. Because it is impossible to compare pyrotags to the full-length sequences accurately, it seems appropriate to increase the power of other statistical analyses by sacrificing sequence length in favor of having more sequence reads.
Next generation sequence analysis of 16S rRNA genes offers the first opportunity to replicate analyses, develop more complex experimental designs, and to increase sampling depth and breadth. The results of this study encourage one to see pyrotags as markers within a metagenome and suggest a different way of considering microbial community analysis. Just as single nucleotide polymorphisms (SNPs) have been used as markers of disease in genome-wide association studies (GWAS), which may have no direct effect on a genes phenotype, pyrotags no doubt will serve as a useful analog to SNPs for the nascent field of metagenome-wide association studies (MWAS).
I obtained the SSURef 16S rRNA gene sequence database from the SILVA project (version 98;
I implemented three sequence-based methods for calculating pairwise distances and a kmer-based distance metric. The first sequence-based method ignored any site that contained a gap; this method is implemented in the commonly used DNADIST program from the PHYLIP package
Pairwise distances were compared using a custom C++-coded program that calculated the linear regression coefficient using the origin as the intercept and the Pearson product-moment correlation coefficient
OTU- and phylogeny-based analyses were performed to assess the intra-sample biodiversity. Sequences were assigned to OTUs using the mothur implementation of the furthest-neighbor clustering algorithm
The OTU assignments and neighbor-joining trees created to study α-diversity were used to evaluate the effects of each variable on the ability to calculate β-diversity. Towards this end, I segregated the sequences to create two mock communities that shared 80% of their membership but had different structures. To create the mock communities full-length SILVA-aligned sequences were first assigned to OTUs using a furthest neighbor clustering of one gap distances with a cutoff of 0.05. Second, OTUs were randomly ordered. Third, 10% of the OTUs were assigned exclusively to the first community, another 10% were assigned exclusively to the second community, and the remaining OTUs were shared. For half of the shared OTUs, the probability of a sequence being from the first community was 0.375 and for the other half of the shared OTUs, the probability was 0.625. These probabilities were selected to simulate sampling two communities that had a Jaccard similarity index of 0.80 and Morisita-Horn Index value of 0.60. This process was repeated to create 100 simulated communities. Because the mock communities were not exhaustively sampled, it was unlikely that the measures would actually equal 0.80 and 0.60 for the Jaccard and Morisita-Horn indices. All β-diversity calculations were made using the mothur software package