Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities

Kevin Chen; Lior Pachter

doi:10.1371/journal.pcbi.0010024

Abstract

The application of whole-genome shotgun sequencing to microbial communities represents a major development in metagenomics, the study of uncultured microbes via the tools of modern genomic analysis. In the past year, whole-genome shotgun sequencing projects of prokaryotic communities from an acid mine biofilm, the Sargasso Sea, Minnesota farm soil, three deep-sea whale falls, and deep-sea sediments have been reported, adding to previously published work on viral communities from marine and fecal samples. The interpretation of this new kind of data poses a wide variety of exciting and difficult bioinformatics problems. The aim of this review is to introduce the bioinformatics community to this emerging field by surveying existing techniques and promising new approaches for several of the most interesting of these computational problems.

Figures

Citation: Chen K, Pachter L (2005) Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities. PLoS Comput Biol 1(2): e24. https://doi.org/10.1371/journal.pcbi.0010024

Published: July 12, 2005

Copyright: © 2005 Chen and Pachter. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abbreviations: HMM, hidden Markov model; MSA, multiple sequence alignment; WGS, whole-genome shotgun

Introduction

Metagenomics is the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species [1–6]. The field has its roots in the culture-independent retrieval of 16S rRNA genes, pioneered by Pace and colleagues two decades ago [7]. Since then, metagenomics has revolutionized microbiology by shifting focus away from clonal isolates towards the estimated 99% of microbial species that cannot currently be cultivated [8,9].

A typical metagenomics project begins with the construction of a clone library from DNA sequence retrieved from an environmental sample. Clones are then selected for sequencing using either functional or sequence-based screens. In the functional approach, genes retrieved from the environment are heterologously expressed in a host, such as Escherichia coli, and sophisticated functional screens employed to detect clones expressing functions of interest [10–12]. This approach has produced many exciting discoveries and spawned several companies aiming to retrieve marketable natural products from the environment (e.g., Diversa [http://www.diversa.com] and Cubist Pharmaceuticals [http://www.cubist.com]). In the sequence-based approach, clones are selected for sequencing based on the presence of either phylogenetically informative genes, such as 16S, or other genes of biological interest [13–17]. The most prominent discovery from this approach thus far is the discovery of the proteorhodopsin gene from a marine community [14].

Recently, facilitated by the increasing capacity of sequencing centers, whole-genome shotgun (WGS) sequencing of the entire clone library has emerged as a third approach to metagenomics. Unlike previous approaches, which typically study a single gene or individual genomes, this approach offers a more global view of the community, allowing us to better assess levels of phylogenetic diversity and intraspecies polymorphism, study the full gene complement and metabolic pathways in the community, and in some cases, reconstruct near-complete genome sequences. WGS also has the potential to discover new genes that are too diverged from currently known genes to be amplified with PCR, or heterologously expressed in common hosts, and is especially important in the case of viral communities because of the lack of a universal gene analogous to 16S.

Nine shotgun sequencing projects of various communities have been completed to date (Table 1). The biological insights from these studies have been well-reviewed elsewhere [3,6]. Here, we highlight just two studies that exemplify the exciting possibilities of the approach. The acid mine biofilm community [18] is an extremely simple model system, consisting of only four dominant species, so a relatively miniscule amount of shotgun sequencing (75 Mbp) was enough to produce two near-complete genome sequences and detailed information about metabolic pathways and strain-level polymorphism. At the other end of the spectrum, the Sargasso Sea community is extremely complex, containing more than 1,800 species [19,20]. Nonetheless, with an enormous amount of sequencing (1.6 Gbp), vast amounts of previously unknown diversity were discovered, including over 1.2 million new genes, 148 new species, and numerous new rhodopsin genes. These results were especially surprising given how well the community had been studied previously, and suggest that equally large amounts of biological diversity await future discovery.

Download:

Table 1.

Published Microbial Community Shotgun Sequencing Projects

https://doi.org/10.1371/journal.pcbi.0010024.t001

In this review, we survey several of the most interesting computational problems that arise from WGS sequencing of communities. Traditional approaches to classic bioinformatics problems such as assembly, gene finding, and phylogeny need to be reconsidered in light of this new kind of data, while new problems need to be addressed, including how to compare communities, how to separate sequence from different organisms in silico, and how to model population structures using WGS assembly statistics. We discuss all these problems and their connections to other areas of bioinformatics, such as the assembly of highly polymorphic genomes, gene expression analysis, and supertree methods for phylogenetic reconstruction.

Although we have chosen to focus on the shotgun sequencing approach, we stress that this is only one piece of the exciting field of metagenomics, and that the integration of other techniques such as large-insert clone sequencing, microarray analysis, and proteomics will be vital to achieve a comprehensive view of microbial communities.

Assembling Communities

The retrieval of nearly complete genomes from the environment without prior lab cultivation is one of the most spectacular results of metagenomics to date. A fundamental limit on the WGS approach is that we can only expect to assemble genomes that constitute a significant fraction of the community [21]. Filtration and normalization techniques that enrich the library for certain low-abundance species, a common technique in the sequencing of symbionts, are thus of vital importance when genome assembly is a primary goal [22,23].

When a closely related, fully sequenced genome is available, comparative assembly can easily be performed by extracting the homologous sequence and assembling it with either a comparative assembler [24] or an alignment program that can handle draft sequence [25,26]. This approach is standard and has been used many times for mixed sequence from multiple species ([19,27]; E. Allen, unpublished data).

In the absence of an appropriate template genome, traditional overlap–layout–consensus assembly [28] can be done, augmented by an additional binning step, in which scaffolds (contiguous sequence with gaps of approximately known size) are separated into species-specific “bins.” The first issue that needs to be overcome is the increased amount of polymorphism, since each read will typically be sampled from a different individual in the population. Second, highly conserved sequence shared between different species can seed contigs and cause false overlaps. In some communities, even phylogenetically distant genomes can share a large number of genes [29]. Careful study of the optimal overlap parameters for separating out sequences at different phylogenetic distances is important, and has been carried out for viral communities [30], but not yet for prokaryotes.

The assembly of communities has strong similarities to the assembly of highly polymorphic diploid eukaryotes, such as Ciona savigny [26] and Candida albicans [31], if we view prokaryotic strains as analogous to eukaryotic haplotypes. The main difference is that in a microbial community, the number of strains is unknown and potentially large, and their relative abundance is also unknown and potentially skewed, while in most eukaryotes we know a priori the number of haplotypes and their relative abundance. This disadvantage is mitigated somewhat by the small size and relative lack of repetitive sequence in prokaryotic and viral genomes, so that the issue of distinguishing alleles from paralogs and polymorphism from repetitive sequence is less acute.

Thus far, both community assembly and polymorphic eukaryotic assembly have been performed by running a single-genome assembler, such as the Celera assembler [32] or Jazz [33], and then manually post-processing the resulting scaffolds to correct assembly errors. Contigs erroneously split apart because of polymorphism are reconnected, and contigs based on false overlaps are broken apart. Not surprisingly, ad hoc heuristics must be employed to adapt programs optimized for single-genome assembly: the Celera assembler, for instance, treats high-depth contigs associated with abundant species as repetitive sequence.

A promising direction for both these problems is co-assembly, in which two very closely related genomes (or even two assemblies of the same genome) are assembled concurrently, using alignment information to complement mate-pair information in ordering scaffolds and correcting assembly errors in a structured, automated way. Thus far, the only published work on this problem is that of Sundararajan et al. [26], and even then, only for two genomes. For three or more genomes, even the multiple alignment problem for draft sequence is not solved. Large-insert clone sequence will also be very useful since the entire clone comes from a single strain or haplotype [22,34].

After scaffolds have been constructed, the next step is to bin the scaffolds according to species or phylogenetic clade. The gold standard for binning is the presence of a phylogenetically informative gene. 16S rRNA, though universal, is decidedly not single copy, so it is important to also consider other genes, such as RecA, EFG, EFTu, and HSP70 [19]. In the absence of one of these genes, genome signatures such as dinucleotide frequencies, codon bias, and GC-content, developed by Karlin and others in a long series of papers [35–38], can be used. These signatures appear to work for scaffolds on the order of 50 kbp in length, and, importantly, they seem to correlate only with phylogenetic relatedness and not with the environment [36]. There is a web server, Tetra, that computes tetranucleotide frequencies for metagenomics projects [39,40].

An additional source of evidence unique to WGS data is scaffold read depth, which is expected to be proportional to species abundance and thus can be used to separate high-abundance from low-abundance species. Subtleties can arise, however, since a variable polymorphism rate across a genome can cause conserved regions to be covered at high depth and variable regions to be covered at low depth.

For some applications, completely accurate binning may not be required. For example, gene finders based on hidden Markov models (HMMs) require training data from closely related species. The accuracy of the gene finder might be improved by additional training data, even if it is not from exactly the same species. One could even imagine running the following iterative algorithm: find a set of putative genes, construct gene trees with them, use the trees to crudely bin the scaffolds, retrain the gene finder, and repeat.

To conclude our discussion of assembly, we consider the important question of determining how much to sequence in order to assemble genomes. When sequencing a single genome, the Lander–Waterman model based on the assumptions of independent and random reads implies that the coverage of each base is distributed according to a Poisson distribution with parameter c (the coverage). Defining n_k to be the number of bases covered exactly k times and G to be the genome size, we have

First consider the problem of assembling the most abundant genome at, say, 8× coverage. In the worst case, all species are present in equal abundance. The Lander–Waterman equation holds with G replaced by the sum of the sizes of all genomes of species in the community (sometimes called the metagenome). For the soil community, we have n₂ = 300,000 and G = 10⁸/c, so the equation implies a coverage of 0.006 and a total of 133 Gbp of sequence needed to assemble the most abundant genome at 8× coverage, disregarding the problem of binning. The total metagenome size predicted is G = 16.7 Gbp, corresponding to 2,800 E. coli–sized genomes, which is consistent with previous estimates of soil microbial diversity and the 16S survey.

For the lower bound, we make the additional assumptions that all genomes have length 6 Mbp and that a single dominant species contributes all the overlaps in the assembly. The same equation implies that 2 Gbp of additional sequence is required for assembly at 8× coverage. This number is about twice that calculated from the 16S survey, but this might be explained by preferential amplification bias in PCR.

We performed similar calculations for the three whale fall communities. In addition, we considered the problem of assembling all genomes in these communities. Since the 16S survey indicated that three dominant species constitute approximately half the total abundance and all other species have roughly equal abundance, the Lander–Waterman model implies that the expected coverage should be distributed as the mixture of two Poissons with equal weight. The results of these calculations are summarized in Table 2. Similar results were obtained by Venter et al. [19] and Breitbart et al. [30], and there is also software for performing such calculations (http://phage.sdsu.edu/phaccs) [41].

Download:

Table 2.

Bounds on Amount of Sequence Needed to Assemble Genomes (in Mbp)

https://doi.org/10.1371/journal.pcbi.0010024.t002

Comparative Metagenomics

Gene finding is a fundamental goal in virtually all metagenomics projects, regardless of whether complete genome sequences can be assembled or not. If large scaffolds can be retrieved and binned, excellent HMM-based microbial gene finders such as FGENESB (http://www.softberry.com) and GLIMMER [42,43] can be used, in combination with expectation-maximization (EM) techniques for unsupervised training of the HMM parameters [44,45]. At the other extreme, we have unassembled reads of roughly 700 bp. These make up 50% of the total reads in the Sargasso Sea dataset and 100% in soil. Since prokaryotic genes are typically short, lack introns, and occur at high density (roughly one in 1,000 bp), each read is likely to contain a significant portion of a gene. For these reads, HMM techniques are unlikely to be successful, leaving BLAST search against a protein database or the community itself as the only realistic alternative.

There have been two simulation studies verifying the accuracy of BLAST for gene finding with single reads [21,46], though it is difficult to make this kind of experiment convincing, since the accuracy of the method is almost entirely dependent on the availability of closely related sequences in the database. We are not aware of any studies on the accuracy of HMM-based techniques on sequences significantly shorter than a whole genome, so we undertook a simple experiment ourselves. We sampled simulated “contigs” of length 10 kb from the complete genome sequence of Thermoplasma volcanium [47]. For each, we predicted genes using GLIMMER trained only on long open reading frames in the contig, and compared these to the GLIMMER predictions when trained on long open reading frames from the entire genome. We found that the results were surprisingly good. Of 92 genes completely contained in the ten simulated contigs, 86 were predicted exactly correctly. There were 16 genes that crossed the boundaries of the contigs, and GLIMMER was able to find truncated genes for seven of these. On the other hand, five of the completely spurious predictions all came from the same contig, which suggests that HMM accuracy may not be uniform over the length of the genome. More detailed studies on this problem are needed to relate the length of assembled contigs to the accuracy of the gene finder. An interesting direction is to attempt to recover more partial genes that overlap contig boundaries, firstly, by making the gene finder aware that genes on the boundary may be truncated and, secondly, by taking advantage of base quality scores for lower quality sequence at the ends of contigs. Another interesting research problem is to fine-tune gene finders for viral genomes.

The gene complement of a microbial community can be used as a fingerprint of a community, allowing us to compare different communities in a gene-centric, as opposed to genome-centric, fashion [21]. In this method, predicted genes are blasted against the COGs [48] or KEGG [49,50] databases and each community is assigned a fingerprint vector with entries corresponding to the number of hits to each COGs or KEGG category. It is also possible to cluster the COGs hits by function in order to compare the communities at a higher level.

Fingerprint vectors are analogous to gene-expression-level vectors in microarray analysis and any of the standard gene expression clustering methods can be used [51]. We first replicated the result of [21] by directly applying popular the off-the-shelf gene expression tools, CLUSTER and TreeView [52], to perform single-linkage hierarchical clustering on the KEGG vectors from several communities (Figure 1).

Download:

Figure 1. Blue-Yellow Microarray Figure Applied to KEGG Vectors for Four Metagenomics Projects

The whale-fall and Sargasso sea data are partitioned into three different samples each. The rows correspond to the different datasets and the columns to the 137 KEGG categories. Blue corresponds to underrepresentation and yellow to overrepresentation. Note that some branch lengths have been adjusted for visualization purposes and do not correspond to an actual meaningful distance.

https://doi.org/10.1371/journal.pcbi.0010024.g001

Although the neat tree structure of the blue-yellow microarray figure (Figure 1) looks appealing, it can also be misleading at times because of the properties of UPGMA (unweighted pair group method with arithmetic mean) clustering. To check this, we applied principle components analysis to the fingerprint vectors (Figure 2). While the high-level result is similar, the principle components analysis shows that the clustering of the communities is somewhat more ambiguous than Figure 1 might suggest. For instance, note the surprising proximity of whale-fall sample 1 to the soil sample.

Download:

Figure 2. Projection of the KEGG Vectors on the First Two Principle Components

https://doi.org/10.1371/journal.pcbi.0010024.g002

In addition to clustering, principle components analysis has the additional advantage that dimensions of the principle components with high magnitude may correspond to COGs or KEGG sequences of interest, and the principle components themselves may correspond to interesting pathways or functions. This has not yet been fully explored and could potentially be a source of new functional pathways in communities.

Finally, since fingerprinting has been advocated as an alternative to genome assembly when the amount of sequence required for assembly is very high [21], an important issue that needs to be discussed is how much sequence is required to fingerprint. In the same spirit as our Lander–Waterman calculations (equation 1), we estimate this quantity using the observation that the number of genes per shotgun read is very close to one [21,46]. Assuming a uniform species abundance distribution, we get the classic coupon collector's problem [53], in which the number of reads needed to collect a fraction f of the N genes in the community is exactly

Applying equation 2 to the soil community, if we assume 4,000 genes per genome and 3,000 genomes, then sampling half the genes would require 6 Gbp of sequencing, comparable to the lower bound on the amount of sequence needed to assemble the dominant genome (Table 2).

Based on these observations, it seems that it may be too early to conclude that fingerprinting is a powerful way of comparing communities. We also note that fingerprinting is difficult for viruses, since 65% of predicted genes from the viral community sequencing projects have no homolog in the databases [6]. However, similar techniques have been used to compare the species, as opposed to their gene complements, across different viral communities [54].

Phylogeny and Community Diversity

If complete gene sequences can be recovered from the community, classic multiple sequence alignment (MSA) [55] and phylogeny algorithms [56] can be applied. If only partial genes are available, phylogenetic reconstruction is still reasonably straightforward if there is already a database of nearly complete sequences, as with 16S [57] or RecA (http://www.tigr.org/_jeisen/RecA/RecA.html). The partial sequences can then be aligned against the complete ones, and the phylogenetic assignment performed by finding the closest sequences in the database [58]. Even for such genes, however, it is plausible to imagine a future in which the majority of genes in the database are in fact partial environmental sequences—at one point, for instance, the Sargasso Sea dataset made up 5% of the total genes in GenBank and a large number of these were unassembled reads. Alternatively, metagenomics projects may discover a highly diverged group of species that may not align well to existing sequences. In these scenarios, it will be necessary to have good MSA and phylogeny tools for partial sequences, even for these “universal” genes.

The case of viral phylogeny is more complex, firstly, because it is not clear that all viruses are related by a tree, and, secondly, because viral taxonomy has traditionally not been based on molecular sequence data, though the Phage Proteomic Tree [59] represents a step in the direction of sequence-based taxonomy. Viral taxonomy is at a very early stage of development, and there is no doubt that culture-independent methods will play an important role in the growth of the field.

Partial sequences are the crux of the phylogeny problem in the context of metagenomics. We are particularly interested in methods for such sequences because they will also be applicable for low-coverage sequencing projects of vertebrates and other species [46,60]. We are not aware of any MSA tools and phylogeny programs that are able to cope with short partial gene fragments, any two of which may fail to have significant overlap. At the alignment stage, we require a semi-global multiple alignment (i.e., terminal gaps are not penalized). The most widely used alignment tools are based on global or local alignments and do not correctly handle partial sequences (an exception is MAP [61]). Since most MSA tools are based on progressive alignment according to a guide tree, it is also important to construct this tree based on pairwise semi-global alignments and conserved terminal k-mers, as opposed to the pairwise global or local alignments currently used.

We studied 40 phosphoglycerate kinase genes from the soil study and aligned them with MUSCLE [62]. Though not optimized for partial sequences, MUSCLE did a reasonable job, as ascertained by several criteria: the number of internal gaps was small, sequences shorter than the read length had either no beginning gaps or no ending gaps (since the gene length is greater than the read length), and the total length was comparable to related proteins.

Of the 780 pairs of sequences, 95 pairs had overlap of less than 50 amino acids, and of these, 48 pairs had no overlap at all. Thus, we have an extreme instance of the missing data problem, which has been extensively discussed in the phylogenetics literature [63,64]. However, this literature has mostly studied consensus tree methods, and the effect of adding incomplete taxa and/or characters on the accuracy of traditional methods, like maximum likelihood. Relatively little effort has gone into actually finding better methods for tree reconstruction with this kind of data. Supertree methods [65], which attempt to construct trees from multiple subtrees, present one such alternative. One reason these methods have not been widely used in the past in the context of molecular data is the relative lack of maturity of the field as compared with parsimony or likelihood methods. However, encouraging new algorithmic results and software in this area [66–68] should spur renewed work on these types of methods. Supertree methods have also been criticized because incomplete data matrices (e.g., from fossil data) usually do not fit a random and independent missing data model. On the other hand, shotgun sequencing does fit this model and thus would seem an ideal setting for supertree methods. While the data might be too limited to provide completely resolved phylogenies, as previous discussed in the context of binning, even crude trees may be sufficient for certain applications, such as training HMMs.

Finally, with regards to community diversity, one of the advantages of the WGS approach is that it is less biased then PCR, which is known to suffer from a host of problems [69]. Community modeling based on analysis of assembly data within the Lander–Waterman model is beginning to show that species abundance curves are not lognormal as previously thought [41,70], so new methods that take into account these naturally occurring distributions are needed.

Conclusion

The number of new community shotgun sequencing projects continues to grow, promising to provide vast quantities of sequence data for analysis. Samples are being drawn from macroscopic environments such as the sea and air, as well as from more contained communities such as the human mouth (Table 3). Exciting advances in our understanding of ecosystems, environments, and communities will require creative solutions to numerous new bioinformatics problems. We have briefly mentioned some of these: assembly (can co-assembly techniques be used to assemble polymorphic genomes and complex communities?), binning (what is the best way to combine diverse sources of information to bin scaffolds?), gene finding (how should gene finding programs, which were designed for complete genes and genomes, be adapted for low-coverage sequence?), fingerprinting (which clustering techniques are best suited for discovering novel pathways and functional groups that allow communities to adapt to their environments?), and MSA and phylogeny (how can we best construct trees and alignments from fragmented data?).

Download:

Table 3.

Examples of Ongoing Community WGS Sequencing Projects

https://doi.org/10.1371/journal.pcbi.0010024.t003

Countless more challenges will likely emerge as WGS sequencing approaches are used to tackle increasingly complex communities. The reward for computational biologists who work on these problems will be the satisfaction of contributing to the grand enterprise of understanding the total diversity of life on our planet.

Acknowledgments

We thank Eric Allen, Jill Banfield, Susannah Tringe, and Gene Tyson for introducing us to the field of metagenomics and for helpful discussions while preparing the manuscript. We also thank Richard Karp and Satish Rao for useful discussions on bioinformatics issues, and the anonymous reviewers for their comments on an earlier version of this paper. Some of the data we have used were provided by JGI and EMBL. KC was supported by National Science Foundation (NSF) grant EF 03–31494. LP was supported by a Sloan Research Fellowship, NSF grant CCF 03–47992, and National Institutes of Health grant R01-HG02362–03.

References

1. DeLong EF (2002) Microbial population genomics and ecology. Curr Opin Microbiol 5: 520–524.
- View Article
- Google Scholar
2. Handelsman J (2004) Metagenomics: Application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev 68: 669–684.
- View Article
- Google Scholar
3. Riesenfeld CS, Schloss P, Handelsman J (2004) Metagenomics: Genomic analysis of microbial communities. Annu Rev Genet 38: 525–552.
- View Article
- Google Scholar
4. Rodriguez-Valera F (2004) Environmental genomics, the big picture? FEMS Microbiol Lett 231: 153–158.
- View Article
- Google Scholar
5. Streit WR, Schmitz RA (2004) Metagenomics—The key to the uncultured microbes. Curr Opin Microbiol 7: 492–498.
- View Article
- Google Scholar
6. Edwards RA, Rohwer F (2005) Viral metagenomics. Nat Rev Microbiol 3: 504–510.
- View Article
- Google Scholar
7. Olsen GJ, Lane DJ, Giovannoni SJ, Pace NR, Stahl DA (1986) Microbial ecology and evolution: A ribosomal RNA approach. Annu Rev Microbiol 40: 337–365.
- View Article
- Google Scholar
8. Hugenholtz P (2002) Exploring prokaryotic diversity in the genomic era. Genome Biol 3: REVIEWS0003.
- View Article
- Google Scholar
9. Rappe M, Giovannoni S (2003) The uncultured microbial majority. Annu Rev Microbiol 57: 369–394.
- View Article
- Google Scholar
10. Courtois S, Cappellano CM, Ball M, Francou F, Normand P, et al. (2003) Recombinant environmental libraries provide access to microbial diversity for drug discovery from natural products. Appl Environ Microbiol 69: 49–55.
- View Article
- Google Scholar
11. Riesenfeld CS, Goodman RM, Handelsman J (2004) Uncultured soil bacteria are a reservoir of new antibiotic resistance genes. Environ Microbiol 6: 981–989.
- View Article
- Google Scholar
12. Uchiyama T, Abe T, Ikemura T, Watanabe K (2005) Substrate-induced gene-expression screening of environmental metagenomic libraries for isolation of catabolic genes. Nat Biotechnol 23: 88–93.
- View Article
- Google Scholar
13. Stein JL, March TL, Wu KY, Shizuya H, DeLong EF (1996) Characterization of uncultivated prokaryotes: Isolation and analysis of a 40-kilobase-pair genome fragment from a planktonic marine archaeon. J Bacteriol 178: 591–599.
- View Article
- Google Scholar
14. Beja O, Aravind L, Koonin EV, Suzuki MT, Hadd A, et al. (2000) Bacterial rhodopsin: Evidence for a new type of phototrophy in the sea. Science 289: 1902–1906.
- View Article
- Google Scholar
15. Liles MR, Manske BF, Bintrim SB, Handelsman J, Goodman RM (2003) A census of rRNA genes and linked genomic sequences within a soil metagenomic library. Appl Environ Microbiol 69: 2684–2691.
- View Article
- Google Scholar
16. Beja O (2004) To BAC or not to BAC: Marine ecogenomics. Curr Opin Biotechnol 15: 187–190.
- View Article
- Google Scholar
17. Sabehi G, Beja O, Suzuki MT, Preston CM, DeLong EF (2004) Different SAR86 subgroups harbour divergent proteorhodopsins. Environ Microbiol 6: 903–910.
- View Article
- Google Scholar
18. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, et al. (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428: 37–43.
- View Article
- Google Scholar
19. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304: 66–74.
- View Article
- Google Scholar
20. Acinas SG, Klepac-Ceraj V, Hunt DE, Pharino C, Ceraj I, et al. (2004) Finescale phylogenetic architecture of a complex bacterial community. Nature 430: 551–554.
- View Article
- Google Scholar
21. Tringe S, von Mering C, Kobayashi A, Salamov A, Chen K, et al. (2005) Comparative metagenomics of microbial communities. Science 308: 554–557.
- View Article
- Google Scholar
22. Hallam SJ, Putnam N, Preston CM, Detter JC, Rokhsar D, et al. (2004) Reverse methanogenesis: Testing the hypothesis with environmental genomics. Science 305: 1457–1462.
- View Article
- Google Scholar
23. Dale C, Dunbar H, Moran NA, Ochman H (2005) Extracting single genomes from heterogenous DNA samples: A test case with Carsonella ruddii, the bacterial symbiont of psyllids (Insecta). J Insect Sci 5: 3.
- View Article
- Google Scholar
24. Pop M, Philippy A, Delcher AL, Salzberg SL (2004) Comparative genome assembly. Brief Bioinform 5: 237–248.
- View Article
- Google Scholar
25. Bray N, Pachter L (2004) MAVID: Constrained ancestral alignment of multiple sequences. Genome Res 14: 693–699.
- View Article
- Google Scholar
26. Sundararajan M, Brudno M, Small K, Sidow A, Batzoglou S (2004) Chaining algorithms for alignment of draft sequence. Fourth Workshop on Algorithms in Bioinformatics; 2004 25–27 May; Bergen, Norway. Available: http://ai.stanford.edu/~serafim/wabi_finalSerafim.pdf. Accessed 7 July 2005.
27. Salzberg S, Hotopp J, Delcher A, Pop M, Smith D, et al. (2005) Serendipitous discovery of Wolbachia genomes in multiple Drosophila species. Genome Biol 6: R23.
- View Article
- Google Scholar
28. Batzoglou S (2005) Algorithmic challenges in mammalian genome sequence assembly. In: Dunn M, Jorde L, Little P, Subramaniam S, editors. Encyclopedia of genomics, proteomics and bioinformatics. Hoboken (New Jersey): John Wiley and Sons. In press.
29. Ruepp A, Graml W, Santos-Martinez M, Koretke KK, Volker C, et al. (2000) The genome sequence of the thermoacidophilic scavenger Thermoplasma acidophilum. Nature 407: 508–513.
- View Article
- Google Scholar
30. Breitbart M, Salamon P, Andresen B, Mahaffy J, Segal A, et al. (2002) Genomic analysis of an uncultured marine viral community. Proc Natl Acad Sci U S A 99: 14250–14255.
- View Article
- Google Scholar
31. Jones T, Federspiel NA, Chibana H, Dungan J, Kalman S, et al. (2004) The diploid genome sequence of Candida albicans. Proc Natl Acad Sci U S A 101: 7329–7334.
- View Article
- Google Scholar
32. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, et al. (2000) A whole-genome assembly of Drosophila. Science 287: 2196–2204.
- View Article
- Google Scholar
33. Aparicio S, Chapman J, Stupka E, Putnam N, Chia J, et al. (2002) Whole genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297: 1301–1310.
- View Article
- Google Scholar
34. DeLong EF (2005) Microbial community genomics in the ocean. Nat Rev Microbiol 3: 459–469.
- View Article
- Google Scholar
35. Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, et al. (2003) Informatics for unveiling hidden genome signatures. Genome Res 13: 693–702.
- View Article
- Google Scholar
36. Campbell A, Mrazek J, Karlin S (1999) Genome signature comparisons among prokaryote, plasmid and mitochondrial DNA. Proc Natl Acad Sci U S A 96: 9184–9189.
- View Article
- Google Scholar
37. Deschavanne PJ, Giron A, Vilain K, Fagot G, Fertil B (1999) Genomic signature: Characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol 16: 1391–1399.
- View Article
- Google Scholar
38. Karlin S, Campbell AM, Mrazek J (1998) Comparative DNA analysis across diverse genomes. Annu Rev Genet 32: 185–225.
- View Article
- Google Scholar
39. Teeling H, Meyerdierks A, Bauer M, Amann R, Glöckner FO (2004) Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ Microbiol 6: 938–947.
- View Article
- Google Scholar
40. Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO (2004) TETRA: A web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5: 163.
- View Article
- Google Scholar
41. Angly F, Rodriguez-Brito B, Bangor D, McNairnie P, Breitbart M, et al. (2005) PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information. BMC Bioinformatics 6: 41.
- View Article
- Google Scholar
42. Salzberg S, Delcher A, Kasif S, White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res 26: 544–548.
- View Article
- Google Scholar
43. Delcher AL, Harmon D, Kasif S, White O, Salzberg SL (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res 27: 4636–4641.
- View Article
- Google Scholar
44. Audic S, Claverie J (1998) Self-identification of protein-coding regions in microbial genomes. Proc Natl Acad Sci U S A 95: 10026–10031.
- View Article
- Google Scholar
45. Hayes W, Borodovsky M (1998) How to interpret an anonymous bacterial genome: Machine learning approach to gene identification. Genome Res 8: 1154–1171.
- View Article
- Google Scholar
46. Goo Y, Roach J, Glusman G, Baliga N, Deutsch K, et al. (2004) Low-pass sequencing for microbial comparative genomics. BMC Genomics 5: 3.
- View Article
- Google Scholar
47. Kawashima T, Amano N, Koike H, Makino S, Higuchi S, et al. (2000) Archaeal adaptation to higher temperatures revealed by genomic sequence of Thermoplasma volcanium. Proc Natl Acad Sci U S A 97: 14257–14262.
- View Article
- Google Scholar
48. Tatusov R, Koonin E, Lipman D (1997) A genomic perspective on protein families. Science 278: 631–637.
- View Article
- Google Scholar
49. Kanehisa M (1997) A database for post-genome analysis. Trends Genet 13: 375–376.
- View Article
- Google Scholar
50. Kanehisa M, Goto S (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28: 27–30.
- View Article
- Google Scholar
51. Quackenbush J (2001) Computational analysis of microarrary data. Nat Rev Genet 2: 418–427.
- View Article
- Google Scholar
52. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Nat Acad Sci U S A 95: 14863–14868.
- View Article
- Google Scholar
53. Feller W (1968) An introduction to probability theory and its applications, Volume 1. Hoboken (New Jersey): John Wiley and Sons. 528 p.
54. Breitbart M, Hewson I, Felts B, Mahaffy J, Nulton J, et al. (2003) Metagenomic analysis of an uncultured viral community from human feces. J Bacteriol 185: 6220–6223.
- View Article
- Google Scholar
55. Durbin R, Eddy SR, Krogh A, Mitchison G (2004) Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press. 368 p.
56. Felsenstein J (2004) Inferring phylogenies. Sunderland (Massachusetts): Sinauer Associates. 664 p.
57. Cole JR, Chai B, Farris RJ, Wang Q, Kulam SA, et al. (2005) The Ribosomal Database Project (RDP-II): Sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res 33: D294–D296.
- View Article
- Google Scholar
58. Ludwig W, Strunk O, Westram R, Richter L, Meier H (2004) ARB: A software environment for sequence data. Nucleic Acids Res 32: 1363–1371.
- View Article
- Google Scholar
59. Rohwer F, Edwards R (2002) The phage proteomic tree: A genome-based taxonomy for phage. J Bacteriol 184: 4529–4535.
- View Article
- Google Scholar
60. Margulies EH, Vinson JP, Miller W, Jaffe DB, Lindblad-Toh K, et al. (2005) An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc Natl Acad Sci U S A 102: 4795–4800.
- View Article
- Google Scholar
61. Huang X (1994) On global sequence alignment. Comput Appl Biosci 10: 227–235.
- View Article
- Google Scholar
62. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32: 1792–1797.
- View Article
- Google Scholar
63. Wiens JJ (2003) Incomplete taxa, incomplete characters and phylogenetic accuracy: Is there a missing data problem? J Vertebr Paleontol 23: 297–310.
- View Article
- Google Scholar
64. Kearney M (2002) Fragmentary taxa, missing data, and ambiguity: Mistaken assumptions and conclusions. Syst Biol 51: 369–381.
- View Article
- Google Scholar
65. Bininda-Emonds ORP, editor. (2004) Phylogenetic supertrees: Combining information to reveal the tree of life. New York: Springer. 550 p.
66. Chen D, Eulenstein O, Fernandez-Baca D (2004) Rainbow: A toolbox for phylogenetic supertree construction and analysis. Bioinformatics 20: 2872–2873.
- View Article
- Google Scholar
67. Pachter L, Speyer D (2004) Reconstructing trees from subtree weights. Appl Math Lett 7: 615–621.
- View Article
- Google Scholar
68. Pachter L, Sturmfels B, editors (2005) Algebraic statistics for computational biology. Cambridge: Cambridge University Press. In press.
69. Forney L, Zhou X, Brown C (2004) Molecular microbial ecology: Land of the one-eyed king. Curr Opin Microbiol 7: 210–220.
- View Article
- Google Scholar
70. Curtis TP, Sloan WT, Scannell JW (2002) Modelling prokaryotic diversity and its limits. Proc Natl Acad Sci U S A 99: 10494–10499.
- View Article
- Google Scholar
71. Breitbart M, Felts B, Kelley S, Mahaffy JM, Nulton J, et al. (2004) Diversity and population structure of a near-shore marine-sediment viral community. Proc Biol Sci 271: 565–574.
- View Article
- Google Scholar
72. Cann AJ, Fandrich SE, Heaphy S (2005) Analysis of the virus population present in equine faeces indicates the presence of hundreds of uncharacterized virus genomes. Virus Genes 30: 151–156.
- View Article
- Google Scholar

[ref1] 1. DeLong EF (2002) Microbial population genomics and ecology. Curr Opin Microbiol 5: 520–524.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Handelsman J (2004) Metagenomics: Application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev 68: 669–684.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Riesenfeld CS, Schloss P, Handelsman J (2004) Metagenomics: Genomic analysis of microbial communities. Annu Rev Genet 38: 525–552.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Rodriguez-Valera F (2004) Environmental genomics, the big picture? FEMS Microbiol Lett 231: 153–158.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Streit WR, Schmitz RA (2004) Metagenomics—The key to the uncultured microbes. Curr Opin Microbiol 7: 492–498.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Edwards RA, Rohwer F (2005) Viral metagenomics. Nat Rev Microbiol 3: 504–510.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Olsen GJ, Lane DJ, Giovannoni SJ, Pace NR, Stahl DA (1986) Microbial ecology and evolution: A ribosomal RNA approach. Annu Rev Microbiol 40: 337–365.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Hugenholtz P (2002) Exploring prokaryotic diversity in the genomic era. Genome Biol 3: REVIEWS0003.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Rappe M, Giovannoni S (2003) The uncultured microbial majority. Annu Rev Microbiol 57: 369–394.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Courtois S, Cappellano CM, Ball M, Francou F, Normand P, et al. (2003) Recombinant environmental libraries provide access to microbial diversity for drug discovery from natural products. Appl Environ Microbiol 69: 49–55.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. Riesenfeld CS, Goodman RM, Handelsman J (2004) Uncultured soil bacteria are a reservoir of new antibiotic resistance genes. Environ Microbiol 6: 981–989.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref12] 12. Uchiyama T, Abe T, Ikemura T, Watanabe K (2005) Substrate-induced gene-expression screening of environmental metagenomic libraries for isolation of catabolic genes. Nat Biotechnol 23: 88–93.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref13] 13. Stein JL, March TL, Wu KY, Shizuya H, DeLong EF (1996) Characterization of uncultivated prokaryotes: Isolation and analysis of a 40-kilobase-pair genome fragment from a planktonic marine archaeon. J Bacteriol 178: 591–599.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref14] 14. Beja O, Aravind L, Koonin EV, Suzuki MT, Hadd A, et al. (2000) Bacterial rhodopsin: Evidence for a new type of phototrophy in the sea. Science 289: 1902–1906.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref15] 15. Liles MR, Manske BF, Bintrim SB, Handelsman J, Goodman RM (2003) A census of rRNA genes and linked genomic sequences within a soil metagenomic library. Appl Environ Microbiol 69: 2684–2691.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref16] 16. Beja O (2004) To BAC or not to BAC: Marine ecogenomics. Curr Opin Biotechnol 15: 187–190.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref17] 17. Sabehi G, Beja O, Suzuki MT, Preston CM, DeLong EF (2004) Different SAR86 subgroups harbour divergent proteorhodopsins. Environ Microbiol 6: 903–910.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref18] 18. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, et al. (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428: 37–43.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref19] 19. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304: 66–74.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref20] 20. Acinas SG, Klepac-Ceraj V, Hunt DE, Pharino C, Ceraj I, et al. (2004) Finescale phylogenetic architecture of a complex bacterial community. Nature 430: 551–554.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref21] 21. Tringe S, von Mering C, Kobayashi A, Salamov A, Chen K, et al. (2005) Comparative metagenomics of microbial communities. Science 308: 554–557.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref22] 22. Hallam SJ, Putnam N, Preston CM, Detter JC, Rokhsar D, et al. (2004) Reverse methanogenesis: Testing the hypothesis with environmental genomics. Science 305: 1457–1462.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref23] 23. Dale C, Dunbar H, Moran NA, Ochman H (2005) Extracting single genomes from heterogenous DNA samples: A test case with Carsonella ruddii, the bacterial symbiont of psyllids (Insecta). J Insect Sci 5: 3.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref24] 24. Pop M, Philippy A, Delcher AL, Salzberg SL (2004) Comparative genome assembly. Brief Bioinform 5: 237–248.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref25] 25. Bray N, Pachter L (2004) MAVID: Constrained ancestral alignment of multiple sequences. Genome Res 14: 693–699.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref26] 26. Sundararajan M, Brudno M, Small K, Sidow A, Batzoglou S (2004) Chaining algorithms for alignment of draft sequence. Fourth Workshop on Algorithms in Bioinformatics; 2004 25–27 May; Bergen, Norway. Available: http://ai.stanford.edu/~serafim/wabi_finalSerafim.pdf. Accessed 7 July 2005.

[ref27] 27. Salzberg S, Hotopp J, Delcher A, Pop M, Smith D, et al. (2005) Serendipitous discovery of Wolbachia genomes in multiple Drosophila species. Genome Biol 6: R23.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref28] 28. Batzoglou S (2005) Algorithmic challenges in mammalian genome sequence assembly. In: Dunn M, Jorde L, Little P, Subramaniam S, editors. Encyclopedia of genomics, proteomics and bioinformatics. Hoboken (New Jersey): John Wiley and Sons. In press.

[ref29] 29. Ruepp A, Graml W, Santos-Martinez M, Koretke KK, Volker C, et al. (2000) The genome sequence of the thermoacidophilic scavenger Thermoplasma acidophilum. Nature 407: 508–513.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref30] 30. Breitbart M, Salamon P, Andresen B, Mahaffy J, Segal A, et al. (2002) Genomic analysis of an uncultured marine viral community. Proc Natl Acad Sci U S A 99: 14250–14255.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref31] 31. Jones T, Federspiel NA, Chibana H, Dungan J, Kalman S, et al. (2004) The diploid genome sequence of Candida albicans. Proc Natl Acad Sci U S A 101: 7329–7334.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref32] 32. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, et al. (2000) A whole-genome assembly of Drosophila. Science 287: 2196–2204.
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref33] 33. Aparicio S, Chapman J, Stupka E, Putnam N, Chia J, et al. (2002) Whole genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297: 1301–1310.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref34] 34. DeLong EF (2005) Microbial community genomics in the ocean. Nat Rev Microbiol 3: 459–469.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref35] 35. Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, et al. (2003) Informatics for unveiling hidden genome signatures. Genome Res 13: 693–702.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref36] 36. Campbell A, Mrazek J, Karlin S (1999) Genome signature comparisons among prokaryote, plasmid and mitochondrial DNA. Proc Natl Acad Sci U S A 96: 9184–9189.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref37] 37. Deschavanne PJ, Giron A, Vilain K, Fagot G, Fertil B (1999) Genomic signature: Characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol 16: 1391–1399.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref38] 38. Karlin S, Campbell AM, Mrazek J (1998) Comparative DNA analysis across diverse genomes. Annu Rev Genet 32: 185–225.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref39] 39. Teeling H, Meyerdierks A, Bauer M, Amann R, Glöckner FO (2004) Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ Microbiol 6: 938–947.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref40] 40. Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO (2004) TETRA: A web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5: 163.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref41] 41. Angly F, Rodriguez-Brito B, Bangor D, McNairnie P, Breitbart M, et al. (2005) PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information. BMC Bioinformatics 6: 41.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref42] 42. Salzberg S, Delcher A, Kasif S, White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res 26: 544–548.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref43] 43. Delcher AL, Harmon D, Kasif S, White O, Salzberg SL (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res 27: 4636–4641.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

[ref44] 44. Audic S, Claverie J (1998) Self-identification of protein-coding regions in microbial genomes. Proc Natl Acad Sci U S A 95: 10026–10031.
View Article
Google Scholar

[127] View Article

[128] Google Scholar

[ref45] 45. Hayes W, Borodovsky M (1998) How to interpret an anonymous bacterial genome: Machine learning approach to gene identification. Genome Res 8: 1154–1171.
View Article
Google Scholar

[130] View Article

[131] Google Scholar

[ref46] 46. Goo Y, Roach J, Glusman G, Baliga N, Deutsch K, et al. (2004) Low-pass sequencing for microbial comparative genomics. BMC Genomics 5: 3.
View Article
Google Scholar

[133] View Article

[134] Google Scholar

[ref47] 47. Kawashima T, Amano N, Koike H, Makino S, Higuchi S, et al. (2000) Archaeal adaptation to higher temperatures revealed by genomic sequence of Thermoplasma volcanium. Proc Natl Acad Sci U S A 97: 14257–14262.
View Article
Google Scholar

[136] View Article

[137] Google Scholar

[ref48] 48. Tatusov R, Koonin E, Lipman D (1997) A genomic perspective on protein families. Science 278: 631–637.
View Article
Google Scholar

[139] View Article

[140] Google Scholar

[ref49] 49. Kanehisa M (1997) A database for post-genome analysis. Trends Genet 13: 375–376.
View Article
Google Scholar

[142] View Article

[143] Google Scholar

[ref50] 50. Kanehisa M, Goto S (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28: 27–30.
View Article
Google Scholar

[145] View Article

[146] Google Scholar

[ref51] 51. Quackenbush J (2001) Computational analysis of microarrary data. Nat Rev Genet 2: 418–427.
View Article
Google Scholar

[148] View Article

[149] Google Scholar

[ref52] 52. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Nat Acad Sci U S A 95: 14863–14868.
View Article
Google Scholar

[151] View Article

[152] Google Scholar

[ref53] 53. Feller W (1968) An introduction to probability theory and its applications, Volume 1. Hoboken (New Jersey): John Wiley and Sons. 528 p.

[ref54] 54. Breitbart M, Hewson I, Felts B, Mahaffy J, Nulton J, et al. (2003) Metagenomic analysis of an uncultured viral community from human feces. J Bacteriol 185: 6220–6223.
View Article
Google Scholar

[155] View Article

[156] Google Scholar

[ref55] 55. Durbin R, Eddy SR, Krogh A, Mitchison G (2004) Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press. 368 p.

[ref56] 56. Felsenstein J (2004) Inferring phylogenies. Sunderland (Massachusetts): Sinauer Associates. 664 p.

[ref57] 57. Cole JR, Chai B, Farris RJ, Wang Q, Kulam SA, et al. (2005) The Ribosomal Database Project (RDP-II): Sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res 33: D294–D296.
View Article
Google Scholar

[160] View Article

[161] Google Scholar

[ref58] 58. Ludwig W, Strunk O, Westram R, Richter L, Meier H (2004) ARB: A software environment for sequence data. Nucleic Acids Res 32: 1363–1371.
View Article
Google Scholar

[163] View Article

[164] Google Scholar

[ref59] 59. Rohwer F, Edwards R (2002) The phage proteomic tree: A genome-based taxonomy for phage. J Bacteriol 184: 4529–4535.
View Article
Google Scholar

[166] View Article

[167] Google Scholar

[ref60] 60. Margulies EH, Vinson JP, Miller W, Jaffe DB, Lindblad-Toh K, et al. (2005) An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc Natl Acad Sci U S A 102: 4795–4800.
View Article
Google Scholar

[169] View Article

[170] Google Scholar

[ref61] 61. Huang X (1994) On global sequence alignment. Comput Appl Biosci 10: 227–235.
View Article
Google Scholar

[172] View Article

[173] Google Scholar

[ref62] 62. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32: 1792–1797.
View Article
Google Scholar

[175] View Article

[176] Google Scholar

[ref63] 63. Wiens JJ (2003) Incomplete taxa, incomplete characters and phylogenetic accuracy: Is there a missing data problem? J Vertebr Paleontol 23: 297–310.
View Article
Google Scholar

[178] View Article

[179] Google Scholar

[ref64] 64. Kearney M (2002) Fragmentary taxa, missing data, and ambiguity: Mistaken assumptions and conclusions. Syst Biol 51: 369–381.
View Article
Google Scholar

[181] View Article

[182] Google Scholar

[ref65] 65. Bininda-Emonds ORP, editor. (2004) Phylogenetic supertrees: Combining information to reveal the tree of life. New York: Springer. 550 p.

[ref66] 66. Chen D, Eulenstein O, Fernandez-Baca D (2004) Rainbow: A toolbox for phylogenetic supertree construction and analysis. Bioinformatics 20: 2872–2873.
View Article
Google Scholar

[185] View Article

[186] Google Scholar

[ref67] 67. Pachter L, Speyer D (2004) Reconstructing trees from subtree weights. Appl Math Lett 7: 615–621.
View Article
Google Scholar

[188] View Article

[189] Google Scholar

[ref68] 68. Pachter L, Sturmfels B, editors (2005) Algebraic statistics for computational biology. Cambridge: Cambridge University Press. In press.

[ref69] 69. Forney L, Zhou X, Brown C (2004) Molecular microbial ecology: Land of the one-eyed king. Curr Opin Microbiol 7: 210–220.
View Article
Google Scholar

[192] View Article

[193] Google Scholar

[ref70] 70. Curtis TP, Sloan WT, Scannell JW (2002) Modelling prokaryotic diversity and its limits. Proc Natl Acad Sci U S A 99: 10494–10499.
View Article
Google Scholar

[195] View Article

[196] Google Scholar

[ref71] 71. Breitbart M, Felts B, Kelley S, Mahaffy JM, Nulton J, et al. (2004) Diversity and population structure of a near-shore marine-sediment viral community. Proc Biol Sci 271: 565–574.
View Article
Google Scholar

[198] View Article

[199] Google Scholar

[ref72] 72. Cann AJ, Fandrich SE, Heaphy S (2005) Analysis of the virus population present in equine faeces indicates the presence of hundreds of uncharacterized virus genomes. Virus Genes 30: 151–156.
View Article
Google Scholar

[201] View Article

[202] Google Scholar

Abstract

Figures

Introduction

Assembling Communities

Comparative Metagenomics

Phylogeny and Community Diversity

Conclusion

Acknowledgments

References

Cookie Preference Center

Customize Your Cookie Preference