Figure 1.
Phylogenetic distribution of organisms selected for the metagenomic simulation.
A phylogenetic tree of three domains with representative groups is shown. Organisms used in this study are indicated by *. The organisms used represent all known domains of life, include four bacterial phyla, a variety of genome sizes, GC compositions, and cell wall types. Large font size indicates clades where multiple isolate genomes have been collapsed into a single leaf node.
Table 1.
Characteristics of organisms in the simulated metagenome.
Figure 2.
Outline of the steps involved in the creation and sequencing of the simulated metagenomic samples.
Table 2.
Mapped reads.
Table 3.
Sequence statistics by library.
Figure 3.
Predicted and observed frequencies of sequence reads from each organism.
The fraction of reads assigned to organisms for each sample preparation method is shown at top. The fraction expected given the measured quantities of mixed DNA from each organism assuming unbiased library prep and sequencing is given as “DNA quantification”, and the fraction of reads predicted based on cell count and genome size is given as “cc*gs prediction.” Sampling error was estimated assuming a multinomial distribution (not shown) and indicated that estimates of relative abundance are accurate +/−5% for dominant organisms given the number of Sanger reads obtained, and +/−1% for pyrosequencing reads. Note that the top two bars labeled Enz+Pyrosequencing and Enz+Sanger offer a comparison of Sanger and pyrosequencing technology on the same extracted DNA.
Figure 4.
Additional sequence data for three of the simulated metagenomes.
Bars represent the observed frequency of organisms in sequenced metagenomes. We constructed and sequenced metagenomes according to the Enz, EnzBB, and DNeasy protocols using the long term frozen isolate culture stocks with glycerol and without glycerol. Reads were mapped to reference genomes as described in Methods. The additional metagenomes show some differences to each of the original libraries. Such differences might be caused by variation across DNA preparations and sequencing runs, age of the frozen samples, or other factors. The libraries constructed using the DNeasy Kit produced the most consistent results.