Manuscript preparation: BWB AME MF DG GG HG BJH SKH KL BAM KM ES DVW GW SY YZ. Principal Investigator: BWB RAG SKH BAM KEN JFP GW RW. Protocol development: BWB DMC LF RF GG JAG BJH SKH KL BAM EM VM JFP XQ MCR YAR ES DT DVW GW BPY. Data production: EA LA SA YA LC DMC CF LC RLE GG LF RF JAG SGH SKH OH VJ CLK LL NJL BAM DMM MO CP JFP XQ MCR YAR ES YS DT DVW GW KCW BPY. Data processing: CA BWB EB MB TB LC DD DG CK KL SL BAM CP XQ NS PDS JRW KCW SY. Data analysis: BWB MB TZD AME ES MF BJH DG HG SKH TAH KL BAM KM JFP XQ NS PDS YS DVW GW KCW BPY SY YZ. Mock community development and validation: YA MB DMC TZD GG LH SKH KL BAM VM JFP XQ MCR ES PDS YS DT DVW GW BPY SY. Data submission: CA LA TB TZD AME DG BJH SGH JC CK DK NJL KR NS DVW JRW KW.
¶ Membership of the Jumpstart Consortium Human Microbiome Project Data Generation Working Group is provided in the Acknowledgments.
The authors have declared that no competing interests exist.
The Human Microbiome Project will establish a reference data set for analysis of the microbiome of healthy adults by surveying multiple body sites from 300 people and generating data from over 12,000 samples. To characterize these samples, the participating sequencing centers evaluated and adopted 16S rDNA community profiling protocols for ABI 3730 and 454 FLX Titanium sequencing. In the course of establishing protocols, we examined the performance and error characteristics of each technology, and the relationship of sequence error to the utility of 16S rDNA regions for classification- and OTU-based analysis of community structure. The data production protocols used for this work are those used by the participating centers to produce 16S rDNA sequence for the Human Microbiome Project. Thus, these results can be informative for interpreting the large body of clinical 16S rDNA data produced for this project.
The human body is host to an abundant and complex diversity of microbial life
The NIH Roadmap Human Microbiome Project (HMP) has undertaken a large-scale, culture-independent census of the microbiota of healthy adults that will describe the members of human-associated communities and establish the extent to which these communities, or their constituents, are shared between individuals and body sites
For the HMP, the prolonged period over which the samples were collected and sequenced, and the participation of multiple sequencing centers, created an unprecedented need for standardization and benchmarking of 16S rDNA (16S) profiling methods. In the course of preparing for the data production phase of the project, the sequencing centers generated abundant sequence data from a synthetic microbial community, as well as from a set of clinical samples from several body regions. Data generated from the MC were invaluable for the development of ChimeraSlayer, a tool for detecting chimeric 16S rRNA reads
When the HMP project was initiated, ABI 3730 and 454 FLX Titanium platforms were both in use at the participating centers. Thus, the analysis herein frequently compares both data types. In the process of establishing molecular and analytic workflows, the centers constructed a synthetic, or ‘mock’ community (MC) composed of 21 archaeal/bacterial species representing 18 genera (Materials and Methods). All MC members have finished reference genomes and represent a range of %GC content and phylogenetic diversity. This MC provided a defined standard to benchmark the accuracy of our data with respect to community composition. In addition, comparison of our 16S data to the reference sequences allowed us to directly assess sequence quality. All centers sequenced the MC in duplicate (3730) or in triplicate (454). Multiple amplicons were targeted for sequencing, spanning different regions of the 16S rDNA (
On a schematic representation of the 16S rDNA gene, the known variable regions and the primers used in this study are indicated. Positions and numbering are based on the
We first examined the observed relative abundance of each community member by using BLAST to compare all reads against a reference set of 16S sequences that was derived from deeply sequenced 16S clone libraries prepared from each organism in the MC and which captured the sequence diversity at all 16S loci. We were able to reliably detect all MC members in all data sets except for the sole archaeal member,
. The left panel shows classification based on BLASTn against reference sequences of the MC members. A sequence is classified if it has >95% global sequence identity with one of the reference sequences and >90% of read is contained in the alignable region. Results are shown as a heatmap depicting the frequency values, using a binary logarithm scale. The middle heatmap illustrates frequency values of taxa identified using the RDP classification tool, applying an 80% confidence cutoff. Right panel shows the difference between RDP and BLASTn based classification, with a heatmap representing the ratio of observed genus-level frequency data (RDP) over expected genus-level frequency (BLASTn) for each of the MC members using a binary logarithm scale.
The MC was sequenced by different centers on both 3730 and 454 platforms. Each sequencing trial is represented as a column. For 3730 sequencing of the V1–V9 window, amplicons derived from a common amplification protocol were sequenced with short capillaries (1), long capillaries (2), and three reads per clone (3). 454 sequencing was performed by four centers (A, B, C, and D) with three 16S windows (V1–V3, V3–V5, and V6–V9). (A) The observed genus-level frequency data over expected genus-level frequency ratio for each of the MC members is shown as a heatmap using a binary logarithm scale. The expected frequency ratio is based on the whole genome coverages inferred from mapped Illumina WGS reads to the MC reference genome sequences. Genera with observed frequencies differing more than four-fold from expected are marked with + or – for over- or under-representation, respectively. (B) The fraction of misclassified (0.1% of the total combined data set) and unclassified (4.6% of the total combined data set) sequences displayed as a frequency heatmap. The frequency values are depicted as a binary logarithm scale.
We reexamined the MC data using a naive Bayesian classification-based regime appropriate for samples of unknown community composition (Materials and Methods). For this analysis, reads that were greater than 200 nucleotides in length and classified with 80% confidence to a genus using the RDP classifier
We then sought to establish the quantitative accuracy of the observed community compositions across each technology. While preparing this analysis, we observed concerning disparities between the calculated expected abundances of the members of the MC previously described
Data from all MC members, across all experiments, both 3730 and 454, exhibited 1.4-fold difference from expected and only 1.07-fold (
While overall accuracy was generally very good, there were notable differences from expected compositions among members of the MC. There were differences in MC representation that correlated with sequencing center or with window (
The 20 bacterial organisms of the Mock Community are represented by corresponding genus (n = 18) along the bottom of the figure, and across the four panels (DNA from
Further exploration of reads classified as
The phylogenetic trees were created starting from the full-length reference sequences that were used to train RDP’s taxonomic scheme version 5 for
Lack of accurate PCR amplification explained the majority of the unclassified 16S reads. Approximately 80% of the unclassified 3730 and 454 data (
We then examined the cumulative error frequency distribution and frequency of error types for the six read types we produced (
(A) For all the quality and chimera filtered 3730 and 454 sequences generated for the MC sample, an alignment-based estimation of errors, including insertions, deletions, and substitutions was performed. For each of the different sequencing approaches, the cumulative frequency distribution of the percent error per sequence is shown for assembled 3730 sequences generated with short capillaries (green), long capillaries (red), and three reads per clone (yellow), and 454 reads spanning the variable regions V1–V3 (light blue), V3–V5 (dark blue), and V6–V9 (fuchsia). A vertical line at 1% was added as a visual aid for upper limit of an acceptable error threshold. (B) Boxplots show the average percentage of errors per read, per sequence approach and per error type, including substitutions, insertions, and deletions. Outliers are not shown.
To visualize where sequencing errors were concentrated along the length of the 16S sequence for each sequencing technology, a root mean square deviation (RMSD) plot was generated for (A) 3730 sequence and (B) 454 read data. The RMSD plot is a graphical representation of the differences in nucleotide distribution between a reference sequence and the samples of interest, for each position along the length of the reference. This figure shows the results for
Unfortunately, error rates cannot be used to pre-filter inaccurate reads unless the parent reference sequences are known. We attempted to determine simple read quality characteristics that could be used to identify inaccurate sequences without relying on more advanced read filtering or denoising approaches. We point to other groups actively advancing methods for data filtering
We explored, first, how different data types, differentiated by technology or 16S window, impacted our ability to classify data for the MC and, second, how this compared to clinical samples taken from four body regions: gut, oral cavity, skin and vagina. In the process of removing detectable chimeras from all data sets prior to taxonomic analysis, we observed that the proportion of chimeras varied markedly between different samples and sequencing platforms (
Samples | % Observed Chimera content | |||
ABI3730 | 454 FLX Titanium | |||
V1–V9 | V1–V3 | V3–V5 | V6–V9 | |
MC | 5.99±3.07 | 14.26±10.34 | 14.75±9.45 | 13.49±8.52 |
gut | 7.71±6.46 | 22.90±8.56 | 16.03±2.86 | 17.76±3.76 |
oral | 7.22±6.35 | 20.55±11.73 | 10.98±4.01 | 9.10±5.02 |
skin | 3.49±5.77 | 11.15±1.36 | 7.51±2.49 | 5.73±1.69 |
vaginal | 6.31±6.64 | 12.60±6.70 | 6.62±3.51 | 3.00±1.65 |
Values are averages ± STDEV calculated from multiple replicates of MC, and from replicates of multiple clinical samples originating from different body sites.
We compared the relative taxonomic “classifiability” of 16S data from the MC and each clinical sample and, consistent with what we observed previously, all non-chimeric data from the MC exhibited >95% genus level classifiability and 100% at the order level. Among these data, the 3730 sequences and the 454 reads from V1–V3 exhibited the greatest classifiability and the 454 V6–V9 reads the lowest (
The fraction of successfully classified 3730 and 454 sequences obtained from the MC (A) and clinical samples representing four major body regions (B) is plotted at different taxonomic levels from genus to phylum. Classification was performed on quality and chimera-filtered sequences and considered to be successful if the RDP Classifier result had a confidence score above 80%. In panel B, 454 results include only window V3–V5.
A phylogenetic tree constructed with 16S sequences from RDP’s training set (light blue, n = 34), publicly available genomes from human isolates (green, n = 26), publicly available HMP genomes (dark blue, n = 44), and sequences from aggregate stool samples that could be classified at the genus level (dark grey, n = 63) and that remained unclassified at the genus level (light grey, n = 408).
Although 3730-derived sequences from our clinical samples were generally more successfully classified than the shorter 454 reads, the difference was modest for stool, skin and vaginal samples, corresponding to just a few percent. The exception, however, was the data from oral samples where the 3730 sequences demonstrated 10% greater classifiability than the shorter 454 windows. Previously, it was shown that smaller 16S windows generally negatively impacted classification success
For each of the HMP body regions, the relationship between the average frequency of a given bacterial family (y-axis) versus the contribution of these families to the unclassifiability issue (x-axis) is plotted for (B) 3730 and (C) 454. Only window V3–V5 is presented in 454 results. Classification was performed on quality- and chimera-filtered sequences and classifications assigned only if the RDP Classifier result had a confidence score above 80%.
Classification-based methods can oversimplify or miss diversity not represented in the reference taxonomy. Alternatively, evaluation of diversity within a sample by clustering sequences into operational taxonomic units (OTUs) that are defined by sequence similarity thresholds can provide greater resolution. We calculated OTUs at 97% similarity in the MC data in which 21 species were expected to cluster into 18 OTUs (
The number of observed OTUs in the MC is shown as the function of the number of 3730 (A) and 454 (B) sequences, before filtering (black), after quality filtering (green, 454 only), and after combined quality and chimera filtering (red). Rarefaction curves were generated using mothur, with an OTU defined at 97% similarity. (A) For 3730, separate lines show the rarefaction curves for the three different sequencing approaches. (B) For 454, rarefaction curves for the three 16S windows spanning the variable regions V1–V3, V3–V5, and V6–V9 are shown separately, and analysis was performed on a random 10,000-sequence subset from each sample.
The 3730 sequencer capillary length had a profound effect on estimates of diversity. Longer capillaries resulted in more accurate, but still greatly inflated, diversity estimates. Three 454 windows yielded approximately similarly inflated diversity estimates (
The scope of the HMP, to profile the microbiome of 300 individuals at numerous body sites over a prolonged period of time and at multiple research sites, imposed a significant obligation to define aspects of data production and data quality that contribute to the consistency, accuracy and utility of the data generated. In the course of establishing protocols (
Here, we generated data using the conventional, long-read 3730 platform and the shorter-read, higher-throughput 454 platform. Unlike reference genome sequencing, where assembly of individual reads produces high-quality consensus sequence, each individual 454 read or assembled 3730 read pair stands separately without the benefit of error correction or removal of anomalous reads by consensus methods. A key facet of the work presented was using a known control, the MC, which allowed us to directly characterize the features contributing to erroneous interpretation of sequence data, and explore simple filters that could in turn be applied to clinical samples of unknown composition.
The primary goal of the HMP, however, is to compare communities within different samples and both 3730 and 454 sequencing will suffice for this purpose. The tremendous cost advantage of 454, because it permits characterization of more samples at greater read depth, cemented its selection as the platform of choice for the 16S production phase of the HMP. Although Illumina sequencing platforms were not a viable option at the time these data were generated, similar advantages of cost and depth are currently driving rapid adoption of Illumina-based approaches. Recent work has reported on the applicability of Illumina sequencing to 16S rDNA studies
We noted the quality of 3730 data varied. The highest quality data were generated using longer capillaries, with assembled, overlapping reads from each end of the near full-length amplicon. For the shorter reads generated by 454 sequencing, the highest quality data were generated using the V1–V3 and V3–V5 windows. When applied to the MC, the V6–V9 window performed poorest, producing the greatest diversity overestimations in the OTU analysis, lower classifiability and higher error rates. V6–V9 also performed less consistently in inter-center comparisons. When applied to identical clinical samples, the V1–V3 and V3–V5 windows produced different representations of the communities and varied in their sensitivity to different organisms. For example, V1–V3 failed to adequately amplify members of the
We clearly illustrate that sequence artifacts can result in mis-classification of reads. It was recently demonstrated that identical chimeric artifacts can be reproduced across independent experiments and were abundant in data from both 3730 and 454 sequencing
Sequencing MC 16S rDNA demonstrated that both 3730- and 454-produced data overestimate species richness to a similar extent. After filtering sequences with an excessive number of low-quality bases and chimeric sequences, the near full-length, assembled 3730 sequences produced data that accurately reflected species richness while the shorter 454 reads still yielded spurious OTUs.
The informatic processing of read data is a significant component of 16S rDNA work. We applied simple filtering metrics in combination with recently developed chimera detection algorithms
The results presented, along with the works of others, demonstrate that all facets of data production and data processing can generate artifacts that bias the representation of community membership. This presents an opportunity for the research community to investigate how to better consolidate and advance approaches for metagenomic studies. As additional sequencing technologies are applied to community metagenomics, it will be critical to benchmark and standardize against defined references so that the research community can leverage the combined data sets efficiently and effectively to obtain greater insights.
The organisms for the mock community (MC) include a variety of different genera commonly found on or within the human body. The MC composition has been described elsewhere
Clinical samples were collected non-invasively at Baylor College of Medicine in Houston, TX and Washington University in St. Louis, MO. IRB approval for clinical samples used in this study were granted from Baylor College of Medicine (IRB Approval #22895) and The Human Research Protection Office of Washington University in St. Louis (IRB Approval #08-0754). The collecting institutions obtained written, informed consent from all participants. DNA from clinical samples was provided to the sequencing centers. Information describing the collection and extraction of DNA from clinical samples, documents representing the consent forms used and supplemental study information is available on the HMP Data Analysis and Coordination Center website (
Samples were amplified and sequenced according to the “HMP 3730 16S Protocol Version 1.1″ (
Samples were amplified and sequenced according to the “HMP 454 16S Protocol Version 4.2″ (Protocol S2). The protocol is available on the HMP Data Analysis and Coordination Center website (
Default processing of 3730 16S rRNA sequences: Sequences derived from a single clone were assembled using AmosCmp16Spipeline
Default processing of 454 16S rRNA sequences: Sequences were processed using mothur v.1.6.0
For detection of chimeric sequences, all 16S rDNA sequences were aligned using NAST-iEr
Naïve Bayesian classification: RDP classifier (v2.2) software was used to classify the sequences according to the taxonomy proposed by Garrity et al.
BLAST-based classification: The identities of the 16S sequences were determined by creating a BLAST database of the genomes representing all organisms included in the mock community and then performing a BLASTn alignment (97% identity and 90% coverage) of the 16S sequences to the database. These results were parsed to obtain the top hit for each sequence and the top hits were counted to obtain the number of sequences matching each genome.
The MC was subjected to WGS sequencing on the Illumina platform to generate 240,935,824 101-nt reads. The Burrows-Wheeler Aligner (BWA)
PCoA was performed on the frequencies of identified genera, with unclassified reads excluded from the analysis. The covariance matrix of the data was used to construct the eigenvectors
Rarefaction curves were generated with mothur
For each of the organisms in the mock community, the available 16S reads were subjected to AMOScmp
Data generated for this work can be accessed at
The NCBI bioproject ID numbers corresponding to figures within this work are as follows.
HMP 3730 16S Protocol Version 1.1.
(PDF)
HMP 454 16S Protocol Version 4.3.
(PDF)
Read Counts for 454 data in
(DOCX)
Read Counts for 3730 data in
(DOCX)
Broad Barcoded Oligos (V1–V3).
(DOCX)
Broad Barcoded Oligos (V3–V5).
(DOCX)
Broad Barcoded Oligos (V6–V9).
(DOCX)
We gratefully thank Peter Turnbaugh and Jeff Gordon for providing stool sample DNAs TS1, TS4 and TS25; Elizabeth Hansen and Jeff Gordon for providing
Doyle V. Ward1, Dirk Gevers1, Georgia Giannoukos1, Ashlee M. Earl1, Barbara A. Methé6, Erica Sodergren2, Michael Feldgarden1, Dawn M. Ciulla1, Diana Tabbaa1, Cesar Arze11, Elizabeth Appelbaum2, Leigh Aird1, Scott Anderson1, Tulin Ayvaz3, Edward Belter2, Monika Bihan5, Toby Bloom1, Jonathan Crabtree11, Laura Courtney2, Lynn Carmichael2, David Dooling2, Rachel L. Erlich1, Candace Farmer2, Lucinda Fulton2, Robert Fulton2, Hongyu Gao2, John A. Gill8, Brian J. Haas1, Lisa Hemphill4, Otis Hall2, Susanna G. Hamilton1, Theresa A. Hepburn1, Niall J. Lennon1, Vandita Joshi4, Cristyn Kells1, Christie L. Kovar4, Divya Kalra4, Kelvin Li5, Lora Lewis4, Shawn Leonard2, Donna M. Muzny4, Elaine Mardis2, Kathie Mihindukulasuriya2, Vincent Magrini2, Michelle O’Laughlin2, Craig Pohl2, Xiang Qin4, Keenan Ross1, Matthew C. Ross3, Yu-Hui A. Rogers8, Navjeet Singh10, Yue Shang4, Katarzyna Wilczek-Boney4, Jennifer R. Wortman11, Kim C. Worley4, Bonnie P. Youmans3, Shibu Yooseph7, Yanjiao Zhou2, Patrick D. Schloss9, Richard Wilson2, Richard A. Gibbs4, Karen E. Nelson6, George Weinstock2, Todd Z. DeSantis10, Joseph F. Petrosino3,4, Sarah K. Highlander3,4, Bruce W. Birren1
1 Broad Institute, Cambridge, Massachusetts, United States of America
2 The Genome Institute, Washington University School of Medicine, St. Louis, Missouri, United States of America
3 Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, Texas, United States of America
4 Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
5 J. Craig Venter Institute, Rockville, Maryland, United States of America
6 Human Genomic Medicine, J. Craig Venter Institute, Rockville, Maryland, United States of America
7 J. Craig Venter Institute, San Diego, California, United States of America
8 Sequencing, J. Craig Venter Institute, Rockville, Maryland, United States of America
9 Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, United States of America
10 Ecology Department, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
11 Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland, United States of America