Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Microbial resolution of whole genome shotgun and 16S amplicon metagenomic sequencing using publicly available NEON data

  • Kyle D. Brumfield,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Maryland Pathogen Research Institute, University of Maryland, College Park, Maryland, United States of America, University of Maryland Institute for Advanced Computer Studies, University of Maryland, College Park, Maryland, United States of America

  • Anwar Huq,

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – review & editing

    Affiliation Maryland Pathogen Research Institute, University of Maryland, College Park, Maryland, United States of America

  • Rita R. Colwell,

    Roles Conceptualization, Methodology, Project administration, Resources, Supervision, Writing – review & editing

    Affiliations Maryland Pathogen Research Institute, University of Maryland, College Park, Maryland, United States of America, University of Maryland Institute for Advanced Computer Studies, University of Maryland, College Park, Maryland, United States of America, CosmosID Inc., Rockville, MD, United States of America

  • James L. Olds,

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Writing – review & editing

    Affiliation Schar School, George Mason University, Arlington, Virginia, United States of America

  • Menu B. Leddy

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Supervision, Validation, Writing – review & editing

    menuleddy@gmail.com

    Affiliation Essential Environmental and Engineering Systems, Huntington Beach, California, United States of America

Microbial resolution of whole genome shotgun and 16S amplicon metagenomic sequencing using publicly available NEON data

  • Kyle D. Brumfield, 
  • Anwar Huq, 
  • Rita R. Colwell, 
  • James L. Olds, 
  • Menu B. Leddy
PLOS
x

Abstract

Microorganisms are ubiquitous in the biosphere, playing a crucial role in both biogeochemistry of the planet and human health. However, identifying these microorganisms and defining their function are challenging. Widely used approaches in comparative metagenomics, 16S amplicon sequencing and whole genome shotgun sequencing (WGS), have provided access to DNA sequencing analysis to identify microorganisms and evaluate diversity and abundance in various environments. However, advances in parallel high-throughput DNA sequencing in the past decade have introduced major hurdles, namely standardization of methods, data storage, reproducible interoperability of results, and data sharing. The National Ecological Observatory Network (NEON), established by the National Science Foundation, enables all researchers to address queries on a regional to continental scale around a variety of environmental challenges and provide high-quality, integrated, and standardized data from field sites across the U.S. As the amount of metagenomic data continues to grow, standardized procedures that allow results across projects to be assessed and compared is becoming increasingly important in the field of metagenomics. We demonstrate the feasibility of using publicly available NEON soil metagenomic sequencing datasets in combination with open access Metagenomics Rapid Annotation using the Subsystem Technology (MG-RAST) server to illustrate advantages of WGS compared to 16S amplicon sequencing. Four WGS and four 16S amplicon sequence datasets, from surface soil samples prepared by NEON investigators, were selected for comparison, using standardized protocols collected at the same locations in Colorado between April-July 2014. The dominant bacterial phyla detected across samples agreed between sequencing methodologies. However, WGS yielded greater microbial resolution, increased accuracy, and allowed identification of more genera of bacteria, archaea, viruses, and eukaryota, and putative functional genes that would have gone undetected using 16S amplicon sequencing. NEON open data will be useful for future studies characterizing and quantifying complex ecological processes associated with changing aquatic and terrestrial ecosystems.

Introduction

Over the past decade, interest in total microbial community composition and dynamics of complex environments has increased significantly. This is because the estimated total number of microbial cells in the earth’s biosphere exceeds 1030 [1], and the microbes themselves harbor potentially up to an additional 1031 phages [2]. All of which have helped shape the planet and its biosphere [3]. Diverse microbial communities flourish in a wide spectrum of complex environments ranging from the human gut [4], rhizosphere [5], and conventionally inhospitable habitats, such as geothermal hot springs [6] and Antarctic volcano mineral soils [7]. Furthermore, microbial activities play a critical role in the biogeochemistry of the planet [8,9] and wellbeing of macroorganisms [10].

Traditionally, microbial communities have been defined using culture dependent methods to detect and enumerate microorganisms. However, it is estimated that the vast majority of prokaryotic genospecies remain uncultured [11], and genomes of uncultured microorganisms encode a largely untapped reservoir of novel metabolites and metabolic processes [12]. Accordingly, the field of metagenomics has developed rapidly and effectively obviates the need to isolate and culture microorganisms by utilizing the genetic material of a sample to identify accurately the functional gene composition [12,13]. That major accomplishment has allowed in depth comparison and exploration of microbial ecology [14,15], including the metabolic profile of complex microbial ecosystems [16,17].

Since the emergence of metagenomics, where DNA is sequenced directly from environmental samples, sequencing for microbial identification has evolved to include a variety of approaches. The polymerase chain reaction (PCR) is one of the fundamental methods currently being used for taxonomic identification, commonly employing amplification of variant regions in macromolecules conserved across species [18]. PCR-based metagenomics is now routine in gene prospecting by direct amplification of specific genes [19] or colony PCR to screen metagenomic libraries [20]. Moreover, PCR amplification of specific genes is used to evaluate microbial species diversity based on sequence composition. The use of 16S ribosomal RNA (rRNA) genes—that occur in one or more copies in most bacterial and archaeal genomes [21] and also present in mitochondrial genomes [22]—is widely recognized as the ‘Gold Standard’ for prokaryotic identification. The 16S rRNA gene sequencing method generally employs universal PCR primers to amplify hypervariable regions of the 16S rRNA gene that infer taxonomic identification by bioinformatic alignment against various rRNA sequence databases [2326], such as the Ribosomal Database Project (RDP) [27], SILVA Ribosomal RNA Gene Database Project [28], or Greengenes [29] databases.

With advances being made in DNA sequencing technology, the cost of sequencing has decreased and whole genome shotgun metagenomic sequencing is attractive for many laboratories to study all of the genes in all organisms present in uncultured microbial communities in complex samples [30]. Instead of targeting specific genomic markers, total DNA is extracted and sheared into fragments that are independently sequenced and aligned, for taxonomic identification to genomic databases, such as the Reference Sequence (RefSeq) [31], GenBank [32], or Pathosystems Resource Integration Center (PATRIC) [33] databases. DNA whole genome shotgun metagenomics has also been complemented with metatranscriptomic or metaproteomic approaches to describe microbial function [34,35]. Accordingly, curated databases for genome annotation sequences, e.g., Subsystems ontology [36], and protein, e.g., SwissProt [37], have been established.

Development of the high-throughput analytical strategies has changed data handling and processing for microbiology. A single biological sample can now be processed in parallel to generate high-throughput data composed of genome sequences, gene and protein expression patterns, or metabolite fluxes, which each require unique postgenomic computer manipulation for analysis [38]. As a result, numerous efforts [3840], specially the Science Commons Protocol for Implementing Open Access Data [41], have surfaced that strive to initiate the regulation of open-access protocols for data management and sharing. However, standardization of laboratory procedures to ensure quality and interoperability of ‘Big Data’ produced through sequencing, notably amplicon and shotgun metagenomics, remains critical [42,43]. The number of metagenomic datasets has increased dramatically, creating a need for standardized operating procedures for sample collection, processing, and data storage. It is now very important to be able to compare sample sequences to a known database, thereby providing information for subsequent analyses, including taxonomic identification and comparison [13].

One of the goals of the National Ecological Observatory Network (NEON) is to enable researchers to ask questions on a regional to continental scale that involve a variety of environmental challenges and provide high-quality, integrated, and standardized data derived from standard field sites across the United States. The NEON soil microbe metagenome and marker gene sequences projects are derived from soil microbial sampling, contain quality-controlled metadata and results for the NEON shotgun metagenomic and 16S marker gene sequences, respectively [44]. In this study, we demonstrate effective use of publicly available NEON soil metagenomic sequencing datasets hosted on the open access Metagenomics Rapid Annotation using Subsystem Technology (MG-RAST) server [45] to assess the feasibility of employing openly sourced NEON data. We compare and contrast whole genome shotgun metagenomic sequencing and 16S amplicon sequencing for application in environmental metagenomics.

Materials and methods

Data type and sample selection

NEON provides open access to information obtained from soil and freshwater (surface and benthic) samples on their microbial content. All microbe metagenome (Neon Data Product ID = DP1.10107.001) and marker gene (NEON Data Product ID = DP1.10108.001) sequence data were previously uploaded to the open-submission MG-RAST data portal [45] for processing and analysis by NEON investigators using standardized protocols [44]. As of August 22, 2019, the MG-RAST server, version 4.0.3, hosted 66,454 public and 390,819 total metagenomes containing 1,498 billion DNA sequences (209.08 tera base pairs).

To select samples for the investigation reported here, preliminary searches were performed directly on the MG-RAST server using key words “NEON” and “National Ecological Observatory Network”. The initial search results returned 1,304 samples hosted by MG-RAST. These were further refined by ensuring that both “Amplicon” with a “target_gene” of “16S” and “WGS” sequencing methods were readily available under “sequence_type” for direct comparison. Additionally, the following criteria had to match across each of the retrieved sequences: 1) “collection_date” was from the same day; 2) “biome”, “biome_id”, and “feature” matched for each sample; and 3) the samples were collected from the same “location”. A total of 97 samples met these criteria, including 25 WGS samples and 51 amplicon samples. We further narrowed these search results by selecting four WGS and four amplicon sequencing method samples with the greatest number of sequencing reads from each year of collection and identified the area that contained the greatest number of entries. The refined samples selected for this study and their relevant metadata, including MG-RAST ID, NEON Data Product ID, NCBI BioProject ID, sequencing method, collection date, and collection location, are detailed in Table 1.

thumbnail
Table 1. Whole genome and 16S amplicon metagenomic datasets examined in this study.

https://doi.org/10.1371/journal.pone.0228899.t001

All whole genome and 16S amplicon metagenomic samples are a part of the NEON soil microbial metagenomic sequencing and NEON soil microbe marker gene sequencing projects (National Science Foundation, Grant #1638694, Grant ID MREFC), respectively, collected with the overall goal of tracking changes in the diversity, composition, and functional potential of microbiota in soil ecosystems through time and space. NEON collects surface soil samples to 30 cm depth. All samples were collected in a temperate grassland biome (biome_id = ENVO:01000193) with features of graminoid or herbaceous vegetation from Central Plains Experimental Range, Colorado, USA [44]. MG-RAST ID, NEON Data Product ID, NCBI BioProject ID, sequencing method, collection date, and location of the collection site are given. Metagenomic sequences were generated on the Illumina HiSeq and MiSeq instruments for WGS and 16S amplicon sequencing methods, respectively, and all samples are publicly available in the MG-RAST server and NCBI.

Identification employing metagenomic sequencing reads

Sample collection, DNA preparation, and sequencing were performed by NEON investigators using standardized operating procedures. Briefly, NEON samples were collected from surface soil down to 30 cm in depth, frozen on dry ice, and shipped to a NEON analytical facility for DNA extraction, sample preparation, and sequencing on Illumina HiSeq and MiSeq instruments for WGS and 16S amplicon sequencing methods, respectively. Table 2 details general analysis statistics, including the number of sequencing reads, mean sequence length, identified protein and rRNA features, and taxonomic hits distribution. Quality control (QC) thresholds were maintained internally through the MG-RAST automated processing pipeline, and no further QC cutoffs were applied to NEON sequences. The number of identified protein and rRNA features is a result of the contig lowest common ancestor (contigLCA) algorithm used by the MG-RAST automated pipeline to find a single consensus taxonomic entry for all features on each individual sequence, with default cutoffs for alignment length, e-value, and percent identity of the raw sequencing reads against the nonredundant M5NR database [46] that contains sequences and annotations from multiple publicly available sources to maintain two databases for protein and ribosomal sequence data. The MG-RAST annotation pipeline has the potential to map one read to multiple annotations and map one annotation to multiple reads. Therefore, “hits” are an estimate of the number of sequences that contain a given annotation, found by multiplying each database hit by the number of representatives in each cluster. Accordingly, “hits” refers to the number of unique database sequences that were found following a similarity search and not the number of reads. Therefore, the number of identified features can be smaller than the number of reads due to clustering or larger due to double counting. Protein database sources include GO, IMG, KEGG, NCBI (RefSeq and GenBank), SEED, UniProt, eggnog, and PATRIC and ribosomal database sources include RDP, SILVA, and Greengenes. Total taxonomic hits, and taxonomic hits for archaea, bacteria, eukaryota, and viruses, were determined using the contigLCA algorithm against the M5NR database for samples analyzed via WGS (MG-RAST metagenome identification numbers = mgm4637825.3, mgm4637821.3, mgm4637831.3, and mgm4637826.3). Analogously, total taxonomic hits, and taxonomic hits for archaea, bacteria, and eukaryota were determined for samples analyzed using 16S amplicon sequencing (MG-RAST metagenome identification numbers = mgm4783766.3, mgm4783759.3, mgm4778732.3, and mgm4778744.3).

thumbnail
Table 2. Sequence breakdown of quality predicted protein features, and total taxonomic hits of WGS and 16S amplicon sequencing samples included in this study.

https://doi.org/10.1371/journal.pone.0228899.t002

WGS and 16S amplicon sequencing samples were sequenced on Illumina HiSeq and MiSeq sequencing platforms, respectively. Quality control and predicted protein features are a result of the automated analysis generated by the MG-RAST pipeline. Total taxonomic hits were determined using a contigLCA algorithm to find a single consensus taxonomic entry for all features on each individual sequence with the default cutoffs for alignment length, e-value, and percent identity against the M5NR database for NEON samples sequenced via WGS (mgm4637825.3, mgm4637821.3, mgm4637831.3, and mgm4637826.3) and 16S amplicon (mgm4783766.3, mgm4783759.3, mgm4778732.3, and mgm4778744.3) methods, respectively. Database hits refer to the number of unique database sequences that were found following a similarity search employed by the MG-RAST annotation pipeline and not the number of reads. Therefore, the number of identified features can be smaller than the number of reads due to clustering or larger due to double counting. MG-RAST ID, sequence count, mean sequence length, identified protein features, identified rRNA features, total taxonomic hits, and the taxonomic hits distribution of archaea, bacteria, eukaryota, viruses, and unclassified reads are given.

The contigLCA was also used to determine the individual source hits distributions of WGS NEON soil metagenomes (mgm4637825.3, mgm4637821.3, mgm4637831.3, and mgm4637826.3) against RefSeq, subsystems of SEED level-one functions, GenBank SwissProt, PATRIC, RDP, and SILVA SSU databases (Fig 1). The source hits distribution of 16S amplicon NEON Soil Marker Gene Sequences (mgm4778732.3 and mgm4778744.3) was determined using the contigLCA algorithm against RDP, SILVA SSU, and Greengenes databases (Fig 2). Data for the source hits distribution of mgm4789766.3 and mgm4783759.3 was not available on the MG-RAST server at the time of analysis.

thumbnail
Fig 1. Source hits distribution of WGS metagenomic samples showing the number of annotated read hits across different databases.

Databases include protein databases with functional hierarchy information, and ribosomal RNA databases. Colored bars represent annotated reads colored by the observed e-value range. Databases differ in the number of hits, but also have different types of annotation data.

https://doi.org/10.1371/journal.pone.0228899.g001

thumbnail
Fig 2. Source hits distribution of 16S amplicon sequencing metagenomic samples showing the number of annotated read hits across different databases.

Shown are ribosomal RNA databases, RDP, SILVA SSU, and Greengenes. Colored bars represent annotated reads colored by the observed e-value range. Databases differ in the number of hits, but also have different types of annotation data. Source hit distribution information was not available for mgm4783766.3 and mgm4783759.3 at the time of analysis.

https://doi.org/10.1371/journal.pone.0228899.g002

The MG-RAST metagenomics analysis server provides rarefaction curves as the total number of distinct species annotations, a function of the number of sequencing reads. However, MG-RAST recommends against using shotgun sequence data to infer taxonomic information below genus for direct analysis. Therefore, to examine organism diversity, a rarefaction curve was created independently to examine genus richness (Fig 3). Briefly, unassembled metagenomic sequencing reads were first analyzed on the MG-RAST server using the contigLCA algorithm to map the raw sequencing reads directly to the RefSeq and RDP databases for the WGS and 16S amplicon sequencing samples, respectively. The corresponding read abundance values were used to create sample-size-based rarefaction (interpolation) and extrapolation (prediction) curves with an endpoint of 20,000 reads and 1,000 bootstrap repetitions using the R software package ‘iNEXT’ [47]. Hill numbers, i.e., the effective number of genera, were used to quantify the taxonomic diversity of each assemblage, that is, the sampling curve plots diversity estimates with respect to the number of sampling units, i.e., the number of reads [4850]. The curve represents the estimated number of different genus annotations for subsamples of each complete dataset and predicts accurately up to double the reference sample size [47]. Therefore, as the curve becomes flatter towards the distal part of the x-axis, additional sampling is likely to yield fewer new genera identifications. Sample-size-based rarefaction curves extrapolated to twice the sample size of WGS samples examined in this study are provided in the supporting information (S1 Fig)

thumbnail
Fig 3. Genus richness estimates for rarefied and extrapolated samples of classified bacteria, archaea, eukaryota, and viruses from WGS show greater genus diversity compared to 16S amplicon sequencing, with respect to read number.

Whole-genome sequencing and 16S amplicon sequencing methods are visualized as genetic diversity compared to number of sequencing reads for each metagenomic sample set mapped against RefSeq and RDP databases, respectively, using contig LCA algorithm on the MG-RAST server [45]. Shaded areas represent 95% confidence intervals with 1,000 bootstrap repetitions, interpolation (solid lines) and extrapolation (dotted lines) curves were generated using the R software package ‘iNEXT’ [47]. Rarefaction curves are separated by color for each sample. Solid line interpolated; dashed line, extrapolated.

https://doi.org/10.1371/journal.pone.0228899.g003

Boxplots were used to summarize alpha diversity of annotated genera in each sample (Fig 4). Alpha diversity is shown as the total number of annotated genera in each sample (Fig 4A) and the genus richness calculated using Shannon’s index (Fig 4B) from the corresponding read abundance values matrix obtained from MG-RAST for WGS and 16S amplicon sequencing samples, as previously mentioned. Shannon’s index alpha diversity was calculated using the R software package ‘iNEXT’ [47], and 95 percent confidence intervals (95% CI) were calculated using the Z statistic for WGS and 16S amplicon sequencing.

thumbnail
Fig 4. Boxplots of the alpha diversity observed at the genus level.

(A) the observed number of annotated genera and (B) Shannon index alpha-diversity values of WGS (mgm4637825.3; mgm4637821.3; mgm4637831.3; mgm4637826.3) and 16S amplicon sequencing (mgm4783766.3; mgm4783759.3; mgm4778732.3; mgm4778744.3) samples included in this study. Shannon’s index alpha diversity was calculated using the R software package ‘iNEXT’ [47]. The 95 percent confidence intervals (95% CI) were calculated using the Z statistic for WGS and 16S amplicon sequencing indicated for each sample set. Y-axis is shared between panels. Boxplots are separated by color for WGS and 16S samples.

https://doi.org/10.1371/journal.pone.0228899.g004

Certain bioinformatics software packages, for example, CosmosID, Inc. (CosmosID, Inc., Rockville, MD, USA), Kraken2 [51,52], MetaPhlAn [53], and MetaMaps [54], are able to achieve bacterial identification to species, subspecies, and/or strain level using unassembled metagenomic shotgun [5558] and long sequencing reads [54]. MG-RAST metagenomics analysis server recommends against using shotgun sequence data to infer taxonomic information below the genus level for direct analysis. Accordingly, using the contigLCA algorithm on the MG-RAST server, genus-level taxonomic categories from each metagenome were determined by mapping the raw sequencing reads directly to the RefSeq and RDP databases for WGS and 16S amplicon sequencing, respectively. Sunburst visualizations of organism specific k-mer relative abundance (percentage) for each sample were generated using Krona [59] (Fig 5 and S2S7 Figs). To compare the overall predicted relative abundance of bacteria, eukaryota, archaea, viruses, and total unclassified sequences detected in each sample, a heatmap of phylum specific k-mer relative abundance (percentage) observed in each sample was created using Morpheus with hierarchical row clustering and One Minus Pearson Correlation [60] (Fig 6). Fungal internal transcribed spacer (ITS) genomic marker regions were not evaluated in this study.

thumbnail
Fig 5. Krona plots of the microbiome detected in representative metagenomic datasets.

(A) Whole-genome sequencing sample mgm4637831.3 and (B) 16S amplicon sequencing sample mgm4778732.3 are visualized as genetic diversity identified at the genus level. Taxonomic nodes are nested sectors arranged from the top level of the hierarchy at the center and progressing outward. Krona plots simultaneously display relative abundance and hierarchy using a radial space-filling display. Genera composition percentages are displayed as the normalized proportion of organism specific k-mers observed in representative datasets. Taxonomic domains are separated by color for bacteria, eukaryota, archaea, viruses, and unclassified reads, respectively.

https://doi.org/10.1371/journal.pone.0228899.g005

thumbnail
Fig 6. Heatmap of relative abundance of microbial phyla detected in WGS and 16S amplicon sequencing samples.

Composition percentages are displayed as the normalized proportion of the phylum specific k-mers observed in each sample relative to the total microbial phyla diversity of the sample. Color gradient key displays the scale of relative abundance percentages for WGS (mgm4637825.3, mgm4637821.3, mgm4637831.3, and mgm4637826.3) and 16S amplicon sequencing (mgm4783766.3, mgm4783759.3, mgm4778732.3, and mgm4778744.3). Hierarchical row clustering was generated using one minus Pearson correlation.

https://doi.org/10.1371/journal.pone.0228899.g006

Results

NEON samples identified for metagenomic sequencing analyses

A total of four whole genome shotgun (mgm4637825.3, mgm4637821.3, mgm4637831.3, mgm4637826.3) and four 16S amplicon (mgm4783766.3, mgm4783759.3, mgm4778732.3, mgm4778744.3) metagenomic samples were selected after performing searches on the MG-RAST server with predefined filtering criteria and identifying samples with the greatest number of sequencing reads collected from the same location. Prior to sequencing, all samples had been collected by NEON researchers from surface soil at 30 cm in depth in a temperate grassland biome (biome_id = ENVO:01000193) with features of graminoid or herbaceous vegetation from Central Plains Experimental Range, Colorado, USA between April-July 2014 (Table 1).

Taxonomic hits distribution against MG-RAST M5NR database

The number of predicted protein features, identified rRNA features, and total taxonomic hits resulting from the automated analysis generated by the MG-RAST pipeline by mapping raw sequencing reads against the M5NR database using the contigLCA algorithm are detailed in Table 2. Samples analyzed by the WGS method employing Illumina HiSeq chemistry produced between 5,66,108 and 11,627,943 (average = 8,525,007) sequencing reads with mean sequence read lengths between 144 bp and 177 bp, compared to the 16S amplicon sequencing method employing Illumina MiSeq chemistry, which produced between 15,799 and 37,106 (average = 22,886) sequencing reads with mean sequence read lengths of between 250 bp and 256 bp. Raw sequencing reads were mapped against the M5NR database and the 16S amplicon sequencing contained, on average, a greater number of identified rRNA features (average = 3,550.75) compared to the WGS sequencing method (average = 3,205). Additionally, the number of total taxonomic hits of raw sequencing reads against the M5NR database was greater for WGS, ranging between 1,621,138 and 3,349,527 (average = 2,484,057), compared to total taxonomic hits of the 16S amplicon sequencing which ranged between 9,728 and 23,807 (average = 15,648). Across all samples, relative abundance of taxonomic hits was greater for archaea and eukaryota in 16S amplicon sequencing (average = 4.70% and 6.00%, respectively) compared to the WGS sequencing method (average = 0.74% and 0.81%, respectively). Conversely, relative abundance of bacteria was slightly decreased in 16S amplicon sequencing (average = 86.11%) compared to the WGS method (average = 98.27%). No viruses were detected in the 16S amplicon sequencing samples while between 0.01% and 0.02% of the reads mapped to viruses in the shotgun metagenomic samples. 16S amplicon sequencing targets DNA sequences encoding the RNA component of the 30S subunit of prokaryotic ribosomes and is not useful in identifying functional protein features, while between 1,823,429 and 3,637,507 (average = 2,696,100) protein features were identified in the WGS samples. Additionally, the proportion of ‘other and unclassified’ sequencing reads was greater in the 16S amplicon sequencing samples (average = 3.20%) compared to the shotgun metagenomic sequencing samples (average = 0.16%).

Source hits distribution against other widely used databases

The source hits distribution was determined by using the contigLCA algorithm to map raw sequencing reads against individual databases for the WGS sequencing method (Fig 1) and the 16S amplicon sequencing (Fig 2). The average total number of raw WGS reads mapped against whole genome taxonomic databases, RefSeq, GenBank, and PATRIC, was 2,950,350.75, 2,860,207, and 2,842,863.5, respectively. The average total of raw WGS reads mapped against genome annotations and protein databases, Subsystems ontology and SwissProt, was 2,860,207 and 407,185, respectively. All WGS samples, and two 16S amplicon samples (mgm4778732.3 and mgm4778744.3) were mapped against rRNA databases RDP and Silva SSU. The 16S amplicon sequencing samples demonstrated a greater number of hits against the RDP database, 23,442 and 13,072 hits for mgm4778732.3 and mgm4778744.3, respectively (average = 18,257) compared to the number of WGS sample reads mapped against the RDP database, which ranged between 1,563 and 3,505 (average = 2,416.5). However, WGS metagenomic samples contained a larger number of hits against the Silva SSU database, which ranged between 69,560 and 142,953 (average = 106180.25), compared to the 16S amplicon sequencing methods, which demonstrated 13,662 and 22,876 hits for mgm4778732.3 and mgm4778744.3, respectively (average = 20,188). The 16S amplicon sequencing samples mapped against the Greengenes database showed 22,876 and 12,633 hits for mgm4778732.3 and mgm4778744.3, respectively (average = 20,188). By default, the MG-RAST analysis server does not provide the source hits distribution of WGS reads mapped against the Greengenes database. Therefore, no comparison was made between WGS samples and the Greengenes database.

Rarefaction and Shannon’s index alpha diversity

Overall genus richness, i.e., the total number of phylotypes, of each sample microbiome sequenced by WGS and 16S amplicon sequencing was compared by construction of rarefaction curves (Fig 3). Across all samples, genera richness in WGS samples was consistently greater than for samples sequenced by the 16S amplicon sequencing method. Furthermore, rarefaction analysis indicated that coverage was sufficient using 16S amplicon sequencing samples, as the asymptote of the curve towards the end of the graph is nearly zero, whereas, for the WGS shotgun metagenomic samples, increasing coverage would likely result in an increased number of samples being identified, as the asymptote of the curve had not reached maximum genus richness, supported by the extrapolated rarefaction sampling curves of WGS samples (S1 Fig). The total number of annotated genera observed in each sample is shown in in Fig 4A. The number of annotated genera observed in WGS samples (mgm4637825.3, 991; mgm4637821.3, 969; mgm4637831.3, 923; mgm4637826.3, 912; average = 948.75, 95% CI [913.04, 986.46]) was significantly greater, compared to 16S amplicon sequencing samples (mgm4783766.3, 204; mgm4783759.3, 196; mgm4778732.3, 281; mgm4778744.3, 217; average = 224.5; 95% CI = [187.63, 263.37]) selected in this study.

Similarly, alpha diversity patterns were calculated at the genus level using Shannon’s index for NEON soil metagenomes analyzed via WGS andNEON Soil Marker Gene Sequences analyzed via 16S amplicon sequencing selected in this study (Fig 4B). The observed Shannon index alpha diversity values were greater for WGS samples (mgm4637825.3, 221.56; mgm4637821.3, 219.93; mgm4637831.3, 215.74; mgm4637826.3, 213.94) compared to 16S samples (mgm4783766.3, 15.57; mgm4783759.3, 17.21; mgm4778732.3, 18.37; mgm4778744.3, 14.59) selected in this study. Across NEON soil metagenomes obtained by WGS, Shannon index alpha diversity was significantly greater (average = 217.79, 95% CI = [214.27, 221.00]) than alpha diversity of all NEON Soil Marker Gene Sequences (average = 16.44; 95% CI = [14.79, 18.09]) obtained by 16S amplicon sequencing.

Microbial resolution of WGS and 16S amplicon sequencing

The average total number of taxonomic hits for WGS samples against the RefSeq database was 4,780,201.75 (mgm4637825.3 = 6,445,937; mgm4637821.3 = 6,347,609; mgm4637831.3 = 3,102,947; mgm4637826.3 = 3,224,314). The average total number of taxonomic hits for 16S amplicon sequencing against the RDP database was 15,785.75 (mgm4783766.3 = 13,751; mgm4783759.3 = 10,237; mgm4778732.3 = 25,470; mgm4778744.3 = 13,685). To display and compare information for the most abundant taxa across samples, characterization of DNA coding for bacteria, archaea, eukaryota, and viruses are shown as Krona plots representing the relative abundance of microbial genera detected in the representative WGS sample mgm4637831.3 (Fig 5A) and 16S amplicon sequencing representative sample mgm4778732.3 (Fig 5B). Krona plots for other WGS samples (mgm4637825.3, mgm4637821.3, and mgm4637826.3) and 16S amplicon sequencing samples (mgm4783766.3, mgm4783759.3, and mgm4778744.3) are provided in supporting information (S2, S3 and S4 Figs) and (S5, S6 and S7 Figs), respectively. Samples analyzed by WGS had a much lower proportion of unclassified reads compared to samples analyzed by 16S amplicon sequencing. For example, mgm4637831.3 (WGS) contained 67,939 reads unclassified to genus corresponding to approximately 2.19% of the total number of processed reads. In mgm4637831.3, approximately 923 genera were identified. In the 16S amplicon sequencing sample mgm4778732.3, 10,119 reads were unclassified at the genus level, corresponding to approximately 39.73% of the total number of processed reads. A total of 281 genera were identified in mgm4778732.3.

To compare dominant taxonomic groups detected by each sequencing method, relative abundance of microbial phyla detected in all WGS and 16S amplicon sequencing samples are shown in a heat map (Fig 6). Overall, dominant phyla detected in all samples agreed across sequencing methodology. With the exception of Eustigmatophyceae that was identified in mgm4783759.3, all phyla detected by 16S sequencing were also detected using WGS. Predominant bacterial phyla detected by both sequencing methods in all samples include Acidobacteria, Actinobacteria, Proteobacteria, Bacteroidetes, Firmicutes, and Verrucomicrobia. For all samples, WGS yielded more phyla compared to 16S amplicon sequencing. The detected diversity of eukaryota was greater across all samples using WGS compared to 16S, for which only Basidiomycota was detected. Similarly, five archaeal phyla were detected using WGS, while only Crenarchaeota and Thaumarchaeota were detected by 16S amplicon sequencing. WGS was also able to detect viruses at up to roughly 0.2% relative abundance in each sample examined, while no viruses were detected by 16S amplicon sequencing.

Discussion

WGS metagenomics offers finer resolution for microbial community structure and dynamics compared to 16S amplicon sequencing

Two most commonly employed methods of sequencing used to study the microbiome of complex environments are 16S rRNA sequencing and whole genome shotgun metagenomics. While, it is debatable as to which approach is superior, there may be a place for both in microbiome studies, depending on the investigation. Tessler and colleagues found that 16S sequencing identified a larger number of phyla than WGS for water samples collected across remote locations of Brazil, suggesting amplicon sequencing may outperform WGS in areas not well studied and comprising only a limited number of sequenced genomes [61]. WGS was concluded to be preferable to 16S amplicon sequencing in the human microbiome, including enhanced detection of bacterial species, increased detection of diversity, and increased prediction of genes [62]. Our findings support the latter, namely that WGS provides greater resolution, i.e., identifies greater microorganism diversity, and for microbial communities may provide greater insight into biochemical processes.

In general, amplicon sequencing may be more practical and less expensive than WGS, which may require more extensive data analysis [6365]. Here we demonstrate that WGS offers insight into the total microbial community, and 16S amplicon sequencing identifies only more dominant organisms in a biological sample (Fig 6). Both amplicon sequencing and 16S targeting multiple loci, are useful in exploring biodiversity that includes bacterial, archaeal, and eukaryotic microbial communities in the same sequencing run [66]. A recent study demonstrated that MinIONTM technology can be employed to identify and differentiate both bacterial and viral species within a biological sample via amplicon sequencing [67]. However, WGS covers the entire community of genomes, capturing sequences of all organisms, including viruses and fungi, which cannot be captured by 16S amplicon sequencing. Additionally, 16S and WGS methods usually require different databases for classification of taxa [62].

In mining sequencing data from MG-RAST, we were unable to identify viral sequences using 16S amplicon sequencing, while whole genome metagenomic sequencing showed definitively that viral DNA comprised between 0.1%-0.2% of the total sequencing reads (Table 2 and Fig 6). Furthermore, 16S amplicon sequencing identified only very specific regions of the genome, insufficient to assess the functional genomics of microbial communities (Figs 1 and 2). Phylogenetic reconstruction has been employed by some investigators to infer biological function encoded in a genome containing a particular 16S sequence [68]. However, accuracy of such inferences relies heavily on how well the genomic diversity is represented by genomes available in the database [69]. Moreover, yet-to-be-discovered taxa whose rRNA sequences are not represented in the database would not be detected [70]. In this study, comparing 16S amplicon sequencing and WGS, we showed WGS metagenomics can be used effectively to identify predicted protein features whereas 16S amplicon sequencing could not reliably detect protein features (Table 2).

Identification of novel and highly complex organisms is difficult using 16S amplicon sequencing because the method is restricted to identifying those organisms whose specific genomic regions can be readily amplified. Furthermore, horizontal transfer of the 16S locus between distantly related taxa is possible [21]. Others have reported an overestimation of population diversity using 16S amplicon sequencing because many organisms in the environment contain multiple rRNA operons [21,71]. In this study, a larger proportion of taxonomic hits for archaea and eukaryota was obtained using 16S amplicon sequencing compared to WGS. Conversely, the distribution of hits for bacteria was slightly lower with 16S amplicon sequencing compared to WGS (Table 2 and Fig 6).

WGS shotgun metagenomics offers a deeper analysis of microbial diversity (Figs 3 and 4A). With rarefaction curves (Figs 3 and S1), we were able to demonstrate increased resolution of the taxonomic classification of microbial genera was not simply a result of number of reads generated in each sample, as shown by flattened curves in the rarefaction of 16S amplicon sequencing. Comparison of 16S amplicon sequencing with WGS metagenomic sequencing by rarefying reads from depths of 500 to 100,000 repeatedly, determined 16S amplicon sequencing can yield significant primer bias [72].

Shannon alpha genus diversity was found to be significantly greater for WGS samples compared to 16S amplicon metagenomic samples (Fig 4B), a finding in agreement with previous studies [62,73]. WGS also yielded enhanced detection of microbial diversity and accuracy. In the representative datasets, WGS identified 923 genera in sample mgm4637831.3 (Fig 5A), compared to 281 genera by 16S amplicon sequencing of sample mgm4778732.3 (Fig 5B).

With respect to cost, it was recently demonstrated that shallow WGS metagenomics can be used to obtain species-level taxonomic and functional data at a fraction of the cost of deep WGS and may serve as an alternative to 16S amplicon sequencing for large-scale microbiome studies [73].

NEON data are promising for collaborative metagenomics and open source datamining

A major challenge of collaboration amongst researchers in metagenomics using existing open-source data for broad-scale analyses is the myriad of sequence databases that are available [74]. In these databases, metagenomics data can be stored in a variety of formats on distinct hardware and software platforms that are often isolated and independent from each other, with no standards established for data collection and communication. Each database is likely to require unique approaches and algorithms for data analysis, which can introduce additional variable interoperability [75]. Therefore, widely adopted standards would help investigators better utilize, share, and archive the ever-expanding volume of metagenomic data [76].

The NEON open source data portal is distinct because it has established a standard for data collection at sites in terrestrial and aquatic ecosystems that employ technical working groups to design protocols for data collection infrastructure, including sensor installation and configuration and supporting measurements, and for observations from samples collected at field sites [77]. NEON closely monitors all aspects of metagenomic analysis, including sample collection procedures, DNA preparation, and communication of observations and results. Variation of DNA preparation procedures have been shown to influence taxonomical classification using downstream sequencing reads [78,79] and can likely alter the identified microbial diversity profile. Currently, there is no uniform standard for ensuring complex metagenomic datasets are accompanied with relevant metadata across projects. However, NEON overcomes this by standardizing the protocols of sample collection and processing. In this study, we successfully demonstrated use of NEON metagenomic datasets (Table 1) with MG-RAST, an open-source standard for data analysis that provides support for automated phylogenetic and functional analysis of metagenome data [45]. NEON and MG-RAST collectively provide a means of standardized metagenomic data collection, processing, storage, analysis, and quality assurance which could be implemented in an array biodiversity studies related to use of long-term ecological data on a continental-scale.

Limitations of applying standardized open source data

While implementation of the use of standardized open-source data is very appealing, it is not without limitation. The MG-RAST analytical server provides a standardized and reproducible platform for metagenomic analysis, but the documentation recommends against taxonomic classification below genus. Further, the MG-RAST annotation pipeline has potential to provide annotations for each submitted fragment of DNA such that the number of identified features may be smaller than the number of reads due to clustering or larger due to double counting. For many studies, species, and even strain level of taxonomic resolution can be essential to uncover fully all organisms present, including pathogenic strains, as well as identify dominant gene pathways that may be present in a sample. Accordingly, other curated databases (e.g., GenBook®) and analytical software platforms with strain resolution can be accessed and is appealing for studies where the goal is to identify microbial diversity and richness and to assign taxonomic or functional hierarchies.

Another limitation of using available open-source data is that investigators are limited to studying only the sequencing data that is readily available, therefore, designing projects that revolve around datasets that are available for a limited number of representative sites that may or may not provide the extent of diversity required for comparative metagenomics. In the case of NEON, data are released as available for regional to continental scale data collected and archived from 81 field sites across 20 ecoclimatic domains covering the contiguous 48 US states, Alaska, Hawaii, and Puerto Rico, and [80]. NEON provides a seamless integration with MG-RAST, which provides public access to calibrated meta and genomic data using standardized methods. As a result, the datasets can be used to formulate sampling sites, determine sampling frequency, and compare the metagenomic diversity and richness between samples.

Conclusions

The National Ecological Observatory Network (NEON) provides regional to continental scale data gathered using standardized protocols and methods for sample collection, pre-processing, post-processing, and quality control. These data can be easily coupled with other standardized bioinformatic software (e.g., MG-RAST) for metagenomic analysis with a reproducible interoperability of results. In this study, we demonstrated the feasibility of using NEON metagenomic datasets to establish the resolution of microbial community structure and diversity. 16S amplicon sequencing is currently used to identify dominant organisms present in a biological sample. However, WGS has been shown to detect and identify more genera of bacteria, archaea, viruses, and eukaryota compared to 16S amplicon sequencing. Furthermore, the identification of putative functional genes in microbial communities provided significantly more effective using WGS than 16S amplicon sequencing. It is concluded that NEON open data are useful for characterizing and quantifying complex ecological processes associated with changing aquatic and terrestrial ecosystems. Other analytical software, in addition to MG-RAST, may be required to resolve taxonomic decisions below genus, i.e., species, strain, and sub-strain.

Supporting information

S1 Fig. Rarefaction curve of whole genome sequencing samples examined in this study.

https://doi.org/10.1371/journal.pone.0228899.s001

(TIF)

S2 Fig. Krona plot of the microbiome detected in mgm4637826.3 whole genome sequencing dataset.

https://doi.org/10.1371/journal.pone.0228899.s002

(TIF)

S3 Fig. Krona plot of the microbiome detected in mgm4637825.3 whole genome sequencing dataset.

https://doi.org/10.1371/journal.pone.0228899.s003

(TIF)

S4 Fig. Krona plot of the microbiome detected inmgm4637821.3 whole genome sequencing dataset.

https://doi.org/10.1371/journal.pone.0228899.s004

(TIF)

S5 Fig. Krona plot of the microbiome detected in mgm4778744.3 16S amplicon sequencing dataset.

https://doi.org/10.1371/journal.pone.0228899.s005

(TIF)

S6 Fig. Krona plot of the microbiome detected in mgm4783759.3 16S amplicon sequencing dataset.

https://doi.org/10.1371/journal.pone.0228899.s006

(TIF)

S7 Fig. Krona plot of the microbiome detected in mgm4783766.3 16S amplicon sequencing dataset.

https://doi.org/10.1371/journal.pone.0228899.s007

(TIF)

Acknowledgments

The National Ecological Observatory Network is a program sponsored by the National Science Foundation and operated under cooperative agreement by Battelle Memorial Institute. This data used in this study is based in part upon work supported by the National Science Foundation through the NEON Program.

References

  1. 1. Turnbaugh PJ, Gordon JI. An Invitation to the marriage of metagenomics and metabolomics. Cell. 2008;134: 708–713. pmid:18775300
  2. 2. Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, et al. Functional metagenomic profiling of nine biomes. Nature. 2008;452: 629–632. pmid:18337718
  3. 3. Blaser MJ, Cardon ZG, Cho MK, Dangl JL, Donohue TJ, Green JL, et al. Toward a Predictive Understanding of Earth’s Microbiomes to Address 21st Century Challenges. MBio. 2016;7: e00714–16. pmid:27178263
  4. 4. Walter J, Ley R. The human gut microbiome: ecology and recent evolutionary changes. Annu Rev Microbiol. 2011;65: 411–429. pmid:21682646
  5. 5. Philippot L, Raaijmakers JM, Lemanceau P, van der Putten WH. Going back to the roots: the microbial ecology of the rhizosphere. Nat Rev Microbiol. 2013;11: 789–799. pmid:24056930
  6. 6. Sharp CE, Brady AL, Sharp GH, Grasby SE, Stott MB, Dunfield PF. Humboldt’s spa: microbial diversity is controlled by temperature in geothermal environments. ISME J. 2014;8: 1166–1174. pmid:24430481
  7. 7. Soo RM, Wood SA, Grzymski JJ, McDonald IR, Cary SC. Microbial biodiversity of thermophilic communities in hot mineral soils of Tramway Ridge, Mount Erebus, Antarctica. Environ Microbiol. 2009;11: 715–728. pmid:19278453
  8. 8. Newman DK, Banfield JF. Geomicrobiology: how molecular-scale interactions underpin biogeochemical systems. Science. 2002;296: 1071–1077. pmid:12004119
  9. 9. Prosser JI, Bohannan BJM, Curtis TP, Ellis RJ, Firestone MK, Freckleton RP, et al. The role of ecological theory in microbial ecology. Nat Rev Microbiol. 2007;5: 384–392. pmid:17435792
  10. 10. Ley RE, Turnbaugh PJ, Klein S, Gordon JI. Microbial ecology: human gut microbes associated with obesity. Nature. 2006;444: 1022–1023. pmid:17183309
  11. 11. Sleator RD, Shortall C, Hill C. Metagenomics. Lett Appl Microbiol. 2008;47: 361–366. pmid:19146522
  12. 12. Simon C, Daniel R. Metagenomic analyses: past and future trends. Appl Environ Microbiol. 2010/12/17. 2011;77: 1153–1161. pmid:21169428
  13. 13. Thomas T, Jack G, Meyer F. Metagenomics-A guide from sampling to data analysis Aquarium Microbiome Project View project MG-RAST View project. Microb Inform Exp. 2012;2: 12. pmid:22587947
  14. 14. Cotruvo J, Colwell RR, Heberling C, Hasan NA, Leddy MB, Subramanian P. Characterization of Microbial Signatures From Advanced Treated Wastewater Biofilms. J Am Water Works Assoc. 2017;109: E503–E512.
  15. 15. Biddle JF, Fitz-Gibbon S, Schuster SC, Brenchley JE, House CH. Metagenomic signatures of the Peru Margin subseafloor biosphere show a genetically distinct environment. Proc Natl Acad Sci U S A. 2008;105: 10583–10588. pmid:18650394
  16. 16. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard N-U, et al. Community genomics among stratified microbial assemblages in the ocean’s interior. Science. 2006;311: 496–503. pmid:16439655
  17. 17. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, et al. Comparative metagenomics of microbial communities. Science. 2005;308: 554–557. pmid:15845853
  18. 18. Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin ML, Pace NR. Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. Proc Natl Acad Sci. 1985;82: 6955 LP– 6959. pmid:2413450
  19. 19. Kotik M. Novel genes retrieved from environmental DNA by polymerase chain reaction: Current genome-walking techniques for future metagenome applications. J Biotechnol. 2009;144: 75–82. pmid:19712711
  20. 20. Itoh N, Isotani K, Makino Y, Kato M, Kitayama K, Ishimota T. PCR-based amplification and heterologous expression of Pseudomonas alcohol dehydrogenase genes from the soil metagenome for biocatalysis. Enzyme Microb Technol. 2014;55: 140–150. pmid:24411457
  21. 21. Acinas SG, Marcelino LA, Klepac-Ceraj V, Polz MF. Divergence and Redundancy of 16S rRNA Sequences in Genomes with Multiple <em>rrn</em> Operons. J Bacteriol. 2004;186: 2629 LP– 2635.
  22. 22. Gray MW, Burger G, Lang BF. Mitochondrial Evolution. Science (80-). 1999;283: 1476 LP– 1481. pmid:10066161
  23. 23. Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci. 2011;108: 4516 LP– 4522. pmid:20534432
  24. 24. Bosshard PP, Abels S, Zbinden R, Böttger EC, Altwegg M. Ribosomal DNA Sequencing for Identification of Aerobic Gram-Positive Rods in the Clinical Laboratory (an 18-Month Evaluation). J Clin Microbiol. 2003;41: 4134 LP– 4140.
  25. 25. Liu Z, Lozupone C, Hamady M, Bushman FD, Knight R. Short pyrosequencing reads suffice for accurate microbial community analysis. Nucleic Acids Res. 2007;35: e120–e120. pmid:17881377
  26. 26. Bartram AK, Lynch MDJ, Stearns JC, Moreno-Hagelsieb G, Neufeld JD. Generation of Multimillion-Sequence 16S rRNA Gene Libraries from Complex Microbial Communities by Assembling Paired-End Illumina Reads. Appl Environ Microbiol. 2011;77: 3846 LP– 3852. pmid:21460107
  27. 27. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, et al. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2008;37: D141–D145. pmid:19004872
  28. 28. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2012/11/28. 2013;41: D590–D596. pmid:23193283
  29. 29. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, et al. Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Appl Environ Microbiol. 2006;72: 5069 LP– 5072. pmid:16820507
  30. 30. Mardis ER. A decade’s perspective on DNA sequencing technology. Nature. 2011;470: 198–203. pmid:21307932
  31. 31. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2015/11/08. 2016;44: D733–D745. pmid:26553804
  32. 32. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Ostell J, Pruitt KD, et al. GenBank. Nucleic Acids Res. 2018;46: D41–D47. pmid:29140468
  33. 33. Wattam AR, Abraham D, Dalay O, Disz TL, Driscoll T, Gabbard JL, et al. PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res. 2013/11/12. 2014;42: D581–D591. pmid:24225323
  34. 34. Gilbert JA, Field D, Huang Y, Edwards R, Li W, Gilna P, et al. Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities. PLoS One. 2008;3: e3042. pmid:18725995
  35. 35. Wilmes P, Bond PL. Metaproteomics: studying functional gene expression in microbial ecosystems. Trends Microbiol. 2006;14: 92–97. pmid:16406790
  36. 36. Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 2014;42: 206–214. pmid:24293654
  37. 37. Bairoch A, Boeckmann B. The SWISS-PROT protein sequence data bank: current status. Nucleic Acids Res. 1994;22: 3578–3580. Available: https://www.ncbi.nlm.nih.gov/pubmed/7937062 pmid:7937062
  38. 38. Field D, Sansone S-A, Collis A, Booth T, Dukes P, Gregurick SK, et al. 'Omics Data Sharing. Science (80-). 2009;326: 234 LP– 236. pmid:19815759
  39. 39. Jones AR, Miller M, Aebersold R, Apweiler R, Ball CA, Brazma A, et al. The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nat Biotechnol. 2007;25: 1127–1133. pmid:17921998
  40. 40. Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007;25: 1251–1255. pmid:17989687
  41. 41. Science Commons. Protocol for Implementing Open Access Data. [cited 14 Nov 2019]. Available: https://creativecommons.org/about/program-areas/open-science
  42. 42. Holmes C, McDonald F, Jones M, Ozdemir V, Graham JE. Standardization and omics science: technical and social dimensions are inseparable and demand symmetrical study. OMICS. 2010;14: 327–332. pmid:20455752
  43. 43. Gomez-Cabrero D, Abugessaisa I, Maier D, Teschendorff A, Merkenschlager M, Gisel A, et al. Data integration in the era of omics: current and future challenges. BMC Syst Biol. 2014/03/13. 2014;8 Suppl 2: I1–I1. pmid:25032990
  44. 44. NEON. Data Products DP1.10107.001 and DP1.10108.001. In: 2019.
  45. 45. Wilke A, Bischof J, Gerlach W, Glass E, Harrison T, Keegan KP, et al. The MG-RAST metagenomics database and portal in 2015. Nucleic Acids Res. 2015/12/09. 2016;44: D590–D594. pmid:26656948
  46. 46. Wilke A, Harrison T, Wilkening J, Field D, Glass EM, Kyrpides N, et al. The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools. BMC Bioinformatics. 2012;13: 141. pmid:22720753
  47. 47. Hsieh TC, Ma KH, Chao A. iNEXT: an R package for rarefaction and extrapolation of species diversity (Hill numbers). Methods Ecol Evol. 2016;7: 1451–1456.
  48. 48. Chao A, Gotelli NJ, Hsieh T., Sande EL, Ma KH, Colwell RK, et al. Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies. Ecol Monogr. 2014;84: 45–67.
  49. 49. Colwell RK, Chao A, Gotelli NJ, Lin S-Y, Mao CX, Chazdon RL, et al. Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. J Plant Ecol. 2012;5: 3–21.
  50. 50. Chao A, Jost L. Coverage-based rarefaction and extrapolation: standardizing samples by completeness rather than size. Ecology. 2012;93: 2533–2547. pmid:23431585
  51. 51. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. bioRxiv. 2019; 762302.
  52. 52. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15: R46. pmid:24580807
  53. 53. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods. 2012;9: 811–814. pmid:22688413
  54. 54. Dilthey AT, Jain C, Koren S, Phillippy AM. Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps. Nat Commun. 2019;10: 3066. pmid:31296857
  55. 55. Connelly S, Fanelli B, Hasan NA, Kaleko M, Colwell RR. Oral Metallo-Beta-Lactamase Protects the Gut Microbiome From Carbapenem-Mediated Damage and Reduces Propagation of Antibiotic Resistance in Pigs. Front Microbiol. 2019;10: 1–12. pmid:30728808
  56. 56. Roy MA, Arnaud JM, Jasmin PM, Hamner S, Hasan NA, Colwell RR, et al. A Metagenomic Approach to Evaluating Surface Water Quality in Haiti. Int J Environ Res Public Health. 2018;15: 2211. pmid:30309013
  57. 57. Ponnusamy D, Kozlova E V, Sha J, Erova TE, Azar SR, Fitts EC. Cross-talk among flesh-eating Aeromonas hydrophila strains in mixed infection leading to necrotizing fasciitis. Proc Natl Acad Sci. 2016;113: 722–727. pmid:26733683
  58. 58. Hourigan SK, Subramanian P, Hasan NA, Ta A, Klein E, Chettout N, et al. Comparison of infant gut and skin microbiota, resistome and virulome between neonatal intensive care unit (NICU) environments. Front Microbiol. 2018;9. pmid:29988506
  59. 59. Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011;12: 385. pmid:21961884
  60. 60. Broad Institute. Morpheus. 2019 [cited 2 Oct 2019]. Available: https://software.broadinstitute.org/morpheus/
  61. 61. Tessler M, Neumann JS, Afshinnekoo E, Pineda M, Hersch R, Velho LFM, et al. Large-scale differences in microbial biodiversity discovery between 16S amplicon and shotgun sequencing. Sci Rep. 2017;7: 6589. pmid:28761145
  62. 62. Ranjan R, Rani A, Metwally A, McGee HS, Perkins DL. Analysis of the microbiome: Advantages of whole genome shotgun versus 16S amplicon sequencing. Biochem Biophys Res Commun. 2015/12/22. 2016;469: 967–977. pmid:26718401
  63. 63. van Nimwegen KJM, van Soest RA, Veltman JA, Nelen MR, van der Wilt GJ, Vissers LELM, et al. Is the $1000 Genome as Near as We Think? A Cost Analysis of Next-Generation Sequencing. Clin Chem. 2016;62: 1458–1464. pmid:27630156
  64. 64. Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet. 2014;15: 121–132. pmid:24434847
  65. 65. Kuczynski J, Lauber CL, Walters WA, Parfrey LW, Clemente JC, Gevers D, et al. Experimental and analytical tools for studying the human microbiome. Nat Rev Genet. 2011;13: 47–58. pmid:22179717
  66. 66. Kittelmann S, Seedorf H, Walters WA, Clemente JC, Knight R, Gordon JI, et al. Simultaneous Amplicon Sequencing to Explore Co-Occurrence Patterns of Bacterial, Archaeal and Eukaryotic Microorganisms in Rumen Microbial Communities. PLoS One. 2013;8: e47879. Available: pmid:23408926
  67. 67. Kilianski A, Haas JL, Corriveau EJ, Liem AT, Willis KL, Kadavy DR, et al. Bacterial and viral identification and differentiation by amplicon sequencing on the MinION nanopore sequencer. Gigascience. 2015;4: 12. pmid:25815165
  68. 68. Langille MGI, Zaneveld J, Caporaso JG, McDonald D, Knights D, Reyes JA, et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat Biotechnol. 2013/08/25. 2013;31: 814–821. pmid:23975157
  69. 69. Sharpton TJ. An introduction to the analysis of shotgun metagenomic data. 2014;5: 1–14. pmid:24982662
  70. 70. Rosselli R, Romoli O, Vitulo N, Vezzi A, Campanaro S, Pascale F De, et al. Direct 16S rRNA-seq from bacterial communities: a PCR-independent approach to simultaneously assess microbial diversity and functional activity potential of each taxon. Nat Publ Gr. 2016; 1–12. pmid:27577787
  71. 71. Rintala A, Pietilä S, Munukka E, Eerola E, Pursiheimo J-P, Laiho A, et al. Gut Microbiota Analysis Results Are Highly Dependent on the 16S rRNA Gene Target Region, Whereas the Impact of DNA Extraction Is Minor. J Biomol Tech. 2017/02/28. 2017;28: 19–30. pmid:28260999
  72. 72. Jovel J, Patterson J, Wang W, Hotte N, O’Keefe S, Mitchel T, et al. Characterization of the Gut Microbiome Using 16S or Shotgun Metagenomics. Frontiers in Microbiology. 2016. p. 459. Available: https://www.frontiersin.org/article/10.3389/fmicb.2016.00459 pmid:27148170
  73. 73. Hillmann B, Al-Ghalith GA, Shields-Cutler RR, Zhu Q, Gohl DM, Beckman KB, et al. Evaluating the Information Content of Shallow Shotgun Metagenomics. Rawls JF, editor. mSystems. 2018;3: e00069–18. pmid:30443602
  74. 74. Bietz MJ, Lee CP. Collaboration in Metagenomics: Sequence Databases and the Organization of Scientific Work BT—ECSCW 2009. In: Wagner I, Tellioğlu H, Balka E, Simone C, Ciolfi L, editors. London: Springer London; 2009. pp. 243–262.
  75. 75. Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, Giovanni R, et al. Darwin Core: An Evolving Community-Developed Biodiversity Data Standard. PLoS One. 2012;7: e29715. Available: pmid:22238640
  76. 76. Chervitz SA, Deutsch EW, Field D, Parkinson H, Quackenbush J, Rocca-Serra P, et al. Data Standards for Omics Data: The Basis of Data Sharing and Reuse BT—Bioinformatics for Omics Data: Methods and Protocols. In: Mayer B, editor. Totowa, NJ: Humana Press; 2011. pp. 31–69. https://doi.org/10.1007/978-1-61779-027-0_2
  77. 77. NEON. About NEON’s science designs and standardized protocols. 2019 [cited 12 Sep 2019]. Available: https://www.neonscience.org/data-collection/protocols-standardized-methods
  78. 78. Albertsen M, Karst SM, Ziegler AS, Kirkegaard RH, Nielsen PH. Back to Basics–The Influence of DNA Extraction and Primer Choice on Phylogenetic Analysis of Activated Sludge Communities. PLoS One. 2015;10: e0132783. Available: pmid:26182345
  79. 79. Teng F, Darveekaran Nair SS, Zhu P, Li S, Huang S, Li X, et al. Impact of DNA extraction method and targeted 16S-rRNA hypervariable region on oral microbiota profiling. Sci Rep. 2018;8: 16321. pmid:30397210
  80. 80. NEON. Frequently Asked Questions. 2019 [cited 12 Sep 2019]. Available: https://www.neonscience.org/about/faq