Baseline human gut microbiota profile in healthy people and standard reporting template

A comprehensive knowledge of the types and ratios of microbes that inhabit the healthy human gut is necessary before any kind of pre-clinical or clinical study can be performed that attempts to alter the microbiome to treat a condition or improve therapy outcome. To address this need we present an innovative scalable comprehensive analysis workflow, a healthy human reference microbiome list and abundance profile (GutFeelingKB), and a novel Fecal Biome Population Report (FecalBiome) with clinical applicability. GutFeelingKB provides a list of 157 organisms (8 phyla, 18 classes, 23 orders, 38 families, 59 genera and 109 species) that forms the baseline biome and therefore can be used as healthy controls for studies related to dysbiosis. This list can be expanded to 863 organisms if closely related proteomes are considered. The incorporation of microbiome science into routine clinical practice necessitates a standard report for comparison of an individual’s microbiome to the growing knowledgebase of “normal” microbiome data. The FecalBiome and the underlying technology of GutFeelingKB address this need. The knowledgebase can be useful to regulatory agencies for the assessment of fecal transplant and other microbiome products, as it contains a list of organisms from healthy individuals. In addition to the list of organisms and their abundances, this study also generated a collection of assembled contiguous sequences (contigs) of metagenomics dark matter. In this study, metagenomic dark matter represents sequences that cannot be mapped to any known sequence but can be assembled into contigs of 10,000 nucleotides or higher. These sequences can be used to create primers to study potential novel organisms. All data is freely available from https://hive.biochemistry.gwu.edu/gfkb and NCBI’s Short Read Archive.

Introduction While humanity has only begun to influence planetary-level events in the last few hundred years [1], microorganisms have shaped our planet since time immemorial [2]. It has been shown that the microbes of the ocean are as important for influencing planetary climate as the microbes of gastrointestinal (GI) tracts of cattle [3]; furthermore, new functions are continuously found for the human microbiome [4][5][6]. However, since the advent of germ theory and the antimicrobial revolution, microbes have been viewed as insurgents bound for eradication [7]. Hence, we have created GutFeelingKB to provide a reference for the metagenomic analysis of the human gut microbiome.
In 2001, some sixty years into the antibiotic era, Joshua Lederberg coined the term 'microbiome' as the pendulum of opinion began to swing back to a more microbe-tolerant position [8,9]. In 2008, the US National Institutes of Health launched the Human Microbiome Project (HMP) to better understand the makeup of the community of microbes in cohabitation with humans [10,11]. This population of microorganisms brings with it a vast, diverse, and modifiable set of genomes which have proven to influence human health and disease [12,13]. Together, these organisms' genomes comprise the metagenome, a highly versatile pool of genetic elements which now serves as a target for medical research [14]. Microbiome characterization through various analysis pipelines has advanced progressively since HMP and this development process has catalyzed the understanding of certain roles of these microbial communities [15,16].
Although microbiomes of all body sites are important, the gut microbiome, with hundreds of prevalent species is of major interest to a large and diverse number of researchers [17,18]. The healthy gut microbiome data and analysis is crucial for all studies of disease with relation to the human gut. A Nature Microbiology issue in 2016 contained a consensus statement which outlined all federally-funded microbiome research over a three-year period [19]. The authors, on behalf of the federal government's FastTrack Action Committee on Mapping Microbiomes (FTAC-MM), defined a microbiome as a multi-species community of microorganisms in any environment: host, habitat, or ecosystem. One of the conclusions reached by the authors was a "priority need" for higher-throughput, more accurate data acquisition, better pipelines for data analyses, and a greater ability to organize, store, access, and share/integrate data sets. At present, most studies leverage study specific control groups and reporting mechanisms. The studies that are successful at creating clinically relevant results, such as the work by uBiome [20], are based on marker genes, and so they do not shed light on the origin of the "microbial dark matter", and are not able to be integrated with whole genome shotgun sequencing studies (WGS). These problems are compounded by the fact that different bioinformatics pipelines produce different results largely because all current pipelines use a limited number of ad hoc reference organisms to determine abundance. It has also been shown that database growth influences the accuracy of relatively faster k-mer-based species identification [21]. The final understanding of the baseline healthy microbiome therefore can be flawed because the methods are uniquely applied in each study. As such, there is a need for aggregation, validation for interoperability, and eventual standardization of methods and reporting.
Currently, metagenomic analyses use nucleotide sequences from a limited set of predetermined microorganisms or genes as a reference database, and, as such, these reference lists are not truly comprehensive. The use of limited sets of sequence data is prevalent because it is computationally challenging to perform pairwise read alignment against the entire NCBI non-redundant nucleotide database (NCBI-nt) [22]. Algorithms have been developed that allow the use of the complete NCBI-nt and it has been shown that using the NCBI-nt permits accurate analysis of the data with significantly fewer errors in microorganism abundance quantification [23]. To leverage this prior work on metagenomic analysis algorithms, samples from a healthy cohort of participants were collected and sequenced to specifically target healthy control data. To ensure the samples were abundant and correct enough to build healthy reference list, we also retrieved sequences of healthy people from HMP. Furthermore, we developed an approach that generates a collection of assembled contiguous sequences (contigs) that cannot be aligned to any known sequence in NCBI-nt but are present in healthy individual fecal samples and are ideal for healthy-disease-microbiome correlation analysis and novel primer design. For the purposes of this study, these sequences are defined as metagenomic dark matter-sequences that cannot be mapped to any known sequence but can be assembled into contigs of 10,000 nucleotides or higher. Together, these data form our Gut Feeling Knowledge Base-GutFeelingKB. The contig nucleotide length threshold is expected to reduce the number of contigs in GutFeelingKB that are not of biological origin. Our definition is much stricter than previous definitions of the metagenomic dark matter which accepts remote homology to known sequences [24]. The need to include metagenomic dark matter in comprehensive analyses of the gut microbiome matches the arguments presented by Bernard et al. in their recent manuscript on microbial dark matter where they opine that "unraveling the microbial dark matter should be identified as a central priority for biologists" [25].
The primary aim in creating GutFeelingKB is to provide a reference knowledgebase for the metagenomic analysis of the human gut microbiome. All the organisms which were confidently observed in a healthy human gut are included. Using this knowledgebase, we designed a standard reporting template of individual microbiome data for direct comparison to GutFee-lingKB. This type of report can be useful to any scientist, clinician, or patient and can enhance comparison of results from different studies.

Metagenomic sampling and participant statistics
Healthy cohort selection and nutritional information. Participants for this study were recruited from the George Washington University (GW) Foggy Bottom campus area through the use of flyers and emails to GW affiliated organizations (selection criterions included in S1 Table). Study participants provided samples and anthropomorphic measurements (included in S1 Table) were collected from healthy people at GW according to a George Washington Institutional Review Board (IRB#011605) approved protocol. At the baseline visit, participants received extensive instructions on how to record their dietary intake (including type, brand, and portion size of every food and beverage consumed on each day throughout the study period) and the time of consumption for each item. Participants then recorded their dietary intake using a seven-day food journal throughout the length of the study. Each participant provided three samples. The food journal was collected at the submission of the final sample, after which the reported 7-day dietary intakes for each subject were entered into the Nutrition Data System for Research (NDSR) [26]. NDSR produces a tabular daily nutrient profile for each day of dietary intake for each individual, which was then added as metadata to the abundance matrices (supplementary table S2 Table). All participants self-reported as 'healthy' (participant does not have an obvious or self-declared disease state) at the start of the study and remained healthy throughout.
Sampling and sequencing. Fecal samples were collected from healthy volunteers using sterile commode containers at the Milken Institute School of Public Health at the George Washington University (GWSPH). Immediately following collection in ethanol, the fecal samples were stored in a -20˚Celsius freezer for a period of up to two weeks, after which, aliquots were placed in longer term storage at -80˚Celsius ultra-freezer. Samples were subsequently transported to the sequencing center on dry ice. DNA was extracted using the MoBio Power-Fecal DNA Isolation kit25. Double-stranded DNA (dsDNA) concentration and quality was assessed using NanoDrop and the Qubit dsDNA Broad Range (BR) DNA Assay Kit26, respectively. DNA was diluted for library preparation using the Illumina Nextera XT Library Prep Kit, and 1 ng from each sample was fragmented and amplified using Illumina Nextera XT Index Kit primers. Amplified DNA was then cleaned using Agencourt AMPure XP beads, resuspended in buffer, and tested again for concentration, quality, and fragment size distribution on a Bioanalyzer using the Agilent High Sensitivity DNA Kit. DNA libraries were brought to the same nM concentration, pooled, and denatured with 0.2 N NaOH prior to loading on an Illumina MiSeq Reagent Kit v3 and sequencing on the Illumina MiSeq platform. Sequence data FASTQ files were uploaded to BaseSpace (https://basespace.illumina.com/home/index) for sharing and further analysis.
Sequence quality assurance. All sequence data were uploaded to the GW High-performance Integrated Virtual Environment (HIVE) [27,28]. Upon initial upload into the system, HIVE automatically conducts a series of quality assurance (QA) computations for each sequence read file and generates figures to display the results. S1 Fig is a compilation of the quality assurance computations done on one read file.
Upon completion of the initial upload for each read file, the resulting quality assurance figures were inspected to ensure that the read file was of adequate quality and did not have any unusual characteristics (such as low-quality score or disproportionate distribution of nucleotides). Reads that had an average Phred quality score of 20 or less were discarded. The nucleotide base distribution was also examined to ensure that no read files had an unusual distribution of bases or a positional quality score below the threshold of 20. S2 Fig is an aggregate of the computations across all samples.
Healthy cohort from Human Microbiome Project. In addition to the data generated from sequencing described above, additional data were downloaded and analyzed from the Human Microbiome Project (HMP) [29]. HMP sequence data and metadata are available through NCBI SRA and dbGaP. Fifty fecal metagenomic samples, randomly chosen from HMP Phase I (supplementary table S1 Table) to match approximately the number of samples collected in our study were selected. The samples generated by the HMP project dataset subjects were screened based on stringent criteria listed in their publication and the individuals who passed the screening were considered "healthy" subjects [11].
GW and HMP combined data. Sequence and metadata from this study are publicly available through GutFeelingKB (https://hive.biochemistry.gwu.edu/gfkb), and also available from two NCBI-SRA BioProjects (Healthy Human Gut Metagenomics (PRJNA428202), and Effects of non-nutritive sweeteners on the composition of the human gut microbiome (PRJNA487305). For PRJNA487305, only the samples donated prior to intake of non-nutritive sweeteners were used in this study. HMP data were downloaded from NIH Human Microbiome Project (HMP) Roadmap Project (PRJNA43021).
A total of 48 samples from 16 individuals were sequenced in the GW cohort. Each sample resulted in two pair-end read files (for details see S3 Table). Sequence data from these 48 samples along with 50 samples from HMP passed sequence quality checks and were used to develop the baseline microbiota profile. For GW samples 55.55% (± 13.46%) while for HMP 48.29% (± 18.54%) of the reads could not be mapped to any known sequence. There was no need for any computational filtering of human DNA as the MoBio PowerFecal DNA Isolation kit25 was used for GW samples, biochemically removing any host DNA. For the HMP data, all human DNA had been computationally removed before the samples were deposited in dbGaP [11]. Sample and participant information can be seen in Table 1.
Using a curated blacklist file of taxonomy IDs, Filtered-nt was generated based on terms that are contained in the lineage of each taxonomy entry. Taxonomy nodes with terms such as 'unclassified', 'unidentified', 'uncultured', 'unspecified', 'unknown', 'vector', 'environmental sample', 'artificial sequence', 'other sequence' were blacklisted. Child nodes are also automatically removed. The filtered taxonomy list was then used to filter the NCBI-nt sequence file. Filtered-nt and the blacklisted taxonomy IDs along with node names are available for download at https://hive.biochemistry.gwu.edu/filterednt.

Metagenomic analysis pipeline
The innovative metagenomic analysis pipeline developed includes three software tools and one sequence database (Filtered-nt), organized in a fashion to produce a workflow that ensures an efficient and comprehensive analysis of a large sequence space. The tools are CensuScope [30], HIVE-Hexagon [31], and IDBA-UD [32]. All software tools are integrated in the HIVE platform [27,28] and allow end-to-end analysis of metagenomic sequences.
Healthy Human gut microbiome list (GutFeelingKB). CensuScope [30] is a taxonomic profiling software that randomly extracts a user-defined number of reads and maps them to any size sequence database using BLAST [33]. CensuScope is rapid, accurate, and is not hindered by the size of the reference sequence database. With the non-redundant sequence database's almost constant exponential increase, CensuScope offers a scalable approach for estimating taxonomic composition of a microbial population. A list of organisms, taxonomy identifiers, and BLAST alignments are provided as the output by CensuScope. A manual evaluation of the CensuScope results for each of the identified organisms was performed to verify that the "hit" represented an authentic match. "Manual evaluation" included the following criteria: 1. Inspection of the match count. The number of matched alignments over the entire computation (over all iterations) had to be > = five out of total 12,500 alignment threshold set by CensuScope. Five was chosen so that there were enough individual alignments to appraise the authenticity of the matches.
2. Confirmation of a justifiable taxonomy assignment. Hits to sequences that lacked a clear taxonomic lineage were excluded and marked for removal from Filtered-nt.
3. Completeness of sequence in GutFeelingKB. Partial sequences, single proteins, or unassembled contiguous sequences were mapped to complete genomes to be included in GutFee-lingKB. This is the only way to keep partial sequences from skewing organism abundance results.
4. Organism verification. In order to have confidence in the results, it was necessary to independently verify the biological accuracy of each "hit". Metadata about the organism was reviewed to verify appropriateness of its presence in the human gut.
Any reference sequence and organism that satisfied these criteria was added to the GutFee-lingKB. To extend the usability of this list, available online databases and reference text was used to annotate the organisms [22,[34][35][36]. The NCBI accession numbers from the true positive CensuScope hitlist results were used to obtain the NCBI accession, the RefSeq accession, the NCBI taxonomy ID, the organism name (Scientific Name), the taxonomy id, and the genome assembly IDs. Using the taxonomy ID, the lineage and taxonomic name from the NCBI taxonomy database was retrieved.
Genome to proteome mapping was guided by Representative Proteome Groups (RPGs), a dataset that clusters similar proteomes (https://proteininformationresource.org/rps/). The RPG clusters are calculated based on co-membership in UniRef50 clusters [34] (supplementary tables S4 and S8 Tables). Using the taxonomy ID and the RPG, the corresponding proteome in https://www.uniprot.org/proteomes was identified. From the proteome entry, verification of the Genome Assembly ID match between UniProt, RPG, and NCBI was performed.
In most instances the proteome entry contained some descriptive text about the organism taken from a publication, as well as citations. Such information was added as organism annotation. Additional fields (Resistance to Antibiotic, Susceptibility to Antibiotic, Physical Characteristics) were populated from other sources [36]. Finally, all of the associated DOI and PMIDs for the metadata were added to the final column. It is important to note that many bacteria are closely related and hence have large homologous regions. This can lead to species level misidentification. Although the concept of pan-genome or pan-proteome for closely related bacteria is well accepted [35], it is important to avoid such misidentification for known pathogens. To avoid such false positives of well-known pathogens (S5 Table), they are included only if their abundance is 1% or higher and their alignments have been manually evaluated.
Bacterial abundance profile. Fig 1 provides a schematic representation of the workflow. The first step uses CensuScope (a subsampling BLAST algorithm) to identify organisms that are present in the sample. To generate a rapid and accurate taxonomic profile, 2,500 reads are used in each iteration [30] (up to five iterations). This step allows identification of organisms present in a sample. These organisms are added into GutFeelingKB if it is not already present. Next, HIVE-hexagon, a highly specific and sensitive short-read aligner [37], is used to map all of the reads in each sample to GutFeelingKB (created through the use of CensuScope) to obtain the final abundance profiles. It is important to note HIVE-hexagon best match parameter was used. This parameter allows reads to be mapped to the reference (in the case of best matches to more than one reference) which has the greatest number of matches.
Metagenomic dark matter. The unaligned reads of each sample were assembled using IDBA-UD [32] and considered as metagenomic dark matter. Only the assembled contiguous sequences (contigs) longer than 10,000 nucleotides were investigated in this experiment. Such a large length threshold was used to ensure that the metagenomics dark matter contigs were truly of biological origin. The gut microbiome of a sample can be represented as the sum of known organisms and organisms represented by the metagenomic dark matter sequences. More specifically, the contigs that were over 10,000 nucleotides in length were tagged with the sample ID and numbered, and metadata data about the participant was added to the header.
These contigs are available as a download at (https://hive.biochemistry.gwu.edu/gfkb) for further analysis and novel primer design.

Analysis of nutritional metadata and microbial abundance
MaAsLin, an R package that employs a "multivariate statistical framework that finds associations between clinical metadata and microbial community abundance or function" [38] was Step 1: CensuScope is run for each read file against Filtered-nt. Each of the aligned organism approved by manually check is added to the GutFeelingKB and it is versioned.
Step 2: For the final analysis the raw read files are mapped against GutFeelingKB organism sequences using HIVE-hexagon. Outputs are tabulated as relative abundance percentages. Unaligned reads from each sample were assembled using IDBA-UD. Contigs that were over 10,000 nucleotides long had their headers modified to include the following: sample ID, numbered according to length (long to short), and additional metadata data about the participant. These contigs are available as a download at (https://hive.biochemistry.gwu.edu/gfkb).
https://doi.org/10.1371/journal.pone.0206484.g001 used to find correlations between bacterial abundance and diet. Intra-host variability was analyzed evaluating the standard deviation of multiple measurements for every patient averaged over all patients. Inter-host variability was computed as a standard deviation of the means of per-host abundance values. To estimate the degree of stability of measurements for bacterial populations in patient samples intra-host vs inter-host variability ratio was computed.
Nutrition to organism abundance correlation was also computed by using a Cosine Similarity Coefficient. The matrix of bacterial strain abundances was variance scaled and zero centered to create comparable distributions of equal variability. Categorical data (such as gender) were turned into numerical values. More specifically, in order to define correlation metrics between features and bacterial composition for the set of individuals, we used Cosine Similarity Coefficient as defined in Formula 1. Cosine Similarity Coefficient of correlation between bacteria (j) and feature (k) is computed as the sum product of j th Bacteria (Bj) abundance for patient i and k th Feature (Fk) of patient i.
A Cosine Similarity of around 1 indicates a strong correlation, -1 indicates a strong anticorrelation, 0 is no correlation with 0.7 being considered the marginal threshold for evidence of some degree of correlation [39,40].

Filtered NCBI-nt (Filtered-nt)
NCBI nucleotide sequence collection (NCBI-nt) is the most comprehensive collection of DNA sequences [22], but many sequences present in NCBI-nt do not provide enough relevant information or correct metadata (e.g. sequences with taxonomy placement such as environmental, unclassified, synthetic sequences, unidentified sequences etc.). A great number of the sequences available in NCBI-nt are also artificial. Reads mapped to such sequences do not provide any valuable biological information in a clinical setting and hence are not useful in understanding the microbial composition of a sample. The version of NCBI-nt used to create our Filtered-nt (v5.0) initially contained 42,439,338 sequences and the taxonomy file contained 1,601,859 scientific names. After removal of 250,610 blacklisted taxonomy IDs (supplementary table S6 Table) pertaining to 7,499,592 sequences the Filtered-nt contained 34,939,806 sequences. The Filtered-nt is ideal for comprehensive metagenomic analysis that relies on a best sequence hit.
Most studies use genomes from known gut bacteria as a truncated reference database [18,30,41,42] and hence would not be able to detect organisms that are not present in their reference database. The use of our Filtered-nt offers surety that the entirety of the known sequence space is covered while excluding the in-silico sequence space and the uncultured/ unclassified sequence space.

Healthy fecal microbiome
GutFeelingKB-A reference list for healthy human gut organisms. GutFeelingKB is a compilation of highly curated data and metadata associated with organisms identified as present in the samples we analyzed. GutFeelingKB consists of 157 organisms which fall into sixty distinct genera, as seen in S2 Table which is arranged by species. The full table can be downloaded at https://hive.biochemistry.gwu.edu/gfkb. Members of the Firmicutes and Bacteroidetes phyla make up a majority of the bacterial species that were present in the human intestinal microbiota. A total of 155 bacterial and 2 archaeal organisms were identified in healthy samples. In summary, the healthy human gut microbiome consists of 8 phyla, 18 families, 23 classes, 38 orders, 59 genera and 109 species. 63 (40%), 32 (20%) and 31 (19.7%) members belongs to Firmicutes, Actinobacteria and Bacteroidetes, respectively which make up a majority of the bacterial species. More than half of Firmicutes are members of the Clostridia (20.3%) class, which is the most abundant class, followed by Bacteroidia (18.5%), Bifidobacteriales (16.6%), Enterobacterales (14%) and Lactobacillales (14%). All of members of Clostridia in the samples are members of Clostridiales order and all of Bacteroidia belongs to Bacteroidales, these two are the most abundant orders. There are 27 organisms which are members of Bifidobacteriaceae family, and 26 of them belongs to Bifidobacterium longum, which is the most abundant species.
With respect to core species concept, 84 out of 109 organisms are present in all of the samples ( Table 2). These 84 could feasibly be a core organisms list for the human gut, but for this paper the focus is on creating a comprehensive list of organisms found in healthy individuals.  Table provides a list of 129 organism clusters that are similar to the organisms (similarity based on computational clustering of proteomes at 75% co-membership threshold [34]) in GutFeelingKB. This could serve as a supplement the Gut-FeelingKB to avoid the misidentification of highly-similar organisms. All these 863 organisms comprise an expanded set of microbes that can be present in a healthy human gut.
Several researchers have focused on the reference genes of the gut microbiome rather than organisms [43,44], but organisms have their own clinical significance in treatment. When Yatsunenko et al. analyzed 531 healthy samples from Venezuela, rural Malawi and US metropolitan areas and mapped their reads to 126 microbial species, they found Fusobacteria that were not mapped to our list. On the other hand, Spirochaetes, Planctomycetes identified in this study were not shown in their list [45]. Of the organisms reported in their study, forty genera map to our list at the species level. Unmapped species include organisms such as Actinomyces odontolyticus, Bacteroides capillosus, Bacteroides uniformis. Nishijima et al. identified 26 major genera in healthy Japanese [46]. Twenty of the 26 genera they listed mapped to the list from this study, the unmapped genera belong to existing GutFeelingKB families and are Dorea, Dialister, Succinatimonas, Butyrivibrio, Coriobacteriaceae, and Phascolarctobacterium. Qin et al. grouped 66 clusters representing cognate bacterial species for healthy and liver cirrhosis patients [47], and the lowest taxonomy level of cluster in this study is strain. Thirty-six of these clusters map to GutFeelingKB in the taxonomy levels higher than species and all of them map to existing GutFeelingKB families. These studies of healthy microbiome diversity from around the world suggest there is significant regional heterogeneity in the health gut microbiome at species/strain level, but reasonable consistency at higher taxonomic levels.
In a study conducted to demonstrate the feasibility of accurate detection of clinically relevant prokaryotic targets [20], Almonacid et al. showed that it was practical to identify 28 specific targets (14 species and 14 genera) based on sequencing of the 16S rRNA marker gene, which is an important clinical application when considering the cost of a test. Adapting one of their supplementary files (https://doi.org/10.1371/journal.pone.0176555.s003), we were able to determine that 75 of the organisms listed in GutFeelingKB can be mapped to Almonacid et al.'s clinical targets. The mapping was done by identifying UniProt Proteome IDs [35] and the RPG [34] cluster that best matched the organism described. If the organism was not present in GutFeelingKB, then we reference our supplementary file S7 Table to see if the organism was present in one of the RPG clusters that are represented by the organisms in GutFeelingKB. The results of our mapping against this study is included as a supplementary table (S11 Table).
Twelve of the 28 clinical targets were unable to be resolved with GutFeelingKB. Only two of these were genera, with the rest being a species level classification. Both of the genera are positively associated with abnormal GI states (Salmonella with Diarrhea [48] and Fusobacterium with Irritable Bowel Syndrome [49]. Of the ten species, five were positively associated with diarrhea or IBD (Vibrio cholerae, Salmonella enterica, Streptococcus sanguinis, Desulfovibrio piger, and Anaerotruncus colihominis). Only one of the species listed, Collinsella aerofaciens, did not have a reference proteome (https://www.uniprot.org/help/reference_ proteome).
It is expected that while other studies will find additional organisms, GutFeelingKB can provide a reference list and abundance information that can provide a starting point for comparative analysis of samples from healthy individuals from around the world and can also help better understand observed differences due to disease, therapy, and diet.
Organism abundance in individual samples. Data interoperability is a perennial challenge in bioinformatics [50]. This problem is further magnified when considerations are made for data from samples collected in distant locations at different times. In the case of HMP, sampling was done in Houston, TX and St. Louis, MO during 2008-2012. All GW samples were collected from the DC Metro Area in 2016. One way to test the compatibility of these data sets was to run a Between-Class Analysis (BCA) on all samples from each of the projects. Data from our three, separate projects fell into the expected three classic enterotypes [51] instead of clustering by project set (S4 Fig). Had the data clustered by project, sampling location, or year, they may not have been compatible for inclusion in the same database. However, we believe that these data do not show a sampling bias and can be leveraged for joint analysis. The sample and participant information are presented in Table 1. Many studies have focused on higher taxonomy nodes, providing little abundance information about specific species or strains. Fig 2 shows the abundance of phyla to highlight how baseline gut microbiome results from this study can be used to compare results from past studies. Abundance sheet with the lowest taxonomy node broken down to the strain level, where applicable, is provided so that other scientists can use the results for comparison purposes. Average abundance, standard deviation, maximal and minimal abundance excluding the organisms with the 0% abundance (S9 Table)  It has been shown that Bacteroides is the most abundant genus in Spain, China, Sweden, US, Denmark and France from samples collected from healthy individuals [46]. Bacteroides maintain a generally beneficial relationship with the host when retained in the gut but can also be opportunistic pathogens. When they escape the gut environment, they can cause significant pathology, including bacteremia and abscess formation in multiple body sites [52]. Bacteroides fragilis protects animals from colitis induced by Helicobacter hepaticus, a commensal bacterium with pathogenic potential [53]. A large proportion of the B. fragilis genome is responsible for carbohydrate metabolism, including the degradation of dietary polysaccharides [54]. Bifidobacterium has been reported to be present in almost all healthy human fecal samples. Members of Bifidobacterium are among the first microbes to colonize the human gastrointestinal tract and are believed to exert positive health benefits on their host [55]. Many species of Bifidobacterium are commonly used as probiotics due to their health promoting properties [56]. Certain Bifidobacterium longum strains have been used as probiotics against enterohemorrhagic Escherichia coli infection due to the production of acetate, a short chain fatty acid, which upregulates a barrier function of the host gut epithelium [57]. In general, they are able to survive in particular ecological niches due to competitive adaptations and metabolic abilities through colonization of specific appendages. There are 12 strains under Bifidobacterium longum species. One strain, BBMN68 has been isolated from the feces of a healthy centenarian living in an area of BaMa, Guangxi, China, known for longevity [58]. Another strain of Bifidobacterium, BGN4, was shown to prevent CD4(+) CD45RB (high) T-cell mediated inflammatory bowel disease by inhibition of disordered T cell activation in BGN4-fed mice [59]. Despite the well-established health benefits, the molecular mechanisms responsible for these traits remain to be elucidated.
Some potential pathogenic species appear in healthy samples in this study and the samples collected by Yatsunenko et al. [45]. Streptococcus mitis, a strain that can cause severe clinical symptoms in cancer patients [60] was also identified. It is likely that organisms such as S. mitis are opportunistic pathogens. There are several strains of Escherichia coli, for which the majority of strains are generally considered a harmless intestinal inhabitant. E. coli is one of the first bacterium to colonize human infants and is a lifelong colonizer of adults [61], although pathogenic strains of E. coli have been implicated in the etiology of health problems such as Crohn's disease and ulcerative colitis [62].

Dietary data and nutrient correlative analysis
In comparing bacterial species to nutrient data using MaAslin, several interesting patterns were observed. Bifidobacterium was positively correlated with dietary protein intake (Fig 3a), specifically vegetable protein, as well as dietary fiber, specifically soluble fiber, present in vegetables such as broccoli, brussel sprouts, beans, peas, asparagus and beans, which also contain vegetable protein. Akkermansia (Fig 3b) was shown to be positively associated with saturated fat intakes and is negatively correlated with total polyunsaturated fatty acids (PUFA). Not surprisingly, it was also positively correlated with linoleic acid, as this particular omega-6 PUFA is found abundantly in oils (e.g. soybean oil, vegetable oil) used in processed food. Bacteriodes ovatus was positively correlated with daily calorie intake (Fig 3c), as well as body weight (Fig  3d), and waist circumference. The table of results (see supplementary file S10 Table) demonstrates the range of correlation for features that have been measured. Cosine Similarity Coefficient analysis (see supplementary file S11 Table) identified correlation for features and organisms with the observations similar to MaAslin. For example, characteristics such as fat intake and BMI correlate with members of Akkermansia. Similarly, the impact of Vitamin A or beta carotenes has positive inductive correlation across all the Bifidobacterium (Fig 4).
As microbiome science moves closer to the clinic, it will be imperative both to have tools for analysis and the quick understanding of a microbial population. It is envisioned that such analyses provide the foundation for clinical reporting. While each organism in an entire microbiome sample isn't immediately actionable, it does allow for both the close tracking of microbial modulation and the better understanding of how the microbiome tracks with health states and therapy. This will be further applicable as evidence-based medicine approaches microbiome science, and microbiome science becomes as important to clinical treatment as genomic medicine. Preliminary microbiome analyses are increasingly yielding interesting results in complex diseases such as cancer. For example, in colorectal cancer patients, carcinoma-enriched bacteria, B. massiliensis, B. dorei, B. vulgates, Parabacteroides merdae, A. finegoldii and B. wadsworthia, is positively correlated with red meat consumption and negatively correlated with fruit and vegetables consummation [63]. It is expected that as the number and size of these studies increase, the need for baseline human gut microbial profile in healthy people and standard reporting template will become essential.
Contigs from unaligned reads (microbial dark matter). On average, 50% of the reads from an individual sample could not be aligned to any sequence in Filtered-nt. These unaligned reads were assembled into contigs. Previous work has shown that creation of contigs from unaligned short reads can be used to better understand the actual sequence space represented in metagenomics samples [64]. This "microbial dark matter" remains to be elucidated. Using BLAST against NCB-nt sequences did not yield any significant matches. Given that the average protein-coding density of bacterial genomes is 87% with a typical range of 85-90% [65], and the organisms in our reference list range in size from 1.89-6.17 Mb, contigs less than 10Kb were excluded. This value would mean that any single sequence would cover at the very least 0.16% of the organism's genome, or 0.19% of an organism's coding region. The goal here was  to reduce the number of false positive contigs. Using this approach unaligned reads were assembled into 1,467,129 contigs of which 46,095 have a length greater than 10Kb. After building the contigs, sequences greater than 10,000 nucleotides were saved into the same file, and each header was formatted to indicate the sample number, gender, age, and ethnicity of the source. The file is available for download at https://hive.biochemistry.gwu.edu/prd/gfkb// content/unalignedContigsGFKB-v2.0.fasta. These contigs are ideal for new primer design for detailed analysis of the gut microbiome.
The unaligned reads were used for contig assembly post-alignment to minimize risk of loosing informative contigs to consensus sequences which may map partially to the organisms in GutFeelingKB. This was confirmed in an experiment where all the reads were assembled first, then the contigs were mapped to the genomes from GutFeelingKB. This step resulted in much smaller number of contigs over 10 KB. Most likely some of the dark matter contigs are from bacteriophages. Using the pre-assembly method, one could potentially identify novel bacteriophages and associate the phage with their host organism. GutFeelingKB. The report does not include an explanation for what a particular result means, as it is premature to tie a specific microbe to phenotype in cases other than infectious disease, and any result falls to the purview of the requesting physician. With more information on the role of the microbiome and its constituent microbes, it will become more feasible to contrast where a sample from an individual lies within the spectrum of healthy or dysbiotic microbes abundance.

FecalBiome Reporting Template
All relative abundances were calculated for the individual datasets before quantifying the relative min, relative max, mean, median, and standard deviation (Fig 3). These statistics were then transformed into one cohesive report that merged the range, mean, median, and standard deviation. The statistics were further collapsed by family to generate an overall report that models a complete metabolic profile. The top most abundant families (Akkermansiaceae, Bacteroidaceae, Enterobacteriaceae, Rikenellaceae, and Ruminoccocaceae) had a relative max of 8.03, 12.13, 10.99, 6.89, and 6.31 percent of relative abundance, respectively. This is not surprising considering the Rikenellaceae family is indicative of good gastrointestinal health [68]. Akkermansiaceae is linked to lower rates of obesity and associated metabolic disorders [69]. Bacteroidaceae and Enterobacteriaceae can be linked to acute infective processes but are otherwise symbionts [70,71], and Ruminococcaceae is known to break down complex carbohydrates especially in people with carb heavy diets [72]. FecalBiome and the underlying Gut-FeelingKB can have high value to clinicians who hope to assess the gut microbial status of their patients. The goal of the database and report is to connect lab results with outcomes. At present, most known microbiome disease associations are a type of severe dysbiosis caused by a kind of potentially pathogenic bacteria-the canonical infectious pathogens such as Helicobacter pylori, Vibrio cholerae and others. By determining what species or strain correlate with good or bad outcomes, this type of research could aid clinicians in developing strategies for valuable evidence-based treatments.

Conclusion
The metagenomic analysis workflow described in this study involves a sub-sampling-based method followed by comprehensive mapping of all of the reads to accurately determine the abundance of microorganisms. The workflow provides a comprehensive snapshot of the microbial abundance and can easily be used with any state-of-the-art NGS read mapping and assembly algorithms. The list of baseline organisms identified in the normal human gut has clinical applicability as microbiome research moves closer to the bedside. The methods, tools and data from this project can also be used by regulatory scientists to evaluate workflows related to fecal transplant.
In addition to the workflow, this work lays the foundation for an expansive and modular database which can aggregate publicly available data as well as data from contributors to push towards an understanding the baseline human microbiome. This database can serve as a reference in studies of dysbiosis and microbiome associated with diseases. The user-friendly format through FecalBiome report, which contains absolute and relative abundance information about a given sample compared to an average across the entire database allows scientists, clinicians, and eventually patients to understand overview of gut microbiome. This work has the potential to provide a significant impact on regulatory science (e.g., FDA) and standards organization (e.g., NIST) research efforts in this area. For example, GutFeelingKB can potentially allow for rapid assessment of the content of human GI replacement products and, ideally, allow for more expedient review of products. Future studies to advance evidence-based microbiome medicine can be conducted where potential patients identify which outcomes (such as depression, bloating, epilepsy, frequency of common colds, cancer, etc.). For example Apte et al. [20] identified 28 disease-related organisms which can be targeted to evaluate the healthy status of an individual and used to detect disease while FecalBiome report can be used to communicate the microbiome-related health status of an individual between a clinician and patient. Those outcomes will become endpoints in clinical trials or observational studies that demonstrate the effects of various bacteria on the human gut. This type of methodology would tie raw numbers to health states that are meaningful for the general population, ensuring that data gathered are relevant to the patient, and therefore the clinician. This could bring a new, patient-centric perspective to microbiome data use and allow for a greater scope of health data to sit atop metagenomic sequence data. If everyone uses the same set of clinically relevant endpoints, research will be easily comparable across studies and meta-analysis becomes interoperable.   Table. Abundance tables are presented as 7 tables each representing a deferent taxonomy node, including phylum, family, class, order, genus, species, and strain abundance tables. Average abundance, standard deviation, maximal and minimal abundance are provided excluding the organisms with the 0% abundance. The table titled HitList provides the actual number of reads that were mapped. (XLSX) S10 Table. Associations between clinical metadata and microbial community abundance. (XLSX) S11 Table. Cosine similarity coefficient of correlation. This table demonstrates the range of correlation for features that have been measured, and the organisms that have been detected. (XLSX)