Reproducibility, stability, and accuracy of microbial profiles by fecal sample collection method in three distinct populations

The gut microbiome likely plays a role in the etiology of multiple health conditions, especially those affecting the gastrointestinal tract. Little consensus exists as to the best, standard methods to collect fecal samples for future microbiome analysis. We evaluated three distinct populations (N = 132 participants) using 16S rRNA gene amplicon sequencing data to investigate the reproducibility, stability, and accuracy of microbial profiles in fecal samples collected and stored via fecal occult blood test (FOBT) or Flinders Technology Associates (FTA) cards, fecal immunochemical tests (FIT) tubes, 70% and 95% ethanol, RNAlater, or with no solution. For each collection method, based on relative abundance of select phyla and genera, two alpha diversity metrics, and four beta diversity metrics, we calculated intraclass correlation coefficients (ICCs) to estimate reproducibility and stability, and Spearman correlation coefficients (SCCs) to estimate accuracy of the fecal microbial profile. Comparing duplicate samples, reproducibility ICCs for all collection methods were excellent (ICCs ≥75%). After 4–7 days at ambient temperature, ICCs for microbial profile stability were excellent (≥75%) for most collection methods, except those collected via no-solution and 70% ethanol. SCCs comparing each collection method to immediately-frozen no-solution samples ranged from fair to excellent for most methods; however, accuracy of genus-level relative abundances differed by collection method. Our findings, taken together with previous studies and feasibility considerations, indicated that FOBT/FTA cards, FIT tubes, 95% ethanol, and RNAlater are excellent choices for fecal sample collection methods in future microbiome studies. Furthermore, establishing standard collection methods across studies is highly desirable.


Introduction
The human colon is host to trillions of bacteria comprising the gut microbiota. There is strong biological plausibility for the role of the gut microbiota and its metabolites in human health, particularly for diseases of the gastrointestinal tract, such as colorectal cancer [1] and inflammatory bowel conditions, [2] and other metabolic and neurological diseases. [3,4] Providing strong evidence for the role of the gut microbiome in human health requires fecal sample collection in large population-based, prospective epidemiologic studies; however, little consensus exists regarding the best, standard methodology for collection and storage of these samples, coordination of which is important for conducting pooled analyses of microbial data.
Currently, an array of different collection methods for fecal samples are being used in studies of the microbiome, each with advantages and disadvantages relating to feasibility for implementation and preservation of microbial profiles. While immediately freezing samples at -80˚C is widely viewed as optimal, [5][6][7][8][9] for large epidemiologic studies this may be highly infeasible. For example, previous large microbiome studies, such as the American Gut Project, [10,11] relied on samples collected by participants in the comfort of their home, and as a result, fecal samples spent several hours to days stored in home freezers and/or at room temperature during shipping. Ideally, the sampling method of choice should be one that preserves the microbial profile, especially under suboptimal conditions that are typical of field studies, and furthermore, that can be used for other -omics, such as transcriptomics and metabolomics studies.
Previously, small studies used 16S rRNA gene sequencing to characterize the influence of the sample collection and storage on the microbial composition of fecal samples; however, larger studies, comprising diverse study population settings, are needed to understand the best methodology for collecting and storing fecal samples in large studies of the microbiome. Herein, we meta-analyzed microbial data from four previous studies (N = 132) [12][13][14][15] spanning two countries, the United States (US) and Bangladesh, and assessed technical reproducibility, stability over 4-7 days at ambient temperature, and accuracy of the microbial profiles in fecal samples collected using six methods-no-solution, fecal immunochemical tests (FIT), fecal occult blood test (FOBT) cards or Flinders Technology Associates (FTA) cards, 70% and 95% ethanol, and RNAlater. These analyses were based on ten microbiome metrics and the most dominant genera, providing an extensive characterization of overall microbial composition by each collection method.

Study population
This analysis included samples from a total of 132 healthy volunteers, drawn from four previously published studies, [12][13][14][15] that contributed fecal samples for analyses. As previously described, 20 samples were collected in the Mayo 1 study in Rochester, Minnesota, US and analyzed at both the Knight laboratory and the Mayo laboratory, [12] 52 samples were collected in the Mayo 2 study, [14] 50 samples were collected in the Bangladesh Health Effects of Arsenic Longitudinal Study (HEALS), [15] and 10 samples were collected at the University of Colorado. [13] Each participant provided written informed consent, and each study was granted ethics approval from the relevant institutional review boards as described in the original publications. The approval number from the NCI Office of Human Subjects Research for Mayo I and 2 studies is 12189 and for the Bangladesh study is 12741. The Colorado study was done under University of Colorado Boulder IRB Protocol #0409. 13.

Fecal sample collection methods
The fecal sample collection methods for each study were described in detail previously, [12][13][14][15] and the number of replicates aliquoted for each collection method and day of freezing are listed in Table 1. Briefly, each participant collected stool and delivered it to the study coordinator for immediate processing. The fecal specimens were mixed manually using a spatula, and aliquoted to each of the six different collection methods-no-solution, FIT tubes, FOBT/FTA cards, 70% and 95% ethanol, and RNAlater in a random order for each participant. Approximately 1-2 grams of feces (about a full scoop) were placed in a Sarstedt feces tube (Numbrecht, Germany) containing no-solution, 2.5 mL of RNAlater Stabilization Solution, or 2.5 mL of 70% or 95% ethanol (Sigma-Aldrich, St. Louis, Missouri). A portion of the mixed fecal specimen was also smeared thinly onto Whatman FTA cards and onto the Triple-slide Hemoccult II Elite Dispensapak Plus for FOBT (Beckman Coulter, Brea, California) per manufacturer instructions. For the FIT tube (Polymedco, Inc., Cortlandt Manor, New York) samples, the FIT probe was dipped into the fecal specimen, placed into the tube, and the tube shaken.
To assess reproducibility, duplicate aliquots of each specimen were created for each participant within each collection method and frozen at the same timepoint. To assess stability, duplicate aliquots of each specimen were frozen at −80˚C immediately (day-0), and the remaining samples were left at ambient temperature for four days and then frozen (day-4), except in the Colorado study where samples were stored at ambient temperature for 7 days and then extracted and sequenced without further storage (day-7). For the accuracy assessment, replicates of no-solution samples were frozen immediately after collection and were considered the gold standard for comparison to immediately-frozen samples collected via other collection methods.

DNA extraction and sequencing
The Knight lab, located initially at the University of Colorado Boulder (through 2014) and then at the University of California San Diego, performed DNA extractions and sequencing for most samples in this analysis, except for the twenty samples in Mayo 1 study that were analyzed at both the Knight lab and the Mayo Clinic. We previously found differences in microbial composition of samples from the two labs, [12] as the DNA extraction and amplification procedures differed slightly as described below. Thus, the two sets of samples were considered separately in our analyses. Knight lab. Methods for DNA extraction, polymerase chain reaction (PCR) amplification, and sequencing were previously described in detail. [12,14,15,17] Briefly, DNA extraction, PCR amplification of the V4 region of the 16S rRNA gene, and amplicon preparation were performed as described by Caporaso et al. [17], using the universal bacterial primer set 515F/ 806R, [17,18] and can be found on the Earth Microbiome Project website (http://www. earthmicrobiome.org/emp-standard-protocols/dna-extraction-protocol/). All DNA extraction and PCR amplification included no-template controls. All barcoded amplicons were pooled in equal concentrations for sequencing on Illumina's MiSeq for Mayo 1 (University of Colorado, Boulder, USA) and HiSeq for the Mayo 2, Bangladesh, and Colorado studies (University of California San Diego's Institute for Genomic Medicine; 150bp). The average coverage was 30,000-37,000 reads per sample. Mayo lab. Methods for DNA extraction, PCR amplification, and sequencing were described previously. [12] Briefly, DNA was extracted using the PowerSoil DNA isolation kit (MoBio Laboratories), and the V3-V5 region (357F/926R) of the 16S rRNA gene was amplified. The samples were sequenced using the Illumina MiSeq (San Diego, CA) sequencing platform (2x250bp). The average coverage was~70,000 reads per sample.

Bioinformatic data processing
Reads were demultiplexed and quality filtered using QIIME 1.9 at the default setting, which was a Phred quality score of >3. [19] Each sample was independently cleaned by removing all candidate read-errors using deblur. [20] The cleaned read files were joined to make a single biom table, with each operational taxonomic unit (OTU) representing a unique 150-base pair sequence for Knight laboratory samples and 250-basepair sequence for the Mayo laboratory. All data were rarefied to 10,000 reads per sample. While the primer sets used for amplification in both the Knight and Mayo lab measure both bacteria and archaea, we focused solely on bacteria, and assigned taxonomy using the QIIME assign_taxonomy command with the rdp method and Greengenes 13.8 at 97% similarity. [13] Next, using the R Phyloseq package, [21] we calculated two alpha diversity metrics: the number of OTUs present in a sample, which reflects species richness, and the Shannon Diversity index, which reflects both richness and evenness. We calculated four beta diversity metrics using the R vegan package [22] to reflect the shared diversity between bacterial populations in terms of ecological distance within each study population (i.e., a distance matrix was calculated for each study population): unweighted, generalized (calculated using the R GUniFrac package [23]) and weighted UniFrac distances, and the Bray-Curtis distance.

Statistical analysis
We identified suspicious samples that were possibly contaminated or mis-labeled by calculating an outlier index based on unweighted UniFrac distance, as samples' unweighted UniFrac values tend to cluster more tightly by subject as compared to generalized UniFrac, weighted UniFrac, and Bray-Curtis distance. Specifically, the outlier index for a given sample K from subject S was calculated as the ratio between (a) the average distance from sample K to other samples from subject S and (b) the median within-subject S distance. We flagged all replicate samples with an outlier index larger than 1.4 as suspicious outlier candidates. As the threshold was quite lenient some normal samples could also be flagged, so we further investigated flagged samples with principal coordinate analysis (PCoA) plots and genus-level abundance bar plots.
Replicate samples that did not cluster with the rest of the samples or with a highly different genus profile were excluded.
A distance-based coefficient of determination, R 2 , was used to estimate the percentage of microbiota variability explained by subject, collection method, and storage time ('adonis' function in the R 'vegan' package [12]) using unweighted, generalized, and weighted UniFrac and Bray-Curtis distances.
To assess technical reproducibility, for each collection method, we calculated intraclass correlation coefficients (ICCs) using a mixed effects model for duplicate fecal samples frozen day-0. To assess stability at ambient temperature, for each collection method, we calculated ICCs using a mixed effects model for one randomly selected replicate of samples frozen after 4 or 7 days at ambient temperature compared to one replicate of samples frozen day-0. To assess accuracy, we calculated Spearman's correlations (SCCs) to investigate if the rank order of microbial metrics were preserved for each randomly selected replicate for each collection method compared to the "gold standard" (i.e., fecal samples frozen day-0 with no solution). The ICCs/SCCs for stability and accuracy were averaged over 100 random samplings of replicates. For all ICC values, except distance-based ICCs (described below), we calculated 95% confidence intervals (95% CIs) using the R 'ICC' package (CI = 'Smith'). For distance-based ICCs and SCCs, we calculated 95% CIs using 1,000 bootstrap samples. We interpreted ICC/ SCC values <40% as poor, 40%-59.99% as fair, 60-74.99% as good, and � 75% as excellent. [24] To encompass changes in the entire microbial community, we calculated ICCs/SCCs based on ten microbial metrics: the relative abundance of the top four most abundant phyla (Actinobacteria, Bacteroidetes, Firmicutes, and Proteobacteria), two alpha diversity metrics (observed OTUs and Shannon diversity), and four beta diversity metrics (unweighted UniFrac, generalized UniFrac, weighted UniFrac, and the Bray-Curtis distance). We also calculated ICCs/SCCs for fecal genera with a prevalence in the population >80% and a mean relative abundance >0.2%. For the four beta diversity metrics, to reflect the preservation of the inter-sample relationships, we used a distance-based ICC, for which the within-subject squared distances and between-subject squared distances were used to calculate the biological and technical variance. [15] SCCs were calculated using all pairwise distances. For the relative abundance metrics, ICCs were calculated based on square root transformed relative abundance to reduce the influence of extremely high abundances and to make the data roughly meet the normality assumption under the mixed effects model for ICCs. To address the compositional nature of microbial data, we also calculated relative abundance based on the centered log-ratio transformation. [25] Finally, we performed differential abundance analyses on the phylum, class and genus level taxa with a prevalence > 50% and a mean count > 10 reads (except for Fusobacteria, which was of biological interest and a strong risk factor for aggression and progression of colorectal cancer [26]) by comparing the abundance of each taxa for each collection method on day-4 vs. day-0 (stability), and for each collection method compared to the gold standard (accuracy). We fitted a generalized mixed effects model (GLMM, R 'lme4' package v1.1.14, 'glmer' function with 'Poisson' family and log link) to the taxa count data, accounting for within-subject correlation and over-dispersion. [27] To address the compositional nature, we used the log geometric mean of the counts as the sample-specific offset in the model after adding a pseudocount of 1, which essentially assumes a linear link between the central log ratio transformed taxa abundance and the covariate. The coefficient of the GLMM is interpreted as the log-fold change of the abundance between condition.
To synthesize ICC/SCC estimates and log-fold changes across data sets, we used a metaanalytic random effects model with restricted maximum-likelihood estimation of the variance components ('rma' in R 'metafor' package). Individual estimates and their standard errors were supplied as the input. [28] The between-study variance was quantified using 'tau2' from 'rma' (estimated variance of the random effects), and the within-study variance was quantified based on the standard errors of individual ICC/SCC or log-fold change estimates. [29] Forest plots were used to visualize the results.
All statistical analyses were conducted using R (3.1.2).
Microbial variability was primarily explained by interindividual differences, which explained between 61% and 79% of variability in unweighted UniFrac distance, between 68% and 78% in generalized UniFrac distance, between 63% and 78% in weighted UniFrac distance, and between 77% and 86% in Bray-Curtis distance. Collection method and time at ambient temperature explained substantially lower variability (<10% and <5%, respectively; S1 Fig).

Technical reproducibility
The meta-analyzed ICCs for technical reproducibility comparing duplicate aliquots for each collection method frozen at day-0 for ten microbial composition metrics are shown in Fig 1A  (see Table A in S1 File for exact ICCs and 95% CIs). For each collection method, reproducibility was excellent for virtually all metrics (all ICCs > 75%), and the variance within and between studies was generally minimal (S2 Fig). The ICCs for technical reproducibility of select genera are shown in Fig 1B. There were no major differences between the collection methods in reproducibility ICCs, and all were excellent-except for Blautia and Faecalibacterium (both

Stability
Comparisons of mean OTU abundance between samples frozen on day-0 and on day-4 (Mayo 1, Mayo 2, and Bangladesh studies) or day-7 (Colorado study) are shown for each collection method in Fig 2. For all collection methods, most mean OTU abundances were similar between day-0 and day-4/7 except mean abundance was less strongly correlated between days of freezing in 70% ethanol, and no-solution samples had approximately 60-fold times more abundant Proteobacteria Escherichia on day-4/7 compared to day-0. As shown by the stability ICCs in Fig 3A (see Table B in S1 File for exact ICCs and 95% CIs and S4 Fig for forest plot of ICCs) stability was excellent for all microbiome metrics in samples collected via FOBT/FTA cards and RNAlater (ICCs � 75%). Samples co1lected via FIT tubes and 95% ethanol generally had excellent stability for alpha diversity and most beta diversity measures; however, stability of phyla and select genera ranged more widely from fair to excellent (ICCs ranging from 0.50-0.91 for FIT tubes and 0.67-0.89 for 95% ethanol). For no-solution and 70% ethanol samples, the ICCs for all metrics were generally lower than the other collection methods, with fair to good stability for the alpha/beta diversity metrics, but poor stability (all ICCs <0.25) for relative abundance of the four most abundant phyla in 70% ethanol samples and for the Proteobacteria phylum in no-solution samples (ICC = 0). There were similar patterns in stability across collection methods for select genera (Fig 3B; e.g., good to excellent ICCs for FOBT/ FTA cards, FIT tubes, 95% ethanol, and RNAlater, and generally poor to fair ICCs for 70% ethanol samples and no-solution samples). As shown in S5 Fig, the ICCs for stability of the centered-log ratio transformed top four phyla and select genera were similar to their square root transformed counterparts. For example, ICCs for no-solution samples and 70% ethanol samples were similarly poor to fair for centered-log ratio transformed abundances.
The log-fold change of selected phyla, classes, and genera prevalent in at least 50% of the population with a mean count of > 10 reads (except for Fusobacteria, which was of biological interest) from day-0 to day-4/7 is shown in Fig 3C (see Table C in S1 File for exact log-fold changes, standard errors, and p-values). For 70% ethanol samples frozen on day-4/7 compared to 70% ethanol samples frozen day-0, 62% percent of taxa were statistically significantly higher or lower in abundance; whereas, for FOBT/FTA cards frozen on day-4/7 vs. frozen day-0, only 7% percent of taxa were statistically significantly higher or lower. For no-solution samples frozen on day-4/7 vs. frozen day-0, there was more than 30-fold higher Gammaproteobacteria abundance (p = 6.02 x E -38 ). There were no statistically significant differences in Fusobacteria abundance after time spent at ambient temperature for any of the collection methods.

Accuracy
To evaluate the accuracy of the fecal sample collection methods, we compared microbiome metrics in samples frozen on day-0 for each collection method to no-solution samples frozen on day-0 (the putative gold standard). As shown in Fig 4, the mean abundance of OTUs was consistently relatively concordant with the gold standard for all collection methods. We found that for most collection methods, SCCs for concordance with the gold standard were generally good to excellent for alpha and beta diversity estimates (Fig 5A and 5B; see Table D in S1 File for exact SCCs and 95% CIs and S6 Fig for forest plot of SCCs), but that weighted UniFrac SCCs were generally lower (fair to good SCCs). Furthermore, for most collection methods, the relative abundances of dominant phyla were generally poorly to fairly concordant with the gold standard, except for Actinobacteria, which was excellently concordant for all collection methods. As shown in S7 Fig, the SCCs for accuracy of the centered-log ratio transformed top four phyla and select genera were similar to those of their square root transformed counterparts.
The log-fold change of selected phyla, classes, and genera prevalent in at least 50% of the population with a mean count of > 10 reads for each collection method compared to samples with no-solution frozen day-0 (the gold standard) is shown in Fig 5C. Compared to no-solution samples frozen day-0, 31%, 28%, 24%, 24%, and 10% of taxa in 70% ethanol, RNAlater, FIT tubes, 95% ethanol, and FOBT/FTA card samples were statistically significantly more or  Table E in S1 File for exact log-fold changes, standard error, and p-values). The only collection method with a statistically significantly lower abundance in Fusobacteria than the gold standard was 95% ethanol, which had 66% lower Fusobacteria abundance.
The relative abundance of genera detected in the fecal samples collected from 132 study participants in the three distinct populations (Mayo, Bangladesh, and Colorado) by collection method on day-0 of freezing is presented in Fig 6. There were marked differences in the distribution of genera between the Bangladesh and US study populations-for example, fecal samples in the Bangladesh study were more abundant in Prevotella (average relative abundance = 38.6%) compared to US populations (average relative abundance = 0.09%). Clearly, within each study population, the general distribution of abundance of each genus was relatively inconsistent across sample collection method.

Discussion
In the largest summary study of bacterial profile reproducibility, stability, and accuracy of six fecal collection methods to date, our results support that, for ten microbiome metrics and select genera, which taken together comprehensively characterize the microbial profile and are measures routinely used in microbiome-exposure/disease analyses: 1) all six of the fecal sample

Fig 5. Meta-analyzed SCCs and log-fold changes based on random effects model for accuracy of each collection method compared to no-solution samples frozen on day-0 (the gold standard) among 132 participants in five studies (Mayo 1-Knight lab, Mayo 1-Mayo lab, Mayo 2, Bangladesh, and Colorado) with fecal samples collected using six different methods.
SCCs are based on ten microbial composition metrics (panel A; square root transformed abundance of four phyla, two alpha diversity metrics [number of observed OTUs and Shannon index] and four beta diversity metrics [unweighted UniFrac, generalized UniFrac, weighted UniFrac, and Bray-Curtis distance]), and square root transformed select bacterial genera (panel B) with prevalence in the population >80% and a mean relative abundance >0.2%. Log-fold changes in relative abundance from day-0 (panel C) are based on select taxa with prevalence > 50% and a mean read count > 10. All error bars represent 95% CIs. collection methods had excellent technical reproducibility; 2) all collection methods, except for no solution and 70% ethanol, had good to excellent stability; and, 3) compared to the nosolution, immediately-frozen samples, all collection methods generally had fair to excellent accuracy. Below we discuss these findings in context with their implications for the collection of fecal samples to study the microbiome in large-scale studies, and the need for standardization of collection methods across studies to facilitate pooling of microbial data.
We conducted an extensive literature search and summarized the rationale for use, feasibility considerations, and previous findings related to microbial reproducibility, stability, and accuracy for each collection method in S1 Table. Each collection method has advantages and disadvantages pertaining to the stabilization of DNA, prevention of bacterial growth, and preservation of a microbial profile comparable to the immediately-frozen, no-solution gold standard. As demonstrated by our findings and previous findings, no solution and 70% ethanol are less stable collection methods compared to others when stored at ambient temperature, likely explained by the total lack of (or dilution of in 70% ethanol) DNA-stabilizing and anti-microbial properties. For example, while some studies found microbiome metrics were stable in samples stored in no solution up to three days at ambient temperature, [30,31] others found OTU counts were merged across samples by study population and collection method, and relative abundance was calculated for the merged samples. OTUs that could not be assigned to a specific genus were combined into "Unclassified_Genus". The 'Other' group comprised all genera with mean abundance less than 0.5%. Fecal sample collection in microbiome studies alterations in relative abundances of major taxa, [5,9,32] lower alpha diversity and lower bacterial counts after 8-24 hours [33,34] or three days, [5] and greater weighted UniFrac, unweighted UniFrac, and Bray-Curtis distances from immediately-frozen samples. [32,35] Furthermore, we observed a drastic growth of Gammaproteobacteria in no-solution samples after 4/7 days at ambient temperature, which may be concerning in gut health analyses because of the association of Gammaproteobacteria with inflammatory bowel disease. [36] However, as described previously, these taxa tend to grow well at room-temperature and may result from contaminated storage conditions, but importantly can be filtered using Deblur. [36] Fecal samples stored in FIT tubes were previously found to be moderately to excellently well-preserved at ambient temperature and comparable to immediately-frozen, no-solution samples. [37,38] However, some studies found differences in relative abundances of phyla and genera, [37] lower Shannon diversity with longer storage time at ambient temperature, [38] and compared to the gold standard, lower numbers of observed OTUs and differences in overall composition based on Bray-Curtis distance. [39] Multiple studies found that microbial profiles of fecal samples from FOBT/FTA cards were moderately to excellently stable at ambient temperature over the course of multiple days [7,[39][40][41][42] and comparable to the gold standard; [43] although, one study found a lower DNA yield among fecal samples from monkeys collected on FTA cards after 8 weeks. [44] Fewer studies investigated the stability/accuracy of human fecal samples stored in 70% and 95% ethanol. A study of tissue samples stored in 95% ethanol and another of gorilla fecal samples stored in 96% ethanol both found a lower DNA yield than fresh samples; [45,46] whereas, another study of monkey fecal samples found that 100% ethanol preserved bacterial composition and diversity well compared to fresh fecal samples. [38,44] Previous findings were mixed for RNAlater stability and accuracy. [5,7,13,40,43,45,[47][48][49][50] For example, Flores et al. [51] found that microbial profiles of fecal samples collected in RNAlater were generally stable up to 7 days at ambient temperature prior to freezing; however, other studies found relatively large bacterial community shifts after just 3 days at ambient temperature [50] or lower alpha diversity and DNA purity after 3 days to a week at ambient temperature. [5,7,49] Given the gut microbiome's highly plausible role in health-particularly gut health-powerful epidemiologic studies of the microbial profile in relation to health outcomes will likely require that microbial data be pooled from multiple studies to observe differences, particularly at the genus level. The variability we observed in the genus-level relative abundances across collection methods indicates that it is optimal for researchers to coordinate the use of at least one collection method commonly used in other studies; however, we also observed that inter-individual differences explained a much higher percentage of microbial variability than collection method. Pooling microbial data from fecal samples collected via different methods could theoretically be acceptable with careful methodological consideration (e.g., when appropriate, ensuring case and control samples are collected via same collection methods, adjusting for collection method in multivariable regression models, etc.); however, the power to detect and interpretation of microbiome-disease associations may be affected since effect sizes of microbiome-disease and microbiome-collection method associations may be similarly moderate. For example, Shah et al. meta-analyzed microbial data from multiple studies to identify microbial markers associated with colorectal cancer and found that samples clustered primarily by their original studies rather than colorectal cancer case/control status due to differences between studies in sample collection and DNA extraction methods/16s rRNA sequencing region, reducing their ability to detect certain microbiome-colorectal cancer associations. [52] When conceptualizing the implications of findings from this study, it is important to not only consider the collective findings described above, but also place them in context with cost/ feasibility for implementation in large-scale studies and use for other -omics (outlined in S1 Table). For example, in this study, the fecal samples in RNAlater were collected in 2.5 mL of solution, which some researchers previously indicated may not be adequate volume for microbial stability and accuracy; [53] however, microbial profiles of RNAlater samples were generally excellently stable and accurate, and compared to larger volumes, 2.5 mL is likely more feasible for both cost and storage requirements in large cohort studies. [54] But better yet, FOBT/FTA cards are widely used for colorectal cancer screening, are easily transportable and storable, and are less than half of the cost of RNAlater. In terms of use for other -omics, RNAlater cannot be used for metabolomics, [55] but can be used metagenomic and metatranscriptomic studies as it was previously found to maintain similar metagenomic and metatranscriptomic profiles to gold standard samples, along with 95% ethanol samples. [12,47,49] Both 95% ethanol and FOBT/FTA cards were found to be reproducible, stable, and accurate in metabolomics studies [43,55]-more so than FIT tubes. [55] This study has several limitations. First, the microbial metrics discussed herein were based on characterizations of 16S rRNA gene sequences of bacteria and use of shotgun metagenomics is becoming more widespread; however, as of now, 16S rRNA sequencing is an affordable, widely-used method to characterize the bacterial microbiome, and studies using other assays may find these results helpful in selecting fecal collection methods. Second, Mayo 1 and Knight laboratories used different DNA extraction and amplification protocols, which produced variability in microbial metrics; however, we conducted stratified analyses by lab and found no meaningful differences in conclusions for each sample collection method. Third, overall there was large variance between studies, especially for phylum-and genus-level analyses; however, the conclusion for each sampling method remained consistent across study populations.
This analysis also has several important strengths. First, with 132 participants from diverse populations, there was greater statistical power to detect differences between collection methods than any previous study comparing fecal collection methods. Second, this analysis included a wide variety of collection methods ranging in cost and ease of implementation, and included FIT, which is gaining popularity for CRC screening. Third, our study population was heterogeneous and included samples gathered in a low-to-middle income country, Bangladesh, where relevant conditions may differ from those present in the US; yet and still, the results in this study were similar to those of the US populations.

Conclusions
These findings, taken together with previous literature and feasibility considerations, indicate that FOBT/FTA cards, FIT tubes, RNAlater, and 95% ethanol samples may be an appropriate choice to collect fecal samples for the measurement of microbiome data in future studies, as these options are reproducible, stable, and relatively accurate. As the gut microbiome becomes increasingly recognized for its role in the etiology of gut and systemic health, it is imperative to characterize microbial profiles using the best available methodologies, and to move toward standardization of fecal sample collection across study populations. Future studies should further investigate the long-term stability (over the course of years) of these collection methods, similar to those samples stored in biobanks, and continue to explore the stability, accuracy, and reproducibility of each fecal collection method for other -omics, such as shotgun metagenomics.
Supporting information S1 Fig. Averaged distance- (TIF) S1 File. Table A: Meta-analyzed ICCs a (intraclass correlation coefficients) based on a random effects model for technical reproducibility comparing duplicate samples frozen day-0 from fecal samples collected using six different methods among 132 participants in five studies (Mayo 1-Knight lab, Mayo 1-Mayo lab, Mayo 2, Bangladesh, and Colorado); Table B: Metaanalyzed ICCs a (intraclass correlation coefficients) based on random effects models for microbiome stability comparing fecal samples frozen on day-4/7 to those frozen at day-0 for six fecal sample collection methods among 132 participants in five studies (Mayo 1-Knight lab, Mayo 1-Mayo lab, Mayo 2, Bangladesh, and Colorado); Table C: Meta-analyzed log-fold changes based on random effects models for microbiome stability comparing fecal samples frozen on day-4/7 to those frozen at day-0 for six fecal sample collection methods among 132 participants in five studies (Mayo 1-Knight lab, Mayo 1-Mayo lab, Mayo 2, Bangladesh, and Colorado). Log-fold changes in relative abundance from day-0 are based on select taxa with prevalence > 50% and a mean read count > 10. Abbreviations: FIT, fecal immunochemical tests tubes; FOBT, fecal occult blood test cards; FTA, Flinders Technology Associates cards; Table D: Meta-analyzed SCCs (Spearman correlation coefficients) based on random effects model for accuracy of each collection method compared to no-solution samples frozen on day-0 (the gold standard) among 132 participants in five studies (Mayo 1-Knight lab, Mayo 1-Mayo lab, Mayo 2, Bangladesh, and Colorado) with fecal samples collected using six different methods; Table E: Meta-analyzed log-fold changes based on random effects models for accuracy of each collection method compared to no-solution samples frozen on day-0 (the gold standard) among 132 participants in five studies (Mayo 1-Knight lab, Mayo 1-Mayo lab, Mayo 2, Bangladesh, and Colorado). Log-fold changes in relative abundance from day-0 are based on select taxa with prevalence > 50% and a mean read count > 10. Abbreviations: FIT, fecal immunochemical tests tubes; FOBT, fecal occult blood test cards; FTA, Flinders Technology Associates cards. (XLSX) S1 Table. Rationale for use, feasibility considerations, and previous findings for reproducibility, stability, and accuracy of six fecal sample collection methods for microbiome studies. (DOCX)