Correcting for batch effects in case-control microbiome studies

High-throughput data generation platforms, like mass-spectrometry, microarrays, and second-generation sequencing are susceptible to batch effects due to run-to-run variation in reagents, equipment, protocols, or personnel. Currently, batch correction methods are not commonly applied to microbiome sequencing datasets. In this paper, we compare different batch-correction methods applied to microbiome case-control studies. We introduce a model-free normalization procedure where features (i.e. bacterial taxa) in case samples are converted to percentiles of the equivalent features in control samples within a study prior to pooling data across studies. We look at how this percentile-normalization method compares to traditional meta-analysis methods for combining independent p-values and to limma and ComBat, widely used batch-correction models developed for RNA microarray data. Overall, we show that percentile-normalization is a simple, non-parametric approach for correcting batch effects and improving sensitivity in case-control meta-analyses.


Introduction
Data generated by high throughput methods like mass-spectrometry, second-generation sequencing, or microarrays are sensitive to experimental and computational processing [1,2]. This sensitivity gives rise to 'batch effects' between independent runs of an experiment. Even when different research groups adhere to the same methodologies, these effects can arise due to slight differences in hardware, reagents, or personnel [3]. Thus, it is inappropriate to make direct, quantitative comparisons of uncorrected data across studies.
Several tools for reducing batch effects in RNA microarray data have been developed. For example, surrogate variable analysis (SVA) estimates a set of inferred variables (eigenvectors) that explain variance associated with putative batch effects [4]. These inferred variables are then incorporated into a linear model to correct downstream significance tests. The limma package employs a similar linear correction to account for batch effects prior to statistical analysis [5]. SVA and limma are part of a family of linear batch-correction methods that use different varieties of factor analysis, singular value decomposition, or regression [4][5][6][7]. The most relied upon method to date [8], called ComBat, uses a Bayesian approach to estimate location and scale parameters for each feature within a batch [9]. All of these models are most effective when batch effects are not conflated with the true biological effects [1]. Furthermore, most batch correction methods make certain parametric assumptions.
Unfortunately, models that often work well for many types of 'omics data may not be appropriate for microbiome datasets. In microbiome studies, batch effects are often diffuse and conflated with biological signals [10][11][12]. The microbiome field has also struggled with finding appropriate parametric models for bacterial abundance distributions and for dealing with zeros. This is especially true for low-biomass samples in microbiome sequencing studies, like samples taken from the built environment [13], where populations are under-sampled, the biological signal is relatively weak, and batch effects can be quite large [14]. One way to get around this issue is to calculate statistics within a given batch, and then compare significant features across batches using classic meta-analysis techniques for combining p-values, like Fisher's and Stouffer's methods [15,16]. These meta-analysis techniques are robust to batch effects across independent studies, but have less statistical power and ability to detect subtle differences than directly pooling data across studies.
Here, we describe a model-free data-normalization procedure for controlling batch effects in case-control microbiome studies that enables pooling data across studies. Case-control studies include a built-in population of control samples (e.g. healthy subjects) that can be used to normalize the case samples (e.g. diseased subjects). For every feature (i.e. bacterial taxon), the case abundance distributions can be converted to percentiles of the equivalent control abundance distributions (Fig 1). Study-specific batch effects present in the case samples will also be present in the control samples, and by converting the case data into percentiles of the control distribution these effects are mitigated. Upon conversion to percentiles of the within-study controls, percentilenormalized samples from multiple studies with similar case-control definitions can be more appropriately pooled for statistical testing (Fig 1). We show that this approach effectively controls batch effects in microbiome case-control studies and we compare this method to pooling ComBat-or limma-corrected data, and to Fisher's and Stouffer's methods for combining independent p-values.

Sequence data processing
To perform OTU-level analyses across the CRC studies, we downloaded the raw data from all of the MicrobiomeHD datasets that sequenced the V4 region of the 16S gene. We quality filtered and length trimmed each V4 dataset as described in Duvallet et al. (2017) and concatenated these raw, trimmed FASTQ files into one file. We removed any unique sequences that did not appear more than 20 times and clustered the remaining reads with USEARCH [37] at 97% similarity. We assigned these OTUs taxonomic identifiers using the RDP classifier [38] with a cutoff of 0.5.
For genus-level analyses, OTU tables and metadata were acquired from the MicrobiomeHD database (https://doi.org/10.5281/zenodo.569601). Raw data were downloaded from the original studies and processed through our in-house 16S-processing pipeline (https://github.com/ thomasgurry/amplicon_sequencing_pipeline) as described previously [17]. Each study's OTU table was converted to relative abundance by dividing each sample by its total number of reads and collapsed to genus level by summing all OTUs with the same genus-level annotation.

Percentile normalization
In this procedure, control feature distributions are percentile-normalized against themselves (resulting in a uniform distribution between 0 and 100) and case feature distributions are Percentile-normalization procedure converts case and control values into percentiles of the control distribution, which allows for pooling of normalized data across studies. Conceptual plot shows theoretical feature (OTU 1) abundance distributions for control samples and case samples from two independent studies. Converting a control distribution into percentiles of itself naturally gives rise to a uniform distribution (represented by flat blue distributions in central panels), while converting the case distribution into percentiles of the control distribution produces a non-uniform distribution when these two distributions differ (represented by skewed orange distributions in central panels). The right-most panel shows the result of pooling percentile distributions from study 1 and study 2. Percentilenormalization places data from separate studies onto a standardized axis that allows for cross-study comparison. Each simulated case and control distribution was produced by randomly sampling 100 times from a lognormal distribution. Study 1 control parameters: μ = 0.1 and σ = 0.7. Study 1 case parameters: μ = 0.8 and σ = 0.5. Study 2 control parameters: μ = 1.5 and σ = 0. converted into percentiles of their equivalent control features. Treating our controls as nullhypotheses is motivated by the idea that healthy patients should be treated as similar across datasets, even though we understand that they will differ due to biological as well as technical batch effects. Relative abundance distributions were converted to percentiles using the SciPy v 0.19.0 [40] stats.percentileofscore method (kind = 'mean'). In order to avoid rank pile-ups due to the presence of many zeros, we replaced zeros with pseudo relative abundances drawn from a uniform distribution between 0.0 and 10 −9 (i.e. a set of random values smaller than the lowest possible relative abundance in any dataset). Due to the zero-replacement step, p-values can shift slightly upon re-analysis with a different random draw, which can lead to the loss or gain of significance for features very near the significance threshold. Within each study, control distributions for each individual OTU or genus were converted into percentiles of themselves and case distributions were converted into percentiles of their corresponding control distribution. We have written a python script that performs percentile-normalization given an OTU table, a list of case sample IDs, and a list of control sample IDs as inputs (https://github.com/ seangibbons/percentile_normalization). A QIIME 2 (https://qiime2.org) plugin for running percentile-normalization is also available (https://github.com/cduvallet/q2-perc-norm).

ComBat
For each disease, we applied ComBat [8] to the case-control datasets analyzed in this study. Relative abundances (OTUs in the CRC analysis or OTUs collapsed to the genus level in the genus-level analysis) were log-transformed prior to running ComBat (default settings), adding a pseudo relative abundance of half the minimal frequency (across the entire feature table) to replace zeros. ComBat-corrected data were then transformed back from log-space (i.e. exponential transformation) prior to downstream analyses.

limma
In addition to ComBat, we applied a linear batch correction method from the limma package in R [5]. Relative abundances (zeros replaced with pseudo relative abundances equal to half the minimal frequency across the entire feature table) were log-transformed as described above and then a linear model was fit to subtract batch effects using the removeBatchEffect function (default settings). The limma-corrected data were then transformed back from log-space (i.e. exponential transformation) prior to downstream analysis.

Statistical analysis
To calculate statistical significance, we restricted our statistical tests to OTUs/genera that occurred in at least one third of control or one third of case samples in order to reduce our multi-test correction penalty. We used the Wilcoxon rank-sum test, as implemented in SciPy v0. 19.0 (sicipy.stats.ranksums) [40], to determine significant differences between independent groups of samples. Wilcoxon tests were run either within or across studies. In order to calculate statistics across studies, normalized case and control samples from multiple studies of the same disease were combined together into the same OTU table. Hereafter, combining datasets is referred to as 'pooling.' P-values were multiple-test corrected using the Benjamini-Hochberg False Discovery Rate (FDR) procedure, as implemented in StatsModels v 0.8.0 (statsmodels. sandbox.stats.multicomp.multipletests) [41]. Differences in overall community structure were assessed using the Permutational Multivariate Analysis of Variance (PERMANOVA) test in R's vegan package [42] as implemented in scikit-bio (skbio.stats.distance.permanova). Fisher's and Stouffer's methods for combining p-values were performed using SciPy v0. 19.0 (scipy. stats.combine_pvalues; method = 'fisher' or method = 'stouffer'). For Stouffer's method, weights for each study were calculated as the square root of the number of cases plus the number of controls. OTUs/genera with significant responses in opposing directions across studies were excluded from Fisher and Stouffer analyses.

in silico experiments
We ran an in silico titration experiment using the OTU-level data to simulate pooling of control samples from different datasets before calculating significant differences. Healthy samples from one study were mixed with healthy samples from another study at different proportions prior to calculating significant differences in OTU frequencies between cases and controls. Case and control groups were subsampled to 40 samples each. Control samples were substituted by randomly selected samples from another study along a fractional gradient (0-100% control samples from another study). We calculated significant differences between case and control groups using the Wilcoxon rank-sum test and applied an FDR correction. OTUs with q-values 0.05 were considered significant. The titration experiment was rerun 20 times, and the results were averaged.
Similar to the titration experiment, we ran an OTU-level analysis of how batch-correction methods might impact false-positive rates by randomly selecting 40 control samples from the (2014) study as artificial 'cases' (across 20 iterations) for each data type (i.e. raw, percentilenormalized, limma-corrected, and ComBat-corrected). We then calculated significant differences between these artificial 'case' and 'control' groups as outlined above to generate p-values for each OTU.

Batch effects at OTU-level resolution
To minimize possible biases across data sets, we identified three colorectal cancer (CRC) studies that sequenced the same region of the 16S gene (V4). We reprocessed the raw sequence data from each study in the same quality filtering and OTU picking pipeline to obtain bioinformatically-standardized results. OTUs that occurred in at least one third of case or one third of control samples (i.e. within individual studies) were retained for all downstream statistical analyses. Despite standardizing the bioinformatic processing of these data, we saw significant batch effects in healthy patients across studies (PERMANOVA p < 0.001; Fig 2). The similarity between samples from the Baxter et al. (2016) and Zackular et al. (2014) studies is due to the fact that they were sourced from the same patient cohort (although samples were processed separately), making this comparison a good pseudo-negative control for batch effects [18,20]. There was an apparent reduction in the batch effect after applying ComBat, although differences between batches remained weakly significant (PERMANOVA p = 0.008, Fig 2) [8]. Due to the non-independence between the Baxter and Zackular patient cohorts, we removed the smaller of the two studies (Zackular) from all downstream analyses. Out of a total of 1,021 OTUs that passed our abundance filter, 681 differed significantly in uncorrected relative abundance between the Baxter and Zeller healthy controls (FDR q 0.05).
We ran an in silico titration experiment to simulate pooling of control samples from different datasets before calculating significant differences. Healthy samples from one study were mixed with healthy samples from another study at different proportions prior to calculating significant differences in OTU frequencies between cases and controls (see conceptual outline in Fig 3). For non-normalized data, the number of significant OTUs greatly increased due to batch effects as more control samples were substituted in from another study. This result highlights the danger of pooling raw data across batches. ComBat-and limma-corrected data performed better than uncorrected data, but still showed many spurious results as the proportion of control samples from another study increased (Fig 3). Percentile-normalization showed

Fig 2. Batch effects between healthy controls from different studies can be reduced by ComBat and percentile-normalization.
Non-metric multidimensional scaling (NMDS) plot showing the distribution of healthy controls from three colorectal cancer studies in ordination space (Bray-Curtis distances of relative abundance OTU-level data). Despite standardized bioinformatic processing, healthy patients differed significantly in their gut microbiomes across studies (PERMANOVA p < 0.001; batch accounts for 6.342% of the total variance). Studies were still significantly different after applying ComBat, an established batchcorrection method (PERMANOVA p < 0.01). However, percentile-normalization did a better job of stabilizing the variance across studies and removed any apparent batch effect (PERMANOVA p > 0.5).
https://doi.org/10.1371/journal.pcbi.1006102.g002 The control group from one study is gradually substituted with randomly chosen control samples from another study (non-normalized, percentile-normalized, limma-corrected, and ComBat-corrected), keeping the total number of case and control samples fixed at n = 40 (see conceptual illustration on the left). Mixing in non-normalized control samples from another study gave rise to spurious results due to batch effects (blue lines). ComBat-and limma-corrected data showed fewer spurious associations (green and red lines). Percentilenormalization showed no increase in spurious results along the titration gradient (orange lines).
https://doi.org/10.1371/journal.pcbi.1006102.g003 no increase in spurious results over the titration gradient (Fig 3). Although we do see several significant CRC-associated OTUs in the full dataset (see below), these were not detected in the titration experiment due to the loss of statistical power when reducing case and control groups to only 40 samples each.
We ran a second in silico experiment to determine whether the false-positive rate was impacted by our different batch-correction methods. We randomly selected 40 samples from the Baxter healthy controls as artificial 'controls' and 40 samples from the Zeller healthy controls as artificial 'cases' for each data type (i.e. non-normalized, percentile-normalized, limmacorrected, and ComBat-corrected) and calculated significant OTU-level differences between these groups. We repeated this process twenty times to generate a set of p-value distributions. We found that the fraction of p-values 0.05 can be as high as~70% for the non-normalized data (Fig 4). This result matches with our finding that 681 out of the 1,021 OTUs in this dataset differed significantly across Baxter and Zeller controls (q 0.05). Each normalization technique drastically reduced the number of false positives, but percentile-normalization gave the best results (Fig 4). When low-abundance OTUs were included in the analysis, ComBat and limma showed highly skewed p-value distributions, giving rise to a larger number of false positives than the non-normalized data (S1 Fig). We next assessed the performance each cross-study analysis method by comparing OTUlevel results across two independent CRC datasets. In the Baxter   Butyricimonas, Fusobacterium, Closridium XIVa, Streptococcus, Parabacteroides, Alistipes, Anaerostipes, Parvimonas, Peptostreptococcus, Blautia, Dialister, and Bacteroides genera) that differed significantly across cases and controls (FDR q 0.05).
In the absence of batch effects, pooling data across datasets of the same disease should increase sensitivity to detect significant cross-study associations. We pooled percentile-normalized, limma-corrected, and ComBat-corrected data, respectively, across Baxter and Zeller studies to look for OTUs that differed significantly across cases and controls. These pooled results were then compared to classic methods for combining p-values from each dataset's individual results (above). For the percentile-normalized data, we found 39 OTUs that differed significantly between cases and controls (FDR q 0.05), 21 of which overlapped with the within-study results. The pooled limma-corrected and ComBat-corrected data resulted in 37 and 36 significant OTUs, respectively. 35 of the OTUs identified as significant by ComBat were also significant in the limma results. 30 of the limma results and 29 of the ComBat results were also significant in the percentile-normalization results, respectively. Fisher's method identified seven significant OTUs from Clostridium XIVa, Streptococcus, Fusobacterium, Parvimonas, Peptostreptococcus, and Anaerostipes genera, which were also found in the percentilenormalized results. Stouffer's method identified the same seven OTUs found using Fisher's method. Overall, the pooling methods improve statistical power to detect significant OTUs over traditional meta-analysis methods. For example, particular OTUs from Desulfovibrio and Parabacteroides genera were identified as significantly enriched in CRC patients in the pooled results (ComBat, limma, and percentile-normalized), but not in the within-study results or in the Fisher and Stouffer results. Pooled analysis of percentile-normalized data also identified on Enterobacter OTU enriched in cancer patients, two OTUs from the Lachnospiraceae family that were enriched in controls and one Lachnospiraceae that was enriched in cases, which were missed by the within-study analyses. In all, 18 OTUs were identified in the pooled, percentilenormalized results that were missed by the within-study analyses (Fig 5). These additional taxonomic associations (e.g. Desulfovibrio, Costridium XIVa, and Lachnospiraceae) are consistent with prior meta-analyses of CRC microbiome studies [17,43]. It is important to visualize the data being fed into statistical tests to determine whether significant associations are being driven by outlier studies or by other artifacts. The associations identified in Fig 5 appear to be biologically meaningful due to the overall consistency of the effect directions across studies.

Batch effects at genus-level resolution across multiple diseases
In order to assess the performance of different meta-analysis techniques across a larger set of studies and diseases, we summarized OTU abundances at the genus level for five diseases across 18 studies-Clostridium difficile induced diarrhea (CDI), Crohn's disease (CD), ulcerative colitis (UC), obesity (OB), and CRC. There were a total of 306 unique genera detected across studies. There were two CDI case-control studies: Schubert [21].
Most genera that differed significantly within a given study were not significant in other studies of the same disease. For example, of the 36 unique genera that showed significant differences within any given OB study, none were found to be significant in all studies (I = 0; Table 1). Indeed, there were no genera that were significant across all studies in the majority of diseases studied (I = 0; Table 1). CDI only had two studies, and of the 38 significant results, only six were shared across both datasets. Overall, few genera were significant within two or more studies (2N 6; Table 1).
The number of genera that differed significantly across pooled cases and controls changed depending on how the data were batch-corrected (Table 1 and S1 Table). For every disease, percentile-normalization yielded the largest number of significant genera when compared to other methods. Overall, ComBat-and limma-corrections resulted in many fewer significant genera, especially for CD, UC, and CRC (Table 1 and S1 Table). Half of the IBD (CD and UC) studies included non-IBD patients with inflammatory symptoms as controls rather than clinically healthy patients. These biologically relevant differences in inflammatory symptoms between control cohorts were conflated with batches and were likely smoothed out by ComBat and limma corrections. Fisher's and Stouffer's methods consistently identified fewer Pooling data provides greater statistical power to detect subtle, yet consistent differences in OTU abundances across sample groups. 18 OTUs are labeled by their most resolved taxonomic annotation. Each OTU in this plot was not found to be significant within either Baxter or Zeller studies, but became significant after pooling the percentile-normalized datasets (q 0.05).
https://doi.org/10.1371/journal.pcbi.1006102.g005 significant associations than percentile-normalization (Table 1 and S1 Table). Pooling data prior to running a statistical test is a more sensitive technique than combining independently calculated p-values [46]. Thus, percentile-normalization increases the statistical power to detect differences across studies while controlling for false positives and batch effects.
To better assess how percentile normalization impacted the pooled results, we looked at genera that were significant within a single-study but not across studies after pooling. There were 12 genera that were significant within a subset of CRC studies, but not after pooling ( Fig  6). Gemmiger, Bacteroides, and Roseburia showed variable responses across studies, sometimes enriched in controls and other times enriched in cases. The remaining genera showed weak  The 'pooled' column shows the total number of genera that were found to be significantly different across pooled studies for a given disease. The 'within' column shows the total number of unique (non-redundant) genera that were identified as significantly different within each study (U = union), the number of genera that were significant in all individual studies (I = intersection), and the number of significant genera that were consistently significant in at least two studies (2N; 2N = = I for CDI). https://doi.org/10.1371/journal.pcbi.1006102.t001 Batch effects in microbiome studies associations within one or two studies, but did not differ significantly across studies (i.e. q 0.05). These genera that show weak or inconsistent responses across batches may not be reliable disease biomarkers. However, by including larger numbers of CRC studies in future meta-analyses it is likely that some of these genera could pass the significance threshold. Two genera-Lactobacillus and Desulfovibrio-were not significantly different between cases and controls within an individual study, but became significant after pooling (Fig 7). These genera showed weak, but largely consistent enrichment in cancer patients and demonstrate the utility of pooling datasets to detect subtle differences.
While prior work has suggested that there may not be consistent associations between the gut microbiome and obesity [35], we observed six genera in a recent meta-analysis that differed significantly across two or more (out of five) independent obesity studies [17]. Of these six genera, four (Roseburia, Clostridium IV, Oscillobacter, and Pseudoflavonifractor) were also found to be significant in the pooled, percentile-normalized results (S1 Table; S2 Fig). The two remaining genera not found to be significant in the percentile-normalized analysis (Mogibacterium and Anaerovorax) showed highly irregular responses across the 11 obesity studies analyzed in this study (S3 Fig). Despite the irregular behavior of these genera, Fisher's and Stouffer's methods both identified Mogibacterium as significantly associated with obesity (S1 Table).

Discussion
Batch effects are unavoidable when working with high-throughput data generation platforms. The RNA microarray community has been proactive in the development of tools for dealing with these effects [1,8]. However, these tools are not as effective when batch effects are confounded with biological signals or when parametric assumptions do not apply, which is often the case in microbiome case-control studies. Therefore, model-free methods are needed for correcting batch effects across microbiome datasets. Fortunately, case-control studies can be internally normalized by their own control samples. Any study-specific batch effects in the case samples will be present in the control samples, and by converting the case data into percentiles of the control distribution these effects are attenuated without making parametric assumptions.
Relative abundance, limma-corrected, and ComBat-corrected data-but not percentilenormalized data-quickly yielded a large number of spurious results when cases from one study were tested against controls from another (Fig 3). Additionally, when control populations from different batches were compared to one another, non-normalized data yielded a much larger number of false positives than batch-corrected data (Fig 4). Our percentile-normalization approach was much more effective than limma and ComBat in controlling false positives (Fig 4), especially in the presence of low-abundance taxa (S1 Fig). Because pooling datasets increases statistical power, it is tempting to pool these data even in the absence of suitable batch-correction methods. Consequently, pooling non-normalized data from different batches has been common practice in the microbiome field [27,29,35,[47][48][49][50][51][52][53][54][55]. In this paper, we demonstrate why this practice is highly inadvisable. Pooling batch-corrected data from multiple studies allowed us to detect significant differences that were not found within a given study (Figs 5 and 7), while removing associations that were weak or inconsistent across studies (Figs 6 and S3). Percentile-normalized results often identified significant differences between cases and controls that were missed by other normalization methods (Table 1). For CDI, percentile-normalized results identified about the same number of significant hits as the other batch-correction methods (Table 1), which was likely due to the fact that the biological signal associated with diarrhea is very strong [17]. In cases where the biological signal is strong, results should be robust to the types of analyses employed. For UC and CD studies (IBD), percentile-normalization identified several significant genera that limma and ComBat did not ( Table 1). The reduced number of significant hits from limma-and ComBat-corrected data for IBD was likely due to heterogeneous control cohorts across these studies (i.e. healthy patients vs. non-IBD patients), which likely smoothed-out inflammation-associated signals. This result highlights the importance of having consistent definitions for case and control cohorts across studies.
We compared percentile-normalization and pooling to Fisher's and Stouffer's methods for combining independent p-values. Stouffer's method is similar to Fisher's, but includes weights for each p-value based on the number of samples in a study. Percentile-normalization Genera that show a significant difference between CRC cases and controls within a given study, but not after pooling. 12 genera showed significant differences between cases and controls within a study (q 0.05), but not after pooling across CRC studies.
https://doi.org/10.1371/journal.pcbi.1006102.g006 Genera that do not show a significant difference between CRC cases and controls within a given study, but do after pooling. Two genera did not show significant differences between cases and controls within a study, but became significant after pooling across CRC studies (q 0.05).
https://doi.org/10.1371/journal.pcbi.1006102.g007 consistently identified a larger number of significant hits than Stouffer's and Fisher's methods, confirming that pooling data increases sensitivity (i.e. reducing putative false negatives). Methods for combining p-values from independent studies are quite robust and should probably be considered as a safe alternative to pooling (i.e. lower chance of false positives). However, in the case of our obesity analysis, Fisher's and Stouffer's methods identified Mogibacterium as significant despite its apparent inconsistency across studies (S3 Fig). In conclusion, we present a robust, model-free procedure for transforming each feature in a microbiome case-control dataset into percentiles of its control distribution (Fig 1). The main conditions for applying this method are that 1) each batch must have a sizeable number of control samples (i.e. the density of the control distribution limits the resolution of the percentiletransformation of the case samples), and 2) case and control populations should be consistently defined across batches (i.e. same definition of 'healthy' or 'diseased' groups). Given these caveats, percentile-normalized features can be pooled across studies for univariate statistical testing (whichever test a researcher prefers-ideally non-parametric), alleviating the batch effect problem. This model-free procedure could also be applied to other types of 'omics datasets with consistently defined internal controls. We find that this procedure allows us to identify differences between cases and controls that are often missed by more conservative metaanalysis techniques. Methods developed for batch-correction in microarray data, like limma and ComBat, can partially reduce batch effects in microbiome studies (Figs 2-4), but appear to obscure real patterns if batch effects are not independent of biological signals or if the parametric assumptions of these models are not valid. We suggest that methods like limma and Com-Bat are useful for studies lacking case and control groups. However, when studies have consistently defined internal controls, percentile-normalization should be the preferred batch correction approach. Future work should focus on developing parametric models specifically for batch correction in microbiome datasets, which could further improve sensitivity to detect subtle biological differences across studies.
Supporting information S1 Table. Excel file containing tabs for each disease analyzed in this study: each tab contains information on significant genera from Table 1