Skip to main content
Advertisement
  • Loading metrics

RSim: A reference-based normalization method via rank similarity

Abstract

Microbiome sequencing data normalization is crucial for eliminating technical bias and ensuring accurate downstream analysis. However, this process can be challenging due to the high frequency of zero counts in microbiome data. We propose a novel reference-based normalization method called normalization via rank similarity (RSim) that corrects sample-specific biases, even in the presence of many zero counts. Unlike other normalization methods, RSim does not require additional assumptions or treatments for the high prevalence of zero counts. This makes it robust and minimizes potential bias resulting from procedures that address zero counts, such as pseudo-counts. Our numerical experiments demonstrate that RSim reduces false discoveries, improves detection power, and reveals true biological signals in downstream tasks such as PCoA plotting, association analysis, and differential abundance analysis.

Author summary

Technical factors, such as sequencing depth, can introduce biases in analyzing and interpreting microbiome sequencing data. Data normalization is critical to mitigate these biases and enable reliable downstream analysis. However, numerous zero counts in the data pose significant challenges in developing effective normalization methods. Inspired by the experiments with spike-in bacteria, we propose a novel computational framework to address the bias resulting from technical factors. Our method identifies a set of non-differentially abundant taxa based on pairwise rank similarity of observed counts and subsequently scales the counts to ensure the same coverage within this set. By adopting this innovative approach, our method robustly handles zero counts and avoids introducing new biases associated with zero handling. Comprehensive evaluations demonstrate that our method helps downstream analysis effectively detect subtle yet biologically important signals and leads to a reduction in false discoveries. The new method offers promising prospects for advancing our understanding of microbial communities.

Introduction

High-throughput sequencing technology has revolutionized the study of microbiome communities, providing biologists with a powerful tool to investigate and understand biological events and mechanisms. However, analyzing and interpreting the data generated by high-throughput sequencing can be challenging due to technical factors that can confound results [13]. One major limitation of high-throughput sequencing is that the observed sequencing count data can only reflect the relative abundance of taxa rather than their absolute abundance, as the observed sequencing depth can vary significantly across samples and is unrelated to the absolute abundance [46]. In mathematical terms, the observed sequencing count data can be expressed as: where Ni,j and Ai,j are the observed count and absolute abundance of taxon j in the ith sample, and ci is the unobserved sampling fraction of the ith sample. The unobserved sampling fraction is typically sample-specific and can vary due to technical factors such as sequencing depth and capture efficiency. Because of this unobserved sampling fraction, applying classical statistical methods to the observed count data can result in false-positive scientific discoveries and invalid analysis results [2, 7]. This paper refers to the bias resulting from the unobserved sampling fraction as compositional bias.

Normalizing the observed sequencing count data is a critical step in removing compositional bias and ensuring accurate and reliable downstream analysis. To this end, many normalization methods have been proposed for different types of sequencing data sets [4, 816]. These methods can be broadly classified into three computational frameworks: rarefying, scaling, and log-ratio based methods [2, 3]. The rarefying method subsamples the taxa of each sample to ensure that all samples have the same sequencing depth. Although this method is popular in practice, it may lead to a loss of statistical power in downstream analysis and does not correct compositional bias [17]. Besides rarefying method, the scaling method is another widely used normalization strategy that estimates the unobserved sampling fraction and scales the observed count by this estimated sampling fraction. Scaling methods include Cumulative-Sum Scaling (CSS) [12], Median (MED) [18], Upper Quartile (UQ) [10], Trimmed Mean of M-values (TMM) [4], Geometric Mean of Pairwise Ratios (GMPR) [19] and Total-Sum Scaling (TSS) normalization. However, accurately estimating the sampling fraction can be challenging when prevalent zero counts exist in the microbiome data [2]. Finally, log-ratio based methods, which are motivated by classical compositional data analysis [20, 21], have been proposed. Although log-ratio transformation can alleviate the compositional effect, it is still unclear how to apply log-ratio transformation when zero counts are present and how to interpret the results [2224]. These challenges lead us to question whether a new normalization method can be developed that is both robust to the prevalent zero counts and corrects the compositional bias.

Here, we introduce a novel normalization method, which we call RSim (normalization via Rank Similarity), to correct the compositional bias in the sequencing data set. The RSim normalization is a scaling method motivated by the normalization method in the experiment with spike-in bacteria. Instead of estimating sampling fraction directly, RSim first identifies a set of non-differential abundant taxa via the pairwise rank similarity of taxa and then scales the counts to ensure that the total sum of coverage in this estimated set is the same across samples. To accurately identify non-differential abundant taxa, RSim employs a new empirical Bayes approach to control the misclassification rate. Unlike existing methods, RSim does not need any assumption or extra treatment to the prevalent zero counts because the Spearman’s rank correlation coefficient used for measuring rank similarity is robust to zero counts. Besides being robust to zero, RSim outperforms existing methods in estimating the sampling fraction and correcting the compositional bias. We demonstrate the efficacy of RSim by comparing it with several state-of-the-art methods using synthetic and real data sets. Our results show that RSim can help reduce false discoveries, improve detection power, and reveal true biological signals in various downstream analyses, such as PCoA plotting, association analysis, and differential abundance analysis. RSim normalization is implemented in an R package, freely available at https://github.com/BoYuan07/RSimNorm.

Results

Overview of RSim normalization

We present a concise summary of the RSim normalization method, with a more detailed explanation provided in the Method section. While we focus on the microbiome data in this paper, it is worth noting that RSim may be applicable to other sequencing data, such as bulk RNA-seq and single-cell RNA-seq [1]. The RSim method is inspired by the normalization approach used in experiments with spike-in bacteria [2528]. When spike-in bacteria are available, the count of each taxon is rescaled by the reciprocal of the count of the spike-in taxa, as follows:

Here, Ni,j represents the observed count of taxon j in the ith sample, and is the set of spike-in taxa. We refer to this method as reference-based normalization since it treats as a reference set. The reference-based method can correct compositional bias when spike-in bacteria are available [25]. The efficacy of the reference-based method is contingent on the assumption that the absolute abundance of spike-in taxa is identical across samples, as expressed by the equation:

Here, Ai,j represents the absolute abundance of taxon j in the ith sample. Given the success of the spike-in based normalization, one might wonder whether it is feasible to identify a reference set of taxa that satisfies the above equality and use it to correct compositional bias without employing spike-in bacteria. Our paper shows that this is possible when we can identify a set of non-differential abundant taxa, denoted by , and replace the spike-in taxa with these estimated non-differential abundant taxa.

The RSim normalization method is primarily aimed at identifying a set of non-differential abundant taxa in microbiome data, even in the presence of zero counts. This identification process has two steps: first, we construct statistics for the differential abundance level of each taxon by using pairwise rank similarity of taxa; second, we use a new empirical Bayes method to identify non-differential abundant taxa based on the statistics obtained in the first step (see Fig 1). To explain the intuition behind the first step, we note that the count of a non-differential abundant taxon is approximately proportional to the unknown sampling fraction, whereas the count of a differential abundant taxon lacks this correspondence to the sampling fraction. Thus, the rank correlation between two non-differential abundant taxa should be higher than between a non-differential abundant taxon and a differential abundant taxon. Assuming that the majority of taxa are non-differential abundant, we use the median of rank correlation coefficients between a taxon and other taxa as the statistics for the level of differential abundance. In the second step, we use an empirical Bayes method to identify non-differential abundant taxa based on the statistics obtained in the first step. The new empirical Bayes method allows choosing a threshold to control the misclassification error. Since most identified taxa are non-differential abundant, they can serve as the reference set for the reference-based normalization in RSim. Notably, the RSim procedure treats zero entries the same way as non-zero entries, allowing it to work consistently with zero counts. In the following sections, we demonstrate the effectiveness of RSim normalization in correcting compositional bias, even in the presence of many zeros in the data set.

thumbnail
Fig 1. Illustration demonstrating the procedure of RSim normalization.

Step 1: median of pairwise rank similarity of taxa is evaluated to construct the statistics for the differential abundance level of each taxon. Step 2: a new empirical Bayes method provides misclassification rate control in identifying non-differential abundant taxa. Estimated non-differential abundant taxa are used as the reference set in reference-based normalization.

https://doi.org/10.1371/journal.pcbi.1011447.g001

Correcting compositional bias via RSim normalization

This section presents a series of numerical experiments to assess the ability of RSim normalization to correct compositional bias. We generate synthetic data using a microbiome dataset collected in [29], where 97% of entries are zeros. We first investigate whether the taxa in the estimated reference set of RSim normalization are mostly non-differentially abundant. Specifically, we design several numerical experiments to determine if the empirical Bayes method in RSim can control the misclassification rate in the estimated reference set at a desired level. S1 Fig shows that RSim successfully controls the empirical misclassification rate at the target level for different levels of misclassification rate. We further evaluate the robustness of the estimated reference set by varying the signal strength of differential abundant taxa, the balance of group size in differential abundant taxa, the proportion of differential abundant taxa, and sample size change (S3 Fig). Our experiments demonstrate that RSim can robustly identify a reference set that consists mostly of non-differential abundant taxa.

The next set of numerical experiments aims to investigate whether RSim normalization can recover the sampling fraction of each sample via reference-based normalization. To this end, we compare RSim normalization with six state-of-art normalization methods, including TSS, UQ implemented in edgeR, CSS implemented in metagenomeSeq, MED implemented in DESeq2, TMM implemented in edgeR, and GMPR implemented in GMPR. We also include an oracle reference-based normalization where the reference set consists of true non-differential abundant taxa. Using synthetic data generated from a microbiome dataset collected in [29], we randomly divided the samples into two groups and inserted signals to the differential abundant taxa of one group. We then estimated the sampling fraction of each sample using the seven normalization methods and compared their performance. Fig 2 presents the results of these experiments. We found that when the signal strength of differential abundant taxa is weak, most normalization methods can recover the sampling fraction well and do not exhibit significant bias in their estimates. However, in the presence of strong differential abundant taxa, existing normalization methods suffer from a systematic bias in sampling fraction estimation, while reference-based normalization is robust to this bias. Notably, RSim normalization performs similarly to the oracle method, indicating that the reference set selected by RSim contains mostly non-differential abundant taxa and can effectively normalize the data. Overall, our numerical experiments demonstrate that RSim normalization corrects the sample-specific bias resulting from technical variations in the sequencing process.

thumbnail
Fig 2. Comparisons of normalization methods in estimating sampling fraction.

The numerical experiments are performed when the signal strength of differential abundant taxa is (a) weak, (b) moderate, and (c) strong. In (a), (b), and (c), the x-axis represents true sampling fractions, while the y-axis represents the estimated sampling fraction from normalization methods. We scale the estimated sampling fractions so that their average is the same as the average of true sampling fractions. The black line in these figures represents equality between the estimated and true sampling fractions and the color of points represent which group the differential abundant taxa belong to. The bias in sampling fraction estimation by different normalization methods is compared in (d) when the signal strength and proportion (p = 0.1, 0.2, 0.3) of differential abundant taxa vary. It is clear that the reference-based method can better correct the compositional bias than existing methods, especially when there is a large proportion of strong differential abundant taxa.

https://doi.org/10.1371/journal.pcbi.1011447.g002

We also compare these computation-based normalization methods with the spike-in-based normalization method when spike-in bacteria are available. Specifically, we consider a dataset collected in [25], where a fixed quantity of Salinibacter ruber is introduced into gut microbiome samples to calibrate unknown sampling fractions. Since spike-in bacteria are available, we treat the results of spike-in-based normalization as the ground truth and compare the seven computation-based normalization methods. The discrepancies in the estimated sampling fractions are reported in S4 Fig. We observed that the sampling fractions estimated by RSim are more similar to the spike-in-based normalization method than the other six methods, confirming again that RSim normalization can better correct the sample-specific bias. In the next three sections, we investigate how RSim normalization improves the performance of commonly used downstream analyses, including PCoA plotting, association analysis, and differential abundance analysis.

RSim normalization reveals biological pattern in PCoA plot

This section aims to investigate the effects of different normalization methods on the PCoA plots. Specifically, we compare the PCoA plots on the normalized data after applying normalization methods, namely TSS, UQ, CSS, MED, TMM, GMPR, rarefaction and RSim, to a microbiome data set collected in [29]. The compositional bias can create false clusters or patterns in the PCoA plot if the count data is not appropriately normalized. We randomly split the samples into two groups, and no cluster structure is observed in the PCoA plots, regardless of the normalization method used (Fig 3a). However, when we rarefy the count data of one group of samples through subsampling, the difference in the sequencing depth results in two clusters in some of the PCoA plots (Fig 3b). In particular, RSim, TMM, GMPR, rarefaction and TSS can remove such false clusters through normalization, while CSS, MED, and UQ cannot. We also conduct a similar numerical experiment on another dataset collected in [30]. The samples in the KarenThai category are divided into two groups based on the sequencing depth (>10000 belongs to the first group, and <5000 belongs to the second group). The two clusters separated by sequencing depth are present in the PCoA plots of all normalization methods except RSim normalization (Fig 3c). Through these two examples, we conclude that RSim normalization is more effective in mitigating the issue of false clusters or patterns in PCoA plots than existing normalization methods.

thumbnail
Fig 3. Compositional bias can create false clusters in PCoA plots.

In (a) and (b), samples are randomly divided into two groups. No modification is applied to (a), while the count data in group 1 is rarefied in (b). In (c), samples are divided into two groups based on the sequencing depth (>10000 belongs to the first group, and <5000 belongs to the second group). In these figures, RSim normalization can help remove the false clusters resulting from compositional bias. Euclidean distance with log transformation is used in all PCoA plots.

https://doi.org/10.1371/journal.pcbi.1011447.g003

The presence of false patterns resulting from compositional bias can lead to erroneous interpretations of data, highlighting the importance of proper normalization. S6a Fig shows the PCoA plots of right palm samples in a data set collected in [31], colored by days since the experiment started. The PCoA plot exhibits a clear time-related pattern in the raw data, implying a possible shift in microbial abundance during the 15-month study period. Similar patterns are also observed in the PCoA plots after applying all normalization methods except RSim. However, further examination reveals a strong correlation between time and sequencing depth (S6c Fig), and a similar pattern is also present in the PCoA plots colored by sequencing depth (S6b Fig). This suggests that sequencing depth is a confounding factor and is likely responsible for the observed time-related pattern. RSim normalization can effectively remove this false pattern, and as a result, the pattern of time and sequencing depth in the PCoA plots is no longer apparent, demonstrating the effectiveness of RSim normalization in removing confounding effects.

Appropriate normalization not only helps avoid false clusters but also aids in detecting true biological patterns resulting from microbial abundance shifts. Following a similar approach as in the previous section, we generated data with differentially abundant taxa from a data set in [29]. The absolute abundance of these taxa depends on a latent variable characterizing the biological structure. When this latent variable is binary, a two-cluster structure is expected in the PCoA plot, but compositional bias confounds such a structure (S5a Fig). After applying normalization methods, only RSim normalization helps detect a two-cluster structure in the PCoA plot. We observed a similar phenomenon in the numerical experiment when the latent variable is continuous (S5b Fig). These examples suggest that compositional bias can obscure biological signals of interest, and RSim normalization can help reveal the true biological pattern in the data set.

RSim normalization increases efficiency of association analysis

This section investigates the impact of normalization on association analysis, which aims to detect an association between microbiome data and a specific outcome, such as age or BMI. To compare the performance of different normalization methods, we consider two commonly used association analysis methods, PERMANOVA [32, 33] and MiRKAT [34]. Similar to the previous sections, we generate synthetic data from the microbiome data set in [29]. In the first set of experiments, we examine the effect of normalization on the type I error of association analysis. We randomly divide samples into two groups and rarefy the first group via subsampling. The type I error is highly inflated when we directly apply association analysis to the unnormalized count data due to the difference in sequencing depth. After applying eight different normalization methods, only TSS, rarefaction and RSim normalization can effectively control the type I error. We also apply PERMANOVA to samples of Karen individuals living in Thailand, as collected in [30], using the same experiment settings as in the previous section. The P-values are reported in Fig 3c. PERMANOVA finds a significant association between the microbiome data and the group defined by the sequencing depth when the count data is normalized by the existing normalization method. However, the association is no longer significant when we apply RSim normalization, indicating that RSim normalization can correct the compositional bias resulting from the confounding sequencing depth. These results further confirm that RSim normalization can reduce false discoveries in association analysis.

The second set of numerical experiments investigates the effect of different normalization methods on the power of association analysis. As in the previous two sections, we generated synthetic data with differential abundant taxa from the microbiome data set in [29]. Applying PERMANOVA and MiRKAT directly to the unnormalized data resulted in power loss, while RSim normalization improved their power more effectively than existing methods in most settings (see Fig 4b and S7 Fig). In addition to the synthetic data, we also compared different normalization methods using the data set collected in [30]. Specifically, we applied PERMANOVA and MiRKAT to examine the global association between BMI and the human gut microbiome in Karen individuals living in Thailand. When the microbiome data was normalized using RSim and GMPR, we observed a significant association with P-values smaller than 0.05. However, when other existing normalization methods were used, no significant discovery was reported (see Table 1). This discovery aligns with previous literature that shows the significant impact of gut microbiota on nutrient metabolism and energy expenditure [35]. These findings highlight the importance of appropriate normalization in association analysis to avoid false discoveries and improve power, and RSim normalization is a superior choice to existing methods.

thumbnail
Fig 4. Normalization can reduce false discovery and improve the power of association analysis.

In (a), the samples are randomly divided into two groups, and the count data in the first group is rarefied. In (b), the synthetic data include differential abundant taxa. The significance level is 0.05 in both (a) and (b). Normalization is an essential step to avoid false discovery and improve power.

https://doi.org/10.1371/journal.pcbi.1011447.g004

thumbnail
Table 1. Normalization can make more scientific discoveries through improving the power of association analysis.

https://doi.org/10.1371/journal.pcbi.1011447.t001

RSim normalization improves accuracy of differential abundance analysis

This section focuses on the effect of normalization on differential abundance analysis, which aims to identify taxa with different abundances across conditions. Classical tests, such as the two-sample t-test and Pearson correlation test, are commonly used for this analysis, but applying them directly to unnormalized count data can lead to inflated false discoveries. Proper normalization is, therefore, essential to mitigate this issue. We conducted experiments on synthetic data generated from the dataset in [29] to study how the normalization impacts differential abundance analysis. Specifically, we apply seven normalization methods (TSS, UQ, CSS, MED, TMM, GMPR and RSim) and conduct differential abundance analysis using the two-sample t-test for binary outcomes and Pearson correlation test for continuous outcomes. The results in Fig 5 showed that inappropriate normalization could introduce bias, resulting in an inflated false discovery rate (FDR) and reduced power. However, RSim normalization was effective in controlling FDR and maintaining sufficient power, thereby mitigating compositional bias. These results confirm the importance of appropriate normalization for differential abundance analysis and suggest that RSim normalization is a reliable choice.

thumbnail
Fig 5. Comparison of different normalization methods’ effect on the differential abundance analysis.

(a) and (b) are the FDR and sensitivity plots of the t-test after applying seven normalization methods. (c) and (d) are the FDR and sensitivity plots of the Pearson correlation test after applying seven normalization methods. The x-axis is the signal strength of differential abundant taxa. RSim can help t-test and Pearson correlation test control FDR and maintain detection power.

https://doi.org/10.1371/journal.pcbi.1011447.g005

In addition to the synthetic data, we also applied RSim normalization to real datasets to further elucidate the effect of normalization on differential abundance analysis. First, we used the dataset from [31] to compare the seven normalization methods. The samples were divided into two groups based on sequencing depth, and we applied the two-sample t-test equipped with seven normalization methods as well as four state-of-the-art differential abundance tests designed for compositional data: ANCOM [36], edgeR [37], LinDA [38], and RDB [24]. The results, as summarized in Fig 6, showed that inappropriate normalization could lead to inflated FDR when the sequencing count data is analyzed. However, RSim normalization successfully corrected for compositional bias and improved the two-sample t-test to control FDR and detect significant differences.

thumbnail
Fig 6. RSim normalization helps two-sample t-test control false discovery.

Samples are divided into two groups based on the sequencing depth (<10000 belongs to the first group, and >20000 belongs to the second group), and the FDR is shown when the different significance levels are used. In (a), seven normalization methods are compared. In (b), a two-sample t-test equipped with RSim normalization is compared with state-of-art differential abundance tests.

https://doi.org/10.1371/journal.pcbi.1011447.g006

We also applied RSim normalization and the two-sample t-test to a gut microbiome data set from an immigration effect study [30]. This analysis compared two groups: Karen (Karen female individuals living in Thailand) vs. Karen1st (Karen female individuals born in Southeast Asia and moved to the US). We detected six significant phyla in the comparison (S1 Table). It is noteworthy that the RDB test, which is designed to remove compositional bias and has the best false discovery control in the previous experiment, also detected these six phyla. However, applying the two-sample t-test to the unnormalized data led to the detection of three different phyla. This discovery is consistent with previous results indicating that Bacteroidota, Firmicutes, Actinobacteriota, and Fusobacteriota are associated with obesity and that the obesity rate is significantly higher for immigrants than for people living in Thailand [3941]. Furthermore, Desulfobacterota is shown to be related to inflammatory bowel diseases [42], and the incidence of these diseases is much higher in Western countries compared to Asian countries, especially Thailand [43], which is also consistent with our findings. These results again suggest that RSim normalization can improve the ability of differential abundance tests to control false discoveries and detect significant differences more effectively than existing normalization methods.

Discussion

In this study, we present RSim, a novel normalization method that corrects sample-specific bias in microbiome data with many zeros. RSim normalization is robust to the prevalent zeros because each step can work with zeros without making extra assumptions or treatments. RSim first identifies a set of non-differential abundant taxa by evaluating the pairwise rank similarity of taxa, then uses the estimated set as a reference set in reference-based normalization. This approach effectively corrects the compositional bias, even when microbiome data consists of many zero counts. Furthermore, while our discussion primarily focused on microbiome sequencing data, the ideas in this algorithm could potentially be applied to single-cell RNA sequencing data, where one major obstacle in normalization is also the zero counts problem.

Our comprehensive investigation into how normalization results can affect downstream analysis shows that the unobserved sampling fraction is a common confounder in high throughput sequencing data analysis. Compositional bias may confound the results of almost all types of downstream analysis, ranging from data visualization to statistical testing. This confounding factor creates false clusters or discoveries and obscure signals of interest in data analysis and interpretation. Our numerical experiments demonstrate that RSim normalization can eliminate compositional bias better than existing methods, reducing false discovery and increasing detection power in downstream analysis, including PCoA plotting, association analysis, and differential abundance analysis. We hope this new normalization method can improve the current data analysis pipeline and enable biological researchers to make more scientific discoveries.

One major assumption of RSim normalization is that more than half of the taxa are non-differential abundant, which is also used in developing differential abundance analysis in compositional data. This assumption may appear strong, but it is necessary for model identification when the sampling fraction is not observed [24]. In other words, the set of non-differential abundant taxa cannot be determined from the observed sequencing count when less than half are non-differential abundant. We recommend applying RSim normalization on high-resolution data, such as at ASV or OTU level, to satisfy this assumption. When ASVs/OTUs are aggregated into taxa at a higher taxonomic level, like class or order level, there could be much fewer non-differential abundant taxa because aggregation of non-differential and differential abundant ASVs results in differential abundant taxa [44].

Finally, the development of RSim normalization shows that reference-based normalization can successfully correct compositional bias when identifying a set of non-differential abundant taxa. While RSim normalization only suggests one way to detect a set of non-differential abundant taxa, there could be alternative approaches to achieve the same goal. For example, Spearman’s rank correlation coefficient can be replaced by other correlation coefficients, such as the Pearson and Kendall rank correlation coefficients. It would also be interesting to explore if there is a better way to control the misclassification rate than our empirical Bayes method.

Materials and methods

Reference-based normalization

In order to correct for compositional bias in sequencing data, various methods have been proposed in the literature for experiments both with and without spike-in bacteria. One of the most commonly used approaches is to calibrate the count data using the count of a control sequence, in cases where spike-in bacteria are available [2528]. In experiments with spike-in bacteria, exogenous taxa of known concentration are introduced to each sample in equal amounts, and the count data is then rescaled using the count of these exogenous taxa. Specifically, suppose we have n samples and each sample has d taxa. Let Ai,j be the true absolute abundance of taxon j of the ith sample, and let Ni,j be the corresponding observed sequence counts. If we denote the collection of spike-in taxa as , then we can rescale the count data as follows:

The aim of this scaling is to convert relative abundance to absolute abundance by ensuring that the rescaled count of the spike-in taxa is the same across samples. Experiments have shown that this scaling can successfully recover absolute abundance, with the error in recovery reduced by using multiple spike-in taxa (i.e., a larger ) [25]. However, using spike-in bacteria can be limited by the availability of reliable taxa, as well as potential amplification biases [45, 46]. Given these challenges, it is natural to ask whether this idea can be generalized to experiments without spike-in bacteria.

A new computational normalization method can be inferred from the way of scaling in experiments with spike-in taxa: first, we identify a data-driven reference set whose absolute abundance remains stable across samples, and then normalize the count data with respect to this set:

In the experiment with spike-in bacteria, the reference set is simply the set of spike-in taxa with the same absolute abundance across different samples. This normalization method is referred to as reference-based normalization in this paper. A reference-based approach is also widely employed in compositional data analysis [20, 21]. For instance, the additive log-ratio transformation uses the last taxon as the reference set, while the centered log-ratio transformation uses the geometric mean of all taxa as the reference set. The reference-based hypothesis is also employed in differential abundance analysis of compositional data [23, 24, 44]. Unlike standard compositional data analysis, we utilize the sum of abundance in a set as the reference.

To perform efficient normalization in the absence of prior knowledge of spike-in taxa, we need to select an appropriate reference set, denoted by . In the presence of spike-in taxa, the reference set is simply the set of taxa with known absolute abundance that is constant across samples. However, for experiments without spike-in taxa, we can instead use a large set of non-differentially abundant taxa as the reference set. We assume that there exists a set of non-differentially abundant taxa called , such that their absolute abundance is similar across samples. When this is the case, the sum of the absolute abundance of these taxa is also similar across samples making the sum of abundance a suitable reference for normalization. Moreover, the sum of abundance of many taxa is generally more stable than that of a single taxon, due to the concentration of measure phenomenon [47, 48]. This observation suggests that normalization based on the set of non-differential abundant taxa can be effective in recovering the absolute abundance. In the next section, we will discuss how we estimate the reference set from the data.

Reference set identification by rank similarity

In the previous section, we proposed using reference-based normalization to convert relative abundance to absolute abundance by identifying a large set of non-differential abundant taxa. In this section, we introduce a new method for detecting this set by comparing the count similarity between pairs of taxa. Before we present the method, we introduce some notation and assumptions. We partition the taxa into two groups based on absolute abundance: the collection of differential abundant taxa, denoted by , and the collection of non-differential abundant taxa, denoted by . To simplify the analysis, we assume that the absolute abundance of non-differential abundant taxa is the same across samples while the absolute abundance of differential abundant taxa varies between samples

This model is only for illustrative purposes, but the method we introduce here can work in a more general setting, as long as the variance of absolute abundance for non-differential abundant taxa is much smaller than that for differential abundant taxa. In practice, we observe the count of each taxon, which only reflects relative abundance, and assume that it is drawn from a multinomial distribution where is the total sequence number in the ith sample. A similar model is also considered in [23, 44]. Equivalently, we assume that the observed count of taxa is approximately equal to the absolute abundance multiplied by some unobserved sampling fraction ci

To make the model identifiable, we assume that , where d is the number of taxa. See more discussion on model identification in [24]. After introducing these notations and assumptions, we present a two-step method for identifying the reference set.

Step 1: Differential abundance level statistics.

In the first step, we use the pairwise similarity of taxa to construct the statistics for the differential abundance level of each taxon (i.e., belonging to or ). The key observation we use in this step is that the observed count of two non-differential abundant taxa is much more similar than that of a non-differential abundant taxon and a differential abundant taxon. We represent the count of the jth taxon in all samples as and the sampling fraction of all samples as . Since the absolute abundance of non-differential abundant taxa is stable across samples, we can expect that the count vectors of two non-differential abundant taxa, and , are almost proportional to the sampling fraction vector , so the correlation between and is close to 1. However, since the absolute abundance of differential abundant taxa varies across samples, we can expect the correlation between and to be much smaller than 1 when j1 is a non-differential abundant taxon and j2 is a differential abundant taxon. If we use Spearman’s rank correlation coefficient to measure correlation, then where r(⋅, ⋅) is Spearman’s rank correlation coefficient. How do we use the difference in the pairwise similarity to distinguish non-differential and differential abundant taxon? Since we assume that more than half of the taxa are non-differentially abundant, we can look at the median of the rank correlation coefficients between a taxon and other taxa. More concretely, if we denote the median as we can expect

This observation suggests that the median Mj can be used to distinguish non-differential and differential abundant taxa.

Step 2: Taxa classification.

The second step of our method uses Mj to classify each taxon based on the empirical Bayes framework. The first step suggests that Mj is larger in non-differential abundant taxa than in differential abundant taxa. A natural classification rule is that we can choose a threshold T such that all taxa with Mj > T are classified as non-differential abundant taxa, i.e., is the estimator for . The threshold T should help ensure that most taxa in are non-differential abundant, since our goal is to find a reference set that satisfies the condition

To achieve this, we choose the threshold T to control the misclassification error rate in , i.e., where η > 0 is the target misclassification rate that users select (e.g., η = 0.01). We estimate T using the empirical Bayes framework [49, 50]. To facilitate this, we write F0 and F1 as the cumulative distribution functions of Mj when j is the non-differential and differential abundant taxa, respectively, and F as the cumulative distribution functions of Mj, i.e., where π0 is the proportion of non-differential abundant taxa. Following these notations, we can rewrite the misclassification error in as

The idea in the empirical Bayes framework is to estimate π0, F0, and F from observed Mj, j = 1, …, d, and then we can estimate the misclassification error by plugging in these estimators. The cumulative distribution function F can be naturally estimated by its empirical version where is an indicator function. Before estimating F0 and π0, we choose γ > 0 such that , indicating that j is likely a non-differentially abundant taxon when Mj > γ. After choosing γ, we adopt a resampling method to estimate F0. 1) Find all taxa with Mj > γ and define , which is a subset of non-differential abundant taxa with high probability. 2) Repeat subsampling the taxa from and recalculate the median on the subsampled data as in Step 1. In other words, the subsampled dataset only includes taxa from . When all the taxa in the subsampled dataset are non-differential abundant, we can expect the median of the Spearman correlation coefficient to be approximately drawn from F0. 3) The empirical cumulative distribution function of these resampled medians is our estimator . After finding and , we can estimate π0 by

Here, we use the fact that 1 − F(t) ≈ π0(1 − F0(t)) when t > γ. With the estimators , , and in hand, we choose the threshold and estimated reference set as

Choices of tuning parameters.

RSim normalization has two main parameters: the target misclassification rate η and the threshold for differential abundance level statistics γ. The choice of η affects the empirical misclassification rate and the size of the estimated reference set, which in turn affects the performance of downstream analysis and sampling fraction recovery. A smaller η leads to a less significant bias but higher variance in sampling fraction recovery. Therefore, we recommend a smaller η for downstream analysis that is sensitive to sampling fraction recovery bias, such as differential abundance analysis. Conversely, a larger η is suitable for downstream analysis that requires an estimated sampling fraction with inflated bias and low variation, such as PCoA plotting. S1a Fig shows that the empirical misclassification rate is well-controlled when varying η.

The threshold γ should ideally be at the lowest level where statistics of differential abundant taxa cannot achieve. The choice of γ depends on the microbiome data characteristics, such as taxonomic rank and proportion of differential abundant taxa. Our experience suggests that at an ASV or OTU level, the statistics of at least 90% of non-differential abundant taxa are greater than 0.8. Therefore, we recommend using γ = 0.8 and use it in all experiments. As discussed, it is better to apply RSim normalization at an ASV/OTU level and then convert the data into higher taxonomic ranks, such as genus and family. S1b Fig shows the empirical misclassification rate is always under-controlled when γ is varied from 0.5 to 0.95.

We also conduct numerical experiments to investigate the impact of parameter choices on downstream analysis, including association analysis and differential abundance analysis. The results are presented in S2 Fig. Through S2 Fig, we can observe that the performance of downstream analysis is more sensitive to η than γ, and having appropriate choices of parameters is crucial to achieving high power and low false discovery.

Computation complexity.

The computation complexity of RSim normalization is O(d2n), where d is the number of taxa, and n is the sample size. We perform a numerical experiment to compare the computation complexity of popular normalization methods. The results are shown in S2 Table. As expected, RSim normalization requires more computational time than most normalization methods because searching for non-differential abundant taxa is time-consuming. S2 Table also suggests that the computation time of different normalization methods is comparable when microbiome data sets have relatively moderate values of d (less than 5000) and n (less than 500).

Supporting information

S1 Text. Detailed description of simulation.

Descriptions of how the datasets were generated in simulations.

https://doi.org/10.1371/journal.pcbi.1011447.s001

(PDF)

S1 Fig. Misclassification rate control when the target misclassification rate η and parameter γ vary.

In Fig (a), the x-axis is the target misclassification rate, while the y-axis represents the empirical misclassification rate of the estimated reference set. In all settings, the misclassification error rate of the estimated reference set can be well controlled. In Fig (b), we vary the value of γ from 0.5 to 0.95. All three settings are the same for both figures. Setting 1: 10% taxa are randomly selected as differential abundant taxa, and the latent variable of differential abundant taxa is binary; Setting 2: the differential abundant taxa are top 10% most abundant taxa, and the latent variable of differential abundant taxa is binary; Setting 3: the differential abundant taxa are top 10% most abundant taxa, and the latent variable of differential abundant taxa is continuous.

https://doi.org/10.1371/journal.pcbi.1011447.s002

(PNG)

S2 Fig. Sensitivity analysis.

(a) and (b) show how the tuning parameter choices influence association analysis. Small η and large γ will lead to a higher power. (c) and (d) show how the tuning parameter choices will influence the Pearson correlation test and t-test, respectively. FDR of Pearson correlation test and t-test results has inflation if η value is large.

https://doi.org/10.1371/journal.pcbi.1011447.s003

(PNG)

S3 Fig. Misclassification rate control when the signal strength of differential abundant taxa, the balance of group size, proportion of differential abundant taxa, and sample size are different.

The empirical misclassification rate in RSim is well controlled despite the choices of the signal strength of differential abundant taxa, the balance of group size, proportion of differential abundant taxa, and sample size.

https://doi.org/10.1371/journal.pcbi.1011447.s004

(PNG)

S4 Fig. Comparison of sampling fraction estimations with the results of spike-in-based normalization as the ground truth.

The discrepancies from the spike-in-based normalization method are compared, and it is observed that RSim exhibited the closest results to the spike-in-based normalization method.

https://doi.org/10.1371/journal.pcbi.1011447.s005

(PNG)

S5 Fig. Normalization can reveal biological pattern in PCoA plots.

In (a), samples are randomly divided into two groups, and the top 10% most abundant taxa are differential abundant taxa with a binary latent variable. In (b), the top 10% most abundant taxa are differential abundant taxa with a continuous latent variable. In these figures, RSim normalization can reveal the structure of the latent variable. Euclidean distance with log transformation is used in all PCoA plots.

https://doi.org/10.1371/journal.pcbi.1011447.s006

(PNG)

S6 Fig. False pattern caused by compositional bias leads to a misleading conclusion.

(a) shows the PCoA plots colored by days after the experiment started. (b) presents the PCoA plots colored by sequencing depth. (c) show the relationship between time and sequencing depth. The pattern of time in PCoA plots is highly overlapped with pattern of the sequencing depth, which can be explained by the deterministic relationship between time and sequencing depth.

https://doi.org/10.1371/journal.pcbi.1011447.s007

(PNG)

S7 Fig. Normalization can improve the power of association analysis.

In (a) and (b), samples are randomly divided into two groups, and the top 25% most abundant taxa are differential abundant taxa with a binary or continuous latent variable. The significance level is 0.05. RSim can improve the power of association analysis.

https://doi.org/10.1371/journal.pcbi.1011447.s008

(PNG)

S1 Table. Differential abundant phyla detected by different differential abundance analysis methods.

Three methods are considered: t-test on unnormalized data, t-test on data normalized by RSim, and RDB test on unnormalized data.

https://doi.org/10.1371/journal.pcbi.1011447.s009

(PDF)

S2 Table. Computational time of different normalization methods (in seconds).

d is the number of taxa, n is the sample size. All the experiments are conducted in iMac M1/8GB. Data are subsampled from the dataset collected in [30].

https://doi.org/10.1371/journal.pcbi.1011447.s010

(PDF)

References

  1. 1. Vallejos CA, Risso D, Scialdone A, Dudoit S, Marioni JC. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nature Methods. 2017;14(6):565–571. pmid:28504683
  2. 2. Weiss S, Xu ZZ, Peddada S, Amir A, Bittinger K, Gonzalez A, et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017;5(1):27. pmid:28253908
  3. 3. Lin H, Peddada SD. Analysis of microbial compositions: a review of normalization and differential abundance analysis. NPJ Biofilms and Microbiomes. 2020;6(1):1–13. pmid:33268781
  4. 4. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology. 2010;11(3):1–9. pmid:20196867
  5. 5. Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology. 2010;11(2):1–12. pmid:20132535
  6. 6. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, et al. A survey of best practices for RNA-seq data analysis. Genome Biology. 2016;17(1):1–19.
  7. 7. Vandeputte D, Kathagen G, D’hoe K, Vieira-Silva S, Valles-Colomer M, Sabino J, et al. Quantitative microbiome profiling links gut community variation to microbial load. Nature. 2017;551(7681):507–511. pmid:29143816
  8. 8. Hughes JB, Hellmann JJ. The application of rarefaction techniques to molecular inventories of microbial diversity. Methods in Enzymology. 2005;397:292–308. pmid:16260298
  9. 9. Anders S, Huber W. Differential expression analysis for sequence count data. Nature Precedings. 2010; p. 1–1. pmid:20979621
  10. 10. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11(1):1–13. pmid:20167110
  11. 11. Dillies M, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics. 2013;14(6):671–683. pmid:22988256
  12. 12. Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial marker-gene surveys. Nature Methods. 2013;10(12):1200–1202. pmid:24076764
  13. 13. Lun ATL, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biology. 2016;17(1):1–14. pmid:27122128
  14. 14. Bacher R, Chu L, Leng N, Gasch AP, Thomson JA, Stewart RM, et al. SCnorm: robust normalization of single-cell RNA-seq data. Nature Methods. 2017;14(6):584–586. pmid:28418000
  15. 15. Kumar MS, Slud EV, Okrah K, Hicks SC, Hannenhalli S, Corrada Bravo H. Analysis and correction of compositional bias in sparse sequencing count data. BMC Genomics. 2018;19:1–23. pmid:30400812
  16. 16. Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biology. 2019;20(1):296. pmid:31870423
  17. 17. McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Computational Biology. 2014;10(4):e1003531. pmid:24699258
  18. 18. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014;15(12):1–21. pmid:25516281
  19. 19. Chen L, Reeve J, Zhang L, Huang S, Wang X, Chen J. GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing data. PeerJ. 2018;6:e4600. pmid:29629248
  20. 20. Aitchison J. The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological). 1982;44(2):139–160.
  21. 21. Pawlowsky-Glahn V, Buccianti A. Compositional data analysis. Wiley Online Library; 2011.
  22. 22. Greenacre M. Compositional data analysis. Annual Review of Statistics and its Application. 2021;8:271–299.
  23. 23. Brill B, Amir A, Heller R. Testing for differential abundance in compositional counts data, with application to microbiome studies. The Annals of Applied Statistics. 2022;16(4):2648–2671.
  24. 24. Wang S. Robust differential abundance test in compositional data. Biometrika. 2023;110(1):169–185.
  25. 25. Stämmler F, Gläsner J, Hiergeist A, Holler E, Weber D, Oefner PJ, et al. Adjusting microbiome profiles for differences in microbial load by spike-in bacteria. Microbiome. 2016;4:1–13. pmid:27329048
  26. 26. Tourlousse DM, Yoshiike S, Ohashi A, Matsukura S, Noda N, Sekiguchi Y. Synthetic spike-in standards for high-throughput 16S rRNA gene amplicon sequencing. Nucleic Acids Research. 2017;45(4):e23–e23. pmid:27980100
  27. 27. Tkacz A, Hortala M, Poole PS. Absolute quantitation of microbiota abundance in environmental samples. Microbiome. 2018;6:1–13. pmid:29921326
  28. 28. Hardwick SA, Chen WY, Wong T, Kanakamedala BS, Deveson IW, Ongley SE, et al. Synthetic microbe communities provide internal reference standards for metagenome sequencing and analysis. Nature Communications. 2018;9(1):3096. pmid:30082706
  29. 29. He Y, Wu W, Zheng H, Li P, McDonald D, Sheng H, et al. Regional variation limits applications of healthy gut microbiome reference ranges and disease models. Nature Medicine. 2018;24(10):1532–1535. pmid:30150716
  30. 30. Vangay P, Johnson AJ, Ward TL, Al-Ghalith GA, Shields-Cutler RR, Hillmann BM, et al. US immigration westernizes the human gut microbiome. Cell. 2018;175(4):962–972. pmid:30388453
  31. 31. Caporaso JG, Lauber CL, Costello EK, Berg-Lyons D, Gonzalez A, Stombaugh J, et al. Moving pictures of the human microbiome. Genome Biology. 2011;12(5):1–8. pmid:21624126
  32. 32. McArdle BH, Anderson MJ. Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology. 2001;82(1):290–297.
  33. 33. Wang S, Cai TT, Li H. Hypothesis testing for phylogenetic composition: a minimum-cost flow perspective. Biometrika. 2021;108(1):17–36. pmid:33716568
  34. 34. Zhao N, Chen J, Carroll IM, Ringel-Kulka T, Epstein MP, Zhou H, et al. Testing in microbiome-profiling studies with MiRKAT, the microbiome regression-based kernel association test. The American Journal of Human Genetics. 2015;96(5):797–807. pmid:25957468
  35. 35. Aoun A, Darwish F, Hamod N. The influence of the gut microbiome on obesity in adults and the role of probiotics, prebiotics, and synbiotics for weight loss. Preventive Nutrition and Food Science. 2020;25(2):113. pmid:32676461
  36. 36. Mandal S, Van Treuren W, White RA, Eggesbø M, Knight R, Peddada SD. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microbial Ecology in Health and Disease. 2015;26(1):27663. pmid:26028277
  37. 37. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–140. pmid:19910308
  38. 38. Zhou H, He K, Chen J, Zhang X. LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome Biology. 2022;23(1):1–23. pmid:35421994
  39. 39. Ley RE, Turnbaugh PJ, Klein S, Gordon JI. Human gut microbes associated with obesity. Nature. 2006;444(7122):1022–1023. pmid:17183309
  40. 40. Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, et al. A core gut microbiome in obese and lean twins. Nature. 2009;457(7228):480–484. pmid:19043404
  41. 41. Andoh A, Nishida A, Takahashi K, Inatomi O, Imaeda H, Bamba S, et al. Comparison of the gut microbial community between obese and lean peoples using 16S gene sequencing in a Japanese population. Journal of Clinical Biochemistry and Nutrition. 2016;59(1):65–70. pmid:27499582
  42. 42. Loubinoux J, Bronowicki JP, Pereira IAC, Mougenel JL, Le Faou AE. Sulfate-reducing bacteria in human feces and their association with inflammatory bowel diseases. FEMS Microbiology Ecology. 2002;40(2):107–112. pmid:19709217
  43. 43. Riansuwan W, Limsrivilai J. Current status of IBD and surgery of Crohn’s disease in Thailand. Annals of Gastroenterological Surgery. 2021;5(5):597–603. pmid:34585044
  44. 44. Wang S. Multiscale adaptive differential abundance analysis in microbial compositional data. Bioinformatics. 2023;39(4):btad178. pmid:37018137
  45. 45. Suzuki MT, Giovannoni SJ. Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR. Applied and Environmental Microbiology. 1996;62(2):625–630. pmid:8593063
  46. 46. Brankatschk R, Bodenhausen N, Zeyer J, Bürgmann H. Simple absolute quantification method correcting for quantitative PCR efficiency variations for microbial community samples. Applied and Environmental Microbiology. 2012;78(12):4481–4489. pmid:22492459
  47. 47. Talagrand M. A new look at independence. The Annals of Probability. 1996; p. 1–34.
  48. 48. Boucheron S, Lugosi G, Massart P. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press; 2013.
  49. 49. Robbins H. Asymptotically subminimax solutions of compound statistical decision problems. In: Proceedings of the second Berkeley symposium on mathematical statistics and probability. vol. 2; 1951. p. 131–149.
  50. 50. Efron B. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction. vol. 1. Cambridge University Press; 2012.