RSim: A reference-based normalization method via rank similarity

Microbiome sequencing data normalization is crucial for eliminating technical bias and ensuring accurate downstream analysis. However, this process can be challenging due to the high frequency of zero counts in microbiome data. We propose a novel reference-based normalization method called normalization via rank similarity (RSim) that corrects sample-specific biases, even in the presence of many zero counts. Unlike other normalization methods, RSim does not require additional assumptions or treatments for the high prevalence of zero counts. This makes it robust and minimizes potential bias resulting from procedures that address zero counts, such as pseudo-counts. Our numerical experiments demonstrate that RSim reduces false discoveries, improves detection power, and reveals true biological signals in downstream tasks such as PCoA plotting, association analysis, and differential abundance analysis.


Setting for Fig 2:
In subfigure (a), (b), and (c), n = 500 samples are randomly selected from the data set and randomly divided into two groups I 1 and I 2 with equal size.10% taxa are randomly selected to be J 1 .The idea of our simulation experiments is to treat the raw count data as the population of absolute abundance.Specifically, absolute abundance is generated in the following way: For i ∈ I 1 : A i,j = X i,j + Poisson(λ), j ∈ J 1 , X i,j , j ∈ J 0 For i ∈ I 2 : A i,j = X i,j , j = 1, ..., d Given the simulated absolute abundance, observed count in the simulation experiment is generated in the following way: N i,j ∼ Binomial(A i,j , c i ), i = 1, .., n, j = 1, ..., d, where c i ∼ Unif[0, 1] is the sampling fraction of each sample.λ is 1 for the weak signal (a), 10 for the moderate signal (b), and 500 for the strong signal (c).
The setting for (d) is the same as (a), (b) and (c), except that we consider the proportion of differential abundant taxa as p = 0.1, 0.2, 0.3.The bias is evaluated in the following way: Here | • | represents the cardinality of a set.The two terms in above equation represent the average log difference between true and estimated sampling fraction for two groups respectively.If sampling fractions are correctly estimated for both two groups, the absolute difference between these two terms should be close to zero.On the other hand, if the differential abundant taxa lead to a systematic bias in estimated sampling fraction, the bias can be large.In all above experiments, we choose η = 0.2.

Setting for Fig S4:
The result of spike-in-based normalization serves as the ground truth for the sampling fraction, denoted as c i for the ith sample.Let ĉi denote the sampling fraction estimated by the computation-based normalization method.The discrepancy is evaluated in the following quantity: where n is the sample size.
Simulation study to assess the misclassification rate control Similar to the last set of simulation experiments, we still conduct the experiments on data set collected in He et al. (2018).Each set of experiment is repeated 500 times and we use the average misclassification rate as measure.
Setting 1 for Fig S1a : n = 500 samples are randomly selected from the data set and divided into two groups I 1 and I 2 with equal size.10% taxa are randomly selected to be J 1 .Absolute abundance is generated in the following way: For i ∈ I 2 : A i,j = 10 • X i,j , j = 1, ..., d λ is the signal strength and set to be 2 in this experiment.Observed count is generated in the following way: where c i ∼ Unif[0, 1] is the sampling fraction of each sample.
Setting 2 for Fig S1a: n = 500 samples are randomly selected from the data set and divided into two groups I 1 and I 2 with equal size.The top 10% abundant taxa are selected to be J 1 .Absolute abundance is generated in the following way: Observed count is generated in the following way: where c i ∼ Unif[0, 1] is the sampling fraction of each sample.
Observed count is generated in the following way: where c i ∼ Unif[0, 1] is the sampling fraction of each sample.

Simulation study for PCoA
The Euclidean distance is used in all the PCoA plots.

Setting for Fig 3a and Fig 3b:
The experiment is conducted on data set collected in He et al. (2018).We only include samples with total counts larger than 30000 in this experiment.Fig 3a is the result for raw data.For 3b, half of the samples are subsampled to 1/10 of the original sequencing depth, while the other half of the samples remain the same.η = 0.2 in these experiments.

Setting for Fig 3c:
PCoA performance on the data set collected in Vangay et al. (2018).
The setup of this experiment can be found in PCoA section.

Setting for Fig S5a:
All the samples of the data set in He et al. (2018) are included in this experiment.Samples are divided into two groups I 1 and I 2 .The top 25% abundant taxa are selected to be J 1 .Absolute abundance is generated in the following way: Observed count is generated in the following way: The sampling fraction of I 1 is around ten times of I 2 .
Setting for Fig S5b : All the samples are included in this experiment(n = 539).The top 1% abundant taxa are selected to be J 1 .First we generate a random variable Y = [Y 1 , .., Y n ] and Y i 's are drawn from Unif[5, 500] independently.Absolute abundance is generated in the following way: For i = 1, .., n : Observed abundance is generated in the following way: where 1/c i ∼ N (Y i , 1) and min(c i , 1) is the sampling fraction for sample i.

Simulation study for association analysis
Simulation experiments is still based on the data set in He et al. (2018).All the experiments for the association analysis are repeated 500 times.
Setting for Fig 4a: n = 500 samples are divided into two groups with equal size.Samples in the first group are subsampled to 1/c of the original sequencing depth, while the second group keeps the same as the original data set.c increases from 1 to 3. We choose η = 0.07.

Setting for Fig 4b:
We consider sample size n = 100, 300, 500.Samples are divided into two groups I 1 and I 2 with equal size.The top 10% abundant taxa are selected to be J 1 .Absolute abundance is generated in the following way: Here we choose λ = 2. Observed count is generated in the following way: The sampling fraction of I 1 is around two times of I 2 .We choose η = 0.05.S1.We use Benjamini-Hochberg procedure to adjust the effect of multiple testing.

Setting for Fig
For the first set of the simulation study, to reduce the computational load, we select the subtree below Node 351 as our target, which contains 1081 taxa in total.We repeat each experiment 500 times and consider the average sensitivity rates and false discovery rates.

Setting for Fig 5a and 5b:
We consider sample size n = 100, 200, 300.Samples are divided into two groups I 1 and I 2 with equal size.10% taxa are randomly selected to be J 1 .Absolute abundance is generated in the following way: Observed count is generated in the following way: λ is chosen to be 50,100,150.Two-sample t-test is applied on the normalized counts.The significance level is 0.1.We choose η = 0.01.
Then we generate the sampling fraction in the following way: Observed abundance is generated in the following way: λ is chosen to be 50, 100, and 150.Pearson correlation test is applied on normalized counts.The significance level is 0.1 and we choose η = 0.01.

Setting for Fig 6:
Samples of subject M3's right palm are selected from data set in Caporaso et al. (2011).In total there are 352 samples and 24333 ASVs.Samples with a sequencing depth of less than 10000 are selected to be the first group, and samples with a sequencing depth of more than 20000 are another group.ANCOM, EdgeR, LinDA, RDB and t-test with different normalization methods are considered in the experiment.We use the detected 0.9 for ANCOM, which is the most conservative setting, and the default setting for EdgeR, RDB and LinDA.
Setting for Table S1: Differential abundant test is conducted for group KareThai and group Karen1st from data set in Vangay et al. (2018).For t-test with RSim, data is first normalized at the ASV level, then aggregated to the phylum level.The FDR control level is 0.1.
Fig S1b: Three settings are the same as Fig S1a.We set η = 0.1 and change γ from 0.5 to 0.95.Setting for Fig S2a and S2b: The setting of power comparison is the same as Fig S7a with n = 500, the setting of type I error comparison is the same as Fig 4a with c = 3. Setting for Fig S2c and S2d: The setting is the same as Fig 5.For Pearson correlation test, we choose n = 200 and λ = 50.For t-test, we choose n = 100 and λ = 50.Setting for Fig S3: All the four subfigures share the same data structure with the first setting of Fig S1a.Signal strength λ increases from 1 to 10; the ratio between the group sizes increases from 0.1 to 0.4; the proportion of differential abundant taxa increases from 0.1 to 0.5; sample size increases from 100 to 500.We choose η = 0.01 in the experiments.
S7a: The setting of Fig S7a is the same as Fig 4b except that λ = 5.Setting for Fig S7b: We consider sample size n = 100, 300, 500.The setting is the same with Fig S1c.Simulation study for differential abundance test There are three parts in this section.The first set of experiments is based on the data set in He et al. (2018), of which results are shown in Fig 5; the second set is based on the data set in Caporaso et al. (2011), of which result is shown on Fig 6; the third set is based on the data set in Vangay et al. (2018), of which result is shown on Table Setting 3 for Fig S1a: n = 500 samples are randomly selected from the data set.The top 10% abundant taxa are selected to be J 1 .First we generate a random variable Y = [Y 1 , .., Y n ] and Y i 's are drawn from Unif[1, 100] independently.And absolute abundance is generated in the following way: Setting for Fig 5c and 5d: We consider sample size n = 200, 350, 500.The top 10% abundant taxa are selected as J 1 .First we generate a random vector Y = [Y 1 , .., Y n ] and Y i are draw from Unif[1, λ].The absolute abundance is generated in the following way: