RSim: A reference-based normalization method via rank similarity

doi:10.1371/journal.pcbi.1011447

Fig 1.

Illustration demonstrating the procedure of RSim normalization.

Step 1: median of pairwise rank similarity of taxa is evaluated to construct the statistics for the differential abundance level of each taxon. Step 2: a new empirical Bayes method provides misclassification rate control in identifying non-differential abundant taxa. Estimated non-differential abundant taxa are used as the reference set in reference-based normalization.

More »

Expand

Fig 2.

Comparisons of normalization methods in estimating sampling fraction.

The numerical experiments are performed when the signal strength of differential abundant taxa is (a) weak, (b) moderate, and (c) strong. In (a), (b), and (c), the x-axis represents true sampling fractions, while the y-axis represents the estimated sampling fraction from normalization methods. We scale the estimated sampling fractions so that their average is the same as the average of true sampling fractions. The black line in these figures represents equality between the estimated and true sampling fractions and the color of points represent which group the differential abundant taxa belong to. The bias in sampling fraction estimation by different normalization methods is compared in (d) when the signal strength and proportion (p = 0.1, 0.2, 0.3) of differential abundant taxa vary. It is clear that the reference-based method can better correct the compositional bias than existing methods, especially when there is a large proportion of strong differential abundant taxa.

More »

Expand

Fig 3.

Compositional bias can create false clusters in PCoA plots.

In (a) and (b), samples are randomly divided into two groups. No modification is applied to (a), while the count data in group 1 is rarefied in (b). In (c), samples are divided into two groups based on the sequencing depth (>10000 belongs to the first group, and <5000 belongs to the second group). In these figures, RSim normalization can help remove the false clusters resulting from compositional bias. Euclidean distance with log transformation is used in all PCoA plots.

More »

Expand

Fig 4.

Normalization can reduce false discovery and improve the power of association analysis.

In (a), the samples are randomly divided into two groups, and the count data in the first group is rarefied. In (b), the synthetic data include differential abundant taxa. The significance level is 0.05 in both (a) and (b). Normalization is an essential step to avoid false discovery and improve power.

More »

Expand

Table 1.

Normalization can make more scientific discoveries through improving the power of association analysis.

More »

Expand

Fig 5.

Comparison of different normalization methods’ effect on the differential abundance analysis.

(a) and (b) are the FDR and sensitivity plots of the t-test after applying seven normalization methods. (c) and (d) are the FDR and sensitivity plots of the Pearson correlation test after applying seven normalization methods. The x-axis is the signal strength of differential abundant taxa. RSim can help t-test and Pearson correlation test control FDR and maintain detection power.

More »

Expand

Fig 6.

RSim normalization helps two-sample t-test control false discovery.

Samples are divided into two groups based on the sequencing depth (<10000 belongs to the first group, and >20000 belongs to the second group), and the FDR is shown when the different significance levels are used. In (a), seven normalization methods are compared. In (b), a two-sample t-test equipped with RSim normalization is compared with state-of-art differential abundance tests.

More »

Expand