Simple and flexible sign and rank-based methods for testing for differential abundance in microbiome studies

Microbiome data obtained with amplicon sequencing are considered as compositional data. It has been argued that these data can be analysed after appropriate transformation to log-ratios, but ratios and logarithms cause problems with the many zeroes in typical microbiome experiments. We demonstrate that some well chosen sign and rank transformations also allow for valid inference with compositional data, and we show how logistic regression and probabilistic index models can be used for testing for differential abundance, while inheriting the flexibility of a statistical modelling framework. The results of a simulation study demonstrate that the new methods perform better than most other methods, and that it is comparable with ANCOM-BC. These methods are implemented in an R-package ‘signtrans’ and can be installed from Github (https://github.com/lucp9827/signtrans).


SPsimSeq simulation scenarios
Negative Binomial simulation scenarios This simulation framework was based on work from Hawinkel and colleagues (2019).In the scenarios with 10% differential abundance, the fold change was introduced by using a fraction ( 1 F C+1 ) of the 10% differentially abundant taxa, of which the relative abundances sum to a.The parameters for mean relative abundances of this fraction were multiplied by the fold change factor (1.5 and 3).The remaining fraction ( F C F C+1 ) of the 10% differentially abundant taxa, of which the relative abundances sum to b, were multiplied by a b (1 − F C) + 1.This made sure that the sum of all mean relative abundances was equal to one again, and the mean relative abundances of the remaining taxa were not affected.Alternatively, for the scenarios with 70% differential abundance, the parameter for the mean relative abundances of 70% differentially abundant taxa (that were randomly selected), was multiplied by a fold change in one group.Subsequently, the parameters of all taxa were changed to make the sum of the mean relative abundances of all taxa equal to 1.This introduces DA in the remaining 30% of the taxa.

Diagnostic plots
To compare the simulated SPsimSeq data with the source data the following comparison metrics were used: • the distribution of mean, variance and coefficients of variation (CV) of taxa abundance levels, • the relationship between the mean and variance and the mean and CV of taxa abundance levels, • the distribution of the fraction of zero counts per taxon and its relationship with the mean abundance level, • the distribution of the pairwise correlation coefficients between the taxa and samples.
Setting A

Figure 1 :
Figure 1: Main simulation design.Each unique combination of log-fold change threshold and sample size is considered as simulation scenario for SPSimSeq.

*Figure 2 :Figure 3 :
Figure2: The distribution of the library sizes of the source data (HGM), the pre-processed data (Source A and Source B) and for simulated data of SPsimSeq (setting A SP, setting B SP) and the negative binomial distribution (setting A NB, setting B NB).The distribution of the library sizes for simulated data is based on a random selected dataset.

Figure 4 :
Figure 4: The distributions of the library size, mean and variance of taxa abundance from the pre-processed data (Source A) and for simulated data of SPsimSeq (scenario 2.2).

Figure 5 :
Figure 5: The relationship between the mean and coefficient of variation (CV) of taxa abundance levels (log-CPM) from the pre-processed data (Source A) and for simulated data of SPsimSeq (scenario 2.2).

Figure 6 :
Figure 6: The fraction of zero counts per gene as a function of the mean taxa abundance levels (log-CPM) from the pre-processed data (Source A) and for simulated data of SPsimSeq (scenario 2.2).

Figure 7 :Figure 8 :
Figure 7: The distributions of the pairwise Pearson correlation-coefficients between taxa from the pre-processed data (Source A) and for simulated data of SPsimSeq (scenario 2.2).

Figure 9 :
Figure 9: The relationship between the mean and coefficient of variation (CV) of taxa abundance levels (log-CPM) from the pre-processed data (Source B) and for simulated data of SPsimSeq (scenario 2.2).

Figure 10 :
Figure 10: The fraction of zero counts per gene as a function of the mean taxa abundance levels (log-CPM) from the pre-processed data (Source B) and for simulated data of SPsimSeq (scenario 2.2).

Figure 11 :
Figure 11: The distributions of the pairwise Pearson correlation-coefficients between taxa from the pre-processed data (Source B) and for simulated data of SPsimSeq (scenario 2.2).

Table 3 :
Simulation scenarios SPsimseq.One hundred data sets per scenario were simulated with the following characteristics.The number of taxa in total was fixed on 250.

Table 4 :
Simulation scenarios NB-framework.One hundred data sets per scenario were simulated with the following characteristics.