Investigating differential abundance methods in microbiome data: A benchmark study

The development of increasingly efficient and cost-effective high throughput DNA sequencing techniques has enhanced the possibility of studying complex microbial systems. Recently, researchers have shown great interest in studying the microorganisms that characterise different ecological niches. Differential abundance analysis aims to find the differences in the abundance of each taxa between two classes of subjects or samples, assigning a significance value to each comparison. Several bioinformatic methods have been specifically developed, taking into account the challenges of microbiome data, such as sparsity, the different sequencing depth constraint between samples and compositionality. Differential abundance analysis has led to important conclusions in different fields, from health to the environment. However, the lack of a known biological truth makes it difficult to validate the results obtained. In this work we exploit metaSPARSim, a microbial sequencing count data simulator, to simulate data with differential abundance features between experimental groups. We perform a complete comparison of recently developed and established methods on a common benchmark with great effort to the reliability of both the simulated scenarios and the evaluation metrics. The performance overview includes the investigation of numerous scenarios, studying the effect on methods’ results on the main covariates such as sample size, percentage of differentially abundant features, sequencing depth, feature variability, normalisation approach and ecological niches. Mainly, we find that methods show a good control of the type I error and, generally, also of the false discovery rate at high sample size, while recall seem to depend on the dataset and sample size.

• Add information on taxon bias not considered by the tools and update Table 1.
• Clarify through the manuscript the term "relative" specifying with respect to what.
• Update the classification for MaAsLin2 as relative abundance-based method when TSS normalisation is involved.
• Add experimental covariates investigation as a perspective in Discussion.
We think the Reviewer's suggestions have served to improve the clarity of some key points of our work. We revised the manuscript to eliminate possible misunderstandings and hope it is now publishable on Plos Computational Biology.
The details of our response are attached below.
Best regards,

Reviewer #2:
I believe that the authors should at least address or clarify within table 1 that these methods still do not account for specific taxon biases (highlighted in Clausen et al., 2022, and McClearn et al., 2019). While I do not think the manuscript needs to go in depth on this point, I believe that is it important to note these biases when estimating absolute abundances and discussing what these tools are trying to measure within the samples they are provided.
We thank the Reviewer for the suggestion. We clearly stated that only few methods implement some strategies to consider taxon-specific biases, and that none of these strategies are included in the default setting (so not included in our comparison). We also updated the description of the DA methods in the supplementary materials (S1 File, Section 2) to include such information.
We added these sentences in paragraph "Material and Methods -Assessed differential abundance methods" and we updated Table 1 accordingly.
16S and metagenomics data are affected by experimental bias throughout the entire experimental workflow given that DNA extraction, PCR, and sequencing steps preferentially measure and amplify certain taxa over others, therefore distorting the measurements with taxon-and protocol-specific biases [33,34]. Usually, this aspect is only partially addressed by normalisation methods that aim primarily at making comparisons across samples within the same taxon and experiment. However, the comparison across different taxa or using data from different protocols is not possible without addressing this issue. Although this aspect goes beyond the scope of our work, we report in Table 1 a column specifying if the different tools could address taxon bias (see S1 File, Section 2 for more details).
While I think the authors have a point that ALDEx2 and eBays may not be measuring absolute abundance directly it is also incorrect to say they are measuring relative abundances at least in the traditional sense (measured as proportions of the whole). The center-log-transformation that they apply is not only to project data into Euclidean space as indicated by the authors but is also there to provide a reference for how that taxon increases or decreases in comparison to the geometric mean abundance. While this could be thought of as a type of relative abundance (as you are comparing the abundance of a taxon relative to something else) it is not similar to what is generally referenced as relative abundances in microbiome data (i.e., measuring taxa as proportions of the whole). This later approach is commonly used in conjunction with Wilcoxon or t-tests for differentially abundance screening. I believe this difference should be highlighted within the table and made clear to the reader because it can have major differences in how the results of these tools should be interpreted. I.e., knowing that a taxon is increased across samples within a group compared to the geometric mean abundance of all taxa within that group is different then knowing that the proportion of that taxon is increased across the samples within that group. Overall, I think this really highlights that a clearer definition of what relative abundances mean in the context of this table should be discussed.
We see the Reviewer point and agree that when we state "relative" we might specify with respect to what.
We corrected Table 1 following the Reviewer suggestion and updated both the main text and the S1 File, Section 2 accordingly. It is still unclear to me how MaAsLin2 is identifying absolute abundances when it models TSS normalized data. I do believe TSS normalization inherently turns the data into proportions or relative abundances. If the authors do believe that this tool is measuring absolute abundances a longer comment on their stance would be useful.
We thank the Reviewer for highlighting this point about TSS normalisation.
MaAsLin2 accepts as input different kind of data (count and non-count), and it supports different data normalisations and transformations. Therefore, it could be used to assess differences in relative or absolute abundances depending on the adopted combination of input data, normalisation and transformation. We better clarify this point in the description of the tool in the supplementary materials (S1 File, Section 2).
We agree with the Reviewer that MaAsLin2 is measuring relative abundances when input data are count and TSS normalisation is used. We updated the classification for MaAsLin2 as relative abundance-based method when TSS normalisation is involved.
In the manuscript it is concluded that performance of tools does not depend on ecological niche as performance of tools tended to cluster separately from non-human vs human. While I find this interesting tool performance does differ based on the underlying datasets in someway. Is it possible to compare some characteristics of these two clustering's to try and identify what is driving performance differences? However, I do understand that this analysis may not be possible due to the smaller number of datasets included within these groupings and may be out of scope of this manuscript.
As the Reviewer rightly observed performance of tools does not depend on ecological niche. Indeed, we observed two groups of datasets independently from non-human and human associated data. The fact that different datasets result in different tools' performance it's something that has been observed also in other benchmarking studies [1][2]. However, we found that relative ranking seems quite stable. We agree that it could be interesting go deep into the data characteristics that drive the performance difference. However, the number of datasets under consideration does not allow us to draw robust conclusion. Looking at the dataset metadata shown in the manuscript maybe the different grouping depends on the sequencing platform used since it's the only covariates that cluster data. In literature, there are some studies that confirm this behavior on bacterial composition and diversity analysis [3][4]. However, also the targeted region sequenced [5][6][7] and the bioinformatics pipeline used could influence the analysis results [8][9][10].
We decided to answer to this and the next review adding a paragraph in "Discussion" highlighting the experimental covariates inspection as future study (see answer to next review).
Finally, while the authors do a great job at addressing the limitations within their study. I believe they should include a sentence or two about the limitations of their simulations to be reflective of all 16S rRNA datasets. It is clear that even within this manuscript DA tools tend to perform better based on the underlying datasets that simulated data is parameterized on (as indicated by the two clusters they found). Its highly possible that further differing results between DA tools could be obtained if more datasets were explored that came from various types of sample preparations, libraries, sequencing platforms, or bioinformatic processing.
As said in the previous answer, experimental covariates investigation is out of the scope of our work but could be an interesting suggestion for future work. We added the following sentences in "Discussion": Finally, as pointed out by the comparison between different microbiota niches, although the relative ranking of different DA methods seems quite stable, results might depend on the type of dataset and on characteristics rising from different types of sample preparations, libraries, sequencing platforms, and read preprocessing. Although a more extended comparison addressing experimental covariates is out of the scope of this paper, an interesting development aspect for future studies might focus on analysing how DA methods cluster based on their performance and on the experimental characteristics of the analysed datasets.