Fig 1.
The urn represents the real microbial population where different bacteria are represented by spheres of different colours. The sequencing process can be described by sampling without replacement with a limited number of extractions (i.e. the sequencing depth).
Table 1.
Summary of the methods included in this benchmark study in alphabetical order.
Table 2.
Overview of the datasets used to estimate the parameters of condition A.
Fig 2.
Distribution of mean abundance in log scale for the tooth dataset.
The dotted lines identify the abundance levels limits for sampling the DA features: low (in black), medium (in blue) and high (in red).
Fig 3.
False Positive Rate (FPR) of each differential abundance method for each dataset considered in the comparison in the scenario without simulated DA features.
In each set of boxes corresponding to the dataset, tools are on rows, while different sample size (SS) values are on columns. The FPR values are averaged over the 50 simulations and the bars show the standard error. The ANCOM** label refers to the method run without performing the underlying FDR adjustment.
Fig 4.
False Discovery Rate (FDR) of each differential abundance method for each dataset considered in the comparison in the main scenario with simulated DA features.
In each set of boxes corresponding to the dataset, different percentages (P) of simulated DA features are on rows, while different sample size (SS) values are on columns. The FDR values are averaged over the 50 simulations and the bars show the standard error. The number of runs that provide a defined value of FDR is shown at the beginning of the bars.
Fig 5.
Mean Recall (on y axis) and FDR (on x axis) of each differential abundance method for each dataset considered in the comparison in the main scenario with simulated DA features.
In each set of boxes corresponding to the dataset, different percentages (P) of simulated DA features are on rows, while different sample size (SS) values are on columns. The recall values are averaged over the 50 simulations and the bars show the standard error.
Fig 6.
Area Under Precision-Recall curves (AUPR) of each differential abundance method for each dataset considered in the comparison in the main scenario with simulated DA features.
In each set of boxes corresponding to the dataset, different percentages (P) of simulated DA features are on rows, while different sample size (SS) values are on columns. The AUPR values are averaged over the 50 simulations and the bars show the standard error.
Fig 7.
Recall of each differential abundance method for each dataset considered in the comparison in simulations with reduced variability.
In each set of boxes corresponding to the dataset, different percentages (P) of simulated DA features are on rows, while different sample size (SS) values are on columns. The recall values are averaged over the 50 simulations and the bars show the standard error.
Fig 8.
FDR of each differential abundance method for each dataset considered in the comparison in simulations with reduced variability.
In each set of boxes corresponding to the dataset, different percentages (P) of simulated DA features are on rows, while different sample size (SS) values are on columns. The FDR values are averaged over the 50 simulations and the bars show the standard error. The difference in the mean FDR between the scenario with reduced variability and the main scenario with simulated DA features is shown at the beginning of the bars.
Fig 9.
Recall of each differential abundance method for each dataset considered in the comparison in the scenario with simulated DA features and θ = 0.
In each set of boxes corresponding to the dataset, different percentages (P) of simulated DA features are on rows, while different sample size (SS) values are on columns. The recall values are averaged over the 50 simulations and the bars show the standard error.
Fig 10.
Mean FDR difference [%] between each differential abundance method and its GMPR normalised version for tooth dataset in the scenario with simulated DA features.
Different percentages (P) of simulated DA features are on rows, while different sample size (SS) values are on columns. Numbers at the beginning of each row correspond to the FDR values obtained with default normalization, while the symbol (*) identifies that the Wilcoxon unpaired statistical test is significant.
Fig 11.
Mean Recall difference [%] between each differential abundance method and its GMPR normalised version for tooth dataset in the scenario with simulated DA features.
Different percentages (P) of simulated DA features are on rows, while different sample size (SS) values are on columns. Numbers at the beginning of each row correspond to the Recall values obtained with default normalization, while the symbol (*) identifies that the Wilcoxon paired statistical test is significant.
Fig 12.
Overall performance of each DA method.
In each set of boxes corresponding to different sample size (SS) values, Precision, NA_perc (percentage of available precision), Recall and pAUPR scores are shown for each dataset in columns. Methods (on rows) are ranked based on Precision values across all the SS scenarios and then based on Recall in case of ties. The legend below the boxes explains the threshold used to assign the overall score for each metric.