The accuracy of absolute differential abundance analysis from relative count data

Concerns have been raised about the use of relative abundance data derived from next generation sequencing as a proxy for absolute abundances. For example, in the differential abundance setting, compositional effects in relative abundance data may give rise to spurious differences (false positives) when considered from the absolute perspective. In practice however, relative abundances are often transformed by renormalization strategies intended to compensate for these effects and the scope of the practical problem remains unclear. We used simulated data to explore the consistency of differential abundance calling on renormalized relative abundances versus absolute abundances and find that, while overall consistency is high, with a median sensitivity (true positive rates) of 0.91 and specificity (1—false positive rates) of 0.89, consistency can be much lower where there is widespread change in the abundance of features across conditions. We confirm these findings on a large number of real data sets drawn from 16S metabarcoding, expression array, bulk RNA-seq, and single-cell RNA-seq experiments, where data sets with the greatest change between experimental conditions are also those with the highest false positive rates. Finally, we evaluate the predictive utility of summary features of relative abundance data themselves. Estimates of sparsity and the prevalence of feature-level change in relative abundance data give reasonable predictions of discrepancy in differential abundance calling in simulated data and can provide useful bounds for worst-case outcomes in real data.


Introduction
Warnings about the consequences of compositional effects in sequence count data have been published repeatedly in the decades since the technology's advent and its application to a host of biological problems. The issue relates to a loss of scale information during sample processing, which renders counts of genes, transcripts, or bacterial species as relative abundances. The technical basis for this belief is summarized in Box 1. No consensus solution for this problem exists. In this work, we use prominent differential abundance testing methods on simulated and real data to quantify the discrepancy between differential abundance estimates made on relative versus "absolute" abundances. Our simulations show that methods which heuristically rescale sample abundances are often reasonably consistent in their estimates of change across relative and absolute counts but that, in a sizeable number of cases characterized by substantial change across simulated contiditions, most or all methods differential abundance testing methods evaluated yielded inaccurate estimates. Further, we show that data sets which are Box 1: Measuring relative abundances Sequence counting has become widespread as a means of census-taking in microscopic biological systems. Genomic material, typically RNA, is captured and quantified at the component level. Sampled cells are lysed, messenger RNA is captured and fragmented, transcribed into cDNA, sequenced, classified, and quantified. The results are relative abundances of gene products in the cell (in the case of single-cell RNA-seq) or tissue (in bulk RNA-seq). In another instance, whole bacterial communities are profiled by barcoding of the 16S subunit of the ribosome. Ribosomal RNA associated with this piece of translation machinery is ubiquitously present across the bacterial kingdom but variations in the genetic sequence of this component can uniquely identify bacteria to the species or strain level in well-characterized systems, allowing a researcher to profile bacterial community composition. Absent measurements of microbial load or transcriptome size, however, the observed sequence counts in all these cases represent relative abundances.
Sequence count data is compositional due to steps in sample processing. Across domains, samples are typically normalized to some optimal total amount of genetic material prior to sequencing in accordance with manufacturer recommendations for best performance. This step removes variation in total abundance across samples. especially susceptible to distortion by compositional effects can sometimes be predicted on the basis of "signatures" of this distortion.

Compositionality in sequence count data
Compositionality refers to the nature of sequence count data as containing relative abundance information only. In the differential abundance setting, several authors [1][2][3] have described the problem this poses: whereas researchers would like to interpret change in absolute abundances, compositional effects mean using change in relative abundances as a proxy can lead to false discoveries. A few authors have cited instances of these false discoveries in real data. Coate and Doyle [4,5] discussed the issue of transcriptome size variation in plants and other systems and the impact of this on accurate transcriptome profiling. Nie et al and Lin et al [6,7] documented the phenomenon of widespread "transcription amplification" by the transcription factor c-Myc and Lovén et al [8] used c-Myc data and parallel RNA quantification assays to show that substantial differences in total abundance between control and elevated c-Myc conditions resulted in very different interpretations of apparent differential expression. Common to these studies of transcriptomes is a recommendation that, where feasible, researchers leverage RNA spike-ins as controls against which changes in observed abundance can be scaled [9,10]. But this practice has fallen short of widespread adoption. While several papers have expressed confidence in the utility of spike-ins [11][12][13][14][15], the doubt cast by reports of widespread batch effects [16] and technical noise [17] have had the effect of reducing researcher confidence in their use. Further, the introduction of spike-ins is not practical on all platforms.
Where approaches that rely on spike-ins are undesirable or infeasible, sample rescaling procedures have proliferated. These methods typically assume the existence of a stable set of features and attempt to normalize compositions in such a way as to recover this stable set across samples. In fact, in transcriptomics, these methods predominate.
In the setting of microbial community profiling, the prevailing assumption is that typical compositions are too simple for rescaling methods to work well (although results in benchmarking studies have been mixed [19,20]). Competing approaches have been developed for dealing with compositionality in microbial sequence count data. Quantitative microbiome profiling [21] and similar approaches combine relative abundances with complementary measurements of microbial load to reconstruct absolute abundances. In contrast, so-called compositional methods are also utilized. These involve log relative representations which can give approximate log-normality, such that workhorse statistical methods for continuous data may Saturation of sequencing has been cited [18] as another mechanism by which abundances are rendered relative: a finite amount of reagent means there is an upper limit on biological material which can be captured; rare components can be forced out by a "competition" to be sampled. These factors withstanding, observed total abundances would likely still be noisy. Repeated subsampling of small amounts of material and variation in the efficiency of library preparation steps can distort observed totals.
In transcriptomics and in microbial community profiling, residual variation in observed total abundances across samples is generally regarded as a source of technical bias and most analytics pipelines involve steps to rescale observed abundances. The simplest of these is the counts per million (CPM) transformation which converts observed counts to relative abundances, then scales by 1 million. be applied. However, interpretation of these quantities can be challenging (e.g. as with the isometric logratio [22]).
Though there is evidence from simulated and real data that scale-i.e. increasing complexity of composition in terms of numbers of genes, transcripts, or bacterial sequence variantsmitigates the problem of compositionality [2,20], it remains unclear whether there are instances where it is reasonable to substitute relative abundances for absolute abundances and several fields could benefit from clarity on this issue. In this work, we quantify the discrepancy in differential abundance calling on simulated and real data sets representative of 16S metabarcoding, bulk RNA-seq, and single-cell RNA-seq experiments.
In particular, we evaluate a set of methods which rescale observed total abundances and show that all such rescaling strategies outperform a simple library size normalization in our simulated data. The methods selected are a diverse set meant to be representative but not exhaustive and draw from the fields of transcriptomic and metagenomic analysis: ALDEx2 [23], ANCOM-BC [24], DESeq2 [25], edgeR [26], and scran [27]. Each of these methods rescales the observed counts against a reference quantity, implicit or explicit-typically, a subset of putatively stable features. Where such a stable reference exists, differences in these rescaled counts should approximate differences in absolute counts. In all cases, baseline or "true" differential abundance for each feature in the absolute count data was determined by the use of a simple generalized linear model. This yielded a common standard against which to compare differential abundance calls made by ALDEx2, ANCOM-BC, DESeq2, edgeR, and scran. Details on the simulation procedure and these differential abundance calling algorithms are given in Methods.
In simulations exploring a broad range of differential abundance scenarios, we find median false positive rates of differential abundance calls made on absolute versus relative counts are 11.5% and that false positive counts increase with the proportion of differentially abundant features. The number of features (e.g. genes or bacterial sequence variants) plays little role in observed outcomes. An exploration of sequencing data collected from twelve external studies reveals similar trends. Further, we show that summaries of sparsity and the prevalence of apparent feature-level change can provide reasonable predictions of the amount of discrepancy in differential abundance calls in simulated data and may be useful for bounding expectations of discrepancy in real data sets.

Results
We simulated differentially abundant count data in paired sets of absolute and relative abundances. In the absolute abundances, differentially abundant features experienced either an increase or decrease in abundance-often large-between each of two simulated conditions. Large numbers of differentially abundant features frequently had the effect of changing the overall total abundance, potentially resulting in several-fold changes in scale between conditions. In each simulated data set a subset of randomly selected features received a random increase or decrease in average abundance between simulated conditions, contributing to an overall increase or decrease in total abundance which might be large or small. These changes in total abundance were purposely removed by a fixed-depth resampling to give a set of relative abundances. The proportion of differentially abundant features simulated and other key characteristics of the simulated data are summarized in S1 and S2 Figs. Compositional effects might be present in this relative count data as relative change exhibited by non-differentially abundant features.
We explored ranges of feature number and in the amount of differential abundance, grouping simulations into three partially overlapping settings: a Microbial setting, characterized by low feature number and high differential abundance; a Bulk Transcriptomic setting having high feature number and low differential abundance; and an intermediate Cell Transcriptomic setting. The full results from 5625 simulations are summarized in S1 Table and visualized in Fig 1, where agreement between differential abundance calls on absolute and relative abundances for each data set are summarized by means of sensitivity and specificity statistics, as described below. The same results are labeled in terms of increasing simulated fold change in We report outcomes in terms of sensitivity (true positive rate) and specificity (1-false positive rate). Perfect concordance of differential abundance calls made on observed versus absolute counts would yield a sensitivity of 1.0 and a specificity of 1.0. Sensitivity drops as more features deemed differentially abundant from the perspective of the absolute counts fail to appear significantly different in the observed data and specificity drops as an increasing number of features appear differentially abundant in the observed counts alone. We highlight key observations made on simulated data below.

Simulated data
All rescaling methods have higher accuracy than edgeR with library size normalization. Library size-or total count-normalization represented an anticipated worst case for accuracy in differential abundance testing. In order to test this, we employed edgeR, which allows for either total count or trimmed mean of M-values (TMM) normalization [26]. The difference in accuracy outcomes with and without TMM normalization was substantial. Median specificity was only 0.35 when the observed per-sample totals were normalized by total counts, versus 0.88 when using TMM normalization on the same collection of simulated data sets. False positive counts were over four times higher in the absence of TMM normalization, demonstrating that, for edgeR, a very widely used differential abundance calling method, accuracy is greatly improved by this heuristic per-sample rescaling. These results are illustrated in Fig 2. Moderate specificity was typical but cases of low specificity were observed in all settings for all methods. Median specificity was moderately high at 0.885-lowest for ANCOM-BC in the Cell Transcriptomic setting at 0.835 and highest for ALDEx2 in the same setting at 0.937 (see Table 1). However, a minority of simulated data sets yielding very large false positive rates were observed in every setting. Data sets with very low specificity (< 0.5) made up less than 10% of simulated cases in the Transcriptomic settings and more than 14% in the Microbial setting.
Increasing feature number improves specificity but the effect is modest. Median specificity improved slightly for ALDEx2 and scran in the highest-feature Bulk Transcriptomic setting relative to the lowest-feature Microbial setting. The change in false positive counts in particular is visualized in Fig 2. For all other methods, accuracy remained similar across settings.
Though scran was the top performer, outcomes were similar across methods. Scran had the highest overall accuracy with a median sensitivity of 0.93 and specificity of 0.91. Sensitivity was lower in ALDEx2 (median = 0.78) and specificity was lower in DESeq2 (median = 0.86). All in all, methods exhibited similar performance in terms of their distributions of sensitivity and specificity across settings. See Figs 1 and 2 and Tables 1 and 2. Next, we explored the characteristics of relative abundances associated-either positively or negatively-with observed sensitivities and specificities in simulated data. It is possible to imagine characteristics which might indicate the presence of distortion by compositional effects, for example, an increase in the percent of rare features from one condition to the next  (in effect, dropouts). While we might not expect any single such characteristic to predict the sensitivity or specificity of calls made on relative versus absolute abundances, composites of such characteristics might be informative. We describe the strongest associations we observed below.
Sensitivity is anti-correlated with estimates of sparsity. Large proportions of zero-and one-counts in the simulated relative abundances correlated with low sensitivity (Spearman's correlation between proportion zeros and sensitivity, ρ = −0.58). In effect, this echoes similar findings that decreasing sequencing depth decreases power in genomics studies [14] and reflects a lesser expected statistical confidence in the observed change of low-count features. Table 1. Median and thresholded specificity by simulated setting, FDR � 0.05. Low specificity is a common outcome in simulated data. A majority of simulated data sets have false positive rates in excess of 5%. A minority of data sets have false positive rates in excess of 50%.

Setting
Method Median specificity Percent of data sets below 95% specificity Percent of data sets below 50% specificity Interestingly, in our simulations, evidence of extremes in terms of the apparent correlation of relative abundances (in fact, the skew of that distribution of correlations) was also inversely associated with sensitivity (ρ = −0.50). Specificity is strongly anti-correlated with the estimated proportion of differential features. The proportions of features undergoing large fold decreases or large fold increases in abundance relative to the mean were highly informative with respect to specificity (ρ = −0.61 and ρ = −0.43 respectively). The standard deviation of the change in log counts between simulated conditions was also anti-correlated with specificity (ρ = −0.56). In each case, these characteristics supplied evidence of the existence (or lack) of widespread change in composition. Methods which rescale observed abundances rely on a pool of stable reference features against which to estimate per-sample "scaling factors." The larger the number of components in the composition apparently changing, the greater the extent to which this rescaling is impaired.
We utilized these and over 50 additional features derived from relative abundance data to train per-method models of both sensitivity and specificity, with the aim of predicting the discrepancy in relative versus absolute differential abundance calls from features of observed data alone, reasoning that these "signatures" might be highly informative. All model features are outlined in S2 Table and feature importance is explored in S3 and S4 Tables. Predictive models were trained on 80% of our simulated data and their performance was evaluated on the heldout 20% of simulated data sets. Predictions of specificity were more accurate than predictions of sensitivity (mean specificity R 2 = 0.70; mean sensitivity R 2 = 0.59). Prediction of outcomes was most successful for ALDEx2, where the R 2 values for sensitivities and specificities on heldout data were 0.74 and 0.72, respectively. Full results are given in Table 3. These results suggested it might be reasonable to predict discrepancy between differential abundance calls made on absolute versus relative abundances using characteristics of the relative abundance data alone. However, the study of some representative real data sets will suggest prediction may be much more difficult in this setting.

Real data
Next, we examined a variety of real data sets across many experimental settings in order to explore outcomes in real data. We collected publicly available data from twelve studies [28][29][30][31][32][33][34][35][36][37][38][39] and attempted to reconstruct absolute abundances by normalizing observed total sample abundances against reference quantities provided in the same published materials. In most cases, these reference quantities were external RNA spike-in sequences. In others, reconstructed absolute abundances had already been estimated, as in [28] through quantitative microbiome profiling (QMP [21]). In one case [38], we normalized against the housekeeping gene gapdh [40] and in another [33], against paired measurements of cell mass. Visual and textual summaries of these data sets are available in Figs 3 and 4 and Table 4. While not exhaustive, these studies illustrate a wide range of experimental conditions. We acknowledge the  difficulty in reconstructing absolute abundances and caution that these estimates are likely noisy. Rather than call them absolute abundances, we will adopt the helpful terminology of [9] and call these reconstructions "nominal abundances" to distinguish them from some theoretical ground truth. While these quantities are an abstraction, since they derive from real experiments, we believe they capture some of the real variation in composition and scale we could expect to see in data from typical experiments and should serve to ground expectations. As with simulated data, we proceed by comparing differential abundances called on the relative count data (via ALDEx2, ANCOM-BC, DESeq2, edgeR, or scran) with differential abundances called on the nominal abundances-our proxy for "true" differential abundance.
We have endeavored to include among the real data sets some of the most challenging possible cases for differential abundance calling from relative count data. Many of the data sets we have selected involve a large amount of absolute and/or compositional change across experimental or biological conditions. These data sets help us answer the questions: What is the scale of discrepancy in the worst cases? How useful are summaries derived from relative abundances in these cases? Could a researcher reasonably predict error from relative abundance data alone?
Accuracy outcomes for these real data sets are given in Figs 5 and 6 and S5 and S6 Tables. These outcomes can be broadly classified on the basis of sensitivity into low, moderate, and high sensitivity cases, described below.
Low sensitivity, high specificity cases. Three data sets exhibited a combination of very low sensitivity (median = 0.06) and high specificity (median = 0.99). These were the data of    (Figs 5 and 6). These studies consisted of two transcriptomic experiments-in human and mouse cells-and a single gut microbial data set. The range of differentially abundant features, as estimated from nominal count data, spanned 32% to 74% in these studies (see Table 4). This differential abundance was often small in absolute terms and demonstrably difficult to detect relative to the scale of sample-sample variation. Features which were significantly differentially abundant in the nominal count data often fell below the threshold of significance in the observed counts. An representative example of this from Kimmerling  Sensitivity prediction was poor again. Only 24% of observed sensitivities were within the 50% interval of prediction on sensitivity but 79% of specificities were within the same 50% predictive interval.
High sensitivity, low specificity cases. A final group contained the data sets of Owens et al. and Klein et al., each of which had an overwhelming majority of differentially abundant features, as estimated from differences in nominal counts. These majorities of differentially abundant features contributed to similarly large, several-fold change in total abundance across conditions for each set of nominal counts. Experimentally, both involved studies of tissue differentiation-one in zebrafish, one in mouse. Though sensitivity was at its highest in this data, specificity was overall very low. We note that this is at least partially attributable to a reduction in the number of stably abundant features in these data sets, as false positive rates are a function of the ratio of "positive" and "negative" calls. Accuracy of prediction was lower for sensitivity than for specificity in these data sets. The observed, high sensitivity fell within the 50% predictive interval in 90% of cases. However, none of the per-method specificity predictions for the data of Owens et al. or Klein et al. were within the 50% predictive interval.
Summarizing results for all twelve data sets, we note that sensitivity was greatly reduced in these real data sets relative to simulation and was more difficult to predict as well. Observed sensitivities were within the 50% interval of model-prediction only 29% of the time overall, versus 71% for specificity. In this small sample, sensitivity and specificity constitute a tradeoff with an association coefficient of -0.54 (p < 0.0004). The lowest specificities (or highest false positive rates) are associated with those data sets encoding the largest amount of change, both in terms of the number of changing features and the scale of the resulting change in total abundance. However, as in   Table 4) can yield very different outcomes on the basis of differences in sample-sample variation.

Mitigation strategies
We pursued two mitigation strategies which we believed could improve the accuracy of differential abundance estimates from relative counts: 1) increased stringency of differential abundance testing and 2) normalization against stable reference features with DESeq2.
In the first case we both increased the significance threshold for differentially abundant features (FDR � 0.01) and enforced a fold change (effect size) threshold for differential features. In essence, we now ask about the agreement between absolute and rescaled relative data in terms of big, unambiguous differences. In this setting, median specificity was improved from 0.90 to 0.95 and from 0.63 to 0.85 in the subset of data sets with a majority of differentially abundant features. For full results, see Table 2 and compare Figs 1 and 2 with S7 and S8 Figs respectively.
In a second analysis, we briefly explored the effect of normalization by stable reference features. DESeq2 allows a user to specify a set of reference features against which samples will be normalized using a "control_genes" option. We utilized this feature in three settings: 1) normalizing simulated data sets against low-variance features identified in absolute counts, 2) normalizing real data sets against low-variance features identified in absolute counts, and 3) normalizing real data against a set of 13 "housekeeping genes" pulled from the literature including gapdh and hprt1 [40][41][42][43].
Results derived from low-variance gene normalization show mostly modest improvements in specificity in both simulated and real data (S7 and S8 Tables; S9 Fig). For example, median specificity was increased by over 15% in data sets with an initial "high" FPR of greater than 10%. This improvement was progressive: real data sets with the worst specificities showed the greatest improvement, from 0.653 to 0.902 (Song et Table). The housekeeping gene set was less effective and even increased discrepancy in two data sets, suggesting that, although normalization by control features can improve accuracy substantially, caution should be employed here: strong biological prior knowledge or independent validating measurements may be required if researchers wish to propose stable genes for sample normalization.

Discussion
While the potential for compositional effects to drive differential abundance has repeatedly been described in the literature, uncertainty remains about the scope of this problem. Previous studies have shown that rates of false positive differential abundance calls can be high in certain settings [2,19,20]. Our results indicate that the problem is at least partially a function of the amount of change in the system under study and that differential abundance estimates from experiments characterizing extreme change across observed conditions are likely to be distorted. Both simulated and published experimental data contributed to this picture. Data sets with low fold change across conditions and a minority of differentially abundant features had high specificities (low false positive rates) and low false positive counts.
We found prediction of outcomes was possible in a limited sense. In simulated data, especially discrepant outcomes were roughly predictable by a few summary statistics derived from their relative abundances-low sensitivity from features which captured information about sparsity in the data set, and low specificity from estimates of the number of differentially abundant features.
Accuracy outcomes were less predictable in real data where poor predictive performance generally took the form of overprediction. Extreme cases were more difficult to predict: where sensitivity was extremely low, per-method models tended to overpredict it and where there was a tremendous number of differentially abundant features, specificity was overpredicted.
Some simple strategies offer avenues for mitigating the discrepancy between absolute and relative estimates of differential abundance. In our analysis, increasing the stringency of differential abundance calling through stricter significance and effect size thresholds improved accuracy significantly. That is, absolute and relative data is more often in agreement about features exhibiting large, unambiguous changes across conditions. Secondly, we observe that normalization by known stable features modestly improves accuracy when performed by DESeq2. Where these stable features are known or can be experimentally discovered, analyses would likely benefit from their use. Similarly, the use of external reference features like spike-ins has been validated in plate-based single-cell assays [13]. Though their application remains challenging on droplet-based platforms, they may still offer the best experimental means of reconciling the relative and absolute perspectives of change.
Lastly, it should be noted that the concerns motivating this and similar studies may be moot for some types of sequence count data. In particular, some single-cell platforms generate library sizes (i.e. total per-sample observed abundances) which are already roughly proportional to absolute abundances. This is in line a view that abundances in deeply sequenced UMI-barcoded single cells are likely to be a good proxy for absolute abundances [12,14]. The effect of compositionality may be a minor concern under these circumstances.

Materials & methods
We simulated general purpose molecular count data. These counts are interpretable as a variety of biological quantities, for example, transcript abundance in a cell or bacterial strain abundance in a microbial community. The simulated abundances undergo a sampling step intended to loosely replicate the process of measurement itself and, crucially, the normalization of total abundance across samples, giving a second set of count data. We refer to the first set of count data as "absolute" counts and the second, resampled set as relative or observed counts and explored the degree to which this loss of information about changes in total abundance alters the results of a mock differential abundance analysis by simulating a wide range of settings in our data, where key characteristics like complexity of composition (e.g. gene number) and fold change across simulated conditions varied widely.

Simulation model
We designed a simulation framework to generate count data corresponding to two arbitrarily different conditions, denoted by superscripts in the equations below. First, for p = 1, . . ., P features in the first condition, a set of log mean abundances was drawn as where hyperparameters m and S tune the mean and standard deviation of baseline log abundances. A correlation matrix was drawn as where scale matrix Q was supplied as either the identity matrix (for a minority of simulations) or a dense correlation matrix with net positive elements. The matrix O is subsequently rescaled to a correlation matrix and used to draw correlated feature perturbations in a second condition as where the hyperparameter a exists in order to tune the overall scale of the correlated log perturbations. Mean abundances on the scale of sequence counts for each condition are calculated as Differentially abundant features in some desired proportion, c, are obtained as follows: features are selected as differentially abundant with probability c. For those selected features only, the perturbed g ð2Þ p serves as the mean abundance in the second condition; for all other features, the mean abundance in both the first and second conditions is given by g ð1Þ p . Let these new vectors be m ð1Þ p ; m ð2Þ p . These represent mean the abundances of P features in two conditions, some of which differ across conditions, others of which are identical. Replicates i = 1, . . ., 10 are then generated for each condition as follows. A fixed dispersion parameter for absolute counts is defined as d abs = 1000 and those counts are drawn as where d � maxð0:1; Nð1; gÞÞ ð6Þ (Note that the dispersion parameter has been chosen such that the resulting counts are only barely overdispersed with respect to a Poisson.) The purpose of the truncated, per-sample multiplier δ is to re-scale all abundances within a given sample by some factor of approximately 1 but by increasing the scale of hyperparameter g, increasing replicate noise can be added within a condition. This process is repeated for the second condition to give a set of absolute counts y ð2Þ p .
A new average observed total abundance (or library size) is randomly drawn as Finally, observed abundances z are generated through a multinomial resampling procedure similar to that of [10,20,23] which gives relative abundances as counts per million. Where i and k index different samples prior to resampling, i 0 indexes the sample i after resampling, and total counts for sample i prior to resampling are given by n i = ∑ y i , we have where superscripts have been suppressed as this procedure is identical across simulated "conditions." The resulting P-length vector of counts for a given sample contains relative but not absolute abundance information. These vectors are collapsed into a P × 20 count matrix containing 10 replicate samples for each of two simulated conditions. In order to evaluate the discrepancy of differential abundance calling on observed versus absolute counts, we apply differential abundance methods to count matrices Z and Y respectively and score the differences.

Breadth of simulations
In order to generate simulations with a wide variety of characteristics, we swept in a grid over all our hyperparameters. Feature number P was stepped through values 100, 1000, and 5000. A maximum feature number of 5000 was chosen as these simulations were computationally intensive and major trends had become apparent at that scale. The degree of feature correlation was encoded in five realizations of scale matrices Q, encoding fully independent features at one extreme and 50% strongly positively correlated features at the other extreme. Log mean abundance (m) and the log variance (S) were independently incremented through low to high values. Likewise, the average log perturbation size (a) was swept from low to high in five steps, as a proportion of log mean abundance. Replicate noise g varied from low to high in three steps. And finally, the proportion of differentially abundant features ranged across 20%, 30%, 50%, 70%, and 90%. Note that because many "perturbations" were very small, detectable differential abundance was generally only a fraction of the parameterized amount and most data sets contain a minority of differentially abundant features. Overall this 5625 simulated data sets were generated with almost continuous variation characteristics of interest.
We suggest that ranges of these parameter settings approximately represent different data modalities. We term the Microbial setting that with low to moderate feature number (P � 1000) and largest average perturbation, in accordance with a belief that bacterial communities are often simple (in terms of sequence variants with more than negligible abundance) and that they are highly variable even at short time scales [44].
We designate the Bulk Transcriptomic setting as that with the largest feature number (P = 5000) and having a lower average perturbation, the rationale being that transcriptomes sampled in aggregate over many cells are complex but largely stable compositions. Similarly, we define the intermediate Cell Transcriptomic setting, approximately representative of single-cell RNA-seq data, to comprise simulations with moderate to large feature numbers (P � 1000) and moderate perturbation sizes. These categories are intended as rough outlines and we note that within these settings the realized data varies in terms of 1) degree of feature correlation, 2) overall abundance, 3) (un)evenness of composition, and 4) within-condition variation in totals.

Calling differential abundance
Five differential abundance calling methods were used in this study. These were ALDEx2 v1.22.0 [23], ANCOM-BC v1.0.5 [24], DESeq2 v1.30.1 [25], edgeR v3.32.1 [26], and scran v1.18.5 [27]. Each of these methods relies upon the use of a reference quantity to renomalize sample total abundances. ALDEx2, DESeq2, and edgeR rely on various (trimmed) estimates of the per-sample mean or median abundance for this reference quantity. ANCOM-BC uses a model-derived estimate of the "sampling fraction" of total abundance represented in an observed sample. scran's procedure learns and applies a condition-or cell type-specific normalization factor. For simplicity, we omit from consideration models which lack a rescaling. We also omit zero-inflation models. Although these are popular in single-cell mRNA sequencing data, the debate continues about whether these models are appropriate for these data [45,46].
Finally, differential abundance calls made on observed counts must be evaluated against a reference in order to calculate discrepancy. For this reference, we used calls made by a negative binomial generalized linear model on absolute abundances as a pseudo-gold standard or "oracle" in all cases. Features with significantly different means according to the GLM were considered "true" instances of differential abundance. One caveat is that some of the discrepancy in calls between absolute and relative abundances will be due to differences in sensitivity between the models applied to the reference and "observed" data sets-i.e. between a stock NB GLM and DESeq2. In all settings, true positive, true negative, false positive, and false negative calls were manually spot checked to verify disagreements between methods were generally unambiguous. Accuracy (in terms of sensitivity and specificity) was calculated between differential abundance calls made on absolute abundances and observed abundances. Unadjusted p-values were collected from all methods and multiple test correction applied via p.adjust in R using Benjamini-Hochberg method.

Predictive modeling
In total, we trained 10 random forest models over 61 summary features, for each combination differential abundance calling method (ALDEx2, ANCOM-BC, DESeq2, edgeR, or scran) and accuracy measure (sensitivity or specificity). All such predictive models were fit with the ran-domForest package in R [47]. A random forest is an ensemble of decision trees and this treebased approach was chosen because, while extensive, our simulations were not exhaustive. We anticipated that learning sets of decision rules might generalize well to unseen conditions, in particular feature numbers larger than those we explored in simulation. Feature importance was measured as "gain," or the relative increase in predictive accuracy achieved by the inclusion in the model of a given feature, as computed by the caret package in R [48].
Predictive models built from these summary features attempted to estimate sensitivity and specificity values explicitly. Details on these models are given in the Methods. All models were trained on 80% of the simulated data and their predictive accuracy was assessed on the reserved 20%.

Real data processing
Publicly available data was downloaded from sources provided in the published materials for the studies cited. In general, relative abundances were converted to counts per million and sequences for external spike-ins extracted from these counts. Simple per-sample scaling factors were calculated from mean spike-in abundances and applied to the relative data to give "nominal" abundances.
In the inhibited and control cells of Klein et al., the expression of housekeeping gene Gapdh was used to normalize per-sample total abundance. In the 16S metagenomic data of Barlow et al. and Vieira-Silva et al., nominal abundances had already been estimated by the authors of those studies using quantitative PCR-based methods. In the single-cell expression data of Yu et al., log 2 observed total abundances correlated well with log 2 spike-in abundances and the observed data themselves were treated as true abundances. Relative abundances were derived by normalizing per-sample library sizes and scaling to give counts per million. Finally, in the coupled cell mass and expression measurements of Kimmerling et al., the estimated cell mass was used as scaling factor for observed expression to give nominal abundances.
Nominal and relative abundances were then filtered to exclude features (genes or bacterial sequence variants) present at an average abundance of less than a single count in either the nominal or relative count tables. This both reduced the size of the largest data sets-making them more computationally manageable-and reduced sparsity in the most extreme cases.
Supporting information S1 Text. Additional details on simulated data sets and mitigation strategies. This file contains supplemental methods and results detailing the characteristics of simulated data sets, describes variable importance determination in predictive models, and gives further details on the effects of mitigation strategies like control genes. (PDF) S1  Table. Real data sensitivity with control features. DESeq2 sensitivity when using a variety of features as controls (via the control_genes argument). Sensitivity is improved when using a random set of low-variance features as references against which to rescale observed abundances. A selection of "housekeeping" genes used for the same purpose gives less improvement over baseline. (PDF) S8 Table. Real data specificity with control features. DESeq2 specificity when using a variety of features as controls (via the control_genes argument). Specificity is improved when using a random set of low-variance features as references against which to rescale observed abundances. A selection of "housekeeping" genes used for the same purpose gives less improvement or worsens specificity in the data sets of Monaco