CBEA: Competitive balances for taxonomic enrichment analysis

Research in human-associated microbiomes often involves the analysis of taxonomic count tables generated via high-throughput sequencing. It is difficult to apply statistical tools as the data is high-dimensional, sparse, and compositional. An approachable way to alleviate high-dimensionality and sparsity is to aggregate variables into pre-defined sets. Set-based analysis is ubiquitous in the genomics literature and has demonstrable impact on improving interpretability and power of downstream analysis. Unfortunately, there is a lack of sophisticated set-based analysis methods specific to microbiome taxonomic data, where current practice often employs abundance summation as a technique for aggregation. This approach prevents comparison across sets of different sizes, does not preserve inter-sample distances, and amplifies protocol bias. Here, we attempt to fill this gap with a new single-sample taxon enrichment method that uses a novel log-ratio formulation based on the competitive null hypothesis commonly used in the enrichment analysis literature. Our approach, titled competitive balances for taxonomic enrichment analysis (CBEA), generates sample-specific enrichment scores as the scaled log-ratio of the subcomposition defined by taxa within a set and the subcomposition defined by its complement. We provide sample-level significance testing by estimating an empirical null distribution of our test statistic with valid p-values. Herein, we demonstrate, using both real data applications and simulations, that CBEA controls for type I error, even under high sparsity and high inter-taxa correlation scenarios. Additionally, CBEA provides informative scores that can be inputs to downstream analyses such as prediction tasks.


4)
We reorganized the results section of the manuscript and added further analyses to align our evaluation strategy with existing standards for benchmarking gene set testing methods (as suggested by Reviewer 2). Our evaluation mainly involves analyzing real data sets, so we moved all our simulation analyses into the supplemental section. We still discuss these results where relevant [3]. a. We agree with Reviewer 1 in comment #4B that differential abundance using set features is equivalent to enrichment analysis at the population level. As such, we have re-labeled "single-sample enrichment testing" to "Inference at the sample level" while the differential abundance section is now titled "Inference at the population level". Both sections are now grouped together under the general enrichment analysis umbrella. b. We have re-organized the results section into three evaluation tasks: statistical inference (type I error evaluations and other analyses performed under the null), phenotype relevance (power and sample ranking analyses), and downstream analyses (prediction analyses). As stated in part a), each of the sections on statistical testing will be subdivided into inference at the population level and inference at the sample level. c. We have also added additional evaluations based on standards set by the reference [3] suggested in comment #2 from Reviewer 2. For "inference at the sample level", our type I error real data evaluation will be using random set analyses which correspond to the stated null scenario of CBEA This change is in response to Reviewer 1's comment #3B.
For "inference at the population level", we added random set analyses in addition to the label permutation analyses already performed in the original manuscript. d. We provided more explicit language to describe the run-time analyses conducted in the supplementary section of our manuscript. 5) We have revised the discussion section to include interpretations on results from the newly added analyses. a. We provided a discussion on situations where CBEA would have the best performance based on results obtained under different evaluation scenarios. This provides the additional context necessary to evaluate our model while also helping potential users apply our approach to their own data. b. We condensed the existing explanation sections to avoid repeating information while also being more precise regarding the performance of our approach (in response to comment #3A from Reviewer 1). c. We added new discussion in light of the additional analyses performed as a response to comment #2 from Reviewer 2. d. We added more clarification on the limitations of our evaluation and method, including: i. Highlighting the difficulty of evaluating power/phenotypic relevance of enrichment methods and clarifying how our evaluation data set can address those goals in response to comment #2 from Reviewer 2 and comment #3B from Reviewer 1. ii. Limited ability to address data sparsity in response to comment #2A from Reviewer 1.
We believe these changes in response to reviewer comments have made the manuscript more focused and better aligned with previous benchmarking standards, while additionally providing the needed clarity and precision of language around complex statistical terminology. Our revisions also highlight the main contribution of the manuscript which is fundamentally a single sample set enrichment analysis approach (in the same vein as ssGSEA). In the following section we provide explicit responses to comments given by reviewers. Reviewer comments are italicized while our responses are in unformatted text. Reviewer comments are organized according to the original text to the best of our ability. We would like to thank the reviewer for the comments with respect to the clarity of our claims around the compositional nature of the data. We agree that the terminology used in the manuscript was confusing, and it was not clear how we reached the assumption that microbiome data is compositional. We adjusted the introduction section to provide more context behind this motivation and have avoided confusing terms such as "strictly compositional". Here, we provide a discussion on this issue that was also included in the manuscript (see paragraph 4 of the introduction section). We feel that this is also a response to the reviewer's comment below (#1D) since both comments (#1A, #1D) refer to this issue of compositionality.
We agree with the reviewer that we cannot get absolute abundances of features measured via sequencing data and that count data alone is not compositional according to Aitchison's definition [4]. However, despite efforts to sequence equimolar amounts of DNA as the reviewer noted, the total number of sequences per sample (i.e., library size) still varies significantly across samples [5,6] and requires applying normalization approaches to ensure that abundances can be compared across samples [7]. The most used normalization approach is to transform counts into proportions using the total library size per sample as the denominator. Even though some researchers suggest using existing methods in the gene expression literature [8], some of the assumptions that underlie these approaches might not match that of microbiome data. For example, DESeq2's median of ratios method (in the function estimateSizeFactors) assumes that the majority of genes do not differ in expression levels across samples. Other studies have also empirically compared different normalization methods, where transformation to proportions is usually the best choice [9]. As such, we mistakenly used the term "strictly compositional" to refer to the fact that microbiome sequencing data, unlike other sequencing data sets, is usually transformed to proportions prior to analysis (as stated above, we have removed this terminology from the manuscript). In the case where researchers transform count data into proportions, the data becomes compositional as a sum constraint has been imposed. Even though there are zeroes in the composition, which does not fit Aitchison's definition, the imposed sum to one constraint still induces spurious negative correlation between the variables, where log-ratio based methods are well motivated as solutions [5]. Our approach is conceptually a log-ratio based method for aggregating compositional variables. This concept is not novel as it has been advanced previously by the original authors of the ILR transformation [10], where it was termed as balances between groups of parts. Our contribution is towards specifying the "groups" that has an interpretation similar to that of the competitive null hypothesis in the gene set testing literature.

B. What the authors propose is not an ILR transform. Unless I am mistaken, there is no constraint on the matrix A such that the coordinate system is cartesian with an orthonormal basis. In fact, if k does not equal p-1 then it cannot possibly be isomorphic let alone isometric with respect to the Aitchison metric. Unless I am mistaken, the authors should change the name of their method and modify their discussion to be more accurate. I would relate this method is not an ILR transform but it is very similar to phylofactor which takes a similar approach (in phylofactor set membership is dictated by the topology of the tree).
We agree with the reviewer that the method itself is not an ILR transform (as it did not propose a novel binary partition such as PhILR [1] ) and was therefore mislabeled. Our approach still leverages the concept of balances between groups of compositional parts related to the ILR transformation as advanced by the original authors [10]. As such, our approach will be renamed "Competitive compositional balances for taxonomic enrichment analysis" (CBEA). We revamped the first section of "Properties of CBEA" to reflect this distinction and provide more discussion on CBEA's influences. We hope that this new name more clearly reflects the specific advances our method is proposing.
C. The authors state that the "data is zero-inflated" this is another cliche that I would encourage the authors to remove. Zero-inflation is a particular family of models for these zeros not a objective characteristic of the data. Simply saying there are many zeros would likely suffice in this article.

They could argue that the data generating mechanism is well represented by a zero-inflation process, but this has been called into question (see Silverman et al. Naught all zeros in sequence count data are the same.)
We agree with the reviewer that the term "zero-inflated" should be used in reference to the specific class of models instead of as a catch-all term for a characteristic of the data. Since we are agnostic to the mechanism behind the process of generating zeroes, we have amended the article to use "zero abundant" or "sparse". We agree with the reviewer that the terminology is confusing and have added additional clarity to the manuscript. We have provided a more comprehensive response on the issue of the compositional nature of microbiome data in the above section (response to comment A), which we believe also addresses the issues raised in this comment.

E. The GSEA method cited on line 51 is not a random-walk like statistic. I think it may be a
Brownian bridge but its constrained to be zero at either end --not a random walk.
We thank the reviewer for the clarification. The "random-walk like statistic" phrase has been clarified and amended in the introduction section.
F. Between lines 73 and 85 the authors do not properly motivate the multiplicative rather than additive amalgamation. They mention the downsides of the "naive summation-based method" but this is unclear. From later in the manuscript I gather that this statement reflects the perturbation invariance of multiplicative amalgamation: given that some have argued that measurement bias can be modeled as a constant compositional perturbation. This needs to be made explicit. There is no inherent downside of summation (i.e., additive amalgamation) -its a modeling choice and it is not "naive".
We have reworked the introduction section (see paragraph 5) to highlight the differences more clearly between product and sum-based aggregations and provide a robust justification for our approach. We have also removed the characterization of sum-based amalgamations as "naïve" as suggested.
G. The authors mention "adjusting for correlation" multiple times throughout the manuscript yet the motivation is not properly clarified. The best I can guess is that they are saying that they need to modify the null-hypothesis to account for a trivial case where something looks differential expressed or set enriched when really it's just due to the correlation structure between taxa. We thank the reviewer for pointing out that the motivation behind "adjusting for correlation" was not clearly communicated in the manuscript. We have amended the "Statistical properties" part of the manuscript and introduce a separate section to provide more clarity on this concept.
Additionally, we agree with the reviewer that there are situations where highly correlated sets are biologically relevant. As such, we have provided additional discussion with regards to that issue and have left the decision whether to adjust for correlation to the user. This also supports the notion (as also recommended by Reviewer 2) that set-based analysis is usually exploratory rather than confirmatory so an inflated type I error may be acceptable to achieve higher power. We agree with the reviewer that the source on line 187 did not discuss the distributional properties of CBEA. We have amended the citation with the source from Aitchison and Shen [11] which provides details on the logistic normal distribution for compositional data and the original source from Egozcue [12] which talks about the relationship between ILR coordinates and the ALR coordinates that motivated the logistic normal distribution mentioned above.

J. On line 536 the authors mention "inflated counts", I have no idea what this means.
We have updated the simulation design section (now in supplementary analyses) to detail the meaning of "inflated counts". We have also amended this confusing language and refer to sets that have a higher mean abundance as "enriched". In essence, "inflated counts" refers to when sets (or individual taxon) have a fold change increase in absolute counts in a certain condition (e.g. IBD) compared to control. This is equivalent terminology to refer to taxa that are differentially abundant across conditions, but also refers specifically to the mechanism of abundance difference (fold change). We would like to thank the reviewer for bringing this to our attention. We have tweaked the manuscript (see discussion section) to hopefully add further clarity to this issue. Here, we provide the discussion that is included in our revision.
The comments referenced by the reviewer were referring to the difference in performance of corncob and DESeq2 between simulations and real data analysis. In the simulation analysis, these methods show low type I error and low power, while conversely in real data analyses (i.e., the permutation analyses) these methods show high type I error and high power (when compared against CBEA). We did not expect this result, which indicates that there are systematic differences between our assumption of the data generating process in the design of the simulations and data from real world sources. In this section we specifically hypothesize the mechanism behind the differences between simulated and real data sets and deduced that it might be due to taxa-specific biases. According to McLaren et al. [13], sum-based aggregations are particularly sensitive to this type of bias. We agree with the reviewer that in permutation analyses using real data this bias would be preserved, which explains that high type I error observed when applying DESeq2 and corncob. However, this is something we did not consider in our simulation, hence the differences in performance. We thank the reviewer for the comments. Our strategy for addressing sparsity in microbiome data is to use pseudocounts to ensure the validity of the log-ratio transformations. As such, we have stated this assumption more clearly in the methods formulation section of the revised manuscript.

Modeling Choices
We also acknowledge in our discussion section the limitations of this approach and mention alternative methods that users can apply prior to running CBEA. However, according to our experimental results, the performance of our approach is not particularly sensitive to the level of sparsity in the data.
As clarified above, we do not model the count data directly but rather model the relative proportions that result from a total sum normalization of the counts. Our approach is to perform an ILR-like transformation to the proportions corresponding to the set annotations and perform inference through empirical modeling of the test statistic under the null. Our simulation studies have demonstrated that the empirical distribution of our test statistic is well approximated by a normal distribution. Furthermore, real data analyses also show that a normal approximation generates good performance values for all considered situations.
As the reviewer noted, multinomial logistic normal models are useful for modeling count data direction, and it may be feasible to apply the multinomial logistic normal model to perform setbased enrichment analysis. Although we are not away of any existing approaches that utilizes this distribution for set-based testing, this is an interesting idea that we hope to explore in future research. We have added this as part of our "Limitations and future directions" section.

Unsubstantiated Claims
There are a number of unsubstantiated claims where the language needs to be altered to be more precise. A. Line 493: "These results demonstrate that cILR generated scores are informative features in disease prediction tasks." No. These results demonstrate that cILR COULD be informative features in disease predictions tasks. I am not convinced that these are even useful for the casestudies shown in this manuscript let alone other tasks. Moreover, the comparison methods ssGSEA and GSVA seem like odd choices. Are the authors only using methods that can take setbased features? This does not account for the potential that the chosen sets are not informative. The later seems like an important case to establish the motivation of the current work.
We agree with the reviewer that the predictive capacity of set-based features would be low if the chosen sets are non-informative or not interesting. We have amended the statement above in the manuscript as suggested by the reviewer. We would like to clarify that in this manuscript we are agnostic as to how the sets are constructed and whether there is a performance increase using sets compared to using the basic features. What we demonstrate in the manuscript is the relative performance of the different approaches to aggregation in instances where the researcher decides aggregation is of interest. As such, our claim that "CBEA generated scores are informative features" refers to the fact that given the same sets of microbes, scores constructed by CBEA can be informative towards prediction compared to similar approaches, suggesting that it is valid to use set-based features generated using CBEA for prediction purposes. However, as stated above, we agree with the reviewer that this is a strong statement and have adjusted it accordingly. We also added the context provided in this response to the results interpretation in the manuscript (refer to "Downstream analysis using prediction models"/ "Disease prediction" sections).
In terms of the comparison methods of GSVA and ssGSEA, we were motivated in this section by the fact that CBEA generates scores at the sample level, performing as a transformation of a n x p matrix of taxa and samples into a matrix of sets and samples. ssGSEA and GSVA are approaches which also calculates enrichment scores per sample using the original matrix and set annotation as inputs. As such, we felt it was appropriate to use these models as comparison points. The goal is that for predictive analyses, we can fit a model (in our case, a simple random forest) to these scores to perform predictive analysis using set-based features. In other words, ssGSEA, GSVA, and CBEA are all "feature engineering" approaches rather than outcome prediction models. We thank the reviewer for raising this issue. For the simulation results (now in the supplementary materials), we updated the manuscript to specify the precise design of our simulation experiments and to clarify that these scenarios represent only a subset of the data sets that users may encounter in practice. As such, we have modified our performance claims in the supplementary analyses where this is relevant.
In terms of our real data evaluation, we have adjusted our type I error analyses for sample level inference to using randomly generated sets instead (as suggested by the source referred by reviewer 2). We believe this better represents the null distribution stated in CBEA and addresses the reviewer's concern over what is random and non-random in the data set. For power analyses, we agree that the aerobic vs anaerobic hypothesis is not strong enough to serve as ground truth.
As a result, we have clarified and provided further discussion on this aspect of the evaluation, and provide discussion on the lack of standardized data sets for enrichment testing evaluation. Despite these limitations, our analyses based on the aerobic vs. anaerobic label can still provide good insight into model performance since the hypothesis does have a clear and straightforward biological interpretation (i.e. based on easy to determine natural characteristics of the microbes) and has been used in prior manuscripts that attempt to validate differential abundance analyses for microbiome data [14]. We have adjusted the "Evaluation" section of the manuscript to improve readability and fill in missing gaps in our evaluation methodology. The simulation section in the revised supplementary analyses also have clearer descriptions of the relevant simulation design. With regards to the specific examples provided by the reviewer, we provide some clarification as follows:

Other Comments on Clarity
1) The and parameters were chosen by fitting a negative binomial distribution (using maximum likelihood approach with the fitdistrplus package [15]) on non-zero entries in each taxon in the human microbiome 16S data set. The median values across all estimates were chosen as the final estimates for the simulation procedure.
2) 10,000 samples refers to the number of samples (i.e. the number of biological replicates). Since we're attempting to perform inference per sample (assign a p-value per sample), this is equivalent to the number of hypotheses tested for our enrichment analysis procedure. between singlesample enrichment and differential abundance was from the introduction. In addition, this notation is non-standard. Typically enrichment (e.g., as used in gsEa) refers to essentially differential expression but for sets of genes (i.e., it is typically a comparison between groups). This makes the "single-sample enrichment" terminology confusing.
To address structural confusion, we have included an overview of the manuscript in the introduction. We also restructured the manuscript into different evaluation criteria instead of analysis tasks (refer to the preamble of this response letter). We also adjusted figure labels and captions for clarity across different sections.
We agree with the reviewer that enrichment testing is essentially differential abundance for setbased features. As such, we have combined "single-sample" and "differential abundance" into the same type of analysis (but now broken down by evaluation type such as type I error evaluation and power). "Single sample" is now referred to as "sample-level inference" where we are testing for enriched sets within the sample (significantly enriched compared to background taxa only for that sample), which is an unsupervised (i.e. without using sample labels) test similar to VAM [16]. "Differential abundance" is now referred to as "population-level inference" where we test for differences in enrichment scores across case/control status, which is a supervised approach similar to GSEA [17]. We would like to thank Reviewer 2 for the detailed and encouraging responses. We have made structural changes in the manuscript (as described in the preamble section of this response) to make sure we acknowledge the standards set out by Geistlinger et al. 2021 [3]. Additionally, we have added a discussion on adjusting for correlation, and have also fixed the R package.