Supervised learning and model analysis with compositional data

Supervised learning, such as regression and classification, is an essential tool for analyzing modern high-throughput sequencing data, for example in microbiome research. However, due to the compositionality and sparsity, existing techniques are often inadequate. Either they rely on extensions of the linear log-contrast model (which adjust for compositionality but cannot account for complex signals or sparsity) or they are based on black-box machine learning methods (which may capture useful signals, but lack interpretability due to the compositionality). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods on 33 publicly available microbiome datasets. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast coefficients to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. KernelBiome is available as an open-source Python package on PyPI and at https://github.com/shimenghuang/KernelBiome.


Introduction
Compositional data, that is, measurements of parts of a whole, are common in many scientific disciplines. For example, mineral compositions in geology (Buccianti et al., 2006), element concentrations in chemistry (Pesenson et al., 2015), species compositions in ecology (Jackson, 1997) and more recently high-throughput sequencing reads in microbiome science (Li, 2015).
Mathematically, any p-dimensional composition-by appropriate normalization-can be represented as a point on the simplex This complicates the statistical analysis, because the sum-to-one constraint of the simplex induces non-trivial dependencies between the components that may lead to false conclusions, if not appropriately taken into account.
The statistics community has developed a substantial collection of parametric analysis techniques to account for the simplex-structure. The most basic is the family of Dirichlet distributions. However, as pointed out already by Aitchison (1982), Dirichlet distributions cannot capture non-trivial dependence structures between the composition components and are thus too restrictive. Aitchison (1982) therefore introduced the log-ratio approach. It generates a family of distributions by projecting multivariate normal distributions into S p−1 via an appropriate log-ratio transformation (e.g., the additive log-ratio, centered log-ratio (Aitchison, 1982), or isometric log-ratio (Egozcue et al., 2003)). The resulting family of distributions results in parametric models on the simplex that are rich enough to capture non-trivial dependencies between the components (i.e., beyond those induced by the sumto-one constraint). The log-ratio approach has been extended and adapted to a range of statistical problems (e.g., Aitchison, 1985;Tsagris et al., 2011;Aitchison, 1983;Aitchison and Greenacre, 2002;Friedman and Alm, 2012).
For supervised learning tasks the log-ratio approach leads to the log-contrast model (Aitchison and Bacon-Shone, 1984). An attractive property of the log-contrast model is that its coefficients quantify the effect of a multiplicative perturbation (i.e., fractionally increasing one component while adjusting the others) on the response. While several extensions of the log-contrast model exist (e.g., Lin et al., 2014;Shi et al., 2016;Combettes and Müller, 2020;Simpson et al., 2021;Ailer et al., 2021), its parametric approach to supervised learning has two major shortcomings that become particularly severe when applied to high-dimensional and zero-inflated high-throughput sequencing data (Tsilimigras and Fodor, 2016;Gloor et al., 2017). Firstly, since the logarithm is not defined at zero, the log-contrast model cannot be directly applied. A common fix is to add so-called pseudo-counts, a small non-zero constant, to all (zero) entries (Kaul et al., 2017;Lin and Peddada, 2020). More sophisticated replacements exist as well (e.g., Martín-Fernández et al., 2003;Fernandes et al., 2013;De La   Simpson diversity Figure 1: Overview of KernelBiome. We start from a paired dataset with a compositional predictor X and a response Y and optional prior knowledge on the relation between components in the compositions (e.g., via a phylogenetic tree). We then select a model among a large class of kernels which best fits the data. This results in an estimated modelf and embeddingk. Finally, these can be analyzed while accounting for the compositional structure.
2018), however, they often rely on knowing the nature of the zeros (e.g., whether they are structural or random), which is typically not available in practice and difficult to estimate.
In any case, the downstream analysis will strongly depend on the selected zero imputation scheme (Park et al., 2022). Secondly, the relationships between individual components (e.g., species) and the response are generally complex. For example, in human microbiome settings, a health outcome may depend on interactions or on the presence or absence of species. Both cannot be captured by the linear structure of the log-contrast model.
We propose to solve the supervised learning task using a nonparametric kernel approach, which is able to handle complex signals and avoid arbitrary zero-imputation. To be of use in biological applications, there are two components to a supervised analysis: (i) estimating a predictive model that accurately captures signals in the data and (ii) extracting meaningful and interpretable information from the estimated model. For (i), it has been shown that modern machine learning methods are capable of creating highly predictive models by using microbiome data as covariates and phenotypes as responses (e.g., Knight et al., 2018;Zhou and Gallins, 2019;Cammarota et al., 2020). In particular, several approaches have been proposed where kernels are used to incorporate prior information (Chen and Li, 2013;Randolph et al., 2018) or as a way to utilize the compositional structure (Ramon et al., 2021;Di Marzio et al., 2015;Tsagris and Athineou, 2021). Recently, Park et al. (2022); Li and Ahn (2022) used the radial transformation to argue that kernels on the sphere provide a natural way of analyzing compositions with zeros and similar to our work suggest using the kernel embeddings in a subsequent analysis. Part (ii) is related to the fields of explainable AI (Samek et al., 2019) and interpretable machine learning (Molnar, 2020), which focus on extracting information from predictive models. These types of approaches have also received growing attention in the context of microbiome data (Topçuoglu et al., 2020;Gou et al., 2021;Ruaud et al., 2022). However, to the best of our knowledge, none of these procedures have been adjusted to account for the compositional structure. As we show in Sec. 2.1, not accounting for the compositionality may invalidate the results.
KernelBiome, see Fig. 1, addresses both (i) by providing a regression and classification procedure based on kernels targeted to the simplex and (ii) by providing a principled way of analyzing the estimated models. Our contributions are fourfold: (1) We develop a theoretical framework for using kernels on compositional data. While using kernels to analyze various aspects of compositional data is not a new idea, a comprehensive analysis and its connection to existing approaches have been missing. In this work, we provide a range of kernels that each capture different aspects of the simplex structure, many of which have not been previously applied to compositional data. For all kernels, we derive novel, positive-definite weighted versions that allow incorporating prior information between the components. Additionally, we show that the distance associated with each kernel can be used to define a kernel-based scalar summary statistic. (2) We propose a theoretically justified analysis of kernel-based models that accounts for compositionality. Firstly, we introduce two novel quantities for measuring the effects of individual features that explicitly take the compositionality into account and prove that these can be consistently estimated. Secondly, we build on known connections between kernels and distance measures to advocate for using the kernel embedding from the estimated model to create visualizations and perform follow-up distance-based analyses that respect the compositionality. (3) We draw connections between KernelBiome and logcontrast-based analysis techniques. More specifically, we connect the Aitchison kernel to the log-contrast model, prove that the proposed compositional feature influence in this case reduce to the log-contrast coefficients, and show that our proposed weighted Aitchison kernel is related to the recently proposed tree-aggregation method of log-contrast coefficients (Bien et al., 2021). Importantly, these connections ensure that KernelBiome reduces to standard log-contrast analysis techniques whenever simple log-contrast models are capable of capturing most of the signal. This is also illustrated by our experimental results. (4) We propose a dataadaptive selection framework that allows to compare different kernels in a principled fashion.
The paper is structured as follows. In Sec. 2, we introduce the supervised learning task, define two quantities for analyzing individual components (Sec. 2.1), give a short introduction to kernel methods and how to apply our methodology (Sec. 2.2), and present the full KernelBiome framework (Sec. 2.3). Finally, we illustrate the advantages of KernelBiome in the experiments in Sec. 3.

Methods
We consider the setting in which we observe n independent and identically distributed (i.i.d.) observations (X 1 , Y 1 ), . . . , (X n , Y n ) of a random variable (X, Y ) with X ∈ S p−1 a compositional predictor and Y ∈ R a real-valued response variable (by which we include categorical responses). Supervised learning attempts to learn a relationship between the response Y and the dependent predictors X. In this work, we focus on conditional mean relationships. More specifically, we are interested in estimating (a version of) the conditional mean of Y , that is, the function (1) We assume that f * ∈ F ⊆ {f | f : S p−1 → R}, where F is a function class determined by the regression (or classification) procedure.
While estimating and analyzing the conditional mean is well established for predictors in Euclidean space, there are two factors that complicate the analysis when the predictors are compositional. (i) While it is possible to directly apply most standard regression procedures designed for X ∈ R p also for X ∈ S p−1 , it turns out that many approaches are ill-suited to approximate functions on the simplex. (ii) Even if one accurately estimates the conditional mean function f * , the simplex constraint complicates any direct assessment of the influence and importance of individual components of the compositional predictor. In this work, we address both issues and propose a nonparametric framework for regression and classification analysis for compositional data.

Interpreting individual features
Our goal when estimating the conditional mean f * given in (1) is to gain insight into the relationship between the response Y and predictors X. For example, when fitting a logcontrast model (see Example 2.1), the estimated coefficients provide a useful tool to generate hypotheses about which features affect the response and thereby inform follow-up experiments. For more complex models, such as the nonparametric methods proposed in this work, direct interpretation of a fitted modelf is difficult. Two widely applicable measures due to Friedman (2001) are the following: (i) Relative influence, which assigns each coordinate j a scalar influence value given by the expected partial derivative E[ d dx jf (X)] and (ii) partial dependence plots, which are constructed by plotting, for each coordinate j, the function z → E[f (X 1 , . . . , X j−1 , z, X j+1 , . . . , X p )]. However, directly applying these measures on the simplex is not possible as we illustrate in App. D.2. The intuition is that both measures evaluate the functionf outside the simplex. An adaption of the relative influence (or elasticity in the econometrics literature) to compositions based on the Aitchison geometry has recently been proposed by Morais and Thomas-Agnan (2021). We adapt the relative influence without relying on the log-ratio transform and hence allow for more general function classes.
Our approach is based on two coordinate-wise perturbations on the simplex. For any j ∈ {1, . . . , p} and x ∈ S p−1 , define (i) for c ∈ [0, ∞) the function ψ j (x, c) ∈ S p−1 to be the composition resulting from multiplying the j-th component by c and then scaling the entire vector back into the simplex, and (ii) c ∈ [0, 1] the function φ j (x, c) ∈ S p−1 to be the composition that consists of fixing the j-th coordinate to c and then rescaling all remaining coordinates such that the resulting vector lies in the simplex. Each perturbation can be seen as a different way of intervening on a single coordinate while preserving the simplex structure. More details are given in App. A. Based on the first perturbation, we define the compositional feature influence (CFI) of component j ∈ {1, . . . , p} for any differentiable Similarly, we adapt partial dependence plots using the second perturbation. Define the compositional feature dependence (CPD) of component j ∈ {1, . . . , p} for any function f : In practice, we can compute Monte Carlo estimates of both quantities by replacing expectations with empirical means. We denote the corresponding estimators byÎ j f andŜ j f , respectively (see App. A for details).
The following proposition connects the CFI and CPD to the coefficients in a log-contrast function.
Proposition 2.1 (CFI and CPD in the log-contrast model). Let f : x → β T log(x) with p j=1 β j = 0, then the CFI and CPD are given by respectively, where c ∈ R is a constant depending on the distribution of X but not on z and satisfies c = 0 if β j = 0.
A proof is given in App. F.1. The proposition shows that the CFI and CPD are generalizations of the β-coefficients in the log-contrast model. The following example provides further intuition.
Example 2.1 (CFI and CPD in a log-contrast model). Consider a log-contrast model Y = The CFI and CPD for the true function f -estimated based on n = 100 i.i.d. samples (X 1 , Y 1 ), . . . , (X n , Y n ) with X i compositional log-normal -are shown in Fig. 2.
The following theorem highlights the usefulness of the CFI and CPD by establishing when they can be consistently estimated from data.
Theorem 2.1 (Consistency). Assumef n is an estimator of the conditional mean f * given in (1) based on (X 1 , Y 1 ), . . . , (X n , Y n ) i.i.d.. Figure 2: Visualization of the CPD (left) and CFI (right) based on n = 100 samples and the true function f . Since β 4 = 0 in this example the 4-th component has no effect on the value of f resulting in a CFI of zero and a flat CPD. Since we are not estimating f , the CFI values exactly correspond to the β-coefficients in this example.
A proof is given in App. F.2 and the result is demonstrated on simulated data in App. D.1.
The theorem shows that the CFI is consistently estimated as long as the derivative of f * is consistently estimated, which can be ensured for example for the kernel methods discussed in Sec. 2.2. In contrast, the CPD only requires the function f * itself to be consistently estimated.
The additional assumption on the support ensures that the perturbation φ j used in the CPD remains within the support. If this assumption is not satisfied one needs to ensure that the estimated function extrapolates beyond the sample support. Interpreting the CPD therefore requires caution.

Kernel methods for compositional data analysis
Before presenting our proposed weighted and unweighted kernels, we briefly review the necessary background on kernels and their connection to distances. Kernel methods are a powerful class of nonparametric statistical methods that are particularly useful for data from non-standard (i.e., non-Euclidean) domains X . The starting point is a symmetric, positive definite function k : X × X → R, called kernel. Kernels encode similarities between points in X , i.e., large values of k correspond to points that are similar and small values to points that are less similar. Instead of directly analyzing the data on X , kernel methods map it into a well-behaved feature space H k ⊆ {f | f : X → R} called reproducing kernel Hilbert space (RKHS), whose inner product preserves the kernel induced similarity.
Here, we consider kernels on the simplex, that is, X = S p−1 . The conditional mean function f * given in (1) can then be estimated by optimizing a loss over H k , for an appropriate kernel k for which H k is sufficiently rich, i.e., f * ∈ H k . The representer theorem (e.g., Schölkopf et al., 2002) states that such an optimization over H k can be performed efficiently. Formally, it states that the minimizer of an arbitrary convex loss function L : with λ > 0 a penalty parameter, has the formf (·) = n i=1α i k(X i , ·) for someα ∈ R n . This means that instead of optimizing over a potentially infinite-dimensional space H k , it is sufficient to optimize over the n-dimensional parameterα. Depending on the loss function, this allows to construct efficient regression and classification procedures, such as kernel ridge regression and support vector machines (e.g., Schölkopf et al., 2002).
The performance of the resulting prediction model depends on the choice of kernel as this determines the function space H k . A useful way of thinking about kernels is via their connection to distances. In short, any kernel k induces a unique semi-metric d k and vice versa. More details are given in App. B.1.1. This connection has two important implications.
Firstly, it provides a natural way for constructing kernels based on established distances on the simplex. The intuition being that a distance, which is large for observations with vastly different responses and small otherwise, leads to an informative feature space H k . Secondly, it motivates using the kernel-induced distance, see Sec. 2.3.2.

Kernels on the simplex
We consider four types of kernels on the simplex, each related to different types of distances.
A full list with all kernels and induced distances is provided in App. G. While most kernels have previously appeared in the literature, we have adapted many of the kernels to fit into the framework provided here, e.g., added zero-imputation for Aitchison kernels and updated the parametrization for the probability distribution kernels.
Euclidean: These are kernels that are constructed by restricting kernels on R p to the simplex. Any such restriction immediately guarantees that the restricted kernel is again a kernel. However, the induced distances are not targeted to the simplex and therefore can be unnatural choices. In KernelBiome, we have included the linear kernel and the radial basis function (RBF) kernel. The RBF kernel is L p -universal (e.g., Sriperumbudur et al., 2011) which means that it can approximate any integrable function (in the large sample limit).
However, this does not necessarily imply good performance for finite sample sizes.
Aitchison geometry: One way of incorporating the simplex structure is to use the Aitchison geometry. Essentially, this corresponds to mapping points from the interior of the simplex via the centered log-ratio transform into R p and then using the Euclidean geometry.
This results in the Aitchison kernel for which the induced RKHS is equal to the log-contrast functions. In particular, applying kernel ridge regression with an Aitchison kernel corresponds to fitting a log-contrast model with a penalty on the coefficients. As the centered log-ratio transform is only defined for interior points in the simplex, we add a hyperparameter to the kernels that shift them away from zero. From this perspective, the commonly added pseudocount constant added to all components becomes a tuneable hyperparameter of our method, rather than a fixed ad-hoc choice during data pre-processing. Thereby, our modified Aitchison kernel respects the fact that current approaches to zero-replacement or imputation are often not biologically justified, yet may impact predictive performance. Our proposed zero-imputed Aitchison kernel comes with two advantages over standard log-contrast modelling: (1) A principled adjustment for zeros and (2) an efficient form of high-dimensional regularization that performs well across a large range of our experiments. In KernelBiome, we include the Aitchison kernel and the Aitchison-RBF kernel which combines the Aitchison and RBF kernels.
Probability distributions: Another approach to incorporate the simplex structure into the kernel is to view points in the simplex as discrete probability distributions. This allows us to make use of the extensive literature on distances between probability distributions to construct kernels. In KernelBiome, we have adapted two classes of such kernels: (1) A parametric class based on generalized Jensen-Shannon distances due to Topsøe (2003), which we call generalized-JS, and (2) a parametric class based on the work by Hein and Bousquet (2005), which we call Hilbertian. Together they contain many well-established distances such as the total variation, Hellinger, Chi-squared, and Jensen-Shannon distance. All resulting kernels allow for zeros in the components of compositions.
Riemannian manifold: Finally, the simplex structure can be incorporated by using a multinomial distribution which has a parameter in the simplex. Lafferty et al. (2005) show that the geometry of multinomial statistical models can be exploited by using kernels based on the heat equation on a Riemannian manifold. The resulting kernel is known as the heatdiffusion kernel and has been observed to work well with sparse data.

Including prior information into kernels
All kernels introduced in the previous section (and described in detail in App. G) are invariant under permutations of the compositional components. They therefore do not take into account any relation between the components. In many applications, one may however have prior knowledge about the relation between the components. For example, if the compositional predictor consists of relative abundances of microbial species, information about the genetic relation between different species encoded in a phylogenetic tree may be available. Therefore, we provide the following way to incorporate such relations. Assume the prior information has been expressed as a positive semi-definite weight matrix W ∈ R p×p with non-negative entries (e.g., using the UniFrac-Distance (Lozupone and Knight, 2005) as shown in App. B.3), where the ij-th entry corresponds to the strength of the relation between components i and j. We can then incorporate W directly into our kernels. To see how this works, consider the special case where the kernel k can be written as k(x, y) = p i=1 k 0 (x i , y i ) for a positive definite kernel k 0 : [0, 1] × [0, 1] → R. Then, the weighted kernel is positive definite and incorporates the prior information in a natural way. If two components i and j are known to be related (corresponding to large values of W i,j ), the kernel k W takes the similarity across these components into account. In App. B.2, we show that the probability distribution kernels and the linear kernel can be expressed in this way and propose similar weighted versions for the remaining kernels.
An advantage of our framework is that it defaults to the log-contrast model when more complex models fail to improve the prediction.
for some β ∈ R p satisfying (1) p j=1 β j = 0 and (2) for all ∈ {1, . . . , m} and i, j ∈ P it holds β i = β j A proof is given in App. F.3. Combined with Proposition 2.1, this implies that the CFI values are equal across the equally weighted blocks P 1 , . . . , P m , which is demonstrated on data in Sec. 3.3.
Then, the feature influence properties (CFI, CPD) and embedding induced byk are analyzed in a way that respects compositionality.

Model selection
We propose the following two step data-driven selection procedure.
1. Select the best kernelk with the following hierarchical CV: • Fix a kernelk, i.e., a type of kernel and its kernel parameters.
• Split the sample into N out random (or stratified) folds.
• For each fold, use all other folds to perform a N in -CV to select the best hyperparameterλ and compute a CV score based onk andλ on the left-out fold.
• Select the kernelk with the best average CV score.
2. Select the best hyperparameterλ fork using a N in -fold CV on the full data. The final estimatorf is then given by the kernel predictor based onk andλ.
We provide default parameter grids for the kernel parameters of each kernel type (see Table 1 in App. B). The kernel and the hyperparameters are treated separately, because the kernel is itself of interest and is used in subsequent analyses as discussed in the next section.

Model analysis
Firstly, as discussed in Sec. 2.1, we propose to analyze the fitted modelf with the CPD and CFI. Other methods developed for functions on R p do not account for compositionality and can be misleading. Secondly, the kernel embeddingk can be used for the following two types of analyses.
Distance-based analysis: A key advantage of using kernels is that the fitted kernelk is itself helpful in the analysis. As discussed in Sec. 2.2,k induces a distance on the simplex that is well-suited to separate observations with different response values. We therefore suggest to utilize this distance to investigate the data further. Essentially, any statistical method based on distances can be applied. We specifically suggest using kernel PCA to project the samples into a two-dimensional space. As we illustrate in Sec. 3.2, such a projection can be used to detect specific groups or outliers in the samples and can also help understand how the predictors are used by the prediction modelf . As we are working with compositional data we need to be careful when looking at how individual components contribute to each Data-driven scalar summary statistics: Practitioners often work with scalar summaries of the data as these are easy to communicate. A commonly used summary statistic in ecology is α-diversity which measures the variation within a community. The connection between kernels and distances provides a useful tool to construct informative scalar summary statistics by considering distances to a reference point u in the simplex. Formally, for a fixed reference point u ∈ S p−1 define for all x ∈ S p−1 a corresponding closeness measure by where d k is the distance induced by the kernel k. This provides an easily interpretable scalar quantity. For example, if Y is a binary indicator taking values healthy and sick, we could select u to be the geometric median 2 of all X i with Y i = healthy. Then, D k corresponds to a very simple health score (see Sec. 3.2 for a concrete example). A further example is given by selecting u = (1/p, . . . , 1/p) and considering points on the simplex as communities. Then, u can be interpreted as the most diverse point in the simplex and D k corresponds to a data-adaptive α-diversity measure. While such a definition of diversity does not necessarily satisfy all desirable properties for diversities (see e.g., Leinster and Cobbold, 2012), it is (1) symmetric with respect to switching of coordinates, (2) has an intuitive interpretation and (3) is well-behaved when combined with weighted kernels. Connections to established diversities also exist, for example, the linear kernel corresponds to a shifted version of the Gini-Simpson diversity (i.e., Gini-Simpson(x)

Implementation
KernelBiome is implemented as a Python (van Rossum and Drake, 2009) package that takes advantage of the high-performance JAX (Bradbury et al., 2018) and scikit-learn (Pedregosa et al., 2011) libraries. All kernels introduced are implemented with JAX's just-in-time compilation and automatically leverage accelerators such as GPU and TPU whenever available.
KernelBiome provides fast computation of all kernels and distance metrics as well as easyto-use procedures for model selection and comparison and procedures to estimate CPD and CFI, compute kernel PCA, and estimate scalar summary statistics.

Results
We evaluated KernelBiome on a total of 33 microbiome datasets. All datasets have been previously published and a full list including details on the pre-processing, prediction task and references is provided in Table 2

State-of-the-art prediction performance
We compare the predictive performance of KernelBiome on all datasets with the following competitors: (i) Baseline, a naive baseline that predicts the majority class for classification and the mean for regression, (ii) SVM-RBF, a support vector machine with the RBF kernel, (iii) Lin/Log-L1, a linear/logistic regression with 1 -penalty (iv) LogCont-L1, a log-contrast regression with 1 penalty with a half of the minimum non-zero relative abundance added as pseudo-count to remove zeros, and (v) RF, a random forest with 500 trees. For SVM-RBF, Lin/Log-L1 and RF we use the scikit-learn implementations (Pedregosa et al., 2011) and choose the hyperparameters (bandwidth, max depth and all penalty parameters) based on a 5fold CV. For LogCont-L1, we use the c-lasso package (Simpson et al., 2021) and the default CV scheme to chose the penalty parameter. We apply two versions of KernelBiome: (1) The standard version that adaptively chooses the kernel using N in = 5, N out = 10 (denoted (2) On datasets were the top kernel is selected consistently (e.g., americangut, hiv and tara) KernelBiome generally performs very well and in these cases strongly outperformed both log-contrast based methods KB-Aitchison and LogCont-L1.
(3) The predictive performance is substantially different between KB-Aitchison and LogCont-L1 which we see as an indication that the regularization is crucial in microbiome datasets.

Model analysis with KernelBiome
As shown in the previous section, KernelBiome results in fitted models with state-of-the-art prediction performance. This is useful because supervised learning procedures can be used in two types of applications: (1) To learn a prediction model that has a direct application, e.g., as a diagnostic tool, or (2) to learn a predictive model as an intermediate step of an exploratory analysis to find out what factors could be driving the response. As discussed above (2) requires us to take the compositional nature of the predictors into account to avoid misleading conclusions. We show how the KernelBiome framework can be used to achieve this regression tasks) based on a 10-fold train/test split. The frequency and kernel which was selected most often by KernelBiome is 58% Aitchison for rmp, 50% Aitchison-RBF for camp, 73% Aichtison-RBF for cirrhosis, 67% Aitchison-RBF for genomemed, 45% generalized-JS for centralpark, 100% Aitchison-RBF for americangut, 97% Aitchison-RBF for hiv and 93% Aitchison-RBF for tara.
based on two datasets: (i) cirrhosis, based on a study analyzing the differences in microbial compositions between n = 130 healthy and cirrhotic patients (Qin et al., 2014) and (ii) centralpark, based on a study analyzing the pH concentration using microbial compositions from n = 580 soil samples (Ramirez et al., 2014). Our aim is not do draw novel biological conclusions, but rather to showcase how KernelBiome can be used in this type of analysis.
To reduce the complexity, we screen the data using A further useful quantity is the CFI, which for cirrhosis is given Fig. 4 (b, left) (for centralpark see Fig. 10 in App. C.3). They explicitly take the compositional structure into account and have an easy interpretation. For example, "Prevotella timonensis" has a CFI of 0.07 which implies that on average solely increasing "Prevotella timonensis" will lead to a larger predicted response. We therefore believe that CFIs are more trustworthy than relying   on for example Gini-importance for random forests, which does not have a clear interpretation due to the compositional constraint.
Lastly, one can also use the connection between kernels and distances to construct useful scalar summary statistics. In Fig. 4 (b, right), we use the kernel-distance to the geometric median in the healthy subpopulation as a scalar indicator for the healthiness of the microbiome.
In comparison with more standard scalar summary statistics such as the Simpson diversity, it is targeted to distinguish the two groups.

Incorporating prior information
In many applications, in particular in biology, prior information about a system is available and should be incorporated into the data analysis. As we show in Sec. 2.2.2, KernelBiome allows for incorporating prior knowledge on the relation between individual components (e.g., taxa). To illustrate how this works in practice, we again consider the screened cirrhosis dataset. We apply KernelBiome with an Aitchison-kernel and c equal to half the minimum non-zero relative abundance without weighting, with a phylum-weighting and with a UniFracweighting. The resulting scaled CFI values for each are visualized in Fig. 4 (c). The phylumweighting corresponds to giving all taxa within the same phylum the same weights and the UniFrac-weighting is a weighting that incorporates the phylogenetic structure based on the UniFrac-distance and is described in App. B.3. As can bee seen in Fig. 4 (c), the phylum weighting assign approximately the same CFI to each variable in same phylum, this is expected given that the phylum weighting has exactly the structure given in Proposition 2.2. Moreover, the UniFrac-weighting leads to CFI values that lie in-between the unweighted and phylumweighted versions. Similar effects are seen for different kernels as well. The same plots for the generalized-JS kernel are provided in App. C.1.

Discussion and conclusions
In this work, we propose the KernelBiome framework for supervised learning with compositional covariates consisting of two main ingredients: data-driven model selection and model interpretation. Our approach is based on a flexible family of kernels targeting the structure of microbiome data, and is able to work with different kernel-based algorithms such as SVM and kernel ridge regression. It is also possible to incorporate prior knowledge, which is crucial in microbiome data analysis. We compare KernelBiome with other state-of-the-art approaches on 33 microbiome datasets and show that KernelBiome achieves improved or comparable results. Moreover, KernelBiome provides multiple ways to extract interpretable information from the fitted model. Two novel measures, CFI and CPD, can be used to analyze how each component affects the response. We prove the consistency of these two measures and demonstrate them via simulated and real datasets. KernelBiome also leverages the connection between kernels and distances to conduct distance-based analysis in a lower-dimensional space.

Appendices
The supplementary material is divided into the following sections.

A. Details on CFI and CPD
Formal definitions of perturbations and estimators related to CFI and CPD.

B Details on kernels included in KernelBiome
Overview of different kernel types, details on how they connect to distances and description of weighted kernels.

C Details and additional results for experiments in Sec. 3
Datasets pre-processing, parameter setup, construction of the weighting matrices with UniFrac-distance and further experiment results based on the cirrhosis and centralpark datasets.

D Additional experiments with simulated data
Consistency of CFI and CPD and comparison of CFI and CPD with their non-simplex counterparts.

E Background on kernels
Mathematical background on kernels and details on dimensionality and visualization with kernels.

F Proofs G List of kernels implemented in KernelBiome
A Details on CFI and CPD

A.1 Perturbations
Formally, the multiplicative perturbation ψ and the fixed coordinate perturbation φ are defined as follows.

A.2 Estimators
We propose to estimate CFI and CPD with the following two estimators.

B Details on kernels included in KernelBiome B.1 Overview of kernels
In this section, we give additional details on the kernels used in KernelBiome. A full list of all kernels and their corresponding metrics together with a visualization on S 2 is given in As discussed in the main paper, we consider four types of kernels.
• Euclidean These are kernels that are used on Euclidean space but restricted to the simplex. This includes the linear kernel and the RBF kernel.
• Probability distribution These are kernels that are constructed from metrics between probability distributions. KernelBiome includes two parametric classes of kernels, the Hilbertian kernel and the generalized-JS kernel. These kernels correspond to multiple well-known metrics on probabilities such as the chi-squared metric, the total-variation metric, the Hellinger metric and the Jensen-Shannon metric.
• Aitchison geometry These are kernels that are constructed by using the centered logratio transform to project data on the simplex into Euclidean space and then combining it with a Euclidean kernel. KernelBiome includes the Aitchison kernel and the Aitchison RBF kernel. In order to allow for zeros, a small positive number c is added to each coordinate for all observations before applying the centered log-ratio transformation.
• Riemannian manifold These kernels are connected to the simplex via multinomial distributions and have been shown to empirically perform well on sparse text data mapped into the simplex. KernelBiome contains the heat-diffusion kernels.
For each type of kernel there are multiple parameter settings. Although users of the KernelBiome package can freely change the parameters, the default settings for KernelBiome for each type of kernel are provided by the package and are given in Table 1.

B.1.1 Connecting positive definite kernels to metrics
Any fixed kernel k on X induces a semi-metric 3 d k on X defined for all x, y ∈ X by This holds for all positive-definite kernels by Theorem E.2. In particular, this corresponds to the distance between the embedded points in the RKHS H k , that is, The feature embedding x → k(x, ·) induced by a kernel therefore preserves the distances d k .
A useful aspect of kernel methods, is that they allow a post-analysis based on the embedded features, see also App. E.2.
defined for all x, y ∈ X by where x 0 ∈ X is an arbitrary reference point, such that the distance in the corresponding RKHS H k is d.
Kernels can be shifted in such a way that the origin in the induced RKHS changes but the metric in (5) remains fixed (see Lemma E.1). A natural origin in the simplex is given by the point u = ( 1 p , . . . , 1 p ), therefore we have shifted all kernels such that k(u, ·) ≡ 0 and hence correspond to the origin in H k . In App. E.1, we provide a short overview of the mathematical results that connect kernels and metrics.

B.2 Weighted kernels -including prior information
In this section, we discuss how to include prior knowledge, e.g. phylogenetic information, into the simplex kernels. We assume the information is encoded in a matrix W ∈ R p×p where each element corresponds to a measure of similarity between components. That is, W i,j is large if components i and j are similar (or related) and small otherwise. We assume that W is symmetric, positive semi-definite and all entries in W are non-negative.
The linear kernel and all kernels based on probability distributions have the form and we therefore define the weighted version by The weighted versions of the remaining kernels are defined individually. A full list of the weighted kernels is given in App. G.1.

B.2.1 Validity of weighted kernels
In order to use the proposed weighted kernels, we need to ensure that they are indeed positive definite. In the following, we prove this for the weighted versions of the linear kernel, the Hilbertian kernel, the Generalized-JS kernel, the RBF kernel and the Aitchison kernel. We do not prove it for the Aitchison RBF kernel and the Heat Diffusion kernel and only note that they appear to be positive definite from our empirical evaluations.
We begin by showing that the kernel defined in (7) is positive definite whenever k 0 : To see this, fix x 1 , . . . , x n ∈ S p−1 and α ∈ R n and denote by K W ∈ R n×n the kernel Gram-matrix based on x 1 , . . . , x n and kernel k W . Then, Since k 0 is positive definite, it holds that i,r α i α r k 0 (x j i , x r ) ≥ 0 and hence α K W α ≥ 0 since all entries in W are non-negative.
We now go over the individual weighted kernels and argue that they are positive definite.
• Linear kernel Since R is a Hilbert space with the inner product xy which induces the |x − y| it follows that the squared distance d 2 Linear (x, y) := (x − y) 2 is Hilbertian as well. Applying Theorem E.1 we know that the distance is of negative type. Thus, based on the one-dimensional squared linear distance d 2 Linear , we apply Theorem E.2 with x 0 = 1 p to construct the following positive definite kernel k 0 defined for all x, y ∈ [0, 1] by Comparing this with our weighted linear kernel in App. G.1, we see that the weighted linear kernel has the form (7) and is therefore positive definite by the above argument.
• Hilbertian kernel As shown by Hein and Bousquet (2005) the distance d Hilbert : is a Hilbertian metric on R + . Applying Theorem E.2 with x 0 = 1 p results in a positive definite kernel k 0 that when combined as in (7) results in the proposed weighted Hilbertian kernels in App. G.1. Therefore, we have shown that the weighted Hilbertian kernels are positive definite as long as W has non-negative entries.
• Generalized-JS kernel Similarly the weighted Generalized-JS kernels in App. G.1 can all be decomposed as in (7) with a one-dimensional kernels k 0 on [0, 1]. Topsøe (2003) show that all these k 0 can be generated using Theorem E.2 with x 0 = 1 p based on Hilbertian metrics. Hence, all weighted Generalized-JS kernels are positive definite as long as W has non-negative entries.
• Aitchison kernel To show that the weighted Aitchison kernel (defined in App. G.1 is positive definite, we first define the mapping Φ : S p−1 → R p by Φ(x) := x+c g(x+c) . Then, the weighted Aitchison kernel is given by Since W is symmetric and positive semi-definite there exists M ∈ R p×p such that W = M M . Therefore, for any α ∈ R n and x 1 , . . . , x n ∈ S p−1 it holds that Hence, k is positive definite.
• RBF kernel Using the symmetry of W the weighted RBF kernel can be expressed as The function A is a positive definite kernel since it is the inner-product of a feature mapping. The function B can be shown to be a kernel by considering the Taylor expansion of the exponential function and using that sums and limits of positive definite kernels are again positive definite together with the fact that W is positive semi-definite.
Therefore, the weighted RBF kernel is positive definite.

B.3 UniFrac-Weighting
In this section, we show how prior information based on the UniFrac-Distance (Lozupone and Knight, 2005) can be encoded into a weight matrix W ∈ R p×p . Depending on the application at hand different distances can be used in a similar way. The UniFrac-Distance is a β-diversity measure that uses phylogenetic information to compare two compositional samples x, y ∈ S p−1 . Each element of the sample is hereby placed on a phylogenetic tree. The We then construct the weight matrix W UniFrac ∈ R p×p by scaling M * such that the diagonal entries are one, that is, where D = diag(σ 1 , . . . , σ p ), with σ i = 1/ M * i,i . Since by construction the matrix M * has its largest values on the diagonal, this weight matrix takes values in [0, 1]. Moreoever, by construction it remains symmetric and positive semi-definite.

C.2 Weighting matrix for weighted KernelBiome
The weight matrix W UniFrac for the cirrhosis dataset (Qin et al., 2014) and the centralpark dataset (Ramirez et al., 2014) are presented as heatmaps in Fig. 5. Using our proposed weighted kernels (see App. G.1) with the UniFrac-based weight matrix W UniFrac is different from incorporating the UniFrac-distance via kernel convolution as proposed by Zhao et al. (2015).

C.3.2 Model analysis
Here we include the remaining model analysis results for the cirrhosis and centralpark datasets.
As in the main paper, we screened both data sets to only include the 50 taxa with the highest absolute CFIs by KernelBiome with Aitchison kernel. Kernel PCA plots for the cirrhosis dataset is given in Fig. 9 and the CFI values for the centralpark dataset are given in Fig. 10.
Furthermore, we also provide the missing circle plots here. The extended version of Fig. 4 (c) in the main text with long labels is given in Fig. 11. Fig. 12 is the circle plot for the cirrhosis dataset based on the generlized-JS kernel. Fig. 13 is the circle plot for the centralpark dataset based on the Aitchison kernel.          . . . , (X n , Y n ) based on the following 2 step generative model.

•
Step 1: Generate a random variableX = (X 1 , . . . ,X 9 ) such that the three blocks (X 1 ,X 2 ,X 3 ), (X 4 ,X 5 ,X 6 ), and (X 7 ,X 8 ,X Then, X i is constructed by normalizingX, that is, X i = X/ 9 j=1X j . The block structure adds non-trivial correlation structure between the compositional components. • Step 2: Generate Y i based on X i by Based on one such dataset, we estimate the CFI and CPD for a fitted KernelBiome estimator (using kernel ridge regression and default settings), and compare the estimates against the population CFI and CPD calculated from the true function f . In Fig. 14, we report the mean squared deviations (MSD) for both CFI and CPD based on 100 such datasets for each sample size.

D.2 Comparing CFI and CPD with permutation importance and partial dependence plots
Two common approaches to assess the importance of individual features are permutation importance (PI) and partial dependency plot (PDP). PI of the j-th feature is defined as the mean difference between the baseline mean squared error of a fitted model and the average For f 1 , changes in all coordinates affect the function value due to the simplex constraint. For f 2 , only changes in x 1 and x 2 affect the function value but not changes in x 3 . This is because on the simplex f 2 (x) = x 1 x 1 +x 2 . An importance measure should therefore associate a non-zero value to x 3 for f 1 and zero to x 3 for f 2 .

∼
LogNormal(0, Id 3 ) 5 and compute PI, PDP, CFI and CPD for each of the two functions. The results are given in Table 3 and Fig. 15. As expected, the CFI and also CPD correctly capture the behavior of the two functions.
However, PI and PDP are incorrect in both cases: For f 1 the variable x 3 shows no effect both with PI and PDP and for f 2 the variable x 3 is falsely assigned a strong negative effect even though it does not affect the function value at all. In Table 3 Table 3: CFI, RI and PI for the two functions f 1 and f 2 defined in (8). Only CFI correctly attributes the effect of x 3 (marked in bold). the relative influence (RI) given by E[ d dx jf (X)] due to Friedman (2001). It has the same problems as PI as it does not take into account the simplex structure.

E Background on kernels E.1 Connection between metrics and kernels
Definition E.1 (Metric, semi-metric, and quasi-metric). A function d : It is called a semi-metric if it satisfies (a)-(c), and a quasi-metric if it satisfies (a)-(b).
Definition E.2 (Function of negative type and Hilbertian metric). A quasi-metric d : X × X → R is called of negative-type if for all n ∈ N, all x 1 , · · · , x n ∈ X , and all c 1 , · · · , c n ∈ R with n i=1 c i = 0, it holds that n i,j=1 If d is a (semi-)metric, then d is also called Hilbertian. Definition E.3 ((conditionally) positive definite kernels). A symmetric function k : X ×X → R (i.e., ∀x, y ∈ X , k(x, y) = k(y, x)) is called a positive definite kernel if and only if for all n ∈ N, all x 1 , · · · , x n ∈ X , and all c 1 , · · · , c n ∈ R, it holds that n i,j=1 It is called a conditional positive definite kernel if instead of for all c 1 , · · · , c n ∈ R condition (10) only holds for all c 1 , · · · , c n ∈ R with n i=1 c i = 0.
Lemma E.1. Let X be a non-empty set, fix x 0 ∈ X and let k,k : X × X → R be symmetric functions satisfying for all x, y ∈ X that Then k is positive definite if and only ifk is conditionally positive definite.
Proof. Fix n ∈ N, c 1 , · · · , c n ∈ R, and x 0 , x 1 , · · · , x n ∈ X . Let c 0 = − n i=1 c i , then we have n i,j=0 Now, ifk is conditionally positive definite, then (12) implies that n i,j=1 c i c j k(x i , x j ) ≥ 0, so k is positive definite; if k is positive definite, (12) implies that n i,j=0 c i c jk (x i , x j ≥ 0 sok is conditionally positive definite. This completes the proof of Lemma E.1. Lemma E.2 (Shifted conditionally positive definite). Let X be a non-empty set and let k : X × X → R be a positive definite kernel, theñ is a conditionally positive definite kernel for all f : X → R.
Proof. The proof follows the exact same argument as the proof of Lemma E.1.
Let k : X × X → R and d : X × X → [0, ∞) be functions. If k is a positive definite kernel and d satisfies d 2 (x, y) = k(x, x) + k(y, y) − 2k(x, y), then d is a Hilbertian semi-metric.
On the other hand, for any x 0 ∈ X , if d is a Hilbertian semi-metric and and k satisfies The result is due to Schoenberge (1938).
Proof. We start with the first part. Assume that k is a positive definite kernel and d satisfies d 2 (x, y) = k(x, x) + k(y, y) − 2k(x, y). Then, d is indeed a semi-metric by the following arguments: y, x), and since k is positive definite, let c 1 = 1, c 2 = −1, x 1 = x, and (c) Since k is a positive definite kernel, there exists a feature map φ k from X to an RKHS H k , and we have Therefore, d(x, z) ≤ d(x, y) + d(y, z) follows from the triangle inequality of a norm.
To show d is also Hilbertian, take any n ∈ N, any x 1 , · · · , x n ∈ X , and any c 1 , · · · , c n ∈ R, we have n i,j=1 This proves the first part of the theorem.
Hence, by Lemma E.1, k is indeed positive definite. This completes the proof of Theorem E.2.

E.2 Dimensionality reduction and visualization with kernels
One important benefit of using the kernel approach is that we can leverage the kernels for dimensionality reduction and visualization, so that one can identify outliers in the data and further investigate them. In this section, we provide a short introduction on how to use kernels for multi-dimensional scaling and connect it to kernel PCA (Schölkopf et al., 2002).
Kernel methods project the compositional data into a (potentially) high-dimensional RKHS H k , which we now want to project into the low dimensional Euclidean space R (with p) such that the lower dimensional representation preserves information that helps separate the observations of different traits in the RKHS. That is, given observations x 1 , · · · , x n ∈ S p−1 and a kernel k, we would like to define a map Φ : H k → R such that n i,j=1 k(x i , ·), k(x j , ·) H k − Φ(k(x i , ·)), Φ(k(x j , ·)) R 2 is minimized. In matrix notation, this corresponds to solving arg min where the rows of Z are z i = Φ(k(x i , ·)) ∈ R for all i ∈ {1 · · · , n} and K ∈ R n×n is the kernel Gram-matrix. This is similar to the classical multidimensional scaling (MDS) but measuring the similarity in the RKHS instead of in Euclidean space. By the Eckart-Young theorem (Eckart and Young, 1936), this minimization problem can be solved via the eigendecomposition of the matrix K = V ΣV , and the optimal solution is Z opt = (V 1 , . . . , V )(Σ : ) 1 2 , where V 1 , . . . , V are the first columns of V and Σ : is the upper-left ( × )-submatrix of Σ.
The optimal projection Φ opt is then given for all f ∈ H k by Φ opt (f ) = (Σ : ) − 1 2 (V 1 , . . . , V ) This in particular allows to project a new observations w ∈ S p−1 with the same projection that is w → Φ opt (k(w, ·)).
The projection in (13) depends on the origin of the RKHS H k . To remove this dependence, it may therefore be desirable to consider a centered version of the optimal projection. This can be achieved by considering the RKHSH k consisting of the functions f (·) = f (·) − 1 n n i=1 k(x i , ·) with f ∈ H k . To compute the optimal centered projection (13) for the RKHSH k , we only need to perform double centering on the kernel matrix K, i.e., K = HKH, where H = I− 1 n 11 T and replace k(x, ·) byk(x, ·) = k(x, ·)− 1 n n i=1 k(x i , ·). With the centering step, this procedure is equivalent to kernel PCA (Schölkopf et al., 2002). The steps to obtain the lower-dimensional representation in matrix form are given in Algorithm 1.
Our goal is now to understand how each principle component is affected by changes in the different components of its arguments. For this, fix a principle component r ∈ {1, . . . , } and consider for each j ∈ {1, . . . , p} the quantities E[F r (ψ j (X, c)) − F r (X)], where c ∈ (0, 1) and ψ j the perturbation defined in App. A. This is very similar in spirit as the CFI but with the derivative replaced by a difference and measures how much a perturbation of size c in the j-th component effects the value of the r-th principle component. It is easily estimated by 1 n n i=1 F r (ψ j (X i , c)) − F r (X i ). Proof. We start with the CFI. Fix j ∈ {1, . . . , p} and x ∈ S p−1 , then we can compute the derivative using the chain rule and the explicit form of the perturbation ψ as follows d dc f (ψ j (x, c)) = ∇f (ψ j (x, c)), d dc ψ j (x, c) = ∇f (ψ j (x, c)), d dc s c (x 1 , · · · , x j−1 , cx j , x j+1 , · · · , x p ) = ∇f (ψ j (x, c)), d dc 1 p =j x + cx j (x 1 , · · · , x j−1 , cx j , x j+1 , · · · , x p ) = ∇f (ψ j (x, c)), −x j ( p =j x + cx j ) 2 x 1 , · · · , x j−1 , cx j x j − x j ( p =j x + cx j ), x j+1 , · · · , x p .
Evaluating, the derivative at c = 1 leads to where we used that ψ j (x, 1) = x. Moreover, the gradient of f in the case of the log-contrast model is given by Combining (14) and (15) together with the constraint p k=1 β k = 0 implies that d dc f (ψ j (x, c))| c=1 = −x j k =j β k + β j (1 − x j ) = β j .
Hence, taking the expectation leads to By the triangle inequality it holds that We now consider the two terms C n and D n separately. First, we apply the triangle inequality to bound the C n term as follows.
where for the last step we used a supremum bound together with (17). Hence, using the assumption that sup x∈supp(X) |f n (x) − f * (x)| P → 0 as n → ∞, we get that C n → ∞ in probability as n → ∞. Similarly, for the D n term we get that Since the X 1 , . . . , X n and hence φ j (X 1 , z), . . . , φ j (X n , z) are i.i.d. and bounded we can apply the weak law of large numbers to get that D n → 0 in probability as n → ∞.
Finally, combining the convergence of C n and D n with (19) proves (ii) and hence completes the proof of Theorem 2.1.

F.3 Proof of Proposition 2.2
Proof. For this proof, we denote by S p−1 the open instead of the closed simplex.