## Figures

## Abstract

Current advances in next-generation sequencing techniques have allowed researchers to conduct comprehensive research on the microbiome and human diseases, with recent studies identifying associations between the human microbiome and health outcomes for a number of chronic conditions. However, microbiome data structure, characterized by sparsity and skewness, presents challenges to building effective classifiers. To address this, we present an innovative approach for distance-based classification using mixture distributions (DCMD). The method aims to improve classification performance using microbiome community data, where the predictors are composed of sparse and heterogeneous count data. This approach models the inherent uncertainty in sparse counts by estimating a mixture distribution for the sample data and representing each observation as a distribution, conditional on observed counts and the estimated mixture, which are then used as inputs for distance-based classification. The method is implemented into a *k*-means classification and *k*-nearest neighbours framework. We develop two distance metrics that produce optimal results. The performance of the model is assessed using simulated and human microbiome study data, with results compared against a number of existing machine learning and distance-based classification approaches. The proposed method is competitive when compared to the other machine learning approaches, and shows a clear improvement over commonly used distance-based classifiers, underscoring the importance of modelling sparsity for achieving optimal results. The range of applicability and robustness make the proposed method a viable alternative for classification using sparse microbiome count data. The source code is available at *https*:*//github*.*com/kshestop/DCMD* for academic use.

## Author summary

The uneven performance of conventional distanced-based classifiers when using microbiome profiles to predict disease status has motivated us to develop a novel distance-based method that accounts for uncertainty when modeling sparse counts. We propose a classification algorithm that uses mixture distributions to measure normed distances between microbiome distributions, which better models the underlying structure by handling excess zeros and sparsity inherent in microbial abundance counts. Applications of DCMD have shown improved classification performance and robustness, making the proposed method an improved alternative for classification using microbiome data.

**Citation: **Shestopaloff K, Dong M, Gao F, Xu W (2021) DCMD: Distance-based classification using mixture distributions on microbiome data. PLoS Comput Biol 17(3):
e1008799.
https://doi.org/10.1371/journal.pcbi.1008799

**Editor: **Elhanan Borenstein, University of Washington, UNITED STATES

**Received: **December 10, 2020; **Accepted: **February 15, 2021; **Published: ** March 12, 2021

**Copyright: ** © 2021 Shestopaloff et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **The dataset analysed in this study is publicly available in the MLRepo repository. The colorectal cancer: https://knights-lab.github.io/MLRepo/docs/kostic_healthy_tumor.html; and the Crohn’s disease: https://knights-lab.github.io/MLRepo/docs/gevers_control_cd_ileum.html.

**Funding: **W.X. was funded by Natural Sciences and Engineering Research Council of Canada (NSERC Grant RGPIN-2017-06672), Crohn’s and Colitis Canada (CCC Grant CCC-GEMIII), and Helmsley Charitable Trust. K.S. was supported by CCC Grant CCC-GEMIII. M.D. was supported by NSERC Grant RGPIN-2017-06672 and CCC Grant CCC-GEMIII. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

This is a

PLOS Computational BiologyMethods paper.

## Introduction

The increasing accessibility of high-throughput technology has generated a wide array of data types for analysis. One type of data that has recently gained popularity is the microbiome community data, which is composed of site-specific counts for identified bacteria. There is a steadily growing number of studies that have demonstrated associations between the human microbiome and health outcomes, such as inflammatory bowel disease [1], type 2 diabetes [2], and cardiovascular disease [3], making it an important topic of research. However, the presence of sparsity and skewness, which characterizes this type of data, brings a number of challenges to statistical modelling. These challenges have motivated methodological developments that expand the existing algorithms, particularly for classification tasks related to disease risks.

One popular class of approaches often used with microbiome data are distance-based methods, which differentiate and classify samples using distances derived from multivariate measures. Ubiquitous methods include *k*-means [4] and *k*-Nearest Neighbours (*k*-NN) [5], which have been adapted to such data with variable transformations using Euclidean distances, Manhattan distance, and several other measures [6–8]. Other adaptations such as the distance-based nearest shrunken centroid (NSC) classifier, which was developed for the use of microarray data [9]. NSC takes the average of the relative abundances for each class as class centroids [10,11] and then calculate standardized squared distances between new samples and class centroids.

A number of linear and additive machine learning classifiers; such as LASSO, ridge regression (RR), random forest (RF), gradient boosting (GB), and support vector machines (SVM) are also commonly used for high-throughput data [7,11–13]. Some methods rely on penalization (LASSO and RR) in logistic models [14,15], typically with log-transformed adjusted counts or relative abundances of operational taxonomic units (OTUs) to address skewness [16]. The RF and GB algorithms rely on sequentially constructed classifiers and automatically incorporate feature selection [17,18]. Other recent methodological developments for microbiome data include regression models with a phylogenetic tree-guided penalty term [19] and inverse regression to deal with the over-dispersion of zeros in count data [20]. However, the tree-guided method can be overly influenced by tree information [19], and the phylogenetic tree is not always available. The existing methods incorporate observed count data or relative abundance directly when computing distances or defining covariates, with some kinds of transformation of OTUs to account for skewness. None of the methods explicitly account for and model the underlying uncertainty inherent in sparse count data.

This paper aims to address these problems in a classification framework, where predictors are sparse and heterogeneous count data. Shestopaloff et al. [21] proposed representing count data using a mixture distribution to analyze the differences between microbiome communities. We extend the method to distance-based classification using mixture distributions (DCMD) that specifically addresses the uncertainty in sparse and low-count data. DCMD measures the distance between the sample-specific distributions of OTUs rather than between counts or relative abundances, which better models the structure of microbiome data for the distance measure. DCMD is also able to handle excess zero counts, which can potentially improve the predictive accuracy when using sparse OTUs. In this paper, we use two simulation studies to show the advantage of DCMD for classification over existing distance metrics and compare it against common machine learning methods. We provide a comprehensive comparison of distance-based classification methods (*k*-means, *k*-NN, and NSC) and machine learning methods (RF, GB, LASSO, RR, and SVM) in different simulation settings, which to our knowledge has not been studied before. We also illustrate the effectiveness of DCMD on two human microbiome studies [22,23]. The paper concludes with a discussion of the merits, drawbacks and the scope of applicability of the proposed methodology.

## Method

In this section we outline the framework of DCMD. The main steps of the model include mixture distribution specification and parameter estimation for modelling observed data, calculation of conditional distributions for each sample, and calculating distances between samples and cluster centres to use in distance-based classification methods. The mixture model and conditional distribution estimation are described in Shestopaloff et al. [21]. It is proposed to model the underlying population rate structure of the observed count data using a mixture distribution with Poisson-Gamma components, then conditioning on observed sample counts and resolution to obtain sample-specific distributions. In the next step, we use the sample-specific distributions for classification by calculating the distances between distributions.

### Model specification and estimation

Microbiome data typically consists of OTU counts, as illustrated in Table 1. The notations used in our method formulation are as follows:

*n*_{ij}, *i* = 1,…,*I* for *j* = 1,…,*J*, the count of the *j*th OTU of the *i*th sample.

*N*_{i}, the total number of aligned reads of sample *i*, *N*_{i} = ∑_{j}*n*_{ij}

Without loss of generality, we focus on a specific OTU and omit the *j*th subscript for subsequent notation. Assume that the observed counts, *n*_{i}, are Poisson distributed with rate *r*_{i} = *q*_{i}*N*_{i}, *i* = 1,…,*I* for sample *i*, where *q*_{i} is the individual-specific relative abundance and is sampled from some general OTU relative abundance distribution *G*_{q}. Then we have,
where , and , with sampled from , which is the rate normalized to the average sample reads to make sure that the counts are treated on the same scale. Thus, the observed count for a specific site of OTU is

Since the distribution of OTU is zero-inflated, skewed, and heavy tailed, we propose a mixture distribution to approximate *G*. For positive rates on a given interval, we specify a set of Gamma components, Γ(*α*, *β*), with shape *α* and rate *β*, to cover the range of the data. To separate structural zeros from low-rate and undetected samples, we include a zero-point mass, , where *P*(*n*_{i} = 0) = 1. Additionally, for sparse high rates, we define a high-count point mass, , where *P*(*n*_{i}>*C*) = 1, *C* is the truncation point and **1**(∙) is the indicator function. The full set of mixture components is **Ω** = (*G*_{z}, *G*_{1}, *G*_{2},…,*G*_{M}, *G*_{C+}) where *G*_{z} is a zero-point mass, *G*_{m}, *m* = 1,2,…,*M*, is a set of Gammas components Γ(*α*_{m}, *β*_{m}), and *G*_{C+} is a high-count point mass. The process for defining the mixture model components is described in detail in the Simulation section.

Define the weight of each component as
where *w*_{z} is the weight of the zero-point mass, *w*_{m}, *m* = 1,2,…,*M*, is the weight for *m*th corresponding Gamma component, and *w*_{C+} is the weight of high-count point mass. Define
the number of species observed *x* times across all samples for *x* = *z*, 0,1,2,…,*C*,*C*+. Then our goal is to minimize , where is the expected aggregate counts of *y*_{x}. Note that given Γ(*α*_{m}, *β*_{m}), sample counts conditional on *t*_{i} are distributed as a negative binomial *NB*[*α*_{m}, *β*_{m}/(*t*_{i}+*β*_{m})] [21]. Define
as the probability of observing count x from the mth mixture component conditional on the resolution *t*_{i}. Then , where *p*_{xm} = ∑_{i}*p*_{xmi}/*I*. Thus, we have the objective function:
(1)

The estimate, , is obtained by optimizing the least-squares objective function (1), using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [24] with the augmented Lagrangian method [25] for the constraints.

Due to the sparse nature of the data, we only optimize the weights and fix the Gamma parameters. Attempting to model the low-rate structure by optimizing both weights and Gamma parameters (*α*_{m}, *β*_{m}) via expectation-maximization (EM) results in biased structural zero estimates and a poor overall fit of the low counts [26]. In this context, the EM is also prone to numerical issues, convergence to local minima and can often be too slow computationally for this type of application [26]. On the other hand, BFGS provides a much faster and robust alternative.

### Weighted mixture distribution

To address the uncertainty around specifying components for the mixture model, particularly for the low rates where sparsity is often an issue, we define a set of nested models Φ_{l}, *l* = 1,…,*L*, with varying components for modelling the rate structure around zero. We estimate the joint mixture model using a nonparametric bootstrap algorithm. As stated in Shestopaloff et al. [21], we can obtain the weight *v*(*l*) of each candidate model, which is the proportion of times each model is selected as optimal relative to the observed data, and calculate the weights for the joint mixture distribution. Let *w*_{l} be the estimated weights for each candidate model, Φ_{l}, with zeros assigned to the weights of components not included in a specific model, then the weights of the joint model are ** w** = ∑

_{l}

*v*(

*l*)

*w*_{l}.

### Sample-specific distribution

Once we have a distribution for the OTU, we can estimate sample-specific distributions by conditioning on the observed count *n*_{i}, estimated mixture weights ** w**, and resolution

*t*

_{i}. We can obtain the probability that sample

*i*sampled from a specific component, as: (2)

The probability of being assigned to the zero-point mass is *P*(*i*∈*G*_{0}) = **1**(*n*_{i} = 0) and to the high-count point mass is *P*(*i*∈*G*_{C+}) = **1**(*n*_{i}>*C*). Define the sample-specific mixture weights as
where

Since the sample-specific weights have been adjusted for the individual resolutions *t*_{i} through the *p*_{im} probabilities, the Poisson-Gamma mixture probabilities are *NB*(*α*, *β*/(1+*β*)). Also note that we have differentiated the zeros in our mixture distribution, which are defined as structural zeros, *x = z*, and observed zeros, *x =* 0. Given the underlying rate distribution from the joint mixture model, we can then calculate the probability of observing count *x* = *z*, 0,1,…,*C*, *C*+ from each mixture component *G*_{m} as

For the point masses we have *P*(*X* = *x*|*G*_{z}) = **1**(*n*_{i} = 0) and *P*(*X* = *x*|*G*_{C+}) = **1**(*n*_{i}>*C*), respectively. To simplify the representation of the distribution, define a vector of probabilities
for *x* = *z*, 0,1,…,*C*, *C*+. Then we can define the discrete probability density for sample *i* as
where

The ** P**(

*x*) vectors in the matrix

**are the vectors giving the probability of observing**

*P**x*from each mixture component, which can be pre-calculated for distance calculations. The overview of how to obtain sample-specific mixture distributions given a set of mixture distribution components is shown in Fig 1.

Workflow for obtaining a sample-specific mixture distribution for each sample i in OTU j: 1) Specify a set of nested candidate mixture distributions using a specific set of components; 2) Apply bootstrap to the set of nested models and calculate the weights of each candidate mixture model, then calculate the weights of the joint mixture distribution; 3) Estimate sample-specific distributions conditional on *n*_{i}, *t*_{i}, and the joint mixture distribution.

### Classification

Once the distribution for each sample has been computed, we use *k-*means and *k-*Nearest Neighbours (*k*-NN) algorithms for classification. In this section, we outline how to apply these algorithms using two distance measures, discrete *L*^{2} (D-*L*^{2}) norm and continuous cumulative *L*^{2} (CC-*L*^{2}) norm.

### Distance measures

Given posterior probability *f*_{i}, cumulative posterior probability *F*_{i}, and an estimated set of weights *w*_{i} for sample *i*, the distance metrics are:

**D- L^{2} Norm**:
(3)
where

*x*=

*z*, 0,1,…,

*C*,

*C*+. Note that we include the structural zero component, z, separately and that the distances only depend on the weights. For multiple predictors, j = 1, …, J, the total distance between samples

*i*

_{1}and

*i*

_{2}is the sum across all predictors, .

**CC- L**

^{2}

**Norm**: (4) where is a matrix with the (

*m*

_{1},

*m*

_{2}) entry set to for each of the continuous mixture component. Details of the derivation can be found in Shestopaloff [26].

#### Distance-based classification.

We use the distances calculated in Eqs (3) and (4) in a *k*-means and *k*-NN framework. In *k*-means, the mean of each class is calculated from the training data and points are classified to the nearest class. In *k*-NN, samples are classified as the mode of the labels from *k* closest neighbours of the training set. The steps of *k*-means and *k*-NN algorithms are as follows:

** K-means**: To adapt the

*k*-means algorithm, we estimate the mean distribution for each class by minimizing the distributional distances between it and the class samples, conditional on a specified distance. Since distances are

*L*

^{2}norms and only depend on the weights, as shown in Eqs (3) and (4), the mean of the weights for each class gives the optimum. The algorithm is implemented as follows:

- Step 1: Determine the mean of the weights for the
*j*th predictor in class*k*,

, where |*N*_{k,j}| is the number of samples in class *k* of predictor j, *k* = 1,…,*K* and *j* = 1,…,*J*;

- Step 2: Compute the distance to the mean for sample
*i*across all predictors,

- Step 3: Predict the label of sample
*i*as the closest mean,

** K-NN:** After computing the pairwise distances between samples and summing across predictors, these can be used directly to identify the nearest neighbours for classification. The algorithm for

*k*-NN is as follows:

- Step 1: Compute the pairwise distance of sample
*i*_{1}and*i*_{2},*i*_{1},*i*_{2}= 1,…,*I*,*i*_{1}≠*i*_{2},

- Step 2: For sample i, pick the k samples with smallest distance to sample
*i*, the optimal*k*can be determined using cross-validation (CV) in the training set or existing heuristics. - Step 3: Tally the labels of the k nearest neighbours, then sample i is predicted as the mode of the
*k*labels.

The overall workflow of DCMD within the *k*-means and *k*-NN frameworks is presented in Fig 2.

For k-means (top panel), the distance between the new sample to the mean of class A is smaller than to the mean of class B, hence the new sample is predicted as class A. For k-NN (bottom panel), using 3 nearest neighbours, the new sample is predicted as class A.

### Predictive metrics

Let be the predicted class for sample *i*. The classification accuracy is defined as the proportion of correctly predicted cases: . For binary outcomes we also include precision, recall, and F1 score as metrics to measure predictive performance. We count the number of true positive (*TP*), false positive (*FP*), and false negative (*FN*) and defined these metrics as follows [27]:

## Simulation

### Data generation

To evaluate the performance of the DCMD method, we design simulation studies that mimic microbiome community count data and assess classification performance. We simulate a separate mixture distribution for each class, individual sample rates, and resolutions to generate observed counts. For the mixture distribution, the number of components, M, is sampled from *Unif*(5, 15). And the number of samples to be taken from each mixture component is set by binning samples from a *Beta*(*α*_{b}, *β*_{b}) at uniform intervals, with the *α*_{b} varied to give different class means and levels of sparsity and with *β*_{b}~*Unif*(2, 6.5) to control dispersion. The observed counts for each sample are then generated as , where is the sampled rate and resolution *t*_{i}~*Unif*(2/3, 5/4).

We consider two- and three-class outcomes with several simulation scenarios for each case. For the two-class outcome, parameter settings and summary statistics for each scenario are shown in Table 2. Scenarios 1 and 2 have low sparsity data, and scenarios 3 and 4 are highly sparse. Scenarios 1 and 3 have weakly differentiated classes (small difference in *α _{b}*), while scenarios 2 and 4 have strongly differentiated classes (large difference in

*α*). The sample size is

_{b}*I*= 800, with 400 samples per class and

*J*= 25 OTUs. For the three-class outcome, parameter settings and summary statistics are shown in S1 Table. Scenarios 1–3 have strongly differentiated classes, with varying levels of sparsity. The sample size is

*I*= 1200, with 400 samples in each class and

*J*= 25 OTUs. A null case scenario is also generated by permuting class labels, and performance metrics for each outcome and scenario are computed over 100 simulation replicates.

### Mixture model specification

The specification of the mixture model components should be data-driven, and the main requirement is that the Gammas allow for appropriate coverage of the observed data. We split the count data into five intervals and apply different strategies to specify components on each interval. Modelling of the zeros and low-rate structures is based on [28], and modelling of the higher counts is based on [21].

- Structural zeros: For data with observed zeros, a zero-point mass P(
*X*= 0) = 1 is included to model zero inflation. - Low counts (
*x*∈[0,1,2,3]): We specify components as Poisson rate posteriors with uniform priors for each of the counts, which is*Γ*(*x*+1, 1). Hence, we include*Γ*(1,1),*Γ*(2,1),*Γ*(3,1) and*Γ*(4,1). The cut-off is set to*x*= 3 because rate posteriors for higher values have a low probability of observing zero, and we want to differentiate the distributions relevant to modelling zero inflation. We also want to examine whether more mass close to zero improves the fit for the low rates. Therefore, we add exponentials with a higher rate,*β*, to have more mass near zero. In this case, we include*Γ*(1,2) into the candidate models. We can potentially include*Γ*(1,3) or other terms into the model, then apply the procedure described above to select the optimal mix. A fuller discussion, drawn from modelling sparse counts for total species estimation, can be found in [28]. - Integer counts (
*x*∈[4,5,6,7]): Components in this range are also specified as the rate posterior*Γ*(*x*+1, 1) at integer intervals. This block exists as a buffer to ensure no gaps in the coverage after the low-count distributions, as this can potentially bias the structural zero and low-rate estimates. This is specified until the last integer component has little overlap with the previous low-rate component. In our formulation, we use an upper limit of*x*= 7 as simulations showed negligible differences between*x*= 7 and*x*= 8. The integer components include*Γ*(4,1),…,*Γ*(7,1). - High counts (
*x*∈[8,…,*C*]): The higher counts tend to have a large range, and it’s not practical to specify them on integer intervals. In this case, we set the number of components based on the range of the data and specify*α*_{m}at uniform intervals on a linear-log scale from 8 to*C*=*q*_{p}, a set quantile of the data. Using between 10 and 15 components worked well in past applications [21]. For our modelling, p = 0.85 is an effective threshold, which means C is the 85% quantile of the sample. - Extreme high counts (
*x*>C): These counts are truncated to a point mass P(*X*>*C*) = 1, in part because of the low density in this range and the uncertainty in modelling them and in part to decrease computation time. The mixture model specification heuristics described above are primarily for modelling low-abundance OTU, which covers the most information of the microbiome data. Higher abundance OTU can be modelled by restricting the component specification to higher counts and increasing p.

The full model we use for our data includes [*Γ*(1,2), *Γ*(1,1),…,*Γ*(7,1), *Γ*(8,1)], along with varying high-count components. Nested models are generated by progressively excluding *Γ*(1,2), [*Γ*(1,2), *Γ*(1,1)],… for a total of five models. A sample model specification for one of the OTU is presented in S2 Table.

### Model fitting and comparison methods

The proposed method is compared with *k*-means and *k*-NN using Euclidean and Manhattan distances of relative abundances, distance-based NSC, as well as LASSO, RF, GB, RF and SVM classifiers [29]. Models are trained using a 60/40 training and test set split [30], with the training set remaining the same for all classifiers within each replicate. For the machine learning methods, we use existing packages and tune the hyper-parameters using cross-validation when appropriate, see details in S1 Text.

### Simulation results

For the two-class outcome, classification accuracy for each model and scenario is presented in Fig 3. The orange boxplots are the results for the proposed DCMD method in a *k*-means and *k*-NN framework. The blue boxplots are the other distance-based methods, including *k*-means and *k*-NN with Euclidean and Manhattan distance and NSC. The green boxplots give results for the machine learning methods, including RF, GB, LASSO, RR, and SVM. The dashed red line gives the average accuracy of the best method in each scenario. The results show that in Scenarios 1 and 2, when sparsity is low, *k*-means with *CC-L*^{2} norm performs best, followed by *k*-means with D-*L*^{2} norm, while in Scenarios 3 and 4, when sparsity is high, *k*-means with D-*L*^{2} norm gives the best performance, followed by *k*-means with *CC-L*^{2} norm. Overall, DCMD in a *k*-means framework with *L*^{2} norms outperforms the other classification methods for all types of signals and data structures for the two-class outcome. Differences in accuracy within the distance-based methods are also progressively more pronounced in favour of DCMD, among which *k*-means outperforming *k*-NN. The specialized NSC approach performed similarly to DCMD within the *k*-NN framework. However, NSC generally falls short of *k*-means DCMD and other machine learning methods.

The proposed DCMD method is shown in orange for *k*-means and *k*-NN with D-*L*^{2} and CC-*L*^{2} distances. The other distance-based methods are shown in blue, including *k*-means and *k*-NN with Euclidean and Manhattan distances and NSC. Machine learning methods are shown in green, including random forest (RF), gradient boosting (GB), LASSO, ridge regression (RR), support vector machine (SVM). The dashed red line gives the average accuracy of the best method in each scenario.

Table 3 shows the summary statistics of the F1 Score over 100 replicates for the two-class outcome. The top results in each scenario are highlighted. Similar to accuracy, DCMD with *L*^{2} norms produce the highest F1 Scores (F1 Score (SD) = 0.68 (0.034), 0.92 (0.017), 0.77 (0.028), 0.95 (0.014) in Scenarios 1–4, respectively), which are better than the best machine learning method (GB: 0.64 (0.033), RF and RR: 0.89 (0.019), LASSO: 0.75 (0.030), GB: 0.94 (0.016) in Scenarios 1–4, respectively). DCMD shows consistent good performance in each scenario compared among the methods.

The classification accuracy of each model and scenario for a three-class outcome is presented in Fig 4. For Scenarios 1–3, the classes are differentiated under varying levels of sparsity, and we observe that DCMD is competitive with the optimal machine learning methods. Although RR has similar predictive accuracy to DCMD in Scenario 1 and 3, and GB has similar predictive accuracy in Scenario 2, DCMD is consistently improved over the optimal comparison method. None of the models is systematically over-fit, as predictive accuracy in the null case (Scenario 4) is near the baseline accuracy of 0.33.

The proposed DCMD method is shown in orange for k-means and k-NN with D-L^{2} and CC-L^{2} distances. The other distance-based methods are shown in blue, including k-means and k-NN with Euclidean and Manhattan distances and NSC. Machine learning methods are shown in green, including random forest (RF), gradient boosting (GB), LASSO, ridge regression (RR), support vector machine (SVM). The dashed red line gives the average accuracy of the best method in each scenario.

## Application

### Data description

We test our method on data from two microbiome studies. The first is a study on colorectal cancer reported by [22]. A total of 190 samples (95 pairs) were collected from 95 patients in Vall d’Hebron University Hospital in Barcelona and Genomics Collaborative. The study aimed to identify associations between tumor microbiome and colorectal carcinoma. Both the colorectal adenocarcinoma tissue and adjacent non-affected tissues were collected. The OTU count table generated by 16S amplification was obtained from the Microbiome Learning Repo [12]. Prior to model training, eighteen samples with total reads less than 100 were dropped from the dataset, and we also excluded OTUs with mean relative abundance less than 0.001, resulting in 149 OTUs and 172 samples (86 pairs) used to differentiate tumour and normal tissue. The second study is a case-control study of Crohn’s disease (CD) from a multi-center cohort, which was designed to examine how microbiota contributes to CD pathogenesis [23]. The profiles were obtained using Illumina 16S rRNA sequencing. The dataset was downloaded from the Microbiome Learning Repo [12] and consisted of 140 ileal tissue biopsy samples. Minimal sample depth is set at 100, and OTUs are restricted to less than 90% zero proportion, leaving 140 samples (78 cases and 62 controls) and 31 OTUs for analysis.

### Model fitting and evaluation

For both datasets, we compare our proposed *L*^{2}-norm based *k*-means and *k*-NN classifier with five other distance-based classifiers (*k*-means-Euclidean, *k*-means-Manhattan, *k*-NN-Euclidean, *k*-NN-Manhattan, NSC) and six machine learning methods (RF, GB, LASSO, RR, SVM). We assess model performance using 10-fold CV. In each iteration, one fold of the data is treated as the test set, and the remaining nine are used for training. The specification of DCMD and other classifiers is the same as that in the simulations (see S1 Text). We calculate accuracy, precision, recall, and F1 score as metrics for comparison.

For the colorectal cancer data, we reduce the predictor space for distance-based classifiers by the univariate screening of OTUs with a nonparametric Mann-Whitney U test on the training set. To adjust for multiple comparisons, we use q-values obtained by the Benjamini–Hochberg (BH) method [31] and retain OTUs with q-values less than 0.05 in each training set. The mean number of OTUs selected from each training set is 42 (range: 13–57). For the machine learning approaches, we include all 149 OTUs. For the CD dataset, 31 OTUs are included for all methods.

### Applied results

The predictive performance of each classifier for the colorectal cancer and CD studies are presented in Tables 4 and 5, respectively. The accuracy of *k*-means with *D-L*^{2} norm is 0.67 for colorectal cancer and 0.73 for CD, which is the best method for colorectal cancer and the second-best for CD. The F1 scores are also among the highest for both datasets, indicating that DCMD has consistently optimal performance and an improvement over the other classifiers. Predictive accuracy of *k*-means with *CC-L*^{2} norm is slightly worse, likely due to high zero proportions in the predictors, which is consistent with the simulation results. Similarly, DCMD outperforms Euclidean and Manhattan distances within *k*-means, and *k*-means outperforms *k*-NN overall. Within *k*-NN, accuracy and F1 score indicate that DCMD has predictive performance comparable to Euclidean or Manhattan distances. The NSC approach has an accuracy of 0.67 and a precision of 0.72 for colorectal cancer, with a recall of 0.56 and an F1 score of 0.63, notably lower than that of the *k*-means classifiers. The performance is unstable in the CD data.

Compared to the machine learning methods, DCMD with *k*-means is superior to RF, GB, LASSO, RR, and SVM in the first dataset. When results are replicated controlling for distance-based classifier variable selection (S3 Table), machine learning methods has improved performance, except for GB. In the second dataset, LASSO and SVM are the best methods with accuracies of 0.74, slightly outperforming the accuracy of 0.73 for *k*-means with *D-L*^{2} norm. Otherwise, DCMD *k*-means with *D-L*^{2} and *CC-L*^{2} norms either equivalent or outperform the machine learning approaches.

## Discussion

The results of our simulation studies and microbiome applications indicate that the proposed DCMD method performs well over a range of scenarios, achieving good classification performance when using sparse data as predictors. The predictive accuracy is consistently improved compared to other distances within distance-based classifiers. It is either advantageous or competitive compared to a number of machine learning methods under a wide range of scenarios. The improved performance of DCMD on sparse data results from the use of mixture distributions to represent the observed count data because the mixture distributions can not only model the underlying uncertainty in the observed sample counts but also account for zero inflation. The improvement is particularly significant in comparison to other distances within the regular *k*-means and *k*-NN classifiers.

The performance differences between the *D-L*^{2} and *CC-L*^{2} norms can be attributed to the data structure. In less sparse scenarios, the data structure is better modelled by a continuous rate structure, resulting in a slight advantage for the *CC-L*^{2} metric. While in the higher ZP and low-count scenarios, the *D-L*^{2} norm allows us to use specific differentiation of zeros into structural and non-structural and modelling expected counts directly, which can further capture the general structure of predictors used for differentiation better.

As the DCMD method derives its major improvement from a focus on modelling lower count data and the associated uncertainty, it is necessary to accurately specify an underlying set of mixture components for the low rates. The mixture also has to model low- and high-count data on the same scale, where the density of the latter is often sparse due to sparse observation intervals. Moreover, it is not feasible to apply a transformation to make the data denser due to abundant zeros and the discrete nature of the low counts. However, the weighing of nested candidate models and the suggested heuristic of specifying higher count distributions on a log-linear scale has worked well in our simulations, as it partially mimics the log-transformation commonly applied to such data.

The proposed DCMD method is formulated in a distance-based framework, so it does not include specific mechanisms for variable selection. While different predictors can alternatively be included in the distance sum, the process is not automated. In our case, we used a simple nonparametric Mann-Whitney U test for feature selection, which worked well in the study data. However, more advanced and specialized methods for feature selection can be applied separately for other applications. Additionally, we note that the model is specified to use microbiome site counts, and continuous covariates need to be modelled separately using continuous distributions, while categorical covariates can only be included as dummy variables. These variables will also be treated on the same scale in the distance metric unless specified otherwise.

Despite these drawbacks, we believe that our core contribution, the representation of observations as distributions to reflect uncertainty and the use of distributional distance metrics, will be valuable to anyone analyzing sparse data. This formulation can compensate for some of the disadvantages inherent in distance-based methods to such an extent that it achieved competitive performance with more sophisticated classifiers, as well as specially designed approaches like NSC. The techniques that made DCMD advantageous for classification when data is expected to be sparse, particularly within a distance-based framework, should be considered for improving model performance.

## Conclusion

In this paper, we present a distance-based classification method for microbiome count data. The DCMD approach models the observed data using mixture distributions and calculates *L*^{2}-norms for distance-based classification algorithms. The method is specifically designed to accurately model low-count structures, addressing the inherent sparsity by representing each observed count as a distribution, and is demonstrated to have improved performance by simulation studies and two microbiome applications. The importance of accounting for uncertainty in sparse data is emphasized, and the resulting improvements in classification accuracy when using distributions are demonstrated. The performance of the proposed DCMD is competitive to a number of machine learning methods and significantly outperforms other common metrics in distance-based classification models. The consistent and improved performances across a variety of different data structures make this approach a viable alternative for modelling and classification of microbiome count data, particularly within a distance-based framework.

## Supporting information

### S1 Text. Supporting information for model specification of classifiers.

https://doi.org/10.1371/journal.pcbi.1008799.s001

(DOCX)

### S1 Table. Three-class outcome: the parameter setting for each scenario and the corresponding ZP and mean count for each class over 100 replicates.

https://doi.org/10.1371/journal.pcbi.1008799.s002

(DOCX)

### S2 Table. The nested models of mixture distribution components used in fitting one of the simulated data.

https://doi.org/10.1371/journal.pcbi.1008799.s003

(DOCX)

### S3 Table. Dataset 1—Colorectal Cancer: the predictive performance of the 14 classifiers using the OTUs selected from Mann–Whitney U test on the colorectal cancer data.

https://doi.org/10.1371/journal.pcbi.1008799.s004

(DOCX)

## Acknowledgments

The authors acknowledge and are grateful for the support of the Tomcyzk AI and Microbiome Working Group.

## References

- 1. Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012;13:R79. pmid:23013615
- 2. Karlsson FH, Tremaroli V, Nookaew I, Bergström G, Behre CJ, Fagerberg B, et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature. 2013;498:99–103. pmid:23719380
- 3. Shreiner AB, Kao JY, Young VB. The gut microbiome in health and in disease. Curr Opin Gastroenterol. 2015;31:69–75. pmid:25394236
- 4.
Cam LML, Neyman J. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability: Biology and problems of health. University of California Press; 1967;281–297.
- 5. Zhang Z. Introduction to machine learning: k-nearest neighbors. Ann Transl Med. 2016;4. pmid:27386492
- 6. Liu Z, Hsiao W, Cantarel BL, Drábek EF, Fraser-Liggett C. Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data. Bioinformatics. 2011;27:3242–9. pmid:21984758
- 7. Statnikov A, Henaff M, Narendra V, Konganti K, Li Z, Yang L, et al. A comprehensive evaluation of multicategory classification methods for microbiomic data. Microbiome. 2013;1:11. pmid:24456583
- 8. Rosenthal M, Aiello AE, Chenoweth C, Goldberg D, Larson E, Gloor G, et al. Impact of Technical Sources of Variation on the Hand Microbiome Dynamics of Healthcare Workers. PLoS One. 2014;9. pmid:24551205
- 9. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences. 2002;99:6567–72. pmid:12011421
- 10. Zhang X, Zhao Y, Xu J, Xue Z, Zhang M, Pang X, et al. Modulation of gut microbiota by berberine and metformin during the treatment of high-fat diet-induced obesity in rats. Scientific reports. 2015;5:14405. pmid:26396057
- 11. Knights D, Costello EK, Knight R. Supervised classification of human microbiota. FEMS Microbiology reviews. 2011;35:343–59. pmid:21039646
- 12. Vangay P, Hillmann BM, Knights D. Microbiome Learning Repo (ML Repo): A public repository of microbiome regression and classification tasks. Gigascience. 2019;8. pmid:31042284
- 13.
Galkin F, Aliper A, Putin E, Kuznetsov I, Gladyshev VN, Zhavoronkov A. Human microbiome aging clocks based on deep learning and tandem of permutation feature importance and accumulated local effects. preprint. Bioinformatics; 2018.
*https://doi.org/10.1101/507780*. - 14. Tibshirani R. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996;58:267–88.
- 15. Hoerl AE, Kennard RW. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics. 1970;12:55–67.
- 16. Weiss S, Xu ZZ, Peddada S, Amir A, Bittinger K, Gonzalez A, et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017;5:27. pmid:28253908
- 17. Breiman L. Random Forests. Machine Learning. 2001;45:5–32.
- 18. Friedman JH. Greedy Function Approximation: A Gradient Boosting Machine. Annals of statistics. 2001;29:1189–232.
- 19. Wang T, Zhao H. Constructing Predictive Microbial Signatures at Multiple Taxonomic Levels. Journal of the American Statistical Association. 2017;112:1022–31.
- 20. Wang T, Yang C, Zhao H. Prediction analysis for microbiome sequencing data. Biometrics. 2019;75:875–84. pmid:30994187
- 21. Shestopaloff K, Escobar MD, Xu W. Analyzing differences between microbiome communities using mixture distributions: Analyzing Differences Between Microbiome Communities. Statistics in Medicine. 2018;37:4036–53. pmid:30039541
- 22. Kostic AD, Gevers D, Pedamallu CS, Michaud M, Duke F, Earl AM, et al. Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Genome research. 2012;22:292–98. pmid:22009990
- 23. Gevers D., Kugathasan S., Denson L. A, et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell host & microbe. 2014;15:382–92. pmid:24629344
- 24. Nocedal J. Updating Quasi-Newton Matrices with Limited Storage. Mathematics of Computation. 1980;35:773–82.
- 25. Conn AR, Gould NIM, Toint Philippe. A Globally Convergent Augmented Lagrangian Algorithm for Optimization with General Constraints and Simple Bounds. SIAM Journal on Numerical Analysis. 1991;28:545–72.
- 26.
Shestopaloff K. Analysis of Ecological Communities Using Mixture Models [PhD thesis]. Toronto, Canada: University of Toronto. 2017.
- 27.
Goutte C, Gaussier E. A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. In: Losada DE, Fernández-Luna JM, editors. Advances in Information Retrieval. Berlin, Heidelberg: Springer. 2005;345–59. https://doi.org/10.1016/j.ijmedinf.2004.04.017 pmid:15694638
- 28. Shestopaloff K, Xu W, Escobar MD. Estimating total species using a weighted combination of expected mixture distribution component counts. Environmental and Ecological Statistics. 2020;27:447–65.
- 29. Suykens JA, Vandewalle J. Least squares support vector machine classifiers. Neural processing letters. 1999;9:293–300.
- 30.
Pedrycz W, Skowron A, Kreinovich V. Handbook of granular computing. John Wiley & Sons. 2008;133–36.
- 31. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B (Methodological). 1995;57:289–300.