DCMD: Distance-based classification using mixture distributions on microbiome data

Current advances in next-generation sequencing techniques have allowed researchers to conduct comprehensive research on the microbiome and human diseases, with recent studies identifying associations between the human microbiome and health outcomes for a number of chronic conditions. However, microbiome data structure, characterized by sparsity and skewness, presents challenges to building effective classifiers. To address this, we present an innovative approach for distance-based classification using mixture distributions (DCMD). The method aims to improve classification performance using microbiome community data, where the predictors are composed of sparse and heterogeneous count data. This approach models the inherent uncertainty in sparse counts by estimating a mixture distribution for the sample data and representing each observation as a distribution, conditional on observed counts and the estimated mixture, which are then used as inputs for distance-based classification. The method is implemented into a k-means classification and k-nearest neighbours framework. We develop two distance metrics that produce optimal results. The performance of the model is assessed using simulated and human microbiome study data, with results compared against a number of existing machine learning and distance-based classification approaches. The proposed method is competitive when compared to the other machine learning approaches, and shows a clear improvement over commonly used distance-based classifiers, underscoring the importance of modelling sparsity for achieving optimal results. The range of applicability and robustness make the proposed method a viable alternative for classification using sparse microbiome count data. The source code is available at https://github.com/kshestop/DCMD for academic use.


Introduction
The increasing accessibility of high-throughput technology has generated a wide array of data types for analysis. One type of data that has recently gained popularity is the microbiome community data, which is composed of site-specific counts for identified bacteria. There is a steadily growing number of studies that have demonstrated associations between the human microbiome and health outcomes, such as inflammatory bowel disease [1], type 2 diabetes [2], and cardiovascular disease [3], making it an important topic of research. However, the presence of sparsity and skewness, which characterizes this type of data, brings a number of challenges to statistical modelling. These challenges have motivated methodological developments that expand the existing algorithms, particularly for classification tasks related to disease risks.
One popular class of approaches often used with microbiome data are distance-based methods, which differentiate and classify samples using distances derived from multivariate measures. Ubiquitous methods include k-means [4] and k-Nearest Neighbours (k-NN) [5], which have been adapted to such data with variable transformations using Euclidean distances, Manhattan distance, and several other measures [6][7][8]. Other adaptations such as the distancebased nearest shrunken centroid (NSC) classifier, which was developed for the use of microarray data [9]. NSC takes the average of the relative abundances for each class as class centroids [10,11] and then calculate standardized squared distances between new samples and class centroids.
A number of linear and additive machine learning classifiers; such as LASSO, ridge regression (RR), random forest (RF), gradient boosting (GB), and support vector machines (SVM) are also commonly used for high-throughput data [7,[11][12][13]. Some methods rely on penalization (LASSO and RR) in logistic models [14,15], typically with log-transformed adjusted counts or relative abundances of operational taxonomic units (OTUs) to address skewness [16]. The RF and GB algorithms rely on sequentially constructed classifiers and automatically incorporate feature selection [17,18]. Other recent methodological developments for microbiome data include regression models with a phylogenetic tree-guided penalty term [19] and inverse regression to deal with the over-dispersion of zeros in count data [20]. However, the tree-guided method can be overly influenced by tree information [19], and the phylogenetic tree is not always available. The existing methods incorporate observed count data or relative abundance directly when computing distances or defining covariates, with some kinds of transformation of OTUs to account for skewness. None of the methods explicitly account for and model the underlying uncertainty inherent in sparse count data.
This paper aims to address these problems in a classification framework, where predictors are sparse and heterogeneous count data. Shestopaloff et al. [21] proposed representing count data using a mixture distribution to analyze the differences between microbiome communities. We extend the method to distance-based classification using mixture distributions (DCMD) that specifically addresses the uncertainty in sparse and low-count data. DCMD measures the distance between the sample-specific distributions of OTUs rather than between counts or relative abundances, which better models the structure of microbiome data for the distance measure. DCMD is also able to handle excess zero counts, which can potentially improve the predictive accuracy when using sparse OTUs. In this paper, we use two simulation studies to show the advantage of DCMD for classification over existing distance metrics and compare it against common machine learning methods. We provide a comprehensive comparison of distance-based classification methods (k-means, k-NN, and NSC) and machine learning methods (RF, GB, LASSO, RR, and SVM) in different simulation settings, which to our knowledge has not been studied before. We also illustrate the effectiveness of DCMD on two human microbiome studies [22,23]. The paper concludes with a discussion of the merits, drawbacks and the scope of applicability of the proposed methodology.

Method
In this section we outline the framework of DCMD. The main steps of the model include mixture distribution specification and parameter estimation for modelling observed data, calculation of conditional distributions for each sample, and calculating distances between samples and cluster centres to use in distance-based classification methods. The mixture model and conditional distribution estimation are described in Shestopaloff et al. [21]. It is proposed to model the underlying population rate structure of the observed count data using a mixture distribution with Poisson-Gamma components, then conditioning on observed sample counts and resolution to obtain sample-specific distributions. In the next step, we use the sample-specific distributions for classification by calculating the distances between distributions.

Model specification and estimation
Microbiome data typically consists of OTU counts, as illustrated in Table 1. The notations used in our method formulation are as follows: n ij , i = 1,. . .,I for j = 1,. . .,J, the count of the jth OTU of the ith sample. N i , the total number of aligned reads of sample i, N i = ∑ j n ij Without loss of generality, we focus on a specific OTU and omit the jth subscript for subsequent notation. Assume that the observed counts, n i , are Poisson distributed with rate r i = q i N i , i = 1,. . .,I for sample i, where q i is the individual-specific relative abundance and is sampled from some general OTU relative abundance distribution G q . Then we have, which is the rate normalized to the average sample reads to make sure that the counts are treated on the Since the distribution of OTU is zero-inflated, skewed, and heavy tailed, we propose a mixture distribution to approximate G. For positive rates on a given interval, we specify a set of Gamma components, Γ(α, β), with shape α and rate β, to cover the range of the data. To separate structural zeros from low-rate and undetected samples, we include a zero-point mass, where P(n i = 0) = 1. Additionally, for sparse high rates, we define a high-count point mass, n i jt i ; r � i � C � 1ðn i > CÞ, where P(n i >C) = 1, C is the truncation point and 1(�) is the indicator function. The full set of mixture components is O = (G z , G 1 , G 2 ,. . .,G M , G C+ ) where G z is a zero-point mass, G m , m = 1,2,. . .,M, is a set of Gammas components Γ(α m , β m ), and G C+ is a high-count point mass. The process for defining the mixture model components is described in detail in the Simulation section.
Define the weight of each component as where w z is the weight of the zero-point mass, w m , m = 1,2,. . .,M, is the weight for mth corresponding Gamma component, and w C+ is the weight of high-count point mass. Define the number of species observed x times across all samples for x = z, 0,1,2,. . .,C,C+. Then our goal is to minimize P Cþ x¼0 ½y x À y E x � 2 , where y E x is the expected aggregate counts of y x . Note that given Γ(α m , β m ), sample counts conditional on t i are distributed as a negative binomial NB[α m , β m /(t i +β m )] [21]. Define The estimate,ŵ, is obtained by optimizing the least-squares objective function (1), using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [24] with the augmented Lagrangian method [25] for the constraints.
Due to the sparse nature of the data, we only optimize the weights and fix the Gamma parameters. Attempting to model the low-rate structure by optimizing both weights and Gamma parameters (α m , β m ) via expectation-maximization (EM) results in biased structural zero estimates and a poor overall fit of the low counts [26]. In this context, the EM is also prone to numerical issues, convergence to local minima and can often be too slow computationally for this type of application [26]. On the other hand, BFGS provides a much faster and robust alternative.

Weighted mixture distribution
To address the uncertainty around specifying components for the mixture model, particularly for the low rates where sparsity is often an issue, we define a set of nested models F l , l = 1,. . ., L, with varying components for modelling the rate structure around zero. We estimate the joint mixture model using a nonparametric bootstrap algorithm. As stated in Shestopaloff et al. [21], we can obtain the weight v(l) of each candidate model, which is the proportion of times each model is selected as optimal relative to the observed data, and calculate the weights for the joint mixture distribution. Let w l be the estimated weights for each candidate model, F l , with zeros assigned to the weights of components not included in a specific model, then the weights of the joint model are w = ∑ l v(l)w l .

Sample-specific distribution
Once we have a distribution for the OTU, we can estimate sample-specific distributions by conditioning on the observed count n i , estimated mixture weights w, and resolution t i . We can obtain the probability that sample i sampled from a specific component, as: The probability of being assigned to the zero-point mass is P(i2G 0 ) = 1(n i = 0) and to the high-count point mass is P(i2G C+ ) = 1(n i >C). Define the sample-specific mixture weights as Since the sample-specific weights have been adjusted for the individual resolutions t i through the p im probabilities, the Poisson-Gamma mixture probabilities are NB(α, β/(1+β)). Also note that we have differentiated the zeros in our mixture distribution, which are defined as structural zeros, x = z, and observed zeros, x = 0. Given the underlying rate distribution from the joint mixture model, we can then calculate the probability of observing count x = z, 0,1,. . .,C, C+ from each mixture component G m as For the point masses we have P(X = x|G z ) = 1(n i = 0) and P(X = x|G C+ ) = 1(n i >C), respectively. To simplify the representation of the distribution, define a vector of probabilities for x = z, 0,1,. . .,C, C+. Then we can define the discrete probability density for sample i as where P i ðzÞ ¼ w G z ; The P(x) vectors in the matrix P are the vectors giving the probability of observing x from each mixture component, which can be pre-calculated for distance calculations. The overview of how to obtain sample-specific mixture distributions given a set of mixture distribution components is shown in Fig 1.

Classification
Once the distribution for each sample has been computed, we use k-means and k-Nearest Neighbours (k-NN) algorithms for classification. In this section, we outline how to apply these algorithms using two distance measures, discrete L 2 (D-L 2 ) norm and continuous cumulative L 2 (CC-L 2 ) norm.
Distance measures. Given posterior probability f i , cumulative posterior probability F i , and an estimated set of weights w i for sample i, the distance metrics are: D-L 2 Norm: where x = z, 0,1,. . .,C, C+. Note that we include the structural zero component, z, separately and that the distances only depend on the weights. For multiple predictors, j = 1, . . ., J, the total distance between samples i 1 and i 2 is the sum across all predictors, CC-L 2 Norm: where G m 1 m 2 is a matrix with the (m 1 , m 2 ) entry set to R G m 1 ðxÞG m 2 ðxÞdx for each of the continuous mixture component. Details of the derivation can be found in Shestopaloff [26].
Distance-based classification. We use the distances calculated in Eqs (3) and (4) in a kmeans and k-NN framework. In k-means, the mean of each class is calculated from the training data and points are classified to the nearest class. In k-NN, samples are classified as the mode of the labels from k closest neighbours of the training set. The steps of k-means and k-NN algorithms are as follows: K-means: To adapt the k-means algorithm, we estimate the mean distribution for each class by minimizing the distributional distances between it and the class samples, conditional on a specified distance. Since distances are L 2 norms and only depend on the weights, as shown in Eqs (3) and (4), the mean of the weights for each class gives the optimum. The algorithm is implemented as follows: Step 1: Determine the mean of the weights for the jth predictor in class k, where |N k,j | is the number of samples in class k of predictor j, k = 1,. . .,K and j = 1,. . .,J; Step 2: Compute the distance to the mean for sample i across all predictors, Step 3: Predict the label of sample i as the closest mean, K-NN: After computing the pairwise distances between samples and summing across predictors, these can be used directly to identify the nearest neighbours for classification. The algorithm for k-NN is as follows: Step 1: Compute the pairwise distance of sample i 1 and i 2 , i 1 , i 2 = 1,. . .,I, i 1 6 ¼i 2 , Step 2: For sample i, pick the k samples with smallest distance to sample i, the optimal k can be determined using cross-validation (CV) in the training set or existing heuristics.
Step 3: Tally the labels of the k nearest neighbours, then sample i is predicted as the mode of the k labels.
The overall workflow of DCMD within the k-means and k-NN frameworks is presented in Fig 2.

Predictive metrics
Letŷ i be the predicted class for sample i. The classification accuracy is defined as the proportion of correctly predicted cases: Accuracy ¼ 1 For binary outcomes we also include precision, recall, and F1 score as metrics to measure predictive performance. We count the number of true positive (TP), false positive (FP), and false negative (FN) and defined these metrics as follows [27]:

Data generation
To evaluate the performance of the DCMD method, we design simulation studies that mimic microbiome community count data and assess classification performance. We simulate a separate mixture distribution for each class, individual sample rates, and resolutions to generate observed counts. For the mixture distribution, the number of components, M, is sampled from Unif (5,15). And the number of samples to be taken from each mixture component is set by binning samples from a Beta(α b , β b ) at uniform intervals, with the α b varied to give different class means and levels of sparsity and with β b~U nif(2, 6.5) to control dispersion. The observed counts for each sample are then generated as n i � Poissonðr � i t i Þ, where r � i is the sampled rate and resolution t i~U nif(2/3, 5/4).
We consider two-and three-class outcomes with several simulation scenarios for each case. For the two-class outcome, parameter settings and summary statistics for each scenario are shown in Table 2. Scenarios 1 and 2 have low sparsity data, and scenarios 3 and 4 are highly sparse. Scenarios 1 and 3 have weakly differentiated classes (small difference in α b ), while scenarios 2 and 4 have strongly differentiated classes (large difference in α b ). The sample size is

Mixture model specification
The specification of the mixture model components should be data-driven, and the main requirement is that the Gammas allow for appropriate coverage of the observed data. We split the count data into five intervals and apply different strategies to specify components on each interval. Modelling of the zeros and low-rate structures is based on [28], and modelling of the higher counts is based on [21].
1. Structural zeros: For data with observed zeros, a zero-point mass P(X = 0) = 1 is included to model zero inflation.
2. Low counts (x2[0,1,2,3]): We specify components as Poisson rate posteriors with uniform priors for each of the counts, which is Γ(x+1, 1). Hence, we include Γ(1,1), Γ(2,1), Γ (3,1) and Γ (4,1). The cut-off is set to x = 3 because rate posteriors for higher values have a low probability of observing zero, and we want to differentiate the distributions relevant to modelling zero inflation. We also want to examine whether more mass close to zero improves the fit for the low rates. Therefore, we add exponentials with a higher rate, β, to have more mass near zero. In this case, we include Γ(1,2) into the candidate models. We can potentially include Γ (1,3) or other terms into the model, then apply the procedure described above to select the optimal mix. A fuller discussion, drawn from modelling sparse counts for total species estimation, can be found in [28].
3. Integer counts (x2 [4,5,6,7]): Components in this range are also specified as the rate posterior Γ(x+1, 1) at integer intervals. This block exists as a buffer to ensure no gaps in the coverage after the low-count distributions, as this can potentially bias the structural zero and low-rate estimates. This is specified until the last integer component has little overlap with the previous low-rate component. In our formulation, we use an upper limit of x = 7 as simulations showed negligible differences between x = 7 and x = 8. The integer components include Γ (4,1),. . .,Γ(7,1). (x2[8,. . .,C]): The higher counts tend to have a large range, and it's not practical to specify them on integer intervals. In this case, we set the number of components based on the range of the data and specify α m at uniform intervals on a linear-log scale from 8 to C = q p , a set quantile of the data. Using between 10 and 15 components worked well in past applications [21]. For our modelling, p = 0.85 is an effective threshold, which means C is the 85% quantile of the sample.

High counts
5. Extreme high counts (x>C): These counts are truncated to a point mass P(X>C) = 1, in part because of the low density in this range and the uncertainty in modelling them and in part to decrease computation time. The mixture model specification heuristics described above are primarily for modelling low-abundance OTU, which covers the most information of the microbiome data. Higher abundance OTU can be modelled by restricting the component specification to higher counts and increasing p.

Model fitting and comparison methods
The proposed method is compared with k-means and k-NN using Euclidean and Manhattan distances of relative abundances, distance-based NSC, as well as LASSO, RF, GB, RF and SVM classifiers [29]. Models are trained using a 60/40 training and test set split [30], with the training set remaining the same for all classifiers within each replicate. For the machine learning methods, we use existing packages and tune the hyper-parameters using cross-validation when appropriate, see details in S1 Text.

Simulation results
For the two-class outcome, classification accuracy for each model and scenario is presented in Fig 3. The orange boxplots are the results for the proposed DCMD method in a k-means and k-NN framework. The blue boxplots are the other distance-based methods, including k-means and k-NN with Euclidean and Manhattan distance and NSC. The green boxplots give results for the machine learning methods, including RF, GB, LASSO, RR, and SVM. The dashed red line gives the average accuracy of the best method in each scenario. The results show that in Scenarios 1 and 2, when sparsity is low, k-means with CC-L 2 norm performs best, followed by k-means with D-L 2 norm, while in Scenarios 3 and 4, when sparsity is high, k-means with D-L 2 norm gives the best performance, followed by k-means with CC-L 2 norm. Overall, DCMD in a k-means framework with L 2 norms outperforms the other classification methods for all types of signals and data structures for the two-class outcome. Differences in accuracy within the distance-based methods are also progressively more pronounced in favour of DCMD, among which k-means outperforming k-NN. The specialized NSC approach performed similarly to DCMD within the k-NN framework. However, NSC generally falls short of k-means DCMD and other machine learning methods. Table 3 shows the summary statistics of the F1 Score over 100 replicates for the two-class outcome. The top results in each scenario are highlighted. Similar to accuracy, DCMD with L 2 norms produce the highest F1 Scores ( The classification accuracy of each model and scenario for a three-class outcome is presented in Fig 4. For Scenarios 1-3, the classes are differentiated under varying levels of sparsity, and we observe that DCMD is competitive with the optimal machine learning methods. Although RR has similar predictive accuracy to DCMD in Scenario 1 and 3, and GB has similar predictive accuracy in Scenario 2, DCMD is consistently improved over the optimal

Data description
We test our method on data from two microbiome studies. The first is a study on colorectal cancer reported by [22]. A total of 190 samples (95 pairs) were collected from 95 patients in Vall d'Hebron University Hospital in Barcelona and Genomics Collaborative. The study aimed to identify associations between tumor microbiome and colorectal carcinoma. Both the colorectal adenocarcinoma tissue and adjacent non-affected tissues were collected. The OTU count table generated by 16S amplification was obtained from the Microbiome Learning Repo [12]. Prior to model training, eighteen samples with total reads less than 100 were dropped from the dataset, and we also excluded OTUs with mean relative abundance less than 0.001, resulting in 149 OTUs and 172 samples (86 pairs) used to differentiate tumour and normal tissue. The second study is a case-control study of Crohn's disease (CD) from a multi-center cohort, which was designed to examine how microbiota contributes to CD pathogenesis [23]. The profiles were obtained using Illumina 16S rRNA sequencing. The dataset was downloaded from the Microbiome Learning Repo [12] and consisted of 140 ileal tissue biopsy samples. Minimal sample depth is set at 100, and OTUs are restricted to less than 90% zero proportion, leaving 140 samples (78 cases and 62 controls) and 31 OTUs for analysis.

Model fitting and evaluation
For both datasets, we compare our proposed L 2 -norm based k-means and k-NN classifier with five other distance-based classifiers (k-means-Euclidean, k-means-Manhattan, k-NN-Euclidean, k-NN-Manhattan, NSC) and six machine learning methods (RF, GB, LASSO, RR, SVM). We assess model performance using 10-fold CV. In each iteration, one fold of the data is treated as the test set, and the remaining nine are used for training. The specification of DCMD and other classifiers is the same as that in the simulations (see S1 Text). We calculate accuracy, precision, recall, and F1 score as metrics for comparison. Table 3. Two-class outcome: the summary of F1 Scores for each model over 100 replicates.

Model Scenario 1 (SD) Scenario 2 (SD) Scenario 3 (SD) Scenario 4 (SD)
k For the colorectal cancer data, we reduce the predictor space for distance-based classifiers by the univariate screening of OTUs with a nonparametric Mann-Whitney U test on the training set. To adjust for multiple comparisons, we use q-values obtained by the Benjamini- Hochberg (BH) method [31] and retain OTUs with q-values less than 0.05 in each training set. The mean number of OTUs selected from each training set is 42 (range: 13-57). For the machine learning approaches, we include all 149 OTUs. For the CD dataset, 31 OTUs are included for all methods.

Applied results
The predictive performance of each classifier for the colorectal cancer and CD studies are presented in Tables 4 and 5, respectively. The accuracy of k-means with D-L 2 norm is 0.67 for colorectal cancer and 0.73 for CD, which is the best method for colorectal cancer and the second-best for CD. The F1 scores are also among the highest for both datasets, indicating that DCMD has consistently optimal performance and an improvement over the other classifiers. Predictive accuracy of k-means with CC-L 2 norm is slightly worse, likely due to high zero proportions in the predictors, which is consistent with the simulation results. Similarly, DCMD outperforms Euclidean and Manhattan distances within k-means, and k-means outperforms k-NN overall. Within k-NN, accuracy and F1 score indicate that DCMD has predictive performance comparable to Euclidean or Manhattan distances. The NSC approach has an accuracy of 0.67 and a precision of 0.72 for colorectal cancer, with a recall of 0.56 and an F1 score of 0.63, notably lower than that of the k-means classifiers. The performance is unstable in the CD data.
Compared to the machine learning methods, DCMD with k-means is superior to RF, GB, LASSO, RR, and SVM in the first dataset. When results are replicated controlling for distancebased classifier variable selection (S3 Table), machine learning methods has improved performance, except for GB. In the second dataset, LASSO and SVM are the best methods with accuracies of 0.74, slightly outperforming the accuracy of 0.73 for k-means with D-L 2 norm. Otherwise, DCMD k-means with D-L 2 and CC-L 2 norms either equivalent or outperform the machine learning approaches.

Discussion
The results of our simulation studies and microbiome applications indicate that the proposed DCMD method performs well over a range of scenarios, achieving good classification performance when using sparse data as predictors. The predictive accuracy is consistently improved compared to other distances within distance-based classifiers. It is either advantageous or competitive compared to a number of machine learning methods under a wide range of scenarios. The improved performance of DCMD on sparse data results from the use of mixture distributions to represent the observed count data because the mixture distributions can not only model the underlying uncertainty in the observed sample counts but also account for zero inflation. The improvement is particularly significant in comparison to other distances within the regular k-means and k-NN classifiers.
The performance differences between the D-L 2 and CC-L 2 norms can be attributed to the data structure. In less sparse scenarios, the data structure is better modelled by a continuous rate structure, resulting in a slight advantage for the CC-L 2 metric. While in the higher ZP and low-count scenarios, the D-L 2 norm allows us to use specific differentiation of zeros into structural and non-structural and modelling expected counts directly, which can further capture the general structure of predictors used for differentiation better.
As the DCMD method derives its major improvement from a focus on modelling lower count data and the associated uncertainty, it is necessary to accurately specify an underlying set of mixture components for the low rates. The mixture also has to model low-and highcount data on the same scale, where the density of the latter is often sparse due to sparse observation intervals. Moreover, it is not feasible to apply a transformation to make the data denser due to abundant zeros and the discrete nature of the low counts. However, the weighing of nested candidate models and the suggested heuristic of specifying higher count distributions on a log-linear scale has worked well in our simulations, as it partially mimics the log-transformation commonly applied to such data.
The proposed DCMD method is formulated in a distance-based framework, so it does not include specific mechanisms for variable selection. While different predictors can alternatively be included in the distance sum, the process is not automated. In our case, we used a simple nonparametric Mann-Whitney U test for feature selection, which worked well in the study data. However, more advanced and specialized methods for feature selection can be applied separately for other applications. Additionally, we note that the model is specified to use microbiome site counts, and continuous covariates need to be modelled separately using continuous distributions, while categorical covariates can only be included as dummy variables.
These variables will also be treated on the same scale in the distance metric unless specified otherwise.
Despite these drawbacks, we believe that our core contribution, the representation of observations as distributions to reflect uncertainty and the use of distributional distance metrics, will be valuable to anyone analyzing sparse data. This formulation can compensate for some of the disadvantages inherent in distance-based methods to such an extent that it achieved competitive performance with more sophisticated classifiers, as well as specially designed approaches like NSC. The techniques that made DCMD advantageous for classification when data is expected to be sparse, particularly within a distance-based framework, should be considered for improving model performance.

Conclusion
In this paper, we present a distance-based classification method for microbiome count data. The DCMD approach models the observed data using mixture distributions and calculates L 2norms for distance-based classification algorithms. The method is specifically designed to accurately model low-count structures, addressing the inherent sparsity by representing each observed count as a distribution, and is demonstrated to have improved performance by simulation studies and two microbiome applications. The importance of accounting for uncertainty in sparse data is emphasized, and the resulting improvements in classification accuracy when using distributions are demonstrated. The performance of the proposed DCMD is competitive to a number of machine learning methods and significantly outperforms other common metrics in distance-based classification models. The consistent and improved performances across a variety of different data structures make this approach a viable alternative for modelling and classification of microbiome count data, particularly within a distance-based framework.