Clustering compositional data using Dirichlet mixture model

A model-based clustering method for compositional data is explored in this article. Most methods for compositional data analysis require some kind of transformation. The proposed method builds a mixture model using Dirichlet distribution which works with the unit sum constraint. The mixture model uses a hard EM algorithm with some modification to overcome the problem of fast convergence with empty clusters. This work includes a rigorous simulation study to evaluate the performance of the proposed method over varied dimensions, number of clusters, and overlap. The performance of the model is also compared with other popular clustering algorithms often used for compositional data analysis (e.g. KMeans, Gaussian mixture model (GMM) Gaussian Mixture Model with Hard EM (Hard GMM), partition around medoids (PAM), Clustering Large Applications based on Randomized Search (CLARANS), Density-Based Spatial Clustering of Applications with Noise (DBSCAN) etc.) for simulated data as well as two real data problems coming from the business and marketing domain and physical science domain, respectively. The study has shown promising results exploiting different distributional patterns of compositional data.


Introduction
In statistics, compositional data are quantitative descriptions of the parts of some whole, which means that it consists of relative information [1]. Mathematically, compositional data follows the Aitchison geometry on the simplex [2]. Measurements including probabilities, proportions, percentages, and ppm can all be thought of as compositional data. In general, compositional data is written as, S D ¼ x ¼ ½x 1 ; x 2 ; :::; x D � 2 R D jx i > 0; i ¼ 1; 2; :::; D; In other words, compositional data is a D dimensional real vector, x = [x 1 , x 2 , . . ., x D ] of positive components on R D such that the sum of all components is c. Often, we observe the sum of all components to be 1; if not, all the components are divided by the sum of all components, such that P D i¼1 x i ¼ 1. Analysis of such data is widely used in the fields of geochemistry [3,4], biology [5][6][7], ecology [8,9], finance and business studies [10][11][12] emerged in the literature long before. [13] identified the problem of 'spurious correlation' between ratios of variables and [14] later extended the work and showed that some of the correlations between components of the composition must be negative because of the unit sum constraint. Many transformations have been proposed over the years (e.g. log transformation [15], log ratio transformation [16]) to overcome the unit sum constraint, but still it is argued when it comes to choosing the best transformation [17]. Another issue with compositional data refers to the dealing with zero values as both ratios as well as logarithms are operations that require non-zero elements in the data matrix. Many researchers have tried different approaches to deal with zero values (see [18][19][20][21]), but it remains as an open problem even today; mostly because, zero values occur in compositional data for different reasons. Often, the "zero problem" is linked with the missing data problem. Missing data are generally classified into three categories [22], namely: missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR). In compositional data analysis, the rounded zeros are considered a NMAR case, where data cannot be observed because their values are below a known value �. Zero values can also occur when the count of an element is zero (known as count zero) and when zero signifies some property or relevant information (known as essential zeros). For our study of compositional data in cluster analysis, we have encountered round zeros and we have used a method proposed by [20], where we replace the zeros with a small quantity and adjust others in a multiplicative way which does not affect the covariance structure of the data. The adjusted values xr ij can be written as where c i is usually the sum constraint. For row i and component j, the above adjustment in Eq 2 replaces the component x ij by a very small quantity δ ij if x ij = 0, else it multiplies a term with x ij to maintain the unit sum constraint. Here, x ik 's are the zero components in row i. The multiplicative term is a fraction by which the non-zero terms to be reduced in order to accommodate the added values of δ ik 's and keep the sum of rows fixed at c i . For clustering compositional data there exists many methods in the literature [23,24]. We generally see two kinds of approaches, namely; model based methods, e.g. mixture models [25] and methods based on dissimilarity distances (e.g. hierarchical clustering [26], KMeans [27]. But most of the time researchers go for Gaussian mixture model or KMeans for clustering purposes [28].
For estimating the parameters of mixture models, the EM algorithm [29][30][31] is widely used. In many applications of mixture models, e.g. in image matching [32], and audio and video scene analysis [33], the EM algorithm is being used regularly. But the EM algorithm is often not very convenient to apply for other than normal distributions, because it needs to be modified and adapted for each case. Sometimes, updating the parameters in the M step becomes impossible for some distributions [34].
The main objectives of our study are to • develop a clustering method without the need of transformation of compositional data, • build a mixture model with distribution other than normal, • evaluate the performance of the method in different situations (different dimensions, different number of clusters and varied overlap).
We are going to propose a model based clustering method without transformation of compositional data. We have used a Hard EM [35] with some modifications, to build mixture models using Dirichlet distribution. For that purpose we need some point estimates of the parent distributions. It is very convenient as it works with both, likelihood based and Bayesian estimates. But the problem with hard assignment of cluster is that it ignores cluster membership probabilities of less probable clusters. As a result, often the algorithm converges too quickly with one or more clusters being empty. In our study we have also proposed a way to deal with that problem. We have done rigorous simulation study to evaluate the performance of the proposed method over varying dimension, number of clusters and overlap. We have also used two real dataset from business and physical science domain to illustrate the method.

Methodology
Let X 1 , X 2 , . . ., X N denote a random sample of size N, where X i is a p dimensional random vector with probability density function f(x i ) on R p . We can write where π = (π 1 , . . ., π k ) contains the corresponding mixture proportions with is the density component of mixture j and α j , j = 1, 2, . . ., k, are vectors of component specific parameters for each density. Then α = (α 1 , . . ., α k ) denotes the vector of all parameters of the model. The log likelihood of the model for a sample of size N is then given by The parameters can be estimated using the EM algorithm with some modifications. For that purpose, let us introduce latent variables Z i , which are categorical variables taking on values 1, . . ., k with probabilities π 1 , . . ., π k such that Pr(X i |Z i = j) = f j (x i ), j = 1, . . ., k. Further, probabilities γ ij are introduced (conditional on the observed data X = x and the parameters α): Eq 5 can be seen as a cluster membership probability of data point i for cluster j. For an EM algorithm, we try to optimize the function where t is the current iteration number. It is nothing but the expected complete data log likelihood. It can also be shown that (see [36]), where the expected complete data log likelihood is expressed as sum of two parts. At the M step, we optimize Q with respect to π and α. π j is estimated in the usual way by N j N , where, N j ¼ P N i¼1 g ij and for estimating α, we look at the part in Q (Eq 7) which depends on α, which is given by, Now, we choose α j such that a t j ¼ argmax a j lða j Þ, which is obtained by the process of assigning data points to respective clusters, given by argmax j g ij , and estimate α j by some estimation method based on the assigned observations to that cluster. It can be seen as a Bayesian concept (although not strictly Bayesian) for learning where Eq 5 provides the cluster membership probability. The idea of choosing the cluster based on maximum probability is the same as choosing the MAP estimate, the mode of the distribution of Pr(Z i = j|X, α).
To run the algorithm, at first, some trial values of the distribution parameters α and mixture proportions π are initialized. Then the initial value of the log likelihood is evaluated. For different distributions, different techniques can be used to choose suitable initial values. For example, in the case of a GMM, the centroids of KMeans can be used as initial values of μ and the empirical covariance matrix of each cluster can be taken as an initial value of S j . On the other hand, for a Dirichlet Mixture Model, centroids of KMeans can be multiplied with a scalar c (for our study we have used c = 60) to get the initial values of the α parameters. Please recall that the mean vector of a Dirichlet distribution consists of the ratios of α parameters and the sum of all α parameters. Here, the scalar c acts as the sum of α parameter values. The initial values of π can be obtained by generating a random number from a Dirichlet (1,1,1,. . .,1) distribution. The empirical ratios of the number of cluster members in the KMeans algorithm and total observations can also be used as the initial values of π. For our study, we have used the KMeans initialization technique mentioned above for all our experiments.
At the E step, the values of the probabilities γ ij are evaluated using the current parameter values. For an usual EM algorithm (e.g. in a GMM), at the M step, a weighted mean and a weighted covariance matrix are calculated using the γ ij values. But for other distributions, where the model parameters are not mean and (co)variance, this technique can not be used. So, for different distributions, different techniques needs to be used. And also, for such Hard EM, sometimes the algorithm converges with one or more clusters being empty. Hence, one might have to force the algorithm to re-iterate if one or more clusters are found to be empty at each M step. To introduce a flexible, yet convenient solution, we propose a different technique in our algorithm, where at the M step each data point is assigned to a cluster depending on the probability of that data point belonging to each cluster. That cluster is assigned for which the probability is maximum. Now, if one or more clusters are found empty then the initial value of the parameter α j for empty cluster j is used. And for the non empty clusters, point estimates of the parameters of each parent distributions are obtained using only the data points available in each cluster. For faster convergence and convenience, maximum likelihood estimates can usually be recommended. The mixture component probabilities π j are estimated as mentioned above by N j N . The newly set of estimated values of the parameters is then used as an update over the previous one. After this step, the log likelihood is evaluated again using the updated parameter values. The process is then continued until convergence. The convergence properties of this algorithm follow the properties of the usual EM algorithm, which has been explained in detail by [37,38].
Algorithm 1: Clustering algorithm for mixture of Dirichlet distributions with provision for empty clusters (Hard DMM 1) Replace zero values in the data, if any, using Eq 2; Initialize the model parameters, α and π. Evaluate the initial value of the log likelihood from Eq 4; while log likelihood difference � � do Evaluate γ ij from Eq 5, using the parameter values and data; If we make a finite mixture with k components, the model is given by Eq 3 and subsequently, the log likelihood is given by Eq 4.
The model parameters, can be easily estimated using our generalized approach. For that we need a good point estimate of the parameters of a Dirichlet distribution to be used in the M step of our algorithm. [39] has discussed a way to find out the maximum likelihood estimates of a Dirichlet distribution, where he proposed to perform a fixed point iteration, given an initial value of the α parameters. The equation is given by At each iteration, for an old value of the parameter a old jm , a new value a new jm is obtained. This iteration in the algorithm requires inverting C, which is a digamma function. A suitable initial value and inversion algorithm is also discussed by [39].

A special provision of Bayesian estimates for clusters with fewer data points
It is possible to add a further step in algorithm 1 to consider the case when there are very few data points in a cluster due to hard assignment. In this situation, Bayesian estimates can be very useful as they can use some prior information about the model parameters. Also, for fewer data points, maximum likelihood estimates are known to be less accurate. However, Bayesian estimation of Dirichlet parameters is tricky due to several reasons. Even though the Dirichlet distribution has a conjugate prior for being a member of the exponential family, the posterior distribution is difficult to use in practical problems and not analytically tractable. Few authors have proposed some approximation to the posterior distribution of Dirichlet parameters (e.g. [40] have used multivariate Gaussian distribution), but no method seems to yield satisfactory results. Also, using some Markov Chain Monte Carlo (MCMC) algorithm at each iteration step of the clustering algorithm makes it too time-consuming, which is not practically feasible. Considering all these challenges, we are going to propose a suitable solution that can be adopted in our clustering algorithm.
Let us recall that, if (X 1 , X 2 , . . ., X p ) follows a Dirichlet distribution, with parameters (α 1 , α 2 , . . ., α p ) then the marginal distribution of X i follows a Beta distribution with parameters ða i ; P p j¼1 ða j À a i ÞÞ. Now, if we choose the prior distribution of α i as Gamma (a, b), then under certain assumptions, the posterior distribution of α i can be obtained in closed form. It can be shown that (see [41]), posterior distribution of α i follows a Gamma distribution with parameters (a + n) and Thus, for our clustering problem, the Bayesian estimates of α jm , m = 1, 2, . . .p for cluster j can be obtained by the posterior mean, which is given by, For our experiment, we have chosen the values of a and b to be 1. The extended algorithm with Bayesian estimates for clusters with fewer data points (Hard DMM 2) is explained below.
Algorithm 2: Clustering algorithm for mixture of Dirichlet distributions with special provision for clusters with fewer data points (Hard DMM 2) Replace zero values in the data, if any, using Eq 2; Initialize the model parameters, α and π. Evaluate the initial value of the log likelihood from Eq 4; while log likelihood difference � � do Evaluate γ ij from Eq 5, using the parameter values and data;

Comparison with other clustering algorithms
We have done simulation study to check the efficiency of the proposed technique. For a Hard DMM, algorithm 1 and algorithm 2 can be used without any alteration. The objective of our simulation study is to compare the performance of Hard DMM 1 and Hard DMM 2 with other popular clustering algorithms which researchers often use for clustering compositional data. For our study we have considered hierarchical agglomerative clustering with linkage criteria ward, single, average and complete respectively [42]. We have also used partition around medoids (PAM) [43], Clustering Large Applications based on Randomized Search (CLAR-ANS) [44], Fuzzy CMean [45], Kmeans, Gaussian Mixture Model (GMM), Gaussian Mixture Model with hard EM (Hard GMM), spectral clustering [46] and DBSCAN [47] for comparison. We have checked three measures to evaluate the performance. A detail description of all the measures can be found in [48]. In this section, we have generated data under two schemes.
We have generated data under two schemes mentioned above and used different clustering algorithms to find patterns. We have measured the performance of algorithms in terms of accuracy, precision and recall. Fig 1 shows the data generated under scheme 1 with true clusters. And Fig 2 shows how different algorithms finds pattern on the data. We see that Hard DMM 1, Hard DMM 2, KMeans, GMM and Hard GMM recognize pattern in the data somewhat similar to the original pattern. Other algorithms fail to recognize the true patterns. The detailed result can be seen in Table 1. The data generated under scheme 2 is shown in Fig 3. We can see that the data has more complex patterns than data under scheme 1. From Fig 4, we see that only Hard DMM 1 and Hard DMM 2 are able to find pattern similar to the true patterns. All other algorithms fail to understand the true patterns of the generated data. The corresponding results in detail can be seen in Table 2.  Fig 6 also, we see that Hard DMM 1 and 2 work better than all other algorithms we considered when it comes to the data generated under scheme 2. Hard DMM 1 and Hard DMM 2 have shown best performance in terms of accuracy, precision and recall on our generated dataset. On this note, it is important to understand the interpretation of precision and recall in classification. Precision gives the ratio of number of elements correctly classified under a certain class (or cluster) and all number of elements which actually belong to that class (or cluster). On the other hand, recall gives the ratio of number of elements correctly classified under a certain class (or cluster) and total number of elements classified under that class both correctly and incorrectly. Even though both of these measures are very important, unfortunately we can not maximize both at the same time. When we have different algorithms with same accuracy, we must choose the model with better precision and recall. In our simulation study,both versions of Hard DMM have the best accuracy on both the dataset along with very good precision and recall. Scheme 2 produces a dataset with a complex, non spherical patterns which makes it very difficult to cluster with generic algorithms or mixture model with Gaussian distributions. When compositional data shows asymmetric and non spherical patterns, mixture model using Dirichlet distribution is expected to give better results as Dirichlet distribution can adopt both symmetric and asymmetric shapes. Our simulation study confirms that Hard DMM can be a suitable choice for both spherical and non spherical data when it comes to clustering.

Performance testing
We wanted to test the performance of our proposed model (Hard DMM 1) for varied dimensions, varied number of clusters and varied overlap. We have simulated data with dimensions 5, 10, 20, 35, 50, 70 and 100 with number of clusters 2 to 6. For each dimension and number of clusters, we have generated data 100 times and used our proposed model to check accuracy. For clusters 2 to 5, we have set very less to none overlap in the data. And for 6 clusters we have introduced overlap in the data with increasing dimension. Let us recall that p denotes the dimension, k denotes the number of cluster and α j = (α j1 , . . ., α jp ) denotes the parameters of Dirichlet distribution from mixture component j. The data generating schemes are mentioned below.
• k = 2: 800 random samples from a Dirichlet distribution, with p parameters drawn randomly from a range 1 and 110, sorted in ascending order. 500 random samples from a Dirichlet distribution, with p parameters drawn randomly from a range 1 and 110, sorted in descending order.
• k = 3: 500 random samples from a Dirichlet distribution, with p parameters drawn randomly from a range 110 and 500, sorted in ascending order. 400 random samples from a Dirichlet distribution, with p parameters drawn randomly from a range 1 and 110, sorted in descending order. 300 random samples from a Dirichlet distribution, with all p parameters equal to 50.
• k = 4: 400 random samples from a Dirichlet distribution, with p parameters drawn randomly from a range 1 and 100, sorted in ascending order. 300 random samples from a Dirichlet distribution, with p parameters drawn randomly from a range 1 and 100, sorted in descending   • k = 5: 500 random samples from a Dirichlet distribution, with p parameters drawn randomly from a range 110 and 500, sorted in ascending order. 100 random samples from a Dirichlet distribution, with p parameters drawn randomly from a range 110 and 500, sorted in descending order. 300 random samples from a Dirichlet distribution, with p parameters

PLOS ONE
Clustering compositional data drawn randomly from a range 1 and 110, sorted in ascending order. 400 random samples from a Dirichlet distribution, with p parameters drawn randomly from a range 1 and 110, sorted in descending order. 300 random samples from a Dirichlet distribution, with all p parameters equal to 50.
• k = 6: 500 random samples from a Dirichlet distribution, with p parameters drawn randomly from a range 110 and 500, sorted in ascending order. 100 random samples from a Dirichlet distribution, with p parameters drawn randomly from a range 110 and 500, sorted in descending order. 300 random samples from a Dirichlet distribution, with p parameters drawn randomly from a range 1 and 110, sorted in ascending order. 400 random samples from a Dirichlet distribution, with p parameters drawn randomly from a range 1 and 110, sorted in descending order. 300 random samples from a Dirichlet distribution, with all p parameters equal to 50. 500 random samples from a Dirichlet distribution with l parameters equals to 110 and rest p − l parameters drawn randomly from Uniform (1,5) distribution, sorted in ascending order. l = (2,3,5,8,12,15,18) for p = (5,10,20,35,50,70,100) respectively.
The T-SNE [53,54] plots in Figs 7 and 8 show that with p = 5 there is some overlap in one cluster and with p = 100 one cluster has completely been overlapped on another. The performances of the model for varied dimension, number of clusters and overlap is shown in Figs 9-13 respectively.
From results in Table 3, we see that increasing dimension and increasing number of clusters do not have much impact on the accuracy of Hard DMM. But increasing overlap has significant impact on the accuracy of Hard DMM. It is to be noted that many algorithms suffer from "Curse of Dimensionality" with increasing dimension in the data. For example, in case of GMM, a p dimensional mean vector and p × p symmetric covariance matrix need to be estimated for each k clusters. In other words for a GMM with k clusters and p dimensions, (k − 1) + kp(1 + (p/2 + 1/2)) number of parameters need to be estimated. On the other hand, in case of a DMM with k clusters and p dimensions, we need only (k − 1) + kp parameters to estimate. So for 10 clusters and 100 dimensional data GMM estimates 51509 parameters, whereas DMM estimates only 1009 parameters on the same situation. So, DMM has an added advantage over GMM when it comes to high dimensionality. That is why we have noticed in our study that increasing dimension has very little to none impact on the performance of Hard DMM. Also with increasing number of cluster, the number of parameters increase linearly. And with a good starting value in the EM algorithm, the model converges soon with satisfactory results. On the contrary, overlap in the data leads to misclassification, which in turn decreases the performance of Hard DMM significantly.

Real data applications
We have applied the proposed methods on two real data problems. Our main idea was to check how our model works for the given data and not to provide an optimum solution for the problems. We have checked three measures to evaluate the performance, namely: accuracy, precision and recall. All measures have been compared with hierarchical agglomerative clustering with linkage criteria ward, single, average and complete, PAM, CLARANS, Fuzzy Cmean, KMeans GMM, Hard GMM, Spectral clustering and DBSCAN algorithm. At first we have checked for missing data. In our study there was no missing data. The class labels were subsequently label encoded in order to make it compatible with python. Using L1 normalization [55] the data was then converted into compositional data and zero values were treated using multiplicative replacement which is available in scikit-bio [56], a python library.

Wholesale customers data
We have used a data from Marketing and Management domain for our first experiment. [57] has used this data for logical discriminant models. The dataset can be downloaded from UCI Machine Learning Repository. The data refers to 440 customers of a wholesale distributor, where 298 customers are from the Horeca (Hotel/Restaurant/Café) channel and the rest 142 customers are from the Retail channel. The wholesale customers are grouped in above two classes according to frequency spending degrees of four types: • low frequency-low spending; • high frequency-low spending; • regular frequency-regular spending; • high frequency-high spending. The first two spending patterns are captured by the class Horeca and the second two patterns are captured by the class Retail. The wholesale data concerning the customers consists of annual spending in monetary units (m.u.) on different product categories, namely: fresh products, milk products, grocery, frozen products, detergents and paper products and delicatessen. The summary statistics of the data is shown in Table 4. The aim of the analysis is to find if there are different spending patterns for the two groups of customers and if so, maybe the wholesale could differentiate its marketing actions directed to these groups. For our study we will restrict ourselves to clustering analysis. In this case we have p = 6, k = 2 and N = 440. The T-SNE plots in Fig 14 shows complex distributional patterns. We have used different algorithms for clustering and checked their performances. The corresponding results are shown in Table 5.
From Fig 15 we see that both versions of Hard DMM work better than all other algorithms under consideration in terms of Accuracy. Precision and recall are found to be little better in GMM and Hard GMM than Hard DMM. As discussed before, when there are different algorithms with same level of accuracy, it is better to choose the algorithm with more precision and recall. In this experiment Hard DMM has the highest accuracy with comparatively good precision and recall value. So, Hard DMM can still be considered as a suitable choice in this situation.

Wine data
The dataset we heve chosen for our second experiment is from physical science domain. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. The 13 components are namely: Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenol, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines and Proline. This data has been used by many researchers, (see [58][59][60]). This dataset can also be downloaded from UCI Machine Learning Repository. The aim of the analysis is to identify the different types of wines based on its components. But the attributes color intensity and hue do not constitute chemical components of wine. Hence, we have dropped those two variables for our compositional data analysis. We have used it for clustering purpose. In this case, we have p = 11, k = 3 and N = 178. The summary statistics of the data is shown in Table 6. Fig 16 displays the T-SNE plot of the data which shows difficult cluster patterns. Like before, we have performed clustering with different clustering algorithms. The results in detail is shown in Table 7.
From Fig 17, we see that, both versions of Hard DMM work better than all other clustering algorithms in terms of accuracy, precision and recall. For complex distributional pattern of compositional data, Hard DMM work better than other models as, compositional data can be naturally modelled using Dirichlet distribution.

Conclusion
In this paper we have shown a convenient way to build Dirichlet mixture model to cluster compositional data. The model can be used without any transformation. In this case, Dirichlet distribution is a natural choice as it works with the unit sum constraint. Researchers generally use generic algorithms for clustering compositional data, whereas Hard DMM offers an exclusive solution specially for compositional data considering both the spherical and non spherical cluster patterns. Dirichlet distribution is well known for modelling symmetric and asymmetric data. This advantage can be exploited using Hard DMM. We wanted to use a distribution other than normal in the mixture model and check whether it works as par with predominantly used methods such as GMM and KMeans. From the simulation study and two real data problems we see that when there is a pattern in the composition (proportions), both versions of Hard DMM are able to identify the clusters with quite a satisfactory result. For clustering purpose we had to be cautious while using data used for classification, as not always classification and clustering done on same ground. For example, if images are classified based on presence of a dog in it and the data contains values of red, green and blue channel, clustering algorithm tries to find completely a different pattern. We have also done an extensive simulation study to evaluate the performance of our proposed method. We see that increasing number of dimensions (upto 100) and increasing number of clusters do not seem to have much effect on the performance. But increasing overlap makes the accuracy decrease accordingly. DMM can also be advantageous in high dimensional setting as it requires relatively less number of parameters to be estimated when compared to other mixture model. Due to the complexity of maximum likelihood (ML) estimation for Dirichlet parameters, Dirichlet distribution has long been ignored for clustering purpose. In our study, we have shown a novel way to use ML estimates of dirichlet parameters conveniently in mixture set up which can be used to cluster compositional data.
In our study, we have considered the number of clusters to be known in advance. But in reality, sometimes we need to estimate k before we can start clustering. We have also not   explored the situation when the data is very high dimensional and sparse. Both the issues would require further research and we keep that for our future work. Compositional data is getting very popular in biology domain. We hope to see more future applications of Dirichlet distribution and DMM in compositional data analysis.