What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm

The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

In the generalized MAP-DP algorithm (Algorithm 3), the computation of the variables d i,k and d i, K+1 (algorithm lines 8,9) requires the collapsed prior predictive distribution f (x|θ 0 ), and also the collapsed posterior predictive distribution f x|θ −i k . This predictive distribution requires the updated cluster posterior hyper parameters θ −i k (algorithm line 7). These updates depend upon the distribution, and the data type, of each data point x i . When the distribution is from the exponential family, the prior distribution over the parameters can be chosen to be conjugate: the prior over the parameters of the data distribution and the posterior have the same form of distribution. This simplifies the hyper parameter updates, and, furthermore, the form of the prior and posterior predictive distributions is the same and is available in closed form. The table below lists some possible data types and distributions, their conjugate prior/posterior distribution, the names given to the hyper parameters and the corresponding name of the predictive distributions. We discuss each case in more detail in the subsequent sections. Multivariate normal

Spherical normal data with known variance
This is the variant of MAP-DP described in Algorithm 2. When each data point x ∈ R D is assumed to be spherical Gaussian with known varianceσ 2 shared across dimensions, the conjugate prior distribution of the Gaussian mean vector parameter µ ∈ R D is also spherical normal with hyper parameters θ 0 = µ 0 , σ 2 0 . Then the posterior distribution for each cluster is also spherical normal with hyper parameters The hyper parameter updates (Algorithm 3, line 7) for each cluster are: The predictive distributions f (x|θ 0 ) and f x θ −i k are D-dimensional spherical normal distributions, whose negative logs are: Note that since the normalization term D 2 ln (2π) is common to both predictive distributions, it can be omitted when computing d i,k and d i,K+1 in the algorithm.

Multivariate normal data with known covariance
For data points x ∈ R D assumed to be multivariate Gaussian with known covariance matrixΣ, the conjugate prior distribution of the Gaussian mean vector parameter is also multivariate normal with hyper parameters θ 0 = (µ 0 , Σ 0 ). The posterior distribution for each cluster is also multivariate normal with hyper parameters The hyper parameter updates are: The predictive distributions f (x|θ 0 ) and f x θ −i k are D-dimensional normal distributions, whose negative logs are: Since the normalization term D 2 ln (2π) is common to both predictive distributions, it can be omitted when computing d i,k and d i,K+1 in the algorithm.

Multivariate Gaussian data
When each data point x ∈ R D is assumed to be multivariate Gaussian with unknown mean vector and covariance matrix, the conjugate prior distribution of the Gaussian parameters is Normal-Wishart, with hyper parameters θ 0 = (m 0 , c 0 , B 0 , a 0 ). Then, the posterior distribution for each cluster is also Normal-Wishart, with hyper parameters These are updated for each cluster according to: The predictive distributions f (x|θ 0 ) and f x θ −i k are D-dimensional multivariate Student-t distributions, whose negative log, written in terms of the parameters (µ, Λ, ν) is: where the Student-t parameters (µ, Λ, ν) are given in terms of the Normal-Wishart parameters µ = m, ν = a − D + 1 and Λ = cν c+1 B. We note that fast incremental updates of all these parameters are possible when including and then removing a single data point from a cluster, see Raykov et al. [1] for further details.

Exponential data
Given data points x ∈ R, x ≥ 0 assumed to be exponentially-distributed, the conjugate prior over the exponential rate parameter is the gamma distribution. This gamma distribution has hyper parameters θ 0 = (α, β) (shape, rate). So, the posterior probability of the rate parameter is also gamma, and the cluster hyper parameter are updated using: The predictive distributions f (x|θ 0 ) and f x θ −i k are the so-called Lomax distribution, with negative log:

Categorical data
For categorical data which can take on one of D > 1 possible values, x ∈ {1, 2, . . . D}, the conjugate prior over the D outcome probability parameters of this distribution are Dirichlet distributed. This Dirichlet distribution has hyper parameters θ 0 = (α 0,1 , . . . , α 0,D ). So, the posterior outcome probability parameters for each cluster are also Dirichlet, and for each cluster the D entries in the cluster hyper parameter θ −i k = α −i k are updated using: where δ (x, y) = 1 if x = y and 0 otherwise. The predictive distributions f (x|θ 0 ) and f x θ −i k are special cases of the Dirichlet-multinomial distribution, with negative log:

Binomial data
In the case of binomial data where the data can take on x ∈ {0, 1, . . . n} for n > 0, the conjugate prior over the binomial success probability parameter is beta distributed, with hyper parameters θ 0 = (α 0 , β 0 ). By conjugacy, the posterior cluster parameters are also beta distributed with hyper parameters θ −i k = α −i k , β −i k , and are updated according to: For such binomial data, the predictive distributions f (x|θ 0 ) and f x θ −i k are beta-binomial, with negative log: where B (·, ·) is the beta function:

Poisson data
For positive integer Poisson count data x ∈ Z, x ≥ 0, the conjugate prior over the single rate parameter is the gamma distribution with hyper parameters θ 0 = (α 0 , β 0 ) (shape and rate, respectively). The posterior cluster parameters are similarly gamma distributed with hyper parameters The updates for these hyper parameters are: For Poisson count data, the predictive distributions f (x|θ 0 ) and f x θ −i k are negative binomial distributed with negative log:

Geometric data
In the case of positive integer data x ∈ Z, x ≥ 0 which is assumed to be geometrically-distributed, the conjugate prior over the single success probability parameter is the beta distribution with hyper parameters θ 0 = (α 0 , β 0 ). The posterior cluster parameters are similarly beta distributed with hyper parameters θ −i k = α −i k , β −i k . The updates for these hyper parameters are: For geometric data, the predictive distributions f (x|θ 0 ) and f x θ −i k have negative log: − ln f (x|θ) = − ln B (α + 1, β + x) + ln B (α, β) where B (·, ·) is the beta function described above.