Efficient similarity-based data clustering by optimal object to cluster reallocation

Mathias Rossignol; Mathieu Lagrange; Arshia Cont

doi:10.1371/journal.pone.0197450

Abstract

We present an iterative flat hard clustering algorithm designed to operate on arbitrary similarity matrices, with the only constraint that these matrices be symmetrical. Although functionally very close to kernel k-means, our proposal performs a maximization of average intra-class similarity, instead of a squared distance minimization, in order to remain closer to the semantics of similarities. We show that this approach permits the relaxing of some conditions on usable affinity matrices like semi-positiveness, as well as opening possibilities for computational optimization required for large datasets. Systematic evaluation on a variety of data sets shows that compared with kernel k-means and the spectral clustering methods, the proposed approach gives equivalent or better performance, while running much faster. Most notably, it significantly reduces memory access, which makes it a good choice for large data collections. Material enabling the reproducibility of the results is made available online.

Citation: Rossignol M, Lagrange M, Cont A (2018) Efficient similarity-based data clustering by optimal object to cluster reallocation. PLoS ONE 13(6): e0197450. https://doi.org/10.1371/journal.pone.0197450

Editor: Sergio Gómez, Universitat Rovira i Virgili, SPAIN

Received: June 22, 2017; Accepted: May 2, 2018; Published: June 1, 2018

Copyright: © 2018 Rossignol et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The code of the experimental protocol as well as the data to reproduce the results presented in the paper is available here: https://github.com/mathieulagrange/kaveragePaper.

Funding: This work was supported by Agence Nationale de la Recherche (FR grants ANR-11-JS03-005-01 and ANR-16-CE22-0012 to ML.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Clustering collections of objects into classes that bring together similar ones is probably the most common and intuitive tool used both by human cognition and artificial data analysis in an attempt to make that data organized, understandable, manageable. When the studied objects lend themselves to this kind of analysis, it is a powerful way to expose underlying organizations and approximate the data in such a way that the relationships between its members can be statistically understood and modeled. Given a description of objects, we first attempt to quantify which ones are “similar” from a given point of view, then group those n objects into C clusters, so that the similarity between objects within the same cluster is maximized. Finding the actual best possible partition of objects into clusters is, however, an NP-complete problem, intractable for useful datasets sizes. Many approaches have been proposed to yield an approximate solution: analytic, iterative, flat or hierarchical, agglomerative or divisive, soft or hard clustering algorithms, etc., each with their strengths and weaknesses [1], performing better on some classes of problems than others [2, 3].

Iterative divisive hard clustering algorithms, usually perform well to identify high-level organization in large data collections in reasonable running time. For that reason, they are a sensible choice in many data mining situations, and constitute our focus in this paper. If the data lies in a vector space, i.e. an object can be described by a m-dimensional feature vector without significant loss of information, the seminal k-means algorithm [4] is probably the most efficient approach, since the explicit computation of the cluster centroids ensure both computational efficiency and scalability. This algorithm is based on the centroid model, and minimizes the intra cluster Euclidean distance. As shown by [5], any kind of Bregman divergence, such as the KL-divergence [6] or the Itakura-Saito divergence [7], may also be considered to develop such efficient clustering algorithms.

However, for many types of data, the projection of a representational problem into an vector space cannot be done without significant loss of descriptive efficiency. To reduce this loss, specifically tailored measures of similarity are considered. As a result, the input data for clustering is no longer a n × m matrix storing the m-dimensional vectors describing the objects, but a (usually symmetric) square matrix S of size n ×n which numerically encodes some sort of relationship between the objects. In this case, one has to resort to clustering algorithms based on connectivity models, since the cluster centroids cannot be explicitly computed.

Early attempts to solve this issue considered the k-medoids problem, where the goal is to find the k objects that maximize the average similarity with the other objects of their respective clusters, or medoids. The Partition Around Medoids (PAM) algorithm [8] solves the k-medoids problem but with a complexity of O(k(n − k)²)i, n being the number of objects and i number of iterations. Due to the high complexity and the low convergence rate, this algorithm cannot be applied to decent size datasets. In order to scale the approach, the Clustering LARge Applications (CLARA) algorithm [8] draws a sample of objects before running the PAM algorithm. This sampling operation is repeated several times and the most satisfying set of medoids is retained. In contrast, CLARANS [9] preserves the whole set of objects but cuts complexity by drawing a sample of neighbors in each search for the medoids.

Another classical approach to the issue is to run a variant of k-means that considers average intra-cluster similarity as the guiding criterion for object-to-class reallocation [10, Chapter 10.7]. This straightforward technique performs well in many cases, but the complexity of computing anew intra-cluster similarities at each iteration makes it impractical for large datasets.

Following work on kernel projection [11], that is, the fact that a nonlinear data transformation into some high dimensional feature space increases the probability of the linear separability of patterns within the transformed space, [12] introduced a kernel version of the K-means algorithm, whose input is a kernel matrix that must be a Gram matrix, i.e. semi definite positive. [13] linked a weighted version of the kernel K-means objective to the popular spectral clustering [14], introducing an efficient way of solving the normalized cut objective [15].

The kernel k-means algorithm proves to be equally useful when considering arbitrary similarity problems if special care is taken to ensure definite positiveness of the input matrix [16]. This follows original algorithmic considerations where vector space data is projected into high dimensional spaces using a carefully chosen kernel function.

Despite such improvements, kernel k-means cannot be easily applied to large scale datasets without special treatments because of high algorithmic and memory access costs. [17] considered sampling of the input data, [18] considered block storing of the input matrix, and a pre-clustering approach [19] is considered by [20] with a coarsening and refining phases as respectively a pre- and post-treatment of the actual clustering phase.

We show in this paper that by using a semantically equivalent variant of the average intra-cluster similarity presented in [10, Chapter 10.7], it becomes possible to perform a computationally efficient greedy optimization [10, Chapter 10.8] with guaranteed convergence for arbitrary similarity measures. Although both of those techniques are well known, combining them is non trivial, and lead to a clustering algorithm, which we call k-averages with the following properties:

input data can be arbitrary symmetric similarity matrices,
it has fast and guaranteed convergence, with a number of object to clusters reallocations experimentally found to be roughly equal to the number of objects,
it provides good scalability thanks to a reduced need for memory access, and
on a collection of synthetic and natural test data, its results are equivalent to those of kernel k-means, and obtained in a fraction of its computing time, particularly when paged memory is required.

To summarize, the main contribution of the paper is to present a clustering algorithm:

that can handle arbitrary affinity matrices, i.e. semi-positiveness is not a mandatory requirement for guaranteed convergence
thanks to a carefully designed strategy to update the membership of objects to cluster, the algorithm is fast and memory efficient.

The remaining of the paper is organized as follows: Section 1 presents the kernel k-means objective function and the basic algorithm that minimizes this function, and Section 2 introduces the concepts behind the k-averages algorithm, followed by a detailed algorithmic description in Section 3. The complexity of the two algorithms in terms of arithmetic operations and memory access is then studied in Section 5. The above presented properties of the proposed k-averages algorithm are then validated on synthetic controlled data in Section 6 and on 43 datasets of time series issued from various sources in Section 7.

1 Kernel k-means

Since its introduction by [12], kernel k-means has been an algorithm of choice for flat data clustering with known number of clusters [16, 20]. It makes use of a mathematical technique known as the “kernel trick” to extend the classical k-means clustering algorithm [4] to criteria beyond simple euclidean distance proximity. Since it constitutes the closest point of comparison with our own work, we dedicate this section to its detailed presentation.

In the case of kernel k-means, the kernel trick allows us to consider that the k-means algorithm is operating in an unspecified, possibly very high-dimensional Euclidean space; but instead of specifying the properties of that space and the coordinates of objects, the equations governing the algorithm are modified so that everything can be computed knowing only the scalar products between points. The symmetrical matrix containing those scalar products is known as a kernel, noted .

1.1 Kernel k-means objective function

In this section and the following, we shall adopt the following convention: N is the number of objects to cluster and C the number of clusters; N_c is the number of objects in cluster c, and μ_c is the centroid of that cluster. z_cn is the membership function, whose value is 1 if object o_n is in class c, 0 otherwise. In the folowing equations, μ_c and o_n are assumed to be column vectors of equal dimension.

Starting from the objective function minimized by the k-means algorithm, expressing the sum of squared distances of points to the centroids of their respective clusters:

And using the definition of centroids as:

S can be developed and rewritten in a way that does not explicitly refer to the centroid positions, since those cannot be computed: where (1)

Since the sum of K_nn over all points remains constant, and the sum of squared centroid norms (third, quadratic, term of Eq 1) is mostly bounded by the general geometry of the cloud of objects, we can see that minimizing this value implies maximizing the sum of the central terms, which are the average scalar products of points with other points belonging to the same class. Therefore, given a matrix gathering similarities between objects, if that matrix possesses the necessary properties to be considered as a kernel (positive semidefinitness), then the kernel k-means algorithm can be applied to it in order to create clusters that locally maximize the average intra-cluster similarity.

1.2 Algorithm

Finding the configuration that globally minimizes S (Eq 1.1) is an NP-complete problem. However, several approaches allow finding an acceptable approximation. We shall only focus here on the fastest and most popular, an iterative assignment / update procedure commonly referred to as the “k-means algorithm” [4], or as a discrete version of Lloyd’s algorithm, detailed in Algorithm 1.

Algorithm 1: Lloyd’s algorithm applied to minimizing the kernel k-means objective.

Data: number of objects N, number of classes C, kernel matrix

Result: label vector L defining a partition of the objects into C classes

1 Initialization: fill L with random values in [1..C];

2 while L is modified do

3 for n ← 1 to N do

4 for c ← 1 to C do

5 Compute Y_cn following Eq 1 (note: z_cn = (L_n == c)?1: 0)

6 end

7 L_n = argmin_c(Y_cn);

8 end

9 end

The version given here is the most direct algorithmic translation of the mathematical foundations developed above, and as we shall see in section 5, it can easily become more efficient. Before that, we introduce our proposed k-averages algorithm.

2 Foundations of the k-averages algorithm

In our proposal, we adopt an alternative objective function which, unlike kernel k-means, does not rely on a geometric interpretation but an explicit account of the similarity matrix. The goal is to maximize the average intra-cluster similarity between points, a commonly used metric to evaluate clustering quality, and one whose computation is direct—linear in time.

Due to its simplicity, however, the objective function cannot be simply “plugged into” the standard kernel k-means algorithm: it lacks the geometric requisites to ensure convergence. We must therefore propose a specifically tailored algorithmic framework to exploit it: first, we show here that it is possible to easily compute the impact on the global objective function of moving a single point from one class to another; this allows us to develop a greedy optimization algorithm taking advantage of that formula.

2.1 Conventions and objective function

In addition to the notations presented above, we index here the set of elements belonging to a given cluster c_k as . For simplicity, we omit the first index and note when considering a single class.

The similarity between objects shall be written s(o_i, o_j). We extend the notation s to the similarity of an object to a class defined as the average similarity of an object with all objects of the class. s(o, c) accepts two definitions, depending on whether or not o is a member of c:

If o ∉ c, (2)

If o ∈ c, then necessarily ∃i ∣ o = o_i (3)

Let us call the “quality” of a class the average intra-class object-to-object similarity, and write it : (4)

In our framework, we do not explicitly refer to class centroids, preferring to directly consider averages of similarity values between individuals within clusters. Using the notations above, we define our objective function as the average class quality, normalized with class sizes: (5)

Since, informally, our goal is to bring together objects that share high similarity, a first idea would be to simply repeatedly move each object to the class with whose members it has the highest average similarity. This is what we call the “naive k-averages” algorithm.

2.2 Naive k-averages algorithm

Algorithm 2 presents a method that simply moves each object to the class with which it has the highest average similarity, until convergence is reached. The algorithm is straightforward and simple; however, experiments show that while it can often produce interesting results, it also sometimes cannot reach convergence because the decision to move an object to a different cluster is taken without considering the impact of the move on the quality of the source cluster.

Algorithm 2: The naive k-averages algorithm.

Data: number of objects N, number of classes C, similarity matrix

Result: label vector L defining a partition of the objects into C classes

1 Initialization: Fill L with random values in [1..C];

2 Compute initial object-class similarities S following Eq 3 or Eq 2;

3 while L is modified do

4 for i ← 1 to N do

5 previousClass ← L_i;

6 nextClass ← argmin_k S(i, k) if nextClass ≠ previousClass then

7 L_i ← nextClass;

8 for j ← 1 to N do

9 Update S(j, nextClass) and S(j, previousClass)

10 end

11 end

12 end

13 end

To ensure convergence, we need to compute the impact on the objective function of moving one object from one class to another. Using such formulation and performing only reallocation that have a positive impact, the convergence of such an iterative algorithm is guaranteed.

2.3 Impact of object reallocation on class quality

Considering a class c, let us develop the expression of into a more useful form. Since all objects are in c, we use the formula in (Eq 3) to get:

Using the assumption that the similarity matrix is symmetrical, we can reach: (6)

For future use and given the importance of the above transformation, we define:

Thus:

2.3.1 Removing an object from a class.

Assuming that o ∈ c, necessarily ∃i ∣ o = o_i. Since the numbering of objects is arbitrary, we can first simplify the following equation by considering that , in order to reach a formula that is independent from that numbering.

The quality of a class after removal of an object is thus: (7)

And the change in quality from its previous value: (8)

2.3.2 Adding an object to a class.

Assuming that o ∉ c, we can similarly to what has been done previously (numbering is arbitrary) consider for the sake of simplicity that o becomes o_{N_c+1} in the modified class c. Following a path similar to above, we get:

The quality of a class c after adding an object o is thus: (9)

And the change in quality from its previous value: (10)

2.4 Impact of object reallocation on the global objective function

When moving an object o from class c_s (“source”), to whom it belongs, to a distinct class c_t (“target”), (N_s − 1) objects are affected by the variation in (8), and N_t are affected by that in (10); in addition, one object o moves from a class whose quality is to one whose quality is , as expressed by Eq 9, which leads to an impact of moving object o from class c_s to class c_t wich can be computed as follows: (11)

As can be seen, computing this impact is a fixed-cost operation. We can therefore use the formula as the basis for an efficient iterative algorithm.

3 K-averages algorithm

Algorithm 3: The K-averages algorithm.

Data: number of objects N, number of classes C, similarity matrix

Result: label vector L defining a partition of the objects into C classes

1 Initialization: Fill L with random values in [1..C];

2 Compute initial object-class similarities S following Eq 3 or Eq 2;

3 Compute initial class qualities following Eq 6;

4 while L is modified do

5 for i ← 1 to N do

6 previousClass ← L_i;

7 nextClass ← argmax_k δ_i(previousClass, k) (following the definition of δ in Eq 11);

8 if nextClass ≠ previousClass then

9 L_i ← nextClass;

10 Update following Eq 7;

11 Update following Eq 9;

12 for j ← 1 to N do

13 Update S(j, nextClass) and S(j, previousClass)

14 following Eq 12;

15 end

16 end

17 end

18 end

Our approach does not allow us to benefit, like kernel k-means, from the convergence guarantee brought by the geometric foundation of k-means. In consequence, we cannot apply a “batch” approach where at each iteration all elements are moved to their new class, and all distances (or similarities) are computed at once. To guarantee convergence, we must update the class properties for the two modified classes (source and destination), as well as recompute the average class-object similarities for them for each considered object, after finding its ideal new class. This is the principle of the “greedy” k-means algorithm [10, Chapter10.8], but whereas for k-means that approach increases complexity (and even more so for kernel k-means), in this case it leads to a much improved computational performance.

Indeed, at a first glance, dynamically updating objectives as a result of object reallocation might seem to have negative performance impact. However, our simple non-quadratic updates make such dynamic changes easily tractable. New class qualities are thus given by Eqs 7 and 9, and new object-class similarities can be computed by: (12) where i is any object index, n is the recently reallocated object, c_s the “source” class that object i was removed from, and c_t the “target” class that object n was added to.

The full description of k-averages is given in Algorithm 3.

4 Convergence

The kernel k-means algorithm ensures convergence if the similarity matrix is semi-definite positive. The k-averages algorithm relaxes this constraint by only requiring symmetricity of the similarity matrix to ensure convergence.

An algorithm is guaranteed to converge if its successive iterations can be tied to a strictly monotonous and bounded quantity. For the k-averages algorithm, this quantity is the objective function itself, as we now show.

Thanks to the rewriting of the class quality function done in Eq 6, which only requires the similarity matrix to be symmetrical, we can directly define the allocation decision criterion δ (Eq 11) to be the change in the global objective function (Eq 5) implied by reallocating an object to a new class. It follows that, as long as reallocations are only performed when δ > 0, is strictly increasing throughout the execution of the algorithm.

Moreover, , defined as a weighted average of the average intra-class similarities for the produced clusters, can be proven to never exceed the maximal similarity between two objects.

Indeed, the average similarity of an object to other members of its class, expressed as s(o, c) in Eq 2, is an average of similarities, and therefore lower than or equal to their maximum value. Similarly, the quality of a class, defined as the average of object to class similarities (Eq 4), is inferior to the maximum value of s(o, c), and therefore to the maximum similarity between objects. Finally, , a weighted average of values, is, once again, inferior to their maximum value.

The objective function is thus upper-bounded, and increases at each iteration of the outer loop of Algorithm 3; which guarantees its convergence.

5 Complexity analysis

In this section, we study the complexity of the two approaches presented above, first from the point of view of raw complexity, second by focusing on memory access.

5.1 Computational complexity

5.1.1 Kernel k-means.

As can be seen in Algorithm 1, the operation on line 5 is the most costly part of the algorithm: for each object n and class c, at each iteration, it is necessary to compute Y_cn from Eq 1—an O(N²) operation in itself, per object. The impossibility of simply computing the distances to a known centroid as done in the k-means algorithm leads to a much higher complexity for the kernel k-means algorithm, globally O(N³) per iteration, independent of how many objects are moved for that iteration.

It is however possible to improve the performance of kernel k-means by noting than in Eq 1, the third term of the equation, which has the highest complexity, is only dependent on class definitions and not on the considered object. We can therefore rewrite Eq 1 as: (13) where (14)

Algorithm 1 thus becomes Algorithm 4, where the values of M_c are computed once at the beginning of each loop (line 4) then reused on line 8, thus reducing the overall complexity to O(n²) per iteration. This optimized version of kernel k-means is the one we shall consider for performance comparison in the remainder of this article.

Algorithm 4: Lloyd’s algorithm applied to minimizing the kernel k-means objective, optimized version.

Data: number of objects N, number of classes C, kernel matrix

Result: label vector L defining a partition of the objects into C classes

1 Initialization: fill L with random values in [1..C];

2 while L is modified do

3 for c ← 1 to C do

4 Compute M_c following Eq 14

5 end

6 for n ← 1 to N do

7 for c ← 1 to C do

8 Compute Y_cn following Eq 13 (note: z_cn = (L_n == c)?1: 0)

9 end

10 L_n = argmin_c(Y_cn);

11 end

12 end

5.1.2 K-averages.

For the k-averages method presented as Algorithm 3, the complexity of each iteration is

O(NC) corresponding to the best class search at line 7
O(NM) corresponding to the object-to-class similarity update at line 13, where M is the number of objects moved at a given iteration.

In the worst case scenario, M = N, and the complexity for one iteration of the algorithm remains the same as for the optimized kernel k-means algorithm, O(N²). In practice, however, as can be seen on Fig 1, the number of objects moving from one class to another decreases sharply after the first iteration, meaning that the complexity of one iteration becomes quickly much lower than O(N²). Thus, while the first iteration of k-averages has a similar complexity with kernel k-means, the overall cost of a typical run of the algorithm (from 10 to 50 iterations) is much lower.

Download:

Fig 1. Number of moved objects per iteration when clustering a variety of datasets with the k-averages algorithm, normalized by the total number of objects to cluster.

The datasets used to create this figure are the real-life time series data that we employ for experimental validation evaluated under the Dynamic Time Warping (DTW) similarity measure, cf. Section 7.

https://doi.org/10.1371/journal.pone.0197450.g001

To go further in this analysis, we display on Fig 2 the total number of object reallocation over a full run of the k-averages algorithm for several datatsets. The datasets used to create this figure are the real-life time series data that we employ for experimental validation evaluated under the Dynamic Time Warping (DTW) similarity measure, cf. Section 7. As can be seen, the correlation is roughly linear with the number of objects to cluster. In fact, the number of reallocations is roughly equal to the number of objects to cluster, which allows us to reach for k-averages a (statistical) total complexity of O(N²), instead of O(N²) per iteration.

Download:

Fig 2. Total number of object reallocations over a run of the k-averages algorithm, plotted against the number of objects to be clustered.

The datasets used to create this figure are the real-life time series data that we employ for experimental validation, cf. Section 7.

https://doi.org/10.1371/journal.pone.0197450.g002

5.2 Memory access

The lowered computational costs is also accompanied by a decrease in memory access: as can be seen from Eq 12, in order to compute the new object-to-class similarities after moving an object n, only line n of the similarity matrix needs to be read. For the remaining of the algorithm, only the (much smaller) object-to-class similarity matrix is used. By contrast, in the case of kernel k-means, the computation of M_c values at each iteration require that the whole similarity matrix be read, which can be a serious performance bottleneck in the case of large object collections.

Moreover, the similarity update function of k-averages, by reading one line of the matrix at a time, presents good data locality properties, which make it play well with standard memory paging strategies.

To illustrate and confirm the theoretical complexity computed here, the next section proposes some performance figures measured on controlled datasets.

6 Validation

In order to reliably compare the clustering quality and execution speed between the two approaches, we have written plain C implementations of Algorithms 4 and 3, with minimal operational overhead: reading the similarity matrix from a binary file where all matrix values are stored sequentially in standard reading order, line by line, and writing out the result of the clustering as a label text file. Both implementations use reasonably efficient code, but without advanced optimizations or parallel processing.

The figures presented in this section were obtained on synthetic datasets, created in order to give precise control on the features of the analyzed data: for n points split between C classes, C centroids are generated at random in two dimensional space, and point coordinates are generated following a Gaussian distribution around class centroids. In addition to the numbers of objects and classes, the variance of Gaussian distributions are adjusted to modulate how clearly separable clusters are. Similarities are computed as inverse Euclidean distances between points.

6.1 Reproducibility

In order to ease reproducibility of the results, the data is taken from a public repository of several benchmark datasets used in academia [21] and the code of the proposed method as well as the experimental code used for generated the figures is publicly available: https://github.com/mathieulagrange/kaveragePaper.

6.2 Clustering performance

Several metrics are available to evaluate the performance of a clustering algorithm. The one closest to the actual target application is the raw accuracy, that is the average number of items labeled correctly after an alignment phase of the estimated labeling with the reference [22].

Another metric of choice is the Normalized Mutual Information (NMI) criterion. Based on information theoretic principles, it measures the amount of statistical information shared by the random variables representing the predicted cluster distribution and the reference class distribution of the data points. If P is the random variable denoting the cluster assignments of the points, and C is the random variable denoting the underlying class labels on the points then the NMI measure is defined as: where I(X; Y) = H(X) − H(X|Y) is the mutual information between the random variables X and Y, H(X) is the Shannon entropy of X, and H(X|Y) is the conditional entropy of X given Y. Thanks to the normalization, the metric stays between 0 and 1, 1 indicating a perfect match, and can be used to compare clustering with different numbers of clusters. Interestingly, random prediction gives an NMI close to 0, whereas the accuracy of a random prediction on a balanced bi-class problem is as high as 50%.

In this paper, for simplicity sake, only the NMI is considered for validations. However, we found that most statements hereafter in terms of performance ranking of the different algorithms still hold while considering the accuracy metric as reference.

Fig 3 presents the quality of clusterings obtained using kernel k-means and k-averages on two series of datasets: one featuring 5 classes, the other 40 classes. On the x-axis is the variance of the Gaussian distribution used to generate the point cloud for each class: the higher that value, the more the classes are spread out and overlap each other, thus making the clustering harder.

Download:

Fig 3. NMI of kernel k-means (*) and k-averages (o) clustering relative to ground truth as a function of the “spread” of those classes for synthetic data sets of 5 and 40 classes, displayed in dashed and solid lines, respectively.

https://doi.org/10.1371/journal.pone.0197450.g003

The question of choosing the proper number of clusters for a given dataset without a priori is a well known and hard problem, and beyond the scope of this article. Therefore, for the purpose of evaluation, clustering is done by requesting a number of clusters equal to the actual number of classes in the dataset. In order to obtain stable and reliable figures, clustering is repeated 500 times with varying initial conditions, i.e. the initial assignment of points to clusters is randomly determined, and only the average performance is given. For fairness of comparison, each algorithm is run with the exact same initial assignments.

As can be seen on the figure, in the case of a 5-class problem, k-averages outperforms kernel k-means in the “easy” cases (low class spread), before converging to equivalent results. For the more complex 40-class datasets, k-averages consistently yields a better result than kernel k-means, especially for higher values of the variance. The lower values of NMI for 5-class experiments is in fact an artifact introduced by the normalization of NMI, and is not important here; we only focus, for each series of experiments, on the relative performances of kernel k-means and k-averages.

6.3 Time efficiency

Fig 4 shows the average time spent by kernel k-means and k-averages to cluster synthetic datasets or varying sizes. As previously, initial conditions on each run are identical for both algorithms. The reported run time is the one measured on a 64 bits Intel^® Core^™ i7 runnning at 3.6 GHz with 32 Gb of RAM and standard Hard Disk Drive (HDD) operated by standard Linux distribution. For results with 2 GB of RAM, the same machine is used with a memory limitation specified to the kernel at boot time.

Download:

Fig 4. Average computation time of the kernel k-means (*) and kaverages (o) algorithms on computers with 2 GB (solid line) and 32 GB (dashed line) of RAM, respectively.

The “running time” axis follows a logarithmic scale.

https://doi.org/10.1371/journal.pone.0197450.g004

These figures confirm the theoretical complexity analysis presented in Section 5: k-averages runs at least 20 times faster on average than kernel k-means in ordinary conditions, when available memory is not an issue. When the matrix size exceeds what can be stored in RAM and the system has to resort to paged memory, as in the presented experiments when the matrix reaches about 1500MB, both algorithms suffer from a clear performance hit; however, kernel k-means is much more affected, and the difference becomes even more important: with a 2000MB similarity matrix on a memory-limited computer, k-averages runs about 100 times faster than kernel k-means.

Having established the interest of our proposed method relative to kernel k-means on synthetic object collections, we now proceed to a thorough evaluation on real data.

7 Experiments

In order to demonstrate the usefulness of k-averages when dealing with real data, we have chosen to focus on the clustering of time series as the evaluation task. Time series, even though represented as vectors and therefore suitable for any kinds of norm-based clustering, are best compared with elastic measures [23, 24], partly due to their varying length. The Dynamic Time Warping (DTW) measure is an elastic measure widely used in many areas since its introduction for spoken word detection [25] and has never been challenged for time series mining [26, 27].

Effective clustering of time series using the DTW measure requires similarity based algorithms such as the k-averages algorithm. With some care, kernel based algorithm can also be considered provided that the resulting similarity matrix is converted into a kernel, i.e. the matrix is forced to be semi definite positive, i.e. to be a Gram matrix [28] in order to guarantee convergence.

7.1 Datasets

To compare quality of clusterings obtained by the considered algorithms, we consider a large collection of 43 time series datasets made publicly available by many laboratories worldwide and compiled by Prof. Keogh. Thus, while all the experiments presented here are performed on time series (chosen for being a good example of a data type requiring similarity-based clustering, as opposed to a simple Euclidean approach), the great variety in the sources and semantics of said series (bio-informatics, linguistics, astronomy, gesture modeling, chemistry…) gives this validation a wide foundation. Statistics about the morphology of those datasets can be found in Table 1 and summarized in Table 2.

Download:

Table 1. Description of the time series datasets used for evaluation.

https://doi.org/10.1371/journal.pone.0197450.t001

Download:

Table 2. Statistics of the datasets.

The length of the times series is expressed in samples.

https://doi.org/10.1371/journal.pone.0197450.t002

7.2 Methods

Three algorithms are considered: the spectral clustering [14] approach as a high complexity reference, the kernel k-means algorithm implemented as described in Section 1 and the proposed k-averages algorithm. The spectral clustering algorithm tested here uses the normalization proposed by Jordan and Weiss [29]. This normalization is chosen over no normalization and the Shi and Malik one [15] as it is found to be the best performing in terms of average NMI over all the datasets. The implementation is done using the Matlab programming language. Even though a C implementation would probably be more efficient, we believe that the gain would be low as the main computational load is the diagonalization of the similarity matrix and the k-means clustering of the eigenvectors which are both efficient builtins Matlab functions. The kernel k-means is implemented both in Matlab using the implementation provided by Mo Chen (vailable at: https://fr.mathworks.com/matlabcentral/fileexchange/26182-kernel-kmeans) and in the C programming language following Algorithm 1. The k-averages method is implemented in C following Algorithm 3.

7.3 Evaluation protocol

For each dataset, since we perform clustering, and not supervised learning, the training and testing data are joined together. DTW similarities are computed using the implementation provided by Prof. Ellis (available at: http://www.ee.columbia.edu/~dpwe/resources/matlab/dtw) with default parameters.

As in our previous experiments with synthetic data, we choose here the normalized mutual information (NMI) as the measure of clustering quality; clustering is done by requesting a number of clusters equal to the actual number of classes in the dataset, and repeated 200 times with varying initial conditions, each algorithm being run with the exact same initial assignments. For the 200 clusterings thus produced, we compute the NMI between them and the ground truth clustering. Average and standard deviation statistics are then computed.

7.4 Clustering performance

For ease of readability and comparison, the presented results are split into 3 tables. Table 3 lists the results obtained on bi-class datasets, i.e. the datasets annotated in terms of presence or absence of a given property; Table 4 concerns the datasets with a small number of classes (from 3 to 7); and Table 5 focuses the datasets with a larger number of classes (from 8 to 50).

Download:

Table 3. NMI (in percents) of clusterings by kernel k-means and k-averages for bi-class datasets.

https://doi.org/10.1371/journal.pone.0197450.t003

Download:

Table 4. NMI (in percents) of clusterings by kernel k-means and k-averages for datasets of 3 to 7 classes.

https://doi.org/10.1371/journal.pone.0197450.t004

Download:

Table 5. NMI (in percents) of clusterings by kernel k-means and k-averages for datasets of 8 to 50 classes.

https://doi.org/10.1371/journal.pone.0197450.t005

For each experiment, the result of the best performing method is marked in bold. The Matlab and C implementations of the kernel k-means algorithm give exactly the same results in terms of NMI, thus only one column is used to display their performance.

A first observation is that the spectral clustering algorithm only performs favorably for 2 of the 43 dataset. Also, most of the bi-class problems (Table 3) do not seem to lend themselves well to this kind of approach: kernel k-means and k-averages produce quasi-identical results, poor in most cases. Concerning the medium numbers of classes (Table 4), k-averages performs best for 8 datasets out of 17. For the larger numbers of classes (Table 5), k-averages performs best for 11 datasets out of 14.

Considering the standard deviation over the several runs of the algorithm with different initialization is helpful to study the sensitivity of the algorithm to its initialization and thus its tendency to be stuck into local minima. For most datasets, the standard deviation of the k-averages algorithm is smaller than the one of the kernel k-means one and thus seems experimentally more robust.

7.5 Efficiency

Computation time displayed in Tables 6, 7, and 8 is the average duration over 100 runs on a single core with no parallelization capabilities. For every datasets, the spectral clustering approach is the more time consuming due to the diagonalization of the matrix which is of O(N³). For the kernel k-means algorithm, the C implementation is most of the time more efficient than the Matlab implementation. The k-average algorithm is more efficient for 46 datasets out of the 47 by close to an order of magnitude for the larger datasets.

Download:

Table 6. Computation time (in seconds) for several implementations of the evaluated clustering algorithms for datasets of 2 classes.

https://doi.org/10.1371/journal.pone.0197450.t006

Download:

Table 7. Computation time (in seconds) for several implementations of the evaluated clustering algorithms for datasets of 3 to 7 classes.

https://doi.org/10.1371/journal.pone.0197450.t007

Download:

Table 8. Computation time (in seconds) for several implementations of the evaluated clustering algorithms for datasets of 8 to 50 classes.

https://doi.org/10.1371/journal.pone.0197450.t008

7.6 Overall results

To conclude on the performance of the evaluated algorithms on real datasets, Table 9 displays the NMI and computation time averaged over the 47 datasets. The k-averages method marginally improve the clustering accuracy compared to the kernel k-means approach by using less time to compute.

Download:

Table 9. Performances averaged over the 47 datasets.

The NMI is expressed in percents and the computation time in seconds.

https://doi.org/10.1371/journal.pone.0197450.t009

8 Conclusion

We have presented k-averages, an iterative flat clustering algorithm that operates on arbitrary similarity matrices by explicitly and directly aiming to optimize the average intra-class similarity. Having established the mathematical foundation of our proposal, including guaranteed convergence, we have thoroughly compared it with widely used standard methods: the kernel k-means and the spectral clustering techniques. We show that the k-averages algorithm converges much faster (20 times faster under ordinary conditions) and leads to equivalent or better clustering results for the task of clustering both synthetic data and realistic times series taken from a wide variety of sources, while also being more computationally efficient and more sparing in memory use.

Acknowledgments

The authors would like to acknowledge support for this project from ANR project Houle (grant ANR-11-JS03-005-01) and ANR project Cense (grant ANR-16-CE22-0012).

References

1. Jain AK. Data clustering: 50 years beyond K-means. Pattern Recognition Letters. 2010;31:651–666.
- View Article
- Google Scholar
2. Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In: KDD workshop on text mining. vol. 400. Boston; 2000. p. 525–526.
3. Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 2006;22:2405–2412. pmid:16882653
- View Article
- PubMed/NCBI
- Google Scholar
4. MacQueen JB. Some Methods for classification and Analysis of Multivariate Observations. In: Proc. of Berkeley Symposium on Mathematical Statistics and Probability; 1967.
5. Banerjee A, Merugu S, Dhillon IS, Ghosh J. Clustering with Bregman Divergences. J Mach Learn Res. 2005;6:1705–1749.
- View Article
- Google Scholar
6. Dhillon IS, Mallela S, Kumar R. A Divisive Information Theoretic Feature Clustering Algorithm for Text Classification. J Mach Learn Res. 2003;3:1265–1287.
- View Article
- Google Scholar
7. Linde Y, Buzo A, Gray RM. An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications. 1980;28:84–95.
- View Article
- Google Scholar
8. Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. New York: John Wiley and Sons; 1990.
9. Ng RT, Han J. Efficient and Effective Clustering Methods for Spatial Data Mining. In: Proceedings of the 20th International Conference on Very Large Data Bases. VLDB’94. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 1994. p. 144–155.
10. Duda RO, Hart PE, Stork DG. Pattern Classification (2nd Ed). Wiley; 2001.
11. Vapnik VN. The Nature of Statistical Learning Theory. New York, NY, USA: Springer-Verlag New York, Inc.; 1995.
12. Girolami M. Mercer Kernel-based Clustering in Feature Space. IEEE Transactions on Neural Network. 2002;13(3):780–784.
- View Article
- Google Scholar
13. Dhillon IS, Guan Y, Kulis B. Weighted Graph Cuts Without Eigenvectors A Multilevel Approach. IEEE Trans Pattern Anal Mach Intell. 2007;29(11):1944–1957. pmid:17848776
- View Article
- PubMed/NCBI
- Google Scholar
14. Von Luxburg U. A tutorial on spectral clustering. Statistics and computing. 2007;17(4):395–416.
- View Article
- Google Scholar
15. Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence. 2000;22(8):888–905.
- View Article
- Google Scholar
16. Roth V, Laub J, Kawanabe M, Buhmann JM. Optimal Cluster Preserving Embedding of Nonmetric Proximity Data. IEEE Trans Pattern Anal Mach Intell. 2003;25(12):1540–1551.
- View Article
- Google Scholar
17. Chitta R, Jin R, Havens TC, Jain AK. Approximate Kernel K-means: Solution to Large Scale Kernel Clustering. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’11. New York, NY, USA: ACM; 2011. p. 895–903.
18. Zhang R, Rudnicky AI. A large scale clustering scheme for kernel K-Means. In: Pattern Recognition, 2002. Proceedings. 16th International Conference on. vol. 4; 2002. p. 289–292 vol.4.
19. Bradley PS, Fayyad UM, Reina C. Scaling Clustering Algorithms to Large Databases. In: Knowledge Discovery and Data Mining; 1998. p. 9–15.
20. Kulis B, Basu S, Dhillon I, Mooney R. Semi-supervised graph clustering: a kernel approach. Machine Learning. 2008;74(1):1–22.
- View Article
- Google Scholar
21. Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, et al. The UCR Time Series Classification Archive; 2015.
22. Kuhn HW. The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly. 1955;2(1–2):83–97.
- View Article
- Google Scholar
23. Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E. Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures. Proc VLDB Endowment. 2008;1(2):1542–1552.
- View Article
- Google Scholar
24. Wang X, Mueen A, Ding H, Trajcevski G, Scheuermann P, Keogh E. Experimental Comparison of Representation Methods and Distance Measures for Time Series Data. Data Mining Knowledge Discovery. 2013;26(2):275–309.
- View Article
- Google Scholar
25. Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing. 1978;26(1):43–49.
- View Article
- Google Scholar
26. Berndt DJ, Clifford J. Using Dynamic Time Warping to Find Patterns in Time Series. In: Fayyad UM, Uthurusamy R, editors. KDD Workshop. AAAI Press; 1994. p. 359–370.
27. Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, et al. Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping. ACM Trans Knowl Discov Data. 2013;7(3):10:1–10:31.
- View Article
- Google Scholar
28. Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI. Learning the Kernel Matrix with Semidefinite Programming. J Mach Learn Res. 2004;5:27–72.
- View Article
- Google Scholar
29. Ng AY, Jordan MI, Weiss Y. On spectral clustering: Analysis and an algorithm. In: Advances in neural information processing systems; 2002. p. 849–856.

[ref1] 1. Jain AK. Data clustering: 50 years beyond K-means. Pattern Recognition Letters. 2010;31:651–666.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In: KDD workshop on text mining. vol. 400. Boston; 2000. p. 525–526.

[ref3] 3. Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 2006;22:2405–2412. pmid:16882653
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref4] 4. MacQueen JB. Some Methods for classification and Analysis of Multivariate Observations. In: Proc. of Berkeley Symposium on Mathematical Statistics and Probability; 1967.

[ref5] 5. Banerjee A, Merugu S, Dhillon IS, Ghosh J. Clustering with Bregman Divergences. J Mach Learn Res. 2005;6:1705–1749.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref6] 6. Dhillon IS, Mallela S, Kumar R. A Divisive Information Theoretic Feature Clustering Algorithm for Text Classification. J Mach Learn Res. 2003;3:1265–1287.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref7] 7. Linde Y, Buzo A, Gray RM. An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications. 1980;28:84–95.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref8] 8. Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. New York: John Wiley and Sons; 1990.

[ref9] 9. Ng RT, Han J. Efficient and Effective Clustering Methods for Spatial Data Mining. In: Proceedings of the 20th International Conference on Very Large Data Bases. VLDB’94. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 1994. p. 144–155.

[ref10] 10. Duda RO, Hart PE, Stork DG. Pattern Classification (2nd Ed). Wiley; 2001.

[ref11] 11. Vapnik VN. The Nature of Statistical Learning Theory. New York, NY, USA: Springer-Verlag New York, Inc.; 1995.

[ref12] 12. Girolami M. Mercer Kernel-based Clustering in Feature Space. IEEE Transactions on Neural Network. 2002;13(3):780–784.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref13] 13. Dhillon IS, Guan Y, Kulis B. Weighted Graph Cuts Without Eigenvectors A Multilevel Approach. IEEE Trans Pattern Anal Mach Intell. 2007;29(11):1944–1957. pmid:17848776
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref14] 14. Von Luxburg U. A tutorial on spectral clustering. Statistics and computing. 2007;17(4):395–416.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref15] 15. Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence. 2000;22(8):888–905.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref16] 16. Roth V, Laub J, Kawanabe M, Buhmann JM. Optimal Cluster Preserving Embedding of Nonmetric Proximity Data. IEEE Trans Pattern Anal Mach Intell. 2003;25(12):1540–1551.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref17] 17. Chitta R, Jin R, Havens TC, Jain AK. Approximate Kernel K-means: Solution to Large Scale Kernel Clustering. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’11. New York, NY, USA: ACM; 2011. p. 895–903.

[ref18] 18. Zhang R, Rudnicky AI. A large scale clustering scheme for kernel K-Means. In: Pattern Recognition, 2002. Proceedings. 16th International Conference on. vol. 4; 2002. p. 289–292 vol.4.

[ref19] 19. Bradley PS, Fayyad UM, Reina C. Scaling Clustering Algorithms to Large Databases. In: Knowledge Discovery and Data Mining; 1998. p. 9–15.

[ref20] 20. Kulis B, Basu S, Dhillon I, Mooney R. Semi-supervised graph clustering: a kernel approach. Machine Learning. 2008;74(1):1–22.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref21] 21. Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, et al. The UCR Time Series Classification Archive; 2015.

[ref22] 22. Kuhn HW. The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly. 1955;2(1–2):83–97.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref23] 23. Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E. Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures. Proc VLDB Endowment. 2008;1(2):1542–1552.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref24] 24. Wang X, Mueen A, Ding H, Trajcevski G, Scheuermann P, Keogh E. Experimental Comparison of Representation Methods and Distance Measures for Time Series Data. Data Mining Knowledge Discovery. 2013;26(2):275–309.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref25] 25. Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing. 1978;26(1):43–49.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref26] 26. Berndt DJ, Clifford J. Using Dynamic Time Warping to Find Patterns in Time Series. In: Fayyad UM, Uthurusamy R, editors. KDD Workshop. AAAI Press; 1994. p. 359–370.

[ref27] 27. Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, et al. Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping. ACM Trans Knowl Discov Data. 2013;7(3):10:1–10:31.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref28] 28. Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI. Learning the Kernel Matrix with Semidefinite Programming. J Mach Learn Res. 2004;5:27–72.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref29] 29. Ng AY, Jordan MI, Weiss Y. On spectral clustering: Analysis and an algorithm. In: Advances in neural information processing systems; 2002. p. 849–856.

Figures

Abstract

Introduction

1 Kernel k-means

1.1 Kernel k-means objective function

1.2 Algorithm

2 Foundations of the k-averages algorithm

2.1 Conventions and objective function

2.2 Naive k-averages algorithm

2.3 Impact of object reallocation on class quality

2.3.1 Removing an object from a class.

2.3.2 Adding an object to a class.

2.4 Impact of object reallocation on the global objective function

3 K-averages algorithm

4 Convergence

5 Complexity analysis

5.1 Computational complexity

5.1.1 Kernel k-means.

5.1.2 K-averages.

5.2 Memory access

6 Validation

6.1 Reproducibility

6.2 Clustering performance

6.3 Time efficiency

7 Experiments

7.1 Datasets

7.2 Methods

7.3 Evaluation protocol

7.4 Clustering performance

7.5 Efficiency

7.6 Overall results

8 Conclusion

Acknowledgments

References