Efficient similarity-based data clustering by optimal object to cluster reallocation

We present an iterative flat hard clustering algorithm designed to operate on arbitrary similarity matrices, with the only constraint that these matrices be symmetrical. Although functionally very close to kernel k-means, our proposal performs a maximization of average intra-class similarity, instead of a squared distance minimization, in order to remain closer to the semantics of similarities. We show that this approach permits the relaxing of some conditions on usable affinity matrices like semi-positiveness, as well as opening possibilities for computational optimization required for large datasets. Systematic evaluation on a variety of data sets shows that compared with kernel k-means and the spectral clustering methods, the proposed approach gives equivalent or better performance, while running much faster. Most notably, it significantly reduces memory access, which makes it a good choice for large data collections. Material enabling the reproducibility of the results is made available online.


Introduction
Clustering collections of objects into classes that bring together similar ones is probably the most common and intuitive tool used both by human cognition and artificial data analysis in an attempt to make that data organized, understandable, manageable. When the studied objects lend themselves to this kind of analysis, it is a powerful way to expose underlying organizations and approximate the data in such a way that the relationships between its members can be statistically understood and modeled. Given a description of objects, we first attempt to quantify which ones are "similar" from a given point of view, then group those n objects into C clusters, so that the similarity between objects within the same cluster is maximized. Finding the actual best possible partition of objects into clusters is, however, an NP-complete problem, intractable for useful datasets sizes. Many approaches have been proposed to yield an approximate solution: analytic, iterative, flat or hierarchical, agglomerative or divisive, soft or hard clustering algorithms, etc., each with their strengths and weaknesses [1], performing better on some classes of problems than others [2,3].
Iterative divisive hard clustering algorithms, usually perform well to identify high-level organization in large data collections in reasonable running time. For that reason, they are a sensible choice in many data mining situations, and constitute our focus in this paper. If the a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 computationally efficient greedy optimization [10,Chapter 10.8] with guaranteed convergence for arbitrary similarity measures. Although both of those techniques are well known, combining them is non trivial, and lead to a clustering algorithm, which we call k-averages with the following properties: • input data can be arbitrary symmetric similarity matrices, • it has fast and guaranteed convergence, with a number of object to clusters reallocations experimentally found to be roughly equal to the number of objects, • it provides good scalability thanks to a reduced need for memory access, and • on a collection of synthetic and natural test data, its results are equivalent to those of kernel k-means, and obtained in a fraction of its computing time, particularly when paged memory is required.
To summarize, the main contribution of the paper is to present a clustering algorithm: • that can handle arbitrary affinity matrices, i.e. semi-positiveness is not a mandatory requirement for guaranteed convergence • thanks to a carefully designed strategy to update the membership of objects to cluster, the algorithm is fast and memory efficient.
The remaining of the paper is organized as follows: Section 1 presents the kernel k-means objective function and the basic algorithm that minimizes this function, and Section 2 introduces the concepts behind the k-averages algorithm, followed by a detailed algorithmic description in Section 3. The complexity of the two algorithms in terms of arithmetic operations and memory access is then studied in Section 5. The above presented properties of the proposed k-averages algorithm are then validated on synthetic controlled data in Section 6 and on 43 datasets of time series issued from various sources in Section 7.

Kernel k-means
Since its introduction by [12], kernel k-means has been an algorithm of choice for flat data clustering with known number of clusters [16,20]. It makes use of a mathematical technique known as the "kernel trick" to extend the classical k-means clustering algorithm [4] to criteria beyond simple euclidean distance proximity. Since it constitutes the closest point of comparison with our own work, we dedicate this section to its detailed presentation.
In the case of kernel k-means, the kernel trick allows us to consider that the k-means algorithm is operating in an unspecified, possibly very high-dimensional Euclidean space; but instead of specifying the properties of that space and the coordinates of objects, the equations governing the algorithm are modified so that everything can be computed knowing only the scalar products between points. The symmetrical matrix containing those scalar products is known as a kernel, noted K.

Kernel k-means objective function
In this section and the following, we shall adopt the following convention: N is the number of objects to cluster and C the number of clusters; N c is the number of objects in cluster c, and μ c is the centroid of that cluster. z cn is the membership function, whose value is 1 if object o n is in class c, 0 otherwise. In the folowing equations, μ c and o n are assumed to be column vectors of equal dimension.
Starting from the objective function minimized by the k-means algorithm, expressing the sum of squared distances of points to the centroids of their respective clusters: And using the definition of centroids as: z cn o n S can be developed and rewritten in a way that does not explicitly refer to the centroid positions, since those cannot be computed: Since the sum of K nn over all points remains constant, and the sum of squared centroid norms (third, quadratic, term of Eq 1) is mostly bounded by the general geometry of the cloud of objects, we can see that minimizing this value implies maximizing the sum of the central terms, which are the average scalar products of points with other points belonging to the same class. Therefore, given a matrix gathering similarities between objects, if that matrix possesses the necessary properties to be considered as a kernel (positive semidefinitness), then the kernel k-means algorithm can be applied to it in order to create clusters that locally maximize the average intra-cluster similarity.

Algorithm
Finding the configuration that globally minimizes S (Eq 1.1) is an NP-complete problem. However, several approaches allow finding an acceptable approximation. We shall only focus here on the fastest and most popular, an iterative assignment / update procedure commonly referred to as the "k-means algorithm" [4], or as a discrete version of Lloyd's algorithm, detailed in Algorithm 1.
Algorithm 1: Lloyd's algorithm applied to minimizing the kernel k-means objective.
Compute Y cn following Eq 1 (note: z cn = (L n == c)?1: 0) 6 end 7 L n = argmin c (Y cn ); 8 end 9 end The version given here is the most direct algorithmic translation of the mathematical foundations developed above, and as we shall see in section 5, it can easily become more efficient. Before that, we introduce our proposed k-averages algorithm.

Foundations of the k-averages algorithm
In our proposal, we adopt an alternative objective function which, unlike kernel k-means, does not rely on a geometric interpretation but an explicit account of the similarity matrix. The goal is to maximize the average intra-cluster similarity between points, a commonly used metric to evaluate clustering quality, and one whose computation is direct-linear in time.
Due to its simplicity, however, the objective function cannot be simply "plugged into" the standard kernel k-means algorithm: it lacks the geometric requisites to ensure convergence. We must therefore propose a specifically tailored algorithmic framework to exploit it: first, we show here that it is possible to easily compute the impact on the global objective function of moving a single point from one class to another; this allows us to develop a greedy optimization algorithm taking advantage of that formula.

Conventions and objective function
In addition to the notations presented above, we index here the set of elements belonging to a given cluster c k as c k ¼ fo k1 ; . . . ; o kN k g. For simplicity, we omit the first index and note c ¼ fo 1 ; . . . ; o N c g when considering a single class.
The similarity between objects shall be written s(o i , o j ). We extend the notation s to the similarity of an object to a class defined as the average similarity of an object with all objects of the class. s(o, c) accepts two definitions, depending on whether or not o is a member of c: If o 2 c, then necessarily 9i Let us call the "quality" of a class the average intra-class object-to-object similarity, and write it Q: In our framework, we do not explicitly refer to class centroids, preferring to directly consider averages of similarity values between individuals within clusters. Using the notations above, we define our objective function as the average class quality, normalized with class sizes: Since, informally, our goal is to bring together objects that share high similarity, a first idea would be to simply repeatedly move each object to the class with whose members it has the highest average similarity. This is what we call the "naive k-averages" algorithm.

Naive k-averages algorithm
Algorithm 2 presents a method that simply moves each object to the class with which it has the highest average similarity, until convergence is reached. The algorithm is straightforward and simple; however, experiments show that while it can often produce interesting results, it also sometimes cannot reach convergence because the decision to move an object to a different cluster is taken without considering the impact of the move on the quality of the source cluster. To ensure convergence, we need to compute the impact on the objective function of moving one object from one class to another. Using such formulation and performing only reallocation that have a positive impact, the convergence of such an iterative algorithm is guaranteed.

Impact of object reallocation on class quality
Considering a class c, let us develop the expression of QðcÞ into a more useful form. Since all objects are in c, we use the formula in (Eq 3) to get: Using the assumption that the similarity matrix is symmetrical, we can reach: For future use and given the importance of the above transformation, we define: SðcÞ and SðcÞ ¼ N c ðN c À 1ÞQðcÞ 2

Removing an object from a class.
Assuming that o 2 c, necessarily 9i j o = o i . Since the numbering of objects is arbitrary, we can first simplify the following equation by considering that o ¼ o N c , in order to reach a formula that is independent from that numbering.
The quality of a class after removal of an object is thus: And the change in quality from its previous value:

Adding an object to a class.
Assuming that o = 2 c, we can similarly to what has been done previously (numbering is arbitrary) consider for the sake of simplicity that o becomes o N c +1 in the modified class c. Following a path similar to above, we get: The quality of a class c after adding an object o is thus: And the change in quality from its previous value:

Impact of object reallocation on the global objective function
When moving an object o from class c s ("source"), to whom it belongs, to a distinct class c t ("target"), (N s − 1) objects are affected by the variation in (8), and N t are affected by that in (10); in addition, one object o moves from a class whose quality is Qðc s Þ to one whose quality is Qðc t [ oÞ, as expressed by Eq 9, which leads to an impact of moving object o from class c s to class c t wich can be computed as follows: As can be seen, computing this impact is a fixed-cost operation. We can therefore use the formula as the basis for an efficient iterative algorithm. Our approach does not allow us to benefit, like kernel k-means, from the convergence guarantee brought by the geometric foundation of k-means. In consequence, we cannot apply a "batch" approach where at each iteration all elements are moved to their new class, and all distances (or similarities) are computed at once. To guarantee convergence, we must update the class properties for the two modified classes (source and destination), as well as recompute the average class-object similarities for them for each considered object, after finding its ideal new class. This is the principle of the "greedy" k-means algorithm [10, Chapter10.8], but whereas for k-means that approach increases complexity (and even more so for kernel k-means), in this case it leads to a much improved computational performance.

K-averages algorithm
Indeed, at a first glance, dynamically updating objectives as a result of object reallocation might seem to have negative performance impact. However, our simple non-quadratic updates make such dynamic changes easily tractable. New class qualities are thus given by Eqs 7 and 9, and new object-class similarities can be computed by: where i is any object index, n is the recently reallocated object, c s the "source" class that object i was removed from, and c t the "target" class that object n was added to. The full description of k-averages is given in Algorithm 3.

Convergence
The kernel k-means algorithm ensures convergence if the similarity matrix is semi-definite positive. The k-averages algorithm relaxes this constraint by only requiring symmetricity of the similarity matrix to ensure convergence. An algorithm is guaranteed to converge if its successive iterations can be tied to a strictly monotonous and bounded quantity. For the k-averages algorithm, this quantity is the objective function itself, as we now show.
Thanks to the rewriting of the class quality function done in Eq 6, which only requires the similarity matrix to be symmetrical, we can directly define the allocation decision criterion δ (Eq 11) to be the change in the global objective function O (Eq 5) implied by reallocating an object to a new class. It follows that, as long as reallocations are only performed when δ > 0, O is strictly increasing throughout the execution of the algorithm.
Moreover, O, defined as a weighted average of the average intra-class similarities for the produced clusters, can be proven to never exceed the maximal similarity between two objects.
Indeed, the average similarity of an object to other members of its class, expressed as s(o, c) in Eq 2, is an average of similarities, and therefore lower than or equal to their maximum value. Similarly, the quality Q of a class, defined as the average of object to class similarities (Eq 4), is inferior to the maximum value of s(o, c), and therefore to the maximum similarity between objects. Finally, O, a weighted average of Q values, is, once again, inferior to their maximum value.
The objective function O is thus upper-bounded, and increases at each iteration of the outer loop of Algorithm 3; which guarantees its convergence.

Complexity analysis
In this section, we study the complexity of the two approaches presented above, first from the point of view of raw complexity, second by focusing on memory access.

Computational complexity
5.1.1 Kernel k-means. As can be seen in Algorithm 1, the operation on line 5 is the most costly part of the algorithm: for each object n and class c, at each iteration, it is necessary to compute Y cn from Eq 1-an O(N 2 ) operation in itself, per object. The impossibility of simply computing the distances to a known centroid as done in the k-means algorithm leads to a much higher complexity for the kernel k-means algorithm, globally O(N 3 ) per iteration, independent of how many objects are moved for that iteration.
It is however possible to improve the performance of kernel k-means by noting than in Eq 1, the third term of the equation, which has the highest complexity, is only dependent on class definitions and not on the considered object. We can therefore rewrite Eq 1 as: where In the worst case scenario, M = N, and the complexity for one iteration of the algorithm remains the same as for the optimized kernel k-means algorithm, O(N 2 ). In practice, however, as can be seen on Fig 1, the number of objects moving from one class to another decreases sharply after the first iteration, meaning that the complexity of one iteration becomes quickly much lower than O(N 2 ). Thus, while the first iteration of k-averages has a similar complexity with kernel k-means, the overall cost of a typical run of the algorithm (from 10 to 50 iterations) is much lower.
To go further in this analysis, we display on Fig 2 the total number of object reallocation over a full run of the k-averages algorithm for several datatsets. The datasets used to create this figure are the real-life time series data that we employ for experimental validation evaluated under the Dynamic Time Warping (DTW) similarity measure, cf. Section 7. As can be seen, the correlation is roughly linear with the number of objects to cluster. In fact, the number of reallocations is roughly equal to the number of objects to cluster, which allows us to reach for k-averages a (statistical) total complexity of O(N 2 ), instead of O(N 2 ) per iteration.

Memory access
The lowered computational costs is also accompanied by a decrease in memory access: as can be seen from Eq 12, in order to compute the new object-to-class similarities after moving an object n, only line n of the similarity matrix needs to be read. For the remaining of the algorithm, only the (much smaller) object-to-class similarity matrix is used. By contrast, in the case of kernel k-means, the computation of M c values at each iteration require that the whole Efficient similarity-based data clustering by optimal object to cluster reallocation similarity matrix be read, which can be a serious performance bottleneck in the case of large object collections.
Moreover, the similarity update function of k-averages, by reading one line of the matrix at a time, presents good data locality properties, which make it play well with standard memory paging strategies.
To illustrate and confirm the theoretical complexity computed here, the next section proposes some performance figures measured on controlled datasets.

Validation
In order to reliably compare the clustering quality and execution speed between the two approaches, we have written plain C implementations of Algorithms 4 and 3, with minimal operational overhead: reading the similarity matrix from a binary file where all matrix values are stored sequentially in standard reading order, line by line, and writing out the result of the clustering as a label text file. Both implementations use reasonably efficient code, but without advanced optimizations or parallel processing.
The figures presented in this section were obtained on synthetic datasets, created in order to give precise control on the features of the analyzed data: for n points split between C classes, C centroids are generated at random in two dimensional space, and point coordinates are generated following a Gaussian distribution around class centroids. In addition to the numbers of Efficient similarity-based data clustering by optimal object to cluster reallocation objects and classes, the variance of Gaussian distributions are adjusted to modulate how clearly separable clusters are. Similarities are computed as inverse Euclidean distances between points.

Reproducibility
In order to ease reproducibility of the results, the data is taken from a public repository of several benchmark datasets used in academia [21] and the code of the proposed method as well as the experimental code used for generated the figures is publicly available: https://github.com/ mathieulagrange/kaveragePaper.

Clustering performance
Several metrics are available to evaluate the performance of a clustering algorithm. The one closest to the actual target application is the raw accuracy, that is the average number of items labeled correctly after an alignment phase of the estimated labeling with the reference [22].
Another metric of choice is the Normalized Mutual Information (NMI) criterion. Based on information theoretic principles, it measures the amount of statistical information shared by the random variables representing the predicted cluster distribution and the reference class distribution of the data points. If P is the random variable denoting the cluster assignments of the points, and C is the random variable denoting the underlying class labels on the points then the NMI measure is defined as:

where I(X; Y) = H(X) − H(X|Y) is the mutual information between the random variables X and Y, H(X) is the Shannon entropy of X, and H(X|Y) is the conditional entropy of X given Y.
Thanks to the normalization, the metric stays between 0 and 1, 1 indicating a perfect match, and can be used to compare clustering with different numbers of clusters. Interestingly, random prediction gives an NMI close to 0, whereas the accuracy of a random prediction on a balanced bi-class problem is as high as 50%.
In this paper, for simplicity sake, only the NMI is considered for validations. However, we found that most statements hereafter in terms of performance ranking of the different algorithms still hold while considering the accuracy metric as reference. Fig 3 presents the quality of clusterings obtained using kernel k-means and k-averages on two series of datasets: one featuring 5 classes, the other 40 classes. On the x-axis is the variance of the Gaussian distribution used to generate the point cloud for each class: the higher that value, the more the classes are spread out and overlap each other, thus making the clustering harder.
The question of choosing the proper number of clusters for a given dataset without a priori is a well known and hard problem, and beyond the scope of this article. Therefore, for the purpose of evaluation, clustering is done by requesting a number of clusters equal to the actual number of classes in the dataset. In order to obtain stable and reliable figures, clustering is repeated 500 times with varying initial conditions, i.e. the initial assignment of points to clusters is randomly determined, and only the average performance is given. For fairness of comparison, each algorithm is run with the exact same initial assignments.
As can be seen on the figure, in the case of a 5-class problem, k-averages outperforms kernel k-means in the "easy" cases (low class spread), before converging to equivalent results. For the more complex 40-class datasets, k-averages consistently yields a better result than kernel kmeans, especially for higher values of the variance. The lower values of NMI for 5-class experiments is in fact an artifact introduced by the normalization of NMI, and is not important here; we only focus, for each series of experiments, on the relative performances of kernel k-means and k-averages. Fig 4 shows the average time spent by kernel k-means and k-averages to cluster synthetic datasets or varying sizes. As previously, initial conditions on each run are identical for both algorithms. The reported run time is the one measured on a 64 bits Intel 1 Core™ i7 runnning at 3.6 GHz with 32 Gb of RAM and standard Hard Disk Drive (HDD) operated by standard Linux distribution. For results with 2 GB of RAM, the same machine is used with a memory limitation specified to the kernel at boot time.

Time efficiency
These figures confirm the theoretical complexity analysis presented in Section 5: k-averages runs at least 20 times faster on average than kernel k-means in ordinary conditions, when available memory is not an issue. When the matrix size exceeds what can be stored in RAM and the system has to resort to paged memory, as in the presented experiments when the matrix reaches about 1500MB, both algorithms suffer from a clear performance hit; however, kernel k-means is much more affected, and the difference becomes even more important: with a 2000MB similarity matrix on a memory-limited computer, k-averages runs about 100 times faster than kernel k-means.
Having established the interest of our proposed method relative to kernel k-means on synthetic object collections, we now proceed to a thorough evaluation on real data.

Experiments
In order to demonstrate the usefulness of k-averages when dealing with real data, we have chosen to focus on the clustering of time series as the evaluation task. Time series, even though represented as vectors and therefore suitable for any kinds of norm-based clustering, are best compared with elastic measures [23,24], partly due to their varying length. The Dynamic Time Warping (DTW) measure is an elastic measure widely used in many areas since its introduction for spoken word detection [25] and has never been challenged for time series mining [26,27].
Effective clustering of time series using the DTW measure requires similarity based algorithms such as the k-averages algorithm. With some care, kernel based algorithm can also be considered provided that the resulting similarity matrix is converted into a kernel, i.e. the matrix is forced to be semi definite positive, i.e. to be a Gram matrix [28] in order to guarantee convergence.

Datasets
To compare quality of clusterings obtained by the considered algorithms, we consider a large collection of 43 time series datasets made publicly available by many laboratories worldwide and compiled by Prof. Keogh. Thus, while all the experiments presented here are performed on time series (chosen for being a good example of a data type requiring similarity-based clustering, as opposed to a simple Euclidean approach), the great variety in the sources and semantics of said series (bio-informatics, linguistics, astronomy, gesture modeling, chemistry. . .) gives this validation a wide foundation. Statistics about the morphology of those datasets can be found in Table 1 and summarized in Table 2.

Methods
Three algorithms are considered: the spectral clustering [14] approach as a high complexity reference, the kernel k-means algorithm implemented as described in Section 1 and the language. Even though a C implementation would probably be more efficient, we believe that the gain would be low as the main computational load is the diagonalization of the similarity matrix and the k-means clustering of the eigenvectors which are both efficient builtins Matlab functions. The kernel k-means is implemented both in Matlab using the implementation provided by Mo Chen (vailable at: https://fr.mathworks.com/matlabcentral/fileexchange/ 26182-kernel-kmeans) and in the C programming language following Algorithm 1. The kaverages method is implemented in C following Algorithm 3.

Evaluation protocol
For each dataset, since we perform clustering, and not supervised learning, the training and testing data are joined together. DTW similarities are computed using the implementation provided by Prof. Ellis (available at: http://www.ee.columbia.edu/~dpwe/resources/matlab/ dtw) with default parameters.
As in our previous experiments with synthetic data, we choose here the normalized mutual information (NMI) as the measure of clustering quality; clustering is done by requesting a number of clusters equal to the actual number of classes in the dataset, and repeated 200 times with varying initial conditions, each algorithm being run with the exact same initial assignments. For the 200 clusterings thus produced, we compute the NMI between them and the ground truth clustering. Average and standard deviation statistics are then computed.

Clustering performance
For ease of readability and comparison, the presented results are split into 3 tables. Table 3 lists the results obtained on bi-class datasets, i.e. the datasets annotated in terms of presence or absence of a given property; Table 4 concerns the datasets with a small number of classes (from 3 to 7); and Table 5 focuses the datasets with a larger number of classes (from 8 to 50).  For each experiment, the result of the best performing method is marked in bold. The Matlab and C implementations of the kernel k-means algorithm give exactly the same results in terms of NMI, thus only one column is used to display their performance.
A first observation is that the spectral clustering algorithm only performs favorably for 2 of the 43 dataset. Also, most of the bi-class problems (Table 3) do not seem to lend themselves well to this kind of approach: kernel k-means and k-averages produce quasi-identical results, poor in most cases. Concerning the medium numbers of classes (Table 4), k-averages performs best for 8 datasets out of 17. For the larger numbers of classes (Table 5), k-averages performs best for 11 datasets out of 14.
Considering the standard deviation over the several runs of the algorithm with different initialization is helpful to study the sensitivity of the algorithm to its initialization and thus its Efficient similarity-based data clustering by optimal object to cluster reallocation tendency to be stuck into local minima. For most datasets, the standard deviation of the kaverages algorithm is smaller than the one of the kernel k-means one and thus seems experimentally more robust.

Efficiency
Computation time displayed in Tables 6, 7, and 8 is the average duration over 100 runs on a single core with no parallelization capabilities. For every datasets, the spectral clustering approach is the more time consuming due to the diagonalization of the matrix which is of O (N 3 ). For the kernel k-means algorithm, the C implementation is most of the time more efficient than the Matlab implementation. The k-average algorithm is more efficient for 46 datasets out of the 47 by close to an order of magnitude for the larger datasets.

Overall results
To conclude on the performance of the evaluated algorithms on real datasets, Table 9 displays the NMI and computation time averaged over the 47 datasets. The k-averages method marginally improve the clustering accuracy compared to the kernel k-means approach by using less time to compute.

Conclusion
We have presented k-averages, an iterative flat clustering algorithm that operates on arbitrary similarity matrices by explicitly and directly aiming to optimize the average intra-class similarity. Having established the mathematical foundation of our proposal, including guaranteed convergence, we have thoroughly compared it with widely used standard methods: the kernel k-means and the spectral clustering techniques. We show that the k-averages algorithm converges much faster (20 times faster under ordinary conditions) and leads to equivalent or better clustering results for the task of clustering both synthetic data and realistic times series taken from a wide variety of sources, while also being more computationally efficient and more sparing in memory use.