Balancing effort and benefit of K-means clustering algorithms in Big Data realms

In this paper we propose a criterion to balance the processing time and the solution quality of k-means cluster algorithms when applied to instances where the number n of objects is big. The majority of the known strategies aimed to improve the performance of k-means algorithms are related to the initialization or classification steps. In contrast, our criterion applies in the convergence step, namely, the process stops whenever the number of objects that change their assigned cluster at any iteration is lower than a given threshold. Through computer experimentation with synthetic and real instances, we found that a threshold close to 0.03n involves a decrease in computing time of about a factor 4/100, yielding solutions whose quality reduces by less than two percent. These findings naturally suggest the usefulness of our criterion in Big Data realms.


Introduction
The skyrocketing technological advances of our time are enabling a substantial increase in the amount of generated and stored data [1][2][3][4], both in public and private institutions, in science, engineering, medicine, finance, education, and transportation, among others. Hence, there is a well justified interest in the quest of useful knowledge that can be extracted from large masses of data, allowing better decision making or improving our understanding of the universe and life.
However, the handling or interrogation of large and complex masses of data with standard tools -termed Big Data-is generally limited by the available computer resourses [1,5]. In this regard, our contribution here is to provide a strategy to deal with the problem of clustering objects according to their attributes (characteristics, properties, etc.) in Big Data realms.
The clustering problem has long been studied. Its usefulness is indeniable in many areas of human activity, both in science, business, machine learning [6], data mining and knowledge discovery [7,8], pattern recognition [9], to name but a few. The clustering consists in partitioning a set of n objects in k ! 2 non-empty subsets (called clusters) in such a way that the objects in any cluster have similar attributes and, at the same time, are different from the objects in any other cluster. PLOS  In this paper we assume that the objects' attributes are measurable. Hence, sometimes we refer to objects as points, according to the context. Let N = {x 1 , . . ., x n } denote the set of n points to be grouped by a closeness criterion, where x i 2 < d for i = 1, . . ., n, and d ! 1 is the number of dimensions (the objects' attributes). Further, let k ! 2 be an integer and K = {1, . . ., k}. For a k-partition P ¼ fGð1Þ; . . . ; GðkÞg of N, denote μ j the centroid of group (cluster) G(j), for j 2 K, and let M = {μ 1 , . . ., μ k }.
Thus, the clustering problem can be posed as a constrained optimization one: (see, for instance, [10]) P: minimize zðW; MÞ ¼ where w ij = 1 , point x i belongs to cluster G(j), and d(x i , μ j ) denotes the Euclidean distance between x i and μ j , for i = 1, . . ., n, and j = 1, . . ., k.
Since the pioneering studies by Steinhaus [11], Lloyd [12], and Jancey [13], many investigations have been devoted to finding a k-partition of N that solves P above. It has been proved that this problem belongs to the so-called NP-hard class for k ! 2 or d ! 2 [14,15]; thus obtaining an optimal solution of a moderate size instance is in general intractable.
Therefore, a variety of heuristic algorithms have been proposed to approach the optimal solution of P, the most conspicuous being those generally designed as k-means [16], with straightforward implementation. It must be said that the establishment of useful gaps between the optimal solution of the problem P and the solution obtained by k-means remains an open problem. The computational complexity of k-means is O(nkdr), where r stands for the number of iterations [5,17], limiting their use in large instances since, in general, at every iteration all distances from objects to cluster centroids must be considered. Thus, numerous strategies to stop iterating have been investigated where, usually, increasing the computational effort entails the reduction of the objective function.
Our aim here is to propose a sound criterion to balance the processing time and the solution quality of k-means cluster algorithms in Big Data realms. So, we consider throughout the algorithm K-MEANS-intentionally denoted with capitals to distinguish it from the k-means family to which it belongs-, greatly inspired by the seminal ones by Jensey [13] and Loyd [12], the latter appearing as one of the most widely used because of its simplicity and elegance [18]. Obviously, our proposal could be as well applied to most procedures of the k-means family.
The rest of the paper proceeds as follows. Section Related work briefly reviews the most relevant investigations known to the authors to improve the efficiency of k-means algorithms. Then, the algorithm K-MEANS is described and its behavior is analysed in Section Behavior of K-MEANS, so as to highlight an observed strong correlation between the decreasing value of the objective function (1), and the number of objects changing cluster at every iteration. Consequently, from this situation was born our idea to stop the algorithm in the convergence step, namely, as soon as the number of objects that change cluster is smaller than a given threshold; aimed to balance the processing time and the solution quality, Section Determining threshold values, proposes a procedure to determine this threshold. Next, Section Proposal validation, deals with a validation of our proposal: We computationally tested it on several synthetic and real instances. Finally, Section Conclusion, contains our conclusions.

Related work
To date, few investigations report theoretical analyzes on k-means. Our proposed methodology, as the vast majority of the published improvements for k-means, is supported by computer experimentation. Among them, the most relevant are briefly reviewed below, grouped according to the four stages of the algorithm, and adding some related to the choice of the appropriate number of clusters. Undeniably, although some improvements appear to dominate others, from the No free lunch Theorem [19] the only way one strategy can outperform another is if it is specialized to the structure of the specific problem under consideration.

Initialization step
For a successful improvement of the solutions quality, and reducing the number of iterations, Arthur & Vassilvitskii [20] proposed a random generation of the initial k centroids as follows. The first centroid is randomly chosen with uniform distribution in N; then, for j = 2, . . ., k, the probability of a point in N to be chosen as the j-th centroid is proportional to the square of its minimal distance to the set of 1, . . ., j − 1 already selected centroids.
Zhanguo et al. [21] strategically label the initial centroids, so as to guide the subsequent classification; the maximal distance principle is proposed to advantageously distribute the initial centroids in the solution space. In the paper by Salman et al. [22] the initial centroids are those obtained as final by k-means when applied on a 0.1n-size, random subset of the n objects. Time savings of around 66 percent are reported.
On their part, El Agha & Ashour [23] claim that the following initialization strategy yields improved results. For s = 1, . . ., d, and i = 1, . . ., n, let x i (s) be the s-th coordinate of point x i , " n s ¼ max x i ðsÞ, and ν s = min x i (s). Also, let n s ¼ 1 k ð" n s À n s Þ. The initial centroids are randomly generated in the main diagonal of the rectangular grid defined by coordinates n s ; n s þ n; n s þ 2n; . . . ; " n s , for s = 1, . . ., d.
Finally, Tzortzis & Aristidis [24] employ maxima and minima values to improve the clustering quality at each iteration, particularly useful when clusters are sought with similar number of objects.

Classification step
In regard to the classification step, Fahim [25] poses that if the distance of an object, say x i , to the centroid of its assigned cluster, say G(j), decreases at any iteration, then x i stays in G(j), not needing to compute the distance from x i to the remaining centroids.
Among the various works proposing to apply the principle of the triangle inequality to reduce the number of times that the distance from objects to centroids is computed, the most conspicuous are those by Elkan [18] and Hamerly [26]. By suggesting lower and upper bounds for distances, the former proposes that in every iteration, the distance from any object to any centroid should not be re-calculated if, in the previous iteration, that distance is outside these bounds; it is claimed that with this strategy the efficiency of k-means is higher as k and n increase. In [26] a variant of the Elkan [18] approach is proposed by establishing new bounds on the distances, asserting to improve Elkan results for instances of low dimension.
Pérez and collaborators [27,28] discuss heuristics to simplify the computation of the distance between objects and centroids. In [28], noting that when objects migrate they normally do so towards a nearby cluster, only the distances from x 2 N to the centroids of the w clusters closest to x are recalculated. Convenient values were experimentally found for w, depending on the dimension d; so, for instance, w = 4, 6 for d = 2, 4, respectively.
Two heuristics are considered in [27]; the first one permanently assigns object x 2 N to cluster G(j) as soon as the distance from x to μ j is lower than a given threshold, excluding x from subsequent distance computations. In the second heuristic a cluster is labelled 'stable' if it has no object exchanges in two successive iterations; objects in stable clusters no longer migrate. A similar heuristic is studied by Lai [29].
Kanungo et al. [30] have theoretically shown that the efficiency of k-means is enhanced when a kd-tree data structure is used. For Chiang et al. [31] objects very close to their nearest centroid are considered as with negligible probability to change cluster, hence they are excluded from subsequent distance computations.

Convergence step
The most common convergence criterion of k-means is to stop as soon as no change in the clustering is observed. However, it seems advisable to set un upper bound on the number of iterations as a concomitant convergence criterion, as offered by software packages such as SPSS, WEKA, and R.
Denote z r the objective function value at iteration r. Pérez et al. [32] propose to stop the algorithm whenever z r > z r−1 , while in [33] the procedure stops if jðz 2 rÀ 1 À z 2 r Þ=z 2 r j < 0:001 along ten successive iterations.
Mexicano et al. [34] compute the largest centroid displacement found in the second iteration (denote it ψ). Then, they assume that the k-means algorithm has converged if in two successive iterations the largest change of centroids position is lower than 0.05ψ.

Selecting the number k of clusters
In the k-means realm, the choice of an appropriate k value depends on the instances considered, and is a rather difficult task; this situation is usually addressed by trial and error. However, several investigations have been carried out to automatically determine the number k of clusters; see, for example, those reported in [35,36], and the more recent approaches [37][38][39] that employ a Bayesian nonparametric view.

Behavior of K-MEANS
We made intensive computational experiments to assess the behavior of the below algorithm K-MEANS under different conditions. Algorithm K-MEANS Step 1 Initialization. Produce points μ 1 ,. . .,μ k , as a random subset of N.
Step 2 Classification. For all x 2 N and j 2 K, compute the Euclidean distance between points x and μ j , namely, d(x, μ j ). Then, point Step 3 Centroids. Determine the centroid μ j of cluster G(j), for j 2 K.
Step 4 Convergence. If the set of centroids M does not change in two successive iterations stop the algorithm, otherwise perform another iteration starting from STEP 2.
In general, a correlation was observed between the number of objects changing cluster at each iteration, and the corresponding z value of the objective function (1).
As an example of what we found, Table 1 displays some results yielded by a run of K-MEANS on a synthetic instance with n = 2 Á 10 6 points (objects) randomly generated in the unit square (uniform distribution), and k = 200.
The algorithm stops at iteration 612 because no points change cluster. Each row of Table 1 contains information corresponding to iteration r: The objective function value z r (we denote z Ã the lowest value found by the algorithm, thus, z Ã = z 612 = 53721), the indicator γ r = 100(υ r /n), where υ r is the number of points changing cluster, and δ r = 100(z r /z Ã − 1). To grasp the algorithm behavior from Table 1  This clearly suggests a strong correlation between indexes γ and δ, and between z and γ. Hence, letting ℓ be the total number of iterations, " g ¼ 1 the correlation coefficient between γ and δ is calculated with the Eq (2) thus verifying our assumptions.
A similar experiment was made with a set of real and synthetic instances -described in Section Proposal validation-obtaining the correlation coefficients as shown in Table 2. Note that no value is below 0.9, and most lie in the range [0.975, 0.998].
The aforementioned considerations led us to our proposal of a criterion to balance the computational effort and the solution quality for k-means in Big Data realms. Section Determining threshold values, describes the path we followed to determine threshold values to be used in the convergence step so as to judiciously stop the algorithm.

Determining threshold values
Although the computational effort at each iteration of k-means is rather constant, the corresponding improvement on z is not. Thus the question: When is it worth to keep iterating? It is well known that the so-called Pareto principle can help in determining an optimal relationship between effort and benefit [40]. Thus, we relied on this principle to provide a sound answer to the above question.
Continuing with the example of Section Behavior of K-MEANS, we computed Table 3 where, for r ! 2, A r = 100(r/ℓ), B r = 100(z r−1 − z r )/(z 1 − z Ã ), C r = C r−1 + B r , and is the Euclidean distance between points (A r , C r ) and (0, 100). Fig 3 shows a partial plot of the Pareto diagram for points (A r , C r ) extracted from Table 3. Note that D 30 D r for any r 2 N, namely, (A 30 , C 30 ) is the closest point to (0,100). Hence, we could stop K-MEANS at iteration r = 30 = 0.0490ℓ, avoiding as much as %95% of iterations, to get a solution with cost z 30 = 1.0091z Ã (δ 30 = 0.91 comes from Table 1), namely, only less than one percent worst than z Ã . Also, observe in Table 1 that at iteration 30 as few as 0.0072n objects migrate.
Following the ideas and concepts discussed over the previous example, we undertook intensive computer experimentation applying the Pareto principle on a variety of instances. As a result we obtained a set of threshold values U such that, depending on the available computer time, on the instance size, and according to the needs, it seems reasonable to stop K-MEANS at iteration r whenever γ r U. Recall γ r = 100(υ r /n), where υ r is the number of migrating points at iteration r. The algorithm below, which we call O-K-MEANS (optimized K-MEANS), results when our criterion is incorporated into K-MEANS.
Step 2 Classification. For all x 2 N and j 2 K, compute the Euclidean distance between points x and μ j , namely, d(x, μ j ). Then, point (object) x 2 N is assigned to a cluster Gð€ ȷÞ if dðx; m € ȷ Þ dðx; m j Þ, for € ȷ; j 2 K. Step 3 Centroids. Determine the centroid μ j of cluster G(j), for j 2 K.
Step 4 Convergence. If γ r U stop the algorithm, otherwise perform another iteration starting from STEP 2.

Proposal validation
By means of C language and a GCC 4.9.2 compiler, both K-MEANS and O-K-MEANS algorithms were implemented on a Mac mini computer with OS Yosemite 10.10, processor Core i5 at Balancing effort and benefit of K-means clustering algorithms in Big Data realms 2.8GHz, and 16 GB of RAM memory. The implementation of our stopping strategy requires negligible additional memory resources.
For the design of the computational experiments and the analysis of our algorithms, we used the methodology proposed by McGeoch [41].
As mentioned in Section Introduction the complexity of K-MEANS is O(nkdr). Therefore, the O-K-MEANS can be expressed as O(nkdrα), where α denotes the quotient of the number of O-K-MEANS iterations and the number of K-MEANS iterations. From our experiments with large synthetic and real instances we obtained an average of α = 0.0389.
Our codes were tested on sets of real and synthetic instances. In each instance the initial centroids were the same for both algorithms. In what follows we denote t, t o , z Ã , and z Ã o , respectively, the time needed by K-MEANS, the time needed by O-K-MEANS, the best objective function value found by K-MEANS, and the best objective function value obtained by O-K-MEANS. Further, to mesure time saving and quality loss we use, respectively, the formulas For each instance we chose the product nkd as an indicator of its complexity level. The computer experiments described in sections Synthetic instances and Real instances used different U values according to the Pareto optimality principle applied to each instance. Consideration of different thresholds for the same instance is dealt with in Section Using other threshold values.

Synthetic instances
14 synthetic instances were produced for different n values and number of dimensions d as shown in Table 4.
All points in each synthetic instance were randomly generated: while in instances 1 to 13 they followed a uniform distribution with coordinates belonging to (0, 1) d , in instance 14 a normal distribution was used with mean 40 and standard deviation 20. Distinct values for the number of clusters k were considered. Table 5 shows the results of an execution of each algorithm; the computing times t and t o are reported in hours rounded to hundredths. Notice that the instances are sorted according to the nkd product.
When compared to K-MEANS, algorithm O-K-MEANS exhibits good performance on these instances: on average, in less than four percent of calculation time solutions are obtained whose quality decreases only about 0.5%. It is worth mentioning the relatively low standard deviation value (1.17), as compared to the average (96.10).

Real instances
We made computer experimentation with five real instances taken from the University of California repository, UCI [42], see Table 6. Note the relatively small size of instances 1-3 when compared to the large instances 4 and 5. For distinct k values, each real instance was solved with K-MEANS and O-K-MEANS using the same initial centroids; results are shown in Table 7. In general, as nkd increases, the efficiency of O-K-MEANS increases -note the average time reduction for the large instances (96.12%) and the small ones (86.47%), yielding similar quality loss-. As in the case of synthetic instances, low standard deviations are observed.

Using other threshold values
In sections Synthetic instances and Real instances, we showed results for thresholds determined following the Pareto optimality principle; however, other threshold values could as well be selected according to the need of solution quality and time availability.
Thus, each instance of Table 4 was solved with K-MEANS and O-K-MEANS using the same initial centroids, and the threshold values U shown in Table 8. In this table " r and " d denote the Balancing effort and benefit of K-means clustering algorithms in Big Data realms averaged reduction of time and solution quality, respectively. As expected, each row decreases monotonically from left to right, namely, increasing U leads to greater reduction of both computing time and quality solution. Thus, there is a trade-off between these concepts. We find remarkable the closeness between the reductions with U = 1, and those that arise when applying the Pareto principle (see Table 5).

Combining our convergence criterion with other criteria
To assess the benefit of combining our convergence criterion with other algorithms for speeding up k-means we considered two efficient classification strategies. One was proposed by Fahim et al. [25], call it F, the other is due to Pérez et al. [43], call it P. With this aim we chose 10 synthetic instances described in Table 4, to be solved for distinct k values. For each instance the developed codes used the same initial centroids. Table 9  The results of our computational experiments lead us to pose that it can be advantageous to combine O-K-MEANS with other strategies, both in terms of computing time reduction and solution quality.
Also, to assess the algorithms OK, FOK, and POK when using as initial centroids those generated by the algorithm K++ proposed by Arthur [20] we selected eight instances, see Table 10. Instance A is the synthetic one shown in Fig 4; instances B and C correspond to the real instance Letters described in Table 6; instance D was produced by randomly selecting 30 000 objects from instance 10 of Table 4; to produce instances E and F 40 000 points were randomly Balancing effort and benefit of K-means clustering algorithms in Big Data realms generated (uniform distribution) in a bounded space; finally, instances G and H correspond to the instance 10 of Table 4. Table 10 shows the time and quality reduction related to the algorithm K++, obtained by algorithms OK, FOK, and POK when using as initial centroids those generated by K++. These results support our belief in the usefulness of our proposal.

Data clustered around specific centers
To size up the performance of our stopping criterion if the data is heavily clustered around few specific centers, we constructed a 2-dimensional synthetic instance (d = 2), where n = 11000 points display compact groups, see Fig 4. Each group was formed by arbitrarily selecting a center and standard deviation, and points were randomly generated with normal distribution around the centers. This instance was solved for k = 5, 10, 15, 20. Table 11 shows the average results of 30 runs of K-MEANS and O-K-MEANS with threshold U = 3.14, as well as the average reduction of time and quality due by the latter. Note that the best results are obtained as k increases, with a time reduction of up to 86.44%, and an average quality reduction as low as 2.86%.

Clustering larger data sets
We tested the performance of O-K-MEANS on four still larger instances, selected from the repository UCI [42]. Table 12 shows their relevant data as well as the average computational results   Table 6. These results confirm the suitability of our proposal in Big Data realms.

Conclusion
We have presented a sound criterion to balance effort and benefit of k-means cluster algorithms in Big Data realms. Its advantages were demonstrated by applying it in the convergence step of one of the most widely used procedures of the k-means family, which here we called K-MEANS. Guided by the Pareto principle, our criterion consists in stopping the iterations as soon as the number of objects that change cluster at any iteration is lower than a prescribed threshold. The novelty of methodology comes from two facts. First, in regard to the stopping criterion, the authors are not aware of any proposal directly related to the number of objects changing group at every iteration. Second, to date the Pareto principle has not been used to determine a threshold leading to an adequate commitment between the quality of a solution and the needed time to obtain it. From intensive computer experimentation on synthetic and real instances we found that, in general, our criterion significantly reduces the number of iterations with relatively small decrement in the quality of the yielded solutions. Furthermore, the best results tend to correspond to the largest instances considered, namely, those where the product nkd is high. Thus, this behavior is an indicator of the usefulness of applying the Pareto principle in the convergence step when dealing with large k-means instances.
It is well known that some strategies to improve the performance of k-means are sensitive to the number of dimensions. This is not our case, since our proposal aims to reduce the number of iterations made, and the time complexity per iteration nkd is taken as a constant.
An important characteristic of our stopping strategy is that its implementation requires negligible additional memory resources; in this regard, it appears to take advantage over other proposed criteria.
Last, but not least, our proposed convergence criterion is not incompatible with any improvement related to the initialization or classification steps of k-means, as we have shown in relation to the procedure that generates the initial centroids of the K++ algorithm. As future work we find it appealing to deepen these investigations in the realm of parallel and distributed computing paradigms. It is foreseeable that our stopping criterion can be successfully used under such paradigms since it only requires the number of objects changing group at each iteration.