Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Balancing effort and benefit of K-means clustering algorithms in Big Data realms

  • Joaquín Pérez-Ortega ,

    Contributed equally to this work with: Joaquín Pérez-Ortega, Nelva Nely Almanza-Ortega, David Romero

    Roles Conceptualization, Investigation, Writing – original draft, Writing – review & editing

    jpo_cenidet@yahoo.com.mx

    Affiliation Departamento de Ciencias Computacionales/Centro Nacional de Investigación y Desarrollo Tecnológico, Tecnológico Nacional de México, Cuernavaca, Morelos, Mexico

  • Nelva Nely Almanza-Ortega ,

    Contributed equally to this work with: Joaquín Pérez-Ortega, Nelva Nely Almanza-Ortega, David Romero

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Departamento de Ciencias Computacionales/Centro Nacional de Investigación y Desarrollo Tecnológico, Tecnológico Nacional de México, Cuernavaca, Morelos, Mexico

  • David Romero

    Contributed equally to this work with: Joaquín Pérez-Ortega, Nelva Nely Almanza-Ortega, David Romero

    Roles Conceptualization, Formal analysis, Methodology, Supervision, Validation, Writing – original draft, Writing – review & editing

    Affiliation Instituto de Matemáticas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico

Abstract

In this paper we propose a criterion to balance the processing time and the solution quality of k-means cluster algorithms when applied to instances where the number n of objects is big. The majority of the known strategies aimed to improve the performance of k-means algorithms are related to the initialization or classification steps. In contrast, our criterion applies in the convergence step, namely, the process stops whenever the number of objects that change their assigned cluster at any iteration is lower than a given threshold. Through computer experimentation with synthetic and real instances, we found that a threshold close to 0.03n involves a decrease in computing time of about a factor 4/100, yielding solutions whose quality reduces by less than two percent. These findings naturally suggest the usefulness of our criterion in Big Data realms.

Introduction

The skyrocketing technological advances of our time are enabling a substantial increase in the amount of generated and stored data [14], both in public and private institutions, in science, engineering, medicine, finance, education, and transportation, among others. Hence, there is a well justified interest in the quest of useful knowledge that can be extracted from large masses of data, allowing better decision making or improving our understanding of the universe and life.

However, the handling or interrogation of large and complex masses of data with standard tools —termed Big Data— is generally limited by the available computer resourses [1, 5]. In this regard, our contribution here is to provide a strategy to deal with the problem of clustering objects according to their attributes (characteristics, properties, etc.) in Big Data realms.

The clustering problem has long been studied. Its usefulness is indeniable in many areas of human activity, both in science, business, machine learning [6], data mining and knowledge discovery [7, 8], pattern recognition [9], to name but a few. The clustering consists in partitioning a set of n objects in k ≥ 2 non-empty subsets (called clusters) in such a way that the objects in any cluster have similar attributes and, at the same time, are different from the objects in any other cluster.

In this paper we assume that the objects’ attributes are measurable. Hence, sometimes we refer to objects as points, according to the context.

Let N = {x1, …, xn} denote the set of n points to be grouped by a closeness criterion, where xi ∈ ℜd for i = 1, …, n, and d ≥ 1 is the number of dimensions (the objects’ attributes). Further, let k ≥ 2 be an integer and K = {1, …, k}. For a k-partition of N, denote μj the centroid of group (cluster) G(j), for jK, and let M = {μ1, …, μk}.

Thus, the clustering problem can be posed as a constrained optimization one: (see, for instance, [10]) (1) where wij = 1 ⇔ point xi belongs to cluster G(j), and d(xi, μj) denotes the Euclidean distance between xi and μj, for i = 1, …, n, and j = 1, …, k.

Since the pioneering studies by Steinhaus [11], Lloyd [12], and Jancey [13], many investigations have been devoted to finding a k-partition of N that solves P above. It has been proved that this problem belongs to the so-called NP-hard class for k ≥ 2 or d ≥ 2 [14, 15]; thus obtaining an optimal solution of a moderate size instance is in general intractable.

Therefore, a variety of heuristic algorithms have been proposed to approach the optimal solution of P, the most conspicuous being those generally designed as k-means [16], with straightforward implementation. It must be said that the establishment of useful gaps between the optimal solution of the problem P and the solution obtained by k-means remains an open problem. The computational complexity of k-means is O(nkdr), where r stands for the number of iterations [5, 17], limiting their use in large instances since, in general, at every iteration all distances from objects to cluster centroids must be considered. Thus, numerous strategies to stop iterating have been investigated where, usually, increasing the computational effort entails the reduction of the objective function.

Our aim here is to propose a sound criterion to balance the processing time and the solution quality of k-means cluster algorithms in Big Data realms. So, we consider throughout the algorithm K-means—intentionally denoted with capitals to distinguish it from the k-means family to which it belongs—, greatly inspired by the seminal ones by Jensey [13] and Loyd [12], the latter appearing as one of the most widely used because of its simplicity and elegance [18]. Obviously, our proposal could be as well applied to most procedures of the k-means family.

The rest of the paper proceeds as follows. Section Related work briefly reviews the most relevant investigations known to the authors to improve the efficiency of k-means algorithms. Then, the algorithm K-means is described and its behavior is analysed in Section Behavior of K-means, so as to highlight an observed strong correlation between the decreasing value of the objective function (1), and the number of objects changing cluster at every iteration. Consequently, from this situation was born our idea to stop the algorithm in the convergence step, namely, as soon as the number of objects that change cluster is smaller than a given threshold; aimed to balance the processing time and the solution quality, Section Determining threshold values, proposes a procedure to determine this threshold. Next, Section Proposal validation, deals with a validation of our proposal: We computationally tested it on several synthetic and real instances. Finally, Section Conclusion, contains our conclusions.

Related work

To date, few investigations report theoretical analyzes on k-means. Our proposed methodology, as the vast majority of the published improvements for k-means, is supported by computer experimentation. Among them, the most relevant are briefly reviewed below, grouped according to the four stages of the algorithm, and adding some related to the choice of the appropriate number of clusters. Undeniably, although some improvements appear to dominate others, from the No free lunch Theorem [19] the only way one strategy can outperform another is if it is specialized to the structure of the specific problem under consideration.

Initialization step

For a successful improvement of the solutions quality, and reducing the number of iterations, Arthur & Vassilvitskii [20] proposed a random generation of the initial k centroids as follows. The first centroid is randomly chosen with uniform distribution in N; then, for j = 2, …, k, the probability of a point in N to be chosen as the j-th centroid is proportional to the square of its minimal distance to the set of 1, …, j − 1 already selected centroids.

Zhanguo et al. [21] strategically label the initial centroids, so as to guide the subsequent classification; the maximal distance principle is proposed to advantageously distribute the initial centroids in the solution space. In the paper by Salman et al. [22] the initial centroids are those obtained as final by k-means when applied on a 0.1n-size, random subset of the n objects. Time savings of around 66 percent are reported.

On their part, El Agha & Ashour [23] claim that the following initialization strategy yields improved results. For s = 1, …, d, and i = 1, …, n, let xi(s) be the s-th coordinate of point xi, , and νs = min xi(s). Also, let . The initial centroids are randomly generated in the main diagonal of the rectangular grid defined by coordinates , for s = 1, …, d.

Finally, Tzortzis & Aristidis [24] employ maxima and minima values to improve the clustering quality at each iteration, particularly useful when clusters are sought with similar number of objects.

Classification step

In regard to the classification step, Fahim [25] poses that if the distance of an object, say xi, to the centroid of its assigned cluster, say G(j), decreases at any iteration, then xi stays in G(j), not needing to compute the distance from xi to the remaining centroids.

Among the various works proposing to apply the principle of the triangle inequality to reduce the number of times that the distance from objects to centroids is computed, the most conspicuous are those by Elkan [18] and Hamerly [26]. By suggesting lower and upper bounds for distances, the former proposes that in every iteration, the distance from any object to any centroid should not be re-calculated if, in the previous iteration, that distance is outside these bounds; it is claimed that with this strategy the efficiency of k-means is higher as k and n increase. In [26] a variant of the Elkan [18] approach is proposed by establishing new bounds on the distances, asserting to improve Elkan results for instances of low dimension.

Pérez and collaborators [27, 28] discuss heuristics to simplify the computation of the distance between objects and centroids. In [28], noting that when objects migrate they normally do so towards a nearby cluster, only the distances from xN to the centroids of the w clusters closest to x are recalculated. Convenient values were experimentally found for w, depending on the dimension d; so, for instance, w = 4, 6 for d = 2, 4, respectively.

Two heuristics are considered in [27]; the first one permanently assigns object xN to cluster G(j) as soon as the distance from x to μj is lower than a given threshold, excluding x from subsequent distance computations. In the second heuristic a cluster is labelled ‘stable’ if it has no object exchanges in two successive iterations; objects in stable clusters no longer migrate. A similar heuristic is studied by Lai [29].

Kanungo et al. [30] have theoretically shown that the efficiency of k-means is enhanced when a kd-tree data structure is used. For Chiang et al. [31] objects very close to their nearest centroid are considered as with negligible probability to change cluster, hence they are excluded from subsequent distance computations.

Convergence step

The most common convergence criterion of k-means is to stop as soon as no change in the clustering is observed. However, it seems advisable to set un upper bound on the number of iterations as a concomitant convergence criterion, as offered by software packages such as SPSS, WEKA, and R.

Denote zr the objective function value at iteration r. Pérez et al. [32] propose to stop the algorithm whenever zr > zr−1, while in [33] the procedure stops if along ten successive iterations.

Mexicano et al. [34] compute the largest centroid displacement found in the second iteration (denote it ψ). Then, they assume that the k-means algorithm has converged if in two successive iterations the largest change of centroids position is lower than 0.05ψ.

Selecting the number k of clusters

In the k-means realm, the choice of an appropriate k value depends on the instances considered, and is a rather difficult task; this situation is usually addressed by trial and error. However, several investigations have been carried out to automatically determine the number k of clusters; see, for example, those reported in [35, 36], and the more recent approaches [3739] that employ a Bayesian nonparametric view.

Behavior of K-means

We made intensive computational experiments to assess the behavior of the below algorithm K-means under different conditions.

Algorithm K-means

Step 1 Initialization. Produce points μ1,…,μk, as a random subset of N.

Step 2 Classification. For all xN and jK, compute the Euclidean distance between points x and μj, namely, d(x, μj). Then, point (object) xN is assigned to a cluster if , for .

Step 3 Centroids. Determine the centroid μj of cluster G(j), for jK.

Step 4 Convergence. If the set of centroids M does not change in two successive iterations stop the algorithm, otherwise perform another iteration starting from Step 2.

In general, a correlation was observed between the number of objects changing cluster at each iteration, and the corresponding z value of the objective function (1).

As an example of what we found, Table 1 displays some results yielded by a run of K-means on a synthetic instance with n = 2 ⋅ 106 points (objects) randomly generated in the unit square (uniform distribution), and k = 200.

thumbnail
Table 1. Some results of the K-means algorithm with a randomly generated instance where n = 2 ⋅ 106, k = 200, and d = 2.

https://doi.org/10.1371/journal.pone.0201874.t001

The algorithm stops at iteration 612 because no points change cluster.

Each row of Table 1 contains information corresponding to iteration r: The objective function value zr (we denote z* the lowest value found by the algorithm, thus, z* = z612 = 53721), the indicator γr = 100(υr/n), where υr is the number of points changing cluster, and δr = 100(zr/z* − 1).

To grasp the algorithm behavior from Table 1 consider Fig 1—for r = 11, …, 30, points (γr, δr) lie in an almost straight line— as well as the striking similarity of curves in Fig 2.

thumbnail
Fig 2. Graphics of γ and δ per iteration.

(A) Percentage of objects changing cluster per iteration. (B) Cost z per iteration.

https://doi.org/10.1371/journal.pone.0201874.g002

This clearly suggests a strong correlation between indexes γ and δ, and between z and γ. Hence, letting be the total number of iterations, , and , the correlation coefficient between γ and δ is calculated with the Eq (2) thus verifying our assumptions. (2)

A similar experiment was made with a set of real and synthetic instances —described in Section Proposal validation— obtaining the correlation coefficients as shown in Table 2. Note that no value is below 0.9, and most lie in the range [0.975, 0.998].

thumbnail
Table 2. Correlation coefficients of synthetic and real instances.

https://doi.org/10.1371/journal.pone.0201874.t002

The aforementioned considerations led us to our proposal of a criterion to balance the computational effort and the solution quality for k-means in Big Data realms. Section Determining threshold values, describes the path we followed to determine threshold values to be used in the convergence step so as to judiciously stop the algorithm.

Determining threshold values

Although the computational effort at each iteration of k-means is rather constant, the corresponding improvement on z is not. Thus the question: When is it worth to keep iterating? It is well known that the so-called Pareto principle can help in determining an optimal relationship between effort and benefit [40]. Thus, we relied on this principle to provide a sound answer to the above question.

Continuing with the example of Section Behavior of K-means, we computed Table 3 where, for r ≥ 2, Ar = 100(r/), Br = 100(zr−1zr)/(z1z*), Cr = Cr−1 + Br, and is the Euclidean distance between points (Ar, Cr) and (0, 100).

thumbnail
Table 3. Information to seize the relationship between computational effort and solution quality.

https://doi.org/10.1371/journal.pone.0201874.t003

Fig 3 shows a partial plot of the Pareto diagram for points (Ar, Cr) extracted from Table 3. Note that D30Dr for any rN, namely, (A30, C30) is the closest point to (0,100). Hence, we could stop K-means at iteration r = 30 = 0.0490, avoiding as much as ≈95% of iterations, to get a solution with cost z30 = 1.0091z* (δ30 = 0.91 comes from Table 1), namely, only less than one percent worst than z*. Also, observe in Table 1 that at iteration 30 as few as 0.0072n objects migrate.

Following the ideas and concepts discussed over the previous example, we undertook intensive computer experimentation applying the Pareto principle on a variety of instances. As a result we obtained a set of threshold values U such that, depending on the available computer time, on the instance size, and according to the needs, it seems reasonable to stop K-means at iteration r whenever γrU. Recall γr = 100(υr/n), where υr is the number of migrating points at iteration r.

The algorithm below, which we call O-K-means (optimized K-means), results when our criterion is incorporated into K-means.

O-K-means

Step 1 Initialization. Produce points μ1,…,μk, as a random subset of N.

Step 2 Classification. For all xN and jK, compute the Euclidean distance between points x and μj, namely, d(x, μj). Then, point (object) xN is assigned to a cluster if , for .

Step 3 Centroids. Determine the centroid μj of cluster G(j), for jK.

Step 4 Convergence. If γrU stop the algorithm, otherwise perform another iteration starting from Step 2.

Proposal validation

By means of C language and a GCC 4.9.2 compiler, both K-means and O-K-means algorithms were implemented on a Mac mini computer with OS Yosemite 10.10, processor Core i5 at 2.8GHz, and 16 GB of RAM memory. The implementation of our stopping strategy requires negligible additional memory resources.

For the design of the computational experiments and the analysis of our algorithms, we used the methodology proposed by McGeoch [41].

As mentioned in Section Introduction the complexity of K-means is O(nkdr). Therefore, the O-K-means can be expressed as O(nkdrα), where α denotes the quotient of the number of O-K-means iterations and the number of K-means iterations. From our experiments with large synthetic and real instances we obtained an average of α = 0.0389.

Our codes were tested on sets of real and synthetic instances. In each instance the initial centroids were the same for both algorithms. In what follows we denote t, to, z*, and , respectively, the time needed by K-means, the time needed by O-K-means, the best objective function value found by K-means, and the best objective function value obtained by O-K-means. Further, to mesure time saving and quality loss we use, respectively, the formulas

For each instance we chose the product nkd as an indicator of its complexity level. The computer experiments described in sections Synthetic instances and Real instances used different U values according to the Pareto optimality principle applied to each instance. Consideration of different thresholds for the same instance is dealt with in Section Using other threshold values.

Synthetic instances

14 synthetic instances were produced for different n values and number of dimensions d as shown in Table 4.

thumbnail
Table 4. Synthetic instances general data.

The n values are given in millions.

https://doi.org/10.1371/journal.pone.0201874.t004

All points in each synthetic instance were randomly generated: while in instances 1 to 13 they followed a uniform distribution with coordinates belonging to (0, 1)d, in instance 14 a normal distribution was used with mean 40 and standard deviation 20. Distinct values for the number of clusters k were considered. Table 5 shows the results of an execution of each algorithm; the computing times t and to are reported in hours rounded to hundredths. Notice that the instances are sorted according to the nkd product.

thumbnail
Table 5. Results of K-means and O-K-means on synthetic instances.

Computer times t and to are given in hours, the product nkd in millions.

https://doi.org/10.1371/journal.pone.0201874.t005

When compared to K-means, algorithm O-K-means exhibits good performance on these instances: on average, in less than four percent of calculation time solutions are obtained whose quality decreases only about 0.5%. It is worth mentioning the relatively low standard deviation value (1.17), as compared to the average (96.10).

Real instances

We made computer experimentation with five real instances taken from the University of California repository, UCI [42], see Table 6. Note the relatively small size of instances 1–3 when compared to the large instances 4 and 5.

For distinct k values, each real instance was solved with K-means and O-K-means using the same initial centroids; results are shown in Table 7. In general, as nkd increases, the efficiency of O-K-means increases —note the average time reduction for the large instances (96.12%) and the small ones (86.47%), yielding similar quality loss—. As in the case of synthetic instances, low standard deviations are observed.

thumbnail
Table 7. Comparative results for real instances.

The product nkd is given in millions, and z*, have been rounded to integers. Computing times t, to are in seconds for the small instances, and in hours for the large ones.

https://doi.org/10.1371/journal.pone.0201874.t007

Using other threshold values

In sections Synthetic instances and Real instances, we showed results for thresholds determined following the Pareto optimality principle; however, other threshold values could as well be selected according to the need of solution quality and time availability.

Thus, each instance of Table 4 was solved with K-means and O-K-means using the same initial centroids, and the threshold values U shown in Table 8. In this table and denote the averaged reduction of time and solution quality, respectively. As expected, each row decreases monotonically from left to right, namely, increasing U leads to greater reduction of both computing time and quality solution. Thus, there is a trade-off between these concepts.

thumbnail
Table 8. Average reduction of computing time and solution quality , obtained by O-K-means for distinct threshold values U.

https://doi.org/10.1371/journal.pone.0201874.t008

We find remarkable the closeness between the reductions with U = 1, and those that arise when applying the Pareto principle (see Table 5).

Combining our convergence criterion with other criteria

To assess the benefit of combining our convergence criterion with other algorithms for speeding up k-means we considered two efficient classification strategies. One was proposed by Fahim et al. [25], call it F, the other is due to Pérez et al. [43], call it P. With this aim we chose 10 synthetic instances described in Table 4, to be solved for distinct k values. For each instance the developed codes used the same initial centroids.

Table 9 summarizes our results, in terms of time and quality reduction related to K-means, where OK, FOK, and POK, denote, respectively, O-K-means, Fahim et al. algorithm combined with O-K-means, and Pérez et al. algorithm combined with O-K-means.

thumbnail
Table 9. Tests on selected synthetic instances.

Time and quality reduction arising from combining F and P algorithms with O-K-means.

https://doi.org/10.1371/journal.pone.0201874.t009

The results of our computational experiments lead us to pose that it can be advantageous to combine O-K-means with other strategies, both in terms of computing time reduction and solution quality.

Also, to assess the algorithms OK, FOK, and POK when using as initial centroids those generated by the algorithm K++ proposed by Arthur [20] we selected eight instances, see Table 10. Instance A is the synthetic one shown in Fig 4; instances B and C correspond to the real instance Letters described in Table 6; instance D was produced by randomly selecting 30 000 objects from instance 10 of Table 4; to produce instances E and F 40 000 points were randomly generated (uniform distribution) in a bounded space; finally, instances G and H correspond to the instance 10 of Table 4. Table 10 shows the time and quality reduction related to the algorithm K++, obtained by algorithms OK, FOK, and POK when using as initial centroids those generated by K++. These results support our belief in the usefulness of our proposal.

thumbnail
Table 10. Results of OK, FOK and POK when using as initial centroids those generated by K++.

https://doi.org/10.1371/journal.pone.0201874.t010

Data clustered around specific centers

To size up the performance of our stopping criterion if the data is heavily clustered around few specific centers, we constructed a 2-dimensional synthetic instance (d = 2), where n = 11000 points display compact groups, see Fig 4. Each group was formed by arbitrarily selecting a center and standard deviation, and points were randomly generated with normal distribution around the centers.

This instance was solved for k = 5, 10, 15, 20. Table 11 shows the average results of 30 runs of K-means and O-K-means with threshold U = 3.14, as well as the average reduction of time and quality due by the latter.

thumbnail
Table 11. Results of O-K-means for a 2-dimensional, non uniform synthetic instance.

https://doi.org/10.1371/journal.pone.0201874.t011

Note that the best results are obtained as k increases, with a time reduction of up to 86.44%, and an average quality reduction as low as 2.86%.

Clustering larger data sets

We tested the performance of O-K-means on four still larger instances, selected from the repository UCI [42]. Table 12 shows their relevant data as well as the average computational results obtained from 30 runs of O-K-means, with the same threshold (3.20) used for the real instances of Table 6. These results confirm the suitability of our proposal in Big Data realms.

thumbnail
Table 12. Results of O-K-means for still larger instances.

https://doi.org/10.1371/journal.pone.0201874.t012

Conclusion

We have presented a sound criterion to balance effort and benefit of k-means cluster algorithms in Big Data realms. Its advantages were demonstrated by applying it in the convergence step of one of the most widely used procedures of the k-means family, which here we called K-means. Guided by the Pareto principle, our criterion consists in stopping the iterations as soon as the number of objects that change cluster at any iteration is lower than a prescribed threshold. The novelty of methodology comes from two facts. First, in regard to the stopping criterion, the authors are not aware of any proposal directly related to the number of objects changing group at every iteration. Second, to date the Pareto principle has not been used to determine a threshold leading to an adequate commitment between the quality of a solution and the needed time to obtain it.

From intensive computer experimentation on synthetic and real instances we found that, in general, our criterion significantly reduces the number of iterations with relatively small decrement in the quality of the yielded solutions. Furthermore, the best results tend to correspond to the largest instances considered, namely, those where the product nkd is high. Thus, this behavior is an indicator of the usefulness of applying the Pareto principle in the convergence step when dealing with large k-means instances.

It is well known that some strategies to improve the performance of k-means are sensitive to the number of dimensions. This is not our case, since our proposal aims to reduce the number of iterations made, and the time complexity per iteration nkd is taken as a constant.

An important characteristic of our stopping strategy is that its implementation requires negligible additional memory resources; in this regard, it appears to take advantage over other proposed criteria.

Last, but not least, our proposed convergence criterion is not incompatible with any improvement related to the initialization or classification steps of k-means, as we have shown in relation to the procedure that generates the initial centroids of the K++ algorithm. As future work we find it appealing to deepen these investigations in the realm of parallel and distributed computing paradigms. It is foreseeable that our stopping criterion can be successfully used under such paradigms since it only requires the number of objects changing group at each iteration.

References

  1. 1. Kambatla K, Kollias G, Kumar V, Grama A. Trends in big data analytics. Journal of Parallel and Distributed Computing. 2014; 74(7):2561–2573.
  2. 2. Laney D. 3D Data management: controlling data volume, velocity and variety. 2001 Feb 6. [cited 24 Aug 2017]. In: Gartner Blog Network [internet]. Available from: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf.
  3. 3. Raykov YP, Boukouvalas A, Baig F, Little MA. What to do when k-means clustering fails: A simple yet principled alternative algorithm. PLoS One. 2016; 11(9):e0162259. pmid:27669525
  4. 4. Chun-Wei T, Chin-Feng L, Han-Chieh C, Vasilakos AV. Big data analytics: a survey. Journal of Big Data. 2015; 2(1):21.
  5. 5. Jain AK. Data clustering: 50 years beyond K-means. Pattern Recognition Letters. 2010; 31(8):651–666.
  6. 6. Fisher DH. Knowledge acquisition via incremental conceptual clustering. Machine learning. 1987; 2(2):139–172.
  7. 7. Fayyad UM, Piatetsky-Shapiro G, Smyth P. Knowledge discovery and data mining: Towards a unifying framework. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. AAAI Press; 1996. p. 82–88. Available from: https://www.aaai.org/Papers/KDD/1996/KDD96-014.pdf.
  8. 8. Kantardzic M. Data Mining: Concepts, Models, Methods, and Algorithms. 2nd ed., John Wiley & Sons; 2011.
  9. 9. Duda RO, Hart PE, Stork DG. Pattern Classification. 2nd ed., John Wiley & Sons; 2012.
  10. 10. Selim SZ, Ismail MA. K-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on pattern analysis and machine intelligence. 1984; (1):81–87. pmid:21869168
  11. 11. Steinhaus H. Sur la division des corps matériels en parties. Bulletin de l’Académie Polonaise des Sciences. 1956; 4(12):801–804.
  12. 12. Lloyd S. Least squares quantization in PCM. IEEE Transactions on Information Theory. 1982; 28(2):129–137.
  13. 13. Jancey RC. Multidimensional group analysis. Australian Journal of Botany. 1966; 14(1):127–130.
  14. 14. Aloise D, Deshpande A, Hansen P, Popat P. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning. 2009; 75(2):245–248.
  15. 15. Mahajan M, Nimbhorkar P, Varadarajan K. The planar k-means problem is NP-hard. Theoretical Computer Science. 2012; 442:13–21.
  16. 16. MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability. University of California Press; 1967 (Vol. 1, No. 14, p. 281–297).
  17. 17. Hans-Hermann B. Origins and extensions of the k-means algorithm in cluster analysis. Journal Electronique d’Histoire des Probabilités et de la Statistique, Electronic Journal for History of Probability and Statistics. 2008; 4(2):1–7. Available from: http://www.emis.ams.org/journals/JEHPS/Decembre2008/Bock.pdf. Cited 24 Aug 2017.
  18. 18. Elkan C. Using the triangle inequality to accelerate k-means. In: Proceedings of the 20th International Conference on Machine Learning. AAAI Press. 2003; p. 147–153.
  19. 19. Wolpert DH, Macready WG. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation. 1997; 1(1):67–82.
  20. 20. Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics. 2007; p. 1027–1035.
  21. 21. Zhanguo X, Shiyu C, Wentao Z. An improved semi-supervised clustering algorithm based on initial center points. Journal of Convergence Information Technology. 2012; 7(5):317–324.
  22. 22. Salman R, Kecman V, Li Q, Strack R, Test E. Two-stage clustering with k-means algorithm. Recent Trends in Wireless and Mobile Networks. Communications in Computer and Information Science. 2011; 162:110–122.
  23. 23. El Agha M, Ashour WM. Efficient and fast initialization algorithm for k-means clustering. International Journal of Intelligent Systems and Applications. 2012; 4(1):21–31.
  24. 24. Tzortzis G, Likas A. The MinMax k-Means clustering algorithm. Pattern Recognition. 2014; 47(7):2505–2516.
  25. 25. Fahim AM, Salem AM, Torkey FA, Ramadan MA. An efficient enhanced k-means clustering algorithm. Journal of Zhejiang University-Science A. 2006; 7(10):1626–1633.
  26. 26. Hamerly G. Making k-means even faster. In: Proceedings of the 2010 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics; 2010. p. 130–140.
  27. 27. Pérez J, Pazos R, Hidalgo M, Almanza N, Díaz-Parra O, Santaolaya R, Caballero V. An improvement to the k-means algorithm oriented to big data. In: AIP Conference Proceedings. AIP Publishing; 2015. (Vol. 1648, No. 1, p. 820002).
  28. 28. Pérez J, Pazos R, Olivares V, Hidalgo M, Ruíz J, Martínez A, Almanza N, González M. Optimization of the K-means algorithm for the solution of high dimensional instances. In: AIP Conference Proceedings. AIP Publishing; 2016. (Vol. 1738, No. 1, p. 310002).
  29. 29. Lai JZ, Tsung-Jen H, Yi-Ching L. A fast k-means clustering algorithm using cluster center displacement. Pattern Recognition. 2009; 42(11):2551–2556.
  30. 30. Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2002; 24(7):881–892.
  31. 31. Ming-Chao C, Chun-Wei T, Chu-Sing Y. A time-efficient pattern reduction algorithm for k-means clustering. Information Sciences. 2011; 181(4):716–731.
  32. 32. Pérez J, Pazos R, Cruz L, Reyes G, Basave R, Fraire H. Improving the efficiency and efficacy of the k-means clustering algorithm through a new convergence condition. In: International Conference on Computational Science and Its Applications. Lecture Notes in Computer Science. Springer; 2007. (Vol. 4707, p. 674–682).
  33. 33. Lam YK, Tsang PW. eXploratory K-Means: a new simple and efficient algorithm for gene clustering. Applied Soft Computing. 2012; 12(3):1149–1157.
  34. 34. Mexicano A, Rodríguez R, Cervantes S, Montes P, Jiménez M, Almanza N, Abrego A. The Early Stop Heuristic: a new convergence criterion for k-means. In: AIP Conference Proceedings. AIP Publishing; 2016. (Vol. 1738, No. 1, p. 310003).
  35. 35. Hamerly G, Elkan C. Learning the K in K-means. In: Proceedings of the Advances in Neural Information Processing Systems. 2003; p. 281–288.
  36. 36. Pelleg D, Moore AW. X-means: extending K-means with efficient estimation of the number of clusters. In: Proceedings of the ICML 2000. 2000; Vol. 1, p. 727–734.
  37. 37. Kulis B, Jordan MI. Revisiting k-means: New algorithms via Bayesian nonparametrics. arXiv preprint arXiv:1111.0352. 2011.
  38. 38. Broderick T, Kulis B, Jordan M. MAD-Bayes: MAP-based asymptotic derivations from Bayes. In: Proceedings of the International Conference on Machine Learning. 2013; p. 226–234.
  39. 39. Raykov YP, Boukouvalas A, Little MA. Simple approximate MAP inference for Dirichlet processes mixtures. Electronic Journal of Statistics. 2016; 10(2):3548–3578.
  40. 40. Konak A, Coit DW, Smith AE. Multi-objective optimization using genetic algorithms: a tutorial. Reliability Engineering & System Safety. 2006; 91(9):992–1007.
  41. 41. McGeoch CC. A guide to experimental algorithmics. Cambridge University Press. 2012.
  42. 42. Lichman M. UCI Machine Learning Repository; 2013 [cited 24 Aug 2017]. Database: Repository of Machine Learning Databases [Internet]. Available from: http://archive.ics.uci.edu/ml.
  43. 43. Pérez J, Pires CE, Balby L, Mexicano A, Hidalgo MA. Early classification: a new heuristic to improve the classification step of k-means. Journal of Informatics and Data Management. 2013; 4(2):94–103.