Reducing the Time Requirement of k-Means Algorithm

Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space Rd and an integer k. The problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARIHA). We found that when k is close to d, the quality is good (ARIHA>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARIHA>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data.


Introduction
Clustering is the unsupervised grouping of objects into classes without any a priori knowledge of the datasets to be analyzed. A clustering algorithm is either hierarchical or partitional. Hierarchical algorithms create successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once. For the hierarchical variants, we have the agglomerative and divisive clustering. However, in partitional clustering, we have QT (Quality Threshold) clustering [1], Self Organising Map (SOM) [2] and Standard k-means [3,4], which have been evolving in recent years for high dimensional data analysis [5,6,7,8,9,10,11]. An overview on clustering algorithms for expression data can be found in Yona et al. [12]. Examples of variants of k-means algorithms that attempted to enhance the traditional k-means algorithm via improved initial centre can be found in Deelers and Auwatanamongkol [13], Nazeer and Sabastian [14] and Yedla et al [15]. Lastly, Kumar et al. [16] in a recent work, enhanced the k-means clustering algorithm using red black tree and min-heap.
The traditional k-means algorithm requires in expectation O(nkl) run time where l is the number of k-means iterations. This time was said to be reduced by Fahim et al. [17] to O(nk). 1=i to estimate the total number of data points for each iteration that changed their clusters during the number of kmeans iterations, l, thereby deducing that the cost of using their enhanced k-means algorithms is approximately O(nk). The kmeans algorithm described in Nazeer and Sebastian [14] also runs in O(nk) while the one in Yedla et al. [15] runs in O(nlogn). For efficient and effective analysis of microarray data, we developed a novel Pearson correlation-based Metric Matrices kmeans (MMk-means). We showed that the algorithm is correct. Experimental results show that it has a better run-time than the Traditional k-means and other variants of k-means algorithm like Overlapped and Enhanced k-means algorithms developed in [17]. Furthermore, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data.

Methods
Notation DD:DD denotes the Euclidean norm of a vector. The trace of a matrix X, i.e., the sum of its diagonal elements, is denoted as trace (A). The Frobenius norm of a matrix DDX DD F~ffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi trace(X T X ) p . I n denotes identity matrix of order n.

Basic Definitions
Metric Matrix (MM). A kxk matrix encapsulates the correlation coefficient (r) between the centroids of the previous (pm j ) and current (m j ) iterations respectively, where 0,j#k.
Ding-He Interval. It is an interval, obtained by Ding and He [18,19], used in our new k-means algorithm to determine when a cluster must remain without further clustering or be subjected to further clustering.
MMk-means Iterations (MMI). The number of k-means iterations required before the Ding-He interval is applied in our new k-means algorithm. diff j . An absolute value obtained from the subtraction of the current iteration eigenvalues (e j ) from the previous iteration eigenvalues (pe j ) which serves as an indicator to terminate clustering for each cluster. Each eigenvalues set is obtained from the corresponding Metric Matrix.
Some Set Notations set[j]: 1#j#k is the set referring to cluster j. add[i]: Is a function to add data point into a cluster, where i is the index of the data.
set [j].n j : Is the size of cluster j, that is, number of data points in a cluster j.

Algorithm Design
Our MMk-means algorithm runs like the Traditional k-means algorithm except that it is equipped with a mechanism to determine when a cluster is stable, that is, its membership data points will always remain in the same cluster in each subsequent iteration. This is an improvement over the Overlapped and Enhanced variants of k-means algorithms introduced by Fahim et al. [17]. They equipped their algorithms with the ability to detect the stability of a data point but MMk-means is equipped with the mechanism to detect the stability of a cluster representing a whole bunch of data points.
We use the recently established relationship between principal component analysis and k-means clustering to design a mechanism for determining when the whole data points in a cluster are stable. We create a covariance matrix (MM of r's), a result of computing the Pearson product moment correlation coefficient between the k centroids of the previous and current iterations and then deduce k previous and current iterations eigenvalues. The difference of these eigenvalues for each cluster is computed and checked to see if it satisfies (that is, lies within) the Ding-He interval [18,19]. If it does, the corresponding cluster is considered stable and there is no need to compute its data point distances with the current centroid of the cluster or the rest k-1 centroids.
The mechanism explained above is prescribed in the subprocedure Compute_MM of Figure 1 and this function is being executed when the current total iterations number is greater than MMI-1. For any n dataset points, given the total number of kmeans iteration l required, we can actually set MMI = l/2, but note that l is unknown, until a traditional k-means algorithm is executed. We know that for a given clustering procedure, k-means algorithm aims to minimize the first Mean Squared Error (MSE 1 ), through a number of iterations, l, distributing all data points into clusters, to arrive at an optimal (minimized) Mean Squared Error (MSE l ). Therefore, we estimate the required MMk-means Iteration (MMI) to be bounded by 0,MMI#MSE 1 .k/MSE l . Note that for a given set of n data points, we can form the n-by-d matrix X = [x 1 ,…, x n ] and the first iteration of a traditional k-means algorithm can be used to determine MSE 1 in O(nk) time. Ding and He [18,19] provided tightly upper and lower bounds for the optimal MSE l . From these, we can compute MSE l in O(n) time. Empirical testing followed by personal communication with Ding, C. shows that the deduced technique does not hold for large k and data with high dimensional d. So we still cannot estimate MSE l for all k and d as we desire in our new k-means algorithm.
In a previous work, that led to the results in [18,19], Zha et al. [20] obtained that: Theorem 1. The optimal Mean Squared Error (MSE l ) is bounded from below by where s j (X ) is the j largest singular value of X and A is an arbitrary orthonormal matrix. Ding and He [18,19] indicated that the lower bound in theorem 1 is not asymptotically tight. They however provided (derivatively) empirical evidence in pages 35 and 500 respectively that as the number of cluster increases (that is k), the lower bound in theorem 1 becomes less tight, but still around 1.5% of the optimal values. This can also be shown when the dimension of the data increases and when both increase, but they do not provide logical argument to prove this. We provide this proof in the following observation.
An important result, that we shall see soon, how it relates centroid of each partition X j to an eigenvalue, is given by Fan [21] and stated below: Theorem 2. Let H be a symmetric matrix with eigenvalues, l 1 § l 2 §::: §l k and the corresponding eigenvector U = [u 1 ,…,u d ]. Then l 1 zl 2 z:::zl k~m ax A T A~I k trace(A T HA): Moreover, the optimal A * is given by A Ã~½u 1 , . . . ,u k Q with Q an arbitrary orthogonal matrix.
Observation 1. From theorem 1 above, we observed that although the equation does not correspondingly estimate MSE l for large k and high d, it possesses a better analytical distribution that mimic the series needed to estimate MSE l for large k and high d.
Proof. From theorem 1, we know that using theorem 2 above and the fact that the sum of the eigenvalues of any (square) matrix is equal to the trace of the matrix. We can see from equation (2) that since l 1 §l 2 §::: §l d from theorem 2, for large k (though still %d), very few largest eigenvalues are the ones subtracted. For larger k and higher d, the estimation in equation (2) is kept balance by the addition of eigenvalues at the tail and the subtraction of very few (in normal practical setting) at the rear of the series. This way, the lower bound in theorem 1 though may be less tight is still within the excellent reach of the optimal value. % Using this observation, in the following, we are able to estimate more accurately a multiplier, we called m, that is useful in the prediction of MSE l from MSE 1 and consequently determine MMI.
Observation 2. From equation (1), we can estimate where s 2 j (X ) is the j largest singular value of x, n j is the size of the data vectors in cluster j and m j is the mean vector of these data vectors. And consequently find MSE l , mMSE 1 . We encapsulate the computation of the multiplier m in an implicit subprocedure Compute_multiplier in line 1 of Figure 2.
Proof. Recalling what we stated earlier on in the body of the paper, it is known that for a given clustering procedure, k-means algorithm aims to minimize the first Mean Squared Error (MSE 1 ), through a number of iterations, l, distributing all data points into clusters, to arrive at an optimal (minimized) Mean Squared Error (MSE l ).
Zha et al. [20] in their attempt to prove theorem 1 (stated in the paper main body) showed that theoretically where A is an n x k orthonormal matrix given by e is a vector of appropriate dimension with all elements equal to one and n j , 1ƒjƒk are number of data points in each cluster. They also showed that So MSE 1 in equation (4) becomes Minimizing equation (4) and using theorem 2 and equation (6) above, Zha et al. [20] showed that MSE 1 § P d i~kz1 s 2 j X ð Þ as given in theorem 1 (of the main body of this paper). So we can estimate the factor m that relates MSE l and MSE l as follows: And therefore from observation 1, for large k and d, we can estimate MSE l from MSE 1 and consequently have MSE l *MSE 1 : % Observation 3. For each iteration, given MSE l *MSE 1 , we can determine MMI, by estimating the distance of the current iteration MSE away from the final and optimal MSE, MSE l and the first instance when(1{ mMSE 1 =current iteration MSE)ve, where ew0, MMI is equal to the current total iterations number. This is also implemented in lines 35-40 of Figure 2.
Proof. The most widely used convergence criteria for the kmeans algorithm is minimizing the MSE. Selim and Ismail [22] provided a rigorous proof of the finite convergence of the k-means type algorithm for any metric. We know that every convergent sequence (with limit s, say) is a Cauchy sequence, since, given any real number ew0, beyond some fixed point, every term of sequence is within distance e=2of s, so any two terms of the sequence are within distance e of each other. By definition, a Cauchy sequence is a sequence, a n such that for any ew0there exists an integer K$1 such that Da n {a m Dve for all n and m with mwn §K.
MMI that we seek in this observation is K, since DmMSE 1 {current iteration MSEDve following the Cauchy sequence definition. This can be rewritten as 1{mMSE 1 = current iteration MSEve. This completes the proof. % Using the devices enumerated above, our new k-means algorithm is presented in Figures 1 and 2. We now prove the correctness of this algorithm.

MMk-means algorithm correctness proof
To prove the correctness of our new and novel k-means algorithm, we will need the following definitions from Kumar et al. [23] and Kanungo et al. [24]. Definition 1. Given a set of k points K, which we also denote as centers, define the k-means cost of X, set of n points in d-dimensional space R d , with respect to K, D(X ,K)as where d(x, K) denotes the distance between x and the closest point to x in K. Let the centroids at each k-means iteration be m i 1 ,m i 2 ,:::,m i k , 1#i#l where l is the total number of k-means iterations. Now, we will also need the following lemma. The following lemma shows the mathematical correctness of our key sub program, Compu- Figure 2. Pseudocode of our main program for MMk-means. It runs similar to the traditional k-means except that it is equipped with a metric matrices based mechanism to determine when a cluster is stable (that is, its members will not move from this cluster in subsequent iteration). This mechanism is implemented in sub-procedure Compute_MM of Figure 1. We use the theory developed by Zha et al. [20] from the singular values of the matrix X of the input data points to determine when it is appropriate to execute Compute_MM during the k-means iterations. This is implemented in lines 34-40. doi:10.1371/journal.pone.0049946.g002 te_MM of Figure 1, which encapsulates the mechanism we used to identify stable partitions in our new MMk-means algorithm.
Lemma 1. For a partition X j at t{iterationƒl, let diff jD pe j {e j D, if Ding{HeH 1 vdiff j vDing{HeL 0 then D(X t j ,m t j )~D(X i j ,m i j ) for tviƒl: Proof. Note that the Metric Matrix, MM in sub-procedure Compute_MM of Figure 1, which is the key mechanism we used to identify stable partitions, is the k x k correlation coefficient matrix generated between the centroids of the previous and current iterations of the k-means algorithm. Note further that this matrix is a covariance matrix [25].
Note that the minimization of equation (4) is equivalent to max~trace(A T X T XA)DA of the form 5 ð Þ È É : Equations (4) and (6) relate minimized MSE to maximizing the sum of the centriods. Note also that equation (6) and theorem 2 above relate centriods of each partition to an eigenvalue. Iteratively, MM in Compute_MM relates centriods of previous and current iterations respectively and therefore from equation (6) and theorem 2, its eigenvalues characterize the iterative minimized MSE of each partition and diff j is an estimate of how close is the minimized MSE for a partition (in term of its centroid) to the optimal one. Since Ding and He [18,19] had shown an upper and lower bound to expect this, then if Ding{He H 1 vdiff j vDing{ He L 0 , the centroid of the corresponding partition X j virtually does not change in subsequent iterations. This translates to D(X t j ,m t j )~D(X i j ,m i j ) for tviƒl from definition 2. % The following is now the correctness proof for the MMk-means algorithm.
Theorem 3. Given a point set X, MMk-means returns a k-means solution on input X.
Proof. We should note that our algorithm maintains the following loop invariant: The set X is a subset of X 1 |X 2 |:::|X k .
It is straight forward to note that for i~0, the invariant holds. Now, let assume that the invariant holds for some fixed i~p, its remains to show that the invariant holds for i~pz1 as well, then we are done.
Based on our assumption, for i~p, For i~pz1, we have to show for every j that it is either or Note that for a partition X j , if D(X pz1 j ,m pz1 j )~D(X p j ,m p j ) from (1)   ). This completes the proof for (9) above.
It now remains to show that for each iteration i in our MMkmeans algorithm, the input set X is a subset of X i 1 |X i 2 |:::|X i k . This is actually straight forward. Note that for an iteration i, x[X , there exist only one closest m i j centriod to x from M for a particular partition X i j , such that x[X i j . This shows that all x[X belongs to a partition X i j for 1ƒjƒk and therefore X is a subset of X 1 |X 2 |:::|X k . %

Results and Discussion
Using C++, we implemented the three variants of k-means algorithms, namely, the Traditional, Overlapped and Enhanced kmeans following Fahim et al. [17] design. We also implemented the fourth one, our MMk-means algorithm using C++ and MATLAB. See the additional file S1 of this paper for the source codes of these programs. Ding and He [18,19] experimentally determined an interval: 0.5-1.5%, which indicates when a cluster is optimally equal to the expected ones. We used this in our Compute_MM of Figure 1, where we set L 0 = 0.5% and H 1 = 1.5%. Finally, Observation 3 is implemented with experimentally determined e = 0.007.
We tested the algorithms using normalized microarray expression data at varying timepoints for P. falciparum microarray experiment data from [26] and [27] as depicted in Table 1 The plots of minimized Mean Standard Error (MSE) versus k values help to measure clusters quality (that is, its effectiveness) and run time (in sec) versus k help to measure each algorithm's efficiency empirically. For the malaria microarray data from Bozdech et al. [26], these plots are shown in Figures 3 and 4. These results are similar to what was obtained from the other malaria microarray data as indicated in Table 1. It was observed that eigenvalues of the Metric Matrix, MM decrease along the diagonal matrix from top to bottom for all iterations except the last one and change interestingly at the last iteration by increasing from top to bottom. It should be noted that the stability condition for cluster as measured by diff j of line 7 in Figure 1 does not apply appropriately to negative gene expression values as we have in Le Roch et al. [27] data. The theoretical reason is given in [18,19]. We observed that, nevertheless, our new algorithm compared excellently to the Traditional k-means.
We found out that the algorithms of Fahim et al. [17] were slower than the Traditional k-means contrary to the claim of the authors. This observation made us to take a critical look at the design of the two algorithms of Fahim et al. [17] theoretically. Fahim et al. designed two new variants of k-means algorithms noting that if the distance between a data point and the current centroid (new center) of the cluster that it was assigned to in the previous iteration is less than or equal to the distance of the data point to its previous centroid (old centre), then the point remains in that cluster and there is no need to compute its distance to the other k-1 centers. To do this, they introduced two arrays, namely Clusterid and Pointdis to keep track of the centroid to which each Table 2. Hubert-Arabie Adjusted Rand Index (ARI HA ) Cluster Quality Computation Result for Biological and Non-biological data. For each data, Bozdech et al. 3D7 and HB3 strains [26] and Le Roch et al. [27], we used two values of k to demonstrate the effect of changing k values on the clusters quality of the clustering algorithms. We considered the structure of the Traditional k-means as the known structure and compare the clusters of MM, Enhanced and Overlapped k-means respectively with it. In a separate (last) column, we also compare the structure of the Enhanced k-means with that of Overlapped k-means. doi:10.1371/journal.pone.0049946.t002 point is assigned to and the distance between this point and its centroid. Fahim et al. [17], used n P l i~1 1=i to estimate the total number of data points for each iteration that moved from their clusters during the number of k-means iterations, l and showed that the cost of using an enhanced k-means algorithm is approximately O(nk). We observed that the total number of data points for each iteration that moved from their clusters during the k-means iterations is not strictly monotonically decreasing and thus their Overlapped and Enhanced k-means algorithm eventually still costs O(nkl) run time in expectation. From the foregone, the two algorithms have the same asymptotic run time as the Traditional k-means algorithm but in practise (from our experimental experience) slower than the traditional one. Whenever k (number of clusters) ,d (dimension or timepoints), effective clustering is achieved for the four algorithms and our MMk-means has the best empirical runtime. Overlapped and Enhanced k-means are the slowest in all cases. Empty clusters are created by all the algorithms if k.d as the clustering becomes irregular, similar to results for 15.k.25 by Le Roch et al. [27].
To further ascertain the quality of our new algorithm on the three microarray data of Table 1, we assessed the quality of its clusters and that of Enhanced and Overlapped k-means respectively, against the clusters from the Traditional k-means, using the For each data, Abalone, Wind and Letter as described in Table 4 below, we used two values of k to demonstrate the effect of changing k values on the clusters quality of the clustering algorithms. We considered the structure of the Traditional k-means as the known structure and compare the clusters of MM, Enhanced and Overlapped kmeans respectively with it. In a separate (last) column, we also compare the structure of the Enhanced k-means with that of Overlapped k-means. doi:10.1371/journal.pone.0049946.t003 Table 4. Non-Biological data used for testing our algorithm and the other three variants of k-means algorithm.  Hubert-Arabie Adjusted Rand index (ARI HA ) [28]. The result of this assessment is given in Table 2 for biological data and in Table 3 for non-biological data. Considering each biological data, Bozdech et al. 3D7 and HB3 strains [26] and Le Roch et al. [27], we used two values of k to demonstrate the effect of changing k values on the clusters quality of the clustering algorithms. In a separate column, we also compare the structure of the Enhanced k-means with that of Overlapped k-means. We found out that Enhanced and Overlapped k-means respectively produced similar clusters and their structures are similar to that of the Traditional k-means. For MMk-means, this is also the case and we found categorically that when k is close to d, the quality of its clusters is good (ARI HA .0.8) and when k is not close to d, the quality is excellent (ARI HA .0.9). In Osamor et al. [29], we demonstrated the biological characteristics of our new algorithm against other well known kmeans clustering algorithms. Interestingly, from this application, we discovered a new functional group for some set of genes of P. falciparum.
To test the behavior of our new algorithm on non-biological data, we used first, the data of Fahim et al. [17]. Details on these datasets are given in Table 4. The result of this exercise is given in Table 3. The quality of MMk-means clusters is similar to what we observed from that of the biological data. Next, we tested MMkmeans algorithm on three large simulated datasets of 50 dimensional size and with 10000, 30000, 50000 items respectively. The result is shown in Table 5. MMk-means is empirically efficient than all other three algorithms and the quality of MMk-means clusters is again similar to what we observed from that of the biological data.

Conclusion
To achieve efficient but also effective analysis of microarray data, we developed a novel Pearson correlation-based Metric Matrices k-means (MMk-means). We provided the correctness proof of this algorithm. Experimental results show that it has a better run-time than the Traditional k-means and other variants of k-means algorithm like Overlapped and Enhanced k-means algorithms developed in [17].
It must be pointed out that the results (extended theories and experimental) of this work provide additional toolkits to analyze successfully high dimensional datasets, which of recent, are of incredible growth [10,11,30]. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on three moderate size and three heavy non-biological data.

Supporting Information
File S1 Traditional, Overlapped, Enhanced and MMkmeans algorithms C++ codes. Kmeansprograms.zip contains Traditionalkmeans.cpp, Overlappedkmeans.cpp, Enhancedkmeans.cpp and Mmkmeansmmi.cpp. These programs are implemented using Borland C++ version 5.0 and MATLAB version 7.0. The steps on how to run the programs are stated at the beginning of the program files in the zip. (ZIP) File S2 The three microarray experimental data used in the testing of our algorithm and the other three variants of k-means algorithm. The files in this Data.zip are as follows. These files can best be viewed using MS Excel or Notepad.