Reducing the Time Requirement of k-Means Algorithm

doi:10.1371/journal.pone.0049946

Figure 1.

Pseudocode of our Compute_MM Sub-program for MMk-means.

We create a covariance matrix, computing the Pearson product moment correlation coefficient between the k centroids of the previous and current iterations and then deduce k previous and current iterations eigenvalues. The difference of these eigenvalues for each cluster is computed and checked to see if it satisfies the Ding-He interval.

More »

Expand

Figure 2.

Pseudocode of our main program for MMk-means.

It runs similar to the traditional k-means except that it is equipped with a metric matrices based mechanism to determine when a cluster is stable (that is, its members will not move from this cluster in subsequent iteration). This mechanism is implemented in sub-procedure Compute_MM of Figure 1. We use the theory developed by Zha et al. [20] from the singular values of the matrix X of the input data points to determine when it is appropriate to execute Compute_MM during the k-means iterations. This is implemented in lines 34–40.

More »

Expand

Table 1.

Short statistics on the three microarray experimental data used in the testing of our algorithm and the other three variants of k-means algorithm.

More »

Expand

Figure 3.

Quality of Clusters (Bozdech et al., P.f 3D7 Microarray Dataset).

The qualities of clusters for the four algorithms are similar. The MSE decreases gradually as the number of clusters increases except for k = 21 that has a higher MSE than when k = 20.

More »

Expand

Figure 4.

Execution Time (Bozdech et al., P.f 3D7 Microarray Dataset).

The plot shows that our MMk-means has the fastest run-time for tested number of clusters, 15≤k≤25. Comparatively, k = 20 took the longest run-time for all the four algorithms, implying that this is a function of the nature of the data under consideration.

More »

Expand

Table 2.

Hubert-Arabie Adjusted Rand Index (ARI_HA) Cluster Quality Computation Result for Biological and Non-biological data.

More »

Expand

Table 3.

Hubert-Arabie Adjusted Rand Index (ARI_HA) Cluster Quality Computation Result for Non-biological data.

More »

Expand

Table 4.

Non-Biological data used for testing our algorithm and the other three variants of k-means algorithm.

More »

Expand

Table 5.

Performance comparison for all types of k-means algorithms considered for very large data sets.

More »

Expand