The authors have declared that no competing interests exist.
The K-harmonic means clustering algorithm (KHM) is a new clustering method used to group data such that the sum of the harmonic averages of the distances between each entity and all cluster centroids is minimized. Because it is less sensitive to initialization than K-means (KM), many researchers have recently been attracted to studying KHM. In this study, the proposed iSSO-KHM is based on an improved simplified swarm optimization (iSSO) and integrates a variable neighborhood search (VNS) for KHM clustering. As evidence of the utility of the proposed iSSO-KHM, we present extensive computational results on eight benchmark problems. From the computational results, the comparison appears to support the superiority of the proposed iSSO-KHM over previously developed algorithms for all experiments in the literature.
Clustering is perhaps the most well-known technique in data mining to cluster data based on certain criteria. In past decades, clustering has attracted much attention, and it is increasingly becoming an important tool due to its wide and valuable applications in improving data analysis in various fields, such as the natural sciences, psychology, medicine, engineering, economics, marketing and other fields [
Clustering is an NP-hard problem with computational effort growing exponentially with the problem size [
The most popular class of partition clustering is the centroid-based clustering algorithm. Among all clustering methods, with an extensive history dating back to 1972, K-means (KM) is one of the most well-known center-based partition clustering techniques [
As seen above, KM is relatively simple, even on large datasets. Hence, it has effective widespread applications in various real-life problems, such as market segmentation, classification analysis, artificial intelligence, machine learning, image processing, and machine vision [
However, KM is a heuristic algorithm and has two serious drawbacks [
Its result depends on the initial random clusters, i.e., sensitivity to initial starting centroids;
It may be trapped in a local optimum; i.e., there is no guarantee that it will converge to the global optimum.
Therefore, the K-harmonic means (KHM) algorithm was proposed by Zhang [
Soft computing is able to help the traditional KHM methods escape from the local optimum trap and obtain better results [
In this paper, a new algorithm, iSSO-KHM, is proposed to help the KHM escape from local optima by installing a new update mechanism into the SSO and integrating the KHM. The rest of the paper is organized as follows: Section 2 provides a description of the KHM and an overview of SSO. The novel one-variable difference update mechanism and the survival of the fittest policy, which are two cores in the proposed iSSO-KHM, are introduced in Section 3. Section 4 compares the proposed iSSO-KHM with three recently introduced KHM-based algorithms in eight benchmark datasets adopted from the UCI database to demonstrate the performance of the proposed iSSO-KHM. Finally, concluding remarks are summarized in Section 5.
The proposed iSSO-KHM is based on both SSO and KHM. Before discussing the proposed iSSO-KHM, how to solve the KHM clustering, basic SSO and KHM algorithms is introduced formally in this section.
SSO is a new population-based soft computing method that was introduced originally by Yeh for discrete-type optimization problems [
Let Nsol be the number of solutions that are initialized randomly, K be the number of variables and the number of centroids,
Analogous to all other soft computing techniques, SSO searches for optimal solutions by updating generations. In every generation of SSO, each variable
Moving toward
KHM is similar to KM [
STEP K2 calculates the fitness function
Based on the novel one-variable difference update mechanism and the policy of survival of the fittest, the proposed iSSO-KHM is able to find a good solution without needing to explore all possible combinations of solutions. These two parts, i.e., the novel one-variable difference update mechanism and the policy of survival of the fittest, are discussed in this section.
Each soft computing method has its own generic update mechanism and numerous revised update mechanisms for different applications in various situations. In most soft computing, the update mechanism is only changed slightly. For example, the update mechanism of PSO is considered to be a vector-based update mechanism using the following two equations where
Note that all variables in the same solution share two random variables in PSO, i.e., ρ1 and ρ2 which are generated randomly from a uniform distribution within [0, 1] in
To reduce the number of random values and to change solutions gradually without breaking the trend and stability in the convergent status, only one variable is updated in each solution for each iteration in the proposed iSSO-KHM. Another reason to adapted the one-variable update mechanism is due to the specific factor that the KHM is essentially insensitive to the initial conditions and only needs to refine its solution [
The update mechanism listed in
For example, let
Case | ρc | original |
updated |
updated c3 |
---|---|---|---|---|
1 | 0.12 | 6.7 | 6.7+0.3∙0.6∙(5.4−6.7) = 6.466 | (1.3, 4.5, |
2 | 0.45 | 6.7 | 6.7+0.3∙0.6∙(5.4−1.7) = 7.366 | (1.3, 4.5, |
3 | 0.99 | 6.7 | 6.7+0.3∙0.6∙(7.7−1.7) = 7.780 | (1.3, 4.5, |
The policy of survival of the fittest, inspired by natural selection, is a strategy to select the most fits and eliminate unfits. In the traditional SSO, the updated solution must replace the old solution regardless of whether the updated solution is worse [
Unlike SSO, the proposed one-variable difference update mechanism only updates one variable and places more emphasis on the local search. Additionally, KHM is less sensitive to the updated solutions. Hence, the survival-of-the-fittest policy applies to both
Like the existing related KHM algorithms, the KHM procedure discussed in section 2.2 to calculate the fitness of each solution is implemented in the iSSO-KHM and acts as a local search to further improve each updated solution heuristically. The steps of complete pseudocode of the proposed iSSO-KHM are described as follows.
In the above, STEP 0 simply runs the KHM procedure for each randomly generated solution to calculate its fitness function and update the solution. STEP 1 finds the first
In this section, we present the computational results of the comparisons among the proposed algorithm and existing algorithms on eight benchmark datasets to test the performance of iSSO-KHM.
To evaluate the efficiency and effectiveness (i.e., the solution quality) of the proposed iSSO-KHM, eight benchmarks adopted from UCI are tested: Abalone (denoted by A, 4177 records and seven features), Breast-Cancer-Wisconsin (denoted by B, 699 records and nine features), Car (denoted by C, 1728 records and six features), Glass (denoted by G, 214 records and nine features), Iris (denoted by I, 150 records and four features), Segmentation (denoted by S, 2310 records and 19 features), Wine (denoted by W, 178 records and 13 features), and Yeast (denoted by Y, 1484 records and eight features).
Moreover, iSSO-KHM is compared to four KHM-related soft computing algorithms: CGS_KHM, MLS_KHM, PSO_KHM, and SSO_KHM. Note that CGS_KHM has better performance than tabu search and VNS for the Iris, Glass and Wine datasets.
The programming language used was C++ with default options for all five algorithms: CGS_KHM (denoted by CGS), iSSO-KHM (denoted by iSSO), MLS_KHM (denoted by MLS), PSO_KHM (denoted by MLS), and SSO_KHM (denoted by SSO). All codes were run using a 64-bit Window 10 Operating System with Intel Core i7-5960X 3.00 GHz CPU and 16 GB of RAM.
In experiments, all values of K are set to three; the
All required parameters for CGS, MLS, PSO, and SSO are taken directly from [
In all tables listed in
To compare the efficiency of the update mechanism of the proposed iSSO, the average of the corresponding fitness calculation number (Navg) and the number of best Navg represented by navg are recorded. Note that for a fixed
To properly evaluate the clustering method, the Fmeasure value is provided and the number of best Fmeasure [
All experimental results are listed in
All results in
0.1 | 0.3 | 0.5 | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Alg. | favg | fmin | fmax | fstd | navg | fmea | favg | fmin | fmax | fstd | navg | fmea | favg | fmin | fmax | fstd | navg | fmea | |
1.5 | CGS | S | 0 | S | BS | 0 | 0 | S | 0 | S | S | 0 | W | 0 | 0 | 0 | 0 | 0 | 0 |
iSSO | |||||||||||||||||||
MLS | 0 | S | 0 | 0 | A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | S | |
PSO | 0 | 0 | 0 | 0 | 0 | W | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
SSO | A | A | 0 | 0 | 0 | AS | 0 | A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
2.0 | CGS | S | S | S | BS | 0 | 0 | S | 0 | S | S | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
iSSO | |||||||||||||||||||
MLS | 0 | 0 | 0 | 0 | A | W | B | 0 | B | B | 0 | W | B | 0 | B | B | 0 | 0 | |
PSO | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | S | 0 | 0 | 0 | 0 | 0 | I | |
SSO | 0 | A | 0 | 0 | 0 | A | 0 | 0 | 0 | 0 | 0 | A | 0 | 0 | 0 | 0 | 0 | AY | |
2.5 | CGS | S | 0 | S | BS | 0 | 0 | 0 | 0 | S | S | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
iSSO | |||||||||||||||||||
MLS | 0 | S | 0 | 0 | A | W | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | G | |
PSO | 0 | 0 | 0 | 0 | 0 | IS | 0 | 0 | 0 | 0 | 0 | BW | 0 | 0 | 0 | 0 | 0 | 0 | |
SSO | 0 | A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | A | 0 | 0 | 0 | 0 | 0 | I |
Alg. | favg | fmin | fmax | fstd | navg | fmea | |
---|---|---|---|---|---|---|---|
CGS | 3 (S) | 1 (S) | 3 (S) | 6 (3B,3S) | 0 | 0 | |
iSSO | |||||||
0.1 | MLS | 0 | 2 (S) | 0 | 0 | 3 (A) | 2 (W) |
PSO | 0 | 0 | 0 | 0 | 0 | 3 (I,S,W) | |
SSO | 1 (A) | 3 (A) | 0 | 0 | 0 | 2 (A,S) | |
CGS | 2 (S) | 0 | 3 (S) | 3 (S) | 0 | 1 (W) | |
iSSO | |||||||
0.3 | MLS | 1 (B) | 0 | 1 (B) | 1 (B) | 0 | 1 (W) |
PSO | 0 | 0 | 0 | 0 | 0 | 3 (B,S,W) | |
SSO | 0 | 1 (A) | 0 | 0 | 0 | 2 (A) | |
CGS | 0 | 0 | 0 | 0 | 0 | 0 | |
iSSO | |||||||
0.5 | MLS | 1 (B) | 0 | 1 (B) | 1 (B) | 0 | 2 (G,S) |
PSO | 0 | 0 | 0 | 0 | 0 | 1 (I) | |
SSO | 0 | 0 | 0 | 0 | 0 | 3 (A,I,Y) |
Alg. | favg | fmin | fmax | fstd | navg | fmea | |
---|---|---|---|---|---|---|---|
1.5 | CGS | 2 (S) | 0 | 2 (S) | 3 (B,2S) | 0 | 1 (W) |
iSSO | |||||||
MLS | 0 | 1 (S) | 0 | 0 | 1 (A) | 1 (S) | |
PSO | 0 | 0 | 0 | 0 | 0 | 1 (W) | |
SSO | 1 (A) | 2 (A) | 0 | 0 | 0 | 2 (A,S) | |
2 | CGS | 2 (S) | 1 (S) | 2 (S) | 3 (B,2S) | 0 | 0 |
iSSO | |||||||
MLS | 2 (B) | 0 | 2 (B) | 2 (B) | 1 (A) | 2 (W) | |
PSO | 0 | 0 | 0 | 0 | 0 | 2 (I,S) | |
SSO | 0 | 1 (A) | 0 | 0 | 0 | 4 (3A,Y) | |
2.5 | CGS | 1 (S) | 0 | 2 (S) | 3 (B,2S) | 0 | 0 |
iSSO | |||||||
MLS | 0 | 1 (S) | 0 | 0 | 1 (A) | 2 (G,W) | |
PSO | 0 | 0 | 0 | 0 | 0 | 4 (B,I,S,W) | |
SSO | 0 | 1 (A) | 0 | 0 | 0 | 2 (A,I) |
Alg. | favg | fmin | fmax | fstd | navg | fmea |
---|---|---|---|---|---|---|
CGS | 5 (S) | 0 | 6 (S) | 9 (3B,6S) | 0 | 1 (W) |
iSSO | ||||||
MLS | 2 (B) | 2 (S) | 2 (B) | 2 (B) | 3 (A) | 5 (G, 3W, S) |
PSO | 0 | 0 | 0 | 0 | 0 | 7 (B, 2I, 2S, 2W) |
SSO | 1 (A) | 4 (A) | 0 | 0 | 0 | 8 (5A, I, S, Y) |
From
In general, each result obtained using the proposed iSSO is better than those obtained using the other methods described
ID | Alg. | Favg | Fmin | Fmax | Fstd | Navg | Fmeasure | ||
---|---|---|---|---|---|---|---|---|---|
A | 0.5 | 2.5 | iSSO | 377.100 | 377.096 | 377.115 | 3.92E-03 | 254.26 | 57.70% |
0.3 | 2.5 | iSSO | 377.129 | 377.097 | 377.226 | 3.42E-02 | 152.28 | 58.64% | |
0.3 | 2.5 | SSO | 380.591 | 377.116 | 403.228 | 5.83E+00 | 146.32 | 58.67% | |
0.5 | 2.5 | SSO | 385.472 | 377.157 | 414.058 | 1.04E+01 | 241.32 | 57.69% | |
0.5 | 2.5 | PSO | 418.457 | 377.634 | 426.647 | 1.96E+01 | 238.2 | 57.59% | |
B | 0.5 | 1.5 | iSSO | 424.867 | 424.867 | 424.867 | 1.12E-06 | 1685.48 | 96.15% |
0.3 | 1.5 | iSSO | 424.867 | 424.867 | 424.867 | 1.71E-05 | 1023.24 | 96.12% | |
0.1 | 2.5 | iSSO | 424.886 | 424.867 | 424.972 | 2.66E-02 | 356.56 | 96.16% | |
0.1 | 2 | iSSO | 424.883 | 424.867 | 424.998 | 2.42E-02 | 357.28 | 96.14% | |
0.1 | 1.5 | iSSO | 424.888 | 424.868 | 424.967 | 2.71E-02 | 341.78 | 96.19% | |
C | 0.5 | 1.5 | iSSO | 7068.628 | 7068.628 | 7068.628 | 8.33E-06 | 722.9 | 39.28% |
0.3 | 1.5 | iSSO | 7068.630 | 7068.628 | 7068.632 | 1.08E-03 | 436.18 | 39.25% | |
0.1 | 2.5 | iSSO | 7069.794 | 7069.146 | 7070.432 | 3.35E-01 | 150.64 | 39.33% | |
0.1 | 2 | iSSO | 7070.008 | 7069.150 | 7070.889 | 4.04E-01 | 150.66 | 39.39% | |
0.1 | 1.5 | iSSO | 7069.977 | 7069.189 | 7070.878 | 4.25E-01 | 144.18 | 39.23% | |
G | 0.3 | 1.5 | iSSO | 1059.336 | 1059.336 | 1059.336 | 7.59E-09 | 5099.84 | 41.42% |
0.3 | 2 | iSSO | 1059.336 | 1059.336 | 1059.336 | 6.87E-09 | 5100.06 | 41.33% | |
0.1 | 2.5 | iSSO | 1059.336 | 1059.336 | 1059.337 | 6.21E-06 | 1780.36 | 41.39% | |
0.1 | 1.5 | iSSO | 1059.336 | 1059.336 | 1059.337 | 5.95E-06 | 1705.48 | 41.39% | |
0.1 | 2 | iSSO | 1059.336 | 1059.336 | 1059.337 | 9.23E-06 | 1779.64 | 41.50% | |
I | 0.3 | 1.5 | iSSO | 181.728 | 181.728 | 181.728 | 2.04E-11 | 9494.02 | 75.31% |
0.3 | 2 | iSSO | 181.728 | 181.728 | 181.728 | 1.83E-11 | 9495.64 | 75.36% | |
0.1 | 1.5 | iSSO | 181.728 | 181.728 | 181.728 | 1.08E-07 | 3185.04 | 75.25% | |
0.1 | 2 | iSSO | 181.728 | 181.728 | 181.728 | 7.83E-08 | 3313.24 | 75.41% | |
0.1 | 2.5 | iSSO | 181.728 | 181.728 | 181.728 | 6.70E-08 | 3316.62 | 75.26% | |
S | 0.5 | 1.5 | iSSO | 42852347.416 | 42852345.652 | 42852350.697 | 1.51E+00 | 412.40 | 53.18% |
0.5 | 2 | iSSO | 42852347.460 | 42852345.678 | 42852351.912 | 1.61E+00 | 412.80 | 53.10% | |
0.3 | 2 | iSSO | 42852454.042 | 42852348.975 | 42852724.919 | 9.39E+01 | 248.00 | 53.19% | |
0.3 | 1.5 | iSSO | 42852441.557 | 42852356.230 | 42852682.740 | 7.27E+01 | 247.90 | 53.07% | |
0.3 | 2 | CGS | 42852405.328 | 42852368.882 | 42852456.610 | 2.33E+01 | 184.10 | 53.14% | |
W | 0.3 | 1.5 | iSSO | 5388248.279 | 5388248.279 | 5388248.279 | 4.70E-09 | 5095.64 | 62.07% |
0.3 | 2 | iSSO | 5388248.279 | 5388248.279 | 5388248.279 | 4.70E-09 | 5095.72 | 62.22% | |
0.1 | 2 | iSSO | 5388248.279 | 5388248.279 | 5388248.279 | 4.69E-08 | 1776.94 | 62.15% | |
0.1 | 1.5 | iSSO | 5388248.279 | 5388248.279 | 5388248.279 | 1.21E-07 | 1704.06 | 62.20% | |
0.1 | 2.5 | iSSO | 5388248.279 | 5388248.279 | 5388248.279 | 6.11E-08 | 1779.74 | 62.08% | |
Y | 0.5 | 2.5 | iSSO | 72.833 | 72.833 | 72.836 | 9.49E-04 | 769.6 | 56.21% |
0.3 | 2.5 | iSSO | 72.837 | 72.833 | 72.857 | 5.41E-03 | 465.74 | 56.15% | |
0.3 | 2.5 | SSO | 74.094 | 72.952 | 75.856 | 8.53E-01 | 445.44 | 56.05% | |
0.5 | 2.5 | SSO | 73.942 | 73.038 | 75.630 | 6.96E-01 | 734.42 | 56.09% | |
0.3 | 2.5 | PSO | 74.343 | 73.185 | 75.418 | 6.40E-01 | 399.8 | 55.84% |
In
Additionally, we can see that the top four Fmin values are all obtained from the proposed iSSO for all datasets, except iSSO only has the top two Fmin in both A and S datasets, of which SSO has the 3rd and 4th best Fmin, and PSO has the 5th best Fmin. It seems that the algorithm with the best Fmin also has the best Favg, Fmax, Fstd, and Navg in all datasets. However, the algorithm with the best Fmin does not guarantee its Fmeasure is also the best, this is applicable to A, B, C, G, I and W datasets.
The following are some other observations for
Navg: The order of the best Navg for each dataset from large to small is 9494.02 (I) > 5099.84 (G) > 5095.64 (W) > 1685.48 (B) > 769.6 (Y) > 722.9 (C) > 412.40 (S) > 254.26 (A), where the letter inside parentheses is the related dataset. The above order exactly coincides with the order from large to small of the number of recorders in each dataset: A (4177) > S (2310) > C (1728) > Y (1484) > B (699) > G (214) > W (178) > I (150), except when 5099.84 (G) > 5095.64 (W) in Navg. Hence, the smaller the dataset is, the shorter the runtime is and the larger number of fitness calculations is.
Fstd: The order of the best Fstd for each dataset from small to large is 2.04E-11 (I) < 4.70E-09 (W) < 7.59E-09 (G) < 1.12E-06 (B) < 8.33E-06 (C) < 9.49E-04 (Y) < 3.92E-03 (A) < 1.51E+00 (S), where the letter inside parentheses is the related dataset. The above order of datasets is similar to that of Navg because the more fitness calculations are performed, the lower is standard deviation.
In this work, a new soft computing method called the iSSO-KHM is proposed to solve the KHM clustering problem. The proposed iSSO-KHM adapted the fundamental concepts in both the traditional SSO and KHM by adding the novel one-variable difference update mechanism to update solutions and the survival-of-the-fittest policy to decide whether to accept the new update solutions.
The computational experiments compare the proposed iSSO-KHM with CGS, MLS, PSO, and SSO on eight benchmark datasets: Abalone, Breast-Cancer-Wisconsin, Car, Glass, Iris, Segmentation, Wine, and Yeast with settings of
The experimental results show the superiority of iSSO-KHM over the other three algorithms for almost all eight benchmark datasets. Hence, iSSO-KHM can achieve a trade-off between exploration and exploitation to generate a good approximation in a limited computation time systematically, efficiently, effectively, and robustly.
However, from the experiments in Section 4, the improved Fmin value does not mean that the Fmeasure is also improved. Therefore, a potential area of exploration would be to include Fmeasure in the fitness function to improve both values of Fmin and Fmeasure. Another limitation of the proposed algorithm is that
As there are some recently proposed swarm-based clustering algorithms, it is necessary to have more comparisons about the proposed algorithm with other well-known swarm-based clustering algorithms in the future. In Section 4, “Experimental results”, the choice of the parameter
(DOCX)
I wish to thank the anonymous editor and the reviewers for their constructive comments and recommendations, which have significantly improved the presentation of this paper. This research was supported in part by the National Science Council of Taiwan, R.O.C. under grant NSC101-2221- E-007-079- MY3 and NSC 102-2221-E-007-086-MY3.