Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A New Soft Computing Method for K-Harmonic Means Clustering

  • Wei-Chang Yeh ,

    yeh@ieee.org

    Affiliation Integration and Collaboration Laboratory, Department of Industrial Engineering and Engineering Management, National Tsing Hua University, Hsinchu City, Taiwan

  • Yunzhi Jiang,

    Affiliation School of Automation, Guangdong Polytechnic Normal University, Guangdong, China

  • Yee-Fen Chen,

    Affiliation Department of Industrial Engineering and Management, Hsiuping University of Science and Technology, Taichung City, Taiwan

  • Zhe Chen

    Affiliation School of Information Engineeering, Chang’an University, Xi’an, China

A New Soft Computing Method for K-Harmonic Means Clustering

  • Wei-Chang Yeh, 
  • Yunzhi Jiang, 
  • Yee-Fen Chen, 
  • Zhe Chen
PLOS
x

Correction

3 Jan 2017: Yeh WC, Jiang Y, Chen YF, Chen Z (2017) Correction: A New Soft Computing Method for K-Harmonic Means Clustering. PLOS ONE 12(1): e0169707. https://doi.org/10.1371/journal.pone.0169707 View correction

Abstract

The K-harmonic means clustering algorithm (KHM) is a new clustering method used to group data such that the sum of the harmonic averages of the distances between each entity and all cluster centroids is minimized. Because it is less sensitive to initialization than K-means (KM), many researchers have recently been attracted to studying KHM. In this study, the proposed iSSO-KHM is based on an improved simplified swarm optimization (iSSO) and integrates a variable neighborhood search (VNS) for KHM clustering. As evidence of the utility of the proposed iSSO-KHM, we present extensive computational results on eight benchmark problems. From the computational results, the comparison appears to support the superiority of the proposed iSSO-KHM over previously developed algorithms for all experiments in the literature.

1. Introduction

Clustering is perhaps the most well-known technique in data mining to cluster data based on certain criteria. In past decades, clustering has attracted much attention, and it is increasingly becoming an important tool due to its wide and valuable applications in improving data analysis in various fields, such as the natural sciences, psychology, medicine, engineering, economics, marketing and other fields [128].

Clustering is an NP-hard problem with computational effort growing exponentially with the problem size [13]. There are two categories among all existing clustering algorithms: hierarchical clustering and partition clustering [3]. The former builds a hierarchy tree of data that successively merges similar clusters, while the latter begins with a random partition and refines it iteratively [3].

The most popular class of partition clustering is the centroid-based clustering algorithm. Among all clustering methods, with an extensive history dating back to 1972, K-means (KM) is one of the most well-known center-based partition clustering techniques [417]. KM is implemented by first randomly selecting K initial centroids and then trying to minimize heuristically the sum of the squares of distances, e.g., the Euclidean distance, Manhattan distance, and Mahalanobis distance, between each data point to the centroids [416].

As seen above, KM is relatively simple, even on large datasets. Hence, it has effective widespread applications in various real-life problems, such as market segmentation, classification analysis, artificial intelligence, machine learning, image processing, and machine vision [416]. Moreover, KM is implemented frequently as a preprocessing stage for other methodologies as a starting configuration.

However, KM is a heuristic algorithm and has two serious drawbacks [716]:

  1. Its result depends on the initial random clusters, i.e., sensitivity to initial starting centroids;
  2. It may be trapped in a local optimum; i.e., there is no guarantee that it will converge to the global optimum.

Therefore, the K-harmonic means (KHM) algorithm was proposed by Zhang [7] in 1999 to solve the problem of sensitivity to initial starting points. However, it still may be trapped by convergence to a local optimum. Hence, the main focus of KHM research has shifted to develop soft computing, such as the tabu K-harmonic means [9], simulated annealing based KHM [10], the particle swarm optimization (PSO) KHM (PSO-KHM) [11], the hybrid data clustering algorithms based on ant colony optimization and KHM [12], a variable neighborhood search (VNS) for KHM clustering [10], the multi-start local search for KHM clustering (MLS) [13], the gravitational search algorithm based KHM [14], the candidate groups search combined with K-harmonic mean (CGS-KHM) [15], the simplified swarm optimization based KHM (SSO-KHM) [16], the statistical feature extraction modeling KHM [29], the PSO hybrid with tabu search for KHM clustering [30], the firefly [31] and the enhanced firefly algorithm [32] for KHM clustering, the fish school search algorithm [33], and the genetic hybrid with gravitational search for KHM clustering [34], to avoid the local trap problem and reduce numerical difficulties.

Soft computing is able to help the traditional KHM methods escape from the local optimum trap and obtain better results [716]. However, the update mechanisms of these soft computing methods are either too tedious, which then requires extra computational efforts, or too weak in their local search, which requires more time for convergence [16]. Thus, there is always a need to have a better soft computing method for KHM clustering.

In this paper, a new algorithm, iSSO-KHM, is proposed to help the KHM escape from local optima by installing a new update mechanism into the SSO and integrating the KHM. The rest of the paper is organized as follows: Section 2 provides a description of the KHM and an overview of SSO. The novel one-variable difference update mechanism and the survival of the fittest policy, which are two cores in the proposed iSSO-KHM, are introduced in Section 3. Section 4 compares the proposed iSSO-KHM with three recently introduced KHM-based algorithms in eight benchmark datasets adopted from the UCI database to demonstrate the performance of the proposed iSSO-KHM. Finally, concluding remarks are summarized in Section 5.

2. Overview of SSO and KHM

The proposed iSSO-KHM is based on both SSO and KHM. Before discussing the proposed iSSO-KHM, how to solve the KHM clustering, basic SSO and KHM algorithms is introduced formally in this section.

2.1 The SSO

SSO is a new population-based soft computing method that was introduced originally by Yeh for discrete-type optimization problems [17] and has applications in two hot research topics in soft computing: swarm intelligence and evolutionary computing. From the applications in various optimization problems, SSO has demonstrated its simplicity, efficiency, and flexibility at exploring large and complex spaces [1628].

Let Nsol be the number of solutions that are initialized randomly, K be the number of variables and the number of centroids, ci = (ci,1, ci,2,…, ci,K) be the ith solution inside the problem space with a fitness value F(ci) determined by the fitness function F to be optimized, pBest Pi = (pi,1, pi,2,…,pi,K) be the best fitness function value of the ith solution with its own history, and gBest PgBest = (pgBest,1, pgBest,2,…,pgBest,K) be the solution with the best fitness function value among all pBests, where i = 1, 2, …, Nsol and gBest∈{1, 2, …, Nsol}.

Analogous to all other soft computing techniques, SSO searches for optimal solutions by updating generations. In every generation of SSO, each variable cj,k is updated according to the following simple step function after Cw, Cp, and Cg are given: (1) where j = 1, 2, …, Nsol; k = 1, 2, …, K; Cw, Cp−Cw, Cg, and 1−Cg are the predefined probabilities to determine whether cj,k will be updated to the same value (i.e., no change); pj,k in its pBest, pgBest,k of gBest, and regenerated to a new randomly generated feasible value [1628].

Moving toward pBest is a local search; moving toward gBest is a global search. Moving toward a randomly generated feasible value is also a global search to maintain population diversity and enhance the capacity of escaping from a local optimum. Thus, each solution is a compromise among the current solution, pBest, gBest, and a random movement; this process combines local search and global search, yielding high search efficiency [1628].

2.2 The KHM

KHM is similar to KM [716]. It is also a center-based partition clustering and randomly selects K initial centroids in the beginning. The major difference between KHM and KM is that KHM uses harmonic averages of the distances from each data point to the centers as components of its performance function. The detail of the KHM clustering algorithm is shown as follows [716]:

KHM PROCEDURE.

  1. STEP K1. Select K initial centroids c1, c2, …, cK randomly, where ck is the centroid of the kth cluster; let F* be a large number, and provide a tolerance ε.
  2. STEP K2. Calculate fitness function: (2) where p is the pth power of the Manhattan distance.
  3. STEP K3. If (F*/F(c1, c2, …, cK) − 1<ε), then halt and go to STEP K7; else, let F* = F(c1, c2, …, cK).
  4. STEP K4. Calculate the membership of each data Xi to centroids ck for i = 1, 2, …, N and k = 1, 2, …, K as below: (3)
  5. STEP K5. Calculate the weight of each data Xi for i = 1, 2, …, N as below: (4)
  6. STEP K6. Calculate the new centroid ck for k = 1, 2, …, K as below and go to STEP K2: (5)
  7. STEP K7. Assign data point Xi to cluster k if M(cj, Xi)≤M(ck, Xi) for j = 1, 2, …, K.

STEP K2 calculates the fitness function F(c1, c2, …, cK) of KHM by summing up all harmonic averages of the distances between each data point and all centroids. STEP K3 defines the stopping criteria for KHM. In STEP K4, KHM employs each member function M(ck, Xi) to measure the influence over the centroid ck to data Xi. This member function determines which cluster each data point belongs to in STEP K7. STEP K5 assigns dynamic weight W(Xi) to each data point such that the larger the weight is, the smaller the distance is to any centroid to avoid multiple centroids close together. STEP K6 updates the current centroids.

3. The Proposed iSSO-KHM

Based on the novel one-variable difference update mechanism and the policy of survival of the fittest, the proposed iSSO-KHM is able to find a good solution without needing to explore all possible combinations of solutions. These two parts, i.e., the novel one-variable difference update mechanism and the policy of survival of the fittest, are discussed in this section.

3.1 The one-variable difference update mechanism

Each soft computing method has its own generic update mechanism and numerous revised update mechanisms for different applications in various situations. In most soft computing, the update mechanism is only changed slightly. For example, the update mechanism of PSO is considered to be a vector-based update mechanism using the following two equations where c1 and c2 are two constants: (6) (7)

Note that all variables in the same solution share two random variables in PSO, i.e., ρ1 and ρ2 which are generated randomly from a uniform distribution within [0, 1] in Eq 6. In ABC, one variable for each solution is selected randomly for updating. The updated operators in traditional GA are either two variables via one-cut-point mutation, or up to half the number of variables changed via one-cut-point crossover. In the traditional SSO, however, all variables are updated simultaneously based on Eq 1.

To reduce the number of random values and to change solutions gradually without breaking the trend and stability in the convergent status, only one variable is updated in each solution for each iteration in the proposed iSSO-KHM. Another reason to adapted the one-variable update mechanism is due to the specific factor that the KHM is essentially insensitive to the initial conditions and only needs to refine its solution [716].

The update mechanism listed in Eq 1 is more suitable for this discrete data or type, and each variable of centroids is a floating point value in the KHM. Hence, the step function in Eq 1 is also revised for floating-point data in the novel one-variable difference update mechanism for the proposed iSSO-KHM as follows: (8) where ρ1, ρ2, and ρc are random numbers generated from the uniform distribution within [0,1]. Note that Cg = .4 and Cw = .6 in this study, the role of pBest is removed, and the comparison order is Cg first and then Cw in the step function of Eq 6, which is different from Eq 1.

For example, let c3 = (1.3, 4.5, 6.7, 8.9) be the current solution, cgBest = c6 = (2.7, 7.6, 5.4, 9.8) be the gBest, cx = c5 = (2.3, 5.5, 7.7, 9.9) and cy = c7 = (6.2, 8.5, 1.7, 4.9) be two randomly selected solutions, and the third variable (i.e., c3,3) be selected randomly to update. Assume that ρ1 = 0.3 and ρ2 = 0.6 are generated randomly. Table 1 shows the newly updated c3 for three different cases resulting from three different values of ρc:

3.2 Survival-of-the-fittest policy

The policy of survival of the fittest, inspired by natural selection, is a strategy to select the most fits and eliminate unfits. In the traditional SSO, the updated solution must replace the old solution regardless of whether the updated solution is worse [1728]. However, gBest is based on survival-of-the-fittest policy; i.e., only a solution that is better than the gBest can replace gBest [1728].

Unlike SSO, the proposed one-variable difference update mechanism only updates one variable and places more emphasis on the local search. Additionally, KHM is less sensitive to the updated solutions. Hence, the survival-of-the-fittest policy applies to both gBest and all updated solutions to reduce the evolution time.

3.3 The complete pseudocode of the proposed iSSO-KHM

Like the existing related KHM algorithms, the KHM procedure discussed in section 2.2 to calculate the fitness of each solution is implemented in the iSSO-KHM and acts as a local search to further improve each updated solution heuristically. The steps of complete pseudocode of the proposed iSSO-KHM are described as follows.

iSSO-KHM PROCEDURE.

  1. STEP 0. Generate cj = (cj,1, cj,2, …, cj,K) randomly, update cj, and calculate its fitness using the KHM procedure discussed in Section 2.2 for all j = 1, 2, …, Nsol.
  2. STEP 1. Let gen = 1 and find gBest∈{1, 2, …, Nsol} such that F(cgBest)≤F(cj) for all j = 1, 2, …, Nsol.
  3. STEP 2. Let j = 1.
  4. STEP 3. Select a variable (i.e., a centroid) randomly from cj, say cj,k where k∈{1, 2, …, K}, and let c* = cj and F* = F(cj).
  5. STEP 4. Generate a random number ρC from the uniform distribution between [0, 1]
  6. STEP 5. If ρC<Cg, then let x = gBest, y = j, and go to STEP 8.
  7. STEP 6. If ρC<Cw, then let x = gBest, select y randomly from {1, 2, …, K}, and go to STEP 8.
  8. STEP 7. Select two integers x and y randomly from {1, 2, …, K}.
  9. STEP 8. Let cj,k = cj,k[0,1]∙ρ[0,1]+(cx,kcy,k), and run the procedure KHM to update cj and calculate its fitness.
  10. STEP 9. If F(cj)>F*, then let cj = c* and F(cj) = F*, and go to STEP 11.
  11. STEP 10. If F(cj)<F(cgBest), then let gBest = j.
  12. STEP 11. If the runtime is less than the predefined T, then go to STEP 2; otherwise, cgBest is the final solution, and halt.
  13. STEP 12. If j<Nsol, let j = j+1, and go to STEP 3.

In the above, STEP 0 simply runs the KHM procedure for each randomly generated solution to calculate its fitness function and update the solution. STEP 1 finds the first gBest from these initial populations after using the KHM procedure. STEPs 2–12 implement the proposed one-variable difference update mechanism; STEPs 9 and 10 are based on the survival-of-the-fittest policy to decide whether to accept the updated solution or replace gBest. Note that the stopping criterion in STEP 11 is the runtime T, and T = 0.1, 0.3, and 0.5 CPU seconds in the experiments tested in Section 4.

4. Experimental Results

In this section, we present the computational results of the comparisons among the proposed algorithm and existing algorithms on eight benchmark datasets to test the performance of iSSO-KHM.

4.1 The Experimental Setting

To evaluate the efficiency and effectiveness (i.e., the solution quality) of the proposed iSSO-KHM, eight benchmarks adopted from UCI are tested: Abalone (denoted by A, 4177 records and seven features), Breast-Cancer-Wisconsin (denoted by B, 699 records and nine features), Car (denoted by C, 1728 records and six features), Glass (denoted by G, 214 records and nine features), Iris (denoted by I, 150 records and four features), Segmentation (denoted by S, 2310 records and 19 features), Wine (denoted by W, 178 records and 13 features), and Yeast (denoted by Y, 1484 records and eight features).

Moreover, iSSO-KHM is compared to four KHM-related soft computing algorithms: CGS_KHM, MLS_KHM, PSO_KHM, and SSO_KHM. Note that CGS_KHM has better performance than tabu search and VNS for the Iris, Glass and Wine datasets.

The programming language used was C++ with default options for all five algorithms: CGS_KHM (denoted by CGS), iSSO-KHM (denoted by iSSO), MLS_KHM (denoted by MLS), PSO_KHM (denoted by MLS), and SSO_KHM (denoted by SSO). All codes were run using a 64-bit Window 10 Operating System with Intel Core i7-5960X 3.00 GHz CPU and 16 GB of RAM.

In experiments, all values of K are set to three; the pth power of the Manhattan distance is p = 1.5, 2.0, and 3.0; and the runtime limit is T = 0.1, 0.3, and 0.5 CPU seconds. For each test and algorithm, the number of solutions is 15, i.e., Nsol = 15, the number of independent runs is 55, and only the best 50 results are recorded to remove possible outliers; the stopping criteria are T = 0.1, 0.3, and 0.5 CPU seconds.

All required parameters for CGS, MLS, PSO, and SSO are taken directly from [15], [13], [11], and [20] for a fair comparison; two parameters, Cg = 0.4 and Cw = 0.6, are used in the proposed iSSO-KHM.

In all tables listed in S1 Appendix and the following two subsections, the notations Favg, Fmin, Fmax, and Fstd denote the average, minimal (the best), maximal (the worst) and standard deviation of the fitness values obtained from related algorithms. Additionally, the notations favg, fmin, fmax and fstd represent the number of Favg, Fmin, Fmax and Fstd that are the best among all algorithms under the same related conditions, e.g., p, T, and/or dataset.

To compare the efficiency of the update mechanism of the proposed iSSO, the average of the corresponding fitness calculation number (Navg) and the number of best Navg represented by navg are recorded. Note that for a fixed T, a higher Navg means that the related update mechanism is more efficient and increases the search performance for finding an optimal solution.

To properly evaluate the clustering method, the Fmeasure value is provided and the number of best Fmeasure [35,36] is represented by fmea. The Fmeasure is one of the standard clustering validity measures based on the ideas of precision and recall from information retrieval [35,36]. Evidently, the bigger value of Fmeasure is, the higher the quality of clustering is.

All experimental results are listed in S1 Appendix. S1 Appendix demonstrates that iSSO has achieved better solutions for each test problem with lower standard deviations and higher fitness computation numbers compared to the other methods.

4.2 General Observations for favg, fmin, fmax, fstd, and navg, and fmea

All results in S1 Appendix are ranked and discussed in this subsection. Tables 25 summarize these ranking based on different T and p, T only, p only, and algorithm only, respectively. The letter next to the number denotes the related dataset, e.g., B2S denotes one best value in dataset B and two best value in dataset S.

thumbnail
Table 3. The values of favg, fmin, fmax, fstd, navg, and fmea for T = 0.1, 0.3, and 0.5.

https://doi.org/10.1371/journal.pone.0164754.t003

thumbnail
Table 4. The values of favg, fmin, fmax, fstd, navg, and fmea for p = 1.5, 2.0, and 2.5.

https://doi.org/10.1371/journal.pone.0164754.t004

thumbnail
Table 5. The values of favg, fmin, fmax, fstd, navg, and fmea for algorithms.

https://doi.org/10.1371/journal.pone.0164754.t005

From Table 2, iSSO has higher numbers in favg, fmin, fmax, fstd, navg, and fmea than other methods for different setting of T and p. Hence, iSSO is more efficient, effective, and robust than other methods.

Table 3 summarizes the values of favg, fmin, fmax, fstd, navg, and fmea for T = 0.1, 0.3, and 0.5 separately. We can observe that the longer runtime is, the better the solution quality obtained from iSSO in Table 3. For example, fmin is increased from 18 to 23 for T = 0.1 to T = 0.2. PSO is the second best in fmea for both T = 0.1 and 0.3; SSO is the second best in fmin for T = 0.1 and 0.2, and in fmea for T = 0.3. Additionally, as seen from Table 3, iSSO tends to perform much better than other methods from time to time, e.g., there are six cases in which Fmin are better than that of iSSO for T = 0.1 but none in which Fmin is better than that of iSSO for T = 0.3.

Table 4 sums up the values of favg, fmin, fmax, fstd, navg, and fmea for p = 1.5, 2.0, and 2.5 separately. It is evident that iSSO is still the best method compared to the others in all aspects. According to published results, other methods work more effectively when p = 2.0 [716]. However, given the results, iSSO still retains its performance, regardless of the value of p. For example, fmin is 21 for p = 1.5 and 22 for both p = 2.0 and 2.5. Interesting observations can still be found, as observed in Table 3, where, in general, CGS yields better results for the S dataset, than other datasets. PSO and SSO follow on in performance with fmea in p = 2.0 and p = 2.5, respectively.

Table 5 lists the overall values of favg, fmin, fmax, fstd, navg, and fmea for CGS, iSSO, MLS, PSO, and SSO separately. In general, it seems that iSSO is only slightly more powerful within dataset A as fmin = 4 and fmea = 5 for SSO and for within dataset S as fmin = 2 for MLS, fmea = 2 for PSO, and favg = 5 and fmax = fstd = 6 for CGS. This is similar to what is observed in Tables 2 and 3. However, the number of best values for iSSO in all statistical indexes are still more than 6.2 times better compared to those of other methods. For example, for fstd, CGS produced nine best values (3 in B dataset and 6 in S dataset), whereas iSSO produced 64 best values. This trend is also found when iSSO is compared across all algorithms and thus demonstrates that iSSO outperforms the other algorithms in almost all aspects.

4.3 General Observations for Favg, Fmin, Fmax, Fstd, Navg, and Fmeasure

In general, each result obtained using the proposed iSSO is better than those obtained using the other methods described S1 Appendix and Section 4.2. For an elaborate analysis, the top five values of Fmin for each dataset under all settings of T and p are summarized in Table 6 and discussed in this subsection.

In Table 6, the proposed iSSO has the largest number (33) of results among the top five values, and SSO, PSO, and CGS have four, two, and one results among the top five values of Fmin, respectively. Note that in most cases SSO yields better results than CGS and MLS, as seen in Table 6, but all of the values of favg, fmin, fmax, fstd are zero for SSO in Section 4.2.

Additionally, we can see that the top four Fmin values are all obtained from the proposed iSSO for all datasets, except iSSO only has the top two Fmin in both A and S datasets, of which SSO has the 3rd and 4th best Fmin, and PSO has the 5th best Fmin. It seems that the algorithm with the best Fmin also has the best Favg, Fmax, Fstd, and Navg in all datasets. However, the algorithm with the best Fmin does not guarantee its Fmeasure is also the best, this is applicable to A, B, C, G, I and W datasets.

The following are some other observations for p, T, Navg, and Fstd:

  1. p: There are 14, 11, and 15 top-five Fmin values across all the eight data sets for p = 1.5, 2.0, and 2.5 in Table 6, respectively. This debunks published literatures [716] which indicate that p = 2.0 yields the best result. The above observation is also found in Table 4 of Section 4.2.
  2. T: The T = 0.1, 0.3, and 0.5 have 15, 16, and 9 top-five Fmin values across all the eight data sets. Among top-five Fmin values for each dataset, the one with the largest T also has the best Fmin, e.g., T = .3 in G, I and W datasets and T = 0.5 in the rest of the datasets. Hence, the above result agrees with the basic concept in soft computing: more runtime results in better solution quality.
  3. Navg: The order of the best Navg for each dataset from large to small is 9494.02 (I) > 5099.84 (G) > 5095.64 (W) > 1685.48 (B) > 769.6 (Y) > 722.9 (C) > 412.40 (S) > 254.26 (A), where the letter inside parentheses is the related dataset. The above order exactly coincides with the order from large to small of the number of recorders in each dataset: A (4177) > S (2310) > C (1728) > Y (1484) > B (699) > G (214) > W (178) > I (150), except when 5099.84 (G) > 5095.64 (W) in Navg. Hence, the smaller the dataset is, the shorter the runtime is and the larger number of fitness calculations is.
  4. Fstd: The order of the best Fstd for each dataset from small to large is 2.04E-11 (I) < 4.70E-09 (W) < 7.59E-09 (G) < 1.12E-06 (B) < 8.33E-06 (C) < 9.49E-04 (Y) < 3.92E-03 (A) < 1.51E+00 (S), where the letter inside parentheses is the related dataset. The above order of datasets is similar to that of Navg because the more fitness calculations are performed, the lower is standard deviation.

5. Conclusions

In this work, a new soft computing method called the iSSO-KHM is proposed to solve the KHM clustering problem. The proposed iSSO-KHM adapted the fundamental concepts in both the traditional SSO and KHM by adding the novel one-variable difference update mechanism to update solutions and the survival-of-the-fittest policy to decide whether to accept the new update solutions.

The computational experiments compare the proposed iSSO-KHM with CGS, MLS, PSO, and SSO on eight benchmark datasets: Abalone, Breast-Cancer-Wisconsin, Car, Glass, Iris, Segmentation, Wine, and Yeast with settings of K = 3; p = 1.5, 2.0, and 2.5; and T = 0.1, 0.3, and 0.5.

The experimental results show the superiority of iSSO-KHM over the other three algorithms for almost all eight benchmark datasets. Hence, iSSO-KHM can achieve a trade-off between exploration and exploitation to generate a good approximation in a limited computation time systematically, efficiently, effectively, and robustly.

However, from the experiments in Section 4, the improved Fmin value does not mean that the Fmeasure is also improved. Therefore, a potential area of exploration would be to include Fmeasure in the fitness function to improve both values of Fmin and Fmeasure. Another limitation of the proposed algorithm is that Cg and Cw in Eq 8 of the proposed update mechanism must be known in advance, this also brings up another practical problem that is to develop a parameter free idea in the proposed algorithm in the future.

As there are some recently proposed swarm-based clustering algorithms, it is necessary to have more comparisons about the proposed algorithm with other well-known swarm-based clustering algorithms in the future. In Section 4, “Experimental results”, the choice of the parameter K is fixed to 3. The proposed approach will also compare with the other versions of KHM for different values of K (like the case of p and T parameters).

Acknowledgments

I wish to thank the anonymous editor and the reviewers for their constructive comments and recommendations, which have significantly improved the presentation of this paper. This research was supported in part by the National Science Council of Taiwan, R.O.C. under grant NSC101-2221- E-007-079- MY3 and NSC 102-2221-E-007-086-MY3.

Author Contributions

  1. Conceptualization: W-CY.
  2. Data curation: W-CY Y-FC.
  3. Formal analysis: W-CY.
  4. Funding acquisition: W-CY.
  5. Investigation: W-CY.
  6. Methodology: W-CY YJ.
  7. Project administration: W-CY YJ.
  8. Resources: W-CY ZC.
  9. Software: W-CY Y-FC.
  10. Supervision: W-CY YJ.
  11. Validation: W-CY ZC.
  12. Visualization: W-CY.
  13. Writing – original draft: W-CY.
  14. Writing – review & editing: W-CY.

References

  1. 1. Anderberg M.R., Cluster Analysis for Application, Academic Press, New York, 1973.
  2. 2. Mirkin B., Clustering for Data Mining: A Data Recovery Approach, Taylor and Francis group and FL, 2005.
  3. 3. Jain A.K., Murty M.N., Flynn P. J., Data clustering: A review, ACM Computational Survey 31(3) (1999) 264–323.
  4. 4. Steinhaus H., Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci., 4 (1957) 801–804. MR 0090073. Zbl 0079.16403.
  5. 5. Forgy E.W., Cluster analysis of multivariate data: efficiency versus interpretability of classifications, Biometrics 21 (3) (1965) 768–769.
  6. 6. https://en.wikipedia.org/wiki/K-means_clustering.
  7. 7. B. Zhang, M. Hsu, U. Dayal, K-harmonic means–a data clustering algorithm, Technical Report HPL-1999-124, Hewlett–Packard Laboratories, 1999.
  8. 8. B. Zhang, Generalized k-harmonic means–boosting in unsupervised learning, Technical Report HPL-2000-137, Hewlett–Packard Laboratories, 2000.
  9. 9. Gungor Z., Unler A., K-harmonic means data clustering with tabu-search method, Applied Mathematical Modelling 32 (2008) 1115–1125.
  10. 10. Gungor Z., Unler A., K-harmonic means data clustering with simulated annealing heuristic, Applied Mathematics and Computation 184 (2007) 199–209.
  11. 11. Yang F., Sun T., Zhang C., An efficient hybrid data clustering method based on K-harmonic means and Particle Swarm Optimization, Expert Systems with Applications 36 (2009) 9847–9852.
  12. 12. Jiang H., Yi S., Li J., Yang F., Hu X., Ant clustering algorithm with K-harmonic means clustering, Expert Systems with Applications 37 (2010) 8679–8684.
  13. 13. Alguwaizani A., Hansen P., Mladenovic N., Ngai E., Variable neighborhood search for harmonic means clustering, Applied Mathematical Modelling 35 (2011) 2688–2694.
  14. 14. Yin M., Hu Y., Yang F., Li X., Gu W., A novel hybrid K-harmonic means and gravitational search algorithm approach for clustering, Expert Systems with Applications 38 (2011) 9319–9324.
  15. 15. Hung C.H., Chiou H.M., Yang W.N., Candidate groups search for K-harmonic means data clustering, Applied Mathematical Modelling 37 (2013) 10123–10128.
  16. 16. Yeh W. C., Lai C. M., Chang K. H., A novel hybrid clustering approach based on K-harmonic means using robust design, Neurocomputing 173 (2016), 1720–1732.
  17. 17. W. C. Yeh, Study on quickest path networks with dependent components and apply to RAP, Report, NSC 97-2221-E-007-099-MY3, 2008–2011.
  18. 18. Yeh W. C., A two-stage discrete particle swarm optimization for the problem of multiple multi-level redundancy allocation in series systems, Expert Systems with Applications 36 (2009) 9192–9200.
  19. 19. Yeh W., Chang W. and Chung Y., A new hybrid approach for mining breast cancer pattern us-ing discrete particle swarm optimization and statistical method, Expert Systems with Applications 36 (2009) 8204–8211.
  20. 20. Yeh W. C., Simplified Swarm Optimization in Disassembly Sequencing Problems with Learning Effects, Computers & Operations Research 39 (2012) 2168–2177.
  21. 21. Yeh W.C., Novel Swarm Optimization for Mining Classification Rules on Thyroid Gland Data, Information Sciences 197 (2012) 65–76.
  22. 22. Chung Y. Y., Wahid N., A hybrid network intrusion detection system using simplified swarm optimization (SSO), Applied Soft Computing, 12 (2012) 3014–3022.
  23. 23. Yeh W. C., New Parameter-Free Simplified Swarm Optimization for Artificial Neural Network Training and Its Application in the Prediction of Time Series, IEEE Transactions on Neural Networks and Learning Systems 24 (2013) 661–665. pmid:24808385
  24. 24. Azizipanah-Abarghooee R., A new hybrid bacterial foraging and simplified swarm optimization algorithm for practical optimal dynamic load dispatch. International Journal of Electrical Power & Energy Systems, 49 (2013) 414–429.
  25. 25. Yeh W. C., Orthogonal simplified swarm optimization for the series–parallel redundancy allocation problem with a mix of components, Knowledge-Based Systems 64 (2014) 1–12.
  26. 26. Huang C.L., A particle-based simplified swarm optimization algorithm for reliability redundancy allocation problems, Reliability Engineering & System Safety 142 (2015) 221–230.
  27. 27. Lee J. H., Yeh W. C., Chuang M. C., Web page classification based on a simplified swarm optimization, Applied Mathematics and Computation 270 (2015) 13–24.
  28. 28. Yeh W.C., An improved simplified swarm optimization, Knowledge-Based Systems 82 (2015) 60–69.
  29. 29. M. W. Ayech, D. Ziou, Terahertz image segmentation based on k-harmonic-means clustering and statistical feature extraction modeling, in: International Conference Pattern Recognition, IEEE, Tsukuba, Japan, 2012, 222–225.
  30. 30. Aghdasi T., Vahidi J., Motameni H. and Inallou M.M.. K-harmonic means Data Clustering using Combination of Particle Swarm Optimization and Tabu Search, International Journal of Mechatronics, Electrical and Computer Technology 4(11) (2014) 485–501.
  31. 31. Abshouri A.A., Bakhtiary A., “A new clustering method based on firefly and KHM”, Journal of Communication and Computer 9(4) (2012) 387–39.
  32. 32. Zhou Z., Zhu S., Zhang D.. A Novel K-harmonic Means Clustering Based on Enhanced Firefly Algorithm, Intelligence Science and Big Data Engineering. Big Data and Machine Learning Technique, Lecture Notes in Computer Science 9243 (2015) 140–149.
  33. 33. Serapiao A.B.S., Correa G.S., Goncalves F.B., Carvalho V.O., Combining K-Means and K-Harmonic with Fish School Search Algorithm for data clustering task on graphics processing units, Applied Soft Computing 41 (2016) 290–304.
  34. 34. Thakare A. D., Dhote C. A., Hanchate R. S., New Genetic Gravitational Search Approach for Data Clustering using K-Harmonic Means, International Journal of Computer Applications 99(13) (2014) 5–8.
  35. 35. Handl J., Knowles J., and Dorigo M., On the performance of ant-based clustering. Design and Application of Hybrid Intelligent Systems, Frontiers in Artificial Intelligence and Applications 104 (2003) 204–213.
  36. 36. A. Dalli, Adaptation of the F-measure to cluster-based Lexicon quality evaluation. In EACL 2003. Budapest.