Accelerated Simplified Swarm Optimization with Exploitation Search Scheme for Data Clustering

Data clustering is commonly employed in many disciplines. The aim of clustering is to partition a set of data into clusters, in which objects within the same cluster are similar and dissimilar to other objects that belong to different clusters. Over the past decade, the evolutionary algorithm has been commonly used to solve clustering problems. This study presents a novel algorithm based on simplified swarm optimization, an emerging population-based stochastic optimization approach with the advantages of simplicity, efficiency, and flexibility. This approach combines variable vibrating search (VVS) and rapid centralized strategy (RCS) in dealing with clustering problem. VVS is an exploitation search scheme that can refine the quality of solutions by searching the extreme points nearby the global best position. RCS is developed to accelerate the convergence rate of the algorithm by using the arithmetic average. To empirically evaluate the performance of the proposed algorithm, experiments are examined using 12 benchmark datasets, and corresponding results are compared with recent works. Results of statistical analysis indicate that the proposed algorithm is competitive in terms of the quality of solutions.


Introduction
Cluster analysis is principally used to explore useful knowledge, particularly inherent structure, hidden in a dataset with scarce pre-existent information. Such technique has been frequently applied in many complicated tasks, including pattern recognition, image analysis, facility location, and other fields of engineering [1][2]. Cluster analysis aims to categorize unlabeled data into different clusters on the basis of the similarity between data instances. Similarity is generally measured by the distance metric and can also be treated as an optimization problem that requires an optimal assignment of objects to clusters by minimizing the sum of distance metric between each object and its cluster centroid [3].
K-means (KM) clustering, which uses distance metric to partition data into K clusters, is common and fundamental because of its simplicity and efficiency. However, the initial state may cause the algorithm to be trapped in local optima, thereby affecting the quality of the solution [4]. Recent studies have made significant progress in overcoming the drawback of KM clustering, particularly by using evolutionary algorithms, including genetic algorithm [5][6], tabu search approach [7], ant colony optimization [8][9], artificial bee colony algorithm [10], and particle swarm optimization [11][12][13]. Several algorithms inspired by physical phenomenon have also been employed. These algorithms include simulated annealing algorithm [14], big bang-big crunch optimization [15], gravitational search algorithm combined with KM (GSA-KM) [16], and black hole algorithm (BH) [3].
Those approaches may take time to converge in dealing with clustering problems, particularly large problems, because the initial particles are randomly generated and the subsequent updates are probabilistic. To speed up the convergence rate, Krishna and Murty [5] have employed an accelerated strategy called KM operator (KMO), which merges the principles of KM into the clustering algorithm to efficiently find an effective solution. KMO uses arithmetic average to determine new cluster centers in each generation. However, KMO may be disabled after a number of generations. In this case, the computation cost is wasted, and the diversity of population is restricted by the arithmetic average. Some clustering algorithms have combined with the KM by using the result of KM as one of the initial solutions to speed up the convergence rate [9,13,16]. However, the outcome of KM heavily depends on the initial choice of the cluster centers and may converge to the local optima rather than global optima. As a result, these hybrid algorithms may start searching from a local optimum and obtain poor-quality solutions.
To overcome such problem, this study proposes an improved simplified swarm optimization (SSO) that combines variable vibrating search (VVS) and rapid centralized strategy (RCS) to solve clustering problems. VVS is an exploitation search scheme which can search nearby for the global best position to refine the quality of the solution by using vibrated movement. It obtains a balance between exploration and exploitation by introducing a function of time (t). RCS is modified from KMO and stochastically activated to reduce the computation cost and the loss of the population diversity. To evaluate the proposed algorithm, 10 benchmark datasets are tested, and the performance is compared with state-of-the-art works. Encouraging results are found in terms of efficiency and effectiveness of the proposed algorithm.
The remainder of this paper is organized as follows: Section 2 briefly describes the clustering problem and SSO. Section 3 introduces the proposed algorithm including VVS and RCS. Section 4 presents and discusses the computational results as well as the statistical analysis. Finally, conclusions are summarized in Section 5.

Related works Clustering problem
The goal of clustering is to partition a given set of N objects Y = {Y 1 , Y 2 ,. . ., Y N }, each Y i = {y 1 , y 2 ,. . ., y D } ∈ R D , into K groups, also called clusters C = {(C 1 , Z 1 ), (C 2 , Z 2 ),. . ., (C K , Z K )}, where C k represents the kth cluster, Z k represents the centroid of the kth cluster, and K N. The cluster structure is represented as follows [2]: 2: 3: C i \ C j ¼ ; i; j ¼ 1; 2; . . . ; K and i 6 ¼ j: To split the objects into different clusters, many similarity criteria have been used in this task. One of most popular criterion is the Euclidean distance metric [3,7,9,11,16]. KM is an efficient and common clustering method which adopts this metric. The steps of this algorithm are as follows [17]: Step 1: Randomly choose the K initial centroids of clusters, Z = {Z 1 , Z 2 ,. . ., Z K }.
Step 3: Evaluate the objective function named the sum of intra-cluster distances (SICD), as follows: Step 4: Recalculate new centroids of clusters, as follows: Step 5: If Z new,k = Z k , then halt. Otherwise, continue from Step 2.
KM regards the clustering problem, which is an NP problem [18], as an optimization problem and aims at minimizing SICD, by assigning objects to the closest cluster centroid. KM is a well-known method in dealing with clustering problems because of its simplicity and efficiency. However, it suffers from the local optimum. This work introduces a novel SSO-based algorithm using the concept of KM to find optimal centroids by minimizing SICD without the initialization problem of KM.

Simplified swarm optimization
SSO is a population-based algorithm proposed by Yeh in 2009 [19] to compensate for the deficiencies of PSO in solving discrete problems. This algorithm has recently been applied in many research areas because of its simplicity, efficiency, and flexibility [20][21][22].
In SSO, each individual in the swarm, called a particle representing a solution, is encoded as a finite-length string with a fitness value. Similar to many population-based algorithms, SSO improves the solution of a specified problem by the update mechanism (UM), which is the core of any evolutionary algorithm scheme. The UM of SSO is as follows: where x t ij is the position value in the ith particle with respect to the jth variable of the solution space at generation t. p i = (p i1 , p i2 ,. . ., p id ), where d is the total number of variables in the problem domain, represents the best solution with the best fitness value in its own history, known as pBest. The best solution with the best fitness value among all solutions is called gBest, which is denoted by g = (g 1 , g 2 ,. . ., g d ), and g j denotes the jth variable in gBest. x is a new randomly generated value between the lower bound and the upper bound of the jth variable. ρ is a uniform random number between 0 and 1. C w , C p, and C g are three predetermined parameters that form four interval probabilities. Thus, c w = C w , c p = C p -C w , c g = C g -C p and c r = 1-C g represent the probabilities of the new variable from four sources, namely, the current solution, pBest, gBest, and a random movement in the UM, respectively. The UM updates each particle to be a compromise of those four sources, particularly a random movement, which is different from the original PSO, maintains population diversity, and enhances the capacity of escaping from a local optimum.

Proposed methods
The proposed algorithm is based on the original SSO, and combines with variable vibrating search (VVS) and rapid centralized strategy (RCS). This section introduces VVS, RCS and overall procedure for the proposed algorithm to solve the clustering problem.

SSO clustering algorithm
Similar to many population-based algorithms, the SSO clustering algorithm randomly generates a population of particles, also called solutions. Encoding solution is the critical first step toward becoming a clustering algorithm. Assume that a clustering problem with D features is partitioned into K clusters, then Z = {Z 1 , Z 2 ,. . ., Z K } represents centroid vector and Z k = {z k1 , z k2 ,. . ., z kD }, where k = 1, 2,. . ., K. Therefore, each solution string can be defined as X = {x 1 , x 2 ,. . ., x K×D } as illustrated in Fig 1. The primary steps of the SSO for clustering are summarized as follows: Step 1: Generate a population of particles that represent centroids of each cluster with random positions based on a given dataset.
Step 2: Evaluate the fitness value for each particle in the population according to Eq (4).
Step 5: Stop the algorithm if the maximum number of iterations is met; otherwise, return to step 2.

Variable vibrating search (VVS)
The UM promotes SSO to be an algorithm with the advantages of simplicity and flexibility. However, the UM is a stochastic process with insufficient capability to find nearby extreme points that can affect the efficiency and robustness of SSO, particularly in continuous problems. One way to improve the performance of SSO is adding the local search operation [23]. This modified SSO, called SSO-ELS, uses exchange local search scheme to find a new pBest of the particle or a new gBest by exchanging attributes with two randomly chosen particles in population. However, it consumes more time than the original SSO.
To overcome such exploitation problem of SSO, this work proposes an exploitation search scheme called VVS rather than a local search. It can refine the quality of the solution by searching the extreme points nearby the global best position. A new variable in a solution after VVS is calculated as Eq (7) where g j denotes the jth variable in gBest. Lb j and Ub j are the lower and upper bounds of the jth variable, respectively. The amplitude constant V is a function of time (t) as follow: where λ represents a random number uniformly distributed in [-1,1]. ν is a predetermined parameter to handle the amplitude of variables. Niter is the total number of iterations and iter is the number of current iterations. As shown in Fig 2, with an increase in the number of iterations, V(t) moves in the plot as vibration wave and decreases toward zero at the last iteration. Hence, the particles search extreme points around the gBest position. The balance between exploration and exploitation is an important criterion that can determine the performance of population-based optimization [24][25]. To tackle this problem, a function of time (t), c r (t) as Eq (9), is introduced to replace c r which is a fixed constant in the original UM. This change leads the particles to stochastically explore the search spaces in beginning steps and gradually transform to exploit the extreme points nearby the gBest position as the number of iterations increases.
However, the modified UM is more complex and consumes more time than the original UM because of the additional VVS. According to the result of preliminary tests as shown in S1 Table, the pBest scheme can be discarded in the modified UM to maintain the simplicity and efficiency of the proposed algorithm without affecting the performance of the proposed algorithm. Therefore, the UM of the proposed algorithm is modified as Eq (11) Rapid centralized strategy (RCS) As mentioned in Section 1, the proposed algorithm may take more time to converge. Krishna and Murty [5] employed KMO to determine new cluster centers for all particles in each generation after initial population procedure by using Eq (5). However, overusing the arithmetic In this case, particles may keep searching nearby the mean point rather than gBest. The computation cost is wasted, and the diversity of population is restricted by arithmetic average. Some clustering algorithms take advantage of combining with KM and using its result as one of the particles for speeding up the convergence rate [9,13,16]. This strategy is named one-from-KM (OFK) in this work. However, these hybrid algorithms may start searching from a local optimum because of the initialization problem of the KM.
In this work, a modified accelerated strategy, RCS, is proposed. RCS is inspired by KMO and can efficiently find a better centroid of a cluster at the initial state and keep the diversity of population. The following two steps constitute RCS: Step 1: In the initial state, the cluster centroids of all particles are recalculated according to Eq (5), after being generated randomly by the proposed algorithm.
Step 2: In each iteration, only some particles are recalculated by RCS depending on its random number.
Recalculating the centroids of all particles in the initial state can escape from the initialization problem of the KM through the diversity of particles and then obtain a promising initial solution. In step 2, a random number, rand, belonging to [0, 1] is generated for each particle after the particle is updated by the modified UM. If rand < β, then the RCS is implemented to recalculate the cluster centroids of the particle. β is a predetermined parameter in the interval [0, 1] to decide the proportion of particles that need to be recalculated the centroids.
Based on the above description, the proposed algorithm is called VSSO-RCS, and its steps are shown in

Experiment Results and Discussion
The experiments are composed of two parts. In the first experiment, RCS is compared with other accelerated strategies, including KMO and OFK, to illustrate the effectiveness of the proposed accelerated strategy. In the second experiment, well-known and recently developed population-based algorithms are implemented to evaluate the performance of the proposed algorithm. These population-based algorithms include SSO [17], SSO-ELS [23], BH [3], KSRPSO [13], and GSA-KM [16]. All experiments are executed in MATLAB R2012b on a computer equipped with an Intel 2.4 GHz CPU and 12 GB of memory.
As mentioned in Section 2, this work uses SICD as a criterion to evaluate the performance of all clustering algorithms. SICD is the sum of the distances between objects in the same cluster, as defined in Eq (4). A smaller SICD results in a higher quality solution.

Data sets
According to the categorization of dataset size described by Kudo and Sklansky [26], the problem can be categorized into three categories in terms of the number of features: small with 0 < D 19, medium with 20 < D 49 and large with D ! 50. Twelve datasets, which are taken from the UCI repository database [27], cover all categories to test all approaches implemented in this work. The characteristics of these datasets are summarized in Table 1. These datasets have sizes ranging from hundreds to thousands and the feature size ranges from 3 to 60. All datasets contain numeric features with no missing data.

Parameter settings
The algorithmic parameters for all approaches are illustrated in Table 2. Parameter settings may influence the quality of results and the settings of each approach are suggested from previous studies. In KSRPSO, the settings of cognition c 1 and social c 2 parameters are 0.5 and 2.5, respectively. The inertia weight w is equal to 0.5×rand/2, where rand is a uniformly generated random number between 0 and 1. Three parameters, f, d and a, are set to 0.2, 0.2 and 0.8 respectively, controlling the selective particle regeneration mechanism for local search [13]. The parameters in SSO, SSO-ELS and SSO-RCS, C w , C p and C g , are all set at 0.1, 0.4 and 0.9, respectively [17,23]. In VSSO-RCS, C w , C g , ν and β are set at 0.2, 0.9, 10 and 0.1, respectively. Based on the preliminary test this parameter setting of VSSO-RCS provides a good chance of finding the global optimal solution. Two parameters in GSA-KM, the initial gravitational constant G 0 and α, are set at 100 and 20, respectively [16,28]. BH has no parameters need to be set [3].  For each run, the number of iterations and population size of each approach depends on the number of features and clusters of each dataset. 10×C iterations and 3×C population size are performed, and C is equal to D×K. [11,12]. As shown in Table 3, where the best average values are shown in bold, SSO without any accelerated strategy produces the worst solution on all selected datasets in comparison with SSO-KMO, SSO-OFK, and SSO-RCS. This result confirms that the three accelerated strategies are functional and facilitate SSO to find better solutions. SSO-KMO consumes more time and obtains worse solution than SSO-RCS. This result proves that RCS modified from KMO not only reduces the computational cost but also enhances the performance of SSO. As expected, SSO-OFK is faster than the other strategies. However, the solutions obtained by SSO-OFK are all worse than those obtained by SSO-RCS. The results reveal that SSO-RCS yields higher quality solutions on all selected datasets than SSO-KMO and SSO-OFK. Obviously, RCS outperforms other strategies in this experiment. Fig 5 depicts the progress of the average gBest over 30 runs, providing insights into the convergence behavior of SSO-KMO, SSO-OFK, and SSO-RCS. For all datasets, SSO exhibits the worst convergence pattern. SSO-KMO cannot obtain any benefit of yielding initial solutions. Furthermore, it displays a fast but premature convergence to a local optimum on Glass, Ionosphere and Sonar dataset, as shown in Fig 5(B), 5(C) and 5(D). This result is understandable because the global search capability depends on the diversity of population. KMO recalculates the centroids of all particles at each iteration using the arithmetic average. After a number of iterations, the overuse of the arithmetic average may backfire and compel the particle to be the same as its predecessor or neighbors. As a result, KMO may be disabled, and the population diversity would be diminished.

Results and discussion
OFK using the result of KM as one of the initial particles can offer SSO-OFK promising initial solutions that lead to initially converging faster than SSO-KMO. However, the initial solutions obtained from KM may suffer from local optimum. Therefore, gBest and pBest are initialized from a pitfall and seriously affect the capability of global search, as shown in Fig 5  (B), 5(C) and 5(D).
SSO-RCS demonstrates a consistent performance pattern among all the considered accelerated strategies. RCS not only produces promising initial solutions but also aids SSO to find higher quality solutions than SSO-KMO and SSO-OFK on all of the selected datasets. Results show that the proposed RCS is an ideal choice for accelerating convergence speed in clustering algorithms.  This result means that other SSO-based algorithms and KSRPSO are powerless to achieve those values even once within 30 runs. Moreover, the standard deviation on the Iris, Cancer, CMC, WDBC, Ionosphere, and Sonar datasets are 0.00, which means that the proposed algorithm is more reliable than the other algorithms.
In terms of CPU time, the proposed algorithm is faster than other SSO-based algorithms. This advantage is attributed to the modified UM of VSSO-RCS by discarding the pBest scheme. GSA-KM is time consuming because of its computational complexity. In general, BH and VSSO-RCS take less processing time than KSRPSO on small and medium size datasets. KSRPSO has superiority on large size dataset.
The statistical analysis. A nonparametric statistical analysis, the Friedman test, is conducted to confirm whether the proposed VSSO-RCS offers a significant improvement. If statistically significant differences exist among all algorithms, then the Holm's method is employed as a post hoc test to compare the proposed algorithm (control algorithm) and the other algorithms. The significance level is set to α = 0.05 to determine whether or not the hypothesis is rejected in all cases. These tests are detailed in [29]. Table 7 reports the average ranks computed through the Friedman test based on the average values of SICD. The table shows that the average rank of the proposed VSSO-RCS is the smallest among the algorithms. Therefore, the proposed method is the best performing one, followed by BH, SSO-RCS, GSA-KM, KSRPSO, SSO-ELS, and SSO, successively.
The p-Values computed through the Friedman test is given in Table 8, suggesting significant differences in the average values of SICD among the considered approaches.
To determine sufficient statistical differences between VSSO-RCS and the remaining algorithms, the Holm's method is conducted as a post hoc test. Table 9 shows that all p-values are smaller than α = 0.05, which indicates that the control algorithm VSSO-RCS is statistically better than SSO, SSO-RCS, SSO-ELS, BH, KSRPSO, and GSA-KM in term of SICD.
The same procedure is conducted to check whether or not significant differences in CPU time exist between the clustering algorithms. The results are shown in Tables 10-12. The Friedman's test reveals that the proposed algorithm is ranked second behind BH and that statistically significant differences in the average CPU time exist among the algorithms. Table 12 indicates that VSSO-RCS is more efficient than SSO-RCS, SSO-ELS, and GSA-KM. No significant difference is found among SSO, KSRPSO, BH, and VSSO-RCS.  The results of empirical and statistical analyses show that VSSO-RCS is not that superior in CPU time compared with SSO, BH and KSRPSO. However, VSSO-RCS exhibits promising and effective clustering performance.
Empirical analysis of algorithm efficiency. The results from the previous subsection demonstrate that the proposed algorithm can perform better than its competitors in terms of the quality of solutions. Also, it can be observed that the problem size may impact the computation time. To better observe the effect of the number of data instances and features on the proposed method, two groups of artificial datasets are generated. As shown in Table 13, the first group varies the number of instances from 16000 to 2048000 at the fixed number of features (i.e. D = 6) for evaluating the effect of instance size. The second group varies the number of features from 400 to 51200 at the fixed number of instances (N = 1000) for observing the effect of feature size as shown in Table 14. The number of clusters in both of the two groups is fixed to K = 2. The proposed method performs 10 independent runs on each dataset with 100 iterations and 30 population size. As reported in Tables 13 and 14, the results obtained from the two analysis cases indicate that the ratios of CPU time increase both converge to 2. Fig 6 and Fig 7 show log-log plots of CPU time vs. instance size and vs. feature size, respectively, both indicating that the computation time grows with the size of instance and feature in a linear and stable manner.

Conclusions
The proposed VSSO-RCS is a modified version of simplified swarm optimization to effectively and efficiently tackle the clustering problems. VVS overcomes the exploitation problem in SSO and facilitates VSSO-RCS to refine the quality of solution. VSSO-RCS also converges fast to an optimum solution by adopting RCS.
To assess the performance of VSSO-RCS, two experiments are conducted. First, RCS is compared with other two powerful accelerated strategies. RCS not only can obtain promising initial solutions but also converge more efficiently than KMO and OFK. Second, VSSO-RCS is compared with state-of-the-art population-based algorithms on 12 benchmark datasets. Results reveal that VSSO-RCS is superior to its competitors from the perspective of solution quality and is efficient in terms of the processing time required.
Our results on empirical analysis suggest that instance and feature size both affect the computation efficiency of the proposed algorithm. As the problem size grows, it adds more challenges to the computation time of our proposed method, especially for dealing with very Simplified Swarm Optimization for Data Clustering large-size problems (i.e. big data) [30]. Due to such aspect, our future research will focus on the following: 1. Apply effective techniques [31][32][33][34][35] which can reduce the search space into the proposed algorithm to mitigate such an issue.
2. Modify the proposed method in a parallel or distributed form [36][37][38] to improve the computation efficiency.
In addition, it is also worth exploring other potential application, such as classification, fully utilize the VSSO-RCS.
Supporting Information S1

Author Contributions
Conceived and designed the experiments: WCY CML. Performed the experiments: CML. Analyzed the data: CML. Contributed reagents/materials/analysis tools: WCY. Wrote the paper: CML.