A Novel Artificial Bee Colony Based Clustering Algorithm for Categorical Data

Data with categorical attributes are ubiquitous in the real world. However, existing partitional clustering algorithms for categorical data are prone to fall into local optima. To address this issue, in this paper we propose a novel clustering algorithm, ABC-K-Modes (Artificial Bee Colony clustering based on K-Modes), based on the traditional k-modes clustering algorithm and the artificial bee colony approach. In our approach, we first introduce a one-step k-modes procedure, and then integrate this procedure with the artificial bee colony approach to deal with categorical data. In the search process performed by scout bees, we adopt the multi-source search inspired by the idea of batch processing to accelerate the convergence of ABC-K-Modes. The performance of ABC-K-Modes is evaluated by a series of experiments in comparison with that of the other popular algorithms for categorical data.


Introduction
As an important technique in data mining, clustering analysis has been used in many fields [1,2], such as information retrieval [3], social media analysis [4], privacy preserving [5], image analysis [6], text analysis [7], and bioinformatics [8]. The aim of clustering is to group those data objects with similar characteristics into the same clusters, and the ones with dissimilar characteristics into different clusters. Most existing clustering algorithms in the literature belong to one of the following two types: hierarchical and partitional. Hierarchical clustering algorithms allocate a group of data objects into a dendrogram of the nested partitions according to a divisive or agglomerative strategy [9]. While partitional clustering algorithms partition a set of data objects into a pre-defined number of clusters by optimizing an objective cost function.
Center-based clustering algorithms are the most popular partitional clustering algorithms. The k-means algorithm is a widely used center-based partitional clustering algorithm due to its simplicity and high efficiency [10]. Considering the uncertainty of data objects, the fuzzy k-means algorithm [11] is also developed. The k-means algorithm and the fuzzy k-means algorithm can only deal with numeric data. However, categorical data are frequently encountered in real world applications, and especially in the emerging social media analysis. For instance, clustering Twitter users based on their profiles described by categorical attributes. For clustering categorical data, Huang extended these two classical algorithms and introduced the wellknown k-modes algorithm and fuzzy k-modes algorithm [12][13][14]. However, one issue associated with (fuzzy) k-means and (fuzzy) k-modes algorithms is that they may fall into local optima. To address this issue, many heuristic clustering algorithms, which adopt the optimization procedures in the clustering process, have been proposed. By introducing genetic algorithms (GAs), the GA-based clustering approaches [15], including the genetic k-means algorithm [16], the fast genetic k-means algorithm [17], and the genetic k-modes algorithm [18] have been developed. Among these GA-based clustering algorithms, the genetic k-modes algorithm [18] is suitable for categorical data. In addition, the following heuristic clustering algorithms are used to cluster numeric data: Selim and Al-Sultan introduced a simulated annealing algorithm for the clustering problem [19]. Maulik and Mukhopadhyay introduced a novel fuzzy clustering approach by integrating the simulated annealing heuristic with artificial neural networks [20]. Sung and Jin presented a tabu search-based clustering approach by combining the packing and releasing procedures [21].
Over the last decade, a few approaches have been developed to model the intelligent foraging behavior of social animals, such as birds and ants, for optimization problems, and these approaches have been successfully applied to clustering. Shelokar, Jayaraman, and Kulkarni proposed an ant colony clustering algorithm which simulates the way real ants look for an optimal path from their nest to a food source [22]. Kao, Zahara, and Kao integrated the particle swarm optimization (PSO) approach, which mimics the way birds find the optimal food sources in search space, with the k-means procedure and Nelder-Mead simplex search method for improving the performance of clustering [23]. Unlike Kao's approach, Tunchan proposed a pure PSO approach for clustering [24]. Chuang, Hsiao, and Yang presented an accelerated chaotic map particle swarm optimization (ACPSO) for clustering by integrating the chaotic map particle swarm optimization (CPSO) with an accelerated convergence rate strategy [25]. Wan et al. introduced a clustering algorithm on the basis of the optimization property of bacterial foraging behavior [26].
In recent years, investigating the foraging behavior of honeybees, including the learning, memorising, and information sharing mechanism, has emerged as an interesting research direction in swarm intelligence [27]. Inspired by the foraging behavior of bee swarms in the real world, Lucic and Teodorović introduced the bee colony optimization heuristic [28], which has been used for solving various engineering and management problems. Karaboga and Basturk presented an artificial bee colony (ABC) algorithm [29] to deal with numerical optimization problems. By using the ABC optimization strategy, Karaboga and Ozturk proposed an artificial bee colony clustering approach [30]. Almost at the same time, Zhang, Ouyang and Ning also introduced an artificial bee colony clustering approach, in which Deb's rules were used to direct the search direction of each candidate food source [27]. However, most of these heuristic approaches are designed for numeric data, and therefore they are not suitable to deal with categorical data. Considering the ubiquity of categorical data in real-world applications, it is necessary to develop an ABC-based clustering algorithm for categorical data.
In this paper, we propose a novel artificial bee colony clustering approach for categorical data. In our approach, we first introduce the one-step k-modes procedure, and then integrate this procedure with the artificial bee colony heuristic to cluster categorical data. The time and space complexity of the proposed approach is analysed, and a comparison with the other popular approaches demonstrates the effectiveness of our approach.
The remainder of this paper is organised as follows: we first review some related work. This is followed by the presentation of our proposed method. Then, we report the experimental results, which demonstrate the advantages of the proposed method. Finally, we draw conclusions and explore future work.

Related Work
In this section, we first review the k-modes algorithm, and then describe the idea of artificial bee colony optimization.

The k-modes algorithm
The k-modes algorithm was first introduced by Huang in [31] for clustering categorical data. Let X = {x 1 , x 2 ,. . .,x n } denote a dataset consisting of n data objects and x i (1 i n) be a data object characterised by m categorical attributes A 1 , A 2 ,. . ., A m . Each categorical attribute A j has a domain of values denoted by DomðA j Þ ¼ fa 1 j ; a 2 j ; . . .; a t j g, where t is the number of categorical values for the attribute A j . A data object x i is generally represented in the form of a vector [x i1 , The aim of the k-modes algorithm is to divide a dataset X into k clusters by minimizing the following cost function: Here Q l is the set of the most frequent value for each attribute in a cluster l, and it is called the mode of the cluster l; u il (0 u il 1) is an element of the partition matrix U n×k ; k is the number of clusters, and dis(x i , Q l ) is the distance measure as given below: In Eq (2), α(x ij , q lj ) is defined as: ( where q lj is the most frequent value of the jth categorical attribute in the cluster l. The process of the k-modes algorithm is depicted as follows: Step 1. Randomly pick up k data objects from the dataset X as the initial modes of clusters. Step 2. For each data object in X, assign it to the cluster the mode of which is the nearest one to this data object compared to the modes of other clusters in terms of Eq (2). After all data objects have been assigned to clusters, update the modes of all clusters.
Step 3. Re-evaluate the dissimilarity between the data objects and the current modes after all data objects have been assigned to clusters. If it is found that a data object's nearest mode belongs to another cluster rather than the current one, reassign this data object to that cluster and update the modes of both clusters.

The artificial bee colony algorithm
The artificial bee colony (ABC) algorithm proposed by Karaboga and Basturk [29] is wellknown for its simplicity and robustness for optimising numeric problems. In the ABC algorithm, the artificial bee swarm consists of three types of bees: employed bees, onlookers, and scouts. The employed bee takes a particular food source to exploit and shares the information about the food source with onlookers in the nest; a scout looks for a new food source in the search space, and an onlooker waits in the nest and finds a food source through the information shared by employed bees. The artificial bee colony has two parts: the first half are the employed bees and the second half are the onlookers. In the model of forage selection, three essential components (food sources, employed foragers, and unemployed foragers) and two modes of the behavior (recruitment to a food source and abandonment of a food source) are given. The value of food source is associated with many factors such as its proximity to the nest, nectar amount and the ease of gathering this nectar. The unemployed forgers contain two types of bees: scouts and onlookers. There is only one employed bee on a food source. Thus, the number of employed bees is equal to the number of food sources. Onlookers move onto a food source according to a probability-based selection strategy. When the nectar of a food source is exhausted, the corresponding employed bee becomes a scout. In ABC algorithm, the exploitation and exploration processes are performed together. Specifically, the employed bees and onlookers implement the exploitation process, and the scouts execute the exploration process. The bee colony explores and exploits the food sources in a way to maximize the nectar being stored in the nest. For an optimisation problem, a food source means a possible solution, the nectar amount of a food source measures the quality of the corresponding solution, and the goal is to obtain the optimal value of the objective function. The procedure of ABC algorithm is given as follows: Step 1. Initialize the population of food sources.
Step 2. Send the employed bees onto the food sources and evaluate the corresponding nectar amounts.
Step 3. Evaluate the probabilities of all food sources to be chosen by the onlooker bees, and the probability value of each food source is determined by its nectar amount (i.e., the quality of the corresponding solution): the bigger the nectar amount of the food source, the higher the probability value is; Step 4. Send the onlookers onto the food sources: each onlooker will chose its food source based on the probabilities calculated from Step 3, exploit its food source, evaluate the nectar amount of the obtained food source, and apply greedy selection process; Step 5. Terminate the exploitation process of an employed bee if its food source becomes exhausted, and this employed bee becomes a scout bee; Step 6. Send the scouts into the search space for finding new food sources randomly; Step 7. Memorise the best food source found so far; Step 8. If the requirements are met, output the best food source; otherwise go to Step 2.

Our Proposed ABC Clustering Algorithm
In this section, we first describe our proposed ABC clustering approach, and then discuss the complexity and convergence of this approach.

The proposed approach
In this subsection, we propose a novel clustering algorithm on the basis of artificial bee colony and the k-modes approach. As mentioned above, there are three types of artificial honeybees: employed bees, onlookers, and scouts. A food source corresponds to a possible solution of the problem to be optimised, and the nectar amount of a food source characterises the quality of the corresponding solution. In the clustering, the clustering results depend on the cluster centers. When the cluster centers are fixed, the clustering results are determined. Therefore, the clustering issue can be seen as the optimisation of the cluster centers, and a set of cluster centers correspond a possible solution. For categorical data clustering, let and where the symbols have the same meaning as in Eq (1). Then, the nectar amount of a food source f i is given by: Similar to the ABC approach, the colony of artificial bees in our algorithm has two parts: the first half of the artificial bees are the employed bees, and the second half of the artificial bees are the onlookers. There exists only one employed bee for a food source, and the number of the employed bees is equal to the number of solutions in the population. Let P fs = {f 1 , f 2 ,. . ., f H } denote the population of food sources, where H is the number of the food sources, and f i is the ith food source. Then the probability of the ith food source being picked up by an onlooker is given by: For deriving a candidate food source from the current one in memory, we introduce the one-step k-modes procedure, called OKM, in our algorithm. The OKM procedure is essentially one iteration step in the search process of the k-modes algorithm, and it is used to search the neighbor food source based on the current food source in the exploitation process performed by employed bees and onlookers. Let f i be the current food source, then the OKM consists of the following two steps: 1. Allocate each data object to the cluster with the nearest mode, and then form a partition matrix U; specifically, if the ith data object belongs to the lth cluster u il = 1; otherwise u il = 0, where u il is one element of U; 2. Calculate the new modes on the basis of the partition matrix U, and thus form a candidate food source f i 0 ¼ fQ For the colony of bees, an employed bee becomes a scout when its food source is exhausted. In our algorithm we adopt the parameter L, which is a predetermined number of trials to control the abandonment of a food resource. If a food source cannot be improved further through L trials, this food source is assumed to be abandoned, and the corresponding employed bee becomes a scout. Let the abandoned food source be f i , and then the search operation of a scout finding a new food source is given by: where i 2 {1, 2,. . ., H}, and Rand(Dom(X)) is the operation of randomly selecting k data objects from the data set X. In our algorithm, the multi-source search, which is inspired by the idea of batch processing [32], is adopted to accelerate the convergence of the proposed algorithm. The idea of the multi-source search is described as follows: a scout bee searches T candidate food sources at a time, and then picks up the best one as the new food source.
Having introduced the detailed calculation formula for relevant variables, the proposed ABC-K-Modes clustering algorithm for categorical data is given as follows: Input: The size of bee colony N, the maximum cycle number MCN, the number of clusters k, and L.
Output: The best food source.
1. Initialise the population of food sources P fs = {f 1 , f 2 ,. . ., f H } randomly; specifically, for each food source, select k data objects randomly from the dataset X as the modes of clusters; set the exploitation numbers of food sources En 1 = 0, En 2 = 0,. . ., En H = 0. 10. If CN = MCN, terminate the algorithm and output the best food source; otherwise go to step 4).

Complexity analysis
In this subsection, we discuss the complexity of the proposed ABC- Here N is the size of population, and S is the maximal number of generations. Generally, when H, m, k << n, the complexity of our algorithm is higher than k-modes algorithm, and lower than genetic k-modes algorithm.

Convergence analysis
In this subsection, we discuss the convergence of the proposed approach. In our approach, the exploration and exploitation are both executed by ABC. For a categorical dataset, the number of different values for an attribute is finite, and the number of attributes is finite as well. It is noted that a candidate solution is a set of cluster centers, and a cluster center is a set of attributes values. Therefore, the number of candidate solutions is finite. Specifically, the number of , and k is the number of clusters. Here, |A i | is the number of different categorical values for the attribute A i . In the process of exploration or exploitation, the current solution will be replaced by a new solution if the new one is better. Thus each possible solution appears at most once in the current solution list. If the value of MCN (maximum number of iterations) is large enough, the global optimal solution will be very likely to be found; otherwise, the algorithm will be converged to a local optimum. In other words, the larger the value of MCN, the greater the possibility that ABC-K-Modes will converge is. When MCN tends to be infinite, the possibility of convergence for our proposed approach approaches to 100%. Therefore the convergence of our algorithm to a global/local optimal solution is guaranteed as long as MCN is big enough. However, due to different characteristics of the search spaces to be explored, for each dataset a different value of MCN may be required for the algorithm to converge.

Experimental Results and Discussion
In this section, for evaluating the performance of our proposed clustering algorithm ABC-K-Modes, we run the proposed approach on six real-world categorical datasets: Zoo, Breast cancer, Soybean, Lung cancer, Mushroom, and Dermatology, all of which can be downloaded from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html). In this research, we adopt Yang's accuracy measure [33] and the Rand Index [34] to assess the obtained clustering results. In Yang's method, the definitions of accuracy (AC), precision (PR), and recall (RE) are given as follows: where a i is the number of data objects that are correctly allocated to class C i , b i is the number of data objects that are incorrectly allocated to class C i , c i is the number of data objects that are incorrectly denied from class C i , k is the total number of classes contained in a dataset, and n is the total number of data objects in a dataset. In the above measures, the AC has the same meaning as the clustering accuracy r defined in [12]. Given a dataset X = {x 1 , x 2 ,. . ., x n } as well as two partitions of this dataset: Y ¼ fy 1 ; y 2 ; . . .; y t 1 g and Y 0 ¼ fy 1 0 ; y 2 0 ; . . .; y t 2 0 g, the Rand Index (RI) [34] is given by The RI is calculated by using the true clustering and the clustering obtained from a clustering algorithm. According to these measures, the higher values of AC, PR, RE, and RI indicate a better clustering result. In the performance analysis, we run our proposed ABC-K-Modes algorithm, the k-modes algorithm, the fuzzy k-modes algorithm, and the genetic k-modes approach on six different datasets, and for each dataset we run twenty trials. We then compare the clustering result of the proposed ABC-K-Modes algorithm with that of the other three well-known algorithms in terms of the best (Best), average (Avg.), and standard deviation (Std.) of AC, PR, RE, and RI, respectively. All algorithms are implemented in Java language and executed on an Intel(R) Core(TM) i7, 3.4GHz, 8GB RAM computer. In all experiments, the parameters of the proposed ABC-K-Modes algorithm are set as follows: N = 20, MCN = 1000, which are the typical values used in the original ABC algorithm [30]; L = 5 and T = 5 are set by the rule of thumb. The cluster number k in all four algorithms is set according to the number of classes provided by the class information of the dataset. We remark that other class information is not used in the clustering process apart from the number of classes. The other parameters of the k-modes algorithm, fuzzy k-modes algorithm, and the genetic k-modes are set the same as those stated in their original papers.
The Zoo dataset consists of 101 data objects, each of which has 17 Boolean-valued attributes. According to the class attributes, all data objects belong to one of the seven classes. Tables 1-4 list the comparison of clustering results of ABC-K-Modes, the k-modes, fuzzy k-modes, and the genetic k-modes on the Zoo dataset according to AC, PR, RE, and RI, respectively. The Breast Cancer dataset contains 699 data objects, each of which is described by 10 categorical attributes. According to the class attribute, the data objects belong to one of the two classes: Benign and Malignant. Tables 5-8 summarise the comparison of the clustering results of ABC-K-Modes and the other three well-known algorithms on the Breast Cancer dataset according to AC, PR, RE, and RI, respectively.
The Soybean dataset is composed of 47 data objects, each of which has 36 categorical attributes. In terms of the class attribute, the data objects belong to one of the four diseases: Diaporthe Stem Canker, Charcoal Rot, Rhizoctonia Root Rot, and Phytophthora Rot. Tables 9-12 list the comparison of the clustering results of ABC-K-Modes and the other three wellknown algorithms on the Soybean dataset according to AC, PR, RE, and RI, respectively. The Lung Cancer dataset has 32 data objects, each of which is described by 57 categorical attributes. According to the class attribute, the dataset has three classes. Tables 13-16 summarise the comparison of the clustering results of the ABC-K-Modes algorithm and the other three well-known algorithms on the Lung Cancer dataset according to AC, PR, RE, and RI, respectively. Mushroom dataset contains 8,124 data objects, each of which has 23 categorical attributes. According to the class attribute, each data object falls into one of the two classes: edible and poisonous. Tables 17-20 list the comparison of the clustering results of ABC-K-Modes and the other three well-known algorithms on the Mushroom dataset according to AC, PR, RE, and RI, respectively.
Dermatology dataset has 366 data objects, each of which is described 34 categorical attributes. In terms of the class attribute, each data object belongs to one of the six classes: psoriasis, seborrheic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis, and pityriasis rubra  Tables 1-24, we can see that our proposed ABC-K-Modes achieves higher Best, Avg., and lower Std. values in AC, PR, RE, and RI in most cases, and therefore ABC-K-Modes in general outperforms the other three algorithms in terms of AC, PR, RE, and RI, respectively. The reason for the success of ABC-K-Modes is due to its effective combination of global search (exploration) and local search (exploitation). This is achieved by the adoption of the OKM operator and the ABC optimisation framework. Therefore, the proposed ABC-K-Modes can obtain optimal or near optimal results. In Table 25, we list the average running time over twenty trials for the proposed ABC-K-Modes and the other three popular algorithms on the six different datasets. The results in Table 25 show that the size/ dimension has a direct effect on the running time of these four algorithms. Specifically, the larger the size/dimension of data set is, the more time it is for these algorithms to find the satisfactory solution. This is consistent with the analysis of time complexity in the complexity analysis section. Compared to the k-modes algorithm, the ABC-K-Modes takes more time to execute due to the introduction of the ABC optimization strategy. However, we also notice that the running time difference between our ABC-K-Modes approach and traditional K-Modes approach decreases with the increase of the size and dimensions of the dataset, and this seems promising. For instance, for the mushroom dataset, which contains the largest number of records, the performance of ABC-K-Modes is closest to that of K-Modes compared to the situation on the other datasets. Finally, we will further explore the acceleration issue of the ABC-K-Modes in our future work.

Conclusions and Future Work
In real-world applications, data objects characterised by categorical attributes are frequently encountered. The k-modes type algorithms are well known for their high efficiency to clustering categorical data. However, it is acknowledged that this type of algorithms is prone to fall into local optima.
To address this issue, in this research we proposed a novel clustering algorithm ABC-K-Modes on the basis of the traditional k-modes algorithm and ABC optimisation procedure. In our algorithm, the search process of employee bees and onlookers is implemented by introducing a specific procedure named OKM, and the search process of scouts are performed by random exploration. To accelerate the convergence of the ABC-K-Modes, we adopt the idea of multi-source search for the search of scout bees. Moreover, we analysed the time and space complexity of the proposed algorithm ABC-K-Modes, and tested ABC-K-Modes on six realworld categorical datasets derived from the UCI Machine Learning Repository. The experimental results demonstrated that our proposed algorithm was superior to the other three wellknown algorithms according to the evaluation measures AC, PR, RE, and RI, respectively.
In the near future, we will explore the acceleration issue of the ABC-K-Modes, and extend this approach to cluster mixed data containing both numeric and categorical attributes. We will investigate the potential of ABC-K-Modes when applied to social media data. Furthermore, we would also like to explore other swarm intelligent algorithms for clustering categorical data as well as mixed data.