Feature Selection via Chaotic Antlion Optimization

Background Selecting a subset of relevant properties from a large set of features that describe a dataset is a challenging machine learning task. In biology, for instance, the advances in the available technologies enable the generation of a very large number of biomarkers that describe the data. Choosing the more informative markers along with performing a high-accuracy classification over the data can be a daunting task, particularly if the data are high dimensional. An often adopted approach is to formulate the feature selection problem as a biobjective optimization problem, with the aim of maximizing the performance of the data analysis model (the quality of the data training fitting) while minimizing the number of features used. Results We propose an optimization approach for the feature selection problem that considers a “chaotic” version of the antlion optimizer method, a nature-inspired algorithm that mimics the hunting mechanism of antlions in nature. The balance between exploration of the search space and exploitation of the best solutions is a challenge in multi-objective optimization. The exploration/exploitation rate is controlled by the parameter I that limits the random walk range of the ants/prey. This variable is increased iteratively in a quasi-linear manner to decrease the exploration rate as the optimization progresses. The quasi-linear decrease in the variable I may lead to immature convergence in some cases and trapping in local minima in other cases. The chaotic system proposed here attempts to improve the tradeoff between exploration and exploitation. The methodology is evaluated using different chaotic maps on a number of feature selection datasets. To ensure generality, we used ten biological datasets, but we also used other types of data from various sources. The results are compared with the particle swarm optimizer and with genetic algorithm variants for feature selection using a set of quality metrics.


Introduction
The large amounts of data generated today in biology offer more detailed and useful information on the one hand, but on the other hand, it makes the process of analyzing these data more difficult because not all the information is relevant. Selecting the relevant characteristics or attributes of the data is a complex problem. Feature selection (attribute reduction) is a technique for solving classification and regression problems, and it is employed to identify a subset of the features and remove the redundant ones. This mechanism is particularly useful when the number of attributes is large and not all of them are required for describing the data and for further exploring the data attributes in experiments. The basic assumption for employing feature selection is that a large number of features do not necessarily translate into high classification accuracy for many pattern classification problems [1]. Ideally, the selected feature subset will improve the classifier performance and provide a faster and more cost effective classification, which leads to comparable or even better classification or regression accuracy than using all the attributes [2]. In addition, feature selection improves the visualization and the comprehensibility of the induced concepts [3]. Using a tumor as a simple example, there are a large number of attributes that describe it: mitotic activity, tumor invasion, tumor shape and size, vascularization, and growth rate, to name just a few. All of these attributes require measurements and tests that are not always easy to perform. Thus, it will be ideal if the classification of a tumor into benign or malignant (and which stage) could be performed with fewer investigations. The selection of a subset of the features that are relevant enough to perform the classification will be of considerable benefit.
Many studies formulate the feature selection problem as a combinatorial optimization problem, in which the selected feature subset leads to the best data fitting [4]. In real world problems, feature selection is mandatory due to the abundance of noisy, irrelevant or misleading features [5]. These factors can have a negative impact on the classification performance during the learning and operation processes. Two main criteria are used to differentiate the feature selection methods: 1. Search strategy: the method employed to generate feature subsets or feature combinations.
2. Subset quality (fitness): the criteria used to judge the quality of a feature subset.
There are two main classes of feature selection methods: wrapper-based methods (apply machine learning algorithms) and filter-based methods (use statistical methods) [6].
The wrapper-based approach uses a machine learning technique as part of the evaluation function, which facilitates obtaining better results than the filter-based approach [7], but it has a risk of over-fitting the model and can be computationally expensive, and hence, a very intelligent search method is required to minimize the running time [8]. In contrast, the filter-based approach searches for a subset of features that optimize a given data-dependent criterion rather than classification-dependent criteria as in the wrapper methods [1].
In general, the feature selection problem is formulated as a multi-objective problem with two objectives: minimize the size of the selected feature set and maximize the classification accuracy. Typically, these two objectives are contradictory, and the optimal solution is a tradeoff between them.
The size of the search space exponentially increases with respect to the number of features in the dataset [8]. Therefore, an exhaustive search for obtaining the optimal solution is almost impossible in practice. A variety of search techniques have been employed, such as greedy search based on sequential forward selection (SFS) [9] and sequential backward selection (SBS) [10]. However, these feature selection approaches still suffer from stagnation in local optima and expensive computational time [11]. Evolutionary computing (EC) algorithms and other population-based algorithms adaptively search the feature space by employing a set of search agents that communicate in a social manner to reach a global solution [12]. Such methods include genetic algorithms (GAs) [13], particle swarm optimization (PSO) [14], and ant colony optimization (ACO) [3].
GAs and PSO are the most common population-based algorithms. GAs are inspired from the process of evolution via natural selection and survival of the fittest and have the ability to solve complex and non-linear problems; however, in many cases, if no additional mechanisms are employed, they can have poor performance and become trapped in local minima [15]. In PSO, each solution is considered as a particle that is defined by position, fitness, and a speed vector, which defines the moving direction of the particle [16].
The antlion optimization (ALO) algorithm [17] is a relatively recent algorithm that is computationally less expensive than other techniques. The chaotic optimization algorithm (COA) is a global optimization method whose main core contains two phases [18]. The first phase has four steps: 1. Produce a sequence of chaotic points; 2. Map the chaotic points to a sequence of design points in the design space; 3. Compute the fitness (objective function) values based on the design points; 4. Select the point that has the minimum fitness value as the current optimum point.
The second phase has two steps: 1. Assume that the current optimum point is located near the global optimum after a number of iterations; 2. Perform position alteration and search around the current optimum in the descent direction along with the axis directions.
These phases are repeated until a convergence (termination) criterion is met. Chaos is considered to be a deterministic dynamic process and is very responsive to its initial parameters and conditions. The nature of chaos is clearly random and unpredictable, but it also has an element of regularity [18].
The aim of this paper is to enhance the performance of the antlion optimizer for feature selection by using chaos. We are particularly interested in applying our methods to data from biology and medicine, as these data possess a large number of attributes and generally have a small number of instances, which makes the feature selection process more complex.
The remainder of this paper is organized as follows. Subsection 1.1 surveys the existing related work. Section 2 provides background information about the antlion optimization algorithm and chaotic maps. The proposed chaotic version of the antlion optimization (CALO) is presented in Subsection 2.3. The experimental results with discussions are reported in Section 3. The conclusions of this research and directions for future work are presented in Section 4.

Related work
Nature-inspired heuristics, such as genetic algorithms, genetic programming, ant colony optimization, and particle swarm optimization, have been successfully used for feature selection. GA uses the accuracy of classification as a fitness (objective) function and removes or adds a feature according to the ranking information. A feature selection algorithm based on GA using a fuzzy set as the fitness function has been proposed in [19]. PSO with the same fitness function achieves better performance than the GA algorithm in [20]. A multi-objective algorithm for feature selection based on genetic programming has been proposed in [21].
An ACO-based wrapper feature selection algorithm has been applied in network intrusion detection [22]. ACO uses the Fisher discrimination rate to adopt the heuristic information. A feature selection method based on ACO and rough set theory has been proposed in [23]. Logistic map is one of the techniques used by the chaotic behavior and has bounded unstable dynamic behavior. The system proposed in [24] uses the K-nearest neighbor (KNN) classifier with leave-one-out cross-validation (LOOCV) and evaluates the classification performance.
The chaos genetic feature selection optimization method (CGFSO) is proposed in [18]. The method proposed in [25] for text categorization consists of some primary stages, such as feature extraction and feature selection. In the feature selection stage, the method applies feature selection algorithms to obtain a feature subset that can increase the classification accuracy and method performance and can reduce the learning complexity. CGFSO explores the search space with all possible combinations of a given dataset. In addition, each individual in the population represents a candidate solution, with the size of the feature subset being the same as the length of a chromosome [26].
Chaotic time series with the EPNet algorithm is proposed in [27]. The authors present four different methods derived from the classical EPNet algorithm applied in three different chaotic series (Logistic, Lorenz, and Mackey-Glass). The tournament EPNet algorithm obtains the best results for all time series considered, and the network architectures remains of a comparatively limited size. The chaotic time series predictor requires a small network architecture, whereas the addition of neural components may degrade the performance during evolution and consequently provide more survival probabilities to smaller networks in the population [28].

Antlion optimization (ALO)
Antlion optimization (ALO) is a bio-inspired optimization algorithm proposed by Mirjalili [17]. The ALO algorithm mimics the hunting mechanism of antlions in nature. Antlions (doodlebugs) belong to the Myrmeleontidae family and Neuroptera order [17]. They primarily hunt in the larvae stage, and the adulthood period is for reproduction. An antlion larvae digs a cone-shaped hole in the sand by moving along a circular path and throwing out sand with its huge jaw. After digging the trap, the larvae hides underneath the bottom of the cone and waits for insects/ants to become trapped in the hole. Once the antlion realizes that a prey is in the trap, it attempts to catch the prey. However, insects are typically not caught immediately and attempt to escape from the trap.
In this case, antlions intelligently throw sand toward the edge of the hole to cause the prey to slide to the bottom of the hole. When a prey is caught in the jaw of an antlion, it is pulled under the soil and consumed. After consuming the prey, antlions throw the leftovers outside the hole and prepare the hole for the next hunt.
Artificial antlion. Based on the above description of antlions, Mirjalili uses the following facts and assumptions in the artificial antlion optimization algorithm [17]: • Prey (ants) move around the search space using different random walks; • Random walks are affected by the traps of antlions; • Antlions can build holes proportional to their fitness (the higher the fitness, the larger the hole); • Antlions with larger holes have a higher probability of catching ants; • Each ant can be caught by an antlion in each iteration; • The range of random walks is decreased adaptively to simulate sliding ants toward antlions; • If an ant becomes fitter than an antlion, this means that the ant is caught and pulled under the sand by the antlion; • An antlion repositions to the most recently caught prey and builds a hole to improve its chance of catching another prey after each hunt.
Formally, the antlion optimization algorithm is given in Algorithm 1.
foreach Ant i do • Select an antlion using Roulette wheel.
• Create a random walk for the Ant i and normalize it, as shown in Eqs (4) and (5) for modeling trapping, Eq (6) for random walk, and Eq (8) for walk normalization. end 6. Calculate the fitness of all ants. 7. Replace an antlion with its corresponding ant if the ant becomes fitter following Eq (1). 8. Update the elite if an antlion becomes fitter than the current elite.
The antlion optimizer applies the following steps to an individual antlion: 1. Building a trap: a roulette wheel is used to model the hunting capability of antlions. Ants are assumed to be trapped in only one selected antlion hole. The ALO algorithm requires a roulette wheel operator for selecting antlions based on their fitness during optimization. This mechanism provides high chances to the fitter antlions for catching prey or ants.

2.
Catching prey and re-building the hole: this is the final stage in hunting, in which the antlion consumes the ant. It is assumed that prey catching occurs when the ant becomes fitter (goes inside sand) than its corresponding antlion. The antlion has to update his position to the latest position of the hunted ant to increase its chance of catching new prey. Eq (1) reflects this process: where: • t shows the current iteration; • Antlion t j shows the position of the antlion j at iteration t; • Ant t i indicates the position of the ant i at iteration t. The antlion optimizer applies the following four operations to an individual ant: 1. Sliding ants toward antlion: antlions shoot sand toward the center of the hole once they realize that an ant is in the trap. This behavior causes the trapped ant that is attempting to escape to slide down. To mathematically model this behavior, the radius of the ants' random walk hyper-sphere is decreased adaptively using Eqs (2) and (3). where: • c t is the minimum of all variables at iteration t; • I is a ratio, which is defined in Eq (3): where: • t is the current iteration; • T is the maximum number of iterations; • w is a constant defined based on the current iteration (w = 2 when t > 0.1T, w = 3 when t > 0.5T, w = 4 when t > 0.75T, w = 5 when t > 0.9T, and w = 6 when t > 0.95T). Basically, the constant w can adjust the accuracy level of exploitation.
2. Trapping in the antlion holes: by modeling the sliding of prey toward the antlion, the ant is trapped in the antlion's hole. In other words, the walk of the ant becomes bounded by the position of the antlion, which can be modeled by changing the range of the ant random walk toward the antlion position as in Eqs (4) and (5): where: • c t is the minimum of all variables at iteration t; • d t is the maximum of all variables at iteration t; • c t i is the minimum of all variables for ant i; • d t i is the maximum of all variables for ant i; • Antlion t j represents the position of the antlion j at iteration t.
To keep the random walks inside the search space, they are normalized using Eq (8) (min-max normalization): where: • a i is the minimum random walk for variable i; • b i is the maximum random walk for variable i; • c t i is the minimum of variable i at iteration t; • d t i is the maximum of variable i at iteration i. 4. Elitism: to maintain the best solution(s) across iterations, elitism has to be applied. In this work, we consider that the random walk of an ant is guided by the selected antlion and by the elite antlion, and hence, the repositioning of a given ant follows the average of both random walks, as shown in Eq (9): where: • R t A is the random walk around the antlion selected using a roulette wheel; • R t E is the random walk around the elite antlion.

Chaotic maps
Chaos means a condition or place of great disorder or confusion [29]. Chaotic systems are deterministic systems that exhibit irregular (or even random) behavior and a sensitive dependence on the initial conditions. Chaos is one of the most popular phenomena that exist in nonlinear systems, whose action is complex and similar to that of randomness [30]. Chaos theory studies the behavior of systems that follow deterministic laws but appear random and unpredictable, i.e., dynamical systems. To be referred to as chaotic, the dynamical system must satisfy the following chaotic properties [29]: 1. sensitive to initial conditions; 2. topologically mixing; 3. dense periodic orbits; 4. ergodic; 5. stochastically intrinsic.
Chaotic variables can go through all states in certain ranges according to their own regularity without repetition [30]. Due to the ergodic and dynamic properties of chaos variables, chaos search is more capable of hill-climbing and escaping from local optima than random search, and thus, it has been applied for optimization [30]. It is widely recognized that chaos is a fundamental mode of motion underlying almost all natural phenomena. A chaotic map is a map that exhibits some type of chaotic behavior [29]. The common chaotic maps in the literature are as follows: 1. Logistic map: this map is one of the simplest chaotic maps [31], as defined in Eq (10): where: • x k 2 (0, 1) under the condition that x 0 2 [0, 1], 0 < a 4; • k is the iteration number.

The novel chaotic antlion optimization (CALO)
In this section, we present our chaotic antlion optimization (CALO) algorithm based on knearest neighbor (KNN) for feature selection. Exploration can be defined as the acquisition of new information through searching [34]. Exploration is a main concern for all optimizers because it might lead to new search regions that might contain better solutions. Exploitation is defined as the application of known information. The good sites are exploited via the application of a local search. The selection process should be balanced between random selection and greedy selection to bias the search toward fitter candidate solutions (exploitation) while promoting useful diversity into the population (exploration) [34].
Parameter I controls the trade-off between exploration and exploitation in the original antlion optimization algorithm. This parameter is linearly decreased to allow more exploration at the beginning of the optimization process, while exploitation becomes more important at the end of the optimization. Therefore, half of the optimization resources are consumed in exploration, whereas the remaining time is dedicated to exploitation, as shown in (Fig 1).
Although the algorithm proved efficient for solving numerous optimization problems, it still possesses the following drawbacks: 1. Sub-optimal selection: at the beginning of the optimization process, I is small, which makes the random walk unbounded in the search space and allows an ant to apply random walk in almost the entire search space. This may cause the algorithm to select sub-optimal solutions.
2. Stagnation: once the algorithm approaches the end of the optimization process, it becomes difficult to escape local optima and find better solutions because its exploration capability is very limited; I becomes very large, thereby limiting the boundaries for the random walk. This causes the algorithm to continue enhancing solutions that have already been found, even if they are sub-optimal. These problems motivate our work on adapting 1 I to obtain successive periods of exploration and exploitation. Therefore, when reaching a solution, exploitation will be applied, followed by another exploration, which may jump to another promising area, followed by using exploitation again to further enhance the solution found, and so on. Chaotic systems with their interesting properties, such as topologically mixing and dense periodic orbits, ergodicity and intrinsic stochasticity, can be used to adapt this parameter, allowing for the required mix between exploration and exploitation. (Fig 2a) presents an example of a chaos map for the values of I for 500 iterations, in which we can observe alternating regions of exploration and exploitation. A small variation in the period means exploitation, whereas a larger variation in the period means exploration. The tent map smoothly and periodically decrements the exploration rate, while the sinusoidal map abruptly switches between exploration and exploitation, which may cause loss of the optimal solution and lead to worse performance (as shown in (Fig 2b and 2c).
The proposed CALO algorithm is schematically presented in (Fig 3). The search strategy of the wrapper-based approach explores the feature space to find a feature subset guided by the classification performance of individual feature subsets.
This approach may be slow because the classifier must be retrained on all the candidate subsets of the feature set and its performance must be measured. Therefore, an intelligent search of the feature space is required. The goals are to maximize the classification performance P and to minimize the number of selected features N f . The fitness function is given in Eq (16) [35]: where: • N f is the size of the selected feature subset; • N t is the total number of features in the dataset; • α 2 [0, 1] defines the weights of the sub-goals; • P is the classification performance measured as in Eq (17).
where N c is the number of correctly classified data instances and N is the total number of instances in the dataset. The number of dimensions in the optimization is the same as the number of features, with each feature related to a dimension and each variable limited to the range [0, 1]. To determine whether a feature will be selected at the evaluation stage, a static threshold of 0.5 is used, as shown in Eq (18): where y ij is the discrete representation of solution vector x, and x ij x ij is the continuous position of the search agent i in dimension j.

Experimental setup
Datasets. Table 1 summarizes the 18 datasets used for the experiments. The datasets are taken from the UCI data repository [36]. We use ten biological datasets to validate the performance of our method and its potential applicability for data generated in biology. In addition, we use eight datasets from other areas to show the general adaptability of our method. Each dataset is divided into 3 equal parts for training, validation, and testing. The training set is used to train a classifier through optimization and at the final evaluation. The validation set is used to assess the performance of the classifier at the optimization time. The testing set is used to evaluate the selected features.
Four different optimization methods are compared in this study: CALO with five different chaotic maps-logistic, singer, tent, piecewise, and sinusoidal; the original ALO; particle Feature Selection via Chaotic Antlion Optimization swarm optimization [14]; and genetic algorithms [13]. The parameter settings for all the algorithms are presented in Table 2.

Performance metrics
Each algorithm has been applied 20 times with random positioning of the search agents except for the full features selected solution, which was forced to be a position for one of the search agents. Forcing the full features solution guarantees that all subsequent feature subsets, if selected as the global best solution, are fitter than it. The well-known KNN is used as a classifier to evaluate the final classification performance for individual algorithms with k = 5 [1]. Repeated runs of the optimization algorithms were used to test their convergence capability. The indicators (measures) used to compare the different algorithms are as follows: • Statistical mean: is the average performance of a stochastic optimization algorithm applied M times and is given in Eq (19): where g i Ã is the optimal solution that resulted at the i−th application of the algorithm. • Statistical best: is the minimum fitness function value (or best value) obtained by an optimization algorithm in M independent applications, as shown in Eq (20): • Statistical worst: is the maximum fitness function value (or worst value) obtained by an optimization algorithm in M independent applications, as in Eq (21): • Statistical standard deviation (std): is used as an indicator of the optimizer stability and robustness: when Std is small, the optimizer always converges to the same solution, whereas large values of std represent close to random results, as shown in Eq (22): • Average classification accuracy: describes how accurate the classifier is given the selected feature set, as shown in Eq (23). where: • N is the number of instances in the test set; • C i is the classifier output label for data instance i; • L i is the reference class label for data instance i; • Match is a function that outputs 1 when the two input labels are the same and outputs 0 otherwise.
• Average selection size (reduction): represents the fraction of selected features from all feature sets, as shown in Eq (24).
where N t is the number of features in the original dataset.
• Average fisher score (f-score): is a measure that evaluates a feature subset such that in the data space spanned by the selected features, the distances between data instances in different classes are as large as possible, while the distances between data instances in the same class are as small as possible [4]. F-score in this work is calculated for individual features given the class labels and for M independent applications of an algorithm, as given in Eq (25): where: • F j is the fisher score for feature j; • μ j is the mean of the entire dataset; • (σ j ) 2 is the standard deviation of the entire dataset; • n k is the size of class k; • m j k is the mean of class k. Algorithms used for comparison: our comparisons include the following algorithms: • ALO: the original antlion optimization

Analysis and discussion
Fig 4 shows the average statistical mean fitness, best fitness, worst fitness, and the standard deviation for all the methods used and for all 18 datasets. The results for the biological datasets are presented in (Fig 5), and those for the other non-biological datasets are presented in (Fig  6). We can observe that ALO and CALO generally perform better than GA and PSO. The search method adopted in ALO is more explorative than the one used in GA and PSO because ALO performs a local search around a roulette wheel selected solution, and in this way, other areas (apart from the area around the current best) are explored. Because of the balanced control of exploration and exploitation, the CALO algorithm outperforms the original ALO. The nonsystematic adaptation of exploration rate in the CALO allows the successive local and global searching and helps escaping from local minima that commonly exist in the search space. The tent chaos map outperforms the other chaos maps, whereas the sinusoidal map provides the worst chaotic result.
To assess the stability of the stochastic algorithms and the stability to converge to the same optimal solution, we measure the statistical standard deviation (std) of the fitness values over different runs. The minimum for the std measure is obtained by CALO in almost all the datasets, which reaffirms that CALO is more stable and can converge to the same optimal solution regardless of its stochastic and chaotic manner. In addition, we can see that the tent map still performs better than the other maps in terms of its repeatability. The results for the classification accuracy presented in Table 3 show that CALO obtains the best results for 11 of the datasets, thus demonstrating the capability of CALO to find optimal feature combinations ensuring good test performance. Table 4 summarizes the results for the size of the selected feature subsets. We can see that CALO, while outperforming all the other methods in terms of classification performance, has comparable values with the other approaches for the number of features selected. Tables 5 and 6 show particular feature selection size (reduction) examples for the Breastcancer dataset, which has 9 input features, and for the HeartEW dataset, which has 13 input features. From the Breastcancer dataset, we can observe that CALO suggests that only four of the features are good enough to classify a tumor. As might be evident, it is particularly preferred in biology and medicine to consider a small number of biomarkers for a disease because this involve fewer experiments, which may sometimes be difficult to perform and have side effects for the patient. For the Heart dataset, our method suggests that five of the data attributes will assure the same precision in performing the classification as if we consider all the features. Such tools could be of real help in the future as they will lead to fewer patient investigations and can lower the costs involved. Overall, while comparing CALO with GA and PSO, we observe that CALO almost always obtains better or very similar classification accuracy with a lower number of features selected. In the majority of the tests performed, on average, approximately 75% of the features selected by CALO are in common with the features selected by GA or PSO, but in most of the cases, the set of features selected by CALO is included in the set of features selected by GA and PSO.
F-score values are given in Table 7, where we can again observe that CALO using the tent map obtains the best results overall. Additionally, note that the worst performing map is the sinusoidal map.
Limitations. The main limitation of the methodology proposed in this paper is the nonexact repeatability of the optimization results. We observed that at different applications of the algorithm, the subset of features selected might differ. Although the resulting solutions are all good solutions, it may be confusing for the user to determine which subset to consider. The proposed algorithm works on the wrapper-based feature selection approach using the KNN  classifier as a simple one. The running time may increase when switching to another classifier, such as support vector machine (SVM) or random forest (RF). Therefore, switching to a different classifier should be carefully handled, particularly if the algorithm is adopted in real-time applications.

Conclusions
In this paper, we address the feature selection problem by developing a chaos-based version of a recently proposed meta-heuristic algorithm, namely, antlion optimization (ALO). A parameter whose setting is crucial for the algorithm performance is adapted using chaos principles. The proposed chaotic antlion optimization (CALO) is applied to a common challenging optimization problem: feature selection in the wrapper mode. The feature selection is formulated as a multi-objective optimization task with a fitness function reflecting the classification performance and the reduction in the number of features. The proposed system is evaluated using 18 different datasets against a number of evaluation criteria. We developed this method with particular interest in datasets generated in biology, as these data typically have a large number of attributes and a low number of instances. CALO proves to be more efficient compared to ALO, PSO, and GA regarding the quality of the features selected. CALO is able to converge to the