Figures
Abstract
Hybrid feature selection algorithm is a strategy that combines different feature selection methods aiming to overcome the limitations of a single feature selection method and improve the effectiveness and performance of feature selection. In this paper, we propose a new hybrid feature selection algorithm, to be named as Tandem Maximum Kendall Minimum Chi-Square and ReliefF Improved Grey Wolf Optimization algorithm (TMKMCRIGWO). The algorithm consists of two stages: First, the original features are filtered and ranked using the bivariate filter algorithm Maximum Kendall Minimum Chi-Square (MKMC) to form a subset of candidate features S1; Subsequently, S1 features are filtered and sorted to form a candidate feature subset S2 by using ReliefF in tandem, and finally S2 is used in the wrapper algorithm to select the optimal subset. In particular, the wrapper algorithm is an improved Grey Wolf Optimization (IGWO) algorithm based on random disturbance factors, while the parameters are adjusted to vary randomly to make the population variations rich in diversity. Hybrid algorithms formed by combining filter algorithms with wrapper algorithms in tandem show better performance and results than single algorithms in solving complex problems. Three sets of comparison experiments were conducted to demonstrate the superiority of this algorithm over the others. The experimental results show that the average classification accuracy of the TMKMCRIGWO algorithm is at least 0.1% higher than the other algorithms on 20 datasets, and the average value of the dimension reduction rate (DRR) reaches 24.76%. The DRR reached 41.04% for 12 low-dimensional datasets and 0.33% for 8 high-dimensional datasets. It also shows that the algorithm improves the generalization ability and performance of the model.
Citation: Bai X, Zheng Y, Lu Y, Shi Y (2024) Chain hybrid feature selection algorithm based on improved Grey Wolf Optimization algorithm. PLoS ONE 19(10): e0311602. https://doi.org/10.1371/journal.pone.0311602
Editor: Seyedali Mirjalili, Torrens University Australia, AUSTRALIA
Received: July 24, 2024; Accepted: September 21, 2024; Published: October 8, 2024
Copyright: © 2024 Bai et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The datasets analyzed in this study are publicly available. Twelve low-dimensional datasets were obtained from the UCI Machine Learning Repository (http://archive.ics.uci.edu/datasets), including the Automobile, breastcancer, bupa liver, german, glass, heart, ionosphere, Parkinsons, sonar, SPECT Heart-SPECTF, thyroid, and zoo datasets. Additionally, eight high-dimensional datasets were sourced from the Gene Expression Model Selector and Microarray Data repositories, including DLBCL, Leukemias, Lukemia, SRBCT, Brain Tumor, Leu, GLI85 (as cited in references 26, 43-46), and GLA180 (https://jundongl.github.io/scikit-feature/OLD/datasets_old).
Funding: This work was supported by the Natural Science Foundation of Jilin Province under Grant 20210101176JC, in part by the Natural Science Foundation of Jilin Province under Grant YDZJ202301ZYTS157 and in part by the Natural Science Foundation of Jilin Province under Grant 20240304097SF.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Feature Selection (FS) is performed by identifying the most relevant and informative features from the original dataset and removing irrelevant and redundant features [1, 2]. It plays a crucial role in machine learning and data mining tasks [3–5]. The feature selection process aims to reduce the dimensionality of the data, improve model performance, and enhance interpret-ability. In recent years, the importance of feature selection has increased with the exponential growth of data in various domains. The main motivation behind it is to address the curse of dimensionality [6, 7], and high-dimensional data can lead to increased computational complexity, overfitting, and reduced model generalization. Feature selection not only improves the efficiency of the learning algorithm by selecting a subset of relevant features, but also helps to identify underlying patterns and relationships in the data.
Feature selection is a very essential step in machine learning that can help improve the performance and efficiency of the model. Amongst them, the three most common methods are the filter methods, the wrapper methods and the embedded methods. Filter methods focus on assessing the relevance between features and target variables through statistical measures, and common techniques include Pearson’s correlation coefficient [8], Spearman’s correlation coefficient [9], etc. Wrapper methods, on the other hand, utilize specific learning algorithms to evaluate different subsets of features to find the best combination of features. Usually heuristic algorithms [10–14] combined with classifiers [15, 16] to form wrapper algorithms are used for feature selection. Embedded methods are an approach that embeds the feature selection process into the model training. It automatically selects and adjusts the weights of the features through the model training process. Commonly, there are methods based on penalty terms [17], methods based on tree models [18], etc.
Zhou et al. [19] proposed a hybrid approach that integrates sample average approximation (SAA) and an improved multi-objective chaotic quantum-behaved particle swarm optimization (MOCQPSO) algorithm. Chance constrained programming is applied for formulating this stochastic problem. However, the computational complexity of this algorithm is high, resulting in a long running time in large-scale instances. Li et al. [20] proposed an improved binary quantum particle swarm optimization algorithm (IBQPSO). The binary PSO algorithm is a simple and efficient discrete optimization method because it guarantees a global search. And the improved algorithm solves the 0-1 knapsack problem effectively. Mean-while, a diversity maintenance mechanism is established in IBQPSO to alleviate the local optimization problem. But it lacks extensive experimental validation for different sizes and features of the knapsack problem. Gong et al. [21] aimed to solve the problems such as poor search performance of traditional particle swarm optimization algorithms in complex high-dimensional optimal problems and easy to fall into local optimum. A quantum PSO method based on diversity migration is proposed by introducing a new migration mechanism. Although subset migration strategies were adopted to enhance population diversity, they may be too conservative in exploring new search spaces. Bodha et al. [22] proposed a new quantum computing-based optimization algorithm. It is designed to solve multiple-objective mixed cost-effective emission dispatch (MEED) problem of electrical power system However, the computational overhead of the algorithm can be high, especially for large datasets. However, testing only on IEEE 14-bus and 30-bus systems failed to verify the algorithm’s performance on larger or diverse systems.
Wu et al. [23] proposed an ensemble feature selection method based on enhanced correlation matrix (ECM-EFS). The method enhances the covariance matrix, which not only considers the nonlinear relationship between different features, but also improves the interpretability and robustness of feature selection. However, the only drawback is that for high dimensional datasets, ECM-EFS may require longer computation time. Tijjani et al. [24] proposed an enhanced PSO algorithm for selecting the most informative subset of features from high dimensional data. The algorithm introduces an improved position update mechanism based on Bayesian Optimization (BO), which enhances the exploration capability of the algorithm. However, it may not deal with very high-dimensional datasets because of the high computational effort. Beheshti [25] proposed a novel fuzzy transfer function method based on Mamdani fuzzy inference system for binary particle swarm optimization. By using fuzzy transfer function, Fuzzy Binary Particle Swarm Optimization (FBPSO) enables optimal subset selection of features in the dataset with high accuracy, but the algorithm has large time cost. Li [26] proposed a binary version of the local Opposing learning Golden sine Grey Wolf Optimization algorithm (OGGWO). The algorithm enriches the population diversity and improves the convergence rate by initializing the locations of individual grey wolves using local dyadic learning maps. The golden sine algorithm is then combined with the Grey Wolf Optimization algorithm. The mean coefficient is then used to improve the autonomous search ability of individual grey wolves and avoid the algorithm from falling into a local optimum. However, the algorithm took a long time to run and could only target some specific datasets.
Although existing hybrid feature selection methods have made progress in dealing with high-dimensional datasets, they still face challenges when dealing with low-dimensional datasets. It is due to the limited number of samples in low-dimensional datasets, which may lead to unrepresentative data and susceptibility to overfitting. In this context, the selection of the most relevant and discriminative features becomes crucial to ensure that the models built are robust and well generalized. To this end, this paper focuses on the key role of feature selection and proposes a chain hybrid feature selection algorithm that explores the challenges of feature selection in the context of high and low dimensional samples. The chain hybrid feature selection algorithm refers to the use of two filter algorithms in tandem and then used in the wrapper algorithm, as if it were a chain, with one link interlocking with the other. The proposed hybrid feature selection algorithm demonstrates its favorable results on both high and low dimensional datasets through three sets of comparative experiments. The algorithm not only improves the model performance, but also reduces the runtime of the model.
The main contributions made in this paper are as follows:
- A new filter algorithm Maximum Kendall Minimum Chi-Square (MKMC) is proposed based on the maximum relevance minimum redundancy (mRMR) criterion. Maximum Kendall is used to measure the relevance between features and labels and Minimum Chi-Square value to measure the redundancy between features.
- Improved Grey Wolf Optimization Algorithm (IGWO): Based on the original Grey Wolf Optimization (GWO) Algorithm, a random perturbation vector is introduced to update its position. The original linearly decreasing variables are changed to random changes to avoid chance. Finally, the “Iterative oscillatory selection” method is used to replace the original random numbers in the selection of the corresponding positions.
- Tandem filter algorithm: The original dataset is used to generate a subset of candidate features using MKMC, which is then processed by ReliefF to form a new subset of candidate features for the wrapper Algorithm.
- In this paper, three sets of comparison experiments are used to verify that Tandem Maximum Kendall Minimum Chi-Square and ReliefF Improved Grey Wolf Optimization algorithm (TMKMCRIGWO) is superior to other algorithms. First, the IGWO is compared with other wrapper algorithms; then MKMC and ReliefF are used to combine IGWO respectively and then compared with IGWO; finally, the proposed algorithm is compared with other hybrid algorithms to prove that the proposed algorithm is superior to the comparison algorithms.
The paper is structured as follows: Section 2 describes the work related to the proposed algorithm in this paper; Section 3 describes the proposed hybrid feature selection algorithm in detail; Section 4 describes the experimental settings, and the analysis of the experimental results; and Section 5 summarizes the proposed algorithm as well as a future outlook.
2. Related works
This section focuses on the related contents of the proposed algorithm, A. Kendall’s correlation coefficient; B. Chi-Square Test; C. ReliefF; and D. Grey Wolf Optimization Algorithm. The algorithm of counting TMKMCRIGWO is proposed after combining and improving these contents.
A. Kendall’s correlation coefficient
Kendall’s coefficient, which is also known as Kendall’s concordance coefficient [27], was introduced by American psychologist Maurice Kendall to measure the relevance of a relationship between two variables. The value of Kendall’s correlation coefficient, as with Pearson’s [8] and Spearman’s correlation coefficients [9], also ranges between -1 and 1. When the Kendall’s correlation coefficient is 1, it indicates that there is a positive correlation between the two variables; when the Kendall’s correlation coefficient is -1, it indicates that there is a negative correlation between the two variables; and when the Kendall’s correlation coefficient is 0, it indicates that there is no correlation between the two variables. The correlation calculation can be expressed as formula 1.
(1)
Where nc denotes the number of pairs in which both variables are positively or negatively correlated at the same time; nd denotes the number of pairs in which one of the two variables is positively and one is negatively correlated; n0 denotes the number of pairs in which there is no correlation between the two variables; n1 denotes the number of pairs in which one of the variables is increasing and the other is decreasing; and n2 denotes the number of pairs in which one of the variables is decreasing and the other is increasing.
Kendall’s tau is a nonparametric statistical method that is robust and widely applicable [28]. In contrast to other correlation coefficients, Kendall’s tau is unaffected by outliers and non-normally distributed data and is applicable to a wide range of data types. It can measure nonlinear relationships and is equivalent to Spearman’s correlation coefficient. In addition, it is computationally simple and efficient. Therefore, Kendall’s tau has significant advantages in analyzing data with nonlinear relationships.
B. Chi-Square Test
Chi-Square Test [29] is a statistical method used to test for the existence of correlation between categorical variables. In chi-square test, we compare the difference between the observed frequencies and the expected frequencies to determine whether there is a significant association between the two variables. The steps of chi-square test are as follows:
1. Establish null hypothesis (H0) and alternative hypothesis (H1).
H0: There is no significant association between the two variables;
H1: There is a significant association between the two variables.
2. Calculate the Chi-square statistic according to formula 2.
(2)
Where Ei represents the expected frequency; Oi is the observed frequency; and ∑ denotes accumulation over all cells.
3. Determination of degree of freedom: The degree of freedom is calculated as shown in formula 3.
(3)
Where r is the number of rows and c is the number of columns.
4. Based on the chi-square distribution table or computer software, find the critical value at the corresponding degree of freedom and significance level.
5. Compare the calculated chi-square statistic with the critical value. If the calculated chi-square statistic is greater than the critical value, the hypothesis is rejected as there is a significant association between the two variables; otherwise, the hypothesis is accepted as there is no significant association between the two variables.
C. ReliefF
ReliefF is a feature selection algorithm [30, 31] for selecting the most discriminating features from a dataset and. It is an instance-based feature selection method that evaluates the importance of features for a classification task by calculating the weights between them.
The ReliefF algorithm compares the feature differences between the target sample and the nearest neighbor samples to evaluate the importance of the features. The specific steps are as Fig 1:
The basic steps of the ReliefF algorithm.
Advantages of the ReliefF algorithm include greater robustness to noise and redundant features, and suitability for processing high-dimensional data and large-scale datasets.
D. Grey Wolf Optimization Algorithm (GWO)
GWO [32] is an optimization algorithm based on the social behavioral characteristics of grey wolves, which was proposed by Mirjalili et al, scholars from Griffith University, Australia, in 2014. Inspired by the predatory behavior of grey wolf packs, the algorithm simulates the collaborative behavior among leaders, followers, and explorers in grey wolf packs in nature to solve optimization problems. The algorithm classifies grey wolves into four types, which are used to simulate hierarchical strata. In addition, the three main phases of finding prey, encircling prey and attacking prey were modeled. In addition, the three main phases of finding prey, encircling prey and attacking prey were simulated. Grey wolves searching for prey gradually approach the prey and encircle it, and their Mathematical model for encircling prey is given in formulas 4 and 5.
(4)
(5)
Where X denotes the position of the grey wolf; t is the current iteration number; Xp denotes the position of the prey; D denotes the distance between the grey wolf and the prey, which is computed in formula 5. A and C are two collaborative coefficient vectors, whose computation is shown in formulas 6 and 7.
(6)
(7)
In formula 6, the vector A is used to simulate the attack behavior of the grey wolf on its prey, and its value receives the effect of a. a is the convergence factor, which is a key parameter that balances the GWO’s exploration and development capabilities. Throughout the iteration, a is decreased from 2 to 0. r1 and r2 are random numbers from 0 to 1.
Grey wolves can recognize the location of prey and encircling them. Grey wolves can recognize the location of potential prey (optimal solution), and when the grey wolf recognizes the location of the prey, β and δ guides the pack to encircle the prey under the leadership of α. The mathematical model of individual grey wolves tracking the location of prey is described as formula 8.
(8)
Where Dα and Dβ, Dδ denote the distance between α, β and δ and other individuals, respectively. Xα, Xβ and Xδ represent the current positions of α, β and δ, respectively; C1, C2 and C3 are random vectors. X is the current position of the grey wolves, when |A| > 1, the grey wolves try to spread out among regions and collect prey, when |A| < 1, the grey wolves focus on searching for prey in a certain region or regions.
(9)
(10)
Formula 9 defines the step length and direction of an individual ω in a wolf pack toward α, β and δ, respectively, and formula 10 defines the final position of ω.
The GWO algorithm has a better global search ability and convergence speed, which is suitable for solving various types of optimization problems, especially in continuous optimization and multimodal optimization problems.
To enable the algorithm to jump out of the local optimum and find the global optimum, this paper proposes a chained hybrid feature selection algorithm. The first layer of dimensionality reduction is achieved by connecting the bivariate filter algorithm in tandem with the univariate filter algorithm. Then the wrapper algorithm is used to continue the exploration on the first layer of dimensionality reduction to find the optimal subset that satisfies the conditions. These works are formed based on the combination and improvement of the above, and the proposed algorithm is described in detail in the third part of this paper.
3. The TMKMCRIGWO algorithm
This part mainly introduces the proposed algorithm. First, the algorithm forms a subset of candidate features by tandem connection the proposed filter algorithm MKMC with the univariate filter algorithm ReliefF. Then the subset formed after the tandem filter algorithm is used in the wrapper algorithm. Finally, the optimal subset is obtained.
A. Maximum Kendall Minimum Chi-Square (MKMC)
Based on the criterion of Maximum Relevant Minimum Redundancy (mRMR) [33], the new filter algorithm Maximum Kendall Minimum Chi-Square (MKMC) is proposed.
1) Maximum Kendall (MK).
The Kendall correlation coefficients between features and labels can be calculated in the feature sets F1, F2,…,Fnand the label L. While each sample contains a feature vector (x1, x2,…, xn) and a labeled value y, formula 11 is computed based on formula 1.
(11)
For two features Fi and Fj, remember that the number of pairs with different ordering between the two feature sequences is nd, the number of pairs with the same ordering is nu, the number of pairs of samples that cannot be compared is nc, and the total number of pairs of samples is nt. Then, Kendall correlation coefficient of the two features Fi and Fj can be expressed as formula 11.
The maximum Kendall correlation coefficient for all features in the feature set can be expressed as formula 12.
(12)
Where P denotes the set of feature permutations and p is a permutation of the feature set P, where p=1, 2, …, n; τ(Fp(i), Fp(j)) is the Kendall correlation coefficient between features Fi and Fj, which is computed by formula 11;
is the number of combinations.
In this paper, we explore the relevance of different features in a prediction task using the Kendall correlation coefficient as an evaluation metric for feature selection. Suppose we take a dataset containing 100 samples and 5 features as an example. We calculated the Kendall correlation coefficient between each pair of features and labels to measure the correlation between them. And a visualization matrix was generated to show their correlation. Finally, we labeled the features that correspond to the maximum Kendall correlation coefficient to show the maximum relevance. In Fig 2, the lighter the color, the higher the relevance between the feature and the label. Its horizontal and vertical coordinates represent the indexes of the features, and we use the vertical coordinates as the label columns.
Heat map of the maximum Kendall correlation coefficient.
By calculating the maximum Kendall correlation coefficient, we can evaluate the maximum relevance between features and labels, helping us to select the most representative features and perform feature selection.
2) Minimum Chi-Square (MC).
By the introduction of the maximum Kendall correlation coefficient, if there are still features Fi and Fj, whose correlation coefficients are noted as τ(Fi, Fj), then the chi-square value is denoted as .
For the association between features Fi and Fj, formula 13 can be used, where i = 1, 2…, n.
(13)
Where Ei represents the expected frequency, Oi is the observed frequency, and ∑ denotes accumulation over all cells.
Based on the calculation of observed frequency and expected frequency, chi-square values can be calculated to verify the degree of correlation between features.
If two different sets of features F and P are known, where the set F is of rows a and columns b and the set P is of rows m and columns n. i=1, 2, …, a, j=1, 2, …, b, k=1, 2, …, m, l=1, 2, …, n. Then it can be analyzed in four cases.
1.Xij: denotes the number of samples in the ith row and jth column in the set F, and the number of samples in the kth row and lth column that are not present in the set P.
2.Yij: denotes the number of samples in the ith row and jth column in the set F, and the number of samples in the kth row and lth column in the set P that do not exist.
3.Zkl: denotes the number of samples in the ith row and jth column that do not exist in the set F, and the number of samples in the kth row and lth column that do not exist in the set P.
4.Wkl: denotes the number of samples of the ith row and jth column that do not exist in the set F, and the number of samples of the kth row and lth column that do not exist in the set P either.
Thus, we can express the calculation of the chi-square value as formula 14.
(14)
This formula can be used to calculate the chi-square value between the union of two different ranks of scale.
Therefore, the minimum chi-square value we can express as formula 15.
(15)
Suppose there are two existing sets, features1= {10,20,30,40} and features2= {15,25,35,45}. The two sets are calculated to obtain the chi-square value. We show the chi-square values between features and features in the form of a figure. As shown in Fig 3. The horizontal and vertical coordinates represent the index of elements in the two feature sets features1 and features2, respectively.
Heat map of the minimum chi-square value.
In Fig 3, the color shade of each cell indicates the corresponding feature-to-feature chi-square value, with darker colors indicating smaller chi-square values and lighter colors indicating larger chi-square values. Therefore, by visualizing these values using a heat map and by looking at the horizontal and vertical coordinates and the color shades of each cell, we can intuitively find the minimum chi-square value between features and features. In Fig 3 we mark the minimum chi-square value.
3) Maximum Kendall Minimum Chi-Square (MKMC).
With the mRMR [29] algorithm, we can understand that in the process of feature selection, not only the relevance between features and labels, but also the redundancy between features and features should be considered. Therefore, Kendall’s correlation coefficient and chi-square test are used in this paper to measure relevance and redundancy, respectively.
Through the above two parts(Part III. The TMKMCRIGWO Algorithm, Section A. 1) Maximum Kendall(MK) and 2) Minimum Chi-Square(MC)), we can obtain formula 16, that is, we can calculate the score of each feature and output the sequence of features in descending order of the score.
(16)
Where MKMC stands for Maximum Kendall Minimum Chi-Square value. MK (P, L) and MC() are obtained via formulas 12 and 15, respectively.
Feature selection was achieved by continuously selecting features with high relevance with labels and low redundancy between features. Among them, Kendall’s correlation coefficient measures the relevance between features and labels, and chi-square test measures the redundancy between features.
In Fig 4, the input parameter F denotes the set of all features in the dataset, and the number of features in this set is gradually reduced to empty. L denotes the label vector; i denotes the number of features selected from F; and K denotes the number of features in the dataset.
This flowchart is an introduction to the proposed filter algorithm MKMC.
Algorithm 1 MKMC algorithm pseudo-code.
1 Input: F, L, i = 0, K
2 Output: Sequence of features
3 The maximum Kendall value is calculated by Formula 12
4 If only one feature is selected
5 The chi-square value is calculated by Formula 13
6 Else
7 The chi-square value is calculated by Formula 14
8 Endif
9 The minimum chi-square value is calculated by Formula 15
10 While i < K
11 The feature is selected through Formula 16 and removed from the set
12 i = i + 1
13 End While
14 Output Sequence of features
B. Improved Grey Wolf Optimization algorithm based on random disturbance factor (IGWO)
GWO is a heuristic optimization algorithm inspired by the behavior of grey wolf packs and is commonly used to solve optimization problems. By introducing an appropriate amount of random disturbance factors, the Grey Wolf Optimization algorithm can add a certain amount of randomness in the search process, which helps to improve the global search ability and convergence speed of the algorithm.
1) Position update.
In the traditional GWO, there are problems such as inadequate local search ability and slow convergence speed. To solve this problem, we introduce a random disturbance factor (Q). The search diversity of the algorithm is improved by introducing Q to avoid falling into local optima. Therefore, we change the position update of the traditional algorithm to formula 17, where the image of the change of Q with the number of iterations is shown in Fig 5.
(17)
This figure illustrates the trend of Q when it reaches two-thirds of the maximum number of iterations.
Where Q is the random disturbance factor which varies according to the mean and standard deviation of the position and is calculated by formula 18.
(18)
Where R is a random number that varies with the number of iterations; σ is the standard deviation of the position of each wolf in each row; is the mean value of the position vector; R, σ and
are calculated by formulas 19–21, respectively.
(19)
(20)
(21)
Where rand() is a random number from 0 to 1; π is the circumference, generally abbreviated as 3.14; t is the current iteration number; T is the maximum iteration number; Xi represents the ith position; represents the average of all the positions; s is the number of positions.
In general, a smaller perturbation factor can help the algorithm avoid falling into local optimal solutions and improve the global search capability. It also does not introduce too much randomness, thus ensuring the stability and convergence of the algorithm. Therefore, in this paper, we adopt the change of introducing random perturbations at a time greater than two-thirds of the maximum number of iterations (denoted as: T2/3). At this point, the algorithm may be close to convergence, but there are still some local optimal solutions that have not been discovered. Therefore, by introducing random perturbations, it can help the algorithm to jump out of the influence of local optimal solutions and better explore the space of the global optimal solutions.
Also, introducing random disturbance factor at T2/3 helps to maintain the stability and convergence of the algorithm. Introducing randomness too early or too late may cause the algorithm to be too random or too deterministic, affecting the performance of the algorithm. Since the randomized disturbance factor is not involved in the change of each iteration, we show the change of Q as shown in Fig 5.
In addition, the position update is subject to changes in the collaborative vectors A and C, while A searches for changes in the convergence factor a. In the traditional method, a is linearly decreasing from 2 to 0. Instead, we introduce a random number to change the traditional decreasing method so that it varies randomly in the range of 0 to 2 to avoid the occurrence of chance. The variation of a is shown in Fig 6.
(22)
This figure shows the change of a before and after the change, where the red color represents the change after the change.
The pseudo-code of the entire improved Grey Wolf Optimization algorithm is shown below, and its flowchart is shown in Fig 7.
This flowchart is an introduction to the IGWO algorithm.
Algorithm 2 IGWO algorithm pseudo-code.
1 Input: population, a, A and C
2 Output: Selected features
3 The fitness values of individual grey wolves are calculated according to Formula 25, and the top 3 wolves with the best fitness α, β and δ are preserved
4 While t < T
5 Update the random disturbance factor Q according to Formula 18
6 If
7 Update the grey wolf position according to Formula 17
8 Else
9 Update the grey wolf position according to Formula 10
10 Endif
11 Update a, A, and C
12 Calculate the fitness value of all grey wolves
13 Update the fitness position of α, β and δ
14 t = t + 1
15 End While
16 Output Selected features
In Fig 7, we represent the original GWO algorithm using orange box lines. The blue and green box lines represent the changes we made to the original GWO. In addition, we have used a red judgment box line to indicate that the changed algorithm is using a random disturbance factor to update the position at T2/3.
2) Iterative oscillatory method for position selection.
Positions are often chosen using a random number, the value of which is usually taken to be 0.5. Such an approach has some limitations, and there may be cases where both are greater than 0.5 or both are less than 0.5. To avoid this, we construct a parameter r that varies with the number of iterations and is affected by the random number. We denote it as formula 23:
(23)
Where rand() is a random number from 0 to 1 and e is a natural constant.
Therefore, we define the position selection as follows:
(24)
Where position denotes the corresponding position of the feature, if it is 1, it denotes that the position is selected; otherwise, it is not selected. As shown in Fig 8.
This figure is an explanation of formula 24.
3) Fitness function.
The fitness function is a function used in evolutionary algorithms to assess the strengths and weaknesses of an individual. In the framework of evolutionary algorithms such as evolutionary strategies, the fitness of an individual is a measure of the individual’s performance and adaptability in problem solving. The fitness function is not only a criterion for judging the strengths and weaknesses of individuals, but also an important basis for evolutionary algorithms to guide the search direction in the search space.
However, the average classification accuracy (ACC) is usually used as the evaluation criterion of the fitness function. In this paper, we use dual metrics [34, 35], that is, the ACC and the feature subset length (LEN), to more comprehensively evaluate the merits of feature subsets. Adding the consideration of LEN to the fitness function can avoid selecting overly complex feature subsets, which reduces the model complexity and improves the generalization ability. Therefore, combined consideration of ACC and LEN can evaluate the fitness of feature subsets more effectively and thus optimize the results of feature selection. To visualize the value of the fitness function, we introduce a weight value wf to regulate the proportion of ACC and LEN, as shown in formula 25. The trend of the weight value wf is shown in Fig 9.
(25)
The figure shows the trend of the proportion of ACC and LEN with the number of iterations.
Where fitness is the fitness value; ACC and LEN are the average classification accuracy and feature subset length, respectively.
In formula 25, wf is a function value that varies with the number of iterations, which we define as formula 26.
(26)
Where t is current number of iteration and T is the maximum number of iterations.
C. The proposed algorithm
This section mainly introduces the framework of the proposed algorithm (TMKMCRIGWO), including pseudo-code and flowchart. By combining the filter algorithm with the wrapper algorithm, the algorithm is avoided to fall into local optimum.
1) Framework of the TMKMCRIGWO algorithm.
The key of the whole algorithm is to connect the two filter algorithms in tandem, as shown in Fig 10, firstly using the bivariate filter algorithm MKMC to generate a candidate feature subset S1. Then S1 is used as the original dataset for the ReliefF algorithm, and the candidate feature subset S2 is generated based on it, which is used in the wrapper algorithm. By connecting the two filter algorithms in tandem the features in the original dataset are incrementally downscaled to provide a better candidate feature subset for the wrapper algorithm. This method can gradually eliminate features irrelevant to the classification task in the process of feature selection, thus reducing the dimensionality of the feature space and improving the efficiency and accuracy of the algorithm.
This figure is an introduction to the proposed algorithm TMKMCRIGWO.
MKMC and ReliefF algorithms can more comprehensively and accurately assess the relevance and importance between features, and then select a better-quality feature subset. Through the two-layers filter algorithms, we can get a more representative combination of features, which helps the wrapper algorithm to be more accurate and efficient in selecting the optimal subset of features.
The feature selection method using tandem filter algorithms help the proposed algorithm to jump out of the local optimum and find the global optimal solution. The final output of the algorithm is the optimal information, which helps to improve the classification accuracy and reduce the length of the feature subset, thus improving the performance and interpretability of the proposed algorithm.
2) Number of selected features.
The size of the dataset directly determines the speed of the algorithm, and we tend to use a fixed value for determination when dealing with large data. Such an approach has some limitations.
First, fixed values may not be able to fully consider the characteristics and complexity of the dataset, leading to inaccuracy in feature selection. Second, fixed values may not be applicable to datasets of different types or sizes and cannot flexibly respond to feature requirements in different situations. In addition, fixed values may limit the flexibility and optimization space of feature selection and fail to fully utilize the potential of feature selection. Therefore, a more intelligent and adjustable feature selection approach is needed to ensure the accuracy and efficiency of data analysis by considering the flexibility and individualized needs. Therefore, in this paper, corresponding treatments are carried out for datasets of different sizes.
In the current era of big data, data often exhibit high dimensionality [37, 38], and for high-dimensional datasets, we use a maximum-minimum determination for selection. If the dimension of the dataset is more than 1000, then, the following process is carried out, conversely, it is set to the dimension size of the dataset. The specific content is shown in formula 27.
(27)
Where K is the number of selected features; e is a natural constant; dim denotes the original dimension of the dataset; min and max denote taking the minimum and maximum values respectively. ⌊⌋ denotes rounding down.
The features we finally selected are after filter algorithm and wrapper algorithm. The first layer is to sort the features of the original dataset through the filter algorithm to form a candidate feature subject; the second layer is further dimensionality reduction based on the first layer of dimensionality reduction; the third layer uses enhancement learning algorithms, such as IGWO, based on the first two layers of dimensionality reduction, to further extract the final selected features.
After these three layers of dimensionality reduction, we can effectively extract the most useful features from the original data, as shown in Fig 11.
This figure shows the process of selecting features from the original dataset.
3) Population initialization.
For the meta-heuristic algorithm based on the GWO algorithm, the random generation of initial grey wolf positions can indeed help the population to achieve a uniform distribution in the search space. It can also increase the exploration performance of the algorithm. In the GWO algorithm, the randomness of the initial position can motivate the population to have more exploration in the initial stage. And it is more probably to traverse to the global optimal solution.
In this paper, the initialization method [38] of randomization is used after considering the variation of the filter algorithm to rank the features. By randomly generating the initial position vectors of grey wolves, the local aggregation of the population in the search space can be effectively avoided, and a comprehensive search of the solution space is guaranteed. This uniform population distribution helps to avoid the algorithm from falling into local optimal solutions, which in turn improves the global search ability of the algorithm. The method steps of random initialization are shown through Fig 12.
This figure shows the steps of population initialization by Grey Wolf Optimization algorithm.
4) Pseudo-code and flowchart.
The flowchart of the proposed algorithm TMKMCRIGWO is shown in Fig 13 and its pseudo-code is Algorithm 3.
This flowchart is an introduction to the proposed algorithm TMKMCRIGWO.
Algorithm 3 IGWO algorithm pseudo-code.
1 Input: population, a, A and C
2 Output: Selected features
3 The candidate feature subset S1 is formed using the MKMC filter algorithm (Algorithm 1)
4 Based on S1, the ReliefF algorithm is performed to form the candidate feature subset S2
5 While t < T
6 Using Algorithm 2 IGWO
7 t = t + 1
8 End While
9 Output Selected features
4. Experimental results and analysis
This section introduces the dataset used for the experiments, the parameter settings of the TMKMCRIGWO algorithm and the algorithms used for the three sets of comparison experiments, and the results and analysis of the experiments.
A. Datasets and parameter settings
1) Datasets.
To verify the superiority of the TMKMCRIGWO algorithm, 20 datasets were selected for a series of experiments. The details of these 20 datasets are presented through Table 1.
Among these 20 datasets, we selected 12 low dimensional datasets in UCI Machine Learning Repository (http://archive.ics.uci.edu/datasets), which were, Automobile, breastcancer, bupa liver, german, glass, heart, ionosphere, Parkinsons, sonar, SPECT Heart-SPECTF, thyroid, and zoo. At the same time, 8 high-dimensional datasets were selected in the Gene Expression Model Selector and Microarray Data, which were, DLBCL, Leukemias, Lukemia, SRBCT, Brain Tumor, GLA180 (https://jundongl.github.io/scikit-feature/OLD/datasets_old), GLI85, and Leu [26, 39–42]. The dimensionality of the datasets we used 2000 as a dividing line. Less than 2000 we call it a low-dimensional dataset; conversely, a high-dimensional dataset. In Table 1, we determine the dimension size of the dataset.
By conducting experiments and comparisons on these diverse datasets, we can evaluate the TMKMCRIGWO algorithm more comprehensively. The datasets in Table 1 cover the number of samples, features, and classifications of each dataset. For the convenience of experimental writing, we denote the names of the datasets in the form of abbreviations. In our experiments, we will analyze and validate these datasets using the TMKMCRIGWO algorithm. In addition, we are expected to evaluate the performance of the proposed algorithm based on the detailed information listed in Table 1 in order to demonstrate that the TMKMCRIGWO algorithm outperforms other algorithms.
2) Parameter settings.
Among the 9 wrapper algorithms and 7 hybrid algorithms, we all use a maximum number of iterations of 100 and a population size of 30 in order to make a fair comparison. Each dataset is tested 10 times, and the classification accuracy is the average of the 10 times. Detailed information of all algorithm parameter settings is shown in Table 2 Among the parceling algorithms, the methods for initial and actual parameters of SA, BBA, GA and CS algorithms are derived from the literature [43–46].
In all these algorithms, we use Support Vector Machine as a classifier to measure the fitness function of the algorithm. The optimal feature subset is the subset of features with the highest classification accuracy based on the Support Vector Machine (SVM) classifier. Radial Basis Function (RBF) is used as the kernel function of the SVM model. And the penalty parameters and RBF parameters are selected by grid search method.
In addition, we measure the average classification accuracy by ten-fold cross-validation technique. By dividing the dataset into ten loops, each loop is divided into 10 groups, 1 group for testing and 9 groups for training. Each loop produces a classification accuracy and then the average of these accuracies and the length of the resulting feature subset are used as the result of the fitness function.
Our experiments were carried out in a CPU environment, and the data were analyzed and verified by using simulation software. In the data preprocessing section, the first step is to deal with the missing values. We downloaded the original dataset and used the average of the side-by-side samples to fill in the missing values to ensure the completeness of the data. In addition, we scrutinized the outliers in the dataset and replaced these outliers using the sample mean method to ensure the reasonableness and consistency of the data.
In our experiments, we use 20 datasets and 16 algorithms. Each algorithm is run 10 times on each dataset. Among them, all the data presented in Tables 3 to 5 represent the mean values derived from the tenfold execution of each algorithm across each dataset. The utilization of mean values not only facilitates a more accurate representation of the algorithmic performance but also mitigates the impact of outlier values. Tables 3–6 are the three sets of experiments conducted, respectively, which show the performance of the 15 algorithms and the TMKMCRIGWO algorithm in terms of the average classification accuracy and the length of the corresponding feature subset achieved on the 20 datasets. The results of the TMKMCRIGWO algorithm are verified to be superior to other algorithms by the results of different algorithms running on different datasets.
B. Experimental results and comparisons
The experiments in this paper are divided into three groups, which are (1) IGWO is compared with other wrapper algorithms, (2) IGWO, MKMC+IGWO, ReliefF+IGWO and the TMKMCRIGWO algorithm are compared, and (3) the TMKMCRIGWO algorithm is compared with other hybrid algorithms.
1) Experiment 1: IGWO with various types of wrapper algorithms.
In this set of experiments, we compare the 8 wrapper algorithms with the improved Grey Wolf Optimization algorithm in this paper. With Fig 14, we can see that the average classification accuracy of IGWO is greater than the other 8 wrapper algorithms in all datasets.
This figure shows the ACC of the first set of experimental algorithms.
Through Table 3, IGWO algorithm obtains the shortest feature subset length in most of the datasets and the average classification accuracy is much higher than other algorithms. However, on the breastcancer, heart and thyroid datasets, MVO+SVM has the shortest feature subset length among all wrapper algorithms. In the eight high-dimensional datasets such as DLBCL and Leukemia, the feature subset length of IGWO is even reduced to double digits. However, in other algorithms, the length of the feature subset is very long, even close to three digits. This also proves that the improved Grey Wolf Optimization algorithm is superior to other algorithms, which then paves the path for the second and third sets of experiments.
2) Experiment 2: IGWO, MKMC+IGWO, ReliefF+IGWO and TMKMCRIGWO.
In this set of experiments, IGWO is our improved wrapper algorithm. Then we combine it with the proposed filter algorithms MKMC and ReliefF algorithm respectively. Finally, we compare these three algorithms with the algorithm proposed.
On the datasets DLBCL, Leukemias, Leukemia, SRBCT, and Leu, the TMKMVRIGWO algorithm achieves an average classification accuracy (ACC) of 100%. The ACC on the sets breastcancer, glass, ionosphere, Parkinsons, sonar, SPECT Heart-SPECTF, thyroid, zoo, and GLI85 data also all reached more than 95%. And the ACC and feature subset length obtained from these datasets are better than the other three algorithms. However, on Brain Tumor, GLA180 and GLI85 datasets, the ACC obtained by ReliefF algorithm combined with the improved wrapper algorithm of in this paper is higher than that of the proposed algorithm. However, the length of the feature subset obtained by this the TMKMCRIGWO algorithm is still the smallest. We also list the specific values of feature subset lengths obtained by TMKMCRIGWO that are shorter than the second shortest feature subset length in Table 4.
3) Experiment 3: TMKMCRIGWO with other hybrid algorithms.
In this set of experiments, we compare the four hybrid algorithms with the TMKMCRIGWO algorithm, and it can be clearly seen that the results obtained by the TMKMCRIGWO algorithm are better than the other hybrid algorithms through 5. The ACC obtained by the TMKMVRIGWO algorithm is the highest on most of the datasets, while the LEN is the shortest in all the datasets and the value of the dimension reduction rate (DRR) is the smallest. These results show that the TMKIMCRIGWO algorithm has better performance and results in feature selection and dimensionality reduction, and has higher utility and reliability compared to other hybrid algorithms.
The ACC of the mRMR+GWO algorithm is higher than the other algorithms and the TMKMCRIGWO algorithm on the three datasets Brain Tumor, GLA180 and GLI85. However, this does not mean that it has excellent performance on all datasets. And the TMKMVRIGWO algorithm has been effective on different datasets, which shows that it has better generalization ability and stability.
Through Table 6, we can also find that in low-dimensional datasets, the DRR is generally in the range of 24% to 66.67%. In the high-dimensional dataset, the DRR is in the range of 0.04% 1.03%. It also indicates that the dimensionality reduction effect of the TMKMCRIGWO algorithm is more obvious on the high-dimensional datasets. Such experimental results show that the TMKMCRIGWO algorithm achieves better dimensionality reduction on datasets of different dimensions.
Especially on high-dimensional datasets, the TMKMCRIGWO algorithm performs more prominently compared with other algorithms and has a lower dimension reduction rate, indicating that the TMKMCRIGWO algorithm has better adaptability and effectiveness. In addition, the superiority and effectiveness of the TMKMCRIGWO algorithm is further verified by comparison experiments with other algorithms. These experimental results further validate the superiority of the IGWO algorithm and the TMKMCRIGWO algorithm in feature selection and dimensionality reduction tasks. In the first set of experiments, the IGWO algorithm performs well and outperforms the other wrapper algorithms, indicating that it has better performance and results in the feature selection phase. And in the second and third set of experiments, the proposed algorithm performs better compared to other hybrid algorithms, indicating that the TMKMCRIGWO algorithm has higher ACC in the process of dimensionality reduction.
C. Experimental analysis
1) Dimension Reduction Rate (DRR).
In this paper, the DRR refers to the ratio of the number of features we finally get to the number of features in the original dataset, the smaller the ratio, the greater the degree of dimensionality reduction of the algorithm is proved to be; on the contrary, the smaller the degree of dimensionality reduction is.
In Fig 15, we compared the results of the second set of experiments in this paper. The feature subset length and the dimensionality reduction rate of each dataset are shown in the form of bar charts and line figures. The feature subset length of the TMKMCRIGWO algorithm is lower than the other three algorithms and the DRR is also the lowest on all datasets. It indicates that the TMKMCRIGWO algorithm is superior to the other algorithms. It further proves that the DRR of the TMKMCRIGWO algorithm is better than the other algorithms, no matter it is a high-dimensional dataset or a low-dimensional dataset. Meanwhile, it indirectly demonstrates that the TMKMCRIGWO algorithm has a certain applicability to some high-dimensional and low-dimensional datasets.
This figure shows LEN and DRR for the second set of experimental algorithms on 20 datasets.
In Fig 16, we compared the ACC and LEN of the four algorithms for the second set of experiments. On these three datasets, the higher average classification accuracy of the ReliefF+IGWO algorithm is due to the ability of the ReliefF algorithm to effectively identify important features within a certain period of time. On this basis, when combined with the wrapper algorithm, the advantages of the ReliefF algorithm and the wrapper algorithm can be fully utilized, and then the important features can be effectively retained. By combining the advantages of the two, the final result of improving the performance of the classifier and the prediction accuracy can be achieved. However, even though the ACC obtained by this algorithm is high, the LEN of the proposed algorithm is still the shortest length among these four algorithms.
This figure shows the ACC and LEN of the second set of experimental algorithms.
In Figs 17 and 18, we compared the dimension reduction rate of the third set of experiments again. It is observed that the dimension reduction rate of the TMKMCRIGWO algorithm is the lowest compared to the other hybrid algorithms in all the datasets. In low-dimensional datasets, the dimension reduction rates of both the TMKMCRIGWO algorithm and the other algorithms are above 25%. However, in high-dimensional datasets, the dimension reduction rates of these algorithms are basically in the range of 0.04% to 1%. Although the dimensionality reduction effect is more obvious in large datasets, the dimensionality reduction effect of the TMKMCRIGWO algorithm on low-dimensional datasets are all better than other algorithms. It also directly proves that the TMKMCRIGWO algorithm can better handle the dimensionality reduction of data and achieve more powerful effect.
This figure shows the third set of experiments algorithm in 12 low-dimensional datasets on dimension reduction rate.
This figure shows the third set of experiments algorithm in 8 high-dimensional datasets on dimension reduction rate.
Figs 19–21 show the effect of low-dimensional dataset and high-dimensional dataset after dimensionality reduction by the TMKMCRIGWO algorithm, respectively. We plotted the original features of the dataset and the length of the subset of features after dimensionality reduction as line graphs. It can be clearly seen that our algorithm can reduce the dimensionality of the data both on the low-dimensional dataset and the high-dimensional dataset. It also shows that our algorithm is applicable on both low dimensional datasets and high dimensional datasets.
This figure shows the dimensionality reduction of the third set of experimental algorithms on 12 low-dimensional datasets.
This figure shows the dimensionality reduction of the third set of experimental algorithms on 8 high-dimensional datasets.
This figure shows the changes in Q and R on the low-dimensional dataset glass.
2) Variation of the random disturbance factor.
The effect of the random disturbance factor (Q) is to help the GWO explore more extensively in the search space, thus improving the ability to find an optimum. It increases the diversity of the algorithm and helps to avoid getting stuck in a local optimum without being able to find the global optimum. Thus, by introducing a Q, the algorithm is more probable to find a better solution. In this paper, the variation of the Q is determined by the mean and standard value of the variation with the position of the grey wolf.
To make the random disturbance factor flexible, we add a random number to it so that it can keep changing to avoid chance results. And the random disturbance factor is to adjust the change of position. Therefore, the change trend of the random disturbance factor directly affects the adjustment strategy of the position. Because of the large number of datasets used in this paper, we choose a representative dataset from the low-dimensional and the high-dimensional dataset respectively.
Figs 21 and 22 show the variation of the random disturbance factor under the influence of the random number R on the glass and DLBCL datasets, respectively. Since the random disturbance factor is regulated in the range after T2/3 time, the variations we see in Q and R are similarly in this range. On the glass dataset, the values of Q and R are closer in most cases. On the DLBCL dataset, the distribution of Q values tends to trend essentially on top of the R values, and the difference is large. It also shows that the random disturbance factor we introduced is follower, which can make the position change of the grey wolf rich in diversity.
This figure shows the changes in Q and R on the high-dimensional dataset DLBCL.
3) Variation of fitness values.
In traditional algorithms, the fitness function only considers the average classification accuracy may ignore the effect of the feature subset length on the model performance, leading to the selection of an overly complex feature subset. An overly complex feature subset may increase the computational complexity of the model, reduce the model’s generalization ability, and even lead to overfitting problems.
Adopting a weight to measure the percentage of the ACC and LEN can regulate the trade-off between the two more flexibly. By choosing the weights reasonably, the performance and complexity of the algorithm can be balanced according to specific application scenarios and requirements. For example, if more attention is paid to the simplicity and generalization ability of the model, higher weights can be given to the LEN; if more attention is paid to the accuracy of the model, higher weights can be given to the ACC.
In this paper, we plotted the fitness values of the 20 datasets as they varied with the weighting parameter as a line graph. Because the weight values change with the number of iterations, the value of the fitness function also changes. As we can see from Figs 23 and 24, the value of the fitness function is not static or increasing (decreasing), it is regulated by the weight values according to the importance of ACC and LEN.
This figure shows the trend of the fitness value over the number of iterations on the low-dimensional datasets.
The figure shows the trend of fitness values and ACC over the number of iterations on 8 high-dimensional datasets.
As shown in Figs 23 and 24, the change in the fitness value basically tends to increase during the iterative change. However, in high-dimensional datasets, because of the large amount of data, the iteration at the beginning may lead to a long feature subset length. Therefore, influenced by the weight values, the fitness value decreases by a small amount before increasing with the number of iterations. Through Fig 24, we can find this feature. In addition, from the figure we can still notice that the variation of the fitness value is strongly influenced by the ACC and is even close to the value of the average classification accuracy obtained. However, usually a higher fitness value indicates that the individual is closer to the optimal solution of the problem. It is also proved that the fitness value influences the performance of the individual in solving the problem, indicating that it is better capable of helping the algorithm to jump out of the local optimum. Therefore, adopting a weight value to measure the percentage of the ACC and LEN can regulate the trade-off relationship in the feature selection process more flexibly, thus better balancing the relationship between model performance and complexity. It also illustrates that the goodness of the fitness value indirectly determines the degree of superiority of the algorithm.
4) Variation of the convergence factor.
In traditional GWO, the convergence factor a is usually linearly decreasing during the iteration process from 2 to 0. It is designed to allow the wolves to gradually converge to the neighborhood of the optimal solution in the search space to increase the balance between global and local search. In this paper, we change the variation of the convergence factor a to vary randomly between [0, 2], which brings the following benefits.
1. Avoid premature convergence: the traditional linear decreasing approach may cause the algorithm to converge prematurely to a local optimum solution, whereas random variation can slow down this convergence and help to better explore the search space.
2. Increase the robustness of the algorithm: randomly varying the parameter a can make the algorithm more robust and better adaptable to different problems, helping to improve the algorithm’s global search capability.
Randomly varying the parameter a increases the diversity of the algorithm, allowing the wolves to have a greater ability to explore the search space and helping to avoid falling into local optima. Since the variation of the convergence factor is randomly varied, it is different in each dataset. Therefore, among the selected 20 datasets, we picked one of the most representative datasets to plot the image of convergence factor. Fig 25 shows the trend of a n the dataset Leukemia. It can be seen that a changes with the number of iterations, no longer decreasing in a single direction, but randomly changing suddenly high and low.
The figure shows the variation of the convergence factor a over the number of iterations on the Leukemia dataset.
5) Variation of the iteratively oscillatory random number r.
In the traditional method, we calculate the grey wolf position after the corresponding position is not 0 or 1 value, according to a random number judgment attribute it to 0 or 1. However, it is not certain that the accuracy of the value and may even appear all 0 or all 1 of the results. To make the data more convincing and make the results more accurate, in this paper, the random number is changed to an iterative oscillatory selection method.
Since the selection of the grey wolf location is based on r, we selected the most representative Automobile dataset among all the datasets to present the trend of r. The r value of the Automobile dataset is not fixed at 0.5, but it varies with the number of iterations. As shown in Fig 26, the value of r is no longer fixed at 0.5, but varies with the number of iterations. Such an approach may make the selected subset richer in diversity.
The figure shows the variation of r over the number of iterations on the Automobile dataset.
The adjustment parameter is using the periodic sinusoidal vibrational function value that varies with the iteration coefficients. It can make the data have different distribution characteristics at different iteration numbers, thus increasing the diversity of the data. Meanwhile, the periodicity characteristics of this function can also make the data change regularly under different iteration times, which helps to better understand the change rule of the data. Therefore, changing the random number to a parameter that changes with the number of iterations can improve the credibility of the data and the accuracy of the results.
6) Analysis of time complexity.
Time complexity is an important concept in algorithm analysis, which is used to describe the increase in the execution time of an algorithm as the input size increases. Time complexity analysis can help us compare the performance of different algorithms and choose more efficient ones. In general, algorithms with lower time complexity are more efficient. In this paper, all the compared algorithms are compared with the proposed TMKMCRIGWO algorithm, and the time complexity is shown in Table 7.
Where T denotes the total number of iterations, K denotes the number of features we selected, N denotes the number of features in the dataset, M denotes the number of nearest neighbors to be consideredand and S denotes the time required to execute the SVM classifier.
The time complexity of the wrapper algorithm is a little lower because there is no additional computation added to the wrapper algorithm. But in the hybrid algorithm, the filter algorithm is combined with the wrapper algorithm, so the time complexity of the hybrid algorithm is a little higher than the single wrapper algorithm. In ReliefF+IGWO, the time complexity of ReliefF algorithm requires judging the number of proximity obtained according to the number of features. The more features, the larger M is, conversely, the smaller it is.
In the TMKMCRIGWO algorithm proposed in this paper, two layers of filtering are first used to select a better subset of candidate features. If the original data set has 100 features, there may be 50 left after the first filter and 20 left after the second filter. Then, these 20 features are used as the original feature subset of the wrapper algorithm, which is more conducive to selecting the most relevant and minimum redundant features. As the number of features decreases, so does the computation time, so the time complexity decreases accordingly.
In Table 7, because the number of features we finally select is much less than the original number of features in the dataset, and K is much smaller than N, which means that K × N is less than N2. Therefore, this algorithm has lower time complexity than other hybrid algorithms. This also means that the TMKMCRIGWO algorithm can process more data or perform more iterations at the same time. This not only improves the performance but also the efficiency of the algorithm. In addition, the lower time complexity also means that the TMKMCRIGWO algorithm may be better suited for resource-limited situations, as it can accomplish computational tasks in a shorter period of time.
5. Conclusion
In feature selection process, hybrid algorithms can help to identify and eliminate redundant features in high or low dimensional data. Thus, in this paper, we propose a hybrid feature selection algorithm TMKMCRIGWO. The algorithm is formed by combining the bivariate filter algorithm MKMC in tandem with the univariate filter algorithm ReliefF, and then combining it with the improved IGWO wrapper algorithm. Through experiments, we verified that the algorithm obtains a higher average classification accuracy with lower length feature subsets. We can see that the hybrid algorithm shows better performance and results than a single algorithm in solving complex problems. By combining the advantages of different algorithms, the hybrid algorithm can overcome the limitations of a single algorithm and improve the accuracy and efficiency of problem solving.
Although the proposed algorithm has achieved certain results on some specific data sets and shown its advantages in solving specific problems, further optimization and improvement are still needed to realize the algorithm in a wider range of application scenarios. Future work may include extending the scope of application of the algorithm, optimizing the performance index of the algorithm, and carrying out the theoretical analysis and model improvement of the algorithm, etc. By continuing to promote these efforts, it is believed that the algorithm will be able to play an important role in more application areas and bring new breakthroughs and contributions to related research and practice.
References
- 1. Ma Xi-Ao and Xu Hao and Ju Chunhua. Class-specific feature selection via maximal dynamic correlation change and minimal redundancy. Expert Systems with Applications. 2023 Nov;229:120455.
- 2. Güldoğuş Buse Çisil and Özögür-Akyüz Süreyya. FSOCP: feature selection via second-order cone programming. Central European Journal of Operations Research. 2024 Jan;1–14.
- 3. Wang Wenjing and Guo Min and Han Tongtong and Ning Shiyong. A novel feature selection method considering feature interaction in neighborhood rough set. Intelligent Data Analysis. 2023;27(2):345–359.
- 4. Epstein Elise and Nallapareddy Naren and Ray Soumya. On the Relationship between Feature Selection Metrics and Accuracy. Entropy. 2023 Dec;25(12):1646. pmid:38136526
- 5. Wang Junya and Xu Pengcheng and Ji Xiaobo and Li Minjie and Lu Wencong. MIC-SHAP: An ensemble feature selection method for materials machine learning. Materials Today Communications. 2023 Dec;37:106910.
- 6. Wang Peng and Xue Bing and Liang Jing and Zhang Mengjie. Feature clustering-Assisted feature selection with differential evolution. Pattern Recognition. 2023 Aug;140:109523.
- 7. Liu Zhaogeng and Yang Jielong and Wang Li and Chang Yi. A novel relation aware wrapper method for feature selection. Pattern Recognition. 2023 Aug;140:109566.
- 8. Jiang Jianxun and Li Kaijun and Du Jingguo and Chen Ziying and Liu Yinhua and Liu Ying. Prediction system for water-producing gas wells using edge intelligence. Expert Systems with Applications. 2024 Aug;247:123304.
- 9. Zheng Yuefeng and Li Ying and Wang Gang and Chen Yupeng and Xu Qian and Fan Jiahao, et al. A novel hybrid algorithm for feature selection. Personal and Ubiquitous Computing. 2018 Oct;22:971–985.
- 10. Mirjalili Seyedali and Lewis Andrew. The whale optimization algorithm. Advances in engineering software. 2016 May;95:51–67.
- 11. Zheng Yuefeng and Li Ying and Wang Gang and Chen Yupeng and Xu Qian and Fan Jiahao, et al. A novel hybrid algorithm for feature selection based on whale optimization algorithm. IEEE ACCESS. 2019 Feb;7:14908–14923.
- 12. Braik Malik and Awadallah Mohammed and Al-Betar Mohammed Azmi and Al-Hiary Heba. Enhanced whale optimization algorithm-based modeling and simulation analysis for industrial system parameter identification. The Journal of Supercomputing. 2023 Sep;79(13):14489–14544.
- 13. Tian Zhirui and Gai Mei. Football team training algorithm: A novel sport-inspired meta-heuristic optimization algorithm for global optimization. Expert Systems with Applications. 2024 Jul;245:123088.
- 14. Ab Aziz Nor Azlina and Ibrahim Zuwairie and Mubin Marizan and Nawawi Sophan Wahyudi and Mohamad Mohd Saberi. Improving particle swarm optimization via adaptive switching asynchronous—synchronous update. Applied Soft Computing. 2018 Nov;72:298–311.
- 15. Shi Jue and Chen Xiaofang and Xie Yongfang and Zhang Hongliang and Sun Yubo. Delicately Reinforced k-Nearest Neighbor Classifier Combined with Expert Knowledge Applied to Abnormity Forecast in Electrolytic Cell. IEEE Transactions on Neural Networks and Learning Systems. 2024 Mar;35(3):3027–3037. pmid:37494170
- 16. Wang Jing and Wang Xingyi and Li Xiongfei and Yi Jiacong. A hybrid particle swarm optimization algorithm with dynamic adjustment of inertia weight based on a new feature selection method to optimize SVM parameters. Entropy. 2023 Mar;25(3):531. pmid:36981419
- 17. Ye Hailiang and Cao Feilong and Wang Dianhui. A hybrid regularization approach for random vector functional-link networks. Expert Systems with Applications. 2020 Feb;140:112912.
- 18. Yang Lingjian and Liu Songsong and Tsoka Sophia and Papageorgiou Lazaros G. A regression tree approach using mathematical programming. Expert Systems with Applications. 2017 Jul;78:347–357.
- 19. Zhou Tao and Long Qiang and Law Kris M. Y. and Wu Changzhi. Multi-objective stochastic project scheduling with alternative execution methods: An improved quantum-behaved particle swarm optimization approach. Expert Systems with Applications. 2022 Oct;203:117029.
- 20. Li Xiaotong and Fang Wei and Zhu Shuwei. An improved binary quantum-behaved particle swarm optimization algorithm for knapsack problems. Information Sciences. 2023 Nov;648:119529.
- 21. Gong Chen and Zhou Nanrun and Xia Shuhua and Huang Shuiyuan. Quantum particle swarm optimization algorithm based on diversity migration strategy. Future Generation Computer Systems-The International Journal of Fscience. 2024 Aug;157:445–458.
- 22. Bodha Kapil Deo and Yadav Vinod Kumar and Mukherjee Vivekananda. Formulation and application of quantum-inspired tidal firefly technique for multiple-objective mixed cost-effective emission dispatch. Neural Computing & Applications. 2020 Jul;32(13):9217–9232.
- 23. Wu Ting and Hao Yihang and Yang Bo and Peng Lizhi. ECM-EFS: An ensemble feature selection based on enhanced co-association matrix. Pattern Recognition. 2023 Jul;139:109449.
- 24. Tijjani Sani and Ab Wahab, Mohd Nadhir and Noor Mohd Halim Mohd. An enhanced particle swarm optimization with position update for optimal feature selection. Expert Systems with Applications. 2024 Aug;247:123337.
- 25. Beheshti Zahra. A fuzzy transfer function based on the behavior of meta-heuristic algorithm and its application for high-dimensional feature selection problems. Knowledge-Based Systems. 2024 Jan;284:111191.
- 26. Li Zhang. A local opposition-learning golden-sine grey wolf optimization algorithm for feature selection in data classification. Applied Soft Computing. 2023 Jul;142:110319.
- 27. Liu Siqi and Tu Peihan and Xu Weichao and Xie Shengli and Wu Weizeng and Zhang Yang, et al. Application of Kendall’s rank function for digital image correlation. Measurement Science and Technology. 2019 Apr;30(4):045003.
- 28. Chamlal Hasna and Benzmane Asmaa and Ouaderhman Tayeb. Elastic net-based high dimensional data selection for regression. Expert Systems with Applications. 2024 Jun;244:122958.
- 29. Zhu Yaolin and Zhao Lu and Chen Xin and Li Yunhong and Wang Jinmei. Identification of cashmere and wool based on LBP and GLCM texture feature selection. Journal of Engineered Fibers and Fabrics. 2023;18:15589250221146548.
- 30. Xue Yu and Zhu Haokai and Neri Ferrante. A feature selection approach based on NSGA-II with ReliefF. Applied Soft Computing. 2023 Feb;134:109987.
- 31. Fan Haiyan and Xue Luyu and Song Yan and Li Ming. A repetitive feature selection method based on improved ReliefF for missing data. Applied Intelligence. 2022 Nov;52(14):16265–16280.
- 32. Mirjalili Seyedali and Mirjalili Seyed Mohammad and Lewis Andrew. Grey wolf optimizer. Advances in engineering software. 2014 Mar;69:46–61.
- 33. Ding Chris and Peng Hanchuan. Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology. 2005;3:185–205. pmid:15852500
- 34. Meng Wenyang and Li Ying and Gao Xiaoying and Ma Jianbin. Ensemble classifiers using multi-objective Genetic Programming for unbalanced data. Applied Soft Computing. 2024 Jun;158:111554.
- 35. Liu Zhao and Wang Aimin and Sun Geng and Li Jiahui and Bao Haiming and Liu Yanheng. Evolutionary feature selection based on hybrid bald eagle search and particle swarm optimization. Intelligent Data Analysis. 2024;28(1):121–159.
- 36. Gong Xiaoling and Wang Jian and Ren Qilin and Zhang Kai and El-Alfy El-Sayed M and Mańdziuk Jacek. Embedded feature selection approach based on TSK fuzzy system with sparse rule base for high-dimensional classification problems. Knowledge-Based Systems. 2024 Jul;295:111809.
- 37. Li Min and Ma Huan and Lv Siyu and Wang Lei and Deng Shaobo. Enhanced NSGA-II-based feature selection method for high-dimensional classification. Information Sciences. 2024 Mar;663:120269.
- 38. Lee Kiryung and Stöger Dominik. Randomly initialized alternating least squares: Fast convergence for matrix sensing. SIAM Journal on Mathematics of Data Science. 2023;5(3):774–799.
- 39. Saberi-Movahed Farid and Rostami Mehrdad and Berahmand Kamal and Karami Saeed and Tiwari Prayag and Oussalah Mourad, et al. Dual regularized unsupervised feature selection based on matrix factorization and minimum redundancy with application in gene selection. Knowledge-Based Systems. 2022 Nov;256:109884.
- 40. Pashaei Elham and Pashaei Elnaz. Hybrid binary arithmetic optimization algorithm with simulated annealing for feature selection in high-dimensional biomedical data. The Journal of Supercomputing. 2022 Sep;78(13):15598–15637.
- 41. Guo Xianjie and Yu Kui and Cao Fuyuan and Li Peipei and Wang Hao. Error-aware Markov blanket learning for causal feature selection. Information Sciences. 2022 Apr;589:849–877.
- 42. Hu Jiao and Gui Wenyong and Heidari Ali Asghar and Cai Zhennao and Liang Guoxi and Chen Huiling, et al. Dispersed foraging slime mould algorithm: Continuous and binary variants for global optimization and wrapper-based feature selection. Knowledge-Based Systems. 2022 Feb;237:107761.
- 43. Ouaarab Aziz and Ahiod Belaïd and Yang Xin-She. Discrete cuckoo search algorithm for the travelling salesman problem. Neural Computing and Applications. 2014 Jun;24(7-8):1659–1669.
- 44. Unler Alper and Murat Alper and Chinnam Ratna Babu. mr2PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Information Sciences. 2011 Oct;181(20):4625–4641.
- 45. Lin Shih-Wei and Lee Zne-Jung and Chen Shih-Chieh and Tseng Tsung-Yuan. Parameter determination of support vector machine and feature selection using simulated annealing approach. Applied soft computing. 2008;8(4):1505–1512.
- 46. Huang Cheng-Lung and Wang Chieh-Jen. A GA-based feature selection and parameters optimization for support vector machines. Expert Systems with applications. 2006;31(2):231–240.