Figures
Abstract
Advances in data collection have resulted in an exponential growth of high-dimensional microarray datasets for binary classification in bioinformatics and medical diagnostics. These datasets generally possess many features but relatively few samples, resulting in challenges associated with the “curse of dimensionality”, such as feature redundancy and an elevated risk of overfitting. While traditional feature selection approaches, such as filter-based and wrapper-based approaches, can help to reduce dimensionality, they often struggle to capture feature interactions while adequately preserving model generalization. Therefore, this paper introduces the Adaptive Cluster-Guided Simple, Fast, and Efficient (ACG-SFE) feature selection, a hybrid approach designed to address the challenges of high-dimensional microarray data in binary classification. ACG-SFE enhances the Simple, Fast, and Efficient (SFE) evolutionary feature selection model by integrating hierarchical clustering to dynamically group correlated features based on the optimal number of clusters determined by the Silhouette index, Davies-Bouldin score, and the feature-to-observation ratio while adaptively selecting representative features within clusters using mutual information and adjusting the selection threshold through a progress factor. This hybrid filter-wrapper approach improves feature interactions, effectively minimizing redundancy and overfitting while enhancing classification performance. The proposed model is assessed against four state-of-the-art evolutionary feature selection models on 11 high-dimensional microarray datasets. Experimental results indicate that ACG-SFE effectively selects a small yet pertinent feature subset, minimizing redundancy while attaining enhanced classification accuracy and F-measure. Furthermore, its reduced RMSE between train and test accuracy substantiates its capability to reduce overfitting, outperforming comparative models. These findings establish ACG-SFE as an effective feature selection model for handling high-dimensional microarray data in binary classification, enhancing classification accuracy while selecting minimal relevant features to reduce unnecessary complexity and the risk of overfitting.
Citation: Tye YW, Chew X, Yusof UK, Tulpar S (2025) ACG-SFE: Adaptive cluster-guided simple, fast, and efficient feature selection for high-dimensional microarray data in binary classification. PLoS One 20(9): e0331089. https://doi.org/10.1371/journal.pone.0331089
Editor: Ricardo Zavala Yoe, Tecnológico de Monterrey: Tecnologico de Monterrey, MEXICO
Received: April 8, 2025; Accepted: August 8, 2025; Published: September 8, 2025
Copyright: © 2025 Tye et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: Funding: This paper was funded by Micron Memory Malaysia Sdn. Bhd. with Grant No. R504-LRGAL007-0006501300-M188 for the project entitled “Dimensionality Reduction for Smart Manufacturing”; and also by the Universiti Sains Malaysia (USM), Short-Term Grant [Grant Number: 304/PKOMP/6315616], Account Number: R501-LR-RND002-0006315616-0000, for the Project entitled “New Coefficient of Variation Control Charts based on Variable Charting Statistics in Industry 4.0 for the Quality Smart Manufacturing and Services”.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Microarray datasets, typically generated through advanced omics technologies, such as genomics and transcriptomics [1], enable simultaneous measurement of thousands of gene expressions, making them valuable for discovering biomarkers and improving medical diagnostics [2]. However, microarray data is extremely high-dimensional, typically containing thousands or even tens of thousands of features (genes) but only a few patient samples [3], which poses significant challenges for binary classification algorithms. Many measured features (genes) in these datasets are irrelevant or noisy, introducing redundancy and unnecessary model complexity, which increases computational cost, heightens the risk of overfitting, and reduces predictive performance in binary classification [4].
Consequently, dimensionality reduction techniques are vital preprocessing methods for addressing challenges in high-dimensional gene expression data, as they compress high-dimensional data into a lower-dimensional form while preserving important information [5]. Generally, dimensionality reduction techniques are categorized into feature extraction and feature selection [6]. Feature extraction transforms the original feature into a lower-dimensional feature space by combining original features, but this transformation often sacrifices the features’ interpretability [7], making it less suitable for medical diagnostics applications where understanding gene relationships is crucial. Conversely, feature selection preserves interpretability by directly selecting the most informative features (genes) while maintaining clear relationships with class labels [8], making it highly suitable for microarray data analysis. Selected genes from microarray data often represent known biomarkers or critical pathways involved in disease progression [9,10], thereby reinforcing the biological relevance and clinical interpretability of feature selection models. Furthermore, feature selection is computationally efficient than feature extraction, as it reduces dimensionality without generating new features [11], making it a preferable approach for high-dimensional microarray data.
Feature selection (FS) techniques are typically categorized into filter, wrapper, and embedded approaches [12]. Filter approaches evaluate features independently using statistical measures, such as correlation and mutual information, offering computational efficiency but ignoring interactions between features and their combined effects with the classifier [12], potentially resulting in suboptimal selections. In contrast, wrapper approaches iteratively evaluate feature subsets based on predictive performance with a specific classifier [13]. Although wrapper approaches implicitly capture beneficial feature interdependencies through this evaluation process, they do not explicitly penalize feature redundancy, which can result in redundant feature subsets [14] and a heightened risk of overfitting, particularly in small-sample datasets [15,16]. Embedded approaches perform feature selection during the model training process, providing adaptive selection but limiting generalizability due to their dependency on specific classifier assumptions [17].
Considering these limitations and guided by the no free lunch theorem that no single model universally outperforms others [18], recent studies have increasingly adopted hybrid filter-wrapper feature selection approaches. These approaches initially apply filter approaches to reduce dimensionality efficiently and subsequently use wrapper approaches to refine feature subsets by capturing feature interactions, ultimately improving classification accuracy and providing more meaningful support for clinical decision-making [19].
Consequently, this paper introduces the Adaptive Cluster-Guided Simple, Fast, and Efficient Feature Selection (ACG-SFE) model, a hybrid feature selection approach designed to address the challenges of high-dimensional microarray data in binary classification. By integrating filter and wrapper approaches, ACG-SFE addresses the limitations of independent evaluation in filter approaches and significantly reduces the risks of redundancy and overfitting associated with wrapper approaches. The model effectively captures feature interactions, minimizes redundancy, and enhances generalization via hierarchical clustering and adaptive cluster regularization.
The key contributions of ACG-SFE are as follows:
- Dynamic hierarchical clustering-based feature grouping:
Design a hierarchical clustering technique that dynamically determines the optimal number of feature clusters using Silhouette and Davies-Bouldin scores, along with the feature-to-observation ratio. This technique effectively groups correlated features, reduces redundancy, and enhances feature interactions.
- Adaptive mutual information-based intra-cluster regularization:
Introduces an adaptive regularization technique that utilizes mutual information to select informative features within each cluster, dynamically adjusts the selection threshold through a progress factor. This technique prevents excessive feature retention, reduces overfitting, and improves generalization performance.
- Hybrid filter-wrapper search optimization:
Integrates dynamic feature clustering and adaptive intra-cluster selection into the Simple, Fast, and Efficient (SFE) heuristic wrapper search. This integration effectively identifies an optimal feature subset, significantly enhancing model accuracy while minimizing feature redundancy.
The remainder of this paper is organized as follows: The related works section reviews related work on filter, wrapper, and hybrid feature selection techniques for high-dimensional microarray data. The research method section presents the ACG-SFE model, detailing its methodology and key components. The experiment design section describes the experimental configuration, including datasets, benchmark models, and evaluation metrics. The results and discussion section presents the results and discusses ACG-SFE’s effectiveness in reducing redundancy, mitigating overfitting, and optimizing feature selection. Finally, the conclusion summarizes key findings and suggests potential future research directions.
Related works
Filter feature selection.
Filter feature selection approaches evaluate each feature’s relevance to the target independently, using statistical or information-theoretic measures [20]. This characteristic makes them computationally efficient and less prone to overfitting in high-dimensional microarray datasets since they do not involve training a classifier during selection [21]. Common filter techniques rank genes by metrics such as t-test scores, chi-square values, correlation, or Mutual Information (MI), selecting the top-ranked genes as potential biomarkers [22]. Correlation-based filter approaches commonly employ Pearson’s correlation to rank features based on linear relationships [23], while Spearman’s correlation detects monotonic relationships [24]. However, strong correlations among selected features introduce redundancy, negatively impacting model performance and increasing the risk of overfitting. To address this redundancy, advanced correlation-based methods like Resampling Fast Correlation-Based Filter (RFCBF) simultaneously evaluate feature-class relevance and redundancy between features, using resampling to stabilize feature selection and improve accuracy [25].
Unlike correlation, MI measures the dependency between features based on information theory [26], effectively capturing both linear and nonlinear relationships between features and class labels [27]. For instance, Zhang et al. introduced a Maximum Conditional Mutual Information (MCMI)-based filter method, effectively reducing redundancy in high-dimensional microarray datasets by selecting genes with strong predictive power and minimal redundancy, thus enhancing classification accuracy and model simplicity [28]. Additionally, Morán-Fernández et al. proposed a low-precision MI-based feature selection approach optimized for devices with limited computational capabilities, demonstrating that reducing the numeric precision of calculations (to 16-bit or even 8-bit) can maintain high classification accuracy on microarray datasets while significantly decreasing computational requirements [29].
In addition to these metrics-based methods, clustering techniques have recently gained prominence for their effectiveness in addressing redundancy by grouping correlated features and selecting representative ones from each cluster [30,31]. For instance, Asghari et al. introduced the Best Clustering Normalized Mutual Information Quantile (BC-NMIQ) model, combining clustering with mutual information ranking and using the Incremental Association Markov Blanket (IAMB) algorithm to select optimal subsets, resulting in enhanced classification accuracy on high-dimensional medical datasets [32]. Similarly, Guo proposed the K-means Neighbourhood Component Feature Selection (KNCFS), which employs correlation-based clustering to group collinear features, subsequently constructing subspaces with lower inter-feature correlations to select robust and informative feature subsets in high-dimensional datasets effectively [33]. These filter and cluster-based approaches have the advantage of speed and can handle ultra-high dimensions, but they struggle to fully capture complex interactions between selected features and the classifier since they do not directly consider a classifier’s feedback.
Wrapper-based feature selection.
Wrapper methods approach feature selection as an optimization problem guided by a predictive model. Instead of ranking features by a metrics score, wrappers repeatedly evaluate candidate feature subsets using a classifier to directly assess subset quality in terms of model performance [13]. This strategy allows wrappers to consider feature interactions and their joint contribution to classification accuracy. Techniques like Recursive Feature Elimination (RFE) and Exhaustive Feature Selection (EFS) were early wrappers used for feature selection, often wrapping KNN, SVM, or other classifiers to search for the best subset [34,35]. By exploring combinations of features, wrapper methods can achieve higher predictive accuracy than filters; however, this advantage comes at a substantially increased computational cost [15,16].
Exhaustively searching all subsets is infeasible for tens of thousands of features in microarray data, prompting researchers to adopt Evolutionary Computation (EC)-based techniques, such as Genetic Algorithms (GA), Binary Particle Swarm Optimization (BPSO), Binary Differential Evolution (BDE), and single-agent heuristics like Simple, Fast, and Efficient (SFE) to effectively explore large search spaces and select informative features with strong predictive performance [36,37]. For instance, Krishna and Rajarajeswari proposed Mutual Fuzzy Swarm Optimization (MFSO), which integrates MI, fuzzy logic, and PSO to select informative and biologically meaningful features effectively, significantly enhancing diagnostic accuracy across multiple microarray datasets [38]. Similarly, Li et al. developed a two-stage hybrid biomarker selection approach combining ensemble filtering and BDE enhanced by binary African vultures optimization (EF-BDBA), demonstrating superior biomarker selection performance by balancing exploration and exploitation effectively during optimization [39].
Recent studies have introduced computationally efficient wrapper-based algorithms to address the computational intensity of traditional evolutionary methods. One prominent example is the Simple, Fast, and Efficient (SFE) algorithm, which utilizes a single-agent heuristic to prune redundant or irrelevant features in a non-selection operator quickly. It then identifies informative features through a selection operator, selecting smaller yet highly informative feature subsets than population-based methods [36]. An enhanced variant, SFE-PSO, further incorporates PSO strategies, balancing rapid convergence and comprehensive search capability [36]. However, despite these strengths, SFE and its variants still face challenges such as premature convergence and susceptibility to overfitting, particularly due to their reliance on validation accuracy and static selection criteria, which can negatively impact their generalization to new data.
Hybrid feature selection.
Given the complementary strengths of filters and wrappers, hybrid feature selection approaches have emerged as a popular strategy for high-dimensional microarray data [40]. Hybrid approaches typically follow a two-stage or multi-stage selection process where a fast filter approach first drastically reduces dimensionality by performing coarse screening of the features, followed by a wrapper approach that refines the selected subset by capturing feature interactions [40]. Recent studies have demonstrated that such hybrid approaches can achieve high accuracy with fewer features than pure filters or wrappers alone.
For instance, Song et al. introduced a three-phase hybrid feature selection called Correlation-guided clustering with PSO (HFS-C-P), which first employs a filter-based method to eliminate irrelevant features rapidly, then clustering correlated features efficiently using correlation-guided clustering, and finally refining the selected feature subsets through an improved BPSO that incorporates relevance-guided swarm initialization and adaptive disturbance strategies, achieving a strong balance between computational efficiency and classification accuracy [41]. Similarly, Anosh Babu et al. proposed a two-stage clustering-based hybrid feature selection model combining k-means clustering and a signal-to-noise ratio filter to eliminate redundancy and noise, subsequently applying Cellular learning automata combined with Ant Colony Optimization (CLACO) as a wrapper to pinpoint highly discriminative genes, resulting in improved performance on cancer microarray datasets [42]. Pirgazi et al. proposed a two-phase hybrid method integrating Relief filtering for initial feature weighting with the Shuffled Frog Leaping Algorithm (SFLA) and Incremental Wrapper Subset Selection with Replacement (IWSSr) in the wrapper phase, effectively identifying compact and highly predictive gene subsets from high-dimensional gene expression data [43].
While these hybrid feature selection methods effectively balance computational efficiency and classification accuracy, they often lack adaptability due to their sequential design. Once the features are removed in the filtering step, they are typically not reconsidered, potentially causing the premature discard of valuable features. Conversely, retaining redundant or irrelevant features heightens the risk of overfitting during wrapper-based optimization. Thus, there is a need for an adaptive hybrid model capable of dynamically coordinating filter and wrapper approaches to select optimal, non-redundant subsets, thereby reducing overfitting and ultimately improving generalization performance on high-dimensional microarray data.
Research method
This paper proposes the Adaptive Cluster-Guided Simple, Fast, and Efficient Feature Selection (ACG-SFE) model, which is an advanced feature selection model that builds upon the Simple, Fast, and Efficient (SFE) [36] feature selection model to address the challenges of existing feature selection models highlighted in the related works section. It enhances SFE by introducing dynamic clustering and adaptive regularization filter mechanisms into the wrapper SFE model to preserve valuable features, reduce redundancy, capture important feature interactions, and enhance generalization for high-dimensional microarray data. The following section outlines ACG-SFE and its key components in detail.
Overview of the ACG-SFE model
Fig 1 illustrates the proposed ACG-SFE model, highlighting the newly introduced components and enhancements over the baseline SFE model [36] in green. ACG-SFE integrates the two phases’ search strategy of the SFE algorithm, initially performing rapid global exploration to remove redundant or irrelevant features, followed by targeted exploitation to reintroduce highly informative features, resulting in a refined initial feature subset [36].
Unlike the standard SFE, which would stop after these two phases, ACG-SFE further refines the feature subset by introducing a dynamic hierarchical clustering that groups correlated features in which its optimal number of clusters is determined using the Davies-Bouldin index and the Silhouette score and dynamically adjusted according to the feature-to-observation ratio. By incorporating feature clusters obtained from hierarchical clustering into the SFE wrapper’s iterative selection process to select representative features from each cluster, ACG-SFE reduces redundancy caused by highly correlated features and enhances interactions between the selected feature subset and the classifier, thus improving overall classification performance.
ACG-SFE also introduces an adaptive cluster regularization mechanism to reduce overfitting by implementing two complementary strategies. First, a mutual information-based selection selects representative and informative features within each cluster, effectively reducing redundancy. Second, an adaptive feature selection threshold, controlled by a progress factor, gradually limits the number of features as optimization progresses. Together, these strategies ensure a compact and generalizable feature subset, minimizing complexity and enhancing the model’s predictive performance on unseen data. The following sections detail each component of the ACG-SFE model.
Search mechanism
The search mechanism of ACG-SFE operates on a binary-coded feature space, where each feature is represented by a binary state, with 1 indicating a selected feature and 0 indicating an unselected one, as illustrated in Fig 2. Similar to SFE, the initial search process in ACG-SFE consists of two complementary phases: exploration and exploitation, guided by adaptive operators.
Exploration (non-selection operator).
In the exploration phase, the non-selection operator (lines 24–29 in Algorithm 1) globally searches the feature space, systematically switching less informative features from selected (1) to non-selected (0) status. The operator uses an adaptive non-selection rate (UR), starting high to remove redundant or irrelevant features aggressively and gradually decreasing as the search progresses [36]. This adaptive adjustment ensures a balanced exploration-exploitation process, refining the feature subset without prematurely discarding valuable features.
Exploitation (selection operator).
In the exploitation phase, the selection operator (lines 30–37 in Algorithm 1) locally refines the feature subset by reactivating previously excluded features that significantly enhance model performance. This operator ensures the feature subset never becomes empty, maintaining at least one selected feature to prevent premature convergence and further enhancing the exploitation capability of the model [36].
Algorithm 1 presents the pseudo-code for the proposed ACG-SFE, illustrating the exploration-exploitation operators’ process (lines 24–37 in Algorithm 1) used to rapidly produce an initial feature subset significantly smaller than the original feature set. However, since the features have been evaluated independently, the next section introduces and details the clustering-based approach that captures feature interrelationships to reduce redundancy, enhance generalization, and reduce overfitting.
Dynamic hierarchical clustering-based feature grouping technique
To effectively manage redundancy and feature interdependence, ACG-SFE employs hierarchical clustering using Ward’s method and Pearson correlation to group correlated features, enabling selection decisions at the cluster level rather than individually. Unlike existing clustering-based FS models, ACG-SFE dynamically determines the optimal number of clusters by evaluating multiple cluster sizes using Davies-Bouldin and Silhouette indices, adjusted based on the dataset’s feature-to-observation (Algorithm 2 and lines 14–15 in Algorithm 1). This dynamic clustering approach ensures that each cluster group strongly correlates features while clearly distinguishing between clusters, simplifying the search space, and preventing redundant evaluation of correlated features.
Algorithm 1: ACG-SFE feature selection algorithm
Input: Training dataset with original feature set, 𝐹 ={f1,f2,f3,…,f}
Output: Optimal feature subset, S ={𝑠1,𝑠2,𝑠3,…,𝑑}
1 Begin:
2 Initialize variables:
3 𝐹𝐸𝑠max: Maximum number of function evaluations
4 Runmax: Maximum number of run
5 𝑈𝑅max: Initial value of non-selection operator rate
6 𝑈𝑅min: Final value of non-selection operator rate
7 Split data into training, validation, and testing sets:
8 Train_Input: Feature matrix for training data
9 Train_Target: Corresponding labels for model training data
10 Fold_Train_Ind: Indices for training data within each cross-validation fold
11 Fold_Test_Ind: Indices for validation data within each cross-validation fold
12 Test_Input: Feature matrix for test data
13 Test_Target: Corresponding labels for the test data
14 Calculate feature-to-observation ratio of Train_Input, feature_obs_ratio = Number of features/ Number of observations
15 Determine num_clusters, the optimal number of clusters, using Algorithm 2 based on feature_obs_ratio
16 feature_clusters = Feature cluster assignments using num_clusters based on Algorithm 3
17 While Run ≤ Runmax Do
18 Initialize feature subset 𝑋={𝑥1,𝑥2,…,𝑥D} randomly in the search space
19 Calculate the fitness of 𝑋 for validation data: fit(X)
20 Calculate the number of features in the dataset, 𝑁𝑣𝑎𝑟
21 𝐹𝐸𝑠 = 1
22 While 𝐹𝐸𝑠 ≤ 𝐹𝐸𝑠max Do
23 𝑋𝑁𝑒𝑤 = 𝑋
24 𝑈𝑅 = 𝑈𝑅𝑚𝑎𝑥
25 𝑈𝑁 = [UR × 𝑁𝑣𝑎𝑟] % The number of features to change to non-selected mode
26 U_II𝑑𝑒𝑥 = find the indexes of selected features in 𝑋
27 𝑈 = Generate 𝑈𝑁 random number between 1 to the number of selected features in 𝑋
28 K = 𝑖I𝑑𝑒𝑥(𝑈)
29 Set 𝑋𝑁𝑒𝑤(K) = 0% non-selection operation
30 If the number of selected features in 𝑋𝑁𝑒𝑤 is ==0
31 𝑋𝑁𝑒𝑤 = 𝑋
32 S_Index = find the indexes of non-selected features in X
33 SN = 1% The number of features to change to selected mode
34 S = Generate SN random number between 1 to the number of non-selected features in X
35 K = 𝑖I𝑑𝑒𝑥(S)
36 Set 𝑋𝑁𝑒𝑤(K) = 1% selection operation
37 End if
38 adaptive_base_features_per_cluster = max(1, ⌈(𝐹𝐸𝑠max − 𝐹𝐸𝑠)/ 𝐹𝐸𝑠max × 3)⌉
39 𝑋𝑁𝑒𝑤 = Apply adaptive MI-based intra-cluster regularization to 𝑋𝑁𝑒𝑤 based on Algorithm 4
40 Calculate the fitness of 𝑋𝑁𝑒𝑤 for validation data: fit(𝑋𝑁𝑒𝑤)
41 If fit(𝑋𝑁𝑒𝑤) ≥ fit(𝑋)
42 𝑋 = 𝑋𝑁𝑒𝑤
43 fit(𝑋) = fit(𝑋𝑁𝑒𝑤)
44 End if
45 𝑈𝑅 = (𝑈𝑅𝑚𝑎𝑥 − 𝑈𝑅𝑚𝑖I) × ((𝐹𝐸𝑠max – 𝐹𝐸𝑠)/ 𝐹𝐸𝑠max) + 𝑈𝑅𝑚in
46 𝐹𝐸𝑠 = 𝐹𝐸𝑠 + 1
47 End while
48 FNrun = Final number of selected features for this run
49 Run = Run + 1
50 End while
51 End
Selecting an appropriate number of clusters is crucial. If too few clusters are used, unrelated features can be clustered together, reducing the precision of feature selection; conversely, too many clusters can fragment related features, increasing redundancy and computational complexity. To address this, Algorithm 2 dynamically determines the clustering range (minimum and maximum cluster counts) based on the dataset’s feature-to-observation ratio (lines 2–9 in Algorithm 2). Specifically, for datasets with a low feature-to-observation ratio (less than 300), the divisor used to determine the maximum number of clusters is set higher, resulting in fewer clusters to ensure closely related features are grouped accurately without unnecessary splitting, which suits datasets where the numbers of features and observations are relatively balanced. Conversely, for datasets with higher ratios (300 or greater), the divisor is slightly decreased to yield more clusters, preventing unrelated features from being grouped excessively, thus effectively managing redundancy.
Once the clustering range is defined, each candidate number of clusters within this range is evaluated using the Davies-Bouldin and Silhouette indices (lines 13–23 in Algorithm 2). The Davies-Bouldin index assesses cluster compactness and separation, with lower scores indicating better-defined clusters, while the Silhouette score measures how well features fit within their clusters, with higher scores reflecting more meaningful cluster assignments. Both metrics are normalized via min-max scaling (lines 25–38 in Algorithm 2) to ensure a fair comparison across cluster sizes. The optimal number of clusters is selected based on the lowest combined normalized score, equally weighting both metrics (lines 39–43 in Algorithm 2). For clarity, Fig 3 illustrates this process using Algorithm 2 for training data with a feature-to-observation ratio of 500.
After determining the optimal number of clusters, hierarchical clustering using Ward’s method is applied to group strongly correlated features, minimizing within-cluster variance while maximizing between-cluster separation (Algorithm 3 and lines 16 in Algorithm 1). By conducting feature selection at the cluster level rather than individually, ACG-SFE simplifies the search space, enhances computational efficiency, reduces redundant feature evaluations, and retains the most informative features. The following section describes how these feature clusters are integrated into an adaptive mutual information-based intra-cluster regularization mechanism of ACG-SFE to reduce overfitting.
Algorithm 2: Determining the optimal number of clusters
Input:
Train_Input: Feature matrix for training data
feature_obs_ratio: Ratio of the number of features to the number of observations in Train_Input
Output: num_clusters: Optimal number of clusters
1 Begin:
2 Set initial range for clusters:
3 min_clusters = 2
4 If feature_obs_ratio < 300
5 divisor = 10 × ⌈ feature_obs_ratio/ 100 ⌉
6 Else
7 divisor = 10 × ⌈ feature_obs_ratio/ 100 ⌉ − 10
8 End if
9 max_clusters = ⌊ feature_obs_ratio/ divisor ⌋
10 Initialize empty arrays to store Davies-Bouldin and Silhouette scores:
11 db_values = Empty array of size (max_clusters − min_clusters + 1) to store Davies-Bouldin scores
12 sil_values = Empty array of size (max_clusters − min_clusters + 1) to store Silhouette scores
13 For k = min_clusters to max_clusters Do
14 Try
15 Compute hierarchical clustering using Ward’s method:
16 Z = Hierarchical clustering linkage matrix using Ward’s method on Train_Input
17 cluster_ids = Cluster labels for each observation through forming k clusters from Z
18 db_valuesk = Davies-Bouldin index for k clusters to evaluate cluster compactness and separation
19 sil_valuesk = Mean Silhouette score for k clusters to measure each observation fits within its assigned cluster
20 Catch Exception
21 Set db_valuesk and sil_valuesk to NaN
22 End Try
23 End for
24 Remove NaN values from db_values and sil_values
25 Compute the minimum and maximum Davies-Bouldin and Silhouette scores for normalization:
26 DBmin = Minimum db_values, DBmax = Maximum db_values,
SILmin = Minimum sil_values, SILmax = Maximum sil_values
27 Initialize optimal cluster search:
28 Set best_score = ∞, optimal_clusters = min_clusters
29 For k = min_clusters to max_clusters Do
30 Try
31 Recompute hierarchical clustering to determine the best k by evaluating normalized scores:
32 Z = Hierarchical clustering linkage matrix using Ward’s method on Train_Input
33 cluster_ids = Cluster labels for each observation through forming k clusters from Z
34 db_index = Davies-Bouldin index to evaluate cluster compactness and separation
35 sil_score = Mean Silhouette score to measure how well each observation fits within its assigned cluster
36 Normalize Davies-Bouldin and Silhouette scores:
37 normalized_db = (db_index − DBmin)/ (DBmax − DBmin)
38 normalized_sil = (sil_score − SILmin)/ (SILmax − SILmin)
39 combined_score = 0.5 × normalized_db + 0.5 × (1 − normalized_sil)
40 If combined_score < best_score
41 best_score = combined_score
42 optimal_clusters = k
43 End if
44 Catch Exception
45 Skip k
46 End Try
47 End for
48 Set num_clusters = optimal_clusters
49 End
Algorithm 3: Clustering features based on correlation matrix
Input:
Train_Input: Feature matrix for training data
num_clusters: Optimal number of clusters
Output: feature_clusters: Cluster assignments for features
1 Begin:
2 corr_matrix = Correlation matrix of Train_Input
3 Perform hierarchical clustering using Ward’s method:
4 Z = Hierarchical clustering linkage matrix using Ward’s method on corr_matrix
5 feature_clusters = Cluster labels for each feature through forming num_clusters clusters from Z
6 End
Adaptive mutual information-based intra-cluster regularization technique
To further optimize feature selection, ACG-SFE employs an adaptive Mutual Information (MI)-based intra-cluster regularization strategy (Algorithm 4) that refines feature selection within clusters through two complementary mechanisms: MI-based feature ranking and adaptive adjustment of the number of selected features.
Algorithm 4: Adaptive MI-based Intra-cluster Regularization
Input:
X: Feature selection mask representing features chosen through non-selection and selection operators
feature_clusters: Cluster assignments for features
Train_Input: Feature matrix for training data
Train_Target: Corresponding labels for model training data
adaptive_base_features_per_cluster: Base limit on the number of selected features per cluster
𝐹𝐸𝑠: Current number of function evaluations
𝐹𝐸𝑠max: Maximum number of function evaluations
Output: X_Adaptive: Updated feature selection mask
1 Begin:
2 Initialize X_Adaptive as a zero vector of the same size as X
3 unique_clusters = Distinct cluster labels from feature_clusters
4 progress_factor = 𝐹𝐸𝑠/ 𝐹𝐸𝑠max
5 For each cluster_id in unique_clusters Do
6 feature_indices_in_cluster = Indices of features belonging to cluster_id
7 mi_scores = Mutual information values between each feature in feature_indices_in_cluster and Train_Target using Algorithm 5
8 Intra-cluster selection based on mutual information and adaptive feature limit:
9 adaptive_feature_limit = ⌈adaptive_base_features_per_cluster×(1+(1 − progress_factor)×0.5 × random_value)⌉
10 selected_features = Top-ranked features from feature_indices_in_cluster based on mi_scores, selecting a number of features equal to the minimum of adaptive_feature_limit and the cluster size
11 Update X_Adaptive to include selected_features
12 End for
13 End
Mutual information-based feature ranking.
ACG-SFE ranks features within each cluster based on their MI with the target labels (lines 6–7 in Algorithm 4). MI quantifies the dependency between a feature and the target, enabling features to be ranked by their relevance. Algorithm 5 computes MI scores by first generating and normalizing a joint histogram of feature values and target labels, then summing over all feature-target probability pairs, which also can be formulated by standard MI formulation [44,45]:
where represents the joint probability of feature
and target
, while
and
are their respective marginal probabilities. Features with higher MI scores, indicating stronger associations with the target labels, are prioritized for selection. This approach reduces redundancy by eliminating each cluster’s less informative or highly correlated features.
Algorithm 5: Mutual Information Calculation
Input:
X: Feature values
y: Target labels
Output: mi: Mutual information score between X and y
1 Begin:
2 joint_hist = Joint histogram of X and y, normalized as probabilities
3 marg_x = Marginal probability distribution of X
4 marg_y = Marginal probability distribution of y
5 expected_joint = marg_x × marg_y
6 Identify valid entries where joint_hist > 0 to avoid log(0)
7 mi = ∑(joint_hist(valid) × log(joint_hist(valid)/ expected_joint(valid)))
8 End
Adaptive feature limit adjustment.
To prevent overfitting and unnecessary complexity, ACG-SFE adaptively limits the maximum number of selected features per cluster using an evolving threshold (lines 9 in Algorithm 4). Initially, the algorithm allows more features per cluster to promote broad exploration. As optimization progresses, the threshold decreases gradually to emphasize targeted refinement (exploitation).
Specifically, the adaptive feature limit is calculated using two components
- Adaptive base feature limit per cluster (line 38 in Algorithm 1):
This formula systematically decreases the upper bound on the number of features selected per cluster as the number of function evaluations (FEs) progresses. Early on, a higher limit encourages extensive exploration, whereas later stages focus on refining the most impactful subset of features.
- A Scaling Factor (SF) introduces controlled randomness to balance exploration and exploitation:
Here, the progress factor represents the proportion of optimization completed (), and the random value is uniformly sampled from [0,1]. At the early optimization stages (low progress factor), the scaling factor is closer to its upper bound (≈1.5). This stage encourages exploration by allowing the selection of more features per cluster and maintaining diversity through controlled randomness, thus preventing the premature exclusion of potentially useful features. As optimization progresses (high progress factor), SF gradually approaches its lower bound (≈1), reducing the number of features selected per cluster. This shift focuses the algorithm on refining the most informative feature subset while preventing excessive feature selection, thus stabilizing feature selection, reducing redundancy, and effectively mitigating overfitting.
Finally, the adaptive feature limit per cluster is calculated by applying the scaling factor to the adaptive base feature limit:
Once the adaptive feature limit is computed, the number of selected features per cluster is constrained by the smaller value between the adaptive limit and the cluster size. If a cluster contains fewer features than the computed limit, all its features are retained; otherwise, only the top-ranked features (based on mutual information) up to the limit are selected. This adaptive mechanism ensures the selection of a compact, stable, and representative feature subset, effectively reducing redundancy and complexity and enhancing generalization.
Fitness function
After determining the feature subset through adaptive MI-based intra-cluster regularization, the selected features are evaluated using a fitness function to measure their classification performance (line 40 in Algorithm 1). The fitness function employed by ACG-SFE relies on validation accuracy, assessing how effectively the selected subset generalizes to unseen validation data. Specifically, the validation accuracy is calculated as:
where TP (True Positive) and TN (True Negative) denote the number of correctly classified positive and negative samples, respectively, while FP (False Positives) and FN (False Negatives) represent misclassified positive and negative samples.
An iterative validation procedure (lines 41–44 in Algorithm 1) continuously evaluates and refines the feature subset. In each iteration, the updated feature subset () is assessed on validation data. If
achieves equal or higher validation accuracy than the previous best subset, it replaces the previous best subset. Otherwise, the previous best subset is retained to avoid performance degradation. This iterative approach ensures the best feature subset remains compact, stable, high-performing, and generalizable, optimizing feature reduction and classification performance across diverse datasets.
Experiment design
Datasets.
Experiments are conducted using 11 high-dimensional real-world microarray datasets, as listed in Table 1, to evaluate the performance and generalization ability of the proposed ACG-SFE algorithm. These datasets are characterized by a high dimensionality, with features ranging from 2,000 to 19,993 and observations from 20 to 253, resulting in feature-to-observation ratios between 32.26 and 631.35. Since this paper specifically addresses high-dimensional feature selection without significant class imbalance, the selected datasets have balanced to slightly imbalanced binary class distributions, with Imbalance Ratios (IR) ranging from 1 to 4.84. All datasets are publicly accessible and sourced from reputable repositories, including Mendeley Data [46], GitHub repositories [47,48], Shenzhen University’s repository [49], and the NCBI Gene Expression Omnibus (GEO) database [50]. These public datasets have been widely utilized in feature selection research, providing a reliable benchmark for fair comparisons with existing methods and reproducibility of the proposed ACG-SFE algorithm.
Comparative models and parameter settings
The performance of the proposed ACG-SFE algorithm is evaluated against four state-of-the-art feature selection models: BDE [51], BPSO [52,53], SFE [36], and SFE-PSO [36]. Additionally, a scenario Without Feature Selection (WFS) is included as a baseline to demonstrate the practical impact of applying feature selection. BDE and BPSO are included because they are well-known evolutionary algorithms that are widely used for feature selection in high-dimensional datasets and previously benchmarked against SFE. Besides, the SFE algorithm is selected due to its computational efficiency and lightweight state-of-the-art wrapper-based model, which forms the foundation for the proposed ACG-SFE model. Furthermore, the SFE-PSO hybrid model, which combines SFE and PSO, is included to examine whether the proposed ACG-SFE achieves further improvements beyond existing SFE-based hybridizations.
Hyperparameter settings for all algorithms, summarized in Table 2, follow the baseline SFE paper to ensure consistency and fair comparisons throughout the experiments. The population size for population-based evolutionary algorithms, including BDE, BPSO, and the PSO component of SFE-PSO, is consistently set to 20 for fair comparisons. Specifically, BDE employs a Mutation Factor (F) of 0.8 and a Crossover Rate (CR) of 0.2 to balance exploration and exploitation. Meanwhile, BPSO uses an inertia weight (w) of 1, a cognitive coefficient (c1) of 2, and a social coefficient (c2) of 1.5 to achieve a balanced search strategy, combining each particle’s individual experiences with collective swarm intelligence.
Parameters: F, mutation factor; CR, crossover rate; b, population size; w, inertia weight; c1, cognitive coefficient; c2, social coefficient; 𝑈𝑅max, maximum non‑selection operator rate; 𝑈𝑅min, minimum non‑selection operator rate; SN, fixed selection number.
In contrast, SFE, the SFE component of SFE-PSO, and the proposed ACG-SFE utilize a single-agent heuristic approach, dynamically applying non-selection and selection operators. The non-selection operator employs a linearly decaying non-selection operator rate (UR) to remove redundant features, starting from a maximum of 0.3 to broadly explore feature subsets and gradually decreasing toward 0.001 to emphasize exploitation. Additionally, the selection operator ensures that at least one feature remains selected using a fixed selection number (SN = 1), preventing scenarios where no features are selected.
Experimental setup
Experiments employ the K-Nearest Neighbors (KNN) classifier with k = 1, selected due to its simplicity [54], non-parametric flexibility [55], and sensitivity to the selected features’ quality. Unlike parametric classifiers that rely on predefined data distributions, KNN makes no distributional assumptions, enabling flexible adaptation to diverse datasets [56]. Since KNN classifies data based on feature-space distances [57], it directly benefits from informative feature subsets and is adversely influenced by irrelevant or redundant features, making it highly effective for evaluating feature selection models.
The datasets are partitioned into training (80%) and testing (20%) sets to evaluate generalization performance. To ensure reproducibility and fair comparison across multiple feature selection models, all experiments utilize the same training-test split and identical stratified 5-fold cross-validation partitions within the training data, consistently generated using a fixed random seed of 42. Within the training set, the 5-fold cross-validation strategy iteratively evaluates validation accuracy at each of the 6,000 function evaluation steps, guiding feature selection toward an optimal subset. The final best-selected feature subset, obtained at the end of these evaluations, is then assessed on the test data to confirm the robustness and generalization capability.
Several metrics comprehensively measure feature selection effectiveness, including test accuracy, F-measure, Root Mean Square Error (RMSE), number of selected features, Feature Reduction Rate (FRR), and Jaccard similarity. Test accuracy measures overall predictive performance, while the F-measure evaluates the balance between precision and recall, capturing robustness across different class distributions. The RMSE between training and test accuracy is computed to quantify overfitting, with a lower RMSE indicating better generalization and reduced overfitting, calculated as:
The number of selected features and the FRR measure the degree of dimensionality reduction, reflecting how the model removes redundant or non-informative features while preserving essential information. FRR measures the percentage reduction in dimensionality achieved through feature selection, defined as:
The Jaccard similarity metric evaluates feature selection stability across multiple runs by measuring the similarity between subsets of selected features. Given two index vectors and
, representing two selected subsets, the Jaccard similarity is calculated as the ratio of the intersection to the union of these subsets [58], with a higher percentage indicating greater consistency and stability of selected features across independent runs, computed as:
Each feature selection model is executed independently for 30 runs to assess the consistency and reliability of its performance. The performance metrics from these runs are then averaged to provide a stable and representative measure of overall performance. Statistical tests, such as the Wilcoxon signed-rank test and the Friedman test, were used to further validate the statistical significance of the results. The Wilcoxon signed-rank test evaluates statistical significance in pairwise comparisons between proposed ACG-SFE and other models, while the Friedman test ranks models across datasets, ensuring robust comparative analysis. The source code of the proposed ACG-SFE model can be accessed publicly at: https://github.com/yiwei9464/ACG-SFE.git.
All experiments are conducted on a desktop equipped with a 13th Gen Intel(R) Core(TM) i9-13900F processor (2.00 GHz) and 64 GB of RAM, running on a 64-bit Windows 11 Pro system to provide sufficient computational power for handling high-dimensional datasets effectively.
Results and discussion
This section evaluates the performance and selected features of the proposed ACG-SFE and comparative models using the metrics described in the Experiment setup. The analysis covers classification accuracy, overfitting mitigation, feature redundancy reduction, F-measure, the consistency and clustering behavior of selected features, as well as stability. Computational efficiency, measured as the mean runtime across 30 independent runs, is also assessed to evaluate each model’s feasibility and scalability, as provided in S1 File. Statistical significance of all the results is evaluated using the Wilcoxon signed-rank and Friedman tests.
Classification accuracy
Classification accuracy is a primary metric for evaluating feature selection performance. Table 3 presents the worst, best, mean, and standard deviation values of test classification accuracy across 30 independent runs. The Wilcoxon signed-rank test (significance level of 0.05) is used to assess statistical significance, with symbols “+” (significantly better), “−” (significantly worse), and “≈” (no significant difference) indicating comparative performance relative to benchmark models. Additionally, the mean Friedman test, averaging model ranks across all datasets, comprehensively evaluates the algorithms’ performance.
The proposed ACG-SFE model outperforms benchmark algorithms in 9 out of 11 datasets, demonstrating its superior capability to select highly informative features and effectively reduce overfitting. Specifically, ACG-SFE achieves perfect accuracy (100%) on the leukemia, prostate cancer, ovarian cancer, and lung cancer datasets. It also significantly improves accuracy on colon, DLBCL, ALLAML, CNS, and SMK_CAN_187 datasets, further highlighting its effectiveness in feature selection.
For the Prostate GE dataset, ACG-SFE achieves the same mean accuracy (85.00%) as the WFS scenario but significantly reduces dimensionality, selecting only 8 out of 5,966 features. In the Prostate Tumor dataset, ACG-SFE attains marginally lower mean accuracy (84.67%) than WFS (85.00%), but this difference is not statistically significant based on the Wilcoxon signed-rank test. Moreover, ACG-SFE significantly reduces dimensionality by selecting just 13 out of 10,509 features. These results demonstrate that ACG-SFE selects a more compact and highly relevant feature subset while maintaining comparable classification performance.
Examining the variability of the feature selection models, SFE and SFE-PSO exhibit higher fluctuations in test accuracy, with standard deviations exceeding 5% in 10 out of 11 datasets, indicating instability in feature selection. In contrast, BDE, BPSO, and ACG-SFE exhibit more consistent performance with a standard deviation of less than 5% across all datasets, demonstrating greater reliability and stability.
The Wilcoxon signed-rank and Friedman test results for test accuracy confirm ACG-SFE’s superior performance, consistently outperforming or performing on par with other comparative models. Additionally, ACG-SFE achieves the best average rank (1.59) in the Friedman test, significantly outperforming the other algorithms.
Fig 4 illustrates the average test accuracy across datasets, clearly highlighting ACG-SFE’s superior performance compared to all comparative models. Among the benchmarks, the WFS scenario achieves the highest accuracy, closely followed by BDE and BPSO, indicating their effectiveness in retaining informative features without substantial performance loss. Conversely, SFE and SFE-PSO exhibit lower average accuracy (as illustrated in Fig 4) and higher variability (as shown in Table 3), reflecting instability and potential overfitting.
Fig 5 demonstrates the convergence behaviors of all feature selection models across 6,000 function evaluations. ACG-SFE consistently achieves the highest accuracy across most datasets, including colon, DLBCL, Prostate GE, leukemia, ALLAML, CNS, prostate cancer, ovarian cancer, and lung cancer, demonstrating effective and stable convergence. Notably, on the SMK_CAN_187 dataset, ACG-SFE initially obtained the lowest accuracy before 4,000 evaluations but then sharply improved to attain and sustain the highest accuracy, highlighting its ability to refine feature selection progressively. In the prostate tumor dataset, although ACG-SFE’s accuracy is marginally lower than WFS, ACG-SFE significantly outperforms BDE, BPSO, SFE, and SFE-PSO.
Comparatively, SFE and SFE-PSO exhibit fluctuating convergence patterns, suggesting potential overfitting due to insufficient regularization, resulting in unstable performance frequently below WFS. BPSO maintains stable accuracy, closely mirroring WFS performance in 8 of 11 datasets, indicating limited improvements over the baseline. Similarly, BDE shows stable convergence but often underperforms relative to WFS in 6 of 11 datasets, suggesting its inadequacy in addressing overfitting effectively for high-dimensional data.
Overfitting analysis via RMSE of train and test accuracy
Overfitting is a significant challenge in high-dimensional datasets, affecting the model’s generalization capability to unseen data. Table 4 presents the worst, best, mean, and standard deviation of RMSE between train and test accuracy across 30 runs, assessing each model’s generalization capability.
ACG-SFE achieves the lowest RMSE on 9 of 11 datasets, demonstrating superior generalization. Notably, it attains an RMSE of 0.00% in leukemia, prostate cancer, ovarian cancer, and lung cancer datasets, indicating perfect generalization. Additionally, it achieves the lowest RMSE among all models for colon, DLBCL, ALLAML, CNS, and SMK_CAN_187 datasets. In Prostate GE and prostate tumor datasets, ACG-SFE maintains RMSE values comparable to WFS, effectively preserving generalization while substantially reducing dimensionality. Although ACG-SFE records its highest RMSE (37.84%) on the SMK_CAN_187 dataset due to its extremely high dimensionality (19,993 features), this RMSE remains the lowest among all tested algorithms.
In contrast, SFE and SFE-PSO exhibit the highest RMSE values across most datasets, with SFE highest in six and SFE-PSO highest in three, reflecting significant overfitting due to inadequate regularization and excessive reliance on validation accuracy. Meanwhile, BPSO and BDE maintain stable RMSE values but do not show significant improvement over WFS, highlighting their limited effectiveness in addressing overfitting.
Statistical analyses using the Wilcoxon signed-rank and Friedman tests confirm ACG-SFE’s significant superiority. ACG-SFE significantly reduces RMSE in 9 of 11 datasets, achieving the top mean Friedman rank (1.59). Fig 6 visually reinforces ACG-SFE’s consistent advantage in reducing overfitting by illustrating its lowest average RMSE across datasets.
Feature redundancy analysis: number of selected features and feature reduction rate
Effective feature selection aims to minimize redundant features while preserving those most informative for classification. Therefore, this section compares the proposed ACG-SFE with comparative models based on the number of selected features and the Feature Reduction Rate (FRR).
Table 5 shows that ACG-SFE selects the fewest features compared to other models in 10 of 11 datasets, consistently selecting fewer than 32 features without compromising classification accuracy. The only exception is the prostate cancer dataset, where ACG-SFE selects 43 features, more than SFE (21 features). However, ACG-SFE achieves 100% test accuracy compared to SFE’s 66.67% for the prostate cancer dataset, demonstrating the greater relevance of ACG-SFE’s selected features.
Further analysis using the FRR presented in Table 6 confirms that ACG-SFE consistently achieves average FRRs above 99% across 10 datasets, effectively eliminating redundant features. The Wilcoxon signed-rank test confirms that ACG-SFE significantly reduces more features than other models in all but the prostate cancer dataset. The Friedman test further supports these findings, ranking ACG-SFE first, with mean ranks of 1.21 for the number of features and 1.20 for FRR, underscoring its superior capability in dimensionality reduction and redundancy elimination.
Fig 7 illustrates the FRR comparison, highlighting ACG-SFE’s highest average reduction rate (99.79%). Although SFE and SFE-PSO achieve high FRRs at 99.52% and 99.06%, respectively, their inadequate regularization leads to unstable performance and overfitting. Conversely, BDE (50.01%) and BPSO (10.81%) exhibit significantly lower FRRs, indicating a limited ability to eliminate redundant features. These results highlight ACG-SFE’s balanced approach, achieving substantial feature reduction without compromising classification stability and generalization.
F-measure
The F-measure complements accuracy by evaluating the balance between precision and recall, making it particularly valuable to evaluate datasets with class imbalance. Table 7 presents the worst, best, mean, and standard deviation of the F-measure across 30 independent runs for all evaluated models.
ACG-SFE achieves the highest F-measure on all datasets except Prostate GE. For the Prostate GE dataset, ACG-SFE ties with WFS at 86.96%, consistent with their identical test accuracy at 85%. Notably, it achieves a perfect F-measure at 100% on leukemia, prostate cancer, ovarian cancer, and lung cancer, confirming its strong capability to select highly relevant features.
For datasets with higher imbalance (IR more than 3), such as DLBCL and lung cancer, ACG-SFE significantly outperforms comparative models, highlighting its effectiveness in addressing class imbalance scenarios. Notably, although ACG-SFE’s test accuracy (84.67%) for Prostate Tumor is slightly lower than WFS (85.00%), it attains the highest F-measure among all models for this dataset, indicating better-balanced precision and recall, resulting in more reliable classification result. Statistical analyses further support ACG-SFE’s effectiveness, with the Wilcoxon test showing significant improvements across 10 datasets, and the Friedman test ranks ACG-SFE as the best-performing model with a mean rank of 1.39.
Fig 8 highlights ACG-SFE’s superior performance, showing the highest average F-measure at 89.03% compared to all comparative models. While WFS, BDE, and BPSO have similar average F-measure of around 76%, SFE and SFE-PSO perform notably lower at 68.81% and 72.43%, respectively. These results underline ACG-SFE’s ability to optimize feature selection effectively while ensuring high predictive accuracy across datasets with balanced or moderate class imbalance.
Consistency and clustering behavior of selected features
Feature selection consistency and clustering behavior are essential indicators of a model’s robustness and generalizability across datasets. Fig 9 illustrates the overall distribution of feature selection frequencies for the ACG-SFE model across 30 independent runs, highlighting that most features selected by the model are consistently chosen in nearly every run, underscoring the stability and reliability of the model.
Fig 10 specifically shows the stable features, defined as those features consistently selected in more than 15 out of 30 runs. Across all datasets, ACG-SFE identifies a small set of consistently selected features, reinforcing their predictive relevance rather than random selection.
To further examine the predictive quality and clustering behavior of these stable features selected by the ACG-SFE model, PCA scatter plots are illustrated in Fig 11. Clear and distinct separations between classes are evident in datasets such as Leukemia, ALLAML, Prostate Cancer, Ovarian Cancer, and Lung Cancer, highlighting strong linear separability and biological relevance of these stable features. For other datasets, including Colon, DLBCL, Prostate GE, CNS, and Prostate Tumor, PCA plots reveal some overlap between classes but still show discernible class separation, indicating informative selected features but less obvious linear patterns.
Notably, for the SMK_CAN_187 dataset, the PCA visualization shows significant overlap between classes, indicating limited linear separability. In contrast, the t-SNE visualization for the SMK_CAN_187 dataset in Fig 12 demonstrates substantially improved class distinction, indicating selected features effectively capturing nonlinear relationships and reducing class overlap. Similar improvements with t-SNE are observed in other datasets, including Colon, DLBCL, Prostate GE, and Prostate Tumor dataset, where nonlinear t-SNE methods separate the classes and reduce class overlap despite PCA showing moderate overlap. However, the CNS dataset shows limited improvement with t-SNE compared to PCA, suggesting differences between classes in the CNS dataset are inherently subtle and challenging to capture visually. Collectively, these visualizations demonstrate that the stable features selected by ACG-SFE model reliably capture informative, generalizable patterns for effective class discrimination across diverse datasets.
Stability analysis
Stability analysis assesses the consistency in selected features and predictive performance across repeated independent runs. Table 8 summarizes the feature selection stability measured by Jaccard similarity across 30 runs for the proposed ACG-SFE and comparative evolutionary algorithms. ACG-SFE demonstrates exceptional stability, achieving a perfect mean Jaccard similarity score (100%) in 6 out of 11 datasets, including Colon, DLBCL, Prostate GE, CNS, Ovarian Cancer, and SMK_CAN_187. This perfect consistency with the highest test accuracy among all models underscores the reliability of the ACG-SFE model in identifying biologically meaningful feature subsets.
For datasets such as Leukemia, ALLAML, Prostate Cancer, and Prostate Tumor, the ACG-SFE model maintains high feature selection stability, achieving mean Jaccard similarities ranging from 91.09% to 96.30%. Although minor variability is present due to randomness applied in the feature selection, the stability remains significantly higher compared to comparative models, reflecting the reliability of the proposed ACG-SFE. For the Lung Cancer dataset, the stability of ACG-SFE is comparatively lower, with a mean Jaccard similarity of 89.53%, slightly below that of BPSO at 99.36%, suggesting the presence of multiple equally predictive feature subsets. Nonetheless, the ACG-SFE model consistently achieves perfect classification accuracy and F-measure (100%) for this dataset, confirming the selected features’ strong predictive relevance despite the slight variations.
Comparatively, BDE, SFE, and SFE-PSO exhibit significantly lower stability, with mean Jaccard similarity scores generally below 52%, indicating substantial inconsistency. BPSO shows moderate stability with mean Jaccard similarity scores above 63%, but remains consistently lower than ACG-SFE, except for Lung Cancer. Statistical analyses using Wilcoxon signed-rank and Friedman tests further confirm ACG-SFE’s high stability, assigning it the best overall Friedman rank of 1.14 across all models.
Further evaluation of predictive performance stability for the ACG-SFE model is provided by the control charts of test accuracy, as illustrated in Fig 13. ACG-SFE consistently achieves high accuracy with minimal variability across 10 of 11 datasets, highlighting robust and reliable performance. Only the Prostate Tumor dataset shows increased variability, with accuracy briefly dropping below control limits in 2 out of 30 runs. Such occasional fluctuations suggest minor sensitivity to differences in selected feature subsets. Nevertheless, the overall mean accuracy of 84.57% remains highest among comparative evolutionary models, reinforcing the ACG-SFE model’s overall effectiveness.
Similar stability patterns are evident in the control charts of RMSE between training and test accuracy (Fig 14) and the F-measure (Fig 15). For most datasets, the RMSE of ACG-SFE consistently remains low, highlighting strong generalization capability and minimal overfitting across runs. Likewise, the F-measure control charts demonstrate stable and balanced predictive performance, reflecting consistent precision and recall. The Prostate Tumor dataset again displays greater variability in both RMSE and F-measure, aligning with the observed fluctuations in test accuracy. This variability likely results from minor differences in selected feature subsets across runs. Nonetheless, the performance variations across 10 of the 11 datasets remain within acceptable and stable ranges, reinforcing the robustness and generalization capability of the ACG-SFE model.
Collectively, these stability analyses, complemented by the Jaccard similarity measures and performance control charts, reinforce the consistency and generalization capabilities of the ACG-SFE model, highlighting its suitability for reliable and reproducible analysis of high-dimensional microarray datasets.
Conclusion
This paper proposes the Adaptive Cluster-Guided Simple, Fast, and Efficient (ACG-SFE) algorithm, a hybrid feature selection model designed for high-dimensional microarray datasets in binary classification. ACG-SFE effectively enhances classification by capturing feature interactions, reducing redundancy, and improving generalization through three main strategies: dynamic hierarchical clustering to group correlated features, adaptive mutual information-based intra-cluster regularization to select highly informative cluster features, and a hybrid filter-wrapper heuristic search for efficient feature optimization.
Experimental results demonstrate that the ACG-SFE model consistently outperforms state-of-the-art evolutionary feature selection models across multiple evaluation metrics, including classification accuracy, F-measure, feature reduction rate, and RMSE between training and testing accuracy, and Jaccard similarity measure. Statistical analyses using Wilcoxon signed-rank and Friedman tests further confirm that ACG-SFE achieves higher classification accuracy and F-measure while effectively minimizing redundant features and overfitting compared to comparative models. Thus, the proposed ACG-SFE provides a stable, high-performance, and generalizable solution for feature selection, effectively addressing critical challenges in high-dimensional microarray data analysis.
Despite achieving the lowest RMSE among comparative models, the relatively high RMSE (37.84%) on the extremely high-dimensional SMK_CAN_187 dataset indicates potential scalability limitations, likely due to increased complexity in capturing feature interactions in ultra-high-dimensional spaces. Future research could enhance ACG-SFE by (1) developing incremental or online selection strategies for streaming data, (2) incorporating multi-objective optimization to balance accuracy, redundancy, and feature relevance, and (3) integrating reinforcement learning to manage high-class imbalance in high-dimensional datasets effectively.
References
- 1. Horgan RP, Kenny LC. ‘Omic’ technologies: genomics, transcriptomics, proteomics and metabolomics. The Obstetric & Gynaecologis. 2011;13(3):189–95.
- 2. Trevino V, Falciani F, Barrera-Saldaña HA. DNA microarrays: a powerful genomic tool for biomedical and clinical research. Mol Med. 2007;13(9–10):527–41. pmid:17660860
- 3. Mufassil Wahid CM, Shawkat Ali ABM, Tickle KS. Hybrid feature selection through feature clustering for microarray gene expression data. HIS. 2013;10(4):165–78.
- 4. Alsaeedi AH, Al-Mahmood HHR, Alnaseri ZF, Aziz MR, Al-Shammary D, Ibaida A, et al. Fractal feature selection model for enhancing high-dimensional biological problems. BMC Bioinformatics. 2024;25(1):12. pmid:38195379
- 5. Khoei TT, Singh A. Data reduction in big data: a survey of methods, challenges and future directions. Int J Data Sci Anal. 2024.
- 6. Guyon I, Elisseeff A. An Introduction to Feature Extraction. Studies in Fuzziness and Soft Computing. Springer Berlin Heidelberg. p. 1–25.
- 7. Akhiat Y, Zinedine A, Chahhou M. Using Reinforcement Learning to Select an Optimal Feature Set. JAMRIS. 2024;:56–66.
- 8. Barbieri MC, Grisci BI, Dorn M. Analysis and comparison of feature selection methods towards performance and stability. Expert Systems with Applications. 2024;249:123667.
- 9. Werner T. Bioinformatics applications for pathway analysis of microarray data. Curr Opin Biotechnol. 2008;19(1):50–4. pmid:18207385
- 10. Cooper CS, Campbell C, Jhavar S. Mechanisms of Disease: biomarkers and molecular targets from microarray gene expression studies in prostate cancer. Nat Clin Pract Urol. 2007;4(12):677–87. pmid:18059348
- 11. Deprez M, Robinson EC. Feature extraction and selection. Machine Learning for Biomedical Applications. Elsevier. 2024. p. 175–92.
- 12. Ang JC, Mirzal A, Haron H, Hamed HNA. Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection. IEEE/ACM Trans Comput Biol Bioinform. 2016;13(5):971–89. pmid:26390495
- 13. Jain R, Xu W. Artificial Intelligence based wrapper for high dimensional feature selection. BMC Bioinformatics. 2023;24(1):392. pmid:37853338
- 14. Parrales-Bravo F, Torres-Urresto J, Avila-Maldonado D, Barzola-Monteses J. Relevant and Non-Redundant Feature Subset Selection Applied to the Detection of Malware in a Network. In: 2021 IEEE Fifth Ecuador Technical Chapters Meeting (ETCM). 2021. p. 1–6.
- 15. Victo Sudha G, Cyril R. Review On Feature Selection Techniques And The Impact Of Svm For Cancer Classification Using Gene Expression Profile. IJCSES. 2011;2(3):16–27.
- 16. Loughrey J, Cunningham P. Overfitting in Wrapper-Based Feature Subset Selection: The Harder You Try the Worse it Gets. Research and Development in Intelligent Systems XXI. Springer London. p. 33–43.
- 17. Turali MY, Lorasdagi ME, Kozat SS. AFS-BM: enhancing model performance through adaptive feature selection with binary masking. SIViP. 2024;18(11):7571–82.
- 18. Wolpert DH. The Lack of A Priori Distinctions Between Learning Algorithms. Neural Computation. 1996;8(7):1341–90.
- 19. Carolina R-A, Antonio A-P, Diego A-P, Elías V-M. A Hybrid Simple Filter-Wrapper Feature Selection Approach for Microarray Classification. Communications in Computer and Information Science. Springer Nature Switzerland. 2024. p. 169–84.
- 20. Banerjee S, Bhuyan D, Elmroth E, Bhuyan M. Cost-Efficient Feature Selection for Horizontal Federated Learning. IEEE Trans Artif Intell. 2024;5(12):6551–65.
- 21. Uzma U, Halim Z. An ensemble filter-based heuristic approach for cancerous gene expression classification. Knowledge-Based Systems. 2021;234:107560.
- 22. Li L, Algabri YA, Liu Z-P. Identifying Diagnostic Biomarkers of Breast Cancer Based on Gene Expression Data and Ensemble Feature Selection. CBIO. 2023;18(3):232–46.
- 23. Pearson K. Notes on the history of correlation. Biometrika. 1920;13(1):25–45.
- 24. Spearman C. The Proof and Measurement of Association between Two Things. The American Journal of Psychology. 1904;15(1):72.
- 25. Deng X, Li M, Wang L, Wan Q. RFCBF: Enhance the Performance and Stability of Fast Correlation-Based Filter. Int J Comp Intel Appl. 2022;21(02).
- 26. Novelli L, Lizier JT. Inferring network properties from time series using transfer entropy and mutual information: Validation of multivariate versus bivariate approaches. Netw Neurosci. 2021;5(2):373–404. pmid:34189370
- 27. Ghasemi JB, Zolfonoun E. A New Variable Selection Method Based on Mutual Information Maximization by Replacing Collinear Variables for Nonlinear Quantitative Structure-Property Relationship Models. Bulletin of the Korean Chemical Society. 2012;33(5):1527–35.
- 28. Zhang J, Li S, Yang H, Jiang J, Shi H. Efficient and Intelligent Feature Selection via Maximum Conditional Mutual Information for Microarray Data. Applied Sciences. 2024;14(13):5818.
- 29. Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A. Low-precision feature selection on microarray data: an information theoretic approach. Med Biol Eng Comput. 2022;60(5):1333–45. pmid:35316469
- 30. Atmakuru A, Di Fatta G, Nicosia G, Badii A. Improved Filter-Based Feature Selection Using Correlation and Clustering Techniques. Lecture Notes in Computer Science. Springer Nature Switzerland. 2024. p. 379–89.
- 31. Mautz D, Ye W, Plant C, Böhm C. Non-Redundant Subspace Clusterings with Nr-Kmeans and Nr-DipMeans. ACM Trans Knowl Discov Data. 2020;14(5):1–24.
- 32. Asghari S, Nematzadeh H, Akbari E, Motameni H. Mutual information-based filter hybrid feature selection method for medical datasets using feature clustering. Multimed Tools Appl. 2023;82(27):42617–39.
- 33. Guo C. KNCFS: Feature selection for high-dimensional datasets based on improved random multi-subspace learning. PLoS One. 2024;19(2):e0296108. pmid:38394325
- 34. Almazrua H, Alshamlan H. A Comprehensive Survey of Recent Hybrid Feature Selection Methods in Cancer Microarray Gene Expression Data. IEEE Access. 2022;10:71427–49.
- 35. Kafrawy PE, Fathi H, Qaraad M, Kelany AK, Chen X. An Efficient SVM-Based Feature Selection Model for Cancer Classification Using High-Dimensional Microarray Data. IEEE Access. 2021;9:155353–69.
- 36. Ahadzadeh B, Abdar M, Safara F, Khosravi A, Menhaj MB, Suganthan PN. SFE: A Simple, Fast, and Efficient Feature Selection Algorithm for High-Dimensional Data. IEEE Trans Evol Computat. 2023;27(6):1896–911.
- 37. Wang M, Zhao M, Xie W. A hybrid feature selection method for biomarker identification. In: Fourth International Conference on Advanced Algorithms and Neural Networks (AANN 2024). 2024. p. 111.
- 38. Krishna PR, Rajarajeswari P. Improved Swarm Intelligence Optimization Model with Mutual Information Estimation for Feature Selection in Microarray Bioinformatics Datasets for Diseases Diagnosis. IJFIS. 2023;23(2):117–29.
- 39. Li W, Chi Y, Yu K, Xie W. A two-stage hybrid biomarker selection method based on ensemble filter and binary differential evolution incorporating binary African vultures optimization. BMC Bioinformatics. 2023;24(1):130. pmid:37016297
- 40. Yu K, Li W, Xie W, Wang L. A Hybrid Feature-Selection Method Based on mRMR and Binary Differential Evolution for Gene Selection. Processes. 2024;12(2):313.
- 41. Song X-F, Zhang Y, Gong D-W, Gao X-Z. A Fast Hybrid Feature Selection Based on Correlation-Guided Clustering and Particle Swarm Optimization for High-Dimensional Data. IEEE Trans Cybern. 2022;52(9):9573–86. pmid:33729976
- 42. P SAB, Annavarapu CSR, Dara S. Clustering-based hybrid feature selection approach for high dimensional microarray data. Chemometrics and Intelligent Laboratory Systems. 2021;213:104305.
- 43. Pirgazi J, Alimoradi M, Esmaeili Abharian T, Olyaee MH. An Efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets. Sci Rep. 2019;9(1):18580. pmid:31819106
- 44. Kraskov A, Stögbauer H, Grassberger P. Estimating mutual information. Phys Rev E Stat Nonlin Soft Matter Phys. 2004;69(6 Pt 2):066138. pmid:15244698
- 45. Kraskov A, Stögbauer H, Andrzejak RG, Grassberger P. Hierarchical clustering using mutual information. Europhys Lett. 2005;70(2):278–84.
- 46. Dabba A. Data for: Gene Selection and Classification of Microarray Data Method Based on Mutual Information and Moth Flame Algorithm. Mendeley Data. 2020.
- 47.
Chen K. Ke Chen - Ph.D Candidate. https://ckzixf.github.io/dataset.html
- 48.
Datasets | Feature Selection @ ASU. https://jundongl.github.io/scikit-feature/datasets.html
- 49.
Microarray Datasets. https://csse.szu.edu.cn/staff/zhuzx/datasets.html
- 50.
GEO Accession viewer. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2443
- 51. Too J, Abdullah AR, Mohd Saad N. Hybrid Binary Particle Swarm Optimization Differential Evolution-Based Feature Selection for EMG Signals Classification. Axioms. 2019;8(3):79.
- 52. Too J, Abdullah AR, Mohd Saad N. A New Co-Evolution Binary Particle Swarm Optimization with Multiple Inertia Weight Strategy for Feature Selection. Informatics. 2019;6(2):21.
- 53. Too J, Abdullah AR, Mohd Saad N, Tee W. EMG Feature Selection and Classification Using a Pbest-Guide Binary Particle Swarm Optimization. Computation. 2019;7(1):12.
- 54. Ray S. An Analysis of Computational Complexity and Accuracy of Two Supervised Machine Learning Algorithms—K-Nearest Neighbor and Support Vector Machine. Advances in Intelligent Systems and Computing. Springer Singapore. 2020. p. 335–47.
- 55. Deng S, Wang L, Guan S, Li M, Wang L. Non-parametric Nearest Neighbor Classification Based on Global Variance Difference. Int J Comput Intell Syst. 2023;16(1).
- 56. Seyghaly R, Garcia J, Masip-Bruin X, Kuljanin J. SBNNR: Small-Size Bat-Optimized KNN Regression. Future Internet. 2024;16(11):422.
- 57. Zhang S. Challenges in KNN Classification. IEEE Trans Knowl Data Eng. 2022;34(10):4663–75.
- 58. Salman R, Alzaatreh A, Sulieman H. The stability of different aggregation techniques in ensemble feature selection. J Big Data. 2022;9(1).