Figures
Abstract
High-dimensional gene expression data poses significant challenges for binary classification, particularly in the context of feature selection methods. Conventional methods, for example, Proportional Overlap Score, Wilcoxon Rank-Sum Test, Weighted Signal to Noise Ratio, ensemble Minimum Redundancy and Maximum Relevance, Fisher Score and Robust Weighted Score for unbalanced data are impacted by key challenges, such as, class imbalance and redundancy. To mitigate these issues, customized feature selection methods are required to tackle the class imbalance issue.
This study proposes a more robust solution, Margin Weighted Robust Discriminant Score, for feature selection in the context of high dimensional imbalanced problems. MW-RDS integrates a minority amplification factor to ensure the impact of minority class observation during feature ranking process. The amplification factor along with class specific stability weights obtained from minority-focused robust discriminant score are used for achieving maximum differential capability of genes/features. The score is weighted by margin weights extracted from support vectors to enhance the discriminative power of genes/features thereby highlighting its potential for class separation. Finally, top-ranked genes/features are constrained using -regularization to discard redundant genes while identifying the most significant ones.
The performance of the proposed method is tested on 9 openly accessible gene expression datasets, using Random Forest, Support Vector Machines, and Weighted k Nearest Neighbors classifiers in term of performance metrics, i.e., accuracy, sensitivity, specificity, F1-score, and precision. The results reveal that the proposed method outperforms the existing methods in most of the cases. Boxplots and stability-plots are also generated to gain a deeper understanding of the results. To futher assess the efficacy of the proposed method, the paper also gives a detailed simulation study.
Citation: Gul S, Muhammad Khan D, Aldahmani S, Khan Z (2025) Margin weighted robust discriminant score for feature selection in imbalanced gene expression classification. PLoS One 20(6): e0325147. https://doi.org/10.1371/journal.pone.0325147
Editor: Zeyneb Kurt, The University of Sheffield, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND
Received: July 18, 2024; Accepted: May 5, 2025; Published: June 10, 2025
Copyright: © 2025 Gul et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The datasets used are available as follows: Leukaemia dataset is available at: https://rdrr.io/cran/propOverlap/man/leukaemia.html. Kidney dataset is available at: https://www.openml.org/d/1147. Prostate dataset is available at: https://www.openml.org/search?type=data&status=any&id=1141. Arcena dataset is available at: https://www.openml.org/d/1458. Ovarian dataset is available at: https://openml.org/search?type=data&status=active&id=45098. Breast dataset is available at: https://openml.org/search?type=data&status=active&id=45085. Colon dataset is available at: https://openml.org/search?type=data&status=active&&id=45087. Endometrium Uterus is available at: https://www.openml.org/search?type=data&status=active&id=1164. Ova Endometrium is available at: https://www.openml.org/search?type=data&status=active&id=1142. Ovary Lung is available at: https://www.openml.org/search?type=data&status=active&id=1140.
Funding: This work was supported by the United Arab Emirates University under grant 12B041.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
The analysis of high dimensional gene expression data has a significant contribution in the growth of biomedical research, particularly, in biomarkers’ discovery and understanding molecular mechanisms. Derived from technologies such as DNA microarrays and RNA sequencing, high-dimensional datasets capture the simultaneous behaviours of thousands of genes. However, getting insights from these datasets is challenging due to the presence of many features that are noisy, redundant or irrelevant. Consequently, detecting genes that regulate the target class is difficult [1,2]. Moreover, skewed class distribution, i.e., fewer observations in one class than the other, in most of the gene expression datasets further complicates the process of identifying regulatory genes. This leads to a reduction in the efficiency of the models trying to identify hidden patterns in the minority class instances – desired in many clinical applications. Many of the classical methods in the field of machine learning [3] are tailored to yield high predictive performance for the majority classes while overlooking the minority classes which carry significant importance in diagnostic and predictive analyses. The problems of high dimensionality and skewed class distribution are especially acute in computational biology where models may become tuned to the class with majority observations and thereby poorly identify the class with fewer instances [4–7]. These issues highlight the need of algorithms that can effectively handle high dimensionality and class imbalance problems simultaneously. To this end, feature selection is often used in selecting genes with high regulatory power and discard noisy and redundant ones to increase the accuracy and interpretability, while maintaining optimal computational cost [8,9].
Feature selection methods are generally classified into three categories, i.e., wrapper, filter, and embedded. Wrapper methods assess subsets of features using a specific algorithm to achieve optimized model performance. Examples of wrapper methods [10–12] are forward selection, backward elimination, and recursive feature elimination (RFE) methods. Embedded feature selection methods [13] combine feature selection with the model training process including LASSO, ridge regression, and decision tree. Filter methods [14] rank features, as a pre-processing step, based on statistical significance. Examples include Pearson correlation coefficient, Relief based algorithm, and Minimum Redundancy Maximum Relevance (MRMR) method. Most of these methods treat both minority and majority classes equally ignoring the adverse effect of the class imbalance problem [15]. Many of the traditional feature selection methods [16,17] give poor performance in class imbalance scenarios. Several efforts have been made in the literature to tailor these methods for class imbalance [18]. However, these methods have been found to struggle with achieving scalability and robustness [19–21].
Inspired from the above-mentioned notion, the current article suggests a feature selection method, the Margin Weighted Robust Discriminant Score (MW-RDS), tailored for high-dimensional gene expression dataset with class imbalance problem. The method consists of ranking genes based on their differential capability between two classes using a minority amplification factor that guarantees adequate representation of the minority class observation. The scores are weighted by margin weights obtained from support vectors for achieving maximum differential capability leading to an overall minority-focused robust discriminant score (RDS). Top ranked genes are further penalized via -regularization to eliminate redundant genes while selecting statistically and biologically relevant ones.
The proposed MW-RDS is assessed on a total of 9 benchmark gene expression datasets using Random Forest (RF), support vector machines (SVM), and weighted k-nearest neighbors (WkNN) classifiers. Classification accuracy, specificity, F1-score, sensitivity, and precision are used as performance metrics. The results are compared with those of the traditional methods such as Proportional Overlap Score (POS), Wilcoxon Rank-Sum Test (Wilcoxon), Weighted Signal-to-Noise Ratio (WSNR), ensemble Minimum Redundancy and Maximum Relevance (mRMRe), Fisher Score (Fisher) and robust weighted score for unbalanced data (ROWSU). A detailed simulation study, demonstrating class imbalance problem, is also given.
The remainder of this paper is arranged as follows: Sect 2 gives a thorough review of the related work. A complete description of the suggested MW-RDS method is given in Sect 3. The experimental design and results based on the benchmark and contrived datasets are given in Sect 4, while Sect 5 concludes the findings of this work.
2 Related work
In the literature, several methods have been proposed for feature/gene selection in high dimensional gene expression datasets with class imbalance problem. Due to this problem, patterns related to the minority class are often overlooked in that feature selection/classification methods mostly learn from the patterns of majority class observations. Therefore, feature selection plays significant role in identifying genes/features that are most relevant to a specfic classification and results in improved model with minimum complexity [22–24]. Several of these methods are prioritizing overall data trends often neglecting class imbalance problem and redundancy. Addressing these problems requires customized feature selection methods that achieve equity in the representation of minority class and improve predictive accuracy [25–29].
Some existing methods, such as robust masking technique [30], addresses this issue to some extent by minimizing noise and outlier. These method are efficient for handling expression outliers, however, they perform inadequately in dealing with class imbalance problems constraining their relevance in high dimensional datasets. Methods like minimum redundancy maximum relevance (mRMR) along with its extension, i.e., minimum redundancy maximum relevance ensemble (mRMRe) has shown improved results in high dimensional problems. These methods achieve maximum relevance with target variable while reducing redundancy [31–33]. However, these methods face computational challenges and lack a direct mechanism to prioritize minority class problems [34].
Other statistical approaches, such as the Wilcoxon Rank-Sum Test [35] and Fisher Score [36], have been valuable for assessing feature relevance but are often limited by their assumption of balanced datasets. Weighted adaptations of these methods attempt to address the class imbalance [37], yet they frequently overlook feature interactions and correlations, resulting in suboptimal feature subsets [38]. Techniques such as Weighted Signal-to-Noise Ratio (WSNR) prioritize features based on their contribution to class distinctions [39], but their reliance on precise signal estimation makes them susceptible to noise [40]. Decision tree-based methods, like Boruta, offer effective feature ranking while addressing class imbalance [41], but their computational intensity can be a barrier for large-scale datasets [42]. Similarly, evolutionary algorithms and embedded approaches, such as Sparse Autoencoders, have shown promising results in addressing class imbalance but are often constrained by high computational costs due to their iterative nature [20,43,44]. While significant advancements have been made, existing methods still face challenges in fully addressing minority class, especially when dealing with noisy, high dimensional imbalance problems.
Considering the above challenges associated with high dimensional class imbalanced problems, a customized feature selection method has been proposed. This method highlights the importance of minority class in feature scoring process and proposes minority-focused robust discriminant score (RDS) with class-specific stability weights to focus on biologically significant features. These selected gene/features are further refined by separating classes through margin weights obtained from support vectors and to remove redundancy by -regularization resulting in a concise and discriminant feature set. The proposed method has been effective in addressing class imbalanced problems thus offers a promising strategy where the existing methods often face challenges with skewed distributions.
3 Margin Weighted Robust Discriminant Score (MW-RDS)
Let the gene expression dataset be expressed as , where
is the feature matrix consisting of
samples and p features, defined as:
where, and
. The binary class variable is given as,
.
The dataset is divided in to two groups, i.e., and
representing the feature matrix of minority and majority class observations, respectively. The symbol p represents the number of features and
and
represent the number of minority and majority class observations, respectively. To mitigate the effects of class imbalance, a minority implification factor (
) that quantifies the degree of imbalance is introduced. Mathematically,
Since the class distribution is highly imbalanced i.e., ,
act as a minority amplification factor that is used to balance the influence of minority class during feature scoring process. Particularly, the contribution of minority class feature matrix
compared to majority class features is amplified by
given in 2. This amplification factor ensures the discrimination efficacy of genes relevant to the minority class within the feature scoring process to prevent the dominant relevance of the majority class. The proposed method adjusts the influence of the minority class and gives slightly more weight to the minority class using a factor of (1 +
). This ensures that class-specific features are effectively captured, preventing their loss due to skewed data. Using the amplification factor
, a robust discriminant score (RDS) is introduced that combines amplification factor (
) and further assigns importance to genes/features that effectively differentiate the minority class improving the overall robustness of the model.
3.1 Minority-focused robust discriminant score
Once the factor is computed, it is used to fine-tune the robust discriminant score (RDS), enabling a tailored adjustment that specifically targets class imbalance. Minority-Focused robust discriminant score (RDS) uses class specific stability weights to select genes that show high stability within majority and minority classes. These weights are inversely proportional to the variance of the bth gene. The weights can be expressed as follows:
where, and
signify the medians of the bth gene for minority and majority classes, respectively. Using these weights, the robust discriminant score (
) is computed as:
In the above expression, the term is the combined median of the bth gene/feature across all observations, and
and
indicate the mean absolute deviations of the bth gene/feature for the minority and majority class observations, respectively. Thus, the score given in Eq 4 is synthesized by combining key terms, i.e., minority amplification factor given in Eq 2 and class specific stability weights given in Eq 3. The resulting score, referred to as
given in Eq 5 is formulated as a sequence of discriminant scores for ranking genes based on their differential capability within classes, i.e.,
3.2 Margin-weighted feature scoring
For further increase in the differential capability of the above scores (), they are weighted by weights, say
, derived from support vectors, i.e.,
. For any feature b, its margin weights is defined as
, this method quatifies how much the bth feature leads to better class distinction, facilitating the ranking of more discriminative features. This maximizes the margin between the hyperplane and data observations. The weights derived from support vectors can briefly be summarized by the following expressions.
where, , stands for the the feature vector of the ath support vector, with a indexing each support vector and
representing the total number of support vectors.
Let represents the class label for the ath observation. The term
represents the dual coefficient associated with the support vectors xa. For each feature, these coefficients are employed to calculate the absolute margin weight, indicating the degree to which a feature affects the classification boundary through the support vectors. The vector of weights,
, shown in Eq 7 helps in identifying gene/feature that contribute in separating the classes. The final robust score
for each gene/feature is estimated as,
The robust score, of the bth gene/feature, represents its combined strength to differentiate between the two classes.
3.3 Feature ranking and redundancy elimination
While features are ranked based on their final robust score , high-dimensional imbalanced problems often contain redundant features. To further refine the selection of informative genes/features, redundancy problem is addressed through least absolute shrinkage and selection operator [45], an
-regularization that promotes sparsity. The corresponding optimization problem is:
The function, denoted by , is used for the loss function over the feature subset. Here,
, where
and
regulates the degree of sparsity during optimization. The set
contains the d top ranked features based on the scores computed by
. For binary classification, the loss is defined as the negative log-likelihood of the logistic regression model. This regularization framework is embedded within logistic regression to optimize the feature selection process by minimizing the following penalized objective function:
where denotes the logistic link function, mapping linear combinations of features to probabilities.
The parameter controls the trade-off between model fit and sparsity, while the term
indicates
-regularization that shrinks redundant features’ coefficients towards zero.
Consequently, the final set of genes is given as:
The final feature set, denoted as, given in Eq 1 includes the top ranked discriminative features. The combination of robustness, margin-based significance, and sparsity results in a final gene/feature set that effectively balances clarity and performance, especially in high-dimensional, imbalanced contexts like gene expression analysis.
The Algorithm detailed in 1 contains the pseudo-code of the proposed method, MW-RDS, and its corresponding flowchart is provided in Fig 1. The proposed MW-RDS algorithm begins with the computation of a robust score, followed by margin based weighting using support vectors and then ranking the features along with penalizing redundand features via the penalty.
Algorithm 1. Margin Weighted Robust Discriminant Score (MW-RDS).
4 Experiments and results
This section outlines the experimental design, analysis on benchmark and simulated dataset, and the evaluation metrics. Detailed explanations of the experimental setup and the study’s findings are provided in the following subsections.
4.1 Imbalanced gene expression datasets
To assess how well the proposed method performs compared to the existing methods, we used nine benchmark imbalanced gene expression datasets. A quick overview of these datasets is provided in Table 1. In the give table, the first column lists the dataset ID, followed by the dataset name in the second column. The third and fourth columns show the number of observations () and features (p), respectively. The fifth column gives the class-wise distribution (negative and positive classes), and the last column provides the data sources.
4.2 Experimental setup
As the focus of this study is high dimensional imbalanced data classification, the given data were adjusted to maintain a 9:1 ratio, with 90% of the observations in the negative class and 10% in the positive class. This is done by randomly removing the minority class observation to create the 9:1 imbalance ratio. For a fair validatioin purpose, each dataset is split into 70% training data, used for building models, and 30% testing data, used to evaluate the methods’ performance. A total of 500 runs of the split sample estimates are obtained to validate the findings and assess the performance of proposed method against the other competitors. The 500 split-sample runs were performed for ensuring the robustness and reliability of the evaluation, allowing for the assessment of performance variability under different data partitions. This extensive validation gives a comprehensive and fair comparison between the MW-RDS approach and competing methods, controling the influence bias in random sampling. It is worth emphasizing that, in both feature selection and classifier application on the selected features, the same training and testing sets are used for all the methods under each run of the experiments.
Using the above setup, top 10 genes are selected by the proposed MW-RDS method and the other feature selection methods, i.e., Proportional Overlap Score (POS), Wilcoxon Rank-Sum Test (Wilcoxon), Weighted Signal-to-Noise Ratio (WSNR), ensemble Minimum Redundancy and Maximum Relevance (mRMRe), Fisher Score (Fisher) and robust weighted score for unbalanced data (ROWSU). Classification model, i.e., random forest (RF), support vector machine (SVM) and weighted k nearest neighbors (WkNN) were used to evaluate the efficacy of the selected features, using performance metrics, i.e., accuracy, sensitivity, specificity, F1-score and precision. To guarantee the accuracy and reproducibility, the experiments were carefully conducted in R programming. All three classifiers are used using their default values of the hyper parameters as given in the corresponding R packages.
4.3 Results
This section provide the results of the proposed method, i.e., MW-RDS and other competing feature selection methods, i.e., POS, Wilcoxon, WSNR, mRMRe, Fisher and ROWSU applied to the nine datasets, ID1, ID2, ID3, ID4, ID5, ID6, ID7, ID8 and ID9.
Table 2 reveals a detailed comparison of feature selection methods, i.e., POS, Wilc, WSNR, mRMRe, Fisher and MW-RDS using different classification models, i.e., RF, SVM and WKNN in term of accuracy, sensitivity, specificity, F1-score, and precision on ID1. The proposed method is performing efficiently throughout the analysis. MW-RDS, using RF, achieved the highest specificity and precision, that is, 0.9877 and 0.9928 respectively, among all the feature selection methods. Random forest in term of sensitivity for MW-RDS is also exceptionally high with a value of 0.9960. Accuracy and F1-score with the values 0.9766 and 0.9905, respectively, are the highest among all the methods, making MW-RDS a clear winner in the case of RF classifier. Methods like POS and WSNR significantly under-perform, with accuracy values of 0.7879 and 0.7736, respectively. While mRMRe comes close in some metrics, i.e., precision observed as 0.9526, it doesn’t achieve the same level of consistency across all the performance metrics as MW-RDS. SVM paired with MW-RDS delivers excellent performance. The accuracy at 0.9962 and precision at 0.9942 are the highest across all feature selection methods, and F1 Score at 0.9900 further highlights its balanced performance. Compared to other methods, MW-RDS outperforms POS, WSNR, and Fisher, which have notably lower specificity observed as 0.1059 for POS and 0.1883 for WSNR. While mRMRe achieves slightly higher sensitivity at 0.9982, MW-RDS provides a more consistent balance across all metrics, making it the most reliable choice for SVM. For WKNN, MW-RDS shows the best performance with the highest specificity of 0.3648 among all the feature selection methods, which is significantly better than the alternatives, such as, POS at 0.0246 and WSNR at 0.1933. Other methods, such as Fisher and Wilcoxon, perform poorly in terms of specificity at 0.1085 and accuracy at 0.7696, emphasizing the superiority of MW-RDS.
Table 3 gives the results for ID2 dataset. MW-RDS using RF gives the highest sensitivity, F1-score and precision values, i.e., 1, 0.9401, 0.9911 and 0.9916, respectively. However, its accuracy is slightly lower than that of mRMRe that is 0.9959. Similarly, SVM paired with MW-RDS excels with the highest accuracy, perfect sensitivity, near-perfect specificity, and the highest F1-score that is 0.9959, 1, 0.9995 and 0.9972, respectively, making it the best-performing combination overall.
In the case of WKNN, MW-RDS demonstrates high Accuracy (0.9868), F1 Score (0.9938), and near-perfect Sensitivity (0.9995), showcasing its effectiveness. While mRMRe slightly outperforms MW-RDS in accuracy 0.9988 and precision, MW-RDS offers better balance across all metrics. Compared to other feature selection methods, MW-RDS consistently outperforms the others in terms of Sensitivity, often achieving perfect values for RF and SVM, and performs exceptionally well in terms of F1-score due to its balanced approach to precision and sensitivity. Although methods like Wilcoxon sometimes achieve higher specificity under WKNN with a value of 1, MW-RDS provides a more stable and consistent overall performance. Among the models, SVM paired with MW-RDS emerges as the best, delivering the best results across all the metrics, while WKNN is also competitive, particularly in sensitivity and F1-score. RF performs well but lags slightly in accuracy and precision. Overall, MW-RDS proves to be a reliable, high-performing feature selection method, standing out as a strong candidate for machine learning tasks in the presence of class imbalance.
Based on the results given in Tables 4, 5, 6, 7, 8, 9, and 10, similar conclusions can be drawn for the ID3 to ID9 datasets, where MW-RDS continues to demonstrate its strong and consistent performance across all the metrics.
For testing statistical significance, Table 11 presents the results of the Wilcoxon Rank-Sum test comparing MW-RDS against other methods. The findings reveal that the superiority of MW-RDS is also statistically significant.
The efficiency of the proposed method is further demonstrated through boxplots given in Figs 2, 3, 4, 5, and 6 for the various performance metrics on ID1. As shown in the boxplots, where the error bars denote , the proposed method exhibits significantly higher performance metrics with reduced variability relative to the competing methods, underscoring its robustness and superior effectiveness.
Additionally, the metric plots given in Figs 7, 8, 9, 10, and 11 give a comparison of the proposed method with the other methods across various numbers of genes, i.e., 5, 10, 15, 20, 25, 50, 100, and 500. These plots highlight the consistent performance of MW-RDS against the others while selecting different number of genes. The results indicate that the proposed method exhibits greater stability compared to the other methods, even when the number of genes varies.
The primary aim of the study was to develop a gene selection method that enhances the classification performance of machine learning algorithms on imbalanced high-dimensional gene expression datasets. This study aims to assists readers interested in further exploring the biological significance of these selected genes. For example, the indices of selected genes are, G2599, G1036, G1495, G4303, G2722, G7747, G1830, G136, G3323, G9247, for the ID1 dataset while the indices of genes for ID2 are G4731, G7721, G10459, G6254, G4209, G5347, G5744, G6121, G10477, G10488 in order of their selection frequency among all the 500 runs. MW-RDS identified top 10 genes that give impressive 99.59% accuracy with SVM classifier for ID1 dataset. Similarly, the accuracy reached 97.66%, 99.62% using RF and SVM for ID2 dataset. For further reading on biological significance of the selected genes, readers are advised to see the work done in [46,47].
4.4 Simulation
This section presents two simulation scenarios to demonstrate the applicability of the proposed method. The first scenario highlights the effectiveness of MW-RDS in addressing imbalanced datasets, while the second scenario explores a data-generating environment where the proposed method might struggle.
To evaluate the performance of feature selection methods on imbalanced datasets, we simulate datasets with skewed class distributions, similar to the characteristics described in the benchmark analysis. The dataset contains samples and
features. An imbalance ratio of 9:1 signifies that 90% of the observations belong to the majority class
, and the remaining 10% represent the minority class
. The exact number of majority/minority observations is computed using the following equations:
Imbalance ratio () is defined by the proportions of class distributions expressed in Eqs 11 and 13. The majority/minority class consists of
,
observations of the total
for simulation. Using multivariate normal distribution,the feature matrix
is generated to represent continuous features values as:
where, ,
are the mean vector and covariance matrix for target variable
, with
indicating the minority and majority observations, respectively. The amplification factor (
) adjusts the contributions of the minority
and majority
class features, ensuring that the minority class influence is not overshadowed by the majority class. This methodology provides a robust framework for testing the performance of feature selection methods, including the proposed MW-RDS, under challenging and realistic imbalanced data conditions.
Fig 12 presents the comparison of the proposed and the other methods applied to the balanced simulated dataset, whereas, Fig 13 presents the results on the imbalance simulated data. While POS demonstrates strong performance under balanced scenario, it is worth highlighting the potential of MW-RDS, particularly in imbalanced scenarios given in Fig 13 showing its effectiveness by achieving higher accuracy via SVM, WkNN sensitivity via SVM, specificity via RF, SVM and WkNN, F1- score via RF, SVM and Wk NN and precision via SVM and NN , which are crucial for accurately identifying minority class instances. This capability highlights the suitability of MW-RDS for addressing the challenges posed by imbalanced datasets, where maintaining equitable performance can be difficult for other methods. These results reflect the adaptability and robustness of MW-RDS, making it a valuable choice for diverse data distributions.
Table 12 summarizes the average time (in miliseconds) taken by each feature selection method for particular dataset. Methods, i.e., POS, WSNR and Fisher exhibit minimal computational cost due to their uni-variate design. In-contrast, MW-RDS requires slightly more time due to it enhanced design. Despite requiring slightly more time than basic filters, MW-RDS remains significantly faster than advanced method such as, wilcoxon, mRMRe and ROWSU. Although MW-RDS demands more computational cost to run, its notably improved classification performance makes it a worthwhile trade-off.
Furthermore, in terms of its sensitivity to choice of hyper-parameters, the number of features selected stayed almost the same by changing the level of minority amplification factor across various setting of the regularization parameter
. Fig 14 gives a clear demonstration the above on the simulated data. This behavior shows that MW-RDS is consistent across different
-values, meaning that, minimal parameter fine-tuning may be sufficient. Ensuring efficiency for high dimensional imbalanced problem, the proposed method, MW-RDS time complexity is
. Table 13 summarizes the performance of the proposed method compared to other feature selection methods on the top 50 features of the imbalanced simulated dataset. Classification was performed using RF, SVM, and WKNN models, and results for accuracy, sensitivity, specificity, F1-score, and precision are reported as
over 500 iterations. MW-RDS consistently achieved superior results in most of the cases. The performance of feature selection methods was further evaluated on the imbalanced simulated dataset using different number of features,
. Figs 15 and 16 demonstrate that MW-RDS outperforms the other methods in terms of classification accuracy and sensitivity.
5 Conclusion
This study presented the Margin Weighted Robust Discriminant Score (MW-RDS), an innovative feature selection method designed to tackle the challenges of high-dimensional and imbalanced datasets.
In contrast to existing methods, MW-RDS presented a minority amplification factor, class-specific stability weights and margin weights from support vectors to confirm feature significance to minority class observations. Moreover, the use of -regularization through the logistic function removed redundant features, resulting in extremely efficient feature set.
MW-RDS has been compared with feature selection methods, including Proportional Overlap Score (POS), Wilcoxon Rank-Sum Test (Wilcoxon), Weighted Signal-to-Noise Ratio (WSNR), ensemble Minimum Redundancy and Maximum Relevance (mRMRe), Fisher Score (Fisher) and robust weighted score for unbalanced data (ROWSU) on several benchmark gene expression imbalanced problems. Three classification models, i.e., Random Forest (RF), Support Vector Machines (SVM), and Weighted k-Nearest Neighbors (Wk-NN) are used in terms of accuracy, sensitivity, specificity, F1-score, and precision to see the efficacy of the proposed method. MW-RDS consistently outperformed existing methods showing its ability to handle class imbalanced problem and achieve superior classification results.
Although MW-RDS involves a slightly higher computational cost as compared to some of the other feature selection methods, this trade-off has resulted in notably improved performance. Its robustness across various hyper-parameter settings further imply its effectiveness. Overall, MW-RDS provides a promising balance between efficiency and effectiveness for high-dimensional analysis with class imbalance.
References
- 1. Kaur A, Sarmadi M. Comparative analysis of machine learning techniques for imbalanced genetic data. Annals Data Sci. 2024:1–23.
- 2. Yang Y, Mirzaei G. Performance analysis of data resampling on class imbalance and classification techniques on multi-omics data for cancer classification. PLoS One. 2024;19(2):e0293607. pmid:38422094
- 3. Sadiq M, Shah R. A machine learning based variable selection algorithm for binary classification of perinatal mortality. PLoS One. 2025;20(1):e0315498. pmid:39821154
- 4. Hemmatian J, Hajizadeh R, Nazari F. Addressing imbalanced data classification with cluster-based reduced noise SMOTE. PLoS One. 2025;20(2):e0317396. pmid:39928607
- 5. Ghosh K, Bellinger C, Corizzo R, Branco P, Krawczyk B, Japkowicz N. The class imbalance problem in deep learning. Mach Learn. 2024;113(7):4845–901.
- 6. Liu S, Yao W. Prediction of lung cancer using gene expression and deep learning with KL divergence gene selection. BMC Bioinformatics. 2022;23(1):175. pmid:35549644
- 7. Bolourchi P, Gholami M. A machine learning-based data-driven approach to Alzheimer’s disease diagnosis using statistical and harmony search methods. J Intell Fuzzy Syst. 2024;46(3):6299–312.
- 8. Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2014;13:8–17. pmid:25750696
- 9. Abdulrauf Sharifai G, Zainol Z. Feature selection for high-dimensional and imbalanced biomedical data based on robust correlation based redundancy and binary grasshopper optimization algorithm. Genes (Basel). 2020;11(7):717. pmid:32605144
- 10.
Yassi M, Moattar MH, Parry M, Chatterjee A. Enhancing robust and stable feature selection through the integration of ranking methods and wrapper techniques in genetic data classification. Gene Expression Analysis: Methods and Protocols. Springer. 2025. p. 243–54.
- 11. Baranauskas JA, Monard MC. Experimental feature selection using the wrapper approach. WIT Trans Inf Commun Technol. 2025;22.
- 12. Bolourchi P, Demirel H, Uysal S. Entropy-score-based feature selection for moment-based SAR image classification. Electron Lett. 2018;54(9):593–5.
- 13. Carrasco M, Ivorra B, L´opez J, Ramos AM. Embedded feature selection for robust probability learning machines. Pattern Recognit. 2025;159:111157.
- 14. Li Z, Li H, Gao W, Xie J, Slowik A. Feature selection in high-dimensional classification via an adaptive multifactor evolutionary algorithm with local search. Appl Soft Comput. 2025;169:112574.
- 15. Wang J, Awang N. A novel synthetic minority oversampling technique for multiclass imbalance problems. IEEE Access. 2025;13:6054–66.
- 16. Theng D, Bhoyar KK. Feature selection techniques for machine learning: a survey of more than two decades of research. Knowl Inf Syst. 2024;66(3):1575–637.
- 17. Bolourchi P, Gholami M. Feature selection based on Gabor filter and BSO for detecting parkinson’s disease. Innov Technol Eng. 2022:159.
- 18. Tsai C-F, Chen K-C, Lin W-C. Feature selection and its combination with data over-sampling for multi-class imbalanced datasets. Appl Soft Comput. 2024;153:111267.
- 19. Khan Z, Ali A, Aldahmani S. Feature selection via robust weighted score for high dimensional binary class-imbalanced gene expression data. Heliyon. 2024;10(19):e38547. pmid:39398002
- 20. Safi SK, Gul S. An enhanced tree ensemble for classification in the presence of extreme class imbalance. Mathematics. 2024;12(20):3243.
- 21. Bolourchi P. Improved gene expression diagnosis via cascade entropy-fisher score and ensemble classifiers. Multimed Tools Appl. 2023;83(15):46181–200.
- 22. Ahmad J, Akram S, Jaffar A, Ali Z, Bhatti SM, Ahmad A, et al. Deep learning empowered breast cancer diagnosis: advancements in detection and classification. PLoS One. 2024;19(7):e0304757. pmid:38990817
- 23. Talari P, N B, Kaur G, Alshahrani H, Al Reshan MS, Sulaiman A, et al. Hybrid feature selection and classification technique for early prediction and severity of diabetes type 2. PLoS One. 2024;19(1):e0292100. pmid:38236900
- 24.
Chaudhari P, Agarwal H. Improving feature selection using elite breeding QPSO on gene data set for cancer classification. In: Intelligent Engineering Informatics: Proceedings of the 6th International Conference on FICTA, 2018. p. 209–19.
- 25. Khan AQ, Sun G, Khalid M, Imran A, Bilal A, Azam M, et al. A novel fusion of genetic grey wolf optimization and kernel extreme learning machines for precise diabetic eye disease classification. PLoS One. 2024;19(5):e0303094. pmid:38768222
- 26. Zhang X, Lu H. Gene expression analysis: addressing class imbalance in high-dimensional data. Genom Comput Biol. 2007;12(4):245–59.
- 27. Huang H, Wu N, Liang Y, Peng X, Shu J. SLNL: a novel method for gene selection and phenotype classification. Int J Intell Syst. 2022;37(9):6283–304.
- 28. Tiwari A, Chaturvedi A. A hybrid feature selection approach based on information theory and dynamic butterfly optimization algorithm for data classification. Exp Syst Appl. 2022;196:116621.
- 29. Balestra F, Flahaut J. Redundancy management in high-dimensional datasets: challenges and solutions. J Bioinform Res. 2023;45(3):456–72.
- 30. Mahmoud R, Saleh A. Robust masking for mitigating outliers in gene expression datasets. Bioinform Adv. 2014;8(2):134–45.
- 31. Peng H, Long F, Ding C. Feature selection based on mutual information: minimum redundancy maximum relevance. Pattern Recognit. 2005;38(6):1225–38.
- 32. Zhang Y, Huang Z. Two-stage gene selection for cancer classification using mRMR and ReliefF. Cancer Inform. 2008;15(1):101–9.
- 33. De Jay N, Papillon-Cavanagh S, Olsen C, El-Hachem N, Bontempi G, Haibe-Kains B. mRMRe: an R package for parallelized mRMR ensemble feature selection. Bioinformatics. 2013;29(18):2365–8. pmid:23825369
- 34. Hamraz H, Wang Z. Scalability of mRMR in ultra-high-dimensional data applications. High-Perform Comput Bioinform. 2023;19(2):201–15.
- 35. Liao W, Chang J. Applying Wilcoxon rank-sum test for feature reduction in gene expression data. Statist Bioinform J. 2006;9(5):87–96.
- 36. Fisher RA. The use of multiple measurements in taxonomic problems. Annals Eugen. 1936;7(2):179–88.
- 37. Zhang H, Liu K. Class-specific fisher score adaptations for imbalanced datasets. Data Sci J. 2023;22(7):321–33.
- 38. Alatrany R, Bashir S. Redundancy in non-parametric feature selection: a critique of Wilcoxon-based methods. Mach Learn Healthc. 2023;14(3):78–85.
- 39. Hamraz H, Kamalov S. Weighted signal-to-noise ratio for imbalanced datasets: insights and limitations. Signal Process Adv. 2023;29(4):345–59.
- 40. Kamalov S, Wright D. Signal estimation challenges in WSNR-based feature selection. Int J Comput Statist. 2023;18(6):543–55.
- 41. Kursa MB, Jankowski A, Rudnicki WR. Boruta – a system for feature selection. Fundamenta Informaticae. 2010;101(4):271–85.
- 42. Liu Y, Zhang T. Computational constraints in decision tree-based feature selection. Appl Data Sci. 2019;11(5):204–20.
- 43.
Rey CCT, García VS, Villuendas-Rey Y. Evolutionary feature selection for imbalanced data. In: 2023 Mexican International Conference on Computer Science (ENC). 2023. p. 1–7. https://doi.org/10.1109/enc60556.2023.10508674
- 44. Zhang X, Wang Y. Impact of sequence in feature selection and resampling on classification performance. J Data Anal. 2023;15(9):221–34.
- 45. Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Statist Soc Ser B (Methodological). 1996;58(1):267–88.
- 46. Castillo D, Galvez JM, Herrera LJ, Rojas F, Valenzuela O, Caba O, et al. Leukemia multiclass assessment and classification from microarray and RNA-seq technologies integration at gene expression level. PLoS One. 2019;14(2):e0212127. pmid:30753220
- 47. Chen AH, Lin CH. A novel support vector sampling technique to improve classification accuracy and to identify key genes of leukaemia and prostate cancers. Exp Syst Appl. 2011;38(4):3209–19.