Margin weighted robust discriminant score for feature selection in imbalanced gene expression classification

Sheema Gul; Dost Muhammad Khan; Saeed Aldahmani; Zardad Khan

doi:10.1371/journal.pone.0325147

Abstract

High-dimensional gene expression data poses significant challenges for binary classification, particularly in the context of feature selection methods. Conventional methods, for example, Proportional Overlap Score, Wilcoxon Rank-Sum Test, Weighted Signal to Noise Ratio, ensemble Minimum Redundancy and Maximum Relevance, Fisher Score and Robust Weighted Score for unbalanced data are impacted by key challenges, such as, class imbalance and redundancy. To mitigate these issues, customized feature selection methods are required to tackle the class imbalance issue.

This study proposes a more robust solution, Margin Weighted Robust Discriminant Score, for feature selection in the context of high dimensional imbalanced problems. MW-RDS integrates a minority amplification factor to ensure the impact of minority class observation during feature ranking process. The amplification factor along with class specific stability weights obtained from minority-focused robust discriminant score are used for achieving maximum differential capability of genes/features. The score is weighted by margin weights extracted from support vectors to enhance the discriminative power of genes/features thereby highlighting its potential for class separation. Finally, top-ranked genes/features are constrained using -regularization to discard redundant genes while identifying the most significant ones.

The performance of the proposed method is tested on 9 openly accessible gene expression datasets, using Random Forest, Support Vector Machines, and Weighted k Nearest Neighbors classifiers in term of performance metrics, i.e., accuracy, sensitivity, specificity, F1-score, and precision. The results reveal that the proposed method outperforms the existing methods in most of the cases. Boxplots and stability-plots are also generated to gain a deeper understanding of the results. To futher assess the efficacy of the proposed method, the paper also gives a detailed simulation study.

Citation: Gul S, Muhammad Khan D, Aldahmani S, Khan Z (2025) Margin weighted robust discriminant score for feature selection in imbalanced gene expression classification. PLoS One 20(6): e0325147. https://doi.org/10.1371/journal.pone.0325147

Editor: Zeyneb Kurt, The University of Sheffield, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND

Received: July 18, 2024; Accepted: May 5, 2025; Published: June 10, 2025

Copyright: © 2025 Gul et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by the United Arab Emirates University under grant 12B041.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

The analysis of high dimensional gene expression data has a significant contribution in the growth of biomedical research, particularly, in biomarkers’ discovery and understanding molecular mechanisms. Derived from technologies such as DNA microarrays and RNA sequencing, high-dimensional datasets capture the simultaneous behaviours of thousands of genes. However, getting insights from these datasets is challenging due to the presence of many features that are noisy, redundant or irrelevant. Consequently, detecting genes that regulate the target class is difficult [1,2]. Moreover, skewed class distribution, i.e., fewer observations in one class than the other, in most of the gene expression datasets further complicates the process of identifying regulatory genes. This leads to a reduction in the efficiency of the models trying to identify hidden patterns in the minority class instances – desired in many clinical applications. Many of the classical methods in the field of machine learning [3] are tailored to yield high predictive performance for the majority classes while overlooking the minority classes which carry significant importance in diagnostic and predictive analyses. The problems of high dimensionality and skewed class distribution are especially acute in computational biology where models may become tuned to the class with majority observations and thereby poorly identify the class with fewer instances [4–7]. These issues highlight the need of algorithms that can effectively handle high dimensionality and class imbalance problems simultaneously. To this end, feature selection is often used in selecting genes with high regulatory power and discard noisy and redundant ones to increase the accuracy and interpretability, while maintaining optimal computational cost [8,9].

Feature selection methods are generally classified into three categories, i.e., wrapper, filter, and embedded. Wrapper methods assess subsets of features using a specific algorithm to achieve optimized model performance. Examples of wrapper methods [10–12] are forward selection, backward elimination, and recursive feature elimination (RFE) methods. Embedded feature selection methods [13] combine feature selection with the model training process including LASSO, ridge regression, and decision tree. Filter methods [14] rank features, as a pre-processing step, based on statistical significance. Examples include Pearson correlation coefficient, Relief based algorithm, and Minimum Redundancy Maximum Relevance (MRMR) method. Most of these methods treat both minority and majority classes equally ignoring the adverse effect of the class imbalance problem [15]. Many of the traditional feature selection methods [16,17] give poor performance in class imbalance scenarios. Several efforts have been made in the literature to tailor these methods for class imbalance [18]. However, these methods have been found to struggle with achieving scalability and robustness [19–21].

Inspired from the above-mentioned notion, the current article suggests a feature selection method, the Margin Weighted Robust Discriminant Score (MW-RDS), tailored for high-dimensional gene expression dataset with class imbalance problem. The method consists of ranking genes based on their differential capability between two classes using a minority amplification factor that guarantees adequate representation of the minority class observation. The scores are weighted by margin weights obtained from support vectors for achieving maximum differential capability leading to an overall minority-focused robust discriminant score (RDS). Top ranked genes are further penalized via -regularization to eliminate redundant genes while selecting statistically and biologically relevant ones.

The proposed MW-RDS is assessed on a total of 9 benchmark gene expression datasets using Random Forest (RF), support vector machines (SVM), and weighted k-nearest neighbors (WkNN) classifiers. Classification accuracy, specificity, F1-score, sensitivity, and precision are used as performance metrics. The results are compared with those of the traditional methods such as Proportional Overlap Score (POS), Wilcoxon Rank-Sum Test (Wilcoxon), Weighted Signal-to-Noise Ratio (WSNR), ensemble Minimum Redundancy and Maximum Relevance (mRMRe), Fisher Score (Fisher) and robust weighted score for unbalanced data (ROWSU). A detailed simulation study, demonstrating class imbalance problem, is also given.

The remainder of this paper is arranged as follows: Sect 2 gives a thorough review of the related work. A complete description of the suggested MW-RDS method is given in Sect 3. The experimental design and results based on the benchmark and contrived datasets are given in Sect 4, while Sect 5 concludes the findings of this work.

2 Related work

In the literature, several methods have been proposed for feature/gene selection in high dimensional gene expression datasets with class imbalance problem. Due to this problem, patterns related to the minority class are often overlooked in that feature selection/classification methods mostly learn from the patterns of majority class observations. Therefore, feature selection plays significant role in identifying genes/features that are most relevant to a specfic classification and results in improved model with minimum complexity [22–24]. Several of these methods are prioritizing overall data trends often neglecting class imbalance problem and redundancy. Addressing these problems requires customized feature selection methods that achieve equity in the representation of minority class and improve predictive accuracy [25–29].

Some existing methods, such as robust masking technique [30], addresses this issue to some extent by minimizing noise and outlier. These method are efficient for handling expression outliers, however, they perform inadequately in dealing with class imbalance problems constraining their relevance in high dimensional datasets. Methods like minimum redundancy maximum relevance (mRMR) along with its extension, i.e., minimum redundancy maximum relevance ensemble (mRMRe) has shown improved results in high dimensional problems. These methods achieve maximum relevance with target variable while reducing redundancy [31–33]. However, these methods face computational challenges and lack a direct mechanism to prioritize minority class problems [34].

Other statistical approaches, such as the Wilcoxon Rank-Sum Test [35] and Fisher Score [36], have been valuable for assessing feature relevance but are often limited by their assumption of balanced datasets. Weighted adaptations of these methods attempt to address the class imbalance [37], yet they frequently overlook feature interactions and correlations, resulting in suboptimal feature subsets [38]. Techniques such as Weighted Signal-to-Noise Ratio (WSNR) prioritize features based on their contribution to class distinctions [39], but their reliance on precise signal estimation makes them susceptible to noise [40]. Decision tree-based methods, like Boruta, offer effective feature ranking while addressing class imbalance [41], but their computational intensity can be a barrier for large-scale datasets [42]. Similarly, evolutionary algorithms and embedded approaches, such as Sparse Autoencoders, have shown promising results in addressing class imbalance but are often constrained by high computational costs due to their iterative nature [20,43,44]. While significant advancements have been made, existing methods still face challenges in fully addressing minority class, especially when dealing with noisy, high dimensional imbalance problems.

Considering the above challenges associated with high dimensional class imbalanced problems, a customized feature selection method has been proposed. This method highlights the importance of minority class in feature scoring process and proposes minority-focused robust discriminant score (RDS) with class-specific stability weights to focus on biologically significant features. These selected gene/features are further refined by separating classes through margin weights obtained from support vectors and to remove redundancy by -regularization resulting in a concise and discriminant feature set. The proposed method has been effective in addressing class imbalanced problems thus offers a promising strategy where the existing methods often face challenges with skewed distributions.

3 Margin Weighted Robust Discriminant Score (MW-RDS)

Let the gene expression dataset be expressed as , where is the feature matrix consisting of samples and p features, defined as:

(1)

where, and . The binary class variable is given as, .

The dataset is divided in to two groups, i.e., and representing the feature matrix of minority and majority class observations, respectively. The symbol p represents the number of features and and represent the number of minority and majority class observations, respectively. To mitigate the effects of class imbalance, a minority implification factor () that quantifies the degree of imbalance is introduced. Mathematically,

(2)

Since the class distribution is highly imbalanced i.e., , act as a minority amplification factor that is used to balance the influence of minority class during feature scoring process. Particularly, the contribution of minority class feature matrix compared to majority class features is amplified by given in 2. This amplification factor ensures the discrimination efficacy of genes relevant to the minority class within the feature scoring process to prevent the dominant relevance of the majority class. The proposed method adjusts the influence of the minority class and gives slightly more weight to the minority class using a factor of (1 + ). This ensures that class-specific features are effectively captured, preventing their loss due to skewed data. Using the amplification factor , a robust discriminant score (RDS) is introduced that combines amplification factor () and further assigns importance to genes/features that effectively differentiate the minority class improving the overall robustness of the model.

3.1 Minority-focused robust discriminant score

Once the factor is computed, it is used to fine-tune the robust discriminant score (RDS), enabling a tailored adjustment that specifically targets class imbalance. Minority-Focused robust discriminant score (RDS) uses class specific stability weights to select genes that show high stability within majority and minority classes. These weights are inversely proportional to the variance of the b^th gene. The weights can be expressed as follows:

(3)

where, and signify the medians of the b^th gene for minority and majority classes, respectively. Using these weights, the robust discriminant score () is computed as:

(4)

In the above expression, the term is the combined median of the b^th gene/feature across all observations, and and indicate the mean absolute deviations of the b^th gene/feature for the minority and majority class observations, respectively. Thus, the score given in Eq 4 is synthesized by combining key terms, i.e., minority amplification factor given in Eq 2 and class specific stability weights given in Eq 3. The resulting score, referred to as given in Eq 5 is formulated as a sequence of discriminant scores for ranking genes based on their differential capability within classes, i.e.,

(5)

3.2 Margin-weighted feature scoring

For further increase in the differential capability of the above scores (), they are weighted by weights, say , derived from support vectors, i.e., . For any feature b, its margin weights is defined as , this method quatifies how much the bth feature leads to better class distinction, facilitating the ranking of more discriminative features. This maximizes the margin between the hyperplane and data observations. The weights derived from support vectors can briefly be summarized by the following expressions.

(6)

(7)

where, , stands for the the feature vector of the a_th support vector, with a indexing each support vector and representing the total number of support vectors.

Let represents the class label for the a^th observation. The term represents the dual coefficient associated with the support vectors x_a. For each feature, these coefficients are employed to calculate the absolute margin weight, indicating the degree to which a feature affects the classification boundary through the support vectors. The vector of weights, , shown in Eq 7 helps in identifying gene/feature that contribute in separating the classes. The final robust score for each gene/feature is estimated as,

(8)

The robust score, of the bth gene/feature, represents its combined strength to differentiate between the two classes.

3.3 Feature ranking and redundancy elimination

While features are ranked based on their final robust score , high-dimensional imbalanced problems often contain redundant features. To further refine the selection of informative genes/features, redundancy problem is addressed through least absolute shrinkage and selection operator [45], an -regularization that promotes sparsity. The corresponding optimization problem is:

(9)

The function, denoted by , is used for the loss function over the feature subset. Here, , where and regulates the degree of sparsity during optimization. The set contains the d top ranked features based on the scores computed by . For binary classification, the loss is defined as the negative log-likelihood of the logistic regression model. This regularization framework is embedded within logistic regression to optimize the feature selection process by minimizing the following penalized objective function:

(10)

where denotes the logistic link function, mapping linear combinations of features to probabilities.

The parameter controls the trade-off between model fit and sparsity, while the term indicates -regularization that shrinks redundant features’ coefficients towards zero.

Consequently, the final set of genes is given as:

(11)

The final feature set, denoted as, given in Eq 1 includes the top ranked discriminative features. The combination of robustness, margin-based significance, and sparsity results in a final gene/feature set that effectively balances clarity and performance, especially in high-dimensional, imbalanced contexts like gene expression analysis.

The Algorithm detailed in 1 contains the pseudo-code of the proposed method, MW-RDS, and its corresponding flowchart is provided in Fig 1. The proposed MW-RDS algorithm begins with the computation of a robust score, followed by margin based weighting using support vectors and then ranking the features along with penalizing redundand features via the penalty.

Download:

Fig 1. Flowchart of the proposed Margin Weighted Robust Discriminant Score (MW-RDS) algorithm.

https://doi.org/10.1371/journal.pone.0325147.g001

Algorithm 1. Margin Weighted Robust Discriminant Score (MW-RDS).

4 Experiments and results

This section outlines the experimental design, analysis on benchmark and simulated dataset, and the evaluation metrics. Detailed explanations of the experimental setup and the study’s findings are provided in the following subsections.

4.1 Imbalanced gene expression datasets

To assess how well the proposed method performs compared to the existing methods, we used nine benchmark imbalanced gene expression datasets. A quick overview of these datasets is provided in Table 1. In the give table, the first column lists the dataset ID, followed by the dataset name in the second column. The third and fourth columns show the number of observations () and features (p), respectively. The fifth column gives the class-wise distribution (negative and positive classes), and the last column provides the data sources.

Download:

Table 1. Summary of the gene expression datasets. Number of samples, number of features, and class-wise frequency distribution are shown against each dataset.

https://doi.org/10.1371/journal.pone.0325147.t001

4.2 Experimental setup

As the focus of this study is high dimensional imbalanced data classification, the given data were adjusted to maintain a 9:1 ratio, with 90% of the observations in the negative class and 10% in the positive class. This is done by randomly removing the minority class observation to create the 9:1 imbalance ratio. For a fair validatioin purpose, each dataset is split into 70% training data, used for building models, and 30% testing data, used to evaluate the methods’ performance. A total of 500 runs of the split sample estimates are obtained to validate the findings and assess the performance of proposed method against the other competitors. The 500 split-sample runs were performed for ensuring the robustness and reliability of the evaluation, allowing for the assessment of performance variability under different data partitions. This extensive validation gives a comprehensive and fair comparison between the MW-RDS approach and competing methods, controling the influence bias in random sampling. It is worth emphasizing that, in both feature selection and classifier application on the selected features, the same training and testing sets are used for all the methods under each run of the experiments.

Using the above setup, top 10 genes are selected by the proposed MW-RDS method and the other feature selection methods, i.e., Proportional Overlap Score (POS), Wilcoxon Rank-Sum Test (Wilcoxon), Weighted Signal-to-Noise Ratio (WSNR), ensemble Minimum Redundancy and Maximum Relevance (mRMRe), Fisher Score (Fisher) and robust weighted score for unbalanced data (ROWSU). Classification model, i.e., random forest (RF), support vector machine (SVM) and weighted k nearest neighbors (WkNN) were used to evaluate the efficacy of the selected features, using performance metrics, i.e., accuracy, sensitivity, specificity, F1-score and precision. To guarantee the accuracy and reproducibility, the experiments were carefully conducted in R programming. All three classifiers are used using their default values of the hyper parameters as given in the corresponding R packages.

4.3 Results

This section provide the results of the proposed method, i.e., MW-RDS and other competing feature selection methods, i.e., POS, Wilcoxon, WSNR, mRMRe, Fisher and ROWSU applied to the nine datasets, ID₁, ID₂, ID₃, ID₄, ID₅, ID₆, ID₇, ID₈ and ID₉.

Table 2 reveals a detailed comparison of feature selection methods, i.e., POS, Wilc, WSNR, mRMRe, Fisher and MW-RDS using different classification models, i.e., RF, SVM and WKNN in term of accuracy, sensitivity, specificity, F1-score, and precision on ID₁. The proposed method is performing efficiently throughout the analysis. MW-RDS, using RF, achieved the highest specificity and precision, that is, 0.9877 and 0.9928 respectively, among all the feature selection methods. Random forest in term of sensitivity for MW-RDS is also exceptionally high with a value of 0.9960. Accuracy and F1-score with the values 0.9766 and 0.9905, respectively, are the highest among all the methods, making MW-RDS a clear winner in the case of RF classifier. Methods like POS and WSNR significantly under-perform, with accuracy values of 0.7879 and 0.7736, respectively. While mRMRe comes close in some metrics, i.e., precision observed as 0.9526, it doesn’t achieve the same level of consistency across all the performance metrics as MW-RDS. SVM paired with MW-RDS delivers excellent performance. The accuracy at 0.9962 and precision at 0.9942 are the highest across all feature selection methods, and F1 Score at 0.9900 further highlights its balanced performance. Compared to other methods, MW-RDS outperforms POS, WSNR, and Fisher, which have notably lower specificity observed as 0.1059 for POS and 0.1883 for WSNR. While mRMRe achieves slightly higher sensitivity at 0.9982, MW-RDS provides a more consistent balance across all metrics, making it the most reliable choice for SVM. For WKNN, MW-RDS shows the best performance with the highest specificity of 0.3648 among all the feature selection methods, which is significantly better than the alternatives, such as, POS at 0.0246 and WSNR at 0.1933. Other methods, such as Fisher and Wilcoxon, perform poorly in terms of specificity at 0.1085 and accuracy at 0.7696, emphasizing the superiority of MW-RDS.

Download:

Table 2. Using the ID₁ dataset, results of the 3 classifiers for the given feature selection methods.

https://doi.org/10.1371/journal.pone.0325147.t002

Table 3 gives the results for ID₂ dataset. MW-RDS using RF gives the highest sensitivity, F1-score and precision values, i.e., 1, 0.9401, 0.9911 and 0.9916, respectively. However, its accuracy is slightly lower than that of mRMRe that is 0.9959. Similarly, SVM paired with MW-RDS excels with the highest accuracy, perfect sensitivity, near-perfect specificity, and the highest F1-score that is 0.9959, 1, 0.9995 and 0.9972, respectively, making it the best-performing combination overall.

Download:

Table 3. Using the ID₂ dataset, results of the 3 classifiers for the given feature selection methods.

https://doi.org/10.1371/journal.pone.0325147.t003

In the case of WKNN, MW-RDS demonstrates high Accuracy (0.9868), F1 Score (0.9938), and near-perfect Sensitivity (0.9995), showcasing its effectiveness. While mRMRe slightly outperforms MW-RDS in accuracy 0.9988 and precision, MW-RDS offers better balance across all metrics. Compared to other feature selection methods, MW-RDS consistently outperforms the others in terms of Sensitivity, often achieving perfect values for RF and SVM, and performs exceptionally well in terms of F1-score due to its balanced approach to precision and sensitivity. Although methods like Wilcoxon sometimes achieve higher specificity under WKNN with a value of 1, MW-RDS provides a more stable and consistent overall performance. Among the models, SVM paired with MW-RDS emerges as the best, delivering the best results across all the metrics, while WKNN is also competitive, particularly in sensitivity and F1-score. RF performs well but lags slightly in accuracy and precision. Overall, MW-RDS proves to be a reliable, high-performing feature selection method, standing out as a strong candidate for machine learning tasks in the presence of class imbalance.

Based on the results given in Tables 4, 5, 6, 7, 8, 9, and 10, similar conclusions can be drawn for the ID₃ to ID₉ datasets, where MW-RDS continues to demonstrate its strong and consistent performance across all the metrics.

Download:

Table 4. Using the ID₃ dataset, results of the 3 classifiers for the given feature selection methods.

https://doi.org/10.1371/journal.pone.0325147.t004

Download:

Table 5. Using the ID₄ dataset, results of the 3 classifiers for the given feature selection methods.

https://doi.org/10.1371/journal.pone.0325147.t005

Download:

Table 6. Using the ID₅ dataset, results of the 3 classifiers for the given feature selection methods.

https://doi.org/10.1371/journal.pone.0325147.t006

Download:

Table 7. Using the ID₆ dataset,results of the 3 classifiers for the given feature selection methods.

https://doi.org/10.1371/journal.pone.0325147.t007

Download:

Table 8. Using the ID₇ dataset, results of the 3 classifiers for the given feature selection methods.

https://doi.org/10.1371/journal.pone.0325147.t008

Download:

Table 9. Using the ID₈ dataset, results of the 3 classifiers for the given feature selection methods.

https://doi.org/10.1371/journal.pone.0325147.t009

Download:

Table 10. Using the ID₉ dataset, results of the 3 classifiers for the given feature selection methods.

https://doi.org/10.1371/journal.pone.0325147.t010

For testing statistical significance, Table 11 presents the results of the Wilcoxon Rank-Sum test comparing MW-RDS against other methods. The findings reveal that the superiority of MW-RDS is also statistically significant.

Download:

Table 11. p-values by Wilcoson rank sum test comparing MW-RDS with feature selection methods across 9 datasets in terms classification accuracy. Statistically significance p-value (^*p< 0.05, ^**p< ^***p<0.001) indicate that MW-RDS significantly outperforms the other method.

https://doi.org/10.1371/journal.pone.0325147.t011

The efficiency of the proposed method is further demonstrated through boxplots given in Figs 2, 3, 4, 5, and 6 for the various performance metrics on ID₁. As shown in the boxplots, where the error bars denote , the proposed method exhibits significantly higher performance metrics with reduced variability relative to the competing methods, underscoring its robustness and superior effectiveness.

Download:

Fig 2. Boxplots of classification accuracies of the 3 classifiers for the given feature selection methods on ID₁.

https://doi.org/10.1371/journal.pone.0325147.g002

Download:

Fig 3. Boxplots of sensitivies of the 3 classifiers for the given feature selection methods on ID₁.

https://doi.org/10.1371/journal.pone.0325147.g003

Download:

Fig 4. Boxplots of classification specificities of the 3 classifiers for the given feature selection methods on ID₁.

https://doi.org/10.1371/journal.pone.0325147.g004

Download:

Fig 5. Boxplots of classification F1-scores of the 3 classifiers for the given feature selection methods on ID₁.

https://doi.org/10.1371/journal.pone.0325147.g005

Download:

Fig 6. Boxplots of classification precisions of the 3 classifiers for the given feature selection methods on ID₁.

https://doi.org/10.1371/journal.pone.0325147.g006

Additionally, the metric plots given in Figs 7, 8, 9, 10, and 11 give a comparison of the proposed method with the other methods across various numbers of genes, i.e., 5, 10, 15, 20, 25, 50, 100, and 500. These plots highlight the consistent performance of MW-RDS against the others while selecting different number of genes. The results indicate that the proposed method exhibits greater stability compared to the other methods, even when the number of genes varies.

Download:

Fig 7. Plots of classification accuracies on ID₂ for various numbers of selected features.

https://doi.org/10.1371/journal.pone.0325147.g007

Download:

Fig 8. Plots of sensitivites on ID₂ for various numbers of selected features.

https://doi.org/10.1371/journal.pone.0325147.g008

Download:

Fig 9. Plots of specificities on ID₂ for various numbers of selected features.

https://doi.org/10.1371/journal.pone.0325147.g009

Download:

Fig 10. Plots of F1-scores on ID₂ for various numbers of selected features.

https://doi.org/10.1371/journal.pone.0325147.g010

Download:

Fig 11. Plots of precisions on ID₂ for various numbers of selected features.

https://doi.org/10.1371/journal.pone.0325147.g011

The primary aim of the study was to develop a gene selection method that enhances the classification performance of machine learning algorithms on imbalanced high-dimensional gene expression datasets. This study aims to assists readers interested in further exploring the biological significance of these selected genes. For example, the indices of selected genes are, G2599, G1036, G1495, G4303, G2722, G7747, G1830, G136, G3323, G9247, for the ID₁ dataset while the indices of genes for ID₂ are G4731, G7721, G10459, G6254, G4209, G5347, G5744, G6121, G10477, G10488 in order of their selection frequency among all the 500 runs. MW-RDS identified top 10 genes that give impressive 99.59% accuracy with SVM classifier for ID₁ dataset. Similarly, the accuracy reached 97.66%, 99.62% using RF and SVM for ID₂ dataset. For further reading on biological significance of the selected genes, readers are advised to see the work done in [46,47].

4.4 Simulation

This section presents two simulation scenarios to demonstrate the applicability of the proposed method. The first scenario highlights the effectiveness of MW-RDS in addressing imbalanced datasets, while the second scenario explores a data-generating environment where the proposed method might struggle.

To evaluate the performance of feature selection methods on imbalanced datasets, we simulate datasets with skewed class distributions, similar to the characteristics described in the benchmark analysis. The dataset contains samples and features. An imbalance ratio of 9:1 signifies that 90% of the observations belong to the majority class , and the remaining 10% represent the minority class . The exact number of majority/minority observations is computed using the following equations:

(12)

(13)

Imbalance ratio () is defined by the proportions of class distributions expressed in Eqs 11 and 13. The majority/minority class consists of , observations of the total for simulation. Using multivariate normal distribution,the feature matrix is generated to represent continuous features values as:

(14)

where, , are the mean vector and covariance matrix for target variable , with indicating the minority and majority observations, respectively. The amplification factor () adjusts the contributions of the minority and majority class features, ensuring that the minority class influence is not overshadowed by the majority class. This methodology provides a robust framework for testing the performance of feature selection methods, including the proposed MW-RDS, under challenging and realistic imbalanced data conditions.

Fig 12 presents the comparison of the proposed and the other methods applied to the balanced simulated dataset, whereas, Fig 13 presents the results on the imbalance simulated data. While POS demonstrates strong performance under balanced scenario, it is worth highlighting the potential of MW-RDS, particularly in imbalanced scenarios given in Fig 13 showing its effectiveness by achieving higher accuracy via SVM, WkNN sensitivity via SVM, specificity via RF, SVM and WkNN, F1- score via RF, SVM and Wk NN and precision via SVM and NN , which are crucial for accurately identifying minority class instances. This capability highlights the suitability of MW-RDS for addressing the challenges posed by imbalanced datasets, where maintaining equitable performance can be difficult for other methods. These results reflect the adaptability and robustness of MW-RDS, making it a valuable choice for diverse data distributions.

Download:

Fig 12. Barplots of results on the balanced simulated dataset based on 10 selected features.

https://doi.org/10.1371/journal.pone.0325147.g012

Download:

Fig 13. Barplots of the results on the imbalanced simulated dataset based on 10 selected features.

https://doi.org/10.1371/journal.pone.0325147.g013

Table 12 summarizes the average time (in miliseconds) taken by each feature selection method for particular dataset. Methods, i.e., POS, WSNR and Fisher exhibit minimal computational cost due to their uni-variate design. In-contrast, MW-RDS requires slightly more time due to it enhanced design. Despite requiring slightly more time than basic filters, MW-RDS remains significantly faster than advanced method such as, wilcoxon, mRMRe and ROWSU. Although MW-RDS demands more computational cost to run, its notably improved classification performance makes it a worthwhile trade-off.

Download:

Table 12. Execution time (in miliseconds) of the feature selection methods for various number of features.

https://doi.org/10.1371/journal.pone.0325147.t012

Furthermore, in terms of its sensitivity to choice of hyper-parameters, the number of features selected stayed almost the same by changing the level of minority amplification factor across various setting of the regularization parameter . Fig 14 gives a clear demonstration the above on the simulated data. This behavior shows that MW-RDS is consistent across different -values, meaning that, minimal parameter fine-tuning may be sufficient. Ensuring efficiency for high dimensional imbalanced problem, the proposed method, MW-RDS time complexity is . Table 13 summarizes the performance of the proposed method compared to other feature selection methods on the top 50 features of the imbalanced simulated dataset. Classification was performed using RF, SVM, and WKNN models, and results for accuracy, sensitivity, specificity, F1-score, and precision are reported as over 500 iterations. MW-RDS consistently achieved superior results in most of the cases. The performance of feature selection methods was further evaluated on the imbalanced simulated dataset using different number of features, . Figs 15 and 16 demonstrate that MW-RDS outperforms the other methods in terms of classification accuracy and sensitivity.

Download:

Fig 14. The effect of

under different levels of minority amplification factor

.

https://doi.org/10.1371/journal.pone.0325147.g014

Download:

Fig 15. Classification accuracy of RF, SVM, and WKNN on 100–500 features selected by different methods for imbalanced simulated datasets.

https://doi.org/10.1371/journal.pone.0325147.g015

Download:

Fig 16. Sensitivity of RF, SVM, and WKNN on 100–500 features selected by different methods for imbalanced simulated datasets.

https://doi.org/10.1371/journal.pone.0325147.g016

Download:

Table 13. Classification performance (accuracy, sensitivity, specificity, F1-score, and precision) based on 50 selected features, reported as

over 500 runs.

https://doi.org/10.1371/journal.pone.0325147.t013

5 Conclusion

This study presented the Margin Weighted Robust Discriminant Score (MW-RDS), an innovative feature selection method designed to tackle the challenges of high-dimensional and imbalanced datasets.

In contrast to existing methods, MW-RDS presented a minority amplification factor, class-specific stability weights and margin weights from support vectors to confirm feature significance to minority class observations. Moreover, the use of -regularization through the logistic function removed redundant features, resulting in extremely efficient feature set.

MW-RDS has been compared with feature selection methods, including Proportional Overlap Score (POS), Wilcoxon Rank-Sum Test (Wilcoxon), Weighted Signal-to-Noise Ratio (WSNR), ensemble Minimum Redundancy and Maximum Relevance (mRMRe), Fisher Score (Fisher) and robust weighted score for unbalanced data (ROWSU) on several benchmark gene expression imbalanced problems. Three classification models, i.e., Random Forest (RF), Support Vector Machines (SVM), and Weighted k-Nearest Neighbors (Wk-NN) are used in terms of accuracy, sensitivity, specificity, F1-score, and precision to see the efficacy of the proposed method. MW-RDS consistently outperformed existing methods showing its ability to handle class imbalanced problem and achieve superior classification results.

Although MW-RDS involves a slightly higher computational cost as compared to some of the other feature selection methods, this trade-off has resulted in notably improved performance. Its robustness across various hyper-parameter settings further imply its effectiveness. Overall, MW-RDS provides a promising balance between efficiency and effectiveness for high-dimensional analysis with class imbalance.

References

1. Kaur A, Sarmadi M. Comparative analysis of machine learning techniques for imbalanced genetic data. Annals Data Sci. 2024:1–23.
- View Article
- Google Scholar
2. Yang Y, Mirzaei G. Performance analysis of data resampling on class imbalance and classification techniques on multi-omics data for cancer classification. PLoS One. 2024;19(2):e0293607. pmid:38422094
- View Article
- PubMed/NCBI
- Google Scholar
3. Sadiq M, Shah R. A machine learning based variable selection algorithm for binary classification of perinatal mortality. PLoS One. 2025;20(1):e0315498. pmid:39821154
- View Article
- PubMed/NCBI
- Google Scholar
4. Hemmatian J, Hajizadeh R, Nazari F. Addressing imbalanced data classification with cluster-based reduced noise SMOTE. PLoS One. 2025;20(2):e0317396. pmid:39928607
- View Article
- PubMed/NCBI
- Google Scholar
5. Ghosh K, Bellinger C, Corizzo R, Branco P, Krawczyk B, Japkowicz N. The class imbalance problem in deep learning. Mach Learn. 2024;113(7):4845–901.
- View Article
- Google Scholar
6. Liu S, Yao W. Prediction of lung cancer using gene expression and deep learning with KL divergence gene selection. BMC Bioinformatics. 2022;23(1):175. pmid:35549644
- View Article
- PubMed/NCBI
- Google Scholar
7. Bolourchi P, Gholami M. A machine learning-based data-driven approach to Alzheimer’s disease diagnosis using statistical and harmony search methods. J Intell Fuzzy Syst. 2024;46(3):6299–312.
- View Article
- Google Scholar
8. Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2014;13:8–17. pmid:25750696
- View Article
- PubMed/NCBI
- Google Scholar
9. Abdulrauf Sharifai G, Zainol Z. Feature selection for high-dimensional and imbalanced biomedical data based on robust correlation based redundancy and binary grasshopper optimization algorithm. Genes (Basel). 2020;11(7):717. pmid:32605144
- View Article
- PubMed/NCBI
- Google Scholar
10. Yassi M, Moattar MH, Parry M, Chatterjee A. Enhancing robust and stable feature selection through the integration of ranking methods and wrapper techniques in genetic data classification. Gene Expression Analysis: Methods and Protocols. Springer. 2025. p. 243–54.
11. Baranauskas JA, Monard MC. Experimental feature selection using the wrapper approach. WIT Trans Inf Commun Technol. 2025;22.
- View Article
- Google Scholar
12. Bolourchi P, Demirel H, Uysal S. Entropy-score-based feature selection for moment-based SAR image classification. Electron Lett. 2018;54(9):593–5.
- View Article
- Google Scholar
13. Carrasco M, Ivorra B, L´opez J, Ramos AM. Embedded feature selection for robust probability learning machines. Pattern Recognit. 2025;159:111157.
- View Article
- Google Scholar
14. Li Z, Li H, Gao W, Xie J, Slowik A. Feature selection in high-dimensional classification via an adaptive multifactor evolutionary algorithm with local search. Appl Soft Comput. 2025;169:112574.
- View Article
- Google Scholar
15. Wang J, Awang N. A novel synthetic minority oversampling technique for multiclass imbalance problems. IEEE Access. 2025;13:6054–66.
- View Article
- Google Scholar
16. Theng D, Bhoyar KK. Feature selection techniques for machine learning: a survey of more than two decades of research. Knowl Inf Syst. 2024;66(3):1575–637.
- View Article
- Google Scholar
17. Bolourchi P, Gholami M. Feature selection based on Gabor filter and BSO for detecting parkinson’s disease. Innov Technol Eng. 2022:159.
- View Article
- Google Scholar
18. Tsai C-F, Chen K-C, Lin W-C. Feature selection and its combination with data over-sampling for multi-class imbalanced datasets. Appl Soft Comput. 2024;153:111267.
- View Article
- Google Scholar
19. Khan Z, Ali A, Aldahmani S. Feature selection via robust weighted score for high dimensional binary class-imbalanced gene expression data. Heliyon. 2024;10(19):e38547. pmid:39398002
- View Article
- PubMed/NCBI
- Google Scholar
20. Safi SK, Gul S. An enhanced tree ensemble for classification in the presence of extreme class imbalance. Mathematics. 2024;12(20):3243.
- View Article
- Google Scholar
21. Bolourchi P. Improved gene expression diagnosis via cascade entropy-fisher score and ensemble classifiers. Multimed Tools Appl. 2023;83(15):46181–200.
- View Article
- Google Scholar
22. Ahmad J, Akram S, Jaffar A, Ali Z, Bhatti SM, Ahmad A, et al. Deep learning empowered breast cancer diagnosis: advancements in detection and classification. PLoS One. 2024;19(7):e0304757. pmid:38990817
- View Article
- PubMed/NCBI
- Google Scholar
23. Talari P, N B, Kaur G, Alshahrani H, Al Reshan MS, Sulaiman A, et al. Hybrid feature selection and classification technique for early prediction and severity of diabetes type 2. PLoS One. 2024;19(1):e0292100. pmid:38236900
- View Article
- PubMed/NCBI
- Google Scholar
24. Chaudhari P, Agarwal H. Improving feature selection using elite breeding QPSO on gene data set for cancer classification. In: Intelligent Engineering Informatics: Proceedings of the 6th International Conference on FICTA, 2018. p. 209–19.
25. Khan AQ, Sun G, Khalid M, Imran A, Bilal A, Azam M, et al. A novel fusion of genetic grey wolf optimization and kernel extreme learning machines for precise diabetic eye disease classification. PLoS One. 2024;19(5):e0303094. pmid:38768222
- View Article
- PubMed/NCBI
- Google Scholar
26. Zhang X, Lu H. Gene expression analysis: addressing class imbalance in high-dimensional data. Genom Comput Biol. 2007;12(4):245–59.
- View Article
- Google Scholar
27. Huang H, Wu N, Liang Y, Peng X, Shu J. SLNL: a novel method for gene selection and phenotype classification. Int J Intell Syst. 2022;37(9):6283–304.
- View Article
- Google Scholar
28. Tiwari A, Chaturvedi A. A hybrid feature selection approach based on information theory and dynamic butterfly optimization algorithm for data classification. Exp Syst Appl. 2022;196:116621.
- View Article
- Google Scholar
29. Balestra F, Flahaut J. Redundancy management in high-dimensional datasets: challenges and solutions. J Bioinform Res. 2023;45(3):456–72.
- View Article
- Google Scholar
30. Mahmoud R, Saleh A. Robust masking for mitigating outliers in gene expression datasets. Bioinform Adv. 2014;8(2):134–45.
- View Article
- Google Scholar
31. Peng H, Long F, Ding C. Feature selection based on mutual information: minimum redundancy maximum relevance. Pattern Recognit. 2005;38(6):1225–38.
- View Article
- Google Scholar
32. Zhang Y, Huang Z. Two-stage gene selection for cancer classification using mRMR and ReliefF. Cancer Inform. 2008;15(1):101–9.
- View Article
- Google Scholar
33. De Jay N, Papillon-Cavanagh S, Olsen C, El-Hachem N, Bontempi G, Haibe-Kains B. mRMRe: an R package for parallelized mRMR ensemble feature selection. Bioinformatics. 2013;29(18):2365–8. pmid:23825369
- View Article
- PubMed/NCBI
- Google Scholar
34. Hamraz H, Wang Z. Scalability of mRMR in ultra-high-dimensional data applications. High-Perform Comput Bioinform. 2023;19(2):201–15.
- View Article
- Google Scholar
35. Liao W, Chang J. Applying Wilcoxon rank-sum test for feature reduction in gene expression data. Statist Bioinform J. 2006;9(5):87–96.
- View Article
- Google Scholar
36. Fisher RA. The use of multiple measurements in taxonomic problems. Annals Eugen. 1936;7(2):179–88.
- View Article
- Google Scholar
37. Zhang H, Liu K. Class-specific fisher score adaptations for imbalanced datasets. Data Sci J. 2023;22(7):321–33.
- View Article
- Google Scholar
38. Alatrany R, Bashir S. Redundancy in non-parametric feature selection: a critique of Wilcoxon-based methods. Mach Learn Healthc. 2023;14(3):78–85.
- View Article
- Google Scholar
39. Hamraz H, Kamalov S. Weighted signal-to-noise ratio for imbalanced datasets: insights and limitations. Signal Process Adv. 2023;29(4):345–59.
- View Article
- Google Scholar
40. Kamalov S, Wright D. Signal estimation challenges in WSNR-based feature selection. Int J Comput Statist. 2023;18(6):543–55.
- View Article
- Google Scholar
41. Kursa MB, Jankowski A, Rudnicki WR. Boruta – a system for feature selection. Fundamenta Informaticae. 2010;101(4):271–85.
- View Article
- Google Scholar
42. Liu Y, Zhang T. Computational constraints in decision tree-based feature selection. Appl Data Sci. 2019;11(5):204–20.
- View Article
- Google Scholar
43. Rey CCT, García VS, Villuendas-Rey Y. Evolutionary feature selection for imbalanced data. In: 2023 Mexican International Conference on Computer Science (ENC). 2023. p. 1–7. https://doi.org/10.1109/enc60556.2023.10508674
44. Zhang X, Wang Y. Impact of sequence in feature selection and resampling on classification performance. J Data Anal. 2023;15(9):221–34.
- View Article
- Google Scholar
45. Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Statist Soc Ser B (Methodological). 1996;58(1):267–88.
- View Article
- Google Scholar
46. Castillo D, Galvez JM, Herrera LJ, Rojas F, Valenzuela O, Caba O, et al. Leukemia multiclass assessment and classification from microarray and RNA-seq technologies integration at gene expression level. PLoS One. 2019;14(2):e0212127. pmid:30753220
- View Article
- PubMed/NCBI
- Google Scholar
47. Chen AH, Lin CH. A novel support vector sampling technique to improve classification accuracy and to identify key genes of leukaemia and prostate cancers. Exp Syst Appl. 2011;38(4):3209–19.
- View Article
- Google Scholar

[ref1] 1. Kaur A, Sarmadi M. Comparative analysis of machine learning techniques for imbalanced genetic data. Annals Data Sci. 2024:1–23.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Yang Y, Mirzaei G. Performance analysis of data resampling on class imbalance and classification techniques on multi-omics data for cancer classification. PLoS One. 2024;19(2):e0293607. pmid:38422094
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Sadiq M, Shah R. A machine learning based variable selection algorithm for binary classification of perinatal mortality. PLoS One. 2025;20(1):e0315498. pmid:39821154
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Hemmatian J, Hajizadeh R, Nazari F. Addressing imbalanced data classification with cluster-based reduced noise SMOTE. PLoS One. 2025;20(2):e0317396. pmid:39928607
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Ghosh K, Bellinger C, Corizzo R, Branco P, Krawczyk B, Japkowicz N. The class imbalance problem in deep learning. Mach Learn. 2024;113(7):4845–901.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref6] 6. Liu S, Yao W. Prediction of lung cancer using gene expression and deep learning with KL divergence gene selection. BMC Bioinformatics. 2022;23(1):175. pmid:35549644
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref7] 7. Bolourchi P, Gholami M. A machine learning-based data-driven approach to Alzheimer’s disease diagnosis using statistical and harmony search methods. J Intell Fuzzy Syst. 2024;46(3):6299–312.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref8] 8. Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2014;13:8–17. pmid:25750696
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref9] 9. Abdulrauf Sharifai G, Zainol Z. Feature selection for high-dimensional and imbalanced biomedical data based on robust correlation based redundancy and binary grasshopper optimization algorithm. Genes (Basel). 2020;11(7):717. pmid:32605144
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref10] 10. Yassi M, Moattar MH, Parry M, Chatterjee A. Enhancing robust and stable feature selection through the integration of ranking methods and wrapper techniques in genetic data classification. Gene Expression Analysis: Methods and Protocols. Springer. 2025. p. 243–54.

[ref11] 11. Baranauskas JA, Monard MC. Experimental feature selection using the wrapper approach. WIT Trans Inf Commun Technol. 2025;22.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref12] 12. Bolourchi P, Demirel H, Uysal S. Entropy-score-based feature selection for moment-based SAR image classification. Electron Lett. 2018;54(9):593–5.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref13] 13. Carrasco M, Ivorra B, L´opez J, Ramos AM. Embedded feature selection for robust probability learning machines. Pattern Recognit. 2025;159:111157.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref14] 14. Li Z, Li H, Gao W, Xie J, Slowik A. Feature selection in high-dimensional classification via an adaptive multifactor evolutionary algorithm with local search. Appl Soft Comput. 2025;169:112574.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref15] 15. Wang J, Awang N. A novel synthetic minority oversampling technique for multiclass imbalance problems. IEEE Access. 2025;13:6054–66.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref16] 16. Theng D, Bhoyar KK. Feature selection techniques for machine learning: a survey of more than two decades of research. Knowl Inf Syst. 2024;66(3):1575–637.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref17] 17. Bolourchi P, Gholami M. Feature selection based on Gabor filter and BSO for detecting parkinson’s disease. Innov Technol Eng. 2022:159.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref18] 18. Tsai C-F, Chen K-C, Lin W-C. Feature selection and its combination with data over-sampling for multi-class imbalanced datasets. Appl Soft Comput. 2024;153:111267.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref19] 19. Khan Z, Ali A, Aldahmani S. Feature selection via robust weighted score for high dimensional binary class-imbalanced gene expression data. Heliyon. 2024;10(19):e38547. pmid:39398002
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref20] 20. Safi SK, Gul S. An enhanced tree ensemble for classification in the presence of extreme class imbalance. Mathematics. 2024;12(20):3243.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref21] 21. Bolourchi P. Improved gene expression diagnosis via cascade entropy-fisher score and ensemble classifiers. Multimed Tools Appl. 2023;83(15):46181–200.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref22] 22. Ahmad J, Akram S, Jaffar A, Ali Z, Bhatti SM, Ahmad A, et al. Deep learning empowered breast cancer diagnosis: advancements in detection and classification. PLoS One. 2024;19(7):e0304757. pmid:38990817
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref23] 23. Talari P, N B, Kaur G, Alshahrani H, Al Reshan MS, Sulaiman A, et al. Hybrid feature selection and classification technique for early prediction and severity of diabetes type 2. PLoS One. 2024;19(1):e0292100. pmid:38236900
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref24] 24. Chaudhari P, Agarwal H. Improving feature selection using elite breeding QPSO on gene data set for cancer classification. In: Intelligent Engineering Informatics: Proceedings of the 6th International Conference on FICTA, 2018. p. 209–19.

[ref25] 25. Khan AQ, Sun G, Khalid M, Imran A, Bilal A, Azam M, et al. A novel fusion of genetic grey wolf optimization and kernel extreme learning machines for precise diabetic eye disease classification. PLoS One. 2024;19(5):e0303094. pmid:38768222
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref26] 26. Zhang X, Lu H. Gene expression analysis: addressing class imbalance in high-dimensional data. Genom Comput Biol. 2007;12(4):245–59.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref27] 27. Huang H, Wu N, Liang Y, Peng X, Shu J. SLNL: a novel method for gene selection and phenotype classification. Int J Intell Syst. 2022;37(9):6283–304.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref28] 28. Tiwari A, Chaturvedi A. A hybrid feature selection approach based on information theory and dynamic butterfly optimization algorithm for data classification. Exp Syst Appl. 2022;196:116621.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref29] 29. Balestra F, Flahaut J. Redundancy management in high-dimensional datasets: challenges and solutions. J Bioinform Res. 2023;45(3):456–72.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref30] 30. Mahmoud R, Saleh A. Robust masking for mitigating outliers in gene expression datasets. Bioinform Adv. 2014;8(2):134–45.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref31] 31. Peng H, Long F, Ding C. Feature selection based on mutual information: minimum redundancy maximum relevance. Pattern Recognit. 2005;38(6):1225–38.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref32] 32. Zhang Y, Huang Z. Two-stage gene selection for cancer classification using mRMR and ReliefF. Cancer Inform. 2008;15(1):101–9.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref33] 33. De Jay N, Papillon-Cavanagh S, Olsen C, El-Hachem N, Bontempi G, Haibe-Kains B. mRMRe: an R package for parallelized mRMR ensemble feature selection. Bioinformatics. 2013;29(18):2365–8. pmid:23825369
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref34] 34. Hamraz H, Wang Z. Scalability of mRMR in ultra-high-dimensional data applications. High-Perform Comput Bioinform. 2023;19(2):201–15.
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref35] 35. Liao W, Chang J. Applying Wilcoxon rank-sum test for feature reduction in gene expression data. Statist Bioinform J. 2006;9(5):87–96.
View Article
Google Scholar

[111] View Article

[112] Google Scholar

[ref36] 36. Fisher RA. The use of multiple measurements in taxonomic problems. Annals Eugen. 1936;7(2):179–88.
View Article
Google Scholar

[114] View Article

[115] Google Scholar

[ref37] 37. Zhang H, Liu K. Class-specific fisher score adaptations for imbalanced datasets. Data Sci J. 2023;22(7):321–33.
View Article
Google Scholar

[117] View Article

[118] Google Scholar

[ref38] 38. Alatrany R, Bashir S. Redundancy in non-parametric feature selection: a critique of Wilcoxon-based methods. Mach Learn Healthc. 2023;14(3):78–85.
View Article
Google Scholar

[120] View Article

[121] Google Scholar

[ref39] 39. Hamraz H, Kamalov S. Weighted signal-to-noise ratio for imbalanced datasets: insights and limitations. Signal Process Adv. 2023;29(4):345–59.
View Article
Google Scholar

[123] View Article

[124] Google Scholar

[ref40] 40. Kamalov S, Wright D. Signal estimation challenges in WSNR-based feature selection. Int J Comput Statist. 2023;18(6):543–55.
View Article
Google Scholar

[126] View Article

[127] Google Scholar

[ref41] 41. Kursa MB, Jankowski A, Rudnicki WR. Boruta – a system for feature selection. Fundamenta Informaticae. 2010;101(4):271–85.
View Article
Google Scholar

[129] View Article

[130] Google Scholar

[ref42] 42. Liu Y, Zhang T. Computational constraints in decision tree-based feature selection. Appl Data Sci. 2019;11(5):204–20.
View Article
Google Scholar

[132] View Article

[133] Google Scholar

[ref43] 43. Rey CCT, García VS, Villuendas-Rey Y. Evolutionary feature selection for imbalanced data. In: 2023 Mexican International Conference on Computer Science (ENC). 2023. p. 1–7. https://doi.org/10.1109/enc60556.2023.10508674

[ref44] 44. Zhang X, Wang Y. Impact of sequence in feature selection and resampling on classification performance. J Data Anal. 2023;15(9):221–34.
View Article
Google Scholar

[136] View Article

[137] Google Scholar

[ref45] 45. Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Statist Soc Ser B (Methodological). 1996;58(1):267–88.
View Article
Google Scholar

[139] View Article

[140] Google Scholar

[ref46] 46. Castillo D, Galvez JM, Herrera LJ, Rojas F, Valenzuela O, Caba O, et al. Leukemia multiclass assessment and classification from microarray and RNA-seq technologies integration at gene expression level. PLoS One. 2019;14(2):e0212127. pmid:30753220
View Article
PubMed/NCBI
Google Scholar

[142] View Article

[143] PubMed/NCBI

[144] Google Scholar

[ref47] 47. Chen AH, Lin CH. A novel support vector sampling technique to improve classification accuracy and to identify key genes of leukaemia and prostate cancers. Exp Syst Appl. 2011;38(4):3209–19.
View Article
Google Scholar

[146] View Article

[147] Google Scholar

Figures

Abstract

1 Introduction

2 Related work

3 Margin Weighted Robust Discriminant Score (MW-RDS)

3.1 Minority-focused robust discriminant score

3.2 Margin-weighted feature scoring

3.3 Feature ranking and redundancy elimination

4 Experiments and results

4.1 Imbalanced gene expression datasets

4.2 Experimental setup

4.3 Results

4.4 Simulation

5 Conclusion

References