Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Feature selection for high dimensional microarray gene expression data via weighted signal to noise ratio

  • Muhammad Hamraz ,

    Contributed equally to this work with: Muhammad Hamraz, Amjad Ali, Wali Khan Mashwani, Saeed Aldahmani, Zardad Khan

    Roles Conceptualization, Formal analysis, Methodology, Writing – original draft, Writing – review & editing

    Affiliation Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan

  • Amjad Ali ,

    Contributed equally to this work with: Muhammad Hamraz, Amjad Ali, Wali Khan Mashwani, Saeed Aldahmani, Zardad Khan

    Roles Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft

    Affiliation Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan

  • Wali Khan Mashwani ,

    Contributed equally to this work with: Muhammad Hamraz, Amjad Ali, Wali Khan Mashwani, Saeed Aldahmani, Zardad Khan

    Roles Conceptualization, Investigation, Methodology, Writing – review & editing

    Affiliation Institute of Numerical Sciences, Kohat University of Science and Technology, Kohat, Pakistan

  • Saeed Aldahmani ,

    Contributed equally to this work with: Muhammad Hamraz, Amjad Ali, Wali Khan Mashwani, Saeed Aldahmani, Zardad Khan

    Roles Conceptualization, Formal analysis, Software, Writing – original draft, Writing – review & editing

    Affiliation Department of Analytics in the Digital Era, United Arab Emirates University, Al Ain, UAE

  • Zardad Khan

    Contributed equally to this work with: Muhammad Hamraz, Amjad Ali, Wali Khan Mashwani, Saeed Aldahmani, Zardad Khan

    Roles Conceptualization, Formal analysis, Methodology, Software, Supervision

    zaar@uaeu.ac.ae

    Affiliation Department of Analytics in the Digital Era, United Arab Emirates University, Al Ain, UAE

Abstract

Feature selection in high dimensional gene expression datasets not only reduces the dimension of the data, but also the execution time and computational cost of the underlying classifier. The current study introduces a novel feature selection method called weighted signal to noise ratio (WSNR) by exploiting the weights of features based on support vectors and signal to noise ratio, with an objective to identify the most informative genes in high dimensional classification problems. The combination of two state-of-the-art procedures enables the extration of the most informative genes. The corresponding weights of these procedures are then multiplied and arranged in decreasing order. Larger weight of a feature indicates its discriminatory power in classifying the tissue samples to their true classes. The current method is validated on eight gene expression datasets. Moreover, results of the proposed method (WSNR) are also compared with four well known feature selection methods. We found that the (WSNR) outperform the other competing methods on 6 out of 8 datasets. Box-plots and Bar-plots of the results of the proposed method and all the other methods are also constructed. The proposed method is further assessed on simulated data. Simulation analysis reveal that (WSNR) outperforms all the other methods included in the study.

1 Introduction

Feature/Gene selection in micro-array gene expression datasets has gained great attention during the recent decades [17]. Since high dimensional datasets usually contain noisy, redundant and non-informative features that enhance computational complexity as well as execution time of the underlying model. Feature selection is therefore, necessary to select the informative features and remove the unnecessary ones. This will not only reduce execution or training time but will also increase the accuracy of the model. Based on this model one can categorize the samples in the data into their classes [8]. Feature selection is mainly carried out by using three different methods such as wrapper, filter and embedded. The feature selection methods used in paper falls under the category of filter methods, except sigF [9] which is a wrapper method. Features or variables selection is used in variety of task such as classification, regression and clustering [10]. Also, different types of biological data sets can be analyzed by using feature selection, for instance whole-genome sequencing data set [11], protein mass spectra data set [12], whole-genome expression data set [1315], and so on. Micro-array and other high throughput technologies are capable of measuring thousands of genes simultaneously, leading to its rampant usage in clinical settings. Recent years have witnessed a lot of feature selection methods for micro-array data analysis. Authors in [16] introduced a method known as ‘double feature selection method’. In their method they have used both the global and intrinsic geometric information, for the selection of informative features in data. Similarly, study in [17] introduced a method that handles semi-supervised feature selection tasks. This method combines neighborhood discriminant index (NDI) and forward iterative Laplacian score (FILS) methods for the selection of discriminative features in high-dimensional data sets. A more efficient implementation of linear support vector machines to improve the recursive feature elimination strategy and then combine them together to select informative genes was proposed in [18]. A study in [2] proposed a new technique that applies an ensemble of feature selection procedures to select those genes that are highly correlated to Lung Adenocarcinoma (LUAD). Utilizing LUAD RNA-seq data from the Cancer Genome Atlas (TCGA), mutual information (MI) was employed followed by recursive feature elimination (RFE) feature selection procedures along with SVM classifier. A new Bi-dimensional Principal Feature Selection (BPFS) procedure for efficiently extracting critical genes was proposed for high dimensional gene expression datasets [19]. This procedure utilizes the principal component analysis (PCA) technique on sample and gene domains successively, in order to identify the informative genes and reduce redundancies while losing less information. The selection of informative features and their importance in classification/regression can be found in [2027]. The main focus of these methods is to enhance the classification accuracy of the underlying classifier with the help of selected genes, while ignoring their biological relevance, which leads to inaccurate downstream data analysis [2833]. Therefore, it is necessary to device such a feature selection method that not only increase the classification accuracy, but also to be capable of identifying the biological significance of the selected genes, in tumor versus normal tumor contrast [34, 35].

This paper proposes a new feature selection procedure by combining the information obtained from well known feature selection method called signal to noise ratio (SNR) [40] and the feature weights given by support vector machine (SVM) [36]. For assessing the performance of the current study eight gene expression datasets, i.e., Leukemia, Colon, Srbct, DLBCL, Lungcancer, Breastcancer, TumorC and Prostate have been used. Furthermore, the results of the proposed method are compared with four other well known feature selection methods such as significant features “sigF” [9], minimum redundancy maximum relevance (mRmR) [37], wilcoxon rank sum test “Wilc” [38] and an ensemble method called SVM-mRMRe [39]. After comparing the results of the proposed method (WSNR) with the aforemesioned methods, it has been observed that the proposed WSNR stands apart in terms of classification error. Box-plots and bar-plots of the results are also constructed, which also indicate that the proposed method has better performance as compared to the aforementioned feature selection methods. The rest of the paper is organized as follows.

Section 2 gives a detailed description of the datasets used in the paper, support vector machine (SVM) classifier, feature selection procedures “Significant Features” (sigF) [9], Signal to Noise Ratio (SNR) [40], and the proposed method (WSNR) with its mathematical background and algorithm. Section 3 presents the experimental set up of the proposed method. Section 4 gives discussion on the results of the proposed method WSNR. The paper is concluded in Section 5.

2 Methods

2.1 Data sets

For the assessment of the proposed method, WSNR, eight benchmark problems are used. Their sources along with number of features, number of observations and class wise distribution of samples are given in Table 1.

thumbnail
Table 1. Brief description of the datasets along with the corresponding number of features, observations, class-wise distributions and sources.

https://doi.org/10.1371/journal.pone.0284619.t001

2.2 Support vector machine

Support vector machine (SVM) is a supervised learning technique, which has been widely used for regression and classification problems in literature. It has also been used for feature selection in several studies [32, 33, 48]. This classifier utilizes several kernel functions to perform the classification effectively in linear and non-linear feature spaces. The SVM searches a linear or non-linear optimal hyperplane (H), which can then divide the two groups of observations meaningfully [49]. This hyperplane (H) is supposed to be at maximum distance from both the classes or groups in high-dimensional spaces, so as to separate the two groups as much as possible. The hyperplane is represented in the form of a vector given in Eq 1 which acts as a reference frame to identify the position of each sample or observation in high-dimensional spaces. It is summed in order to produce a score known as discriminate score, which is then used to categorize the observations into one of the two classes. (1) where y is a response vector, i.e., y ∈ (0, 1), where each sample in the data is classified into class 0 or 1. z = (z1, ⋯, zd) is a d-dimensional input vector and vector w = (w1, ⋯, wd) contains the coefficients of the hyperplane. The term b indicates the intercept of the hyperplane.

2.2.1 Mathematical description behind SVM weights w.

As the SVM algorithm uses a hyperplane (H) to classify the data points in their respective classes, i.e.,

The distance between a given point ψ(z0) and the hyperplane H is give by (2) where ‖w2 is the Euclidean norm given as (3) The weight vector is the argument that maximize the distance given in Eq 2, that is: (4)

2.3 Significant feature selection (sigF)

A method known as Signature feature selection (sigF) can be found in [9]. In this method, significant features are identified with the help of support vector machine and t-test. First, the weight of each feature is computed via support vector machine (SVM). In the second stage, t-test is computed for each feature in the data in the following manner: (5) where represent the means, standard deviations and number of samples in Class 0 and 1, respectively. In this way the t-statistic is computed for each feature in the data. Alternatively, p-values for all the features in the data are computed based on t-test. A smaller p-value of a feature represents its disciminative ability. The weights computed via SVM classifier are then multiplied with these p-values to achieve new weights of all the features by using the following equation. (6) where v is the level of significance for the corresponding reference distribution and u is the observed value of test statistic based on the level of flexibility v. The feature is considered informative if it possesses a smaller value of ξ.

2.4 The proposed method, WSNR

The proposed method selects the informative genes or features in high-dimensional gene expression data sets in a similar fashion as that of sigF given in [9]. The only difference is that the method in [9] computes t-statistic for each feature, which is then multiplied with the weights computed via support vector machine classifier. The proposed method on the other hand computes signal to noise ratio [40] for each feature in the following manner. (7) where represent the mean and standard deviations of class 0 and 1, respectively. Features that carry larger value of SNR, are supposed to have greater discriminative ability. Similarly weights of all the features in the data are also computed via SVM, i.e., wj. Since both the weighting schemes assign larger weights to the informative genes therefore, their multiplication will also assign larger weights to the features that are informative. The resultant weights of the proposed method (WSNR) are computed by using the following equation (8) where (WSNR)j represents the weight of jth feature in the data. The proposed method (WSNR) considers the following steps in identifying the informative genes.

  • Compute weights of all the features using support vectors and denote it by wj.
  • Compute signal to noise ratio for all the features in the training data and denote it by SNRj.
  • Multiply the corresponding weights in step 1 and 2 and arrange them in descending order.
  • Select the top ranked (K) genes in step 3 for the model construction.

The authors in [9] have used t-test rather than signal to noise ratio for the selection of discriminative genes. The t-test requires the underlying distribution of variables to be approximately normal, which is a difficult task in a situation where data contains tens of thousand of genes or variables. On the other hand signal to noise ratio does not require such assumption. The following pseudo code given in Algorithm 1 explains how the proposed method, WSNR, identifies the informative genes, in high-dimensional gene expression data sets, followed by its flowchart in Fig 1.

Algorithm 1 Pseudo code of the proposed method, WSNR.

1: ← Micro-array data with dimension n × (d + 1);

2: n ← Number of tissue samples in the data;

3: d ← Number of genes in the data;

4: Xn×d ← Total input feature space with n samples and d genes;

5: Y ← Target variable having n values.

6: K ← Number of genes to be selected.

7: w ← Weights vector of genes obtained via support vector;

8: wj ← Weight of jth gene obtained via support vector;

9: for j ← 1: d do

10:  SNRj ← Compute the using signal to noise ratio;

11:  Perform (WSNR)j = wj * SNRj;

12: end for

13: Arrange the weights (WSNR)j in decreasing order;

14: Select the top K genes for model construction.

3 Experiments

This section provides the experimental setup of the current paper. Eight high-dimensional gene expression benchmark problems are analyzed, where each benchmark problem is split into (70%) training and (30%) testing parts. This splitting criteria is repeated 500 times for all feature selection procedures and the classifiers used for assessing their performance. Random forest (RF) and k-Nearest Neighbours (k-NN) classifiers have been used to evaluate the performance of different subsets of informative genes selected by various feature selection techniques.

The feature selection method, minimum redundancy and maximum relevance (mRmR), is implemented in R package mRMRe [50]. Wilcoxon rank sum test (Wilc) and significant feature selection (sigF) are implemented by using the R packages WilcoxCV [51] and sigFeature [9], respectively. Moreover, the R library randomForest [52] is utilized for fitting the random forest algorithm with default parameters, i.e., ntree = 500, and nodesize = 1. Similarly, the R library caret [53] is used for the implementation of k-Nearest neighbours classifier, with parameter k = 5.

The training parts of each benchmark problem are considered for the selection of different subsets of descriminative genes, i.e., K = 5, 10 and 15 by different gene selection procedures to train the classifiers. Classification error rate is used as a performance metric to investigate the classifiers’ performance on the basis of selected set of informative genes.

4 Results and discussion

Table 2 provides the classification error rates produced by the proposed method, WSNR, and all the other competitors included in the study, for different subsets of informative genes. From Table 2, it is evident that for the data set “Leukemia” the proposed method has outperformed all the other methods on both the classifiers. In the case of “Colon” data set, the proposed method has outperformed the others on random forest classifier for all subsets of descriminative genes, while on k-nearest neighbour classifier the method (sigF) has produced minimum error for a subset of 5 informative genes. The proposed method, however, has produced minimum error rates for the subsets of genes 10 and 15. Similarly, in the case of “Lungcancer” data set, the method (Wilc) has yielded minimum error rates on random forest classifier while the proposed method has outperformed all the other competitors on k-NN classifier. In the case of “Srbct” data set, the proposed WSNR method has outperformed all the other methods except for the number of 5 informative genes, where the method “sigF” has yielded minimum error rate on k-NN classifier. The proposed method has outperformed all the other methods on random forest classifier in the case of the dataset “DLBCL” and has shown poor performance on kNN classifier. Similarly, the WSNR method has won over all the other procedures in majority of the cases for the data set “Breast” but has shown poor performance in case of “TumorC” data set. Similarly, the proposed method has won over all the other methods in case of Prostate data set. Overall, the method, WSNR, has produced minimum error rates in six out of eight data sets and comparable results on one data set. To summarize simulation results, a win-loss summary is given in Table 3.

thumbnail
Table 2. Classification error rates produced by different methods on various subsets of genes.

https://doi.org/10.1371/journal.pone.0284619.t002

thumbnail
Table 3. Win-loss table of the methods used.

Total number of wins of the methods on the data sets is given in the last row of the table.

https://doi.org/10.1371/journal.pone.0284619.t003

The performance of the proposed method is also illustrated with the help of bar-plots of the results for pictorial illustration as given in Figs 29. It is clear from the plots that in case of the data set “Leukemia” the heights of bars corresponding to the proposed method, WSNR, are smaller than the bars corresponding to all the other procedures included in the study. In case of data set “Lungcancer” the method “Wilc” is producing minimum error rates than the rest of the gene selection procedures. For the data sets “Srbct” and “DLBCL”, the method, WSNR, method has produced minimum classification error rates. For the remaining data sets, our method has maintained a majority wining position except for the data set “TumorC”. Fig 2 has been constructed for a quick insight into the results of various feature selection methods included in the study.

thumbnail
Fig 2. Bar-plots of error rates of the proposed and the other classical methods on various subsets for Leukemia dataset.

https://doi.org/10.1371/journal.pone.0284619.g002

thumbnail
Fig 3. Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for Colon dataset.

https://doi.org/10.1371/journal.pone.0284619.g003

thumbnail
Fig 4. Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for Lungcancer dataset.

https://doi.org/10.1371/journal.pone.0284619.g004

thumbnail
Fig 5. Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for Srbct dataset.

https://doi.org/10.1371/journal.pone.0284619.g005

thumbnail
Fig 6. Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for DLBCL dataset.

https://doi.org/10.1371/journal.pone.0284619.g006

thumbnail
Fig 7. Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for Breast dataset.

https://doi.org/10.1371/journal.pone.0284619.g007

thumbnail
Fig 8. Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for TumorC dataset.

https://doi.org/10.1371/journal.pone.0284619.g008

thumbnail
Fig 9. Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for Prostate dataset.

https://doi.org/10.1371/journal.pone.0284619.g009

Similarly, box-plots of the results produced by the method, WSNR, and all the other competitors for 10 number of informative genes on random forest classifier are also constructed as given in Figs 1017. The boxplots also show that the method, WSNR, outperformed the others in majority of the cases.

thumbnail
Fig 10. Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for Leukemia dataset.

https://doi.org/10.1371/journal.pone.0284619.g010

thumbnail
Fig 11. Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for Colon dataset.

https://doi.org/10.1371/journal.pone.0284619.g011

thumbnail
Fig 12. Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for Lungcancer dataset.

https://doi.org/10.1371/journal.pone.0284619.g012

thumbnail
Fig 13. Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for Srbct dataset.

https://doi.org/10.1371/journal.pone.0284619.g013

thumbnail
Fig 14. Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for DLBCL dataset.

https://doi.org/10.1371/journal.pone.0284619.g014

thumbnail
Fig 15. Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for Breastcancer dataset.

https://doi.org/10.1371/journal.pone.0284619.g015

thumbnail
Fig 16. Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for TumorC dataset.

https://doi.org/10.1371/journal.pone.0284619.g016

thumbnail
Fig 17. Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for Prostate dataset.

https://doi.org/10.1371/journal.pone.0284619.g017

4.1 Simulation

This subsection describes two simulation scenarios for the proposed method. The first scenario (S1) is designed to mimic a situation where the proposed method is useful, whereas the second scenario (S2) shows a data generation environment that might not favour the proposed method. For this purpose, two different models are designed, one for each scenario. The class probabilities of the Bernoulli response Y = Bernoulli(p) given n × d dimensional matrix X of n iid observations from Normal(0, 1) and Uniform(0, 1) distributions, are generated in each scenario by using the following equation. (9) The values of a and b are both fixed at 1.5. A vector of coefficients, i.e., β is generated from the Uinform(−5, 5) distribution to fit the following linear predictor. (10) Top five, i.e., K = 5, important variables are identified from the above model based on their coefficients βs. In order to contaminate the data, outliers are added to these top five variables from the Normal(20, 60) distribution. In addition, 20 noisy variables/observations are also added to the data from Normal(5, 10) distribution. By this way a simulated data having n = 100 observations and d = 120 variables is generated. For all the methods considered, the same experimental set is used as that of the benchmark data sets. The second model is also constructed in a similar fashion. The difference between the two models is that, the former contains outliers and noisy variables/observations in the data, while the latter one does not contain outliers and noisy variables in the data. A total of 500 realizations are made for estimating the performance metrics values. The results of the simulation study for both the scenarios are presented in Table 4.

thumbnail
Table 4. Classification error rates produced by different methods on simulated data.

https://doi.org/10.1371/journal.pone.0284619.t004

From Table 4, it is evident that, when there are noisy variables/observations in the data, the proposed method, WSNR, performs better than the other competitors, whereas the method (Wilc) produces minimum error rates, when there are no noisy variables/observations in the data. Similarly, bar-Plots of error rates for different subsets of genes, when the simulated data contains noisy genes/observations in the data and when there are no noisy features/observations are also constructed as given in Figs 18 and 19, respectively. The plots indicate that the proposed method, WSNR, is producing minimum error rates in the presence of noisy features/observations in the data.

thumbnail
Fig 18. Bar-plots of errors produced by different feature selection methods on simulated data having outliers, for various subsets of genes.

https://doi.org/10.1371/journal.pone.0284619.g018

thumbnail
Fig 19. Bar-plots of errors produced by different feature selection methods on simulated data, having no outliers, for various subsets of genes.

https://doi.org/10.1371/journal.pone.0284619.g019

5 Conclusion

The current study has proposed a novel feature selection method by exploiting feature weighting via support vectors and signal to noise ratio (SNR). The proposed method initially computes the weights of all genes using support vector machine, followed by the computation of signal to noise ratio for all the genes in the training phase. These weights are then multiplied to compute new weights for each gene in the data. Genes are then arranged in decreasing order of their weights. Top ranked genes are then selected for model construction.

The proposed method is validated on eight benchmark problems and assessment is made against other methods in terms of classification error rates. The results of the proposed method are compared with four well known feature selection methods. Two stat-of-the-art classifiers, i.e., random forest (RF) and k-NN are used to evaluate the performance of the selected genes by various feature selection methods. The analyses revealed that the proposed method, WSNR, has out performed all the other methods in 6 out of 8 data sets and has produced comparable results on 2 data sets. For quick insight into the results of the proposed method and all the other methods, bar-plots and box-plots of the results have also been constructed. Furthermore, the proposed method is also evaluated on the simulated data where two scenarios are generated. First, a scenario which favors the proposed idea where data consist of noisy features and outlier observations. Second, a scenario where there are no noisy features and outlier observations in the data which does not favor the proposed method. From all the analysis, it is concluded that the proposed method could effectively be used in high dimensional settings where the underlying distribution of observations is not known, as is the case with micro-array data.

For future work in the direction of the proposed study, one can extend it to the situation of unsupervised learning, where the features will first be divided into clusters, and then the proposed method applied in each cluster. The top ranked genes in each cluster can then be selected for the model construction. One can also use the robust measures of location and dispersion in conventional signal to noise ratio to mitigate the effect of outliers in gene expression values. In addition, the performance of the proposed method can be checked by using various kernel functions in SVM.

References

  1. 1. Akinola OA, Agushaka JO, Ezugwu AE. Binary dwarf mongoose optimizer for solving high-dimensional feature selection problems. Plos one. 2022;17(10):e0274850. pmid:36201524
  2. 2. Abdelwahab O, Awad N, Elserafy M, Badr E. A feature selection-based framework to identify biomarkers for cancer diagnosis: A focus on lung adenocarcinoma. Plos one. 2022;17(9):e0269126. pmid:36067196
  3. 3. Song J, Li Z, Yao G, Wei S, Li L, Wu H. Framework for feature selection of predicting the diagnosis and prognosis of necrotizing enterocolitis. PloS one. 2022;17(8):e0273383. pmid:35984833
  4. 4. Tahmouresi A, Rashedi E, Yaghoobi MM, Rezaei M. Gene selection using pyramid gravitational search algorithm. Plos one. 2022;17(3):e0265351. pmid:35290401
  5. 5. Taguchi Y, Turki T. Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools. PloS one. 2022;17(9):e0275472. pmid:36173994
  6. 6. Chen LP. Classification and prediction for multi-cancer data with ultrahigh-dimensional gene expressions. Plos one. 2022;17(9):e0274440. pmid:36107929
  7. 7. Ai H. GSEA–SDBE: A gene selection method for breast cancer classification based on GSEA and analyzing differences in performance metrics. PloS one. 2022;17(4):e0263171. pmid:35472078
  8. 8. James G, Witten D, Hastie T, Tibshirani R. Statistical learning. In: An introduction to statistical learning. Springer; 2021. p. 15–57.
  9. 9. Das P, Roychowdhury A, Das S, Roychoudhury S, Tripathy S. sigFeature: novel significant feature selection method for classification of gene expression data using support vector machine and t statistic. Frontiers in genetics. 2020;11:247. pmid:32346383
  10. 10. Jović A, Brkić K, Bogunović N. A review of feature selection methods with applications. In 2015 38th international convention on information and communication technology, electronics and microelectronics (MIPRO)(pp. 1200–1205). Google Scholar. 2015; p. 1200–1205.
  11. 11. Das R, Dimitrova N, Xuan Z, Rollins RA, Haghighi F, Edwards JR, et al. Computational prediction of methylation status in human genomic sequences. Proceedings of the National Academy of Sciences. 2006;103(28):10713–10716. pmid:16818882
  12. 12. Hilario M, Kalousis A, Pellegrini C, Müller M. Processing and classification of protein mass spectra. Mass spectrometry reviews. 2006;25(3):409–449. pmid:16463283
  13. 13. Zheng C, Li L, Haak M, Brors B, Frank O, Giehl M, et al. Gene expression profiling of CD34+ cells identifies a molecular signature of chronic myeloid leukemia blast crisis. Leukemia. 2006;20(6):1028–1034. pmid:16617318
  14. 14. Frank O, Brors B, Fabarius A, Li L, Haak M, Merk S, et al. Gene expression signature of primary imatinib-resistant chronic myeloid leukemia patients. Leukemia. 2006;20(8):1400–1407. pmid:16728981
  15. 15. Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z. Tissue classification with gene expression profiles. In: Proceedings of the fourth annual international conference on Computational molecular biology; 2000. p. 54–64.
  16. 16. Shang R, Song J, Jiao L, Li Y. Double feature selection algorithm based on low-rank sparse non-negative matrix factorization. International Journal of Machine Learning and Cybernetics. 2020;11(8):1891–1908.
  17. 17. Pang Q, Zhang L. A recursive feature retention method for semi-supervised feature selection. International Journal of Machine Learning and Cybernetics. 2021;12(9):2639–2657.
  18. 18. Li Z, Xie W, Liu T. Efficient feature selection and classification for microarray data. PloS one. 2018;13(8):e0202167. pmid:30125332
  19. 19. Hou X, Hou J, Huang G. Bi-dimensional principal gene feature selection from big gene expression data. Plos one. 2022;17(12):e0278583. pmid:36477666
  20. 20. Bakhshandeh S, Azmi R, Teshnehlab M. Symmetric uncertainty class-feature association map for feature selection in microarray dataset. International Journal of Machine Learning and Cybernetics. 2020;11(1):15–32.
  21. 21. Li Z, Du J, Nie B, Xiong W, Xu G, Luo J. A new two-stage hybrid feature selection algorithm and its application in Chinese medicine. International Journal of Machine Learning and Cybernetics. 2022;13(5):1243–1264.
  22. 22. Nasfi R, Bouguila N. A novel feature selection method using generalized inverted Dirichlet-based HMMs for image categorization. International Journal of Machine Learning and Cybernetics. 2022; p. 1–17.
  23. 23. Javidi MM. Feature selection schema based on game theory and biology migration algorithm for regression problems. International Journal of Machine Learning and Cybernetics. 2021;12(2):303–342.
  24. 24. Hamraz M, Khan Z, Khan DM, Gul N, Ali A, Aldahmani S. Gene selection in binary classification problems within functional genomics experiments via robust Fisher Score. IEEE Access. 2022;10:51682–51692.
  25. 25. Hamraz M, Gul N, Raza M, Khan DM, Khalil U, Zubair S, et al. Robust proportional overlapping analysis for feature selection in binary classification within functional genomic experiments. PeerJ Computer Science. 2021;7:e562. pmid:34141889
  26. 26. Hamraz M, Khan DM, Gul N, Ali A, Khan Z, Ahmad S, et al. Regulatory Genes Through Robust-SNR for Binary Classification Within Functional Genomics Experiments. 2022;.
  27. 27. Ali A, Hamraz M, Kumam P, Khan DM, Khalil U, Sulaiman M, et al. A k-nearest neighbours based ensemble via optimal model selection for regression. IEEE Access. 2020;8:132095–132105.
  28. 28. Ali F, El-Sappagh S, Islam SR, Kwak D, Ali A, Imran M, et al. A smart healthcare monitoring system for heart disease prediction based on ensemble deep learning and feature fusion. Information Fusion. 2020;63:208–222.
  29. 29. Ali F, El-Sappagh S, Islam SR, Ali A, Attique M, Imran M, et al. An intelligent healthcare monitoring framework using wearable sensors and social networking data. Future Generation Computer Systems. 2021;114:23–43.
  30. 30. Kumar Y, Koul A, Singla R, Ijaz MF. Artificial intelligence in disease diagnosis: a systematic literature review, synthesizing framework and future research agenda. Journal of Ambient Intelligence and Humanized Computing. 2022; p. 1–28. pmid:35039756
  31. 31. Mandal M, Singh PK, Ijaz MF, Shafi J, Sarkar R. A tri-stage wrapper-filter feature selection framework for disease classification. Sensors. 2021;21(16):5571. pmid:34451013
  32. 32. Li X, Peng S, Chen J, Lü B, Zhang H, Lai M. SVM–T-RFE: A novel gene selection algorithm for identifying metastasis-related genes in colorectal cancer using gene expression profiles. Biochemical and biophysical research communications. 2012;419(2):148–153. pmid:22306013
  33. 33. Mishra S, Mishra D. SVM-BT-RFE: An improved gene selection framework using Bayesian T-test embedded in support vector machine (recursive feature elimination) algorithm. Karbala International Journal of Modern Science. 2015;1(2):86–96.
  34. 34. Galland F, Lacroix L, Saulnier P, Dessen P, Meduri G, Bernier M, et al. Differential gene expression profiles of invasive and non-invasive non-functioning pituitary adenomas based on microarray analysis. Endocrine-related cancer. 2010;17(2):361–371. pmid:20228124
  35. 35. Jiang H, Martin V, Gomez-Manzano C, Johnson DG, Alonso M, White E, et al. The RB-E2F1 Pathway Regulates AutophagyRB/E2F1 Pathway Regulates Autophagy. Cancer research. 2010;70(20):7882–7893. pmid:20807803
  36. 36. Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20(3):273–297.
  37. 37. Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology. 2005;3(02):185–205. pmid:15852500
  38. 38. Lausen B, Hothorn T, Bretz F, Schumacher M. Assessment of optimal selected prognostic factors. Biometrical Journal: Journal of Mathematical Methods in Biosciences. 2004;46(3):364–374.
  39. 39. El Kafrawy P, Fathi H, Qaraad M, Kelany AK, Chen X. An Efficient SVM-Based Feature Selection Model for Cancer Classification Using High-Dimensional Microarray Data. IEEE Access. 2021;9:155353–155369.
  40. 40. Mishra D, Sahu B. Feature selection for cancer classification: a signal-to-noise ratio approach. International Journal of Scientific & Engineering Research. 2011;2(4):1–7.
  41. 41. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences. 1999;96(12):6745–6750. pmid:10359783
  42. 42. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science. 1999;286(5439):531–537. pmid:10521349
  43. 43. Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy. The Lancet. 2005;365(9458):488–492. pmid:15705458
  44. 44. Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 2005;21(5):631–643. pmid:15374862
  45. 45. Gordon GJ, Jensen RV, Hsiao LL, Gullans SR, Blumenstock JE, Ramaswamy S, et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer research. 2002;62(17):4963–4967. pmid:12208747
  46. 46. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, et al. Bioinformatics Laboratory; 2002. Available from: https://file.biolab.si/biolab/supp/bi-cancer/projections/info/DLBCL.html.
  47. 47. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature medicine. 2002;8(1):68–74. pmid:11786909
  48. 48. Guyon I, Weston A, Barnhill S, Vapnik V. Gene selection for cancer classification using svm. Machine Learning Journal;2(10.1023).
  49. 49. Butte A. The use and analysis of microarray data. Nature reviews drug discovery. 2002;1(12):951–960. pmid:12461517
  50. 50. De Jay N, Papillon-Cavanagh S, Olsen C, Bontempi G, Haibe-Kains B. mRMRe: an R package for parallelized mRMR ensemble feature selection. Submitted. 2012; p.
  51. 51. Boulesteix AL. WilcoxCV: Wilcoxon-based variable selection in cross-validation; 2012. Available from: https://CRAN.R-project.org/package=WilcoxCV.
  52. 52. Liaw A, Wiener M. Classification and Regression by randomForest. R News. 2002;2(3):18–22.
  53. 53. Kuhn M. caret: Classification and Regression Training; 2021. Available from: https://CRAN.R-project.org/package=caret.