Figures
Abstract
High dimensional biomedical data contain tens of thousands of features, accurate and effective identification of the core features in these data can be used to assist diagnose related diseases. However, there are often a large number of irrelevant or redundant features in biomedical data, which seriously affect subsequent classification accuracy and machine learning efficiency. To solve this problem, a novel filter feature selection algorithm based on redundant removal (FSBRR) is proposed to classify high dimensional biomedical data in this paper. First of all, two redundant criteria are determined by vertical relevance (the relationship between feature and class attribute) and horizontal relevance (the relationship between feature and feature). Secondly, to quantify redundant criteria, an approximate redundancy feature framework based on mutual information (MI) is defined to remove redundant and irrelevant features. To evaluate the effectiveness of our proposed algorithm, controlled trials based on typical feature selection algorithm are conducted using three different classifiers, and the experimental results indicate that the FSBRR algorithm can effectively reduce the feature dimension and improve the classification accuracy. In addition, an experiment of small sample dataset is designed and conducted in the section of discussion and analysis to clarify the specific implementation process of FSBRR algorithm more clearly.
Citation: Zhang B, Cao P (2019) Classification of high dimensional biomedical data based on feature selection using redundant removal. PLoS ONE 14(4): e0214406. https://doi.org/10.1371/journal.pone.0214406
Editor: Xiangtao Li, Northeast Normal University, CHINA
Received: November 10, 2018; Accepted: March 12, 2019; Published: April 9, 2019
Copyright: © 2019 Zhang, Cao. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript, Supporting Information files, and from the FIGSHARE database here: https://figshare.com/articles/Data-PONE-D-18-32374/7866167.
Funding: This work was supported by the Youth Science Fundation of Lanzhou Jiaotong University [2016004] to B. T. Zhang.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The analysis of high dimensional disease data [1–2] is a very important research field, especially cancer [3], or mental disease (e.g. Depressive [4–5]). It is unrealistic to cure these diseases completely, so early diagnosis or prevention plays an important role in the treatment related disease. However, high dimension biomedical data usually contain a large number of weak relevant or irrelevant features. If all the features are treated equally, the time complexity, spatial complexity and accuracy of the prediction can be seriously affected. Therefore, feature selection is considered to be an essential step in the diagnosis of related disease using high dimension biomedical data.
As is stated in [6], feature selection is also referred to as feature subset selection. The main purpose of feature selection is to remove irrelevant and redundant features in the classification process, while retaining the most valuable information of the original data. In other words, the objective of feature selection is to select an optimal feature subset [7] from the original feature set, which lays a good foundation for subsequent classification or learning work. As one of the important part of knowledge discovery technology, feature selection [8] can effectively improve the computing speed of subsequent prediction algorithm, enhance the compactness of the prediction model, increase the generalization ability of the corresponding model, and avoid over fitting.
Based on the above factors, feature selection has always been a hot research topic, and new achievements are constantly emerging. For example, in [9], a feature selection method based on multi-objective binary based biogeography optimization (MOBBBO) is proposed for gene selection, which combines the non-dominated sorting method and the crowding distances method into the binary based biogeography optimization (BBBO) framework. In [10], a novel feature selection method via chaotic optimization is developed to solve the problem of balance between exploration of the search space and exploitation of the best solutions. In [11], Liu et al. used a discrete biogeography based optimization (DBBO) method by integrating discrete migration model and discrete mutation model for feature selection in molecular signatures. In [12], Li et al. proposed that a feature selection algorithm based on the multi-objective ranking binary artificial bee colony to select the optimal subset from the original high dimensional data while retaining a subset that satisfies the defined objective. In addition, recent advances on feature selection can be found in [13–14], in which [14] focuses on the review of the latest research work on evolutionary computation (EC) of feature selection, and identifies the contributions of these different algorithms.
To better introduce the research status of high-throughput data or high-dimensional data based on feature selection algorithm [1, 15], the following is an overview of some representative studies in recent years. Tan et al. [1] proposed a new minimax sparse LR model for very high-dimensional feature selections, which can be efficiently solved by a cutting plane algorithm. In order to solve the problem of effectively identifying chromosome-wide spatial clusters from high-throughput chromatin conformation capture data, a population based optimization algorithm coordinates and guides the non-negative matrix factorization toward global optima was proposed in [16]. In [17], a novel feature selection method based on high dimensional model representation (HDMR) was proposed to solve the hyper-spectral image classification problem. The core idea of this method is to rank the global sensitivity index calculated via the HDMR to find the most relevant features. In order to explore and identify small clusters of spatially proximal genomic regions, Li et al. [18] proposed evolutionary computation methods to evolve and confirm functionally related genomic regions. Chen et al. [19] proposed a feature selection algorithm, which was named genetic programming (GP) with permutation importance (GPPI), to select features of high-dimensional symbolic regression (SR) using GP. Based on two typical applications of microarray analysis and target detection, Augusto et al. [20] discussed the feature selection of high-dimensional spatial data. In order to solve the problem of high-dimensional data classification, Zhang [21] proposed an improved artificial bee colony (ABC) algorithm to select the optimal feature subset. Meanwhile, to improve the convergence of ABC, the modified ABC algorithm (named OGR-ABC algorithm) introduces three modified strategies including opposite initialization, global optimum based search equations and ranking based selection mechanism.
Through the analysis of the above research, it is not difficult to find that feature selection process mainly includes two steps [22]: search strategy and evaluation criterion. Based on whether or not the classifier itself is used as feature evaluation index, evaluation criterion can be categorized into the wrapper method and the filter method. The wrapper method [23, 24] to evaluate superiority and inferiority of the optimal feature subset under the premise of the classification algorithm unchanged. Meanwhile, the corresponding classification accuracy is adopted as an index to select optimal feature subset. So the feature subset selected by the wrapper method is not universal. It is necessary to execute the feature selection process again when the classification algorithm is changed. Therefore, its time complexity is too high, especially for high dimension data, and the execution time of the algorithm may be longer. Another evaluation criterion based on the filter method [25–27], the search of feature space depends on the intrinsic correlation of the data itself rather than the classification algorithm. The filter method is increasingly attractive because of its simplicity and fast speed. Therefore this method is more popular than the wrapper method.
Four intrinsic correlation metrics are often adopted by the filter method to evaluate feature subset, including MI [28], fractal dimension [29], dependency degree [30] and distance [31]. Among them, MI is considered as the most acceptable criteria due to two major advantages [32]: (1) Measuring different relationship between nonlinear (random) variables. (2) Preserving stability of transformations in the high dimensional feature space that is invertible.
According to the above analysis, a filter feature selection method is proposed for high dimensional biomedical data based on redundant removal in this paper. Firstly, we analyze the four boundary extremes of correlation between feature and target class and the correlation between feature and feature. Based on this, two redundant criterions are proposed. And then in order to quantify the redundancy criterion, the core module based on MI is proposed: the definition of approximate redundancy feature. Finally, the algorithm implementation is given.
Mathematical symbols and basic concepts
Mathematical symbols
There are many mathematical symbols used in this study. To improve the readability, we list these mathematical symbols and their abbreviations in below.
- P:a probability measure.
- F: feature set, F = {F1,F2,…,Fi,…,Fn}.
- Fi: Fi = {fi,1, fi,2,…,fi,n}.
- Ai: Ai = F −{Fi}.
- C: class attribute, C = {C1,C2,…,Ci,…,Cm}.
- R: the relevance between any two variables.
- Ri,j: the relevance between any pair of feature Fi and Fj,i≠j.
- Ri,c: the relevance between any feature Fi and class attribute C.
- Rmax: the maximum value of Ri,c.
: the mean value of Ri,c, that is
.
Basic concepts
For laying a base for further investigation, three basic concepts [33] (strongly relevant feature, weakly relevance feature, and irrelevant feature) used in this study are listed as follows.
Strong relevance: Fi is strongly relevant feature iff there exists P(Fi,Ai)>0 such that
(1)
Weak relevance: Fi is weakly relevant feature iff it is not strongly relevant (i.e. P(C|Fi,Ai) ≠ P(C|Ai)), there exists and
such that
(2)
Irrelevance: Fi is irrelevant feature iff it are not strongly relevant and weakly relevant, there all and
such that
(3)
Strong relevance indicates that the feature is very important for classification accuracy improvement, so it can’t be arbitrarily removed. Weak relevance indicates that the feature can sometimes contribute to improve prediction accuracy. Irrelevance indicates that the feature is useless on the improvement of classification accuracy, so it can be directly deleted.
Method
Determination of redundancy criterion
It is difficult to determine that the complete correlation between any pair of feature in the actual calculation process, and then determine whether there is redundancy among features. To combat this, a redundancy criterion based on the correlation is proposed in this study to lay the foundation for further feature selection. Based on three basic concepts, the redundancy of feature Fi is analyzed under different extreme values of Ri,c and Ri,j. Different cases of extreme value are shown in Table 1.
It is easy to draw the following four conclusions after analyzing four cases:
Conclusion 1: Ri,c is large, which means that Fi contains more information about C. Ri,j is large, which means that the correlation between Fi and Fj is strong. If Ri,j = 1, then Fi and Fj is complete correlation, hence Fi is redundant. If Ri,j≠1, it is difficult to determine the feature Fi whether or not is redundant.
Conclusion 2: Ri,j is small, which means that the correlation between Fi and Fj is weak. Hence Fj can’t replace Fi. In other words, no matter the size of the Ri,c, the feature Fi is not redundant.
Conclusion 3: Ri,c is small, which means that Fi contains less information about C. Ri,j is large, which means that the correlation between Fi and Fj is strong. In this case, the feature Fi is redundant with higher probability. With the increase of Ri,j, this probability is also increasing.
Conclusion 4: Ri,j is small, which means that the correlation between Fi and Fj is weak. This conclusion is consistent with the conclusions 2, no matter the size of the Ri,c, the feature Fi is not redundant.
Based on the above four conclusions, two redundant criteria can be obtained:
Criteria 1: when Ri,j is large, whether Fi is redundant is uncertain.
Criteria 2: when Ri,j is small, no matter the size of the Ri,c, the feature Fi is not redundant.
Definition approximate redundancy feature
Based on the two redundant criterions inthe previous section, the approximate redundancy feature is proposed and defined in this section.
Assuming that the Ri,c of the feature Fi is very close to Rmax, it indicates that Fi contains a lot of information about class attribute C. In this condition, only if the value of Ri,j is large enough, Fi can be considered as an approximate redundancy feature. Otherwise, it can’t be considered as redundancy feature. The reason is that Fi plays an important role in improving the accuracy of classification, and can’t be easily removed as redundancy. By contrast, Assuming that the Ri,c of feature Fi is not very close to Rmax, it indicates that Fi contains relatively less information about class attribute C. In this condition, as long as the value of Ri,j is relatively large, Fi is considered as an approximate redundancy feature. The reason is that Fi not plays a main role in improving the accuracy of classification. Based on the above analysis and discussion, the approximate redundant feature is formally described in definition 1.
Definition 1 (approximate redundancy feature): There is any pair of correlation feature Fi and Fj, and Rj,c≥Ri,c.
(1) Fi is an approximate redundancy feature iff there exists, Rmax-Rj,c≤δ, 0.05≤δ≤0.13, such that
(4)
(2) Fi is an approximate redundancy feature iff there exists Rmax − Rj,c > δ && Rmax − Rj,c ≤ α, 0.05≤δ≤0.13, 0.60≤α≤0.66, such that
(5)
Definition 1indicates that Fj can be approximated as an alternative for Fi.
Correlation calculation
In general, correlation measure methods include linear and nonlinear. A nonlinear method based on MI is adopted in this study, and the reason is that the high dimensional biomedical data usually exist in the form of nonlinear in the real world. The correlation between any pair of variables (X, Y) can be calculated in the following formulas (6) or (7).
(6)
(7)
where H(X), H(X|Y) and H(X,Y) can be calculated on the basis of formulas (8), (9) and (10).
According on the above three formulas, the value of H(X|Y) or H(X,Y) is smaller when Y contains more information about X. In other words, the greater value of IG(X;Y), which means there are more relevant between X and Y.
To prevent the scale of data is not unified and to reduce the effect of extreme value, each IG(X;Y) is normalized to the range [0, 1] using formula (11).
Algorithm implementation
Based on the definition of the approximate redundancy feature and the correlation calculation method, a feature selection algorithm for high dimension biomedical data classification based on redundant removal (FSBRR) is given in Algorithm 1.
Algorithm 1: FSBRR
Input: Feature set: F = {F1,F2,…,Fi,…,Fn};
Class label: C = {C1,C2, …, Ci…,Cm};
Parameter: τ, δ, α.
1.for i = 1:n
2. Ri,c = 2 * IG(C;Fi)/(H(C) + H(Fi));
3.if Ri,c ≥ τ % a preset threshold value, remove irrelevance features
4.Addto(F’, Fi);
5.Addto(R’, Ri,c);
6.end
7. end
8. ; % the mean value of Ri,c, 1≤i≤n
9. [X, I] = sort(Ri,c, ‘descend’); % where Ri,cϵR’
10.F’ = F’(I); % order F’ in descending Ri,c value
11 for i = 1:size(I,2)-1;
12.Fi = F’(:,i); Ri,c; % to select current first feature in each cycle, i.e. first variable
13. for j = i+1: size(I,2)
14.Fj = F’(:,j); % to select next feature (or variable)
15. Rj,c = 2 * IG(C;Fj)/(H(C) + H(Fj))
16. Ri,j = 2 * IG(Fi;Fj)/(H(Fi) + H(Fj))
17. if Fj ≠ Null
18. if Ri,c- Rj,c≤δ&&Ri,j≥Ri,c% 0.05≤δ≤0.13
19. remove(F’, Fj); % removing approximate redundancy features
20.else if Ri,c- Rj,c>δ&&Ri,c- Rj,c<α&&Ri,j> (+Rj,c)/2% 0.05≤δ≤0.13, 0.60≤α≤0.66
21. remove(F’, Fj);
22. end
23. end
24. end
25.end
26.Foptimal = F’;
Output: Foptimal
The time consumption of FSBRR algorithm is mainly used to calculate Ri,c and Ri,j, so its atomic operation is the calculation of Ri,c and Ri,j. Assuming that a dataset contains n features, the time complexity of this algorithm used for removing irrelevant feature is linear time order O(n) (line 1 to line 7). For the removing approximate redundant feature and approximate irrelevant feature (line 11 to line 25): In the worst case, the time complexity is square time order O(n2), and all features are not redundant features at this time. In the best case, the time complexity is linear time order O(n), and except for the first feature, the remaining n-1 features are redundant features at this time.
Performance evaluation function
In this study, classification accuracy and the number of selected features are used to design the performance evaluation function [34–35], which is shown in formula (12).
n is the number of selected features and N is the total number of features. w1 and w2 are predefined weight coefficients, which are used to adjust the importance of two indicators in the performance evaluation function. In this study, the values of w1 and w2 are set to 0.999 and 0.001 respectively. The main reason for this setting can be attributed to the following three aspects: (1) Under the prerequisite of data dimensionality reduction, this study main focuses on the use of classification accuracy as a metric of feature selection algorithm. (2) The number of selected features is significantly reduced, but the classification accuracy is not improved, such dimensionality reduction of high-dimensional data will lose its original application value. (3) The performance evaluation function with high weight coefficient of classification accuracy and low weight coefficient of the number of selected features has been recognized and widely applied in many feature selection studies, such as [34]. In addition, Acc is classification accuracy as defined in formula(13).
Cnum and Inum are the number of correct and incorrect classification labels respectively.
Experiments
Data description
Eight well-known biomedical datasets (Table 2) were used to evaluate the performance of FSBRR algorithm. These dataset includes eight aspects of disease diagnosis data. The data dimension range was from 319 to 21548. The first three datasets were taken from the Kent Ridge Biomedical [36]. p53 Mutants and Arcene were taken from the UCI dataset [37]. Breast invasive carcinoma (BRCA), Glioblastoma multiforme (GBM), and tumour sequencing project (TSP) were taken from the TCGA [38].
Experimental design
To evaluate the performance of FSBRR algorithm, under the same conditions, we designed and conducted the following experiments: eight high dimensional biomedical data were compared and analyzed by FSBRR, Relief (a filter methodbased on the nearest neighbor distance) [39], maximum relevance and minimum redundancy (mRmR) [40] and genetic algorithm (GA) [41], respectively. In this experiment, the same conditions contain two meanings: (1) Random forest (RF, numTrees = 10), K-nearest neighbor (KNN, k = 1), and Support Vector Machine (SVM, Linear Kernel) were adopted as classifier to evaluate classification performance. (2) In FSBRR algorithm, the parameter τ is set to 0, and the purpose is that to avoid losing the weak correlation feature without prior knowledge. In addition, after adaptive testing, the values of δ and α were set to 0.08 and 0.64 respectively.
To obtain an unbiased experimental result, 10 fold cross validation was adopted to evaluate the classification performance. Each dataset was stratified into 10 folds, of which 9 folds were used as a training sample and the remaining 1 fold was used as a testing sample. Moreover, in order to get a statistically meaningful result, each experiment was executed 100 times independently. This means that the classification task is executed 1000 times in total, and the average value is taken as the result in finally. The above experiments were implemented in Matlab 2017a. The experimental hardware and software configuration is shown in Table 3.
Results
For the eight datasets, we have conducted the experiments described in above section. The six major statistical indicators were compared and analyzed, and the results were shown in Table 4.
From Table 4, we can observe the following aspects: (1)The FSBRR algorithm obtained the highest Mean among all the four feature selection algorithms for twenty-one out of twenty-four experimental results. The highest Mean(s) were 92.01%, 80.17%, 82.99%, 94.31%, 85.67%, 86.26%, 78.63%, 91.91%, 74.24%, 83.68%, 85.67%, 84.42%, 77.69%, 87.68%, 73.18%, 78.91%, 86.25%, 80.61%, 82.02%, 80.09%, and 72.35% respectively. Meanwhile, we notice that the maximum Mean improvement of FSBRR was 19.84% compared with the full set. (2) RF-based GA algorithm uses GBM dataset, and KNN-based GA algorithm uses p53 Mutants and GBM datasets to obtain highest Mean of 82.03%, 89.90%, and 81.39%, respectively. However, Std, MeanFN and RT of GA were significantly higher than FSRRR in these three experiments. (3) The Std among for eighteen out of twenty-four experimental results obtained by FSBRR was smaller than other three algorithms. (4) Four feature selection algorithms can effectively reduce the feature dimension, but the dimension reduction of the FSBRR algorithm was the most obvious. In FSBRR, Relief, mRmR and GA, GA belongs to the wrapper feature selection algorithm, so there were differences in the number of feature subsets for RF, KNN and SWM classifiers. (5) In most experiments, except for the running time index, the other performances of RF were significantly better than the KNN and SVM for the same dataset. Such results indicate that for the specified dataset, to get the best experimental results, a matching classification (or learning) algorithm must be found.
Fig 1 was obtained by statistical analysis of Table 4. It shows four average attribute values (avg(Mean), avg(Std), avg(MeanFN), and avg(RT)) of four feature selection algorithms. From Fig 1 we can observe that performance, stability, the feature number of optimal subset, and time complexity of FSBRR was superior to the other three algorithms. We believe that the reasons for obtaining this experiment were: (1) FSBRR algorithm uses Ri,j to explore the horizontal correlation between features and features, and the longitudinal correlation between features and classes was explored by Ri,c. Meanwhile, based on the basic relevance theory, the horizontal relevance and the longitudinal relevancewere effectively combined. (2) FSBRR algorithm not only removes the irrelevant features, but also removes the approximately redundant features.
Discussion and analysis
To verify the effect of parameters δ and α on the performance of FSBRR algorithm, the classification accuracy of 8 datasets was tested on RF, KNN, and SVM, respectively. In the first experiment, when the parameter α = 0.6 keep constant, the parameter δ increased from 0 to 0.2 with a step length of 0.01. In the second experiment, when the parameter δ = 0.1 keep constant, the parameter α increased from 0.5 to 0.7 with a step length of 0.01. Other experimental procedures are described in experimental design section. Moreover, to facilitate discussion and analysis, the classification accuracy was only considered in here. The results of the two experiments were shown in Figs 2 and 3. Statistical analysis of the experimental results in Figs 2 and 3 reveals that: when the highest classification accuracy was obtained for eight datasets, although the values of parameters were different, their ranges of value were overlapped. By selecting the results of the top 20% of the classification accuracy, we can clearly observe that the saliency overlap of the parameter ranges, and the results were shown in Fig 4. The optimum range of parameter δ is [0.05, 0.13], and the optimum range of parameter α is [0.60, 0.66]. See the dotted line marker range in the Fig 4.
The approximate optimal feature subset of each original dataset may be 1 or more. The purpose of feature selection is to find one of the optimal subset. However, the dimensions of the optimal subset for different original datasets are different, and the correlation values (Ri,j and Ri,c) distribution of different datasets is vary, which leads to the difference of parameter values (δ or α). According to the statistical analysis of the above two experimental results, the parameters δ and α should be selected within the range in [0.05, 0.13], and the range in [0.60, 0.66], respectively. Because the classification accuracy may be reach the highest value in this range.
In order to further analysis and interpretation of our proposed algorithm, a Breast cancer Wisconsin (Diagnostic) [42] dataset from UCI was used in this subsection. This dataset contains two kinds of data: malignant and benign. It was composed of 569 instances, and each instance contains 32 attributes. For FSBRR algorithm, the relationship between relevance and redundant features was analyzed using RF algorithm. Other experimental procedures are described in experimental design section. Besides, to facilitate discussion and analysis, the classification accuracy was only considered in here.
The main results of this experiment were listed in Table 5. For Table 5 there were two attributes need to be explained which are: (1) "Feature set", for example, {5} was the subscript of the feature, which is the fifth feature. (2) "Change" was the accuracy change based on the current feature set. From Table 5 we can observe the following points: (1) {5} as the current feature set, the accuracy rate increases after adding feature 7 (see the third row), in contrast, the accuracy rate reduces after adding feature 21 (see the fourth row). These results verify the redundancy Criterion 1 (when Ri,j is large, whether Fi is redundant is uncertain). In this case, the approximate redundant feature definitionis needed to determine whether the current feature is redundant. (2) {5, 7} as the current feature set, the accuracy rate increases after adding feature 18 (see the fifth row); {8, 3} as the current feature set, the accuracy rate increases after adding feature 30 (see the seventh row). These results verify the redundancy Criterion 2 (when Ri,j is smaller, no matter the size of the Ri,c, the feature Fi is not redundant).
Conclusions
In this paper, the relationship between two kinds of correlation (Ri,c and Ri,j) is established, which effectively combines the correlation between features and classes and the correlation between features and features to eliminate redundant features. Because the determination of completely redundant features in actual operation is difficult to realize, so we first analyze four kinds of boundary conditions between Ri,c and Ri,j, and then a redundancy feature criteria is proposed. On this basis, the approximate redundancy features are defined in this study. Finally, we have proposed a new feature selection algorithm based on approximate redundancy removal (FSBRR) for high dimensional biomedicine data classification.
To verify the effectiveness of the FSBRR algorithm, three classification algorithms (RF, KNN and SVM) are used to compare the FSBRR and three typical feature selection algorithms on eight high dimension biomedical datasets. The experiment results show that the FSBRR algorithm can effectively remove redundant features to improve the classification performance. Additionally, we also designed a set of comparative experiments to discuss and analyze the effects of parameters δ and α on the performance of FSBRR algorithm.
References
- 1. Tan MK, Tsang IW, Wang L. Minimax sparse logistic regression for very high-dimensional feature selection. IEEE Transactions on Neural Networks and Learning Systems. 2013; 24(10): 1609–1622. pmid:24808598
- 2. Tamaresis JS, Irwin JC, Goldfien G, Jossph T, Rabban RO, Gamran N, et al. Molecular classification of endometriosis and disease stage using high-dimensional genomic data. Endocrinology. 2014; 155(12): 4986–4999. pmid:25243856
- 3. Lee K, Man Z, Wang D, Cao Z. Classification of bioinformatics dataset using finite impulse response extreme learning machine for cancer diagnosis. Neural Computing and Applications.2013; 22(3): 457–468.
- 4. Jiang H, Hu B, Liu Z, Yan L, Wang T, Liu F, et al. Investigation of different speech types and emotions for detecting depression using different classifiers. Speech Communication. 2017; 90(6): 39–46.
- 5. Li XW, Zhuang J, Hu B, Jing Z, Ning Z, Mi L, et al. A resting-state brain functional network study in MDD based on minimum spanning tree analysis and the hierarchical clustering. Complexity. 2017; 22(3): 1–11.
- 6. Chung IF, Chen YC, Pal N. Feature selection with controlled redundancy in a fuzzy rule based framework. IEEE Transactions on Fuzzy Systems. 2018; 26(2): 734–748.
- 7. Avci E. Selecting of the optimal feature subset and kernel parameters in digital modulation classification by using hybrid genetic algorithm-support vector machines: HGASVM. Expert Systems with Applications. 2009; 36(2): 1391–1402.
- 8. Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering. 2005; 17(4): 491–502.
- 9. Li XT, Yin MH. Multi objective binary biogeography based optimization based feature selection for gene expression data. IEEE Transactions on NanoBioscience. 2013; 12 (4): 343–353. pmid:25003163
- 10. Zawbaa HM, Emary E, Grosan C. Feature selection via chaotic antlion optimization. Plos One. 2016; 11(3): e0150652. pmid:26963715
- 11. Liu B, Tian M, Zhang C, Li X. Discrete biogeography based optimization for feature selection in molecular signatures. Molecular Informatics. 2015; 34(4): 197–215. pmid:27490166
- 12. Li XT, Li MJ, Yin MH. Multiobjective Ranking Binary Artificial Bee Colony for Gene Selection Problems Using Microarray Datasets. IEEE/CAA Journal of Automatica Sinica. 2016;
- 13. Vergara JR, Estévez PA. A review of feature selection methodsbased on mutual information. Neural Computing and Applications. 2014; 24(1): 175–186.
- 14. Xue B, Zhang MJ, Browne WN, Yao X. A Survey on Evolutionary Computation Approaches to Feature Selection. IEEE Transactions on Evolutionary Computation, 2016, 20(4): 606–626.
- 15. Zhang BT, Lei T, Liu H, Cai HS. EEG-based automatic sleep staging using ontology and weighting feature analysis. Computational and Mathematical Methods in Medicine. 2018; 2018: 6534041. pmid:30254690
- 16. Li XT, Wong KC. A comparative study for identifying the chromosome-wide spatial clusters from high-throughput chromatin conformation capture data. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2018; 15(3):774–787. pmid:28333638
- 17. Taskin G, Kaya H, Bruzzone L. Feature selection based on high dimensional model representation for hyperspectral images. IEEE Transactions on Image Processing. 2017; 26(6): 2918–2928. pmid:28358688
- 18. Li XT, Ma SJ, Wong KC. Evolving spatial clusters of genomic regions from high-throughput chromatin conformation capture data. IEEE Transactions on NanoBioscience. 2017; 16(6): 400–407. pmid:28708563
- 19. Chen Q, Zhang MJ, Xue B. Feature selection to improve generalisation of genetic programming for high-dimensional symbolic regression. IEEE Transactions on Evolutionary Computation. 2017; 21(5): 792–806.
- 20. Augusto D, Sofia M, Christine DM, Alessandro V, Francesca O. Feature selection for high-dimensional data. Computational Management Science. 2009; 6(1): 25–40.
- 21. Zhang Y. A modified artificial bee colony algorithm-based feature selection for the classification of high-dimensional data. Journal of Computational and Theoretical Nanoscience. 2016; 13(7): 4088–4095.
- 22. Mafarja M, Mirjalili S. Whale optimization approaches for wrapper feature selection. Applied Soft Computing. 2018;62(1): 441–453.
- 23. Kohavi R, John GH. Wrappers for feature subset selection. Artificial Intelligence. 1997; 97(1): 273–324.
- 24. Chrysostomou K, Chen SY, Liu X. Combining multiple classifiers for wrapper feature selection. International Journal of Data Mining Modelling and Management. 2017; 1(1): 91–102.
- 25. Zhang D, Chen S, Zhou ZH. Constraint score: A new filter method for feature selection with pairwise constraints. Pattern Recognition, 2008; 41(5): 1440–1451.
- 26. Hancer E, Xue B, Zhang MJ. Differential evolution for filter feature selection based on information theory and feature ranking. Knowledge-Based Systems, 2018; 140(10): 103–119.
- 27. Lei T, Jia XH, Zhang YN, He LF, Meng HY, Nandi AK. Significantly fast and robust fuzzy c-means clustering algorithm based on morphological reconstruction and membership filtering. IEEE Transactions on Fuzzy Systems. 2018; 26(5): 3027–3041.
- 28. Estevez PA, Tesmer M, Perez CA, Zurada JM. Normalized Mutual Information Feature Selection. IEEE Transactions on Neural Networks. 2009; 20(2):189–201. pmid:19150792
- 29. Zhang C, Ni ZW, Ni LP, Tang N. Feature selection method based on multi-fractal dimension and harmony search algorithm and its application. International Journal of Systems Science. 2016; 47(14): 3476–3486.
- 30. Maji P. A rough hypercuboid approach for feature selection in approximation spaces. IEEE Transactions on Knowledge and Data Engineering. 2014; 26(1): 16–29.
- 31. Ververidis D, Kotropoulos C. Information loss of the mahalanobis distance in high dimensions: application to feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2009, 31(12): 2275–2281. pmid:19834146
- 32. Brown G, Pocock A, Zhao MJ, Luján M. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. Journal of Machine Learning Research. 2012; 13(1): 27–66.
- 33. John GH, Kohavi R, Pfleger K. Irrelevant features and the subset selection problem. Machine Learning Proceedings. 1994;1994(7): 121–129.
- 34. Hu B, Dai YQ, Su Y, Moore P, Zhang XY, Mao CS, et al. Feature selection for optimized high-dimensional biomedical data using the improved shuffed frog leaping algorithm. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2018; 15(6): 1765–1773. pmid:28113635
- 35. Chuang YL,Chang HW, Tu CJ, Yang CH. Improved binary PSO for feature selection using gene expression data. Computational Biology and Chemistry. 2008; 32(1): 29–38. pmid:18023261
- 36.
Li J, Liu H. Kent Ridge Biomedical Data Set Repository.School of Computer Engineering, Nanyang Technological University, Singapore. 2004; Available from: http://datam.i2r.astar.edu.sg/datasets/krbd/index.html.
- 37.
UCI. 2018; Available from: http://archive.ics.uci.edu/ml/.
- 38.
TCGA. 2019; Available from: http://tcga-data.nci.nih.gov.
- 39.
Kononenko I. Estimating attributes: analysis and extensions of RELIEF. European Conference on Machine Learning on Machine Learning. Springer-Verlag New York. 1994:171–182. https://doi.org/10.1007/3-540-57868-4_57
- 40. Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005; 27(8): 1226–1238. pmid:16119262
- 41. Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation. 2002; 6(2): 182–197.
- 42. Mangasarian OL, Street WN, Wolberg WH. Breast cancer diagnosis and prognosis via linear programming. IEEE Computational Science and Engineering. 1995; 43(4): 570–577.