Classification of high dimensional biomedical data based on feature selection using redundant removal

Bingtao Zhang; Peng Cao

doi:10.1371/journal.pone.0214406

Abstract

High dimensional biomedical data contain tens of thousands of features, accurate and effective identification of the core features in these data can be used to assist diagnose related diseases. However, there are often a large number of irrelevant or redundant features in biomedical data, which seriously affect subsequent classification accuracy and machine learning efficiency. To solve this problem, a novel filter feature selection algorithm based on redundant removal (FSBRR) is proposed to classify high dimensional biomedical data in this paper. First of all, two redundant criteria are determined by vertical relevance (the relationship between feature and class attribute) and horizontal relevance (the relationship between feature and feature). Secondly, to quantify redundant criteria, an approximate redundancy feature framework based on mutual information (MI) is defined to remove redundant and irrelevant features. To evaluate the effectiveness of our proposed algorithm, controlled trials based on typical feature selection algorithm are conducted using three different classifiers, and the experimental results indicate that the FSBRR algorithm can effectively reduce the feature dimension and improve the classification accuracy. In addition, an experiment of small sample dataset is designed and conducted in the section of discussion and analysis to clarify the specific implementation process of FSBRR algorithm more clearly.

Citation: Zhang B, Cao P (2019) Classification of high dimensional biomedical data based on feature selection using redundant removal. PLoS ONE 14(4): e0214406. https://doi.org/10.1371/journal.pone.0214406

Editor: Xiangtao Li, Northeast Normal University, CHINA

Received: November 10, 2018; Accepted: March 12, 2019; Published: April 9, 2019

Copyright: © 2019 Zhang, Cao. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript, Supporting Information files, and from the FIGSHARE database here: https://figshare.com/articles/Data-PONE-D-18-32374/7866167.

Funding: This work was supported by the Youth Science Fundation of Lanzhou Jiaotong University [2016004] to B. T. Zhang.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The analysis of high dimensional disease data [1–2] is a very important research field, especially cancer [3], or mental disease (e.g. Depressive [4–5]). It is unrealistic to cure these diseases completely, so early diagnosis or prevention plays an important role in the treatment related disease. However, high dimension biomedical data usually contain a large number of weak relevant or irrelevant features. If all the features are treated equally, the time complexity, spatial complexity and accuracy of the prediction can be seriously affected. Therefore, feature selection is considered to be an essential step in the diagnosis of related disease using high dimension biomedical data.

As is stated in [6], feature selection is also referred to as feature subset selection. The main purpose of feature selection is to remove irrelevant and redundant features in the classification process, while retaining the most valuable information of the original data. In other words, the objective of feature selection is to select an optimal feature subset [7] from the original feature set, which lays a good foundation for subsequent classification or learning work. As one of the important part of knowledge discovery technology, feature selection [8] can effectively improve the computing speed of subsequent prediction algorithm, enhance the compactness of the prediction model, increase the generalization ability of the corresponding model, and avoid over fitting.

Based on the above factors, feature selection has always been a hot research topic, and new achievements are constantly emerging. For example, in [9], a feature selection method based on multi-objective binary based biogeography optimization (MOBBBO) is proposed for gene selection, which combines the non-dominated sorting method and the crowding distances method into the binary based biogeography optimization (BBBO) framework. In [10], a novel feature selection method via chaotic optimization is developed to solve the problem of balance between exploration of the search space and exploitation of the best solutions. In [11], Liu et al. used a discrete biogeography based optimization (DBBO) method by integrating discrete migration model and discrete mutation model for feature selection in molecular signatures. In [12], Li et al. proposed that a feature selection algorithm based on the multi-objective ranking binary artificial bee colony to select the optimal subset from the original high dimensional data while retaining a subset that satisfies the defined objective. In addition, recent advances on feature selection can be found in [13–14], in which [14] focuses on the review of the latest research work on evolutionary computation (EC) of feature selection, and identifies the contributions of these different algorithms.

To better introduce the research status of high-throughput data or high-dimensional data based on feature selection algorithm [1, 15], the following is an overview of some representative studies in recent years. Tan et al. [1] proposed a new minimax sparse LR model for very high-dimensional feature selections, which can be efficiently solved by a cutting plane algorithm. In order to solve the problem of effectively identifying chromosome-wide spatial clusters from high-throughput chromatin conformation capture data, a population based optimization algorithm coordinates and guides the non-negative matrix factorization toward global optima was proposed in [16]. In [17], a novel feature selection method based on high dimensional model representation (HDMR) was proposed to solve the hyper-spectral image classification problem. The core idea of this method is to rank the global sensitivity index calculated via the HDMR to find the most relevant features. In order to explore and identify small clusters of spatially proximal genomic regions, Li et al. [18] proposed evolutionary computation methods to evolve and confirm functionally related genomic regions. Chen et al. [19] proposed a feature selection algorithm, which was named genetic programming (GP) with permutation importance (GPPI), to select features of high-dimensional symbolic regression (SR) using GP. Based on two typical applications of microarray analysis and target detection, Augusto et al. [20] discussed the feature selection of high-dimensional spatial data. In order to solve the problem of high-dimensional data classification, Zhang [21] proposed an improved artificial bee colony (ABC) algorithm to select the optimal feature subset. Meanwhile, to improve the convergence of ABC, the modified ABC algorithm (named OGR-ABC algorithm) introduces three modified strategies including opposite initialization, global optimum based search equations and ranking based selection mechanism.

Through the analysis of the above research, it is not difficult to find that feature selection process mainly includes two steps [22]: search strategy and evaluation criterion. Based on whether or not the classifier itself is used as feature evaluation index, evaluation criterion can be categorized into the wrapper method and the filter method. The wrapper method [23, 24] to evaluate superiority and inferiority of the optimal feature subset under the premise of the classification algorithm unchanged. Meanwhile, the corresponding classification accuracy is adopted as an index to select optimal feature subset. So the feature subset selected by the wrapper method is not universal. It is necessary to execute the feature selection process again when the classification algorithm is changed. Therefore, its time complexity is too high, especially for high dimension data, and the execution time of the algorithm may be longer. Another evaluation criterion based on the filter method [25–27], the search of feature space depends on the intrinsic correlation of the data itself rather than the classification algorithm. The filter method is increasingly attractive because of its simplicity and fast speed. Therefore this method is more popular than the wrapper method.

Four intrinsic correlation metrics are often adopted by the filter method to evaluate feature subset, including MI [28], fractal dimension [29], dependency degree [30] and distance [31]. Among them, MI is considered as the most acceptable criteria due to two major advantages [32]: (1) Measuring different relationship between nonlinear (random) variables. (2) Preserving stability of transformations in the high dimensional feature space that is invertible.

According to the above analysis, a filter feature selection method is proposed for high dimensional biomedical data based on redundant removal in this paper. Firstly, we analyze the four boundary extremes of correlation between feature and target class and the correlation between feature and feature. Based on this, two redundant criterions are proposed. And then in order to quantify the redundancy criterion, the core module based on MI is proposed: the definition of approximate redundancy feature. Finally, the algorithm implementation is given.

Mathematical symbols and basic concepts

Mathematical symbols

There are many mathematical symbols used in this study. To improve the readability, we list these mathematical symbols and their abbreviations in below.

P:a probability measure.
F: feature set, F = {F₁,F₂,…,F_i,…,F_n}.
F_i: F_i = {f_i,1, f_i,2,…,f_i,n}.
A_i: A_i = F −{F_i}.
C: class attribute, C = {C₁,C₂,…,C_i,…,C_m}.
R: the relevance between any two variables.
R_i,j: the relevance between any pair of feature F_i and F_j,i≠j.
R_i,c: the relevance between any feature F_i and class attribute C.
R_max: the maximum value of R_i,c.
: the mean value of R_i,c, that is .

Basic concepts

For laying a base for further investigation, three basic concepts [33] (strongly relevant feature, weakly relevance feature, and irrelevant feature) used in this study are listed as follows.

Strong relevance: F_i is strongly relevant feature iff there exists P(F_i,A_i)>0 such that (1)

Weak relevance: F_i is weakly relevant feature iff it is not strongly relevant (i.e. P(C|F_i,A_i) ≠ P(C|A_i)), there exists and such that (2)

Irrelevance: F_i is irrelevant feature iff it are not strongly relevant and weakly relevant, there all and such that (3)

Strong relevance indicates that the feature is very important for classification accuracy improvement, so it can’t be arbitrarily removed. Weak relevance indicates that the feature can sometimes contribute to improve prediction accuracy. Irrelevance indicates that the feature is useless on the improvement of classification accuracy, so it can be directly deleted.

Method

Determination of redundancy criterion

It is difficult to determine that the complete correlation between any pair of feature in the actual calculation process, and then determine whether there is redundancy among features. To combat this, a redundancy criterion based on the correlation is proposed in this study to lay the foundation for further feature selection. Based on three basic concepts, the redundancy of feature F_i is analyzed under different extreme values of R_i,c and R_i,j. Different cases of extreme value are shown in Table 1.

Download:

Table 1. Different cases of extreme value.

https://doi.org/10.1371/journal.pone.0214406.t001

It is easy to draw the following four conclusions after analyzing four cases:

Conclusion 1: R_i,c is large, which means that F_i contains more information about C. R_i,j is large, which means that the correlation between F_i and F_j is strong. If R_i,j = 1, then F_i and F_j is complete correlation, hence F_i is redundant. If R_i,j≠1, it is difficult to determine the feature F_i whether or not is redundant.

Conclusion 2: R_i,j is small, which means that the correlation between F_i and F_j is weak. Hence F_j can’t replace F_i. In other words, no matter the size of the R_i,c, the feature F_i is not redundant.

Conclusion 3: R_i,c is small, which means that F_i contains less information about C. R_i,j is large, which means that the correlation between F_i and F_j is strong. In this case, the feature F_i is redundant with higher probability. With the increase of R_i,j, this probability is also increasing.

Conclusion 4: R_i,j is small, which means that the correlation between F_i and F_j is weak. This conclusion is consistent with the conclusions 2, no matter the size of the R_i,c, the feature F_i is not redundant.

Based on the above four conclusions, two redundant criteria can be obtained:

Criteria 1: when R_i,j is large, whether F_i is redundant is uncertain.

Criteria 2: when R_i,j is small, no matter the size of the R_i,c, the feature F_i is not redundant.

Definition approximate redundancy feature

Based on the two redundant criterions inthe previous section, the approximate redundancy feature is proposed and defined in this section.

Assuming that the R_i,c of the feature F_i is very close to R_max, it indicates that F_i contains a lot of information about class attribute C. In this condition, only if the value of R_i,j is large enough, F_i can be considered as an approximate redundancy feature. Otherwise, it can’t be considered as redundancy feature. The reason is that F_i plays an important role in improving the accuracy of classification, and can’t be easily removed as redundancy. By contrast, Assuming that the R_i,c of feature F_i is not very close to R_max, it indicates that F_i contains relatively less information about class attribute C. In this condition, as long as the value of R_i,j is relatively large, F_i is considered as an approximate redundancy feature. The reason is that F_i not plays a main role in improving the accuracy of classification. Based on the above analysis and discussion, the approximate redundant feature is formally described in definition 1.

Definition 1 (approximate redundancy feature): There is any pair of correlation feature F_i and F_j, and R_j,c≥R_i,c.

(1) F_i is an approximate redundancy feature iff there exists, R_max-R_j,c≤δ, 0.05≤δ≤0.13, such that (4)

(2) F_i is an approximate redundancy feature iff there exists R_max − R_j,c > δ && R_max − R_j,c ≤ α, 0.05≤δ≤0.13, 0.60≤α≤0.66, such that (5)

Definition 1indicates that F_j can be approximated as an alternative for F_i.

Correlation calculation

In general, correlation measure methods include linear and nonlinear. A nonlinear method based on MI is adopted in this study, and the reason is that the high dimensional biomedical data usually exist in the form of nonlinear in the real world. The correlation between any pair of variables (X, Y) can be calculated in the following formulas (6) or (7). (6) (7) where H(X), H(X|Y) and H(X,Y) can be calculated on the basis of formulas (8), (9) and (10).

(8)

(9)

(10)

According on the above three formulas, the value of H(X|Y) or H(X,Y) is smaller when Y contains more information about X. In other words, the greater value of IG(X;Y), which means there are more relevant between X and Y.

To prevent the scale of data is not unified and to reduce the effect of extreme value, each IG(X;Y) is normalized to the range [0, 1] using formula (11).

(11)

Algorithm implementation

Based on the definition of the approximate redundancy feature and the correlation calculation method, a feature selection algorithm for high dimension biomedical data classification based on redundant removal (FSBRR) is given in Algorithm 1.

Algorithm 1: FSBRR

Input: Feature set: F = {F₁,F₂,…,F_i,…,F_n};

Class label: C = {C₁,C₂, …, C_i…,C_m};

Parameter: τ, δ, α.

1.for i = 1:n

2. R_i,c = 2 * IG(C;F_i)/(H(C) + H(F_i));

3.if R_i,c ≥ τ % a preset threshold value, remove irrelevance features

4.Addto(F’, F_i);

5.Addto(R’, R_i,c);

6.end

7. end

8. ; % the mean value of R_i,c, 1≤i≤n

9. [X, I] = sort(R_i,c, ‘descend’); % where R_i,cϵR’

10.F’ = F’(I); % order F’ in descending R_i,c value

11 for i = 1:size(I,2)-1;

12.F_i = F’(:,i); R_i,c; % to select current first feature in each cycle, i.e. first variable

13. for j = i+1: size(I,2)

14.F_j = F’(:,j); % to select next feature (or variable)

15. R_j,c = 2 * IG(C;F_j)/(H(C) + H(F_j))

16. R_i,j = 2 * IG(F_i;F_j)/(H(F_i) + H(F_j))

17. if F_j ≠ Null

18. if R_i,c- R_j,c≤δ&&R_i,j≥R_i,c% 0.05≤δ≤0.13

19. remove(F’, F_j); % removing approximate redundancy features

20.else if R_i,c- R_j,c>δ&&R_i,c- R_j,c<α&&R_i,j> (+R_j,c)/2% 0.05≤δ≤0.13, 0.60≤α≤0.66

21. remove(F’, F_j);

22. end

23. end

24. end

25.end

26.F_optimal = F’;

Output: F_optimal

The time consumption of FSBRR algorithm is mainly used to calculate R_i,c and R_i,j, so its atomic operation is the calculation of R_i,c and R_i,j. Assuming that a dataset contains n features, the time complexity of this algorithm used for removing irrelevant feature is linear time order O(n) (line 1 to line 7). For the removing approximate redundant feature and approximate irrelevant feature (line 11 to line 25): In the worst case, the time complexity is square time order O(n²), and all features are not redundant features at this time. In the best case, the time complexity is linear time order O(n), and except for the first feature, the remaining n-1 features are redundant features at this time.

Performance evaluation function

In this study, classification accuracy and the number of selected features are used to design the performance evaluation function [34–35], which is shown in formula (12).

(12)

n is the number of selected features and N is the total number of features. w₁ and w₂ are predefined weight coefficients, which are used to adjust the importance of two indicators in the performance evaluation function. In this study, the values of w₁ and w₂ are set to 0.999 and 0.001 respectively. The main reason for this setting can be attributed to the following three aspects: (1) Under the prerequisite of data dimensionality reduction, this study main focuses on the use of classification accuracy as a metric of feature selection algorithm. (2) The number of selected features is significantly reduced, but the classification accuracy is not improved, such dimensionality reduction of high-dimensional data will lose its original application value. (3) The performance evaluation function with high weight coefficient of classification accuracy and low weight coefficient of the number of selected features has been recognized and widely applied in many feature selection studies, such as [34]. In addition, Acc is classification accuracy as defined in formula(13).

(13)

C_num and I_num are the number of correct and incorrect classification labels respectively.

Experiments

Data description

Eight well-known biomedical datasets (Table 2) were used to evaluate the performance of FSBRR algorithm. These dataset includes eight aspects of disease diagnosis data. The data dimension range was from 319 to 21548. The first three datasets were taken from the Kent Ridge Biomedical [36]. p53 Mutants and Arcene were taken from the UCI dataset [37]. Breast invasive carcinoma (BRCA), Glioblastoma multiforme (GBM), and tumour sequencing project (TSP) were taken from the TCGA [38].

Download:

Table 2. High dimension biomedical datasets.

https://doi.org/10.1371/journal.pone.0214406.t002

Experimental design

To evaluate the performance of FSBRR algorithm, under the same conditions, we designed and conducted the following experiments: eight high dimensional biomedical data were compared and analyzed by FSBRR, Relief (a filter methodbased on the nearest neighbor distance) [39], maximum relevance and minimum redundancy (mRmR) [40] and genetic algorithm (GA) [41], respectively. In this experiment, the same conditions contain two meanings: (1) Random forest (RF, numTrees = 10), K-nearest neighbor (KNN, k = 1), and Support Vector Machine (SVM, Linear Kernel) were adopted as classifier to evaluate classification performance. (2) In FSBRR algorithm, the parameter τ is set to 0, and the purpose is that to avoid losing the weak correlation feature without prior knowledge. In addition, after adaptive testing, the values of δ and α were set to 0.08 and 0.64 respectively.

To obtain an unbiased experimental result, 10 fold cross validation was adopted to evaluate the classification performance. Each dataset was stratified into 10 folds, of which 9 folds were used as a training sample and the remaining 1 fold was used as a testing sample. Moreover, in order to get a statistically meaningful result, each experiment was executed 100 times independently. This means that the classification task is executed 1000 times in total, and the average value is taken as the result in finally. The above experiments were implemented in Matlab 2017a. The experimental hardware and software configuration is shown in Table 3.

Download:

Table 3. Hardware and software configuration of experimental.

https://doi.org/10.1371/journal.pone.0214406.t003

Results

For the eight datasets, we have conducted the experiments described in above section. The six major statistical indicators were compared and analyzed, and the results were shown in Table 4.

Download:

Table 4. Experimental results based on eight data sets.

https://doi.org/10.1371/journal.pone.0214406.t004

From Table 4, we can observe the following aspects: (1)The FSBRR algorithm obtained the highest Mean among all the four feature selection algorithms for twenty-one out of twenty-four experimental results. The highest Mean(s) were 92.01%, 80.17%, 82.99%, 94.31%, 85.67%, 86.26%, 78.63%, 91.91%, 74.24%, 83.68%, 85.67%, 84.42%, 77.69%, 87.68%, 73.18%, 78.91%, 86.25%, 80.61%, 82.02%, 80.09%, and 72.35% respectively. Meanwhile, we notice that the maximum Mean improvement of FSBRR was 19.84% compared with the full set. (2) RF-based GA algorithm uses GBM dataset, and KNN-based GA algorithm uses p53 Mutants and GBM datasets to obtain highest Mean of 82.03%, 89.90%, and 81.39%, respectively. However, Std, MeanFN and RT of GA were significantly higher than FSRRR in these three experiments. (3) The Std among for eighteen out of twenty-four experimental results obtained by FSBRR was smaller than other three algorithms. (4) Four feature selection algorithms can effectively reduce the feature dimension, but the dimension reduction of the FSBRR algorithm was the most obvious. In FSBRR, Relief, mRmR and GA, GA belongs to the wrapper feature selection algorithm, so there were differences in the number of feature subsets for RF, KNN and SWM classifiers. (5) In most experiments, except for the running time index, the other performances of RF were significantly better than the KNN and SVM for the same dataset. Such results indicate that for the specified dataset, to get the best experimental results, a matching classification (or learning) algorithm must be found.

Fig 1 was obtained by statistical analysis of Table 4. It shows four average attribute values (avg(Mean), avg(Std), avg(MeanFN), and avg(RT)) of four feature selection algorithms. From Fig 1 we can observe that performance, stability, the feature number of optimal subset, and time complexity of FSBRR was superior to the other three algorithms. We believe that the reasons for obtaining this experiment were: (1) FSBRR algorithm uses R_i,j to explore the horizontal correlation between features and features, and the longitudinal correlation between features and classes was explored by R_i,c. Meanwhile, based on the basic relevance theory, the horizontal relevance and the longitudinal relevancewere effectively combined. (2) FSBRR algorithm not only removes the irrelevant features, but also removes the approximately redundant features.

Download:

Fig 1. The average attributes value of four feature selection algorithms based on eight datasets.

https://doi.org/10.1371/journal.pone.0214406.g001

Discussion and analysis

To verify the effect of parameters δ and α on the performance of FSBRR algorithm, the classification accuracy of 8 datasets was tested on RF, KNN, and SVM, respectively. In the first experiment, when the parameter α = 0.6 keep constant, the parameter δ increased from 0 to 0.2 with a step length of 0.01. In the second experiment, when the parameter δ = 0.1 keep constant, the parameter α increased from 0.5 to 0.7 with a step length of 0.01. Other experimental procedures are described in experimental design section. Moreover, to facilitate discussion and analysis, the classification accuracy was only considered in here. The results of the two experiments were shown in Figs 2 and 3. Statistical analysis of the experimental results in Figs 2 and 3 reveals that: when the highest classification accuracy was obtained for eight datasets, although the values of parameters were different, their ranges of value were overlapped. By selecting the results of the top 20% of the classification accuracy, we can clearly observe that the saliency overlap of the parameter ranges, and the results were shown in Fig 4. The optimum range of parameter δ is [0.05, 0.13], and the optimum range of parameter α is [0.60, 0.66]. See the dotted line marker range in the Fig 4.

Download:

Fig 2. The relationship between parameter δ and classification accuracy.

https://doi.org/10.1371/journal.pone.0214406.g002

Download:

Fig 3. The relationship between parameter α and classification accuracy.

https://doi.org/10.1371/journal.pone.0214406.g003

Download:

Fig 4. The range of parameter values when the classification accuracy is located in the top 20% based on FSBRR algorithm.

https://doi.org/10.1371/journal.pone.0214406.g004

The approximate optimal feature subset of each original dataset may be 1 or more. The purpose of feature selection is to find one of the optimal subset. However, the dimensions of the optimal subset for different original datasets are different, and the correlation values (R_i,j and R_i,c) distribution of different datasets is vary, which leads to the difference of parameter values (δ or α). According to the statistical analysis of the above two experimental results, the parameters δ and α should be selected within the range in [0.05, 0.13], and the range in [0.60, 0.66], respectively. Because the classification accuracy may be reach the highest value in this range.

In order to further analysis and interpretation of our proposed algorithm, a Breast cancer Wisconsin (Diagnostic) [42] dataset from UCI was used in this subsection. This dataset contains two kinds of data: malignant and benign. It was composed of 569 instances, and each instance contains 32 attributes. For FSBRR algorithm, the relationship between relevance and redundant features was analyzed using RF algorithm. Other experimental procedures are described in experimental design section. Besides, to facilitate discussion and analysis, the classification accuracy was only considered in here.

The main results of this experiment were listed in Table 5. For Table 5 there were two attributes need to be explained which are: (1) "Feature set", for example, {5} was the subscript of the feature, which is the fifth feature. (2) "Change" was the accuracy change based on the current feature set. From Table 5 we can observe the following points: (1) {5} as the current feature set, the accuracy rate increases after adding feature 7 (see the third row), in contrast, the accuracy rate reduces after adding feature 21 (see the fourth row). These results verify the redundancy Criterion 1 (when R_i,j is large, whether F_i is redundant is uncertain). In this case, the approximate redundant feature definitionis needed to determine whether the current feature is redundant. (2) {5, 7} as the current feature set, the accuracy rate increases after adding feature 18 (see the fifth row); {8, 3} as the current feature set, the accuracy rate increases after adding feature 30 (see the seventh row). These results verify the redundancy Criterion 2 (when R_i,j is smaller, no matter the size of the R_i,c, the feature F_i is not redundant).

Download:

Table 5. The relationship between relevance and redundant feature based on FSBRR algorithm.

https://doi.org/10.1371/journal.pone.0214406.t005

Conclusions

In this paper, the relationship between two kinds of correlation (R_i,c and R_i,j) is established, which effectively combines the correlation between features and classes and the correlation between features and features to eliminate redundant features. Because the determination of completely redundant features in actual operation is difficult to realize, so we first analyze four kinds of boundary conditions between R_i,c and R_i,j, and then a redundancy feature criteria is proposed. On this basis, the approximate redundancy features are defined in this study. Finally, we have proposed a new feature selection algorithm based on approximate redundancy removal (FSBRR) for high dimensional biomedicine data classification.

To verify the effectiveness of the FSBRR algorithm, three classification algorithms (RF, KNN and SVM) are used to compare the FSBRR and three typical feature selection algorithms on eight high dimension biomedical datasets. The experiment results show that the FSBRR algorithm can effectively remove redundant features to improve the classification performance. Additionally, we also designed a set of comparative experiments to discuss and analyze the effects of parameters δ and α on the performance of FSBRR algorithm.

References

1. Tan MK, Tsang IW, Wang L. Minimax sparse logistic regression for very high-dimensional feature selection. IEEE Transactions on Neural Networks and Learning Systems. 2013; 24(10): 1609–1622. pmid:24808598
- View Article
- PubMed/NCBI
- Google Scholar
2. Tamaresis JS, Irwin JC, Goldfien G, Jossph T, Rabban RO, Gamran N, et al. Molecular classification of endometriosis and disease stage using high-dimensional genomic data. Endocrinology. 2014; 155(12): 4986–4999. pmid:25243856
- View Article
- PubMed/NCBI
- Google Scholar
3. Lee K, Man Z, Wang D, Cao Z. Classification of bioinformatics dataset using finite impulse response extreme learning machine for cancer diagnosis. Neural Computing and Applications.2013; 22(3): 457–468.
- View Article
- Google Scholar
4. Jiang H, Hu B, Liu Z, Yan L, Wang T, Liu F, et al. Investigation of different speech types and emotions for detecting depression using different classifiers. Speech Communication. 2017; 90(6): 39–46.
- View Article
- Google Scholar
5. Li XW, Zhuang J, Hu B, Jing Z, Ning Z, Mi L, et al. A resting-state brain functional network study in MDD based on minimum spanning tree analysis and the hierarchical clustering. Complexity. 2017; 22(3): 1–11.
- View Article
- Google Scholar
6. Chung IF, Chen YC, Pal N. Feature selection with controlled redundancy in a fuzzy rule based framework. IEEE Transactions on Fuzzy Systems. 2018; 26(2): 734–748.
- View Article
- Google Scholar
7. Avci E. Selecting of the optimal feature subset and kernel parameters in digital modulation classification by using hybrid genetic algorithm-support vector machines: HGASVM. Expert Systems with Applications. 2009; 36(2): 1391–1402.
- View Article
- Google Scholar
8. Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering. 2005; 17(4): 491–502.
- View Article
- Google Scholar
9. Li XT, Yin MH. Multi objective binary biogeography based optimization based feature selection for gene expression data. IEEE Transactions on NanoBioscience. 2013; 12 (4): 343–353. pmid:25003163
- View Article
- PubMed/NCBI
- Google Scholar
10. Zawbaa HM, Emary E, Grosan C. Feature selection via chaotic antlion optimization. Plos One. 2016; 11(3): e0150652. pmid:26963715
- View Article
- PubMed/NCBI
- Google Scholar
11. Liu B, Tian M, Zhang C, Li X. Discrete biogeography based optimization for feature selection in molecular signatures. Molecular Informatics. 2015; 34(4): 197–215. pmid:27490166
- View Article
- PubMed/NCBI
- Google Scholar
12. Li XT, Li MJ, Yin MH. Multiobjective Ranking Binary Artificial Bee Colony for Gene Selection Problems Using Microarray Datasets. IEEE/CAA Journal of Automatica Sinica. 2016;
- View Article
- Google Scholar
13. Vergara JR, Estévez PA. A review of feature selection methodsbased on mutual information. Neural Computing and Applications. 2014; 24(1): 175–186.
- View Article
- Google Scholar
14. Xue B, Zhang MJ, Browne WN, Yao X. A Survey on Evolutionary Computation Approaches to Feature Selection. IEEE Transactions on Evolutionary Computation, 2016, 20(4): 606–626.
- View Article
- Google Scholar
15. Zhang BT, Lei T, Liu H, Cai HS. EEG-based automatic sleep staging using ontology and weighting feature analysis. Computational and Mathematical Methods in Medicine. 2018; 2018: 6534041. pmid:30254690
- View Article
- PubMed/NCBI
- Google Scholar
16. Li XT, Wong KC. A comparative study for identifying the chromosome-wide spatial clusters from high-throughput chromatin conformation capture data. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2018; 15(3):774–787. pmid:28333638
- View Article
- PubMed/NCBI
- Google Scholar
17. Taskin G, Kaya H, Bruzzone L. Feature selection based on high dimensional model representation for hyperspectral images. IEEE Transactions on Image Processing. 2017; 26(6): 2918–2928. pmid:28358688
- View Article
- PubMed/NCBI
- Google Scholar
18. Li XT, Ma SJ, Wong KC. Evolving spatial clusters of genomic regions from high-throughput chromatin conformation capture data. IEEE Transactions on NanoBioscience. 2017; 16(6): 400–407. pmid:28708563
- View Article
- PubMed/NCBI
- Google Scholar
19. Chen Q, Zhang MJ, Xue B. Feature selection to improve generalisation of genetic programming for high-dimensional symbolic regression. IEEE Transactions on Evolutionary Computation. 2017; 21(5): 792–806.
- View Article
- Google Scholar
20. Augusto D, Sofia M, Christine DM, Alessandro V, Francesca O. Feature selection for high-dimensional data. Computational Management Science. 2009; 6(1): 25–40.
- View Article
- Google Scholar
21. Zhang Y. A modified artificial bee colony algorithm-based feature selection for the classification of high-dimensional data. Journal of Computational and Theoretical Nanoscience. 2016; 13(7): 4088–4095.
- View Article
- Google Scholar
22. Mafarja M, Mirjalili S. Whale optimization approaches for wrapper feature selection. Applied Soft Computing. 2018;62(1): 441–453.
- View Article
- Google Scholar
23. Kohavi R, John GH. Wrappers for feature subset selection. Artificial Intelligence. 1997; 97(1): 273–324.
- View Article
- Google Scholar
24. Chrysostomou K, Chen SY, Liu X. Combining multiple classifiers for wrapper feature selection. International Journal of Data Mining Modelling and Management. 2017; 1(1): 91–102.
- View Article
- Google Scholar
25. Zhang D, Chen S, Zhou ZH. Constraint score: A new filter method for feature selection with pairwise constraints. Pattern Recognition, 2008; 41(5): 1440–1451.
- View Article
- Google Scholar
26. Hancer E, Xue B, Zhang MJ. Differential evolution for filter feature selection based on information theory and feature ranking. Knowledge-Based Systems, 2018; 140(10): 103–119.
- View Article
- Google Scholar
27. Lei T, Jia XH, Zhang YN, He LF, Meng HY, Nandi AK. Significantly fast and robust fuzzy c-means clustering algorithm based on morphological reconstruction and membership filtering. IEEE Transactions on Fuzzy Systems. 2018; 26(5): 3027–3041.
- View Article
- Google Scholar
28. Estevez PA, Tesmer M, Perez CA, Zurada JM. Normalized Mutual Information Feature Selection. IEEE Transactions on Neural Networks. 2009; 20(2):189–201. pmid:19150792
- View Article
- PubMed/NCBI
- Google Scholar
29. Zhang C, Ni ZW, Ni LP, Tang N. Feature selection method based on multi-fractal dimension and harmony search algorithm and its application. International Journal of Systems Science. 2016; 47(14): 3476–3486.
- View Article
- Google Scholar
30. Maji P. A rough hypercuboid approach for feature selection in approximation spaces. IEEE Transactions on Knowledge and Data Engineering. 2014; 26(1): 16–29.
- View Article
- Google Scholar
31. Ververidis D, Kotropoulos C. Information loss of the mahalanobis distance in high dimensions: application to feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2009, 31(12): 2275–2281. pmid:19834146
- View Article
- PubMed/NCBI
- Google Scholar
32. Brown G, Pocock A, Zhao MJ, Luján M. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. Journal of Machine Learning Research. 2012; 13(1): 27–66.
- View Article
- Google Scholar
33. John GH, Kohavi R, Pfleger K. Irrelevant features and the subset selection problem. Machine Learning Proceedings. 1994;1994(7): 121–129.
- View Article
- Google Scholar
34. Hu B, Dai YQ, Su Y, Moore P, Zhang XY, Mao CS, et al. Feature selection for optimized high-dimensional biomedical data using the improved shuffed frog leaping algorithm. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2018; 15(6): 1765–1773. pmid:28113635
- View Article
- PubMed/NCBI
- Google Scholar
35. Chuang YL,Chang HW, Tu CJ, Yang CH. Improved binary PSO for feature selection using gene expression data. Computational Biology and Chemistry. 2008; 32(1): 29–38. pmid:18023261
- View Article
- PubMed/NCBI
- Google Scholar
36. Li J, Liu H. Kent Ridge Biomedical Data Set Repository.School of Computer Engineering, Nanyang Technological University, Singapore. 2004; Available from: http://datam.i2r.astar.edu.sg/datasets/krbd/index.html.
37. UCI. 2018; Available from: http://archive.ics.uci.edu/ml/.
38. TCGA. 2019; Available from: http://tcga-data.nci.nih.gov.
39. Kononenko I. Estimating attributes: analysis and extensions of RELIEF. European Conference on Machine Learning on Machine Learning. Springer-Verlag New York. 1994:171–182. https://doi.org/10.1007/3-540-57868-4_57
40. Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005; 27(8): 1226–1238. pmid:16119262
- View Article
- PubMed/NCBI
- Google Scholar
41. Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation. 2002; 6(2): 182–197.
- View Article
- Google Scholar
42. Mangasarian OL, Street WN, Wolberg WH. Breast cancer diagnosis and prognosis via linear programming. IEEE Computational Science and Engineering. 1995; 43(4): 570–577.
- View Article
- Google Scholar

[ref1] 1. Tan MK, Tsang IW, Wang L. Minimax sparse logistic regression for very high-dimensional feature selection. IEEE Transactions on Neural Networks and Learning Systems. 2013; 24(10): 1609–1622. pmid:24808598
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Tamaresis JS, Irwin JC, Goldfien G, Jossph T, Rabban RO, Gamran N, et al. Molecular classification of endometriosis and disease stage using high-dimensional genomic data. Endocrinology. 2014; 155(12): 4986–4999. pmid:25243856
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Lee K, Man Z, Wang D, Cao Z. Classification of bioinformatics dataset using finite impulse response extreme learning machine for cancer diagnosis. Neural Computing and Applications.2013; 22(3): 457–468.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref4] 4. Jiang H, Hu B, Liu Z, Yan L, Wang T, Liu F, et al. Investigation of different speech types and emotions for detecting depression using different classifiers. Speech Communication. 2017; 90(6): 39–46.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Li XW, Zhuang J, Hu B, Jing Z, Ning Z, Mi L, et al. A resting-state brain functional network study in MDD based on minimum spanning tree analysis and the hierarchical clustering. Complexity. 2017; 22(3): 1–11.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref6] 6. Chung IF, Chen YC, Pal N. Feature selection with controlled redundancy in a fuzzy rule based framework. IEEE Transactions on Fuzzy Systems. 2018; 26(2): 734–748.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref7] 7. Avci E. Selecting of the optimal feature subset and kernel parameters in digital modulation classification by using hybrid genetic algorithm-support vector machines: HGASVM. Expert Systems with Applications. 2009; 36(2): 1391–1402.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref8] 8. Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering. 2005; 17(4): 491–502.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref9] 9. Li XT, Yin MH. Multi objective binary biogeography based optimization based feature selection for gene expression data. IEEE Transactions on NanoBioscience. 2013; 12 (4): 343–353. pmid:25003163
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref10] 10. Zawbaa HM, Emary E, Grosan C. Feature selection via chaotic antlion optimization. Plos One. 2016; 11(3): e0150652. pmid:26963715
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref11] 11. Liu B, Tian M, Zhang C, Li X. Discrete biogeography based optimization for feature selection in molecular signatures. Molecular Informatics. 2015; 34(4): 197–215. pmid:27490166
View Article
PubMed/NCBI
Google Scholar

[36] View Article

[37] PubMed/NCBI

[38] Google Scholar

[ref12] 12. Li XT, Li MJ, Yin MH. Multiobjective Ranking Binary Artificial Bee Colony for Gene Selection Problems Using Microarray Datasets. IEEE/CAA Journal of Automatica Sinica. 2016;
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref13] 13. Vergara JR, Estévez PA. A review of feature selection methodsbased on mutual information. Neural Computing and Applications. 2014; 24(1): 175–186.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref14] 14. Xue B, Zhang MJ, Browne WN, Yao X. A Survey on Evolutionary Computation Approaches to Feature Selection. IEEE Transactions on Evolutionary Computation, 2016, 20(4): 606–626.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref15] 15. Zhang BT, Lei T, Liu H, Cai HS. EEG-based automatic sleep staging using ontology and weighting feature analysis. Computational and Mathematical Methods in Medicine. 2018; 2018: 6534041. pmid:30254690
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref16] 16. Li XT, Wong KC. A comparative study for identifying the chromosome-wide spatial clusters from high-throughput chromatin conformation capture data. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2018; 15(3):774–787. pmid:28333638
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref17] 17. Taskin G, Kaya H, Bruzzone L. Feature selection based on high dimensional model representation for hyperspectral images. IEEE Transactions on Image Processing. 2017; 26(6): 2918–2928. pmid:28358688
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref18] 18. Li XT, Ma SJ, Wong KC. Evolving spatial clusters of genomic regions from high-throughput chromatin conformation capture data. IEEE Transactions on NanoBioscience. 2017; 16(6): 400–407. pmid:28708563
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref19] 19. Chen Q, Zhang MJ, Xue B. Feature selection to improve generalisation of genetic programming for high-dimensional symbolic regression. IEEE Transactions on Evolutionary Computation. 2017; 21(5): 792–806.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref20] 20. Augusto D, Sofia M, Christine DM, Alessandro V, Francesca O. Feature selection for high-dimensional data. Computational Management Science. 2009; 6(1): 25–40.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref21] 21. Zhang Y. A modified artificial bee colony algorithm-based feature selection for the classification of high-dimensional data. Journal of Computational and Theoretical Nanoscience. 2016; 13(7): 4088–4095.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref22] 22. Mafarja M, Mirjalili S. Whale optimization approaches for wrapper feature selection. Applied Soft Computing. 2018;62(1): 441–453.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref23] 23. Kohavi R, John GH. Wrappers for feature subset selection. Artificial Intelligence. 1997; 97(1): 273–324.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref24] 24. Chrysostomou K, Chen SY, Liu X. Combining multiple classifiers for wrapper feature selection. International Journal of Data Mining Modelling and Management. 2017; 1(1): 91–102.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref25] 25. Zhang D, Chen S, Zhou ZH. Constraint score: A new filter method for feature selection with pairwise constraints. Pattern Recognition, 2008; 41(5): 1440–1451.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref26] 26. Hancer E, Xue B, Zhang MJ. Differential evolution for filter feature selection based on information theory and feature ranking. Knowledge-Based Systems, 2018; 140(10): 103–119.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref27] 27. Lei T, Jia XH, Zhang YN, He LF, Meng HY, Nandi AK. Significantly fast and robust fuzzy c-means clustering algorithm based on morphological reconstruction and membership filtering. IEEE Transactions on Fuzzy Systems. 2018; 26(5): 3027–3041.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref28] 28. Estevez PA, Tesmer M, Perez CA, Zurada JM. Normalized Mutual Information Feature Selection. IEEE Transactions on Neural Networks. 2009; 20(2):189–201. pmid:19150792
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref29] 29. Zhang C, Ni ZW, Ni LP, Tang N. Feature selection method based on multi-fractal dimension and harmony search algorithm and its application. International Journal of Systems Science. 2016; 47(14): 3476–3486.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref30] 30. Maji P. A rough hypercuboid approach for feature selection in approximation spaces. IEEE Transactions on Knowledge and Data Engineering. 2014; 26(1): 16–29.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref31] 31. Ververidis D, Kotropoulos C. Information loss of the mahalanobis distance in high dimensions: application to feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2009, 31(12): 2275–2281. pmid:19834146
View Article
PubMed/NCBI
Google Scholar

[102] View Article

[103] PubMed/NCBI

[104] Google Scholar

[ref32] 32. Brown G, Pocock A, Zhao MJ, Luján M. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. Journal of Machine Learning Research. 2012; 13(1): 27–66.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref33] 33. John GH, Kohavi R, Pfleger K. Irrelevant features and the subset selection problem. Machine Learning Proceedings. 1994;1994(7): 121–129.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref34] 34. Hu B, Dai YQ, Su Y, Moore P, Zhang XY, Mao CS, et al. Feature selection for optimized high-dimensional biomedical data using the improved shuffed frog leaping algorithm. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2018; 15(6): 1765–1773. pmid:28113635
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref35] 35. Chuang YL,Chang HW, Tu CJ, Yang CH. Improved binary PSO for feature selection using gene expression data. Computational Biology and Chemistry. 2008; 32(1): 29–38. pmid:18023261
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

[ref36] 36. Li J, Liu H. Kent Ridge Biomedical Data Set Repository.School of Computer Engineering, Nanyang Technological University, Singapore. 2004; Available from: http://datam.i2r.astar.edu.sg/datasets/krbd/index.html.

[ref37] 37. UCI. 2018; Available from: http://archive.ics.uci.edu/ml/.

[ref38] 38. TCGA. 2019; Available from: http://tcga-data.nci.nih.gov.

[ref39] 39. Kononenko I. Estimating attributes: analysis and extensions of RELIEF. European Conference on Machine Learning on Machine Learning. Springer-Verlag New York. 1994:171–182. https://doi.org/10.1007/3-540-57868-4_57

[ref40] 40. Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005; 27(8): 1226–1238. pmid:16119262
View Article
PubMed/NCBI
Google Scholar

[124] View Article

[125] PubMed/NCBI

[126] Google Scholar

[ref41] 41. Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation. 2002; 6(2): 182–197.
View Article
Google Scholar

[128] View Article

[129] Google Scholar

[ref42] 42. Mangasarian OL, Street WN, Wolberg WH. Breast cancer diagnosis and prognosis via linear programming. IEEE Computational Science and Engineering. 1995; 43(4): 570–577.
View Article
Google Scholar

[131] View Article

[132] Google Scholar

Figures

Abstract

Introduction

Mathematical symbols and basic concepts

Mathematical symbols

Basic concepts

Method

Determination of redundancy criterion

Definition approximate redundancy feature

Correlation calculation

Algorithm implementation

Performance evaluation function

Experiments

Data description

Experimental design

Results

Discussion and analysis

Conclusions

References