Classification of high dimensional biomedical data based on feature selection using redundant removal

High dimensional biomedical data contain tens of thousands of features, accurate and effective identification of the core features in these data can be used to assist diagnose related diseases. However, there are often a large number of irrelevant or redundant features in biomedical data, which seriously affect subsequent classification accuracy and machine learning efficiency. To solve this problem, a novel filter feature selection algorithm based on redundant removal (FSBRR) is proposed to classify high dimensional biomedical data in this paper. First of all, two redundant criteria are determined by vertical relevance (the relationship between feature and class attribute) and horizontal relevance (the relationship between feature and feature). Secondly, to quantify redundant criteria, an approximate redundancy feature framework based on mutual information (MI) is defined to remove redundant and irrelevant features. To evaluate the effectiveness of our proposed algorithm, controlled trials based on typical feature selection algorithm are conducted using three different classifiers, and the experimental results indicate that the FSBRR algorithm can effectively reduce the feature dimension and improve the classification accuracy. In addition, an experiment of small sample dataset is designed and conducted in the section of discussion and analysis to clarify the specific implementation process of FSBRR algorithm more clearly.


Introduction
The analysis of high dimensional disease data [1][2] is a very important research field, especially cancer [3], or mental disease (e.g. Depressive [4][5]). It is unrealistic to cure these diseases completely, so early diagnosis or prevention plays an important role in the treatment related disease. However, high dimension biomedical data usually contain a large number of weak relevant or irrelevant features. If all the features are treated equally, the time complexity, spatial complexity and accuracy of the prediction can be seriously affected. Therefore, feature selection is considered to be an essential step in the diagnosis of related disease using high dimension biomedical data. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 As is stated in [6], feature selection is also referred to as feature subset selection. The main purpose of feature selection is to remove irrelevant and redundant features in the classification process, while retaining the most valuable information of the original data. In other words, the objective of feature selection is to select an optimal feature subset [7] from the original feature set, which lays a good foundation for subsequent classification or learning work. As one of the important part of knowledge discovery technology, feature selection [8] can effectively improve the computing speed of subsequent prediction algorithm, enhance the compactness of the prediction model, increase the generalization ability of the corresponding model, and avoid over fitting.
Based on the above factors, feature selection has always been a hot research topic, and new achievements are constantly emerging. For example, in [9], a feature selection method based on multi-objective binary based biogeography optimization (MOBBBO) is proposed for gene selection, which combines the non-dominated sorting method and the crowding distances method into the binary based biogeography optimization (BBBO) framework. In [10], a novel feature selection method via chaotic optimization is developed to solve the problem of balance between exploration of the search space and exploitation of the best solutions. In [11], Liu et al. used a discrete biogeography based optimization (DBBO) method by integrating discrete migration model and discrete mutation model for feature selection in molecular signatures. In [12], Li et al. proposed that a feature selection algorithm based on the multi-objective ranking binary artificial bee colony to select the optimal subset from the original high dimensional data while retaining a subset that satisfies the defined objective. In addition, recent advances on feature selection can be found in [13][14], in which [14] focuses on the review of the latest research work on evolutionary computation (EC) of feature selection, and identifies the contributions of these different algorithms.
To better introduce the research status of high-throughput data or high-dimensional data based on feature selection algorithm [1,15], the following is an overview of some representative studies in recent years. Tan et al. [1] proposed a new minimax sparse LR model for very high-dimensional feature selections, which can be efficiently solved by a cutting plane algorithm. In order to solve the problem of effectively identifying chromosome-wide spatial clusters from high-throughput chromatin conformation capture data, a population based optimization algorithm coordinates and guides the non-negative matrix factorization toward global optima was proposed in [16]. In [17], a novel feature selection method based on high dimensional model representation (HDMR) was proposed to solve the hyper-spectral image classification problem. The core idea of this method is to rank the global sensitivity index calculated via the HDMR to find the most relevant features. In order to explore and identify small clusters of spatially proximal genomic regions, Li et al. [18] proposed evolutionary computation methods to evolve and confirm functionally related genomic regions. Chen et al. [19] proposed a feature selection algorithm, which was named genetic programming (GP) with permutation importance (GPPI), to select features of high-dimensional symbolic regression (SR) using GP. Based on two typical applications of microarray analysis and target detection, Augusto et al. [20] discussed the feature selection of high-dimensional spatial data. In order to solve the problem of high-dimensional data classification, Zhang [21] proposed an improved artificial bee colony (ABC) algorithm to select the optimal feature subset. Meanwhile, to improve the convergence of ABC, the modified ABC algorithm (named OGR-ABC algorithm) introduces three modified strategies including opposite initialization, global optimum based search equations and ranking based selection mechanism.
Through the analysis of the above research, it is not difficult to find that feature selection process mainly includes two steps [22]: search strategy and evaluation criterion. Based on whether or not the classifier itself is used as feature evaluation index, evaluation criterion can be categorized into the wrapper method and the filter method. The wrapper method [23,24] to evaluate superiority and inferiority of the optimal feature subset under the premise of the classification algorithm unchanged. Meanwhile, the corresponding classification accuracy is adopted as an index to select optimal feature subset. So the feature subset selected by the wrapper method is not universal. It is necessary to execute the feature selection process again when the classification algorithm is changed. Therefore, its time complexity is too high, especially for high dimension data, and the execution time of the algorithm may be longer. Another evaluation criterion based on the filter method [25][26][27], the search of feature space depends on the intrinsic correlation of the data itself rather than the classification algorithm. The filter method is increasingly attractive because of its simplicity and fast speed. Therefore this method is more popular than the wrapper method.
Four intrinsic correlation metrics are often adopted by the filter method to evaluate feature subset, including MI [28], fractal dimension [29], dependency degree [30] and distance [31]. Among them, MI is considered as the most acceptable criteria due to two major advantages [32]: (1) Measuring different relationship between nonlinear (random) variables. (2) Preserving stability of transformations in the high dimensional feature space that is invertible.
According to the above analysis, a filter feature selection method is proposed for high dimensional biomedical data based on redundant removal in this paper. Firstly, we analyze the four boundary extremes of correlation between feature and target class and the correlation between feature and feature. Based on this, two redundant criterions are proposed. And then in order to quantify the redundancy criterion, the core module based on MI is proposed: the definition of approximate redundancy feature. Finally, the algorithm implementation is given.

Mathematical symbols
There are many mathematical symbols used in this study. To improve the readability, we list these mathematical symbols and their abbreviations in below. P:a probability measure.
R: the relevance between any two variables. R i,j : the relevance between any pair of feature F i and F j ,i6 ¼j. R i,c : the relevance between any feature F i and class attribute C.

Basic concepts
For laying a base for further investigation, three basic concepts [33] (strongly relevant feature, weakly relevance feature, and irrelevant feature) used in this study are listed as follows.
Strong relevance: F i is strongly relevant feature iff there exists P(F i ,A i )>0 such that Weak relevance: F i is weakly relevant feature iff it is not strongly relevant (i.e. P( Irrelevance: F i is irrelevant feature iff it are not strongly relevant and weakly relevant, there Strong relevance indicates that the feature is very important for classification accuracy improvement, so it can't be arbitrarily removed. Weak relevance indicates that the feature can sometimes contribute to improve prediction accuracy. Irrelevance indicates that the feature is useless on the improvement of classification accuracy, so it can be directly deleted.

Determination of redundancy criterion
It is difficult to determine that the complete correlation between any pair of feature in the actual calculation process, and then determine whether there is redundancy among features. To combat this, a redundancy criterion based on the correlation is proposed in this study to lay the foundation for further feature selection. Based on three basic concepts, the redundancy of feature F i is analyzed under different extreme values of R i,c and R i,j . Different cases of extreme value are shown in Table 1.
It is easy to draw the following four conclusions after analyzing four cases: Conclusion 1: R i,c is large, which means that F i contains more information about C. R i,j is large, which means that the correlation between F i and F j is strong. If R i,j = 1, then F i and F j is complete correlation, hence F i is redundant. If R i,j 6 ¼1, it is difficult to determine the feature F i whether or not is redundant.
Conclusion 2: R i,j is small, which means that the correlation between F i and F j is weak. Hence F j can't replace F i . In other words, no matter the size of the R i,c , the feature F i is not redundant.
Conclusion 3: R i,c is small, which means that F i contains less information about C. R i,j is large, which means that the correlation between F i and F j is strong. In this case, the feature F i is redundant with higher probability. With the increase of R i,j , this probability is also increasing.
Conclusion 4: R i,j is small, which means that the correlation between F i and F j is weak. This conclusion is consistent with the conclusions 2, no matter the size of the R i,c , the feature F i is not redundant.
Based on the above four conclusions, two redundant criteria can be obtained:

Definition approximate redundancy feature
Based on the two redundant criterions inthe previous section, the approximate redundancy feature is proposed and defined in this section.
Assuming that the R i,c of the feature F i is very close to R max , it indicates that F i contains a lot of information about class attribute C. In this condition, only if the value of R i,j is large enough, F i can be considered as an approximate redundancy feature. Otherwise, it can't be considered as redundancy feature. The reason is that F i plays an important role in improving the accuracy of classification, and can't be easily removed as redundancy. By contrast, Assuming that the R i, c of feature F i is not very close to R max , it indicates that F i contains relatively less information about class attribute C. In this condition, as long as the value of R i,j is relatively large, F i is considered as an approximate redundancy feature. The reason is that F i not plays a main role in improving the accuracy of classification. Based on the above analysis and discussion, the approximate redundant feature is formally described in definition 1.
Definition 1 (approximate redundancy feature): There is any pair of correlation feature F i and F j , and R j,c� R i,c .
(1) F i is an approximate redundancy feature iff there exists, R max -R j,c �δ, 0.05�δ�0.13, such that Definition 1indicates that F j can be approximated as an alternative for F i .

Correlation calculation
In general, correlation measure methods include linear and nonlinear. A nonlinear method based on MI is adopted in this study, and the reason is that the high dimensional biomedical data usually exist in the form of nonlinear in the real world. The correlation between any pair of variables (X, Y) can be calculated in the following formulas (6) or (7).
IGðX; YÞ ¼ HðXÞ þ HðYÞ À HðX; YÞ ð7Þ where H(X), H(X|Y) and H(X,Y) can be calculated on the basis of formulas (8), (9) and (10). HðX; According on the above three formulas, the value of H(X|Y) or H(X,Y) is smaller when Y contains more information about X. In other words, the greater value of IG(X;Y), which means there are more relevant between X and Y.
To prevent the scale of data is not unified and to reduce the effect of extreme value, each IG (X;Y) is normalized to the range [0, 1] using formula (11).

Algorithm implementation
Based on the definition of the approximate redundancy feature and the correlation calculation method, a feature selection algorithm for high dimension biomedical data classification based on redundant removal (FSBRR) is given in Algorithm 1.
% to select current first feature in each cycle, i.e. first variable 13. for j = i+1: size(I,2) 14.F j = F'(:,j); % to select next feature (or variable) 15. R j,c = 2 � IG(C;F j )/(H(C) + H(F j )) 16. The time consumption of FSBRR algorithm is mainly used to calculate R i,c and R i,j , so its atomic operation is the calculation of R i,c and R i,j . Assuming that a dataset contains n features

Performance evaluation function
In this study, classification accuracy and the number of selected features are used to design the performance evaluation function [34][35], which is shown in formula (12).
n is the number of selected features and N is the total number of features. w 1 and w 2 are predefined weight coefficients, which are used to adjust the importance of two indicators in the performance evaluation function. In this study, the values of w 1 and w 2 are set to 0.999 and 0.001 respectively. The main reason for this setting can be attributed to the following three aspects: (1) Under the prerequisite of data dimensionality reduction, this study main focuses on the use of classification accuracy as a metric of feature selection algorithm. (2) The number of selected features is significantly reduced, but the classification accuracy is not improved, such dimensionality reduction of high-dimensional data will lose its original application value.
(3) The performance evaluation function with high weight coefficient of classification accuracy and low weight coefficient of the number of selected features has been recognized and widely applied in many feature selection studies, such as [34]. In addition, Acc is classification accuracy as defined in formula (13).
C num and I num are the number of correct and incorrect classification labels respectively.

Data description
Eight well-known biomedical datasets (Table 2) were used to evaluate the performance of FSBRR algorithm. These dataset includes eight aspects of disease diagnosis data. The data dimension range was from 319 to 21548. The first three datasets were taken from the Kent Ridge Biomedical [36]. p53 Mutants and Arcene were taken from the UCI dataset [37]. Breast invasive carcinoma (BRCA), Glioblastoma multiforme (GBM), and tumour sequencing project (TSP) were taken from the TCGA [38].

Experimental design
To evaluate the performance of FSBRR algorithm, under the same conditions, we designed and conducted the following experiments: eight high dimensional biomedical data were compared and analyzed by FSBRR, Relief (a filter methodbased on the nearest neighbor distance) [39], maximum relevance and minimum redundancy (mRmR) [40] and genetic algorithm (GA) [41], respectively. In this experiment, the same conditions contain two meanings: (1) Random forest (RF, numTrees = 10), K-nearest neighbor (KNN, k = 1), and Support Vector Machine (SVM, Linear Kernel) were adopted as classifier to evaluate classification performance. (2) In FSBRR algorithm, the parameter τ is set to 0, and the purpose is that to avoid losing the weak correlation feature without prior knowledge. In addition, after adaptive testing, the values of δ and α were set to 0.08 and 0.64 respectively.
To obtain an unbiased experimental result, 10 fold cross validation was adopted to evaluate the classification performance. Each dataset was stratified into 10 folds, of which 9 folds were used as a training sample and the remaining 1 fold was used as a testing sample. Moreover, in order to get a statistically meaningful result, each experiment was executed 100 times independently. This means that the classification task is executed 1000 times in total, and the average value is taken as the result in finally. The above experiments were implemented in Matlab 2017a. The experimental hardware and software configuration is shown in Table 3.

Results
For the eight datasets, we have conducted the experiments described in above section. The six major statistical indicators were compared and analyzed, and the results were shown in Table 4.
From Table 4, we can observe the following aspects: (1) (2) RF-based GA algorithm uses GBM dataset, and KNN-based GA algorithm uses p53 Mutants and GBM datasets to obtain highest Mean of 82.03%, 89.90%, and 81.39%, respectively. However, Std, MeanFN and RT of GA were significantly higher than FSRRR in these three experiments. (3) The Std among for eighteen out of twenty-four experimental results obtained by FSBRR was smaller than other three algorithms. (4) Four feature selection algorithms can effectively reduce the feature dimension, but the dimension reduction of the FSBRR algorithm was the most obvious. In FSBRR, Relief, mRmR and GA, GA belongs to the wrapper feature selection algorithm, so there were differences in the number of feature subsets for RF, KNN and SWM classifiers. (5) In most experiments, except for the running time index, the other performances of RF were significantly better than the KNN and SVM for the same dataset. Such results indicate that for the specified dataset, to get the best experimental results, a matching classification (or learning) algorithm must be found.      Classification of high dimensional biomedical data based on feature selection using redundant removal Fig 1 we can observe that performance, stability, the feature number of optimal subset, and time complexity of FSBRR was superior to the other three algorithms. We believe that the reasons for obtaining this experiment were: (1) FSBRR algorithm uses R i,j to explore the horizontal correlation between features and features, and the longitudinal correlation between features and classes was explored by R i,c . Meanwhile, based on the basic relevance theory, the horizontal relevance and the longitudinal relevancewere effectively combined. (2) FSBRR algorithm not only removes the irrelevant features, but also removes the approximately redundant features.

Discussion and analysis
To verify the effect of parameters δ and α on the performance of FSBRR algorithm, the classification accuracy of 8 datasets was tested on RF, KNN, and SVM, respectively. In the first experiment, when the parameter α = 0.6 keep constant, the parameter δ increased from 0 to 0.2 with a step length of 0.01. In the second experiment, when the parameter δ = 0.1 keep constant, the parameter α increased from 0.5 to 0.7 with a step length of 0.01. Other experimental procedures are described in experimental design section. Moreover, to facilitate discussion and analysis, the classification accuracy was only considered in here. The results of the two experiments were shown in Figs 2 and 3. Statistical analysis of the experimental results in Figs 2 and Classification of high dimensional biomedical data based on feature selection using redundant removal 3 reveals that: when the highest classification accuracy was obtained for eight datasets, although the values of parameters were different, their ranges of value were overlapped. By selecting the results of the top 20% of the classification accuracy, we can clearly observe that the saliency overlap of the parameter ranges, and the results were shown in The approximate optimal feature subset of each original dataset may be 1 or more. The purpose of feature selection is to find one of the optimal subset. However, the dimensions of the optimal subset for different original datasets are different, and the correlation values (R i,j and R i,c ) distribution of different datasets is vary, which leads to the difference of parameter values (δ or α). According to the statistical analysis of the above two experimental results, the parameters δ and α should be selected within the range in [0.05, 0.13], and the range in [0.60, 0.66], respectively. Because the classification accuracy may be reach the highest value in this range.
In order to further analysis and interpretation of our proposed algorithm, a Breast cancer Wisconsin (Diagnostic) [42] dataset from UCI was used in this subsection. This dataset contains two kinds of data: malignant and benign. It was composed of 569 instances, and each instance contains 32 attributes. For FSBRR algorithm, the relationship between relevance and redundant features was analyzed using RF algorithm. Other experimental procedures are described in experimental design section. Besides, to facilitate discussion and analysis, the classification accuracy was only considered in here.
The main results of this experiment were listed in Table 5. For Table 5 there were two attributes need to be explained which are: (1) "Feature set", for example, {5} was the subscript of the feature, which is the fifth feature. (2) "Change" was the accuracy change based on the current feature set. From Table 5 we can observe the following points: (1) {5} as the current feature set, the accuracy rate increases after adding feature 7 (see the third row), in contrast, the accuracy rate reduces after adding feature 21 (see the fourth row). These results verify the redundancy Criterion 1 (when R i,j is large, whether F i is redundant is uncertain). In this case, the approximate redundant feature definitionis needed to determine whether the current feature is redundant. (2) {5, 7} as the current feature set, the accuracy rate increases after adding feature 18 (see the fifth row); {8, 3} as the current feature set, the accuracy rate increases after adding feature 30 (see the seventh row). These results verify the redundancy Criterion 2 (when R i,j is smaller, no matter the size of the R i,c , the feature F i is not redundant).

Conclusions
In this paper, the relationship between two kinds of correlation (R i,c and R i,j ) is established, which effectively combines the correlation between features and classes and the correlation between features and features to eliminate redundant features. Because the determination of completely redundant features in actual operation is difficult to realize, so we first analyze four kinds of boundary conditions between R i,c and R i,j , and then a redundancy feature criteria is proposed. On this basis, the approximate redundancy features are defined in this study. Finally, we have proposed a new feature selection algorithm based on approximate redundancy removal (FSBRR) for high dimensional biomedicine data classification.
To verify the effectiveness of the FSBRR algorithm, three classification algorithms (RF, KNN and SVM) are used to compare the FSBRR and three typical feature selection algorithms on eight high dimension biomedical datasets. The experiment results show that the FSBRR algorithm can effectively remove redundant features to improve the classification performance. Additionally, we also designed a set of comparative experiments to discuss and analyze the effects of parameters δ and α on the performance of FSBRR algorithm.