CBFS: High Performance Feature Selection Algorithm Based on Feature Clearness

Background The goal of feature selection is to select useful features and simultaneously exclude garbage features from a given dataset for classification purposes. This is expected to bring reduction of processing time and improvement of classification accuracy. Methodology In this study, we devised a new feature selection algorithm (CBFS) based on clearness of features. Feature clearness expresses separability among classes in a feature. Highly clear features contribute towards obtaining high classification accuracy. CScore is a measure to score clearness of each feature and is based on clustered samples to centroid of classes in a feature. We also suggest combining CBFS and other algorithms to improve classification accuracy. Conclusions/Significance From the experiment we confirm that CBFS is more excellent than up-to-date feature selection algorithms including FeaLect. CBFS can be applied to microarray gene selection, text categorization, and image classification.


Introduction
The fundamental goal of feature selection is to select useful features and eliminate useless ones in a high-dimensional dataset to improve the performance of learning models by alleviating the effects of dimensionality, enhancing generalization capability, speeding up the learning process and improving model interpretability. Typical application areas of feature selection are gene selection from microarray data and text categorization.
In machine learning literature there are three general approaches to feature selection: filters, wrappers, and embedded methods [1,2]. Filter methods select the optimal feature subset based solely on the dataset by evaluating each future based on specific statistics, but completely independently from the classification algorithm. In contrast, wrapper methods make use of the algorithm that will be used to build the final classifier to select a feature subset. When compared to filters, they tend to be more computationally expensive, but provide superior performance [3] since they are injected inside the learning algorithm and well suited to the interest of the classifier. In the embedded technique, the search for an optimal subset of features is built into the classifier construction, and can be seen as a search in the combined space of feature subsets and hypotheses. Just like wrapper approaches, embedded approaches are thus specific to a given learning algorithm.
Filter Methods are mostly a popular approach because they are simple and fast to extract target features. FSDD [4], Relief [5], and MRMR [6] are up-to-date feature selection algorithms that belong to the filter methods. FSDD is a distance discriminant method.
This algorithm calculates the grade of each feature using a distance matrix. The criterion used for selecting good features is d b -ßd w , where d b is the distance between different classes, d w is distance within classes, and ß is a user defined value that is usually set to 2 and used to control the impact of d w . The higher the value of ß, the more the focus should be on the distance within classes. The Relief algorithm recursively and randomly selects an instance and identifies its nearest neighbors, one from its own class and others from different classes. The quality estimator in this algorithm is then updated for all attributes to assess how well the feature distinguishes the instance from its closest neighbors. In each iteration, an instance x is selected randomly, and its nearest instance is found from the same class (NH), as well as different classes (NM). Finally, the weight value is updated by the equation. Because of this recursive characteristic, runtime is slow compared with the other feature selection methods. Also, results are different every time because the method randomly selects features. MRMR [6] has been proposed to solve the problem by (using mutual information) maximizing the mutual Euclidean distance and minimizing the pair-wise correlations of the features. The minimum redundancy condition is WI~1 DSD2 FeaLect [7] is a very high quality wrapper method. FeaLect proposes an alternative algorithm for feature selection based on the Lasso [8] for building a robust predictor. Lasso is an L1regularization technique for linear regression which has attracted much attention in machine learning and statistics. Although efficient algorithms exist for recovering the whole regularization path for the Lasso, finding a subset of highly relevant features that lead to a robust predictor is an important aspect to investigate. The hypothesis of FeaLect is that defining a scoring scheme that measures the quality of each feature can provide a more robust selection of features. The FeaLect approach is to generate several samples from the training data, determine the best relevanceordering of the features for each sample, and finally combine these relevance-orderings to select highly relevant features.
In this paper, we propose a clearness-based feature selection (CBFS) algorithm which can be classified as a filter method. In our context, clearness means the separability between classes in a feature. If (clearness of feature f 2 ) . (clearness of feature f 1 ), then f 2 is more advantageous to classification than f 1 . In Fig. 1, feature f 2 is clearer than f 1 . O and X are data samples in f 1 and f 2 , and mixed area of f 1 is larger than f 2 . Therefore, the classification accuracy using f 1 may be lower than f 2 . In the CBFS method, we measure clearness of each feature in a dataset, and select top ranked features. CBFS calculates the distance between the target sample and centroid of each class, and then compares the class of the nearest centroid with the class of the target sample. The matching ratio of all samples in a feature becomes a clearness value for the feature. We describe the detailed process to obtain clearness values in materials and methods section.
The proposed method can be used to combine other feature ranking algorithms. We combine proposed methods with R-value [9] and validate the improvement of classification accuracy. Rvalue is one of the feature ranking algorithms and it also measures the clearness of each feature by different way of CBFS. It considers nearest neighbor samples of target sample to decide whether it is located in congestion area or not. In some cases, R-value based feature selection produces better accuracy than CBFS, and we can expect that combining them improves classification accuracy.

Materials and Methods
Clearness-based feature scoring scheme As mentioned, the proposed method can be classified as a filter method. Every filter method has a scoring scheme for each feature in a dataset. CBFS adopts CScore. CScore(f i ) is a scoring function for feature f i which measures clearness of the feature. The intuitive meaning of CScore for feature f i is the degree of correctly clustered samples to the centroid of their class in f i . In the context of CBFS, each sample is clustered to the nearest centroid of the class. If a sample of class A is clustered to the centroid of class B, it is a misclustered sample. In Fig. 2(b), two samples are mis-clustered whereas all samples are correctly clustered in Fig. 2(a). It is clear that well clustered features bring high classification accuracy.
Let's suppose a dataset DS has n samples, m features, and p classes. DS can be denoted by a set of sample x i .
Each sample is a vector value which has m elements (features).
A set CS contains class labels corresponding to samples in DS.
A class label is a sequential numerical value and the range is [0, p-1]. Now we introduce the procedure to obtain CScore(f i ).
Step 1. Calculate centroid of each class. It is the same as the median point of a class and calculated by the average operation. Med(f i , j) denotes the median point of class j in the feature f i, which is calculated by: where k is a number of samples of class j ð1Þ Step 2. Calculate the predicted class label for each x ij in sample x i . After calculating the distance between x ij and Med(f j ,c i ) for all classes, we take the nearest centroid Med(f j , s) and s is a predicted class label for x ij . The distance between x ij and Med(f j , t) is calculated by: As a result of step2, we have n 6m matrix M 1 and element value M 1 (i,j) is predicted class label for x ij .
Step 3. Calculate n 6m matrix M 2 which contains a matching result of predicted class label and correct class label in CS. M 2 (i,j) is calculated by: Step 4. Calculate CScore(f i ). Finally we calculate CScore(f i ) by: If CScore(f i ) is close to 1, this shows that classes in feature f i are clustered well and elements in f i can be clearly classified. Therefore, we can use CScore(f i ) as a criteria to select features for classification work. CBFS chooses highly scored features using the CScore() function. Implementable algorithm to get CScore() is available in the Supplementary Material link (http://biosw. dankook.ac.kr/cbfs). CBFS can be combined to other feature scoring schemes. To distinguish combined algorithms as shown in the next section, we denote a pure CBFS algorithm as CBFS org .

Improvement of CBFS with R-value
Though CBFS itself shows high performance for feature selection, we can improve its quality by combining other scoring schemes. In this section, we describe a combining example between CBFS and R-value. We can apply this approach to combine other scoring schemes. Scoring function of CBFS is based on distance between each data point and centroid of classes. In some cases, this produces the wrong scores, as shown in Fig. 4. In Fig. 4(a), class A and class B are clearly separated but two points of class B in the dotted circle are classified as class A and this decreases the value of the CScore(). If two classes are widely overlapped as shown in Fig. 4(b), many points in the overlapping area will be mis-classified. In the cases shown in Fig. 4, the R-value is a better scoring function because the R-value does not consider the distance to the centroid of classes but instead, to the number of nearest neighbors.
Traditional approaches to combine different feature selection methods usually just use intersection. Next box presents the simple steps required to combine CBFS and R-value. We denote this approach as CBFS intersection .   . Two cases that CScore() produces wrong scores. If data range of a class is so smaller than neighbor class', CScore may produce wrong score (Case 1). If two classes' have wide overlapping area, CScore may produce wrong score (Case 2). doi:10.1371/journal.pone.0040419.g004 A difficulty in CBFS intersection is how to determine n if we fix m. For example, even if we want to get 20 features using CBFS intersection , we do not know the correct number of n because we cannot estimate the number of intersections between step 1 and step 2. Therefore, we modify CBFS intersection to extract the exact number of m features. We denote it as CBFS exact . Basic steps for CBFS exact are as follows:

Datasets, feature selection algorithms, and classifiers
To compare feature selection algorithms we choose various kinds of datasets, which contain varying numbers of features and samples. Duke, Leukemia, DLBCL, and Carcinoma are well known microarray datasets. Other datasets come from the UCI repository [10] and several websites. Table 1 summarizes the benchmark datasets. FeaLect, FSDD, and Relief feature selection algorithms are compared with proposed CBFS org , CBFS intersection , and CBFS exact . FeaLect is widely considered as a state-of-the-art algorithm and details are described in Section 1. For simplicity we denote FeaLect as Lect from here on.
The basis of the FSDD algorithm is to identify the features that result in good class separability among classes and to make the samples in the same classes as close as possible. A criterion used for selection of good features is d b -b d w and the criteria function can be expressed as follows: where m is the number of selected features, c is the number of classes, and r i is the prior probability of the ith class.
Relief is regarded as one of the successful features of selection algorithms. The basic idea of Relief is to iteratively estimate feature weights according to their ability to discriminate between neighboring instances. In each iteration, an instance x is selected randomly, and its nearest instance is found from the same class (NH), as well as different classes (NM). Finally, the weight value is updated by the equation: If w 1 , w 2 , feature 2 is better than feature 1 . The ReliefF (Relief-F) algorithm [11], which is an updated version of Relief, is more robust and can deal with incomplete and noisy data.
To compare classification accuracy between the current feature selection algorithms and proposed CBFS, we used the k-nearest neighbor (KNN) and support vector machine (SVM). In the KNN classification analysis, we used k = 5 for K because this value was found to produce the best accuracy in most of cases. For the SVM test, we use LIBSVM tool [12] with linear kernel. Whole user defined value is set as default such as degree, gamma, and coef0. We use the Lect algorithm that is imported in the R-package. User defined values of FSDD are Beta = 3, and K = 3. In case of ReliefF, we use K = 7. Proposed CBFS intersection chooses a threshold value n = 100. We use well-known validation methods, k-fold cross validation [13] where k = 5 to avoid the problem of over-fitting the classification.

Results
Relevance, sparsity, and optimality are measures to evaluate feature selection algorithms. Relevance and sparsity are generally used for microarray area and requires domain knowledge. Optimality evaluates classification accuracy using the same number of features from different feature selection algorithms. Fig. 5 and 6 present the optimality evaluation. We test KNN and SVM to compare classification accuracy based on 5-fold validation. Fig. 5 and 6 show that the proposed CBFS org is far better than the current filter methods such as FSDD and Relief. In addition, it also outperforms Lect, which is a superior quality feature selection method. Lect can be classified as a wrapper method. In general, wrapper methods produce better classification accuracy but require long execution time. Though the proposed method is a filter method, it exceeds or remains the performance of wrapper method. CBFS org shows good classification accuracy both in KNN and SVM. CBFS org has a generality for well-known classification algorithms. Fig. 7 shows PCA analysis for feature selection results for the Arcene dataset. Red and black points represent samples of two different classes. Congestion areas of red and black points are narrow in CBFS graphs compared with the others. In general, the more narrow congestion area we get, the better classification accuracy we can expect. This is why the CBFS algorithm produces higher accuracy than other algorithms. Table 2 summarizes the best classification accuracies by prepared feature selection algorithms on benchmark datasets. We test various parameter values and a number of features for each feature selection algorithm and classifiers, and choose the best accuracies. The proposed CBFS org and CBFS exact occupy top accuracies for each datasets except Prostate. In particular, CBFS exact produces 22.7% and 23.8% higher than Lect on the Duke and Madelon dataset, respectively. It is clear that the number of features to produce best accuracy of CBFS org and CBFS exact are generally smaller than other algorithms. For example, Lect, CBFS org and CBFS exact produce best accuracy on the Leukemia dataset, and Lect uses 25 features whereas CBFS exact uses only 5 features. Table 2 also shows that CBFS exact produces better accuracy than CBFS org in some cases. CBFS intersection has lower accuracy than CBFS org and CBFS exact . We can consider combining multiple feature selection algorithms to improve classification accuracy.
Some microarray datasets have a small number of samples. For example, Carcinoma has only 72 samples. In that case, classification accuracy is not a reliable measure to evaluate feature selection algorithms, and instead we need to analyze the risk of mis-classification or prediction using the 'loss function' [14] or Receiver Operating Characteristic (ROC) curve [15]. ROC curves show a two-dimensional graph using sensitivity and 1-specificity. They are widely used in biology and medical science for evaluating prediction methods or markers. We used ROC curves to compare the stability of CBFS with that of Lect. Currently, Lect is a top ranked feature selection algorithm, and we only use it for comparison purposes. Fig. 8 and 9 show ROC analysis for CBFS and Lect on the Duke and Prostate dataset, respectively. We extract five features using CBFS and Lect, and list the relationship values between samples in the features and their class labels. We also draw AUC curves according to [15] which use the average values obtained from the ROC curves. Fig. 8(c) and Fig. 9(c) shows AUC value of CBFS is greater than Lect, which means that CBFS is a more stable and superior method than Lect.

Discussion
Time complexity of calculating CScore CBFS is a fast and efficient algorithm. From the steps to calculate CScore() in section 2.1, we analyze time complexity. Let n, c, and f equal the number of samples, classes, and features of Therefore, the total time complexity is O((2+c)NmNn). Table 3 shows computation times of feature selection for selected algorithms. CBFS is the fastest feature selection algorithm.

Overfitting problem of proposed algorithm
Overfitting is a general problem of machine learning algorithms such as classification. To avoid overfitting, K-fold validation and LOOCV skims are used in classification tests. Validation errors can be used to evaluate feature selection algorithms. Table 4 shows the classification accuracy and validation errors of Lect and CBFS on benchmark datasets. We calculate validation errors from five to twenty features derived by Lect and CBFS. Lect uses the L1regularization technique to avoid overfitting problem, so we can indirectly evaluate the validation error of CBFS by comparing Lect and CBFS. CBFS gives lower validation errors than Lect for every dataset except Sonar. The average validation error of Lect and CBFS are 68.33% and 66.85%, respectively. If a feature selection algorithm has a lower validation error, it means that the algorithm is less sensitive for distribution of samples and may produce less overfitting problems. Most distance-based filters assume that if a feature has short intra-classes and long interclasses distances, it can produce high classification accuracy, but this assumption carries the risk of higher overfitting. Proposed CScore() evaluates each feature based on the degree of conden-

Application of CBFS
CBFS can be applied to any areas of data analysis that require feature selection scheme such as microarray gene selection, text categorization, and image classification. Microarray data is used to screen thousands of genes and determine whether genes have relationship with specific disease such as cancer. A gene corresponds to a feature and CBFS may suggest candidate genes according to feature evaluation values. Medical expert will analyze the biological functions of the candidate genes and find target genes that are related with diseases. Feature selection is an essential part of text classification. Document collections have 10,000 to 100,000 or more unique words. Many words are not useful for classification. Restricting the set of words that are used for classification makes classification more efficient and can improve generalization error [16]. Image retrieval is one of application area of CBFS. In image retrieval, each image data may have so many features to characterize the data. In feature extraction step, we don't know which features are efficient to