Rough sets and Laplacian score based cost-sensitive feature selection

Cost-sensitive feature selection learning is an important preprocessing step in machine learning and data mining. Recently, most existing cost-sensitive feature selection algorithms are heuristic algorithms, which evaluate the importance of each feature individually and select features one by one. Obviously, these algorithms do not consider the relationship among features. In this paper, we propose a new algorithm for minimal cost feature selection called the rough sets and Laplacian score based cost-sensitive feature selection. The importance of each feature is evaluated by both rough sets and Laplacian score. Compared with heuristic algorithms, the proposed algorithm takes into consideration the relationship among features with locality preservation of Laplacian score. We select a feature subset with maximal feature importance and minimal cost when cost is undertaken in parallel, where the cost is given by three different distributions to simulate different applications. Different from existing cost-sensitive feature selection algorithms, our algorithm simultaneously selects out a predetermined number of “good” features. Extensive experimental results show that the approach is efficient and able to effectively obtain the minimum cost subset. In addition, the results of our method are more promising than the results of other cost-sensitive feature selection algorithms.


Introduction
Feature selection [1][2][3][4] is an essential process for machine learning applications [5][6][7], because it improves generalization capabilities and reduces running time [8][9][10]. The goal of the feature selection problem is to find a feature subset to reduce the dimensionality of the feature space and improve the predictive accuracy of a classification algorithm [11][12][13][14][15][16]. There are various feature evaluation methods such as maximal margin [17], maximal stability [18], effective distance [19], maximum relevance-maximum significance [20], and matrix factorization subspace learning [21,22]. These evaluation methods assume that the obtained data are free. However, in many real-world applications, we should pay test costs for collecting data items [23][24][25]. Test costs are often measured by time, money, and other resources [26]. Therefore, the cost must be considered in the feature selection process. PLOS  Definition 1 [45]. A test-cost-sensitive decision system (TCS-DS) S is the 6-tuple: where U is a finite set of objects called the universal, C is the set of conditional features, D is the set of decision features. For each a 2 C [ D, I a : U ! V a . V a is the set of values for each a 2 C [ D, and I a is an information function for each feature a 2 C [ D, c Ã : 2 C ! R þ [ f0g is the feature subset test cost function, where R þ is the set of positive real numbers.
We assume that the test sequence does not influence the total cost, and for each A C, one should specify the value of c Ã (A). Therefore, the feature subset test cost function c Ã is to employ a vector c Ã ¼ ½c Ã ð;Þ; c Ã ðfa 1 gÞ; c Ã ðfa 2 gÞ; Á Á Á ; c Ã ðfa 1 ; a 2 gÞ; Á Á Á ; c Ã ðCÞ: ð2Þ The space requirement for storing function c Ã is 2 |C| , which soon becomes unacceptable as |C| increases. To deal with this problem, we need to develop an alternative representation of the test cost function. Therefore, we set c : C ! R þ [ f0g is the test cost function and we assume that the range of cost function c is non-negative ðR þ [ f0gÞ, which is a natural assumption in reality. A feature test cost function can easily be represented by a vector c = [c (a 1 ), c(a 2 ), Á Á Á, c(a |C| )]. Definition 2 [43]. Let S = (U, C, D, V, I, c Ã , e) be a TCS-DS-ER, where U, C, D, V, I and c Ã have the same meaning as Definition 1, e : C ! R þ [ f0g is the maximal error range of a 2 C, and ±e(a) is the error range of a. The error range of feature a is defined as where we set Δ = 0.1, a(x i ) is the i-th instance value of a 2 C, i 2 [1, m], and m is the number of instances. The precision of e(a) can be adjusted through Δ setting. In order to facilitate processing and comparison, the conditional feature values are normalized, and their value range from 0 to 1. In fact, there are a number of normalization approaches. We employ a simple function of normalization: y = (x − min)/(max − min), where y is the normalized value, x is the initial value, and max and min are the maximal and minimal values in each conditional features.

Rough sets and Laplacian score based cost-sensitive feature selection
In this section, we introduce the relative reduct by Rough sets, the Laplacian score (LS) and our algorithm. The first part describes the relative reduct in numeric data. The second part describes the use of LS in cost-sensitive feature selection. Our CSFS-RSLS algorithm is described in the last part. The key of the exponential weighting algorithm is the feature importance exponent weighted function, test costs, and a user-specified exponent α.

Relative reducts in rough sets
Rough set theory [46], proposed in the early 1980s, is a mathematical tool to deal with uncertainty and is a relatively new soft computing method. Concept of relative reduct has been thoroughly investigated by the rough set theory. The concept of relative reduct is built on decision systems, and there are many different definitions, such as positive approximation reducts [47], parallel reducts [48,49], and a general definition reducts [50]. The first condition guarantees that the information in terms of the positive region is preserved, and the second condition guarantees that no superfluous test is included. With this decision-relative reduct, decision-relative core is naturally defined as follows.
Definition 4. Let Red(S) denotes the set of all decision-relative reducts of S. The decisionrelative core of S is Core(S) = \Red(S).
In other words, Core(S) contains those tests appearing in all decision-relative reducts. A decision-relative reduct is also called a reduct for brevity.

Laplacian score in cost-sensitive feature selection
In real-world applications, the LS can be applied to supervised or unsupervised feature selection. For many datasets, the local structure of the space is more important than the global structure. To represent the local geometry of the data, LS is used to construct a nearest-neighbor graph. The nearest-neighbor graph of the LS is based on the observation that, two data points are probably related to the same topic if they are close to each other. This can be wellpreserved under the data structure and the nearest neighbor graph can be obtained. The importance of each feature is calculated from the nearest neighbor graph. For each feature, the basic idea of LS is to evaluate the feature importance according to its locality preserving power. Here, we apply the LS to unsupervised feature selection. For each feature, we assume that the feature importance is LS(a), a 2 C. We combine feature importance and cost as follows: where α is a user-specified non-positive exponent and c is the cost of feature. If α = 0, this function reduces to the traditional feature importance. About function LS(a) [51], let's set some symbols. Let LS(a r ) denote the Laplacian Score of the r-th feature. Let f ri denote the i-th sample of the r-th feature, i = 1, Á Á Á, m. Our function can be stated as the following four main steps: 1. Construct a nearest neighbor graph G with m samples. The i-th sample corresponds to x i . We put an edge between samples i and j if x i and x j are "close", i.e. x i is among k nearest neighbors of x j or x j is among k nearest neighbors of x i . When the label information is available, one can put an edge between two nodes sharing the same label.
2. If samples i and j are connected, put where t is a suitable constant. Otherwise, put S ij = 0. The weight matrix S of the graph models the local structure of the data space.

For the
4. Compute the LS of the r-th feature as follow: Example 1 Firstly, we use a subtable of Table 1 as shown in Table 3 and obtain an error range vector in Table 4 by Table 3. Secondly, we obtain the core feature is Gammagt by Table 4 and set k = 3 and t = 1, the weight matrix S can be written as follow:  Then, for each feature, the Laplacian Score LS(a) is shown in Table 5.
The value of the LS(a) indicates the quality of the feature. Table 5 shows that the feature importance is Gammagt > Sgpt > Sgot > Mcv > Alkphos. When we add the cost and set α =

The proposed algorithm
To more quickly and efficiently deal with the problem of test cost, we propose a feature-importance function that includes cost sensitivity to calculate the feature score. This function combines feature importance and cost, and is more reasonable and more widely applicable to practical problems. The algorithm pseudocode is listed in Algorithm 1 and contains two main steps: 1. Add the core feature to B according to reduct of rough sets; 2. Add the current-best feature to B according to feature importance function LS(a, c) until the number of B set achieve the desired number of features.

Evaluation method
Among the existing algorithms, there are many algorithms to deal with the MTR problem. It is necessary to define several evaluation methods to compare the performances. First, we need a method to evaluate the quality of one feature subset. For example, if the test cost of the optimal feature subset is $100, an equal number of feature subsets with test cost $120 are better than another with a test cost of $150. We propose algorithm can run on many datasets or one dataset with different test cost settings. We propose two statistical metrics: average below factor and average exceeding factor.

Below factor
For a dataset produce test cost setting, let R 0 be an optimal reduct. The below factor of a feature subset R is The below factor is a quantitative metric for evaluating the performance of a feature subset. It shows the goodness of a feature subset when it is better than the optimal. Naturally, if R is an optimal feature subset, the below factor is 0. Maximal below factor. To demonstrate the performance of the algorithm, statistical metrics are needed. Let the number of experiments be K. In the i-th experiment (1 i K), the feature subset computed by the algorithm is denoted by R i . The maximal below factor (MBF) is defined as This is the best case of the algorithm given the dataset. To some extent, it can express the performance of this algorithm. Average below factor. The average below factor (ABF) is defined as Because ABF is averaged over K 1 different test cost settings, the value of K 1 is c Ã (R) less than c Ã (R 0 ). This value is a very good way to show the performance of the algorithm from a solely statistical perspective.

Exceeding factor
For a dataset with a test cost setting, the exceeding factor is used to show the performance of the algorithm. Similarly, if the algorithm is run K times, the exceeding factor and the maximal exceeding factor are defined in [31]. The exceeding factor provides a quantitative metric to evaluate the performance of a feature subset. It shows the badness of a feature subset when it is not optimal. The value of the maximal exceeding factor is the worst case for some datasets.
Although it relates to the performance of one particular feature subset, it should be viewed as a statistical rather than individual metric. The average exceeding factor (AEF) is defined as The maximal exceeding factor is averaged on K 2 = K − K 1 different test cost settings. It is a statistical metric that represents the overall performance of the algorithm.

Experiments
In this section, we try to answer the following questions by experimentation. The data of our experiments come from real applications. However, because these datasets do not provide the test cost, we use uniform, normal, and pareto distributions to generate random test costs in [1,100]. To help show the performance of the cost-sensitive feature selection algorithm, we create these data for the experiments. The data underlying this study have been uploaded to Github and are accessible using the following link: https://github.com/fhqxa/ PLOSONE-D-17-34607.
Their basic information are listed in Table 6, where |C| is the number of features, and |U| is the number of instances, |D| is the number of classes.

Comparison of three distributions
For each dataset, we have different α values, and there are three distributions for generating the test cost settings. The algorithm is run 100 times with different test cost settings and different α settings on nine datasets. Figs 1-9 show the results of finding the optimal factors from the three distributions. The proposed algorithm performs the best with the pareto distribution for each dataset. Except for the Ionosphere dataset, the normal distribution leads to the worst performance. A possible reason is that the pareto distribution generates many small values and a few large values, and there are many features with both low test costs and large LSs. In contrast, the normal Rough sets and Laplacian score based cost-sensitive feature selection distribution generates many values close to the mean value, and there are no low test costs and large LSs. Finally, in the uniform distribution, there are more cheap tests than in the normal distribution, and fewer cheap tests than in the pareto distribution. Figs 10-18 show the average below factor. For the three distributions, the average below factor is more convincing than the maximal below factor because the average below factor is created from a statistical point of view. Hence, for the three distributions, the average below factor can better describe the performance of the CSFS-RSLS algorithm. From these results, we can see that the proposed algorithm obtains the best performance for each dataset from the uniform distribution except for the SMK-CAN-187 dataset. With the pareto distribution, the average factor is 0 for the Wpbc, Promoters, Prostate-GE, SMK-CAN-187, and Waveform datasets. These results indicate that the test cost of the feature subset and the optimal reduction is the same. In the Ionosphere dataset, although the optimal factor is 1, the average below factor is not 0 but about 0.5. This result shows that the test cost of the feature selection subset is less than the test cost of the optimal reduction and is half that of the optimal reduction.  Here, the optimal setting for α is very close, if not equal, to that for the finding optimal factor. We only need to obtain the optimal setting to find the optimal factor. When this α is at the optimal setting, the average exceeding factor is very low. For example, it is 0 for the nine datasets with the pareto distribution. That is, the constructed feature subsets do not have a higher test cost than that of the optimal reduction, on average. This performance would very satisfactory for practical applications.
In Table 7, we list the results of each dataset to compare the two approaches according to the optimal factor. Both methods are based on CSFS-RSLS. The first approach, called the nonweighting approach, is implemented by setting α = 0. The second approach, called the average α approach.
We observe the following: 1. The non-weighting approach only performs well on the Ionosphere dataset relative to the other datasets. The non-weighting approach has the highest average value of 0.661 on the Ionosphere dataset. For the eight other datasets, the results are unacceptable. For the Waveform dataset, when α = 0, it obtains an optimal factor of 0 for the uniform and normal distribution. In a word, when α = 0, this algorithm has no effect. Therefore, the non-weighting approach is not suitable for the minimal test-cost feature subset problem.
2. The average α approach takes a statistical approach and significantly improves the quality of the results in each dataset. For the Promoters, Prostate-GE, Credit-g, and SMK-CAN-187 datasets, the approach has especially good results. For the SMK-CAN-187 dataset, the value increases by about 99.1% for the uniform distribution. Relatively good results are obtained for the other datasets. For example, for the uniform distribution, the best value of the α = 0 approach is 0.440, and the value of the average α approach is 0.979, an increase of 52.9%. This result is a big improvement.

Effectiveness compare with two algorithms
In this section, we compare the proposed algorithm with two existing algorithms [43,44] to show the efficiency of our algorithm. First, the two existing algorithms and the CSFS-RSLS  algorithm are used in a support vector machine classifier to compute the classification accuracy. We used 60% of the dataset as the training set and the rest as the test set. Second, using the uniform distribution, each algorithm was run 100 times with different test cost settings and the optimal factor was compared with different exponential weight settings. Fig 28 shows the classification accuracy of the three algorithms for eight datasets. For the Liver, Wpbc, Promoters, Voting, Credit, Prostate-GE, and SMK-CAN-187 datasets, the classification accuracy of the λ-weighted algorithm and δ-weighted algorithm is the same. For these datasets, the CSFS-RSLS algorithms has a higher classification accuracy than these two algorithms. Even for the Prostate-GE dataset, the classification accuracy of the CSFS-RSLS algorithm is higher than that of the two algorithms by about 10%. For the Ionosphere dataset, our CSFS-RSLS algorithm is only lower than the δ-weighted algorithm by about 1%. However, it is higher than the λ-weighted algorithm by about 7%. Fig 29 shows the optimal factor found by the three algorithms with the optimal exponential weight. For the Promoters, Voting, Ionosphere, Credit, Prostate-GE, and SMK-CAN-187 datasets, the optimal factor found by the CSFS-RSLS algorithm is 1. For the Ionosphere dataset, the optimal factor found by the CSFS-RSLS algorithm is higher than that of the λ-weighted algorithm by about 0.4. This is an unsatisfactory number. For the Liver dataset, the optimal factor found by the CSFS-RSLS algorithm is lower than that of the δ-weighted algorithm by about 0.01. This value is acceptable.

Conclusion and further work
In this paper, we have developed a new method for cost-sensitive feature selection. Firstly, we use rough sets to calculate the core of all features and use LS to calculate the importance of the each feature. Secondly, the cost is randomly generated by the three different distributions. Finally, we combine the feature importance and cost. To compare the performance of the proposed algorithm, we use two heuristic algorithms to our paper in the same experimental environment. Extensive experimental results show that the proposed algorithm can have better performance and obtain a feature subset with low cost. The CSFS-RSLS algorithm also outperforms the existing algorithms. With regard to further work, many tasks need to be undertaken. First, other realistic data models with test costs can be built. A second point to be considered in future research is that the misclassification cost also should be added to the model. A model combining the test cost with the misclassification cost will be more suitable for real application. In the future, we will focus on designing more effective and efficient algorithms to cope with the minimal cost feature-selection problem. In summary, this study suggests new research trends for the feature selection problem and cost-sensitive learning.