A Robust and Accurate Method for Feature Selection and Prioritization from Multi-Class OMICs Data

Selecting relevant features is a common task in most OMICs data analysis, where the aim is to identify a small set of key features to be used as biomarkers. To this end, two alternative but equally valid methods are mainly available, namely the univariate (filter) or the multivariate (wrapper) approach. The stability of the selected lists of features is an often neglected but very important requirement. If the same features are selected in multiple independent iterations, they more likely are reliable biomarkers. In this study, we developed and evaluated the performance of a novel method for feature selection and prioritization, aiming at generating robust and stable sets of features with high predictive power. The proposed method uses the fuzzy logic for a first unbiased feature selection and a Random Forest built from conditional inference trees to prioritize the candidate discriminant features. Analyzing several multi-class gene expression microarray data sets, we demonstrate that our technique provides equal or better classification performance and a greater stability as compared to other Random Forest-based feature selection methods.


Introduction
Identifying discriminant features, for instance from transcriptomics experiments, and modelling classifiers based on them are fundamental tasks when the aim is to highlight biomarkers (e.g. genes or transcripts discriminating healthy from diseased samples). Indeed, clinical classification based on high throughput molecular profiling has been already explored for a number of complex diseases, such as cancer [1,2]. These studies become crucial also in terms of public health when such approaches are considered for clinical practice [3]. On the other hand, concerns about the stability and reproducibility of microarray results have been raised, despite the huge propagation of the gene selection methods in biomarker discovery [4] and interest on this topic seems to be increasing [5,6]. The most frequently used feature selection techniques include univariate (filter), and multivariate (wrapper), approaches. Univariate techniques, such as the formal statistical hypothesis testing or, more in general, the ranking methods, test each feature separately. Multivariate techniques assess the relevance of groups of features simultaneously, by using selection methods (e.g. forward or backward selection) coupled with machine learning techniques such as logistic regression, support vector machines (SVM) or random forests (RF) [7][8][9]. Unfortunately, multivariate methods tend to identify different subsets of candidate biomarkers with equal accuracy, even when feature selection algorithms are used on the same data [5,6]. This is particularly true for feature selection problems in OMICs data analysis, where the number of investigated features is much larger than the number of samples. Multiple stability issues can in fact affect these data sets, and the data sets can contain large number of redundant features [10].
The aim of this study was to develop a feature selection and prioritization framework capable of guaranteeing high stability as well as high classification performance. First, an unsupervised fuzzy pattern discovery method [11] is used to discretize the gene expression data and to identify fuzzy-based feature signatures called fuzzy patterns (FP). Each FP summarizes the most relevant features of each class. Next, a Random Forest (RF) procedure, where the prior knowledge of the FPs is used to enhance the performance of each tree within the classifier, is run on the original data. Last, a permutation based variable importance measure is used to rank the selected features and produce the final prioritized feature list.
We tested our fuzzy pattern -random forest (FPRF) procedure, implemented in R language [12], as well as two widely used methods for feature selection based on RFs and also implemented in R -varSelRF [8] and Boruta [9], on several gene expression microarray data sets investigating human samples in different pathophysiological conditions. The basic idea of varSelRF is to execute a backward iterative selection process that exploits the measures of variable importance, computed by RF based on CART trees [13], whereas Boruta uses an iteration process that removes, at each run, the features with less contribution to classification accuracy by introducing random variables for the competition [10].

Materials and Methods
The FPRF algorithm Figure 1 depicts the proposed schema of feature selection and prioritization, when dealing with multiclass high-throughput data. The first step consists of a fuzzy pattern discovery method, implemented in the R package DFP [11], which is used to select large subsets of relevant and independent class-specific features indicated as FPs (see Figure S1 for details). First, the membership function for each feature is computed and each value is consequently transformed into a linguistic label (i.e. ''Low'', ''Medium'', ''High'' and their intersections ''Low-Medium'' and ''Medium-High''). The output is a discretized (fuzzyfied) dataset that is only used to generate the FPs. A FP is a large set of features whose fuzzyfied pattern is correlated with a specific class (or a biological state of interest). The union of all FPs forms the set of selected features (step 1 in Figure 1). At the second step, the selected features are prioritized using a RF-based classifier. This is achieved by using a modified RF algorithm that helps reducing the risk of considering redundant features for the node splitting process, improving the accuracy of the decision trees and the final rank of the features.

The discretization and selection steps
As shown in Figure S1, the full dataset is discretized into fuzzy labels through the application of membership functions (MFs). A MF is a curve that defines, for each feature in the data set, how each numerical value in the input space is mapped to a membership value or linguistic label (i.e., ''Low'', ''Medium'' and ''High''). In the DFP package, two types of membership functions are used: i) the polynomial approximation of a Gaussian membership function to model the range of 'normal' expression levels of a gene and ii) the polynomial approximation of two sigmoidal membership functions, which are able to specify asymmetric membership functions for the 'low' and 'high' expression levels. After the discretization step, the FP for each class (or outcome) is computed by selecting the genes with highly frequent discretized label in at least one class ( Figure S1). The discretization and selection step is based on two parameters (zeta, piVal). The zeta parameter is a threshold used in the membership functions to label the float values with a discrete value, and is thus important for the fuzzy discretization process. The parameter piVal, on the other hand, specifies the percentage of values of a class to determine the FPs. Essentially, these two parameters influence the number of features included in each FP. Smaller values of zeta and bigger values of piVal result in smaller FPs containing less features. While too small FPs can resolve in some empty FPs, bigger FPs might include less relevant features. In our experiments, we have preferred to work with bigger FPs by using the parameter configurations shown in Table S1. Since in FPRF the predictive power of the features is evaluated in the RF-based classifier, the less informative features will rank low in the final list. Nevertheless, nominally less predictive features that become Figure 1. Feature selection and prioritization schema. The feature selection step is based on a fuzzy pattern discovery method implemented into the R-package DFP. This method is able to identify the most relevant class-specific features, forming a fuzzy pattern, for each class (A). The selected features are then ranked using a modified RF procedure that exploits the fuzzy patterns to improve the performance of the decision trees grown from large subset of samples and features (B). The RF algorithm works on the gene expression dataset given by the union of all features selected in the first step. Furthermore, the knowledge of the fuzzy-based feature signatures is exploited in order to have a different random selection procedure. For each node, the subset of features used for the splitting process is composed by a random selected feature from each fuzzy feature pattern. doi:10.1371/journal.pone.0107801.g001 Feature Selection and Prioritisation from OMICs Data PLOS ONE | www.plosone.org important in combination with others will have a higher rank in the output.

The RF-based feature ranking method
We used a RF-based classifier to rank the features selected by the fuzzy pattern discovery method. The RF implementation utilized is based on unbiased classification trees, as implemented in the ctree function in the R package party [14,15]. The feature importance is usually evaluated through methods such as the Gini importance and the ''mean decrease in accuracy'' or ''permutation'' test, available in the package randomForest [16]. Similarly, a permutation importance measure for cforest is available in party. Since, the Gini importance criterion may lead to biased results [17,18], we used the permutation accuracy importance score to evaluate the selected features.
Furthermore, the package party was modified in order to introduce a new RF procedure that exploits the information of FPs to improve the accuracy of the decision trees. At each node, the standard RF procedure, selects M variables at random and searches for the best split over them. This procedure is referred to as node splitting process. The new RF procedure simply replaces the random selection of M features with a new process that picks one random relevant feature from each fuzzy pattern. The basic idea is to increase the number of relevant features selected for the node splitting process, restricting the random feature selection from the FPs. Indeed, we observed by internal studies that the RF models built with the proposed RF procedure are significantly better than those obtained with the standard procedure. In a more detail, the random subset of features at each node splitting is built as follows: Where f ij is the feature randomly selected from the FP i and the index j[f1,:::,kg. k is the number of genes in FP i , and n is the number of FPs ( = #number of classes). The number of trees used by the RF classifiers to rank the selected features is a relevant parameter, and we used 1000 trees for all the data sets. This value was also used to evaluate the accuracy of all trained RF-based classifiers. The methods varSelRF and Boruta were used with default parameters.

Datasets analyzed
Four multi-class gene expression microarray data sets were analyzed to evaluate the performances of FPRF. The first data set is a compendium of human peripheral blood mononuclear cells (PBMC) samples, consisting of seven classes, which were generated by integrating multiple independent publicly available series (Table S2) from the NCBI GEO repository (http://www.ncbi. nlm.nih.gov/geo/). The second data set is a series of transcriptomics profiles of bone marrow cells (BM) from patients with different subtypes of acute lymphoblastic leukemia (ALL) [19] (http://www.stjuderesearch.org/site/data/ALL1). In the original study, the dataset Leukemia has been divided into six diagnostic groups (BCR-ABL, E2A-PBX1, Hyperdiploid.50, MLL, T-ALL and TELL-AML), and one that contains samples that did not fit into any one of the above groups. But, in our study we preferred to consider only those samples that were belonging to one of 6 categories (276). In addition, two data sets from the IMPROVER challenge [20], profiling respectively lung cancer (four classes,, GSE43580) and psoriasis specimens (three classes, GSE13355 and GSE14905), respectively, were also analyzed. The preprocessing of all the data sets has followed a similar procedure. First, the raw data (.CEL files) were imported into R v. 3.0.0 and their quality were checked with the packages AffyQCReport (http://www. bioconductor.org/packages/2.12/bioc/html/affyQCReport.html.) and affyPLM (http://www.bioconductor.org/packages/2.12/ bioc/html/affyPLM.html). Subsequently, the background correction, normalization and summarization of the gene expression values were performed according to the RMA algorithm implemented in the package affy [21]. Alternative CDFs v.18 (http:// brainarray.mbni.med.umich.edu/brainarray/default.asp) were used for probe annotations based on the Ensembl Gene (http:// www.ensembl.org/index.html) or the Entrez Gene (http://www. ncbi.nlm.nih.gov/gene) databases. For the PBMC compendium, the technical biases associated with the series and the platform (different Affymetrix chipsets) were corrected using the ComBat algorithm [22] implemented in the R sva package [23,24]. Finally, the genes with low variability across the samples were eliminated.
Testing and assessment of the results Figure 2 depicts the general schema used to evaluate and compare our method with varSelRF and Boruta. After having randomly selected and left aside the 30% of the data, the remaining 70% of the data was used to generate the sets of selected, relevant features (or genes). Since our method runs feature selection and prioritization, the generated ranked lists were cut-off at different points (or lengths) in order to produce lists of features of different sizes, referred to as n-top ranked feature lists. Next, the classification accuracy of each n-top ranked feature list and the feature lists obtained with varSelRF and Boruta were evaluated. For each list of features, a RF classifier was first trained on the randomly selected 70% of the data set, and then tested on the remaining 30% of the data, in order to get the post-selection classification accuracy. This process was repeated 30 times in order to assess the resulting classification metrics and the stability of the selected features.
The execution time for each run was recorded. All the analyses were performed with a quad-core Intel Core i7 3.4 GHz and 32GB DDR3 RAM running Apple Mac OS X v.10.9.2 and R 3.0.2.

Evaluation Metrics
The evaluation metrics used in this study aimed at assessing the classification accuracy and the stability of the compared methods. The stability was evaluated by comparing the lists of features selected by each method over 30 bootstrap iterations. From these, we looked for significantly self-consistent features [10], that were selected more frequently by bootstrap iterations than what would be expected at random. For each method, we computed the ratio between the number of self-consistent features and the total number of features selected over the 30 bootstrap iterations.
The post-classification accuracy was estimated by the F-score and the G-mean metrics, which are particularly appropriate when unbalance multi-class problems are considered [25]. The F-score is based on the F-measure, which is calculated as follows: Where R i and P i are the recall and the precision, respectively, for the i th class, and k is the number of class labels. A high Fmeasure value guarantees that both recall and precision are reasonably high. The extended F-measure metric for the multiclass case is described as follows: The G-mean function is instead defined as the geometric mean of the recalls across all the classes.
In addition to these two main metrics, we also report the overall accuracy.

Results and Discussion
We compared the novel feature selection and prioritization method FPRF with two common RF-based feature selection methods, varSelRF and Boruta. We used publicly available gene expression datasets, and evaluated the classification performance and the stability of the selected features following the schema illustrated in Figure 2.

Empirical evaluation -accuracy
A common method for the assessment and tuning of feature selection methods consists in evaluating the accuracy/error of a classifier that is trained using only the selected features. At each bootstrap iteration, the features that were selected from the training set were used to model a RF-based classifier on the test set. The test set contained samples that were not present in the corresponding training set, and therefore not considered during the feature selection step. Since our method selects and ranks the features, we trained different RF-based classifiers on the training set, selecting different cut points: the first 2, 3, 4, 5, 10, 20, 30, 50, 150, 200 and 250 top-ranked features. We then compared the classification performance of the classifiers FPRF.2-250, with those obtained using the features selected from varSelRF and Boruta. The idea was to compare our method with Boruta and varSelRF, and simultaneously find the best cut points for each analyzed datasets. To impartially assess the classification performance, we used three extended measures, namely, Acc, G-mean (G) and F-score (F), as defined in the Material and Methods. Table 1 reports the results of varSelRF and Boruta, along with the different n-top ranked feature lists generated by FPRF on the four analyzed data sets. The evaluation metrics were averaged over 30 bootstraps. The RF-based classifiers that were trained with less than or equal to 50-top ranked feature lists exhibited the best classification performance among the different n-top ranked  (Table 2). In the Lekumia dataset the Accuracy, Fscore and G-mean values compiled for FPRF.10 and FPRF.20 are almost always bigger than the same values compiled for FPRF.3-5 and FPRF.30-250. Of note, the Leukemia dataset was sensitive to class imbalance, while other datasets were not, as shown by the difference between the Accuracy as well as the F-score and Gmean values. An Accuracy value larger than the F-score or the Gmean value indicates that the classifier is significantly affected by imbalanced class distribution, which is evident with the RF-based classifiers obtained from varSelRF and Boruta. However, this class imbalance problem does not apparently affect the classification performance of the RF-classifiers obtained using FPRF.10 and FPRF.20. In Lung Cancer dataset, Boruta performs better than our RF-based classifiers achieving 99% of accuracy. However, using only the first four top ranked features, the accuracy of the FPRF.4 classifier reaches 98%.
Our results confirm the recent observations of Kursa [10], where the robustness of different RF-based gene selection methods was evaluated in terms of accuracy and stability. The high accuracy of the generated gene sets is due to the nature of RFbased classifiers. They can handle a large number of noisy features without a significant increase in error. Therefore, although the classification performance is relevant for evaluating the quality of the selected feature sets, it is alone not sufficient to provide a reliable assessment of the selection quality.

Empirical evaluation -stability
For each method, the feature stability was assessed using the sets of selected features over 30 bootstrap iterations. The stability of the different n-top ranked feature sets were compared with those obtained with varSelRF and Boruta. For each method, we identified the set of stable features by applying a recently described self-consistency-based method [10]. Table 3 reports the ratios of the self-consistent features to the total number of different features selected over the 30 bootstrap iterations for each data set and method considered. The 2, 3, 4, 5, 10 20 top-ranked features provided by our method exhibit better robustness for all data sets. This is particularly true in the case of the Lung Cancer and Psoriasis datasets, where the observed stability values are much higher than those achieved by Boruta.
In Figure 3, we evaluated the stability and accuracy metrics together rather than individually. This figure allows to easily identify the methods providing the best trade-offs with respect to the two selected metrics. The methods providing the best tradeoffs can be found in the right top corner; here there are all the methods which are strictly better on at least one metric. For instance, in the case of the Lung Cancer dataset, we can observe that Boruta provides high accuracy (99%) and very low stability (22%). While the FPRF.4 provides high accuracy (98%) and better stability (50%).

Execution time
The average running time of the methods benchmarked in this study is reported in the Table 4. The fastest method is FPRF, when considering the sum of the execution time for both the feature selection and the prioritization steps. Instead, the slowest method is Boruta, especially for the PBMC data set, which consists of many samples. The varSelRF algorithm required less execution time than Boruta, but it is systematically slower than FPRF.
Biological significance of the selected feature lists Next, we evaluated the ability of FPRF to select biologically relevant features. The lists of genes selected from each dataset are given as Data S1. Leukemia is a hematological neoplasm, in which the bone marrow generates abnormal, rapidly growing white blood cells. As expected, genes involved in acute myeloid leukemia along with other cancer-related genes, such as breast and ovarian cancers, were among the top ranked genes. Moreover, genes related to T-and B-cell activation and differentiation, as well as leukocyte activation, were also retrieved. Interestingly, FPRF selected several genes coding for proteins located in the cell membrane, suggesting that this approach might provide a valuable additional tool in leukemia differential diagnosis, which is currently dependent on flow cytometric analysis of patterns and intensity of antigen expression on the cell membrane to reach a definitive diagnosis [26]. The genes selected in the lung cancer data set are involved in epithelial differentiation. In addition, a number of known genes currently used for lung cancer subtyping were retrieved in the top positions of the ranked output list, such as KRT5 and TP63 [27].Psoriasis is a chronic inflammatory skin disease characterized by highly inflamed and sharply demarcated scaly skin lesions or plaques, which histologically show marked epidermal hyperplasia, prominent inflammatory infiltrate and increased vascularization. T-cells play a key role, and it is becoming increasingly more apparent that a pathogenic crosstalk between innate and adaptive cells underlie the dysregulated immune response that leads to abnormal epidermal proliferation [28]. As expected, the top ranked genes in the psoriasis data include members of relevant immune signaling pathways such as the IL-1 family of cytokines (IL-36G), the IFN-gamma (OASL, STAT1, CXCL1) and IL-17 (CCL20, CXCL8) signaling pathways. Furthermore, antimicrobial peptides (AMPs, S100A7A, S100A12) that bridge the immune system and the epidermal component, and kallikrein-related peptidases (KLK9) that induce AMPs were retrieved. FRPF also selected epidermal differentiation markers (members of the small proline-rich protein family (SPRR2) and the late cornified envelope family (LCE3D)), and genes associated with keratin regulation (KRT77), similar to previous studies [29,30].Peripheral blood mononuclear cells (PBMC) represent a heterogeneous set of circulating immune cells, mainly including lymphocytes (T-, B-and NK-cells), monocytes and macrophages. The selected genes in this data set indeed highlighted relevant pathways including immune response, T-cell activation and differentiation, and host-virus interactions.

Conclusions
We have developed a new method, FPRF, for fast feature selection and prioritization that ensures the identification of relevant and stable sets of features from high-throughput transcriptomics data. By evaluating FPRF on different multi-class microarray data sets, we show that it is able to reach a high classification power, while gaining stability over other popular algorithms.  Supporting Information Figure S1 The fuzzy pattern discovery method implemented in the R package DFP [11] is described in details.
(TIF) Code S1 Supplementary methods and the implementation/evaluation in R of the FPRF method. (ZIP) Data S1 Supplementary Data containing the list of genes selected for each dataset and their ranks. (XLS)