Figure 1.
Feature selection and prioritization schema.
The feature selection step is based on a fuzzy pattern discovery method implemented into the R-package DFP. This method is able to identify the most relevant class-specific features, forming a fuzzy pattern, for each class (A). The selected features are then ranked using a modified RF procedure that exploits the fuzzy patterns to improve the performance of the decision trees grown from large subset of samples and features (B). The RF algorithm works on the gene expression dataset given by the union of all features selected in the first step. Furthermore, the knowledge of the fuzzy-based feature signatures is exploited in order to have a different random selection procedure. For each node, the subset of features used for the splitting process is composed by a random selected feature from each fuzzy feature pattern.
Figure 2.
The full dataset is a matrix with thousands of features (e.g. genes) in rows and tens or hundreds of samples (belonging to different classes) in columns. For each sample, the outcome (class) is given. The dataset is randomly divided into training and test sets using a stratified random selection (1). Within the training set, relevant features are selected using the compared methods (2). The FPRF method identifies a wide set of relevant features using a fuzzy pattern discovery technique and ranks them applying a RF-based procedure (3). The most n-relevant features are then selected with n = 30, 50, 100, 150 and 200 (4). The different sets of features are used to evaluate the stability and the corresponding classification performance. For each set of selected features an RF-based classifier is trained on the training set (5). After training, the classifiers are asked to predict the outcome of the test set patients (6). The predicted outcome is compared with the true outcome and the number of correctly classified samples is noted. Steps 1–6 are repeated 30 times, and the resulting evaluation metrics are obtained by averaging over the 30 runs.
Table 1.
Overview of the analyzed datasets.
Table 2.
Classification performance.
Table 3.
Selection consistency analysis.
Figure 3.
The accuracy and the stability are evaluated together rather than individually for each data set and each analysis method. x-axis shows the percentage stability while y-axis shows the percentage accuracy (G-mean). Moreover, the size of the circles denotes the mean number of selected features over the 30 runs. Each subplot allows to analyze and identify the methods having the beset trade-off between a accuracy and stability for a specific dataset. (.a) trade-offs in Leukemia; (.b) trade-offs in Lung Cancer; (.c) trade-offs in Psoriasis; (.d) trade-offs in PBMC.
Table 4.
Running time.