Accurate and fast feature selection workflow for high-dimensional omics data

doi:10.1371/journal.pone.0189875

Fig 1.

Proposed workflow for FS including a filtering step with univariate and/or multivariate approaches, followed by a wrapper approach (recursive feature elimination).

More »

Expand

Fig 2.

(A) Correlation matrix for the 544 physicochemical (features) of the 7,391 peptides (samples) included in Dataset 2; (B) the final 20 variables after the correlation-matrix filtering steps.

More »

Expand

Fig 3.

(A) Proportion of variance and (B) cumulative variance of principal components for the analysis of Dataset 1.

More »

Expand

Table 1.

Benchmark of the SVM regression model for Dataset 2 applying different FS methods (SVM), no feature selection, (X2) univariate correlation alone, (CM) correlation matrix filtering, (RFE) and wrapper feature elimination.

The figures indicated using the prefixes CV3, CV7 and CV10 correspond to the number of interactions in the cross-validation steps during the RFE feature selection.

More »

Expand

Fig 4.

Error plot of predicted isoelectric point vs the experimental isoelectric point (Dataset 2): (SVM) applying FS or cross-correlation step; (X2-CM-SVM) adding correlation filters as the only steps for feature selection; (RFE-SVM-CV3) recursive feature elimination, three interactions of cross-validation combined with SVM; (X2-CM-RFE-SVM-CV3) considering the full FS workflow.

More »

Expand

Fig 5.

Accuracy vs. feature selection combination for expression datasets (1, 3, 4, 5, 6 and 7).

(RF) Random Forest without previous feature selection step; (X2-CM-RFE-RF), random forest classification after the feature selection step using univariate correlation filter with matrix correlation and recursive feature elimination; (X2-PCA-RFE-RF), random forest classification after the feature selection step using univariate correlation filter with principal component analysis and recursive feature elimination. All methods include an internal cross-validation 10-fold step. All accuracy metrics were estimated following the approach previously reported by Pochet et al. [31], where 20-fold randomized test data were used to summarize the accuracy of the FS combination.

More »

Expand

Table 2.

Benchmarking of the random forest model (classification) for Dataset 1, when different FS methods are applied: (RF) random forest only, (RFE) wrapper recursive feature elimination with 10-times internal cross-validation, (PCA) principal component analysis, (X2) univariate correlation filtering or (CM) correlation matrix filter.

Each method is applied 20 times with randomized and class-balanced training datasets. The accuracy values provided correspond to the average value.

More »

Expand

Fig 6.

Visualization of the classification process using the first two principal components (PC1 and PC2) from the original data before (A, C, E) and after (B, D, F), to apply the following FS workflow: Univariate correlation (X2) with correlation matrix filter (CM) follow by Recursive Feature Elimination (RFE) wrapped with random forest (RF). The figure shows the classes distribution for Dataset 1 (A, B), Dataset 3 (C, D) and Dataset 4 (E, F).

More »

Expand

Table 3.

Performance comparison between the proposed approach (X2-PCA-RFE-RF) and the method reported by Li et al. [22].

The computer used in the original manuscript was an Intel(R) Core(TM) i5-4690 @ 3.5 GHz CPU, with 16 GB of RAM. In this study, we used an Intel(R) Core(TM) i5-4200 @ 2.5 GHz CPU, with 16 GB of RAM.

More »

Expand