Fig 1.
Proposed workflow for FS including a filtering step with univariate and/or multivariate approaches, followed by a wrapper approach (recursive feature elimination).
Fig 2.
(A) Correlation matrix for the 544 physicochemical (features) of the 7,391 peptides (samples) included in Dataset 2; (B) the final 20 variables after the correlation-matrix filtering steps.
Fig 3.
(A) Proportion of variance and (B) cumulative variance of principal components for the analysis of Dataset 1.
Table 1.
Benchmark of the SVM regression model for Dataset 2 applying different FS methods (SVM), no feature selection, (X2) univariate correlation alone, (CM) correlation matrix filtering, (RFE) and wrapper feature elimination.
The figures indicated using the prefixes CV3, CV7 and CV10 correspond to the number of interactions in the cross-validation steps during the RFE feature selection.
Fig 4.
Error plot of predicted isoelectric point vs the experimental isoelectric point (Dataset 2): (SVM) applying FS or cross-correlation step; (X2-CM-SVM) adding correlation filters as the only steps for feature selection; (RFE-SVM-CV3) recursive feature elimination, three interactions of cross-validation combined with SVM; (X2-CM-RFE-SVM-CV3) considering the full FS workflow.
Fig 5.
Accuracy vs. feature selection combination for expression datasets (1, 3, 4, 5, 6 and 7).
(RF) Random Forest without previous feature selection step; (X2-CM-RFE-RF), random forest classification after the feature selection step using univariate correlation filter with matrix correlation and recursive feature elimination; (X2-PCA-RFE-RF), random forest classification after the feature selection step using univariate correlation filter with principal component analysis and recursive feature elimination. All methods include an internal cross-validation 10-fold step. All accuracy metrics were estimated following the approach previously reported by Pochet et al. [31], where 20-fold randomized test data were used to summarize the accuracy of the FS combination.
Table 2.
Benchmarking of the random forest model (classification) for Dataset 1, when different FS methods are applied: (RF) random forest only, (RFE) wrapper recursive feature elimination with 10-times internal cross-validation, (PCA) principal component analysis, (X2) univariate correlation filtering or (CM) correlation matrix filter.
Each method is applied 20 times with randomized and class-balanced training datasets. The accuracy values provided correspond to the average value.
Fig 6.
Visualization of the classification process using the first two principal components (PC1 and PC2) from the original data before (A, C, E) and after (B, D, F), to apply the following FS workflow: Univariate correlation (X2) with correlation matrix filter (CM) follow by Recursive Feature Elimination (RFE) wrapped with random forest (RF). The figure shows the classes distribution for Dataset 1 (A, B), Dataset 3 (C, D) and Dataset 4 (E, F).
Table 3.
Performance comparison between the proposed approach (X2-PCA-RFE-RF) and the method reported by Li et al. [22].
The computer used in the original manuscript was an Intel(R) Core(TM) i5-4690 @ 3.5 GHz CPU, with 16 GB of RAM. In this study, we used an Intel(R) Core(TM) i5-4200 @ 2.5 GHz CPU, with 16 GB of RAM.