BOSO: A novel feature selection algorithm for linear regression with high-dimensional data

doi:10.1371/journal.pcbi.1010180

Fig 1.

Summary of the BOSO algorithm.

An example dataset with 7 features is split into training and validation sets. For any given subset of features of length K, a linear model is constructed with training data and assessed with validation data. The optimal selected features for a specific K value (green boxes) are obtained from the model that minimizes the validation error. For example, for K = 2, the linear model trained with the subset of features {X₃, X₆} is the one that minimizes the validation error. The problem of selecting the best subset of features of length K is formulated via mixed-integer quadratic programming (MIQP) (see Methods section) and solved using standard MIQP tools. With our MIQP approach, we directly assess all different combinations of linear models that involve K features and select the one with least validation error. This process is repeated for each K value until an information criterion, in this case the extended Bayesian Information Criterion (eBIC), is not further improved. Minimal eBIC is found in this example for K = 2. The final model is derived from Ridge regression with only these two selected variables.

More »

Expand

Fig 2.

Performance comparison of BOSO with different feature selection algorithms using F1-score.

a) Low setting; b) Medium setting; c) High-5 setting; d) High-10 setting. Dots and bars represent, respectively, the mean and standard deviation of F1-scores across 10 random samples for the different SNR values.

More »

Expand

Fig 3.

Performance comparison of BOSO with different feature selection algorithms using Number of non-zeros in the 4 considered problem settings.

a) Low setting; b) Medium setting; c) High-5 setting; d) High-10 setting. Dots and bars represent, respectively, the mean and standard deviation of Number of non-zeros across 10 random samples for different SNR values. The dotted line is the actual value of non-zeros (s) for each SNR value.

More »

Expand

Fig 4.

Performance comparison of BOSO with different feature selection algorithms using False Positives in the 4 considered problem settings.

a) Low setting; b) Medium setting; c) High-5 setting; d) High-10 setting. Dots and bars represent, respectively, the mean and standard deviation of Number of non-zeros across 10 random samples for different SNR values.

More »

Expand

Fig 5.

Performance comparison of BOSO with different feature selection algorithms using False Negatives in the 4 considered problem settings.

a) Low setting; b) Medium setting; c) High-5 setting; d) High-10 setting. Dots and bars represent, respectively, the mean and standard deviation of Number of non-zeros across 10 random samples for different SNR values.

More »

Expand

Fig 6.

Performance comparison of BOSO under different information criterions using the F1-score in the 4 considered problem settings.

a) Low setting; b) Medium setting; c) High-5 setting; d) High-10 setting. Dots and bars represent, respectively, the mean and standard deviation of F1-score across 10 random samples for different SNR values. Note here that BOSO-BIC and BOSO-eBIC obtained the same result in the low setting and, for this reason, the blue and green lines overlap in panel a.

More »

Expand

Fig 7.

Prediction of Methotrexate cytotoxicity in cancer.

Using 100 random partitions of data into training, validation and test sets: a) Pearson correlation obtained with BOSO, Forward Stepwise, Lasso and Relaxed in the Test partition; b) Number of active features selected in the approaches included in Fig 7A; c) Experimental validation of IC50 values predicted by the BOSO-BIC algorithm for 5 MTX-sensitive (PF-382, P12-ICHIKAWA, JVM-2, PEER, SEM) and 5 MTX-resistant (U87MG, A498, LOUNH91, UMUC1, UMUC7). The cell lines with available GDSC IC50 values (PF-382, P12-ICHIKAWA, JVM-2, U87MG, A498, LOUNH91) were excluded from the model construction process.

More »

Expand