A refined approach for evaluating small datasets via binary classification using machine learning

doi:10.1371/journal.pone.0301276

Fig 1.

Synthetic dataset creation process.

In a first step, half of the target dataset size ‘n’ is created for the first class. For each data point, 10 features are then added whose values are drawn from a Gaussian distribution with a mean of 0 and a standard deviation of 1. After adding these points to the dataset, half the target dataset size data points are created for class 2, and for each point, 10 features are drawn from a Gaussian distribution with a mean of 0.2 and a standard deviation of 1. These points are also added and the resulting dataset is shuffled.

More »

Expand

Fig 2.

Structure of the nested cross-validation method in this study.

It consists of an outer CV to evaluate the performance of the trained ML model and an inner CV to find the optimal (hyper)parameters and features of the model. In the innermost loop, feature selection, standardisation, and oversampling are applied to the training subset of the inner CV before passing the data to the SVM classifier. The combination of hyperparameters that achieves the highest score in the inner CV is used to retrain the model on the outer CV training data and predict its test data. The averaged results of all iterations of the outer CV are returned as the result.

More »

Expand

Fig 3.

Dependence of the probability obtained with the permutation test using the MCC on the number of data points for different numbers of permutations.

The probability obtained is depicted as a function of the number of data points for 25 (orange), 50 (blue), 75 (red), and 100 permutations (green). The coloured area represents the 95% confidence interval calculated over all datasets using bootstrapping—that is, it is derived from the results of repeated resampling of the population. The dotted grey line indicates a probability of 5%.

More »

Expand

Fig 4.

(a) Dependence of the normalised prediction score and (b) the probabilities of the corresponding permutation tests on the number of data points using MCC, F₁ and recall.

The MCC score was normalised to the same range of values from 0 to 1 as the other two metrics. The coloured area represents the 95% confidence interval calculated by bootstrapping over all datasets.

More »

Expand

Fig 5.

Mean value of (a) the MCC, and (b) the probability of 5 nCV versus the MCC of an rnCV consisting of five replicates.

The red line represents the bisector for clarity and the connecting line between the data points is only a guide for the eye.

More »

Expand

Table 1.

Minimum, maximum and standard deviation of scores of the MCC for the individual nCV components of a rnCV.

It includes the probability of the null hypothesis when evaluating using rnCV, contrasted with the average probability derived from utilising nCV for five iterations in the permutation test, rather than a singular rnCV. Assessments are performed on random subsets of the CESAR dataset.

More »

Expand

Table 2.

Scores of the ACC, BA, precision, recall, F₁-Score, MCC, AUC, and κ for rnCV on a random subsets of the CESAR dataset.

More »

Expand

Table 3.

Probabilities of the ACC, BA, precision, recall, F₁-Score, MCC, AUC and κ for rnCV on a random subsets of the CESAR dataset.

More »

Expand

Table 4.

Confusion matrices of the ACC, F₁-Score, and MCC for rnCV on a random subsets of the CESAR dataset.

More »

Expand

Table 5.

Scores of the MCC for rnCV missing either the hyperparameter tuning, the feature selection or both on a random subsets of the CESAR dataset.

More »

Expand

Table 6.

Probabilities of the MCC for rnCV missing either the hyperparameter tuning, the feature selection or both on a random subsets of the CESAR dataset.

More »

Expand