Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results

doi:10.1371/journal.pone.0256152

Fig 1.

Study flow.

More »

Expand

Fig 2.

The relationship between the number of input features and the variability in estimated and test model performance by 1,000 different random data splits.

A, B, C, and D: Each colored line indicates a result from a single random training-test set split, and each thick black line shows the average of the results from 1,000 data splits in relation to the number of features: mean cross-validated AUC without (A) or with undersampling (B), and test AUC without (C) or with undersampling (D). On average, AUC increased in the beginning and later decreased with increased number of features. However, without averaging, in some trials the AUC did not decrease even at higher numbers of features, widening the AUC variability by the data splits with increased number of features. E and D: Each colored line indicates the average of the percentage AUC differences between CV and testing (i.e., generalization gap) from 1,000 data splits in relation to the number of features, with datasets without (E) or with undersampling (F). The vertical lines indicate 95% confidence intervals. Note that all these trends were more pronounced with the difficult task (i.e., meningioma grading) and with undersampling.

More »

Expand

Fig 3.

Model performance estimate and generalization gap in 1,000 different training-test set pairs according to the sample size and the level of task difficulty.

Each point indicates a model performance estimate in the training set (X axis) and its gap from the performance in the test set (Y axis) from a single training-test set pair. Inspection of how closely clustered the datapoints are reveals that the variability in both the model performance estimation and the gap between CV and testing were worse with the difficult task and with undersampling.

More »

Expand

Table 1.

Model performance estimate and generalization gap according to the sample size and the level of task difficulty.

More »

Expand

Table 2.

Model performance estimate and generalization gap according to the level of task difficulty in some representative training-test set pairs.

More »

Expand

Fig 4.

Discrepancy between performance estimated from cross validation in the training set and performance in the test set (i.e., generalization gap), explained by the distribution of datapoints in the feature space.

The left panel shows that mean cross-validated AUC in the training set and AUC in the test set in three representative trials. The horizontal error bars are standard deviation for CV and 95% confidence interval for testing. The right panel shows the distribution of datapoints in the space consisting of two most important radiomics features (sphericity and flatness). The blue and red dots indicate low- and high-grade meningioma cases, respectively. The linear line dividing the feature space into two areas is a decision boundary; if a datapoint is located in the blue or red area, the model predicts that it is low- or high-grade meningioma, respectively.

More »

Expand

Table 3.

Comparison of four methods to estimate model performance in the meningioma grading task.

More »

Expand