Developing an ensemble machine learning study: Insights from a multi-center proof-of-concept study

doi:10.1371/journal.pone.0303217

Table 1.

Table of the clinical features of the CT dataset.

For each feature the absolute value and its frequency is shown.

More »

Expand

Fig 1.

Workflow for implementing machine learning algorithms in the challenge.

A same 80:20 hod-out validation scheme was used for all initially trained machine learning algorithms. Additionally, each machine learning algorithm was trained on the training sample and validated in 100 10-fold cross-validation rounds. The algorithms thus defined were validated on the independent dataset. Performances were evaluated for both training validation and independent test, in terms of the Area Under the Curve (AUC), Accuracy, Sensitivity, Specificity, Precision and F1 score.

More »

Expand

Fig 2.

Workflow of the classifier Ensemble method.

The scores of the various algorithms were averaged and aggregated; they became “features” of a ensemble machine learning model. Final performances for train and test were evaluated and an XAI approach was implemented to explain which feature-algorithm impacted more on the final predictions.

More »

Expand

Fig 3.

Pie charts of the adopted software (a), balancing technique (b), adopted classifier (c) and feature selection technique (d) by the various algorithms.

More »

Expand

Fig 4.

Heatmaps of the correlation coefficients among the classification score of all the seven algorithms for training (a) and test (b).

More »

Expand

Fig 5.

Score distributions for training (a) and test (b) of the various algorithms and the Classifier Ensemble model.

More »

Expand

Fig 6.

Comparison of ROC curves and the resulting AUC values.

Blues curve: Ensemble model in Leave-one-out validation scheme over the training set; Red curve: Ensemble model over the test set. The shaded area around each curve indicates the confidence intervals at 95% level.

More »

Expand

Fig 7.

Radar plots of the performances of the various algorithms (dashed lines) and the Classifier Ensemble model for training (a) and test (b). The performance metrics were AUC, Accuracy (ACC), Sensitivity (Sens), Specificity (Spe), Precision (Pre) and F1 score.

More »

Expand

Fig 8.

Charts of log-loss metrics for the various algorithms for training and test.

Each algorithm has been averaged first and then used for the comparison.

More »

Expand

Fig 9.

Bee-swarm of the global model (a) and table of the correspondent strategies (b) adopted by the specific algorithm (outlier mechanism, balancing technique, used classifier, and feature selection algorithm).

More »

Expand

Fig 10.

Force-plots of no metastatic sample wrongly classified.

More »

Expand