Expert-augmented automated machine learning optimizes hemodynamic predictors of spinal cord injury outcome

doi:10.1371/journal.pone.0265254

Fig 1.

A framework for applying Automated Machine Learning (AutoML) for reproducible inferences in biomedical research.

After data is curated, we perform a cyclical model development process utilizing AutoML to optimize an array of models. Reproducible and explainable AI tools and strategies can be applied to ultimately draw clinical and biological inferences from the models and allow for integration of domain expertise. Critically for clinical modeling, we also include a feature reduction component to achieve a more parsimonious model. The final models are then validated with external validation data along with population similarity analysis for further clinical contextualization. By applying this framework, models produced by AutoML can be stabilized and interpreted for inferential reproducibility and clinical verifiability.

More »

Expand

Fig 2.

AutoML generated 15 models that performed better than the Majority Class Classifier model.

Each model consisted of automatically implemented preprocessing steps and algorithms. Models were assigned names according to the algorithm and encoded by a unique color. Blueprints of the same algorithm class are numbered for identification across both (A) LogLoss and (B) Area Under Curve (AUC) plots. Two models were selected for additional analysis: BP_log (blue box) and BP_XGB (green box). Aggregating across 25 projects (unique partitioning arrangements of the dataset), BP_log had an average performance of 0.67 ± 0.01 LogLoss and 0.68 ± 0.02 AUC; BP_XGB had an average performance of 0.68 ± 0.01 LogLoss and 0.67 ± 0.02 AUC. (C) BP_log consisted of a regularized logistic regression (L2) algorithm with a notable quintile spline transformation preprocessing step for numeric variables. (D) BP_XGB implemented an eXtreme Gradient Boosted (XGB) trees classifier with unsupervised learning features, which refers to the TensorFlow Variational Autoencoder preprocessing step for categorical variables.

More »

Expand

Fig 3.

Feature rank instability (FRI) analysis as a function of number of projects aggregated.

As the number of projects increased, FRI decreased (i.e. pFI ranking became more stable). (A, B) Expected FRI calculated for all 46 features. BP_log had an average FRI of 174.40 ± 2.14 with 2-project aggregation and 13.03 ± 0.34 with 25-project aggregation (A). Similarly, BP_XGB started with an average FRI of 153.83 ± 3.06 that decreased to 11.65 ± 0.33 at 25 projects (B). (C, D) Focusing only on the bottom five features by pFI to calculate FRI, BP_log had an average FRI of 20.41 ± 0.75 with 2-project aggregation and decreased to 0.96 ± 0.08 with 25-project aggregation (C). Similarly, BP_XGB started with an average FRI of 7.77 ± 0.37 and decreased to 0.56 ± 0.06 for the bottom five features with 25-project aggregation (D).

More »

Expand

Fig 4.

Applying an iterative backward feature reduction process to identify parsimonious feature lists that maximize model performance.

The process was performed first by removing the lowest five features by feature importance (step size = 5) and then repeated with step size = 1 within the feature list size range that contained the best performance. (A) For BP_log, the step size was reduced starting at 16 features with the best performance observed with the 9-feature parsimonious feature list (LogLoss = 0.55 ± 0.02). (B) The corresponding pFI of the 9-feature parsimonious BP_log model showed that the MRI BASIC score and the time patients spent outside of the MAP thresholds were the most important features. The remaining features included other intraoperative timeseries-derived features and the time between hospitalization and surgery (Time_to_OR_a). (C) The feature reduction for BP_XGB was expanded to always preserve the two MAP threshold features. The step size was reduced to one starting at 16 features with the best performance observed with the 11-feature parsimonious feature list (LogLoss = 0.48 ± 0.02). (D) The corresponding pFI for the parsimonious BP_XGB model showed that the AIS score at admission (AIS_ad) was the most important feature. Non-timeseries-derived features included Cervical_Injury, Vertebral_Artery_Injury, and TBI_Present. The time_MAP_Avg_above_104 and time_MAP_Avg_below_76 features were ranked 7^th and 9^th respectively.

More »

Expand

Fig 5.

Partial dependence plots (PDPs) for features of interest help interpret how features affect model prediction of BP_log and BP_XGB.

(A) For BP_log, an MRI BASIC score of 4 resulted in lower prediction of improved outcome. A MRI BASIC score of 0–3 increased prediction of better outcome with a MRI BASIC score of 2 leading to the highest probability of improvement. (B) For BP_XGB, an AIS score of A or D at admission resulted in lower probability of patient improvement. AIS scores of B and C both led to higher probability of improvement with AIS score C resulting in the highest probability. (C) For BP_log and (D) BP_XGB, if a patient’s MAP exceeded an upper threshold of 104 mmHg for more than 50–75 minutes, the predicted probability of improvement decreased significantly. (E) For BP_log and (F) BP_XGB, if a patient’s MAP fell below a lower threshold of 76 mmHg for more than 100–150 minutes, the predicted probability of improvement decreased significantly. Notably, BP_XGB PDP for both time_MAP_Avg_above_104 and time_MAP_Avg_below_76 exhibited a rebound in predicted improvement probability at extreme upper values that was absent on the BP_log PDPs.

More »

Expand

Fig 6.

LogLoss performance plots for investigating different lower and upper MAP thresholds using best-performing parsimonious BP_log and BP_XGB models.

(A) With BP_log, we observe that the lower threshold values of 74, 75, 76, and 79 mmHg performed the best of the lower thresholds. The upper threshold values of 103, 104, and 105 mmHg performed the best of the upper thresholds. Notably, the best-performing upper threshold feature (104 mmHg) resulted in a larger improvement to model performance compared to the best-performing lower threshold feature (79 mmHg). (B) With BP_XGB, the values of 74, 75, and 76 mmHg performed the best of the lower thresholds, and the values of 103 and 104 performed the best of the upper thresholds. Similar to BP_log, the best-performing upper threshold feature (104 mmHg) resulted in a larger improvement to model performance compared to the best-performing lower threshold feature (76 mmHg).

More »

Expand

Fig 7.

Model validation confusion matrices and clustering analysis to demonstrate differences in patient population between training and validation datasets.

Validation predictions were scored by comparing the average predicted probability of each validation sample against the average best F1 threshold for the corresponding model. (A) The best parsimonious BP_log model correctly predicted 13 of the 14 true positives (i.e. patient improved in outcome) and 15 of the 45 true negatives. (B) The best parsimonious BP_XGB model correctly predicted 9 of the 14 true positives and 14 of the 45 true negatives. (C) UMAP and HDB clustering analysis on the combined training and validation data produced six clusters of patients. Notably, Clusters 1 and 2 showed high representation in the training cohort and low representation in the validation cohort. Conversely, Cluster 3 showed low and high representation in the training and validation cohorts respectively. Clusters 3, 5, and 6 have no discernable differences between cohorts.

More »

Expand