Fig 1.
An illustrative schematic for AutoPrognosis.
In this depiction, AutoPrognosis constructs an ensemble of three ML pipelines. Pipeline 1 uses the MissForest algorithm to impute missing data, and then compresses the data into a lower-dimensional space using the principal component analysis (PCA) algorithm, before using the random forest algorithm to issue predictions. Pipelines 2 and 3 use different algorithms for imputation, feature processing, classification and calibration. AutoPrognosis uses the algorithm in [19] to make decisions on what pipelines to select and how to tune the pipelines’ parameters.
Table 1.
List of algorithms included in AutoPrognosis.
Table 2.
Performance of all prediction models under consideration.
Table 3.
Variable ranking by their contribution to the predictions of AutoPrognosis.
Table 4.
Performance of AutoPrognosis in the diabetic patient subgroup.
Table 5.
Variable ranking for the diabetic population.
Fig 2.
Predictive ability of the UK Biobank variables for men and women.
Each point represents a variable in the UK Biobank ordered by the ability to predict CVD events for men and women. Predictions based solely on age achieved an AUC-ROC of 0.632 ± 0.003 for men and 0.665 ± 0.002 for women. We report the AUC-ROC from models trained with individual variables in addition to age, and only display variables that achieved a statistically significant improvement in AUC-ROC compared to predictions based on age only. Each color represents a different variable category. Variables deviating from the (dotted gray) regression line have an AUC-ROC that differs between men and women more than expected in view of the overall association between the two genders, suggesting a stronger relative importance in one gender group.