Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Using tree-based ensemble methods to produce a population-based mortality risk score in Ontario, Canada

  • Steven Habbous ,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing – original draft

    Steven.habbous@ontariohealth.ca

    Affiliations Ontario Health, Toronto, Ontario, Canada, Department of Epidemiology and Biostatistics, Western University, London, Ontario, Canada

  • Peter C. Austin,

    Roles Investigation, Methodology, Validation, Writing – review & editing

    Affiliations ICES, Toronto, Canada, Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada

  • Shabnam Balamchi,

    Roles Investigation, Methodology, Validation, Writing – review & editing

    Affiliation Ontario Health, Toronto, Ontario, Canada

  • Davood Astaraky,

    Roles Methodology, Validation, Writing – review & editing

    Affiliation Ontario Health, Toronto, Ontario, Canada

  • Roozbeh Yousefi,

    Roles Methodology, Validation, Writing – review & editing

    Affiliation Ontario Health, Toronto, Ontario, Canada

  • Munaza Chaudhry,

    Roles Methodology, Validation, Writing – review & editing

    Affiliation Ontario Health, Toronto, Ontario, Canada

  • Erik Hellsten

    Roles Conceptualization, Investigation, Methodology, Supervision, Validation, Writing – review & editing

    Affiliation Ontario Health, Toronto, Ontario, Canada

Abstract

Introduction

Risk adjustment is critical in observational epidemiology to control for confounding of the exposure-outcome relationship. Accurate prediction of outcomes, such as mortality, can improve risk adjustment. In the present study, we compared logistic regression with a range of tree-based ensemble methods to predict 1-year mortality in the general population of Ontario, Canada.

Methods

Ontario adults (age 18 years and older) who were alive as of January 1, 2022 were included. Using a window of up to 3 years, various measures of health and healthcare utilization were captured from administrative databases. To predict 1-year mortality, we applied logistic regression, random forests, extremely randomized trees, adaptive boosting, gradient boosting, extreme gradient boosting, Newton boosting, and CatBoost. All models also included age and sex. Performance was evaluated using the area under the ROC curve (AUROC), the area under the precision-recall curve (PR-AUC), the Brier score, and a quantile-based version of the Integrated Calibration Index (ICI), reported in the 30% test set. Feature importance was assessed using CatBoost’s internal model structure, supplemented with permutation feature importance, explainable boosted machines, and marginal effects.

Results

A total of 12,080,801 Ontarians were included and 121,951 (1.0%) died within 1 year. Logistic regression showed excellent discrimination (AUROC 0.926; PR-AUC 0.256) and acceptable calibration (ICI 0.0022). The best model was CatBoost, which had the best discrimination (AUROC 0.933, PR-AUC 0.280) and calibration (ICI 0.0003). In sensitivity analyses of the CatBoost model, including more detailed definitions of cancer (to include its subtype) and chronic kidney disease (defined using serum creatinine instead of diagnostic codes) produced modest improvements in PR-AUC (0.290), along with substantially improved calibration amongst the highest-risk (70–100%) individuals. The most influential model-building feature was age. Residence in long-term care and receipt of palliative care was associated with the largest marginal effects.

Conclusion

The machine learning model CatBoost yielded the most accurate predictive model for 1-year mortality using individual comorbidities and additional measures of healthcare utilization for the general population. These findings demonstrate that machine learning methods can enhance risk adjustment efforts in observational studies, leading to more accurate confounder control and better support for health policy and epidemiologic research.

Introduction

Risk prediction models are common in the medical industry, often guiding treatment decisions. Examples include risk equations that predict the probability of kidney failure, the probability of lung cancer given a pulmonary nodule, the probability of postoperative pulmonary complications, and the 10-year risk of cardiovascular disease [14]. Traditionally, such risk scores have been constructed using standard statistical regression techniques like linear regression, logistic regression, or Cox proportional hazards regression.

Machine learning methods may be better suited than traditional statistical methods when the number of potential predictors is large, when there is potential for multi-collinearity, when synergistic or antagonistic effects (e.g., interactions) between predictors exist, and when associations are non-linear. Traditional statistical models would require the onerous and subjective task of manually coding interaction terms. Tree-based methods automatically consider interactions and non-linearity by virtue of subsequently splitting the data into partitions as the depth of the trees increases, and like logistic regression, can also output a predicted probability of the outcome.

Despite this, applications of machine learning methods in public health are limited [5,6]. In the present study, we constructed models for estimating the risk of all-cause 1-year mortality in the general adult population in Ontario, Canada. The output from these models can be used as a single variable for risk adjustment or confounder control, or as a variable of interest to understand the burden of illness in a population. We use a range of tree-based ensemble methods, uncovering some of the black-box features of machine learning methods and providing additional exposure for epidemiologists and health system analysts interested in predictive modeling [7].

Methods

Cohort creation

This retrospective population-based study was conducted in Ontario, Canada. The cohort included all Ontario residents (with valid Ontario postal code and valid unique identifier) who were alive as of January 1, 2022. Persons with sex coded as M or F and aged 18–105 as of the index date (January 1, 2022) were included (N = 12,080,801). Reporting follows STROBE for observational studies (STROBE Checklist in S1 File).

Outcome

The outcome of interest was 1-year all-cause mortality from the index date (January 1, 2022). The date of death was captured from the Registered Persons Database.

Predictors of 1-year mortality

Two sets of predictors were considered. The first set of predictors included age (continuous), sex, and comorbidities (chronic conditions) captured from a hospital setting adapted from the Charlson Comorbidity Index using the updated codes for HIV/AIDS and renal disease as outlined by Glasheen et al [8,9]. The second set of predictors expanded on the first by including additional measures of healthcare utilization that could be captured outside of the hospital setting.

Comorbidities – Chronic conditions

Comorbidities were categorized using ICD-10 codes from hospital visits within three years prior to the index date [8,10]. We used the Discharge Abstract Database (DAD) for inpatient visits, (allowing up to 25 diagnostic codes per visit), and the National Ambulatory Care Reporting System (NACRS) for ambulatory hospital visits (allowing up to 10 diagnostic codes per visit) (Appendix 1 in S2 File for codes). Comorbidities included myocardial infarction, congestive heart failure, peripheral vascular disease, cerebrovascular disease, dementia, chronic pulmonary disease, connective tissue/rheumatoid disease, peptic ulcer disease, mild liver disease, diabetes without complications, diabetes with complications, hemiplegia/paraplegia, moderate-to-severe liver disease, mild-to-moderate renal disease, severe renal disease, primary cancer, metastatic cancer, HIV, and AIDS (HIV with opportunistic infection).

Measures of healthcare utilization

We used the Ontario Health Insurance Plan (OHIP) database to capture health and healthcare utilization based on physician billing codes that could plausibly be related (positively or negatively) to mortality. Flags were created using a lookback of 1 year and included billing codes related to obesity, long-term care/chronic care residence, mental health, house calls, American Society of Anesthesiologists (ASA) physical status classification (V, IV, III, none), chronic pain, fibromyalgia, palliative care, smoking cessation, homecare, substance use/abuse, critical care, hyperbaric treatment, vaginal delivery or delivery by Cesarean section, major burns, and amputation (Appendix 2 in S2 File). Acute hospital-based indicators included hip fracture, delirium, and hospital-acquired pressure injury. Indicators of healthcare utilization included the number of unique visits with the healthcare system (OHIP), the number of admissions (DAD), and the number of outpatient hospital encounters (NACRS). Activity of Daily Living (ADL) score was obtained from the Continuing Care Reporting System (long-term care; complex continuing care) and InterRAI-Home Care databases. The ADL is a measure of an individuals’ abilities to perform self-care tasks (ranges from 0–28). The most recent score available during the preceding year was used, with missing values assigned a “missing” category.

Data pre-processing

Numeric predictors (age, number of healthcare utilization visits, ADL score) were retained as continuous predictors. Persons missing an ADL score (e.g., those not in a setting to complete an InterRAI survey) were assigned a separate category. These variables were not rescaled, as the machine learning models employed were based on decision trees, which are invariant to monotonic transformations and operate effectively on the rank order of predictor values. Categorical variables were transformed using one-hot encoding (dummy-coded to 0/1) to ensure compatibility with the machine learning algorithms.

Models

Base model: Logistic regression.

The baseline set of models was logistic regression, which is estimated using maximum likelihood estimation (MLE). MLE identifies the parameters that maximize the log-likelihood function. In machine learning implementations of logistic regression, the objective is typically framed as minimizing the negative average log-likelihood function.

Tree-based ensemble methods I: Bagging.

The first set of ensemble methods involved bagging (bootstrap aggregation) methods [11]. These methods randomly split the data across n base estimators (e.g., decision tree classifiers), which can be trained independently and therefore in parallel. In standard bagging, bootstrap samples (sampling with replacement) are drawn from the training data, and each base estimator uses all available features.

In random forests, instead of using all available features to split each node within a tree, a random subset of features is selected at each split, increasing diversity and reducing overfitting.

Extremely randomized trees (ExtraTrees) are similar to random forest, but instead of picking the best variable to split on among the random sample of features at each node, ExtraTrees chooses the splitting variable by picking the best variable to split according to random split thresholds produced for each feature.

All these bagging-based methods aim to reduce variance and improve generalization by introducing diversity among the base learners, which are ensembles of strong learners (e.g., deep trees).

Tree-based ensemble methods II: Boosting.

The second set of ensemble methods involved boosting methods, which sequentially construct shallow trees (e.g., weak learners), with each iteration aimed at correcting the errors of the previous model [11].

In adaptive boosting (AdaBoost), observations misclassified by previous learners are assigned higher weights, so that subsequent learners focus more on the difficult-to-classify cases. Each weak learner is associated with its own weight, and the final prediction is a weighted combination of all learners.

In gradient boosting, instead of reweighting misclassified observations, learners are fit to the gradients of a loss function, which quantify the degree of error. Here, the loss function is based on residual errors (observed minus predicted from previous weak learner), and weak regressors are fit to the residuals, which are real-valued rather than binary. Gradient boosting uses first-order (single derivative) information.

Newton boosting extends gradient boosting by incorporating second-order derivatives (the Hessian) to adjust weights, enabling faster convergence. Extreme gradient boosting (XGBoost) is a type of Newton boosting, designed for greater computational efficiency and allowing for regularized loss functions.

Finally, we used CatBoost, a newer Newton boosting method optimized for datasets with many categorical variables, especially those with high cardinality (many levels for a categorical variable, such as primary cancer type). CatBoost uses oblivious decision trees instead of standard decision trees (the same splitting criterion is used for all splits occurring at the same level) (https://catboost.ai/) [1214]. Although CatBoost handles categorical variables and missing values natively, in our application, all categorical variables were non-missing and already one-hot encoded due to their dichotomous nature.

Assessing model performance

The dataset was randomly split into 70% training and 30% test, stratified by the outcome to ensure equal proportions of the outcome in both datasets. Model performance was assessed in the test set while considering performance in the training set to identify evidence of overfitting (e.g., marked degradation in performance between training and test datasets).

Discrimination.

The primary metric was the area under the receiver operating characteristic (AUROC) curve. We also considered the area under the precision-recall curve (PR-AUC). Whether AUROC or PR-AUC is more appropriate when the outcome is unbalanced (e.g., the outcome of mortality is uncommon) remains a matter of discussion, so we provide both, favouring the AUROC [15,16]. We supplemented this with the Brier score (categorical equivalent of mean squared error, computed as the average squared distance between the predicted probability and the observed dichotomous outcome as 0/1) [17]. For all models, we inspected the distribution and the range of the predicted probabilities (ideally ranging from 0.01 to 0.99), how close the mean predicted mortality was to the observed mean mortality (ideally they are the same), and whether the AUROC, PR-AUC, and Brier scores differed between the training and test datasets (ideally they are the same).

Calibration.

While the Brier score also incorporates calibration [17], we also assessed calibration by visual inspection of the calibration curve, plotting centiles of the predicted probability of mortality versus the observed probability of mortality [18,19]. Large deviations of the calibration curve from the diagonal line of perfect calibration, if driven by sparsity (e.g., few observations had a predicted probability in that range), may cause one to over-interpret deviations from calibration [20]. Given that the lowess() function was computationally expensive, we calculated a metric comparable to the Integrated Calibration Index (ICIeq) by computing , interpreted as the weighted average of the absolute difference between mean observed and expected outcomes, where ni is the size of the bin and N is the total sample size (100 bins were selected) (Appendix 3 in S2 File for Python code). We also supplemented this by comparing mean(y) with mean() (the closer the better; calibration-in-the-large) and Cox’s calibration slope and intercept [21]. To calculate Cox’s calibration slope and intercept, we used a logistic regression of the observed outcome regressed on the logit-transformed predicted probability of the outcome [2123]. A well calibrated model is expected to have a slope of 1 (a unitless quantity) and an intercept of 0. Lastly, for visual support we used Kaplan-Meier plots stratified by predicted probabilities of the outcome, expecting the observed event rates to match the predicted event rates in each group [24].

Sensitivity analyses

Definition of select comorbidities.

We refined the definition of renal disease and primary cancer to leverage our more detailed data assets (Appendix 4 in S2 File for details). For renal disease, we used serum creatinine from the Ontario Laboratory Information System (OLIS) to calculate eGFR using the 2021 CKD-EPI formula that is not race-adjusted [25]. This was done because ICD-10 codes have demonstrably poor sensitivity for capturing chronic kidney disease (CKD), resulting in underestimation of this highly prevalent condition [26]. Persons with no serum creatinine were assumed to have no kidney disease. For primary cancer, we used the Ontario Cancer Registry (gold standard for cancer diagnoses), categorizing each diagnosis using the SEER recode classification system. The SEER codes were unchanged, with the following exceptions due to small sample sizes: 22050 (pleura) and 22060 (trachea, mediastinum, other respiratory organs) were combined with 22030 (lung and bronchus); 35041 (other acute leukemia), 35023 (other myeloid/monocytic leukemia) 35031 (acute monocytic leukemia), and 35041 (other acute leukemia) were combined as “Other leukemia”; and 33011 (nodal) and 33012 (extranodal) were combined into “Hodgkin lymphoma”.

Source of comorbidity.

Comorbidities were primarily dichotomized as present/absent. In sensitivity analysis, we further distinguish by hospital visit type as: 1) none (comorbidity absent); 2) ambulatory visit only (NACRS); 3) inpatient visit only (DAD); and 4) ambulatory and inpatient visit (both DAD and NACRS).

Data splitting.

In a sensitivity analysis, we compared the performance of the best-performing model when the same model was constructed using 10-fold cross-validation.

Hyperparameter tuning.

We started with arbitrary values and fine-tuned manually for select models (e.g., number of trees, depth of trees, learning rate). Due to resource constraints, we did not perform a randomized or grid search of hyperparameters.

Validation cohort.

We identically constructed a second validation cohort comprised of Ontario adults alive as of January 1, 2024. We applied the best-performing model to this cohort to assess generalizability and potential drift.

Explainability

Feature importance: Internal to the model.

Methods using gradients (gradient boosting, extreme gradient boosting, Newton boosting, CatBoost) do not use entropy or Gini impurity to determine a split. Instead, leaf values are updated using gradients information and indicate to what extent the models’ predictions should be increased or decreased. Using CatBoost’s internal structure, a feature’s importance was calculated as the sum of the absolute changes in prediction (leaf value changes) caused by splits on that feature, weighted by the number of samples passing through those splits, and then normalized so the importances sum to 100 across all features (formula: https://catboost.ai/docs/en/concepts/fstr#regular-feature-importance).

Feature importance: External to the model.

Model-agnostic methods do not require the inner workings of the model, only its output. We provide two approaches: permutation feature importance that is more common to the machine learning literature, and marginal effects that is more common among the statistical literature.

Permutation feature importance (PFI): for each column, the values are randomly shuffled within the dataset and the difference between the original metric and the permuted one is estimated [27]. We have chosen the AUROC as the metric for estimating the PFI.

Marginal effects: To estimate the overall effect of each feature on 1-year mortality, we computed the average marginal effect (AME) and the relative marginal effect (RME) using recycled predictions [28]. For each categorical variable we computed the predicted probability of the outcome () under two scenarios: 1) when all individuals in the cohort had a value = 0 (); and 2) when all individuals in the cohort had a value = 1 (), where xi is the categorical variable. For a numeric variable (age, number of healthcare visits), is the value of the variable as-is and is the value of the variable + 1 . The AME was computed as a mean of the differences , interpreted as the mean absolute change in the outcome due to a 1-unit increase in the predictor. The RME was calculated a mean of the relative difference , interpreted as the relative average change in the predicted probability associated with a 1-unit increase in the predictor. The 95% confidence interval for AME and RME was computed using the standard error of the difference times the critical t-statistic where alpha = 0.05.

Explainable boosting machines.

Explainable boosting machines (EBM) are tree-based ensemble methods that sacrifice accuracy for explainability [11,29]. EBMs leverage generalized additive models (GAMs), whereby each predictor in the model is a function learned from the data rather than a single parameter. Each function is smoothed and can be non-linear. These functions are then added together to produce the linear predictor, and like other regression models, are linked with the outcome through some link function (e.g., logit). With EBMs, each feature’s function is modeled using a shallow tree using gradient boosting and a small learning rate. Pairwise interactions can be added to GAMs manually, but the EBM approach automates this after first fitting the main effects, and only retains interactions if they improve performance.

Software

Cohort creation was performed using SAS v9.4 (SAS Institute Inc., Cary, NC) and all analyses were performed in Python using the statsmodels.api-v0.14.2 for logistic regression with MLE, scikit-learn-v1.5.2 for random forest, ExtraTrees, (extreme) gradient boosting, catboost-v1.2.7 for CatBoost, and interpret.glassbox-v0.6.10 for EBMs in Jupyter Notebook (v4.0.11). numpy-1.26.4 was used because version <2 is required for CatBoost and older versions are incompatible with EBM. Feature details are presented in Appendix 5 in S2 File. Sample code is provided in Appendix 6 in S2 File (training and testing a CatBoost model).

Privacy and ethics

Research ethics approval was not required as per the Ontario Health privacy assessment as this work was performed for the purpose of quality improvement and no identifying information was obtained. This study was compliant with section 45(1) of PHIPA (Ontario Health is a prescribed entity); thus, patient consent was not required.

Results

A total of 12,080,801 Ontarians who were alive as of January 1, 2022 were included. Half (n = 6,194,717; 51.3%) were female, the mean age was 49.0 (SD 18.6) years, and 121,951 (1.0%) died within 1 year (Table 1).

thumbnail
Table 1. Prevalence and crude risk of 1-year all-cause mortality by comorbidity in full cohort (N = 12,080,801).

https://doi.org/10.1371/journal.pone.0347302.t001

The most prevalent comorbidity was diabetes without complications (n = 513,176; 4.3%), followed by primary cancer (n = 257,092; 2.1%) and diabetes with complications (n = 236,554; 2.0%) (Table 1). OLIS identified nearly 6 times more persons living with CKD (n = 345,477; 2.9%) compared with CIHI. The OCR identified 1.39% of the population having had a cancer diagnosis in the last 3 years compared with the standard CCI definition (2.13%). Among other health encounters, flu vaccination was the most prevalent (12.8%), followed by breast cancer screening (8.8%) and a mental health flag (3.7%).

The risk of 1-year mortality varied by comorbidity type or health encounter type, with the highest mortality rates associated with receipt of palliative care (36%), residence in LTC (27%) dementia (26%), pressure injury (25%), delirium (21%), metastatic cancer (20%), stage 4–5 renal disease (16–17%), and congestive heart failure (16%) (Table 1). The probability of death for primary cancer was 9.3% but varied from 1% (e.g., thyroid or testicular cancer) to >30% (e.g., mesothelioma, pancreatic cancer, brain cancer, esophageal cancer) (Appendix 7 in S2 File).

Model performance

Logistic regression (MLE; model 1A) produced AUROC 0.926, PR-AUC 0.256, Brier Score 0.0085, and ICIeq 0.0022, performing similarly to random forest (model 1B) (Table 2). The best-performing model was CatBoost with a learning rate of 0.05 and a maximum tree depth of 6 (model 1H, Table 2), which had the highest AUROC (0.933), highest PR-AUC (0.281), lowest Brier Score (0.0083), and lowest ICIeq (0.0003) (Fig 1A). Cox’s calibration intercept was not significantly different from 0 (slope = 0.017; p = 0.09) and Cox’s calibration slope was not significantly different from 1 (slope = 1.005; p = 0.08).

thumbnail
Table 2. Model performance statistics on the test set.

https://doi.org/10.1371/journal.pone.0347302.t002

thumbnail
Fig 1. Calibration curve and Receiver Operator Characteristic (ROC) curves for best performing models.

A) Model 1H (CatBoost; learning rate = 0.05, min samples leaf = 10, no regularization, 1000 iterations, depth = 6); B) Model 2E (CatBoost; same as Model 1H, but with the inclusion of primary cancer types, chronic kidney disease stages, and comorbidities categorized by hospital encounter source). AUROC – area under the ROC curve.

https://doi.org/10.1371/journal.pone.0347302.g001

Sensitivity analysis

Using Model 1H as the comparator, we assessed a series of sensitivity analyses (Table 2, section 3). Defining cancer from the Ontario Cancer Registry improved performance in PR-AUC only, whether the cancer diagnoses were one-hot encoded (model 2A) or ordered-target encoded (Model 2B-2C). Including the source of comorbidity (e.g., inpatient, outpatient) (Model 2D), defining CKD using OLIS as stages 2–5 (Model 2E), or using 10-fold cross-validation instead of a 70%/30% training/test split (Model 2F) did not appreciably affect performance.

Survival analysis.

A Kaplan-Meier plot of the predicted probabilities from model 1H in the test dataset on the time until death also revealed separation of the curves that was evident as early as 1-month follow-up and continued through the 12 months (Fig 2B). The model over-estimated the risk of death among the group with a predicted probabilities between 70 and 100%, which contained only 321 persons (Table 3). We therefore hypothesized that people with the highest actual likelihood of dying can be captured by incorporating individual cancer diagnoses from the OCR. Using Model 2E, despite no noticeable improvement in performance, we found better discrimination for the 70–100% risk group (Figs 1B and 2B), although a small degree of miscalibration was observed [Cox’s intercept 0.028 (p = 0.006) and Cox’s slope 1.009 (p = 0.002)].

thumbnail
Table 3. 1-year mortality by summary score in test set (n=3,624,241).

https://doi.org/10.1371/journal.pone.0347302.t003

thumbnail
Fig 2. Kaplan-Meier plots for 1-year all-cause mortality stratified by the predicted probability of death.

A) Model 1H (CatBoost; learning rate = 0.05, min samples leaf = 10, no regularization, 1000 iterations, depth = 6); B) Model 2E (CatBoost; same as Model 1H, but with the inclusion of primary cancer types, chronic kidney disease stages, and comorbidities categorized by hospital encounter source).

https://doi.org/10.1371/journal.pone.0347302.g002

Explainability

Based on feature importance from CatBoost model 1H, age was the most important feature, followed by the number of outpatient hospital visits, sex, the number of encounters with the healthcare system (based on OHIP), ADL score, the number of hospitalizations, palliative care, and breast cancer screening (Fig 3A). Using PFI, a similar set of features was identified (Fig 3B). Using EBM, age was the most important feature and was included in interactions with several other variables (Fig 3C).

thumbnail
Fig 3. Feature importance: A) Feature importance from CatBoost (Model 1H), internal from model structure; B) Permutation feature importance (Model 1H), model-agnostic; C) Importance from Explainable Boosting Machine (Model 3D).

Model 1H: CatBoost; learning rate = 0.05, min samples leaf = 10, no regularization, 1000 iterations, depth = 6. Model 3D: explainable boosting machine max rounds = 1000, learning rate = 0.01, max leaves = 3, max bins = 255, interactions = 20. DAD – Discharge Abstract Database (hospitalizations); NACRS – National Ambulatory Care Reporting System (ambulatory hospital visits); OHIP – Ontario Health Insurance Plan (physician billing); ASA – American Society of Anesthesiologists (ASA) physical status classification; ADL – Activities of Daily Living score.

https://doi.org/10.1371/journal.pone.0347302.g003

To examine the effect of each covariate on mortality risk on an interpretable scale, we report the model parameters from a logistic regression model equivalent to Model 2E in Appendix 8 in S2 File (except comorbidities are dichotomized for simplicity). We also calculated the average and relative marginal effects using model 1H (Appendix 9 in S2 File). Features associated with the highest increased predicted risk of death included receipt of palliative care (AME 4.03%; RME 438%) followed by moderate-to-severe liver disease (AME 2.60%; RME 261%) and metastatic cancer (AME 1.53%; RME 159%). Several features were associated with a lower risk of mortality prediction (e.g., breast cancer screening, ASA score 3, AIDS, influenza vaccination), but the AME was small in magnitude (<0.3% absolute reduction).

Validation

Applying the best model (2H) to the cohort of Ontarians alive as of January 1, 2024 produced the same AUROC (0.933) and similar Brier Score (0.0077) despite a slightly lower PR-AUC (0.254) and worse calibration (ICIeq 0.0008) due to underestimation of risk (Fig 4).

thumbnail
Fig 4. Calibration curve and Receiver Operator Characteristic (ROC) curves for the best performing models (2E) applied to the 2024 validation cohort.

AUROC – area under the ROC curve.

https://doi.org/10.1371/journal.pone.0347302.g004

Discussion

In the present study we examined different machine learning methods for estimating the risk of 1-year mortality using a wide range of health and healthcare indicators. The tree-based ensemble method CatBoost produced the most accurate predictive model and exhibited the best calibration statistics.

Having an accurate predictive model is useful because the predicted probability of mortality can provide some degree of confounder control in epidemiologic studies [30]. Logistic regression to predict the risk of 1-year mortality has been used to produce some of the most common summary scores used for risk adjustment, including the Charlson Comorbidity Index [8], the Elixhauser comorbidity index [3134], and the Johns Hopkins Aggregated Diagnostic Groups (ADGs) [35]. Although these methods have produced AUROC as high as 0.917 in the general population and remain prognostic many years after their development [35,36], the better discrimination found in our models (AUROC 0.933) suggests that there is opportunity for more accurate risk-adjustment in the general population by leveraging at a minimum only one additional data source (physician billing). Moreover, we use multiple metrics and visual cues to identify the best-performing model. For cohorts where cancer is more prevalent, the performance of the model, and particularly its calibration, would be improved if a more detailed accounting of primary cancer was included.

The improved performance may be driven by several factors. First, we include a wider range of predictors, including those that may be associated with improved mortality (e.g., caesarean or vaginal delivery, influenza vaccination, breast cancer screening). Second, we explicitly included age and sex in the model and implicitly included non-linearity and interactions. Compared with tree-based methods, logistic regression would require manual coding of higher-order features for non-linearity, as well as interaction terms, which would be onerous, subjective, and prone to overfitting or underfitting even in the presence of backwards selection. Tree-based methods automatically incorporate interactions (the EBM model supported the importance of interactions particularly with age) and non-linearity by virtue of subsequently splitting the data into partitions as the depth of the trees increase. Like logistic regression, tree-based methods can output a predicted probability of the event. For all comparable models in the general population, CatBoost outperformed logistic regression on all metrics.

Strengths

Machine learning models have been criticized for being a ‘black-box’, producing parameters that are unobservable or uninterpretable. The coefficients from a logistic regression output can be interpreted as the relative change in the log odds of an event due to the presence of a comorbidity, but the presence of higher-order terms and interaction terms makes interpretability challenging. Moreover, negative coefficients (protective) are sometimes difficult to contextualize and are likely driven by residual confounding. The Elixhauser comorbidity scoring system assigns negative weights to several conditions including obesity and AIDS, features that we also found associated with small negative marginal effects [34]. CIHI’s Population Grouper [37] and the Johns Hopkins ADG [36] are both proprietary algorithms, so a deeper understanding of their inner workings are unavailable. A similar issue can also affect machine learning methods, but it is usually hidden by the opaqueness of these methods rather than the modeler’s choice. To mitigate this, we have unpacked the inner workings of the model through feature importance rankings (typical for machine learning applications), supplemented with absolute and relative marginal effects (more typical of epidemiology applications). We also provide the code that an analyst can use to build their own model using a similar approach, as well as the actual code used to make predictions using the model ‘as-is’ (File S3 for Model 1H and File S4 for Model 2E).

There is a trade-off between explainability and performance, and many analytics practitioners have sought to meet somewhere in-between (e.g., using ‘glass-box’ methods optimized for explainability; or reducing the number of features and interactions for parsimony) [29,38,39]. There are a few considerations regarding this trade-off. One is the application: predicting weather or stock prices may demand that accuracy is maximized, while healthcare applications may strive for more explainability (e.g., identifying modifiable risk factors or high-risk groups). Yet another approach would be to take a data-informed rather than a data-driven approach. For example, one study computed a 12-month mortality risk score at the time of admission, and patients with a score exceeding some threshold were referred for palliative care consultation [40]. However, the actual risk score was not provided in the referral because the decision to provide palliative care should be left to the discretion of the clinical staff, not the model. In this example, neither the score itself was important (only the threshold chosen), nor were the individual components that were the main drivers behind risk score. In our study, we promote explainability to lend credibility and support to the model’s construction and outputs, and we do so by several means. First, interpretation of calibration plots and supportive visualizations like Kaplan-Meier plots were useful for improving or understanding the model more fulsomely. Increasing model complexity by including primary cancer types, source of comorbidities, and chronic kidney disease stages improved calibration-in-the-large for people with a predicted mortality of 70% or higher. This decision was supported by knowledge (and observation) that different cancer types have vastly different mortality rates, and the intuition that persons hospitalized for a condition are likely at a higher risk of mortality than those who are not [41]. Since this is a small group of people and death is an uncommon outcome, there was little impact on the overall performance of the model based on summary statistics alone (AUROC, PR-AUC, Brier Score, ICIeq). Second, we took a top-down approach by identifying factors potentially associated with mortality and then measuring them (this approach facilitates interpretability). This was done by leveraging existing comorbidity scores prevalent in the literature, plus factors known or purported to be associated (positively or negatively) with mortality. Conversely, a bottom-up approach could have yielded better performance at the expense of explainability, whereby the analyst could be removed from the feature engineering process and we would let the model-building process figure it out. Features could include all diagnostic codes, procedure codes, physician billing codes, laboratory values, and a range of questions from InterRAI surveys beyond ADL, which would yield thousands of features.

Such an approach would be computationally prohibitive. Moreover, there is an unknown practical limit for all statistical measures of discrimination and calibration since 1) mortality is not always predictable; and 2) data are imperfect (e.g., inaccurate diagnoses; migration patterns; incomplete death linkage). However, our models performed very well relative to the theoretical limits (AUROCmax = 1; ICIeq,max = 0) and changes to hyperparameters only produced modest changes in discrimination or calibration. However, there are some limitations that must be acknowledged.

Limitations

One limitation is the computational resources available. While advancing our ability and capacity to use machine learning methods, our organization is still immature on this front [6]. We adhere to a ‘Cloud-First’ strategy, but the reliance on a specific-vendor model imposes structural and financial constraints that are ill-suited for the iterative nature of machine learning research. On our Azure Virtual Desktop we have access to eight CPU cores, 137 Gb of RAM, and no GPUs. We were unable to conduct full cross-validation assessments or hyperparameter tuning through grid or randomized search, and it remains possible that a better model fits these data. Despite this, the convergence of different models across all measures of model performance suggest that any further improvements in the model are likely to be small. For the present study this was not a barrier, but for work requiring more sophisticated models, a hybrid approach of cloud and on-premises computing resources would be important for cost-contained model development.

Another limitation of the present work also extends to all predictive models: the risk of data drift and changing importance of a predictor over time [42,43]. We acknowledge that diagnostic codes, case definitions, and clinical management may change over time, requiring periodic monitoring for drift in performance. For example, we observed a degradation of calibration when the model was applied to a 2024 cohort, but not model performance based on AUROC. This is perhaps not surprising since factors associated with mortality are unlikely to change drastically over time and AUROC does not rely on calibration. However, the worse calibration for 2024 illustrates that the absolute value of the score can be misleading and should be interpreted cautiously, but its ranking (e.g., discrimination) is less time-varying.

It is possible that the use of variables related to health care resource utilization resulted in some algorithmic bias and therefore different degrees of accuracy for different sociodemographic groups [44]. For example, if select populations are less likely to engage with the healthcare system, are less likely to have a diagnosis rendered, or have the diagnosis rendered at a more advanced stage of illness, then predictors like healthcare utilization frequency may perpetuate such biases [4547]. We would anticipate underperformance (e.g., higher than expected mortality rates) for such groups. Checklists for artificial intelligence algorithms are available to improve the transparency of artificial intelligence and machine learning models, but additional guidelines are available to help mitigate bias (IJMEDI Checklist in S1 File) [48].

Importantly, we expect these findings to be generalizable to other jurisdictions because comorbidity indices have been demonstrated to be similarly prognostic internationally [49,50]. We also expect the model to generalize to outcomes that are correlated with 1-year mortality. However, we expect performance to deteriorate as look-forward windows increase (e.g., 5-year mortality). Our results are also likely generalizable to disease-specific cohorts, but we expect worse performance because disease-specific or event-specific indicators are likely more important as they become more prevalent. A similar reasoning can be made to explain the worse performance for outcomes like hospital readmission, potentially requiring different models to be trained for specific outcomes [34,51].

Another limitation is the worse performance of the model for predicted probabilities of dying ≥70%, although this represents a small absolute number of people (n = 543; 0.01% of the adult population). The performance of the model is expected to be better if more granular clinical indicators were available. For example, we demonstrated that the specific cancer type is an important prognostic factor that improved model performance (accuracy and calibration), but further breakdown by cancer stage, cancer subtype, and even method of detection (screening versus symptomatic) is expected to yield further improvements. Similar arguments can be made for other diseases or measures of general health beyond ADL, but such data are unavailable or incomplete in our databases.

Another limitation is inherent to errors in administrative coding for diagnoses. Inaccuracies for coding renal disease have been noted, and algorithms for more accurate case definitions have been published for diabetes [52], congestive heart failure [53], ischemic heart disease [54], stroke [55], and upper gastrointestinal diseases [56]. However, none of these algorithms are perfect and a search for optimal definitions of comorbidities becomes onerous and perhaps unnecessary for the present scope, which aims to model mortality rather than to measure the association of each comorbidity with mortality.

Conclusion

Tree-based ensemble methods yielded the most accurate predictive model for 1-year mortality using individual comorbidities and additional measures of health and healthcare utilization for the general population. We provide algorithms that other investigators can use to determine a persons’ baseline risk of mortality that can be used for direct study or for risk adjustment.

Supporting information

S1 File. STROBE checklist for observational studies in epidemiology.

International Journal of Medical Informatics (IJMEDI) checklist checklist for assessment of medical artificial intelligence.

https://doi.org/10.1371/journal.pone.0347302.s001

(PDF)

S2 File. Includes supplementary tables for administrative codes, feature definitions, Python code for training/testing the CatBoost model 1H, and supporting results.

https://doi.org/10.1371/journal.pone.0347302.s002

(DOCX)

Acknowledgments

Parts of this material are based on data and information compiled and provided by CIHI. However, the analyses, conclusions, opinions, and statements expressed herein are those of the author, and not necessarily those of CIHI. Parts of this publication are based on data provided by ICES. However, the views expressed in this publication are those of the researcher and do not necessarily represent those of ICES. This report was produced with the support of the Ontario Ministry of Health. However, the views expressed herein are those of the author, and not necessarily those of the Ontario Ministry of Health or the Government of Ontario.

References

  1. 1. Tangri N, Stevens LA, Griffith J, Tighiouart H, Djurdjev O, Naimark D, et al. A predictive model for progression of chronic kidney disease to kidney failure. JAMA. 2011;305(15):1553–9. pmid:21482743
  2. 2. McWilliams A, Tammemagi MC, Mayo JR, Roberts H, Liu G, Soghrati K. Probability of cancer in pulmonary nodules detected on first screening CT. N Engl J Med. 2013;369(10):910–9.
  3. 3. Canet J, Gallart L, Gomar C, Paluzie G, Vallès J, Castillo J, et al. Prediction of postoperative pulmonary complications in a population-based surgical cohort. Anesthesiology. 2010;113(6):1338–50. pmid:21045639
  4. 4. D’Agostino RB Sr, Vasan RS, Pencina MJ, Wolf PA, Cobain M, Massaro JM, et al. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation. 2008;117(6):743–53. pmid:18212285
  5. 5. Pagallo U, O’Sullivan S, Nevejans N, Holzinger A, Friebe M, Jeanquartier F, et al. The underuse of AI in the health sector: opportunity costs, success stories, risks and recommendations. Health Technol (Berl). 2024;14(1):1–14. pmid:38229886
  6. 6. Habbous S, Herring J, Badiani T, Brown A, Hillmer M, Rosella LC. A qualitative assessment of the use of artificial intelligence in public sector health organisations in Ontario, Canada. Swiss Med Wkly. 2026;156:4942. pmid:41962065
  7. 7. Wiens J, Shenoy ES. Machine learning for healthcare: on the verge of a major shift in healthcare epidemiology. Clin Infect Dis. 2018;66(1):149–53.
  8. 8. Charlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J Chronic Dis. 1987;40(5):373–83. pmid:3558716
  9. 9. Glasheen WP, Cordier T, Gumpina R, Haugh G, Davis J, Renda A. Charlson comorbidity index: ICD-9 update and ICD-10 translation. Am Health Drug Benefits. 2019;12(4):188.
  10. 10. Dobbins TA, Creighton N, Currow DC, Young JM. Look back for the Charlson Index did not improve risk adjustment of cancer surgical outcomes. J Clin Epidemiol. 2015;68(4):379–86.
  11. 11. Kunapuli G. Ensemble methods for machine learning. In: Olstein K, Miller K, editors. Manning Publications Co.; 2023.
  12. 12. Hancock JT, Khoshgoftaar TM. CatBoost for big data: an interdisciplinary review. J Big Data. 2020;7(1).
  13. 13. Hu L, Li L. Using tree-based machine learning for health studies: literature review and case series. Int J Environ Res Public Health. 2022;19(23).
  14. 14. Serrano LG. Grokking machine learning [Internet]. Hales K, editor. Shelter Island (NY): Manning Publications Co.; 2021. p. 1–513.
  15. 15. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432. pmid:25738806
  16. 16. Richardson E, Trevizani R, Greenbaum JA, Carter H, Nielsen M, Peters B. The receiver operating characteristic curve accurately assesses imbalanced datasets. Patterns. 2024;5(6):100994.
  17. 17. Rufibach K. Use of Brier score to assess binary predictions. J Clin Epidemiol. 2010;63(8):938–9; author reply 939. pmid:20189763
  18. 18. Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230. pmid:31842878
  19. 19. Austin PC, Steyerberg EW. Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Stat Med. 2014;33(3):517–35.
  20. 20. Austin PC, Steyerberg EW. The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat Med. 2019;38(21):4051–65.
  21. 21. Stevens RJ, Poppe KK. Validation of clinical prediction models: what does the “calibration slope” really measure? J Clin Epidemiol. 2020;118:93–9.
  22. 22. Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128–38. pmid:20010215
  23. 23. Cox DR. Two further applications of a model for binary regression. Biometrika. 1958;45(3/4):562.
  24. 24. Janes H, Pepe MS, Gu W. Assessing the value of risk predictions by using risk stratification tables. Ann Intern Med. 2008;149(10):751–60. pmid:19017593
  25. 25. Lu S, Robyak K, Zhu Y. The CKD-EPI 2021 equation and other creatinine-based race-independent eGFR equations in chronic kidney disease diagnosis and staging. J Appl Lab Med. 2023;8(5):952–61.
  26. 26. Fleet JL, Dixon SN, Shariff SZ, Quinn RR, Nash DM, Harel Z, et al. Detecting chronic kidney disease in population-based administrative databases using an algorithm of hospital encounter and physician claim codes. BMC Nephrol. 2013;14:81. pmid:23560464
  27. 27. Kaneko H. Cross-validated permutation feature importance considering correlation between features. Anal Sci Adv. 2022;3(9–10):278–87. pmid:38716264
  28. 28. Williams R. Using the margins command to estimate and interpret adjusted predictions and marginal effects. Stata J. 2012;12(2):308–31.
  29. 29. Körner A, Sailer B, Sari-Yavuz S, Haeberle HA, Mirakaj V, Bernard A, et al. Explainable Boosting Machine approach identifies risk factors for acute renal failure. Intensive Care Med Exp. 2024;12(1):55. pmid:38874694
  30. 30. Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med. 2010;29(3):337.
  31. 31. Elixhauser A, Steiner C, Harris DR, Coffey RM. Comorbidity measures for use with administrative data. Med Care. 1998;36(1):8–27. pmid:9431328
  32. 32. van Walraven C, Austin PC, Jennings A, Quan H, Forster AJ. A modification of the Elixhauser comorbidity measures into a point system for hospital death using administrative data. Med Care. 2009;47(6):626–33. pmid:19433995
  33. 33. Quan H, Sundararajan V, Halfon P, Fong A, Burnand B, Luthi J-C, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med Care. 2005;43(11):1130–9. pmid:16224307
  34. 34. Mehta HB, Li S, An H, Goodwin JS, Alexander GC, Segal JB. Development and validation of the summary elixhauser comorbidity score for use with ICD-10-CM-coded data among older adults. Ann Intern Med. 2022;175(10):1423–30. pmid:36095314
  35. 35. Austin PC, Van Walraven C, Wodchis WP, Newman A, Anderson GM. Using the Johns Hopkins Aggregated Diagnosis Groups (ADGs) to predict mortality in a general adult population cohort in Ontario, Canada. Med Care. 2011;49(10):932–9.
  36. 36. Austin PC, Walraven CV. The mortality risk score and the ADG score: two points-based scoring systems for the Johns Hopkins aggregated diagnosis groups to predict mortality in a general adult population cohort in Ontario, Canada. Med Care. 2011;49(10):940–7.
  37. 37. Canadian Institute for Health Information. CIHI’s Population Grouping Methodology 1.4 — Overview and Outputs, 2023 [Internet]. Ottawa, ON; 2023.
  38. 38. Ladbury C, Zarinshenas R, Semwal H, Tam A, Vaidehi N, Rodin AS, et al. Utilization of model-agnostic explainable artificial intelligence frameworks in oncology: a narrative review. Transl Cancer Res. 2022;11(10):3853–68. pmid:36388027
  39. 39. Robson B, Cooper R. Glass box and black box machine learning approaches to exploit compositional descriptors of molecules in drug discovery and aid the medicinal chemist. ChemMedChem. 2024:e202400169. pmid:38837320
  40. 40. Wegier P, Koo E, Ansari S, Kobewka D, O’Connor E, Wu P, et al. mHOMR: a feasibility study of an automated system for identifying inpatients having an elevated risk of 1-year mortality. BMJ Qual Saf. 2019;28(12):971–9. pmid:31253736
  41. 41. Kuhn M, Johnson K. Applied predictive modeling. Springer New York; 2013. 600 p.
  42. 42. Chi S, Tian Y, Wang F, Zhou T, Jin S, Li J. A novel lifelong machine learning-based method to eliminate calibration drift in clinical prediction models. Artif Intell Med. 2022;125:102256. pmid:35241261
  43. 43. Davis SE, Greevy RA Jr, Lasko TA, Walsh CG, Matheny ME. Detection of calibration drift in clinical prediction models to inform model updating. J Biomed Inform. 2020;112:103611. pmid:33157313
  44. 44. Schulte KJ, Mayrovitz HN. Myocardial infarction signs and symptoms: females vs. males. Cureus. 2023;15(4):e37522.
  45. 45. Mamary AJ, Stewart JI, Kinney GL, Hokanson JE, Shenoy K, Dransfield MT, et al. Race and gender disparities are evident in COPD underdiagnoses across all severities of measured airflow obstruction. Chronic Obstr Pulm Dis. 2018;5(3):177–84. pmid:30584581
  46. 46. Mittermaier M, Raza MM, Kvedar JC. Bias in AI-based models for medical applications: challenges and mitigation strategies. NPJ Digit Med. 2023;6(1):113. pmid:37311802
  47. 47. Chen RJ, Wang JJ, Williamson DFK, Chen TY, Lipkova J, Lu MY. Algorithm fairness in artificial intelligence for medicine and healthcare. Nat Biomed Eng. 2023;7(6):719.
  48. 48. Nazer LH, Zatarah R, Waldrip S, Ke JXC, Moukheiber M, Khanna AK, et al. Bias in artificial intelligence algorithms and recommendations for mitigation. PLOS Digit Health. 2023;2(6):e0000278. pmid:37347721
  49. 49. Oliveros H, Buitrago G. Validation and adaptation of the Charlson Comorbidity Index using administrative data from the Colombian health system: retrospective cohort study. BMJ Open. 2022;12(3):e054058. pmid:35321892
  50. 50. Quan H, Li B, Couris CM, Fushimi K, Graham P, Hider P, et al. Updating and validating the Charlson comorbidity index and score for risk adjustment in hospital discharge abstracts using data from 6 countries. Am J Epidemiol. 2011;173(6):676–82. pmid:21330339
  51. 51. Preen DB, Holman CDJ, Spilsbury K, Semmens JB, Brameld KJ. Length of comorbidity lookback period affected regression model performance of administrative health data. J Clin Epidemiol. 2006;59(9):940–6. pmid:16895817
  52. 52. Lipscombe LL, Hwee J, Webster L, Shah BR, Booth GL, Tu K. Identifying diabetes cases from administrative data: a population-based validation study. BMC Health Serv Res. 2018;18(1).
  53. 53. Schultz SE, Rothwell DM, Chen Z, Tu K. Identifying cases of congestive heart failure from administrative data: a validation study using primary care patient records. Chronic Dis Inj Can. 2013;33(3):160–6. pmid:23735455
  54. 54. Tu K, Mitiku T, Lee DS, Guo H, Tu JV. Validation of physician billing and hospitalization data to identify patients with ischemic heart disease using data from the Electronic Medical Record Administrative data Linked Database (EMRALD). Can J Cardiol. 2010;26(7):e225-8. pmid:20847968
  55. 55. Hall R, Mondor L, Porter J, Fang J, Kapral MK. Accuracy of administrative data for the coding of acute stroke and TIAs. Can J Neurol Sci. 2016;43(6):765–73.
  56. 56. Lopushinsky SR, Covarrubia KA, Rabeneck L, Austin PC, Urbach DR. Accuracy of administrative health data for the diagnosis of upper gastrointestinal diseases. Surg Endosc. 2007;21(10):1733–7. pmid:17285379