Using tree-based ensemble methods to produce a population-based mortality risk score in Ontario, Canada

Steven Habbous; Peter C. Austin; Shabnam Balamchi; Davood Astaraky; Roozbeh Yousefi; Munaza Chaudhry; Erik Hellsten

doi:10.1371/journal.pone.0347302

Abstract

Introduction

Risk adjustment is critical in observational epidemiology to control for confounding of the exposure-outcome relationship. Accurate prediction of outcomes, such as mortality, can improve risk adjustment. In the present study, we compared logistic regression with a range of tree-based ensemble methods to predict 1-year mortality in the general population of Ontario, Canada.

Methods

Ontario adults (age 18 years and older) who were alive as of January 1, 2022 were included. Using a window of up to 3 years, various measures of health and healthcare utilization were captured from administrative databases. To predict 1-year mortality, we applied logistic regression, random forests, extremely randomized trees, adaptive boosting, gradient boosting, extreme gradient boosting, Newton boosting, and CatBoost. All models also included age and sex. Performance was evaluated using the area under the ROC curve (AUROC), the area under the precision-recall curve (PR-AUC), the Brier score, and a quantile-based version of the Integrated Calibration Index (ICI), reported in the 30% test set. Feature importance was assessed using CatBoost’s internal model structure, supplemented with permutation feature importance, explainable boosted machines, and marginal effects.

Results

A total of 12,080,801 Ontarians were included and 121,951 (1.0%) died within 1 year. Logistic regression showed excellent discrimination (AUROC 0.926; PR-AUC 0.256) and acceptable calibration (ICI 0.0022). The best model was CatBoost, which had the best discrimination (AUROC 0.933, PR-AUC 0.280) and calibration (ICI 0.0003). In sensitivity analyses of the CatBoost model, including more detailed definitions of cancer (to include its subtype) and chronic kidney disease (defined using serum creatinine instead of diagnostic codes) produced modest improvements in PR-AUC (0.290), along with substantially improved calibration amongst the highest-risk (70–100%) individuals. The most influential model-building feature was age. Residence in long-term care and receipt of palliative care was associated with the largest marginal effects.

Conclusion

The machine learning model CatBoost yielded the most accurate predictive model for 1-year mortality using individual comorbidities and additional measures of healthcare utilization for the general population. These findings demonstrate that machine learning methods can enhance risk adjustment efforts in observational studies, leading to more accurate confounder control and better support for health policy and epidemiologic research.

Citation: Habbous S, Austin PC, Balamchi S, Astaraky D, Yousefi R, Chaudhry M, et al. (2026) Using tree-based ensemble methods to produce a population-based mortality risk score in Ontario, Canada. PLoS One 21(4): e0347302. https://doi.org/10.1371/journal.pone.0347302

Editor: Suyan Tian, The First Hospital of Jilin University, CHINA

Received: July 22, 2025; Accepted: March 31, 2026; Published: April 23, 2026

Copyright: © 2026 Habbous et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Ontario Health is prohibited from making the data used in this research publicly accessible if it includes potentially identifiable personal health information and/or personal information as defined in Ontario law, specifically the Personal Health Information Protection Act (PHIPA) and the Freedom of Information and Protection of Privacy Act (FIPPA). Due to these legal and ethical restrictions, data will not be made publicly available. However, upon request (Datarequest@ontariohealth.ca), data de-identified to a level suitable for public release may be provided.”.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Risk prediction models are common in the medical industry, often guiding treatment decisions. Examples include risk equations that predict the probability of kidney failure, the probability of lung cancer given a pulmonary nodule, the probability of postoperative pulmonary complications, and the 10-year risk of cardiovascular disease [1–4]. Traditionally, such risk scores have been constructed using standard statistical regression techniques like linear regression, logistic regression, or Cox proportional hazards regression.

Machine learning methods may be better suited than traditional statistical methods when the number of potential predictors is large, when there is potential for multi-collinearity, when synergistic or antagonistic effects (e.g., interactions) between predictors exist, and when associations are non-linear. Traditional statistical models would require the onerous and subjective task of manually coding interaction terms. Tree-based methods automatically consider interactions and non-linearity by virtue of subsequently splitting the data into partitions as the depth of the trees increases, and like logistic regression, can also output a predicted probability of the outcome.

Despite this, applications of machine learning methods in public health are limited [5,6]. In the present study, we constructed models for estimating the risk of all-cause 1-year mortality in the general adult population in Ontario, Canada. The output from these models can be used as a single variable for risk adjustment or confounder control, or as a variable of interest to understand the burden of illness in a population. We use a range of tree-based ensemble methods, uncovering some of the black-box features of machine learning methods and providing additional exposure for epidemiologists and health system analysts interested in predictive modeling [7].

Methods

Cohort creation

This retrospective population-based study was conducted in Ontario, Canada. The cohort included all Ontario residents (with valid Ontario postal code and valid unique identifier) who were alive as of January 1, 2022. Persons with sex coded as M or F and aged 18–105 as of the index date (January 1, 2022) were included (N = 12,080,801). Reporting follows STROBE for observational studies (STROBE Checklist in S1 File).

Outcome

The outcome of interest was 1-year all-cause mortality from the index date (January 1, 2022). The date of death was captured from the Registered Persons Database.

Predictors of 1-year mortality

Two sets of predictors were considered. The first set of predictors included age (continuous), sex, and comorbidities (chronic conditions) captured from a hospital setting adapted from the Charlson Comorbidity Index using the updated codes for HIV/AIDS and renal disease as outlined by Glasheen et al [8,9]. The second set of predictors expanded on the first by including additional measures of healthcare utilization that could be captured outside of the hospital setting.

Comorbidities – Chronic conditions

Comorbidities were categorized using ICD-10 codes from hospital visits within three years prior to the index date [8,10]. We used the Discharge Abstract Database (DAD) for inpatient visits, (allowing up to 25 diagnostic codes per visit), and the National Ambulatory Care Reporting System (NACRS) for ambulatory hospital visits (allowing up to 10 diagnostic codes per visit) (Appendix 1 in S2 File for codes). Comorbidities included myocardial infarction, congestive heart failure, peripheral vascular disease, cerebrovascular disease, dementia, chronic pulmonary disease, connective tissue/rheumatoid disease, peptic ulcer disease, mild liver disease, diabetes without complications, diabetes with complications, hemiplegia/paraplegia, moderate-to-severe liver disease, mild-to-moderate renal disease, severe renal disease, primary cancer, metastatic cancer, HIV, and AIDS (HIV with opportunistic infection).

Measures of healthcare utilization

We used the Ontario Health Insurance Plan (OHIP) database to capture health and healthcare utilization based on physician billing codes that could plausibly be related (positively or negatively) to mortality. Flags were created using a lookback of 1 year and included billing codes related to obesity, long-term care/chronic care residence, mental health, house calls, American Society of Anesthesiologists (ASA) physical status classification (V, IV, III, none), chronic pain, fibromyalgia, palliative care, smoking cessation, homecare, substance use/abuse, critical care, hyperbaric treatment, vaginal delivery or delivery by Cesarean section, major burns, and amputation (Appendix 2 in S2 File). Acute hospital-based indicators included hip fracture, delirium, and hospital-acquired pressure injury. Indicators of healthcare utilization included the number of unique visits with the healthcare system (OHIP), the number of admissions (DAD), and the number of outpatient hospital encounters (NACRS). Activity of Daily Living (ADL) score was obtained from the Continuing Care Reporting System (long-term care; complex continuing care) and InterRAI-Home Care databases. The ADL is a measure of an individuals’ abilities to perform self-care tasks (ranges from 0–28). The most recent score available during the preceding year was used, with missing values assigned a “missing” category.

Data pre-processing

Numeric predictors (age, number of healthcare utilization visits, ADL score) were retained as continuous predictors. Persons missing an ADL score (e.g., those not in a setting to complete an InterRAI survey) were assigned a separate category. These variables were not rescaled, as the machine learning models employed were based on decision trees, which are invariant to monotonic transformations and operate effectively on the rank order of predictor values. Categorical variables were transformed using one-hot encoding (dummy-coded to 0/1) to ensure compatibility with the machine learning algorithms.

Models

Base model: Logistic regression.

The baseline set of models was logistic regression, which is estimated using maximum likelihood estimation (MLE). MLE identifies the parameters that maximize the log-likelihood function. In machine learning implementations of logistic regression, the objective is typically framed as minimizing the negative average log-likelihood function.

Tree-based ensemble methods I: Bagging.

The first set of ensemble methods involved bagging (bootstrap aggregation) methods [11]. These methods randomly split the data across n base estimators (e.g., decision tree classifiers), which can be trained independently and therefore in parallel. In standard bagging, bootstrap samples (sampling with replacement) are drawn from the training data, and each base estimator uses all available features.

In random forests, instead of using all available features to split each node within a tree, a random subset of features is selected at each split, increasing diversity and reducing overfitting.

Extremely randomized trees (ExtraTrees) are similar to random forest, but instead of picking the best variable to split on among the random sample of features at each node, ExtraTrees chooses the splitting variable by picking the best variable to split according to random split thresholds produced for each feature.

All these bagging-based methods aim to reduce variance and improve generalization by introducing diversity among the base learners, which are ensembles of strong learners (e.g., deep trees).

Tree-based ensemble methods II: Boosting.

The second set of ensemble methods involved boosting methods, which sequentially construct shallow trees (e.g., weak learners), with each iteration aimed at correcting the errors of the previous model [11].

In adaptive boosting (AdaBoost), observations misclassified by previous learners are assigned higher weights, so that subsequent learners focus more on the difficult-to-classify cases. Each weak learner is associated with its own weight, and the final prediction is a weighted combination of all learners.

In gradient boosting, instead of reweighting misclassified observations, learners are fit to the gradients of a loss function, which quantify the degree of error. Here, the loss function is based on residual errors (observed minus predicted from previous weak learner), and weak regressors are fit to the residuals, which are real-valued rather than binary. Gradient boosting uses first-order (single derivative) information.

Newton boosting extends gradient boosting by incorporating second-order derivatives (the Hessian) to adjust weights, enabling faster convergence. Extreme gradient boosting (XGBoost) is a type of Newton boosting, designed for greater computational efficiency and allowing for regularized loss functions.

Finally, we used CatBoost, a newer Newton boosting method optimized for datasets with many categorical variables, especially those with high cardinality (many levels for a categorical variable, such as primary cancer type). CatBoost uses oblivious decision trees instead of standard decision trees (the same splitting criterion is used for all splits occurring at the same level) (https://catboost.ai/) [12–14]. Although CatBoost handles categorical variables and missing values natively, in our application, all categorical variables were non-missing and already one-hot encoded due to their dichotomous nature.

Assessing model performance

The dataset was randomly split into 70% training and 30% test, stratified by the outcome to ensure equal proportions of the outcome in both datasets. Model performance was assessed in the test set while considering performance in the training set to identify evidence of overfitting (e.g., marked degradation in performance between training and test datasets).

Discrimination.

The primary metric was the area under the receiver operating characteristic (AUROC) curve. We also considered the area under the precision-recall curve (PR-AUC). Whether AUROC or PR-AUC is more appropriate when the outcome is unbalanced (e.g., the outcome of mortality is uncommon) remains a matter of discussion, so we provide both, favouring the AUROC [15,16]. We supplemented this with the Brier score (categorical equivalent of mean squared error, computed as the average squared distance between the predicted probability and the observed dichotomous outcome as 0/1) [17]. For all models, we inspected the distribution and the range of the predicted probabilities (ideally ranging from 0.01 to 0.99), how close the mean predicted mortality was to the observed mean mortality (ideally they are the same), and whether the AUROC, PR-AUC, and Brier scores differed between the training and test datasets (ideally they are the same).

Calibration.

While the Brier score also incorporates calibration [17], we also assessed calibration by visual inspection of the calibration curve, plotting centiles of the predicted probability of mortality versus the observed probability of mortality [18,19]. Large deviations of the calibration curve from the diagonal line of perfect calibration, if driven by sparsity (e.g., few observations had a predicted probability in that range), may cause one to over-interpret deviations from calibration [20]. Given that the lowess() function was computationally expensive, we calculated a metric comparable to the Integrated Calibration Index (ICI_eq) by computing , interpreted as the weighted average of the absolute difference between mean observed and expected outcomes, where n_i is the size of the bin and N is the total sample size (100 bins were selected) (Appendix 3 in S2 File for Python code). We also supplemented this by comparing mean(y) with mean() (the closer the better; calibration-in-the-large) and Cox’s calibration slope and intercept [21]. To calculate Cox’s calibration slope and intercept, we used a logistic regression of the observed outcome regressed on the logit-transformed predicted probability of the outcome [21–23]. A well calibrated model is expected to have a slope of 1 (a unitless quantity) and an intercept of 0. Lastly, for visual support we used Kaplan-Meier plots stratified by predicted probabilities of the outcome, expecting the observed event rates to match the predicted event rates in each group [24].

Sensitivity analyses

Definition of select comorbidities.

We refined the definition of renal disease and primary cancer to leverage our more detailed data assets (Appendix 4 in S2 File for details). For renal disease, we used serum creatinine from the Ontario Laboratory Information System (OLIS) to calculate eGFR using the 2021 CKD-EPI formula that is not race-adjusted [25]. This was done because ICD-10 codes have demonstrably poor sensitivity for capturing chronic kidney disease (CKD), resulting in underestimation of this highly prevalent condition [26]. Persons with no serum creatinine were assumed to have no kidney disease. For primary cancer, we used the Ontario Cancer Registry (gold standard for cancer diagnoses), categorizing each diagnosis using the SEER recode classification system. The SEER codes were unchanged, with the following exceptions due to small sample sizes: 22050 (pleura) and 22060 (trachea, mediastinum, other respiratory organs) were combined with 22030 (lung and bronchus); 35041 (other acute leukemia), 35023 (other myeloid/monocytic leukemia) 35031 (acute monocytic leukemia), and 35041 (other acute leukemia) were combined as “Other leukemia”; and 33011 (nodal) and 33012 (extranodal) were combined into “Hodgkin lymphoma”.

Source of comorbidity.

Comorbidities were primarily dichotomized as present/absent. In sensitivity analysis, we further distinguish by hospital visit type as: 1) none (comorbidity absent); 2) ambulatory visit only (NACRS); 3) inpatient visit only (DAD); and 4) ambulatory and inpatient visit (both DAD and NACRS).

Data splitting.

In a sensitivity analysis, we compared the performance of the best-performing model when the same model was constructed using 10-fold cross-validation.

Hyperparameter tuning.

We started with arbitrary values and fine-tuned manually for select models (e.g., number of trees, depth of trees, learning rate). Due to resource constraints, we did not perform a randomized or grid search of hyperparameters.

Validation cohort.

We identically constructed a second validation cohort comprised of Ontario adults alive as of January 1, 2024. We applied the best-performing model to this cohort to assess generalizability and potential drift.

Explainability

Feature importance: Internal to the model.

Methods using gradients (gradient boosting, extreme gradient boosting, Newton boosting, CatBoost) do not use entropy or Gini impurity to determine a split. Instead, leaf values are updated using gradients information and indicate to what extent the models’ predictions should be increased or decreased. Using CatBoost’s internal structure, a feature’s importance was calculated as the sum of the absolute changes in prediction (leaf value changes) caused by splits on that feature, weighted by the number of samples passing through those splits, and then normalized so the importances sum to 100 across all features (formula: https://catboost.ai/docs/en/concepts/fstr#regular-feature-importance).

Feature importance: External to the model.

Model-agnostic methods do not require the inner workings of the model, only its output. We provide two approaches: permutation feature importance that is more common to the machine learning literature, and marginal effects that is more common among the statistical literature.

Permutation feature importance (PFI): for each column, the values are randomly shuffled within the dataset and the difference between the original metric and the permuted one is estimated [27]. We have chosen the AUROC as the metric for estimating the PFI.

Marginal effects: To estimate the overall effect of each feature on 1-year mortality, we computed the average marginal effect (AME) and the relative marginal effect (RME) using recycled predictions [28]. For each categorical variable we computed the predicted probability of the outcome () under two scenarios: 1) when all individuals in the cohort had a value = 0 (); and 2) when all individuals in the cohort had a value = 1 (), where x_i is the categorical variable. For a numeric variable (age, number of healthcare visits), is the value of the variable as-is and is the value of the variable + 1 . The AME was computed as a mean of the differences , interpreted as the mean absolute change in the outcome due to a 1-unit increase in the predictor. The RME was calculated a mean of the relative difference , interpreted as the relative average change in the predicted probability associated with a 1-unit increase in the predictor. The 95% confidence interval for AME and RME was computed using the standard error of the difference times the critical t-statistic where alpha = 0.05.

Explainable boosting machines.

Explainable boosting machines (EBM) are tree-based ensemble methods that sacrifice accuracy for explainability [11,29]. EBMs leverage generalized additive models (GAMs), whereby each predictor in the model is a function learned from the data rather than a single parameter. Each function is smoothed and can be non-linear. These functions are then added together to produce the linear predictor, and like other regression models, are linked with the outcome through some link function (e.g., logit). With EBMs, each feature’s function is modeled using a shallow tree using gradient boosting and a small learning rate. Pairwise interactions can be added to GAMs manually, but the EBM approach automates this after first fitting the main effects, and only retains interactions if they improve performance.

Software

Cohort creation was performed using SAS v9.4 (SAS Institute Inc., Cary, NC) and all analyses were performed in Python using the statsmodels.api-v0.14.2 for logistic regression with MLE, scikit-learn-v1.5.2 for random forest, ExtraTrees, (extreme) gradient boosting, catboost-v1.2.7 for CatBoost, and interpret.glassbox-v0.6.10 for EBMs in Jupyter Notebook (v4.0.11). numpy-1.26.4 was used because version <2 is required for CatBoost and older versions are incompatible with EBM. Feature details are presented in Appendix 5 in S2 File. Sample code is provided in Appendix 6 in S2 File (training and testing a CatBoost model).

Privacy and ethics

Research ethics approval was not required as per the Ontario Health privacy assessment as this work was performed for the purpose of quality improvement and no identifying information was obtained. This study was compliant with section 45(1) of PHIPA (Ontario Health is a prescribed entity); thus, patient consent was not required.

Results

A total of 12,080,801 Ontarians who were alive as of January 1, 2022 were included. Half (n = 6,194,717; 51.3%) were female, the mean age was 49.0 (SD 18.6) years, and 121,951 (1.0%) died within 1 year (Table 1).

Download:

Table 1. Prevalence and crude risk of 1-year all-cause mortality by comorbidity in full cohort (N = 12,080,801).

https://doi.org/10.1371/journal.pone.0347302.t001

The most prevalent comorbidity was diabetes without complications (n = 513,176; 4.3%), followed by primary cancer (n = 257,092; 2.1%) and diabetes with complications (n = 236,554; 2.0%) (Table 1). OLIS identified nearly 6 times more persons living with CKD (n = 345,477; 2.9%) compared with CIHI. The OCR identified 1.39% of the population having had a cancer diagnosis in the last 3 years compared with the standard CCI definition (2.13%). Among other health encounters, flu vaccination was the most prevalent (12.8%), followed by breast cancer screening (8.8%) and a mental health flag (3.7%).

The risk of 1-year mortality varied by comorbidity type or health encounter type, with the highest mortality rates associated with receipt of palliative care (36%), residence in LTC (27%) dementia (26%), pressure injury (25%), delirium (21%), metastatic cancer (20%), stage 4–5 renal disease (16–17%), and congestive heart failure (16%) (Table 1). The probability of death for primary cancer was 9.3% but varied from 1% (e.g., thyroid or testicular cancer) to >30% (e.g., mesothelioma, pancreatic cancer, brain cancer, esophageal cancer) (Appendix 7 in S2 File).

Model performance

Logistic regression (MLE; model 1A) produced AUROC 0.926, PR-AUC 0.256, Brier Score 0.0085, and ICI_eq 0.0022, performing similarly to random forest (model 1B) (Table 2). The best-performing model was CatBoost with a learning rate of 0.05 and a maximum tree depth of 6 (model 1H, Table 2), which had the highest AUROC (0.933), highest PR-AUC (0.281), lowest Brier Score (0.0083), and lowest ICI_eq (0.0003) (Fig 1A). Cox’s calibration intercept was not significantly different from 0 (slope = 0.017; p = 0.09) and Cox’s calibration slope was not significantly different from 1 (slope = 1.005; p = 0.08).

Download:

Table 2. Model performance statistics on the test set.

https://doi.org/10.1371/journal.pone.0347302.t002

Download:

Fig 1. Calibration curve and Receiver Operator Characteristic (ROC) curves for best performing models.

A) Model 1H (CatBoost; learning rate = 0.05, min samples leaf = 10, no regularization, 1000 iterations, depth = 6); B) Model 2E (CatBoost; same as Model 1H, but with the inclusion of primary cancer types, chronic kidney disease stages, and comorbidities categorized by hospital encounter source). AUROC – area under the ROC curve.

https://doi.org/10.1371/journal.pone.0347302.g001

Sensitivity analysis

Using Model 1H as the comparator, we assessed a series of sensitivity analyses (Table 2, section 3). Defining cancer from the Ontario Cancer Registry improved performance in PR-AUC only, whether the cancer diagnoses were one-hot encoded (model 2A) or ordered-target encoded (Model 2B-2C). Including the source of comorbidity (e.g., inpatient, outpatient) (Model 2D), defining CKD using OLIS as stages 2–5 (Model 2E), or using 10-fold cross-validation instead of a 70%/30% training/test split (Model 2F) did not appreciably affect performance.

Survival analysis.

A Kaplan-Meier plot of the predicted probabilities from model 1H in the test dataset on the time until death also revealed separation of the curves that was evident as early as 1-month follow-up and continued through the 12 months (Fig 2B). The model over-estimated the risk of death among the group with a predicted probabilities between 70 and 100%, which contained only 321 persons (Table 3). We therefore hypothesized that people with the highest actual likelihood of dying can be captured by incorporating individual cancer diagnoses from the OCR. Using Model 2E, despite no noticeable improvement in performance, we found better discrimination for the 70–100% risk group (Figs 1B and 2B), although a small degree of miscalibration was observed [Cox’s intercept 0.028 (p = 0.006) and Cox’s slope 1.009 (p = 0.002)].

Download:

Table 3. 1-year mortality by summary score in test set (n=3,624,241).

https://doi.org/10.1371/journal.pone.0347302.t003

Download:

Fig 2. Kaplan-Meier plots for 1-year all-cause mortality stratified by the predicted probability of death.

A) Model 1H (CatBoost; learning rate = 0.05, min samples leaf = 10, no regularization, 1000 iterations, depth = 6); B) Model 2E (CatBoost; same as Model 1H, but with the inclusion of primary cancer types, chronic kidney disease stages, and comorbidities categorized by hospital encounter source).

https://doi.org/10.1371/journal.pone.0347302.g002

Explainability

Based on feature importance from CatBoost model 1H, age was the most important feature, followed by the number of outpatient hospital visits, sex, the number of encounters with the healthcare system (based on OHIP), ADL score, the number of hospitalizations, palliative care, and breast cancer screening (Fig 3A). Using PFI, a similar set of features was identified (Fig 3B). Using EBM, age was the most important feature and was included in interactions with several other variables (Fig 3C).

Download:

Fig 3. Feature importance: A) Feature importance from CatBoost (Model 1H), internal from model structure; B) Permutation feature importance (Model 1H), model-agnostic; C) Importance from Explainable Boosting Machine (Model 3D).

Model 1H: CatBoost; learning rate = 0.05, min samples leaf = 10, no regularization, 1000 iterations, depth = 6. Model 3D: explainable boosting machine max rounds = 1000, learning rate = 0.01, max leaves = 3, max bins = 255, interactions = 20. DAD – Discharge Abstract Database (hospitalizations); NACRS – National Ambulatory Care Reporting System (ambulatory hospital visits); OHIP – Ontario Health Insurance Plan (physician billing); ASA – American Society of Anesthesiologists (ASA) physical status classification; ADL – Activities of Daily Living score.

https://doi.org/10.1371/journal.pone.0347302.g003

To examine the effect of each covariate on mortality risk on an interpretable scale, we report the model parameters from a logistic regression model equivalent to Model 2E in Appendix 8 in S2 File (except comorbidities are dichotomized for simplicity). We also calculated the average and relative marginal effects using model 1H (Appendix 9 in S2 File). Features associated with the highest increased predicted risk of death included receipt of palliative care (AME 4.03%; RME 438%) followed by moderate-to-severe liver disease (AME 2.60%; RME 261%) and metastatic cancer (AME 1.53%; RME 159%). Several features were associated with a lower risk of mortality prediction (e.g., breast cancer screening, ASA score 3, AIDS, influenza vaccination), but the AME was small in magnitude (<0.3% absolute reduction).

Validation

Applying the best model (2H) to the cohort of Ontarians alive as of January 1, 2024 produced the same AUROC (0.933) and similar Brier Score (0.0077) despite a slightly lower PR-AUC (0.254) and worse calibration (ICI_eq 0.0008) due to underestimation of risk (Fig 4).

Download:

Fig 4. Calibration curve and Receiver Operator Characteristic (ROC) curves for the best performing models (2E) applied to the 2024 validation cohort.

AUROC – area under the ROC curve.

https://doi.org/10.1371/journal.pone.0347302.g004

Discussion

In the present study we examined different machine learning methods for estimating the risk of 1-year mortality using a wide range of health and healthcare indicators. The tree-based ensemble method CatBoost produced the most accurate predictive model and exhibited the best calibration statistics.

Having an accurate predictive model is useful because the predicted probability of mortality can provide some degree of confounder control in epidemiologic studies [30]. Logistic regression to predict the risk of 1-year mortality has been used to produce some of the most common summary scores used for risk adjustment, including the Charlson Comorbidity Index [8], the Elixhauser comorbidity index [31–34], and the Johns Hopkins Aggregated Diagnostic Groups (ADGs) [35]. Although these methods have produced AUROC as high as 0.917 in the general population and remain prognostic many years after their development [35,36], the better discrimination found in our models (AUROC 0.933) suggests that there is opportunity for more accurate risk-adjustment in the general population by leveraging at a minimum only one additional data source (physician billing). Moreover, we use multiple metrics and visual cues to identify the best-performing model. For cohorts where cancer is more prevalent, the performance of the model, and particularly its calibration, would be improved if a more detailed accounting of primary cancer was included.

The improved performance may be driven by several factors. First, we include a wider range of predictors, including those that may be associated with improved mortality (e.g., caesarean or vaginal delivery, influenza vaccination, breast cancer screening). Second, we explicitly included age and sex in the model and implicitly included non-linearity and interactions. Compared with tree-based methods, logistic regression would require manual coding of higher-order features for non-linearity, as well as interaction terms, which would be onerous, subjective, and prone to overfitting or underfitting even in the presence of backwards selection. Tree-based methods automatically incorporate interactions (the EBM model supported the importance of interactions particularly with age) and non-linearity by virtue of subsequently splitting the data into partitions as the depth of the trees increase. Like logistic regression, tree-based methods can output a predicted probability of the event. For all comparable models in the general population, CatBoost outperformed logistic regression on all metrics.

Strengths

Machine learning models have been criticized for being a ‘black-box’, producing parameters that are unobservable or uninterpretable. The coefficients from a logistic regression output can be interpreted as the relative change in the log odds of an event due to the presence of a comorbidity, but the presence of higher-order terms and interaction terms makes interpretability challenging. Moreover, negative coefficients (protective) are sometimes difficult to contextualize and are likely driven by residual confounding. The Elixhauser comorbidity scoring system assigns negative weights to several conditions including obesity and AIDS, features that we also found associated with small negative marginal effects [34]. CIHI’s Population Grouper [37] and the Johns Hopkins ADG [36] are both proprietary algorithms, so a deeper understanding of their inner workings are unavailable. A similar issue can also affect machine learning methods, but it is usually hidden by the opaqueness of these methods rather than the modeler’s choice. To mitigate this, we have unpacked the inner workings of the model through feature importance rankings (typical for machine learning applications), supplemented with absolute and relative marginal effects (more typical of epidemiology applications). We also provide the code that an analyst can use to build their own model using a similar approach, as well as the actual code used to make predictions using the model ‘as-is’ (File S3 for Model 1H and File S4 for Model 2E).

There is a trade-off between explainability and performance, and many analytics practitioners have sought to meet somewhere in-between (e.g., using ‘glass-box’ methods optimized for explainability; or reducing the number of features and interactions for parsimony) [29,38,39]. There are a few considerations regarding this trade-off. One is the application: predicting weather or stock prices may demand that accuracy is maximized, while healthcare applications may strive for more explainability (e.g., identifying modifiable risk factors or high-risk groups). Yet another approach would be to take a data-informed rather than a data-driven approach. For example, one study computed a 12-month mortality risk score at the time of admission, and patients with a score exceeding some threshold were referred for palliative care consultation [40]. However, the actual risk score was not provided in the referral because the decision to provide palliative care should be left to the discretion of the clinical staff, not the model. In this example, neither the score itself was important (only the threshold chosen), nor were the individual components that were the main drivers behind risk score. In our study, we promote explainability to lend credibility and support to the model’s construction and outputs, and we do so by several means. First, interpretation of calibration plots and supportive visualizations like Kaplan-Meier plots were useful for improving or understanding the model more fulsomely. Increasing model complexity by including primary cancer types, source of comorbidities, and chronic kidney disease stages improved calibration-in-the-large for people with a predicted mortality of 70% or higher. This decision was supported by knowledge (and observation) that different cancer types have vastly different mortality rates, and the intuition that persons hospitalized for a condition are likely at a higher risk of mortality than those who are not [41]. Since this is a small group of people and death is an uncommon outcome, there was little impact on the overall performance of the model based on summary statistics alone (AUROC, PR-AUC, Brier Score, ICI_eq). Second, we took a top-down approach by identifying factors potentially associated with mortality and then measuring them (this approach facilitates interpretability). This was done by leveraging existing comorbidity scores prevalent in the literature, plus factors known or purported to be associated (positively or negatively) with mortality. Conversely, a bottom-up approach could have yielded better performance at the expense of explainability, whereby the analyst could be removed from the feature engineering process and we would let the model-building process figure it out. Features could include all diagnostic codes, procedure codes, physician billing codes, laboratory values, and a range of questions from InterRAI surveys beyond ADL, which would yield thousands of features.

Such an approach would be computationally prohibitive. Moreover, there is an unknown practical limit for all statistical measures of discrimination and calibration since 1) mortality is not always predictable; and 2) data are imperfect (e.g., inaccurate diagnoses; migration patterns; incomplete death linkage). However, our models performed very well relative to the theoretical limits (AUROC_max = 1; ICI_eq,max = 0) and changes to hyperparameters only produced modest changes in discrimination or calibration. However, there are some limitations that must be acknowledged.

Limitations

One limitation is the computational resources available. While advancing our ability and capacity to use machine learning methods, our organization is still immature on this front [6]. We adhere to a ‘Cloud-First’ strategy, but the reliance on a specific-vendor model imposes structural and financial constraints that are ill-suited for the iterative nature of machine learning research. On our Azure Virtual Desktop we have access to eight CPU cores, 137 Gb of RAM, and no GPUs. We were unable to conduct full cross-validation assessments or hyperparameter tuning through grid or randomized search, and it remains possible that a better model fits these data. Despite this, the convergence of different models across all measures of model performance suggest that any further improvements in the model are likely to be small. For the present study this was not a barrier, but for work requiring more sophisticated models, a hybrid approach of cloud and on-premises computing resources would be important for cost-contained model development.

Another limitation of the present work also extends to all predictive models: the risk of data drift and changing importance of a predictor over time [42,43]. We acknowledge that diagnostic codes, case definitions, and clinical management may change over time, requiring periodic monitoring for drift in performance. For example, we observed a degradation of calibration when the model was applied to a 2024 cohort, but not model performance based on AUROC. This is perhaps not surprising since factors associated with mortality are unlikely to change drastically over time and AUROC does not rely on calibration. However, the worse calibration for 2024 illustrates that the absolute value of the score can be misleading and should be interpreted cautiously, but its ranking (e.g., discrimination) is less time-varying.

It is possible that the use of variables related to health care resource utilization resulted in some algorithmic bias and therefore different degrees of accuracy for different sociodemographic groups [44]. For example, if select populations are less likely to engage with the healthcare system, are less likely to have a diagnosis rendered, or have the diagnosis rendered at a more advanced stage of illness, then predictors like healthcare utilization frequency may perpetuate such biases [45–47]. We would anticipate underperformance (e.g., higher than expected mortality rates) for such groups. Checklists for artificial intelligence algorithms are available to improve the transparency of artificial intelligence and machine learning models, but additional guidelines are available to help mitigate bias (IJMEDI Checklist in S1 File) [48].

Importantly, we expect these findings to be generalizable to other jurisdictions because comorbidity indices have been demonstrated to be similarly prognostic internationally [49,50]. We also expect the model to generalize to outcomes that are correlated with 1-year mortality. However, we expect performance to deteriorate as look-forward windows increase (e.g., 5-year mortality). Our results are also likely generalizable to disease-specific cohorts, but we expect worse performance because disease-specific or event-specific indicators are likely more important as they become more prevalent. A similar reasoning can be made to explain the worse performance for outcomes like hospital readmission, potentially requiring different models to be trained for specific outcomes [34,51].

Another limitation is the worse performance of the model for predicted probabilities of dying ≥70%, although this represents a small absolute number of people (n = 543; 0.01% of the adult population). The performance of the model is expected to be better if more granular clinical indicators were available. For example, we demonstrated that the specific cancer type is an important prognostic factor that improved model performance (accuracy and calibration), but further breakdown by cancer stage, cancer subtype, and even method of detection (screening versus symptomatic) is expected to yield further improvements. Similar arguments can be made for other diseases or measures of general health beyond ADL, but such data are unavailable or incomplete in our databases.

Another limitation is inherent to errors in administrative coding for diagnoses. Inaccuracies for coding renal disease have been noted, and algorithms for more accurate case definitions have been published for diabetes [52], congestive heart failure [53], ischemic heart disease [54], stroke [55], and upper gastrointestinal diseases [56]. However, none of these algorithms are perfect and a search for optimal definitions of comorbidities becomes onerous and perhaps unnecessary for the present scope, which aims to model mortality rather than to measure the association of each comorbidity with mortality.

Conclusion

Tree-based ensemble methods yielded the most accurate predictive model for 1-year mortality using individual comorbidities and additional measures of health and healthcare utilization for the general population. We provide algorithms that other investigators can use to determine a persons’ baseline risk of mortality that can be used for direct study or for risk adjustment.

Supporting information

S1 File. STROBE checklist for observational studies in epidemiology.

International Journal of Medical Informatics (IJMEDI) checklist checklist for assessment of medical artificial intelligence.

https://doi.org/10.1371/journal.pone.0347302.s001

(PDF)

S2 File. Includes supplementary tables for administrative codes, feature definitions, Python code for training/testing the CatBoost model 1H, and supporting results.

https://doi.org/10.1371/journal.pone.0347302.s002

(DOCX)

S3 File. CatBoost Model 1H.

https://doi.org/10.1371/journal.pone.0347302.s003

(CBM)

S4 File. CatBoost Model 2E.

https://doi.org/10.1371/journal.pone.0347302.s004

(CBM)

Acknowledgments

Parts of this material are based on data and information compiled and provided by CIHI. However, the analyses, conclusions, opinions, and statements expressed herein are those of the author, and not necessarily those of CIHI. Parts of this publication are based on data provided by ICES. However, the views expressed in this publication are those of the researcher and do not necessarily represent those of ICES. This report was produced with the support of the Ontario Ministry of Health. However, the views expressed herein are those of the author, and not necessarily those of the Ontario Ministry of Health or the Government of Ontario.

References

1. Tangri N, Stevens LA, Griffith J, Tighiouart H, Djurdjev O, Naimark D, et al. A predictive model for progression of chronic kidney disease to kidney failure. JAMA. 2011;305(15):1553–9. pmid:21482743
- View Article
- PubMed/NCBI
- Google Scholar
2. McWilliams A, Tammemagi MC, Mayo JR, Roberts H, Liu G, Soghrati K. Probability of cancer in pulmonary nodules detected on first screening CT. N Engl J Med. 2013;369(10):910–9.
- View Article
- Google Scholar
3. Canet J, Gallart L, Gomar C, Paluzie G, Vallès J, Castillo J, et al. Prediction of postoperative pulmonary complications in a population-based surgical cohort. Anesthesiology. 2010;113(6):1338–50. pmid:21045639
- View Article
- PubMed/NCBI
- Google Scholar
4. D’Agostino RB Sr, Vasan RS, Pencina MJ, Wolf PA, Cobain M, Massaro JM, et al. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation. 2008;117(6):743–53. pmid:18212285
- View Article
- PubMed/NCBI
- Google Scholar
5. Pagallo U, O’Sullivan S, Nevejans N, Holzinger A, Friebe M, Jeanquartier F, et al. The underuse of AI in the health sector: opportunity costs, success stories, risks and recommendations. Health Technol (Berl). 2024;14(1):1–14. pmid:38229886
- View Article
- PubMed/NCBI
- Google Scholar
6. Habbous S, Herring J, Badiani T, Brown A, Hillmer M, Rosella LC. A qualitative assessment of the use of artificial intelligence in public sector health organisations in Ontario, Canada. Swiss Med Wkly. 2026;156:4942. pmid:41962065
- View Article
- PubMed/NCBI
- Google Scholar
7. Wiens J, Shenoy ES. Machine learning for healthcare: on the verge of a major shift in healthcare epidemiology. Clin Infect Dis. 2018;66(1):149–53.
- View Article
- Google Scholar
8. Charlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J Chronic Dis. 1987;40(5):373–83. pmid:3558716
- View Article
- PubMed/NCBI
- Google Scholar
9. Glasheen WP, Cordier T, Gumpina R, Haugh G, Davis J, Renda A. Charlson comorbidity index: ICD-9 update and ICD-10 translation. Am Health Drug Benefits. 2019;12(4):188.
- View Article
- Google Scholar
10. Dobbins TA, Creighton N, Currow DC, Young JM. Look back for the Charlson Index did not improve risk adjustment of cancer surgical outcomes. J Clin Epidemiol. 2015;68(4):379–86.
- View Article
- Google Scholar
11. Kunapuli G. Ensemble methods for machine learning. In: Olstein K, Miller K, editors. Manning Publications Co.; 2023.
12. Hancock JT, Khoshgoftaar TM. CatBoost for big data: an interdisciplinary review. J Big Data. 2020;7(1).
- View Article
- Google Scholar
13. Hu L, Li L. Using tree-based machine learning for health studies: literature review and case series. Int J Environ Res Public Health. 2022;19(23).
- View Article
- Google Scholar
14. Serrano LG. Grokking machine learning [Internet]. Hales K, editor. Shelter Island (NY): Manning Publications Co.; 2021. p. 1–513.
15. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432. pmid:25738806
- View Article
- PubMed/NCBI
- Google Scholar
16. Richardson E, Trevizani R, Greenbaum JA, Carter H, Nielsen M, Peters B. The receiver operating characteristic curve accurately assesses imbalanced datasets. Patterns. 2024;5(6):100994.
- View Article
- Google Scholar
17. Rufibach K. Use of Brier score to assess binary predictions. J Clin Epidemiol. 2010;63(8):938–9; author reply 939. pmid:20189763
- View Article
- PubMed/NCBI
- Google Scholar
18. Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230. pmid:31842878
- View Article
- PubMed/NCBI
- Google Scholar
19. Austin PC, Steyerberg EW. Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Stat Med. 2014;33(3):517–35.
- View Article
- Google Scholar
20. Austin PC, Steyerberg EW. The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat Med. 2019;38(21):4051–65.
- View Article
- Google Scholar
21. Stevens RJ, Poppe KK. Validation of clinical prediction models: what does the “calibration slope” really measure? J Clin Epidemiol. 2020;118:93–9.
- View Article
- Google Scholar
22. Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128–38. pmid:20010215
- View Article
- PubMed/NCBI
- Google Scholar
23. Cox DR. Two further applications of a model for binary regression. Biometrika. 1958;45(3/4):562.
- View Article
- Google Scholar
24. Janes H, Pepe MS, Gu W. Assessing the value of risk predictions by using risk stratification tables. Ann Intern Med. 2008;149(10):751–60. pmid:19017593
- View Article
- PubMed/NCBI
- Google Scholar
25. Lu S, Robyak K, Zhu Y. The CKD-EPI 2021 equation and other creatinine-based race-independent eGFR equations in chronic kidney disease diagnosis and staging. J Appl Lab Med. 2023;8(5):952–61.
- View Article
- Google Scholar
26. Fleet JL, Dixon SN, Shariff SZ, Quinn RR, Nash DM, Harel Z, et al. Detecting chronic kidney disease in population-based administrative databases using an algorithm of hospital encounter and physician claim codes. BMC Nephrol. 2013;14:81. pmid:23560464
- View Article
- PubMed/NCBI
- Google Scholar
27. Kaneko H. Cross-validated permutation feature importance considering correlation between features. Anal Sci Adv. 2022;3(9–10):278–87. pmid:38716264
- View Article
- PubMed/NCBI
- Google Scholar
28. Williams R. Using the margins command to estimate and interpret adjusted predictions and marginal effects. Stata J. 2012;12(2):308–31.
- View Article
- Google Scholar
29. Körner A, Sailer B, Sari-Yavuz S, Haeberle HA, Mirakaj V, Bernard A, et al. Explainable Boosting Machine approach identifies risk factors for acute renal failure. Intensive Care Med Exp. 2024;12(1):55. pmid:38874694
- View Article
- PubMed/NCBI
- Google Scholar
30. Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med. 2010;29(3):337.
- View Article
- Google Scholar
31. Elixhauser A, Steiner C, Harris DR, Coffey RM. Comorbidity measures for use with administrative data. Med Care. 1998;36(1):8–27. pmid:9431328
- View Article
- PubMed/NCBI
- Google Scholar
32. van Walraven C, Austin PC, Jennings A, Quan H, Forster AJ. A modification of the Elixhauser comorbidity measures into a point system for hospital death using administrative data. Med Care. 2009;47(6):626–33. pmid:19433995
- View Article
- PubMed/NCBI
- Google Scholar
33. Quan H, Sundararajan V, Halfon P, Fong A, Burnand B, Luthi J-C, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med Care. 2005;43(11):1130–9. pmid:16224307
- View Article
- PubMed/NCBI
- Google Scholar
34. Mehta HB, Li S, An H, Goodwin JS, Alexander GC, Segal JB. Development and validation of the summary elixhauser comorbidity score for use with ICD-10-CM-coded data among older adults. Ann Intern Med. 2022;175(10):1423–30. pmid:36095314
- View Article
- PubMed/NCBI
- Google Scholar
35. Austin PC, Van Walraven C, Wodchis WP, Newman A, Anderson GM. Using the Johns Hopkins Aggregated Diagnosis Groups (ADGs) to predict mortality in a general adult population cohort in Ontario, Canada. Med Care. 2011;49(10):932–9.
- View Article
- Google Scholar
36. Austin PC, Walraven CV. The mortality risk score and the ADG score: two points-based scoring systems for the Johns Hopkins aggregated diagnosis groups to predict mortality in a general adult population cohort in Ontario, Canada. Med Care. 2011;49(10):940–7.
- View Article
- Google Scholar
37. Canadian Institute for Health Information. CIHI’s Population Grouping Methodology 1.4 — Overview and Outputs, 2023 [Internet]. Ottawa, ON; 2023.
38. Ladbury C, Zarinshenas R, Semwal H, Tam A, Vaidehi N, Rodin AS, et al. Utilization of model-agnostic explainable artificial intelligence frameworks in oncology: a narrative review. Transl Cancer Res. 2022;11(10):3853–68. pmid:36388027
- View Article
- PubMed/NCBI
- Google Scholar
39. Robson B, Cooper R. Glass box and black box machine learning approaches to exploit compositional descriptors of molecules in drug discovery and aid the medicinal chemist. ChemMedChem. 2024:e202400169. pmid:38837320
- View Article
- PubMed/NCBI
- Google Scholar
40. Wegier P, Koo E, Ansari S, Kobewka D, O’Connor E, Wu P, et al. mHOMR: a feasibility study of an automated system for identifying inpatients having an elevated risk of 1-year mortality. BMJ Qual Saf. 2019;28(12):971–9. pmid:31253736
- View Article
- PubMed/NCBI
- Google Scholar
41. Kuhn M, Johnson K. Applied predictive modeling. Springer New York; 2013. 600 p.
42. Chi S, Tian Y, Wang F, Zhou T, Jin S, Li J. A novel lifelong machine learning-based method to eliminate calibration drift in clinical prediction models. Artif Intell Med. 2022;125:102256. pmid:35241261
- View Article
- PubMed/NCBI
- Google Scholar
43. Davis SE, Greevy RA Jr, Lasko TA, Walsh CG, Matheny ME. Detection of calibration drift in clinical prediction models to inform model updating. J Biomed Inform. 2020;112:103611. pmid:33157313
- View Article
- PubMed/NCBI
- Google Scholar
44. Schulte KJ, Mayrovitz HN. Myocardial infarction signs and symptoms: females vs. males. Cureus. 2023;15(4):e37522.
- View Article
- Google Scholar
45. Mamary AJ, Stewart JI, Kinney GL, Hokanson JE, Shenoy K, Dransfield MT, et al. Race and gender disparities are evident in COPD underdiagnoses across all severities of measured airflow obstruction. Chronic Obstr Pulm Dis. 2018;5(3):177–84. pmid:30584581
- View Article
- PubMed/NCBI
- Google Scholar
46. Mittermaier M, Raza MM, Kvedar JC. Bias in AI-based models for medical applications: challenges and mitigation strategies. NPJ Digit Med. 2023;6(1):113. pmid:37311802
- View Article
- PubMed/NCBI
- Google Scholar
47. Chen RJ, Wang JJ, Williamson DFK, Chen TY, Lipkova J, Lu MY. Algorithm fairness in artificial intelligence for medicine and healthcare. Nat Biomed Eng. 2023;7(6):719.
- View Article
- Google Scholar
48. Nazer LH, Zatarah R, Waldrip S, Ke JXC, Moukheiber M, Khanna AK, et al. Bias in artificial intelligence algorithms and recommendations for mitigation. PLOS Digit Health. 2023;2(6):e0000278. pmid:37347721
- View Article
- PubMed/NCBI
- Google Scholar
49. Oliveros H, Buitrago G. Validation and adaptation of the Charlson Comorbidity Index using administrative data from the Colombian health system: retrospective cohort study. BMJ Open. 2022;12(3):e054058. pmid:35321892
- View Article
- PubMed/NCBI
- Google Scholar
50. Quan H, Li B, Couris CM, Fushimi K, Graham P, Hider P, et al. Updating and validating the Charlson comorbidity index and score for risk adjustment in hospital discharge abstracts using data from 6 countries. Am J Epidemiol. 2011;173(6):676–82. pmid:21330339
- View Article
- PubMed/NCBI
- Google Scholar
51. Preen DB, Holman CDJ, Spilsbury K, Semmens JB, Brameld KJ. Length of comorbidity lookback period affected regression model performance of administrative health data. J Clin Epidemiol. 2006;59(9):940–6. pmid:16895817
- View Article
- PubMed/NCBI
- Google Scholar
52. Lipscombe LL, Hwee J, Webster L, Shah BR, Booth GL, Tu K. Identifying diabetes cases from administrative data: a population-based validation study. BMC Health Serv Res. 2018;18(1).
- View Article
- Google Scholar
53. Schultz SE, Rothwell DM, Chen Z, Tu K. Identifying cases of congestive heart failure from administrative data: a validation study using primary care patient records. Chronic Dis Inj Can. 2013;33(3):160–6. pmid:23735455
- View Article
- PubMed/NCBI
- Google Scholar
54. Tu K, Mitiku T, Lee DS, Guo H, Tu JV. Validation of physician billing and hospitalization data to identify patients with ischemic heart disease using data from the Electronic Medical Record Administrative data Linked Database (EMRALD). Can J Cardiol. 2010;26(7):e225-8. pmid:20847968
- View Article
- PubMed/NCBI
- Google Scholar
55. Hall R, Mondor L, Porter J, Fang J, Kapral MK. Accuracy of administrative data for the coding of acute stroke and TIAs. Can J Neurol Sci. 2016;43(6):765–73.
- View Article
- Google Scholar
56. Lopushinsky SR, Covarrubia KA, Rabeneck L, Austin PC, Urbach DR. Accuracy of administrative health data for the diagnosis of upper gastrointestinal diseases. Surg Endosc. 2007;21(10):1733–7. pmid:17285379
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Tangri N, Stevens LA, Griffith J, Tighiouart H, Djurdjev O, Naimark D, et al. A predictive model for progression of chronic kidney disease to kidney failure. JAMA. 2011;305(15):1553–9. pmid:21482743
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. McWilliams A, Tammemagi MC, Mayo JR, Roberts H, Liu G, Soghrati K. Probability of cancer in pulmonary nodules detected on first screening CT. N Engl J Med. 2013;369(10):910–9.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref3] 3. Canet J, Gallart L, Gomar C, Paluzie G, Vallès J, Castillo J, et al. Prediction of postoperative pulmonary complications in a population-based surgical cohort. Anesthesiology. 2010;113(6):1338–50. pmid:21045639
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. D’Agostino RB Sr, Vasan RS, Pencina MJ, Wolf PA, Cobain M, Massaro JM, et al. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation. 2008;117(6):743–53. pmid:18212285
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Pagallo U, O’Sullivan S, Nevejans N, Holzinger A, Friebe M, Jeanquartier F, et al. The underuse of AI in the health sector: opportunity costs, success stories, risks and recommendations. Health Technol (Berl). 2024;14(1):1–14. pmid:38229886
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref6] 6. Habbous S, Herring J, Badiani T, Brown A, Hillmer M, Rosella LC. A qualitative assessment of the use of artificial intelligence in public sector health organisations in Ontario, Canada. Swiss Med Wkly. 2026;156:4942. pmid:41962065
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref7] 7. Wiens J, Shenoy ES. Machine learning for healthcare: on the verge of a major shift in healthcare epidemiology. Clin Infect Dis. 2018;66(1):149–53.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref8] 8. Charlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J Chronic Dis. 1987;40(5):373–83. pmid:3558716
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref9] 9. Glasheen WP, Cordier T, Gumpina R, Haugh G, Davis J, Renda A. Charlson comorbidity index: ICD-9 update and ICD-10 translation. Am Health Drug Benefits. 2019;12(4):188.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref10] 10. Dobbins TA, Creighton N, Currow DC, Young JM. Look back for the Charlson Index did not improve risk adjustment of cancer surgical outcomes. J Clin Epidemiol. 2015;68(4):379–86.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref11] 11. Kunapuli G. Ensemble methods for machine learning. In: Olstein K, Miller K, editors. Manning Publications Co.; 2023.

[ref12] 12. Hancock JT, Khoshgoftaar TM. CatBoost for big data: an interdisciplinary review. J Big Data. 2020;7(1).
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref13] 13. Hu L, Li L. Using tree-based machine learning for health studies: literature review and case series. Int J Environ Res Public Health. 2022;19(23).
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref14] 14. Serrano LG. Grokking machine learning [Internet]. Hales K, editor. Shelter Island (NY): Manning Publications Co.; 2021. p. 1–513.

[ref15] 15. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432. pmid:25738806
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref16] 16. Richardson E, Trevizani R, Greenbaum JA, Carter H, Nielsen M, Peters B. The receiver operating characteristic curve accurately assesses imbalanced datasets. Patterns. 2024;5(6):100994.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref17] 17. Rufibach K. Use of Brier score to assess binary predictions. J Clin Epidemiol. 2010;63(8):938–9; author reply 939. pmid:20189763
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref18] 18. Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230. pmid:31842878
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref19] 19. Austin PC, Steyerberg EW. Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Stat Med. 2014;33(3):517–35.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref20] 20. Austin PC, Steyerberg EW. The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat Med. 2019;38(21):4051–65.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref21] 21. Stevens RJ, Poppe KK. Validation of clinical prediction models: what does the “calibration slope” really measure? J Clin Epidemiol. 2020;118:93–9.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref22] 22. Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128–38. pmid:20010215
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref23] 23. Cox DR. Two further applications of a model for binary regression. Biometrika. 1958;45(3/4):562.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref24] 24. Janes H, Pepe MS, Gu W. Assessing the value of risk predictions by using risk stratification tables. Ann Intern Med. 2008;149(10):751–60. pmid:19017593
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref25] 25. Lu S, Robyak K, Zhu Y. The CKD-EPI 2021 equation and other creatinine-based race-independent eGFR equations in chronic kidney disease diagnosis and staging. J Appl Lab Med. 2023;8(5):952–61.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref26] 26. Fleet JL, Dixon SN, Shariff SZ, Quinn RR, Nash DM, Harel Z, et al. Detecting chronic kidney disease in population-based administrative databases using an algorithm of hospital encounter and physician claim codes. BMC Nephrol. 2013;14:81. pmid:23560464
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref27] 27. Kaneko H. Cross-validated permutation feature importance considering correlation between features. Anal Sci Adv. 2022;3(9–10):278–87. pmid:38716264
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref28] 28. Williams R. Using the margins command to estimate and interpret adjusted predictions and marginal effects. Stata J. 2012;12(2):308–31.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref29] 29. Körner A, Sailer B, Sari-Yavuz S, Haeberle HA, Mirakaj V, Bernard A, et al. Explainable Boosting Machine approach identifies risk factors for acute renal failure. Intensive Care Med Exp. 2024;12(1):55. pmid:38874694
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref30] 30. Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med. 2010;29(3):337.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref31] 31. Elixhauser A, Steiner C, Harris DR, Coffey RM. Comorbidity measures for use with administrative data. Med Care. 1998;36(1):8–27. pmid:9431328
View Article
PubMed/NCBI
Google Scholar

[102] View Article

[103] PubMed/NCBI

[104] Google Scholar

[ref32] 32. van Walraven C, Austin PC, Jennings A, Quan H, Forster AJ. A modification of the Elixhauser comorbidity measures into a point system for hospital death using administrative data. Med Care. 2009;47(6):626–33. pmid:19433995
View Article
PubMed/NCBI
Google Scholar

[106] View Article

[107] PubMed/NCBI

[108] Google Scholar

[ref33] 33. Quan H, Sundararajan V, Halfon P, Fong A, Burnand B, Luthi J-C, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med Care. 2005;43(11):1130–9. pmid:16224307
View Article
PubMed/NCBI
Google Scholar

[110] View Article

[111] PubMed/NCBI

[112] Google Scholar

[ref34] 34. Mehta HB, Li S, An H, Goodwin JS, Alexander GC, Segal JB. Development and validation of the summary elixhauser comorbidity score for use with ICD-10-CM-coded data among older adults. Ann Intern Med. 2022;175(10):1423–30. pmid:36095314
View Article
PubMed/NCBI
Google Scholar

[114] View Article

[115] PubMed/NCBI

[116] Google Scholar

[ref35] 35. Austin PC, Van Walraven C, Wodchis WP, Newman A, Anderson GM. Using the Johns Hopkins Aggregated Diagnosis Groups (ADGs) to predict mortality in a general adult population cohort in Ontario, Canada. Med Care. 2011;49(10):932–9.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref36] 36. Austin PC, Walraven CV. The mortality risk score and the ADG score: two points-based scoring systems for the Johns Hopkins aggregated diagnosis groups to predict mortality in a general adult population cohort in Ontario, Canada. Med Care. 2011;49(10):940–7.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref37] 37. Canadian Institute for Health Information. CIHI’s Population Grouping Methodology 1.4 — Overview and Outputs, 2023 [Internet]. Ottawa, ON; 2023.

[ref38] 38. Ladbury C, Zarinshenas R, Semwal H, Tam A, Vaidehi N, Rodin AS, et al. Utilization of model-agnostic explainable artificial intelligence frameworks in oncology: a narrative review. Transl Cancer Res. 2022;11(10):3853–68. pmid:36388027
View Article
PubMed/NCBI
Google Scholar

[125] View Article

[126] PubMed/NCBI

[127] Google Scholar

[ref39] 39. Robson B, Cooper R. Glass box and black box machine learning approaches to exploit compositional descriptors of molecules in drug discovery and aid the medicinal chemist. ChemMedChem. 2024:e202400169. pmid:38837320
View Article
PubMed/NCBI
Google Scholar

[129] View Article

[130] PubMed/NCBI

[131] Google Scholar

[ref40] 40. Wegier P, Koo E, Ansari S, Kobewka D, O’Connor E, Wu P, et al. mHOMR: a feasibility study of an automated system for identifying inpatients having an elevated risk of 1-year mortality. BMJ Qual Saf. 2019;28(12):971–9. pmid:31253736
View Article
PubMed/NCBI
Google Scholar

[133] View Article

[134] PubMed/NCBI

[135] Google Scholar

[ref41] 41. Kuhn M, Johnson K. Applied predictive modeling. Springer New York; 2013. 600 p.

[ref42] 42. Chi S, Tian Y, Wang F, Zhou T, Jin S, Li J. A novel lifelong machine learning-based method to eliminate calibration drift in clinical prediction models. Artif Intell Med. 2022;125:102256. pmid:35241261
View Article
PubMed/NCBI
Google Scholar

[138] View Article

[139] PubMed/NCBI

[140] Google Scholar

[ref43] 43. Davis SE, Greevy RA Jr, Lasko TA, Walsh CG, Matheny ME. Detection of calibration drift in clinical prediction models to inform model updating. J Biomed Inform. 2020;112:103611. pmid:33157313
View Article
PubMed/NCBI
Google Scholar

[142] View Article

[143] PubMed/NCBI

[144] Google Scholar

[ref44] 44. Schulte KJ, Mayrovitz HN. Myocardial infarction signs and symptoms: females vs. males. Cureus. 2023;15(4):e37522.
View Article
Google Scholar

[146] View Article

[147] Google Scholar

[ref45] 45. Mamary AJ, Stewart JI, Kinney GL, Hokanson JE, Shenoy K, Dransfield MT, et al. Race and gender disparities are evident in COPD underdiagnoses across all severities of measured airflow obstruction. Chronic Obstr Pulm Dis. 2018;5(3):177–84. pmid:30584581
View Article
PubMed/NCBI
Google Scholar

[149] View Article

[150] PubMed/NCBI

[151] Google Scholar

[ref46] 46. Mittermaier M, Raza MM, Kvedar JC. Bias in AI-based models for medical applications: challenges and mitigation strategies. NPJ Digit Med. 2023;6(1):113. pmid:37311802
View Article
PubMed/NCBI
Google Scholar

[153] View Article

[154] PubMed/NCBI

[155] Google Scholar

[ref47] 47. Chen RJ, Wang JJ, Williamson DFK, Chen TY, Lipkova J, Lu MY. Algorithm fairness in artificial intelligence for medicine and healthcare. Nat Biomed Eng. 2023;7(6):719.
View Article
Google Scholar

[157] View Article

[158] Google Scholar

[ref48] 48. Nazer LH, Zatarah R, Waldrip S, Ke JXC, Moukheiber M, Khanna AK, et al. Bias in artificial intelligence algorithms and recommendations for mitigation. PLOS Digit Health. 2023;2(6):e0000278. pmid:37347721
View Article
PubMed/NCBI
Google Scholar

[160] View Article

[161] PubMed/NCBI

[162] Google Scholar

[ref49] 49. Oliveros H, Buitrago G. Validation and adaptation of the Charlson Comorbidity Index using administrative data from the Colombian health system: retrospective cohort study. BMJ Open. 2022;12(3):e054058. pmid:35321892
View Article
PubMed/NCBI
Google Scholar

[164] View Article

[165] PubMed/NCBI

[166] Google Scholar

[ref50] 50. Quan H, Li B, Couris CM, Fushimi K, Graham P, Hider P, et al. Updating and validating the Charlson comorbidity index and score for risk adjustment in hospital discharge abstracts using data from 6 countries. Am J Epidemiol. 2011;173(6):676–82. pmid:21330339
View Article
PubMed/NCBI
Google Scholar

[168] View Article

[169] PubMed/NCBI

[170] Google Scholar

[ref51] 51. Preen DB, Holman CDJ, Spilsbury K, Semmens JB, Brameld KJ. Length of comorbidity lookback period affected regression model performance of administrative health data. J Clin Epidemiol. 2006;59(9):940–6. pmid:16895817
View Article
PubMed/NCBI
Google Scholar

[172] View Article

[173] PubMed/NCBI

[174] Google Scholar

[ref52] 52. Lipscombe LL, Hwee J, Webster L, Shah BR, Booth GL, Tu K. Identifying diabetes cases from administrative data: a population-based validation study. BMC Health Serv Res. 2018;18(1).
View Article
Google Scholar

[176] View Article

[177] Google Scholar

[ref53] 53. Schultz SE, Rothwell DM, Chen Z, Tu K. Identifying cases of congestive heart failure from administrative data: a validation study using primary care patient records. Chronic Dis Inj Can. 2013;33(3):160–6. pmid:23735455
View Article
PubMed/NCBI
Google Scholar

[179] View Article

[180] PubMed/NCBI

[181] Google Scholar

[ref54] 54. Tu K, Mitiku T, Lee DS, Guo H, Tu JV. Validation of physician billing and hospitalization data to identify patients with ischemic heart disease using data from the Electronic Medical Record Administrative data Linked Database (EMRALD). Can J Cardiol. 2010;26(7):e225-8. pmid:20847968
View Article
PubMed/NCBI
Google Scholar

[183] View Article

[184] PubMed/NCBI

[185] Google Scholar

[ref55] 55. Hall R, Mondor L, Porter J, Fang J, Kapral MK. Accuracy of administrative data for the coding of acute stroke and TIAs. Can J Neurol Sci. 2016;43(6):765–73.
View Article
Google Scholar

[187] View Article

[188] Google Scholar

[ref56] 56. Lopushinsky SR, Covarrubia KA, Rabeneck L, Austin PC, Urbach DR. Accuracy of administrative health data for the diagnosis of upper gastrointestinal diseases. Surg Endosc. 2007;21(10):1733–7. pmid:17285379
View Article
PubMed/NCBI
Google Scholar

[190] View Article

[191] PubMed/NCBI

[192] Google Scholar

Figures

Abstract

Introduction

Methods

Results

Conclusion

Introduction

Methods

Cohort creation

Outcome

Predictors of 1-year mortality

Comorbidities – Chronic conditions

Measures of healthcare utilization

Data pre-processing

Models

Base model: Logistic regression.

Tree-based ensemble methods I: Bagging.

Tree-based ensemble methods II: Boosting.

Assessing model performance

Discrimination.

Calibration.

Sensitivity analyses

Definition of select comorbidities.

Source of comorbidity.

Data splitting.

Hyperparameter tuning.

Validation cohort.

Explainability

Feature importance: Internal to the model.

Feature importance: External to the model.

Explainable boosting machines.

Software

Privacy and ethics

Results

Model performance

Sensitivity analysis

Survival analysis.

Explainability

Validation

Discussion

Strengths

Limitations

Conclusion

Supporting information

S1 File. STROBE checklist for observational studies in epidemiology.

S2 File. Includes supplementary tables for administrative codes, feature definitions, Python code for training/testing the CatBoost model 1H, and supporting results.

S3 File. CatBoost Model 1H.

S4 File. CatBoost Model 2E.

Acknowledgments

References