Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Paying attention to cardiac surgical risk: An interpretable machine learning approach using an uncertainty-aware attentive neural network

  • Jahan C. Penny-Dimri ,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Writing – original draft, Writing – review & editing

    jahan.penny-dimri@monash.edu

    Affiliation Department of Surgery, School of Clinical Sciences at Monash Health, Monash University, Melbourne, Vic, Australia

  • Christoph Bergmeir,

    Roles Conceptualization, Supervision, Writing – review & editing

    Affiliations Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain, Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Melbourne, Vic, Australia

  • Christopher M. Reid,

    Roles Data curation, Writing – review & editing

    Affiliation Department of Epidemiology and Preventive Medicine, Monash University, Melbourne, Vic, Australia

  • Jenni Williams-Spence,

    Roles Data curation, Writing – review & editing

    Affiliation Department of Epidemiology and Preventive Medicine, Monash University, Melbourne, Vic, Australia

  • Andrew D. Cochrane,

    Roles Supervision, Writing – review & editing

    Affiliation Department of Surgery, School of Clinical Sciences at Monash Health, Monash University, Melbourne, Vic, Australia

  • Julian A. Smith

    Roles Supervision, Writing – review & editing

    Affiliation Department of Surgery, School of Clinical Sciences at Monash Health, Monash University, Melbourne, Vic, Australia

Abstract

Machine learning (ML) is increasingly applied to predict adverse postoperative outcomes in cardiac surgery. Commonly used ML models fail to translate to clinical practice due to absent model explainability, limited uncertainty quantification, and no flexibility to missing data. We aimed to develop and benchmark a novel ML approach, the uncertainty-aware attention network (UAN), to overcome these common limitations. Two Bayesian uncertainty quantification methods were tested, generalized variational inference (GVI) or a posterior network (PN). The UAN models were compared with an ensemble of XGBoost models and a Bayesian logistic regression model (LR) with imputation. The derivation datasets consisted of 153,932 surgery events from the Australian and New Zealand Society of Cardiac and Thoracic Surgeons (ANZSCTS) Cardiac Surgery Database. An external validation consisted of 7343 surgery events which were extracted from the Medical Information Mart for Intensive Care (MIMIC) III critical care dataset. The highest performing model on the external validation dataset was a UAN-GVI with an area under the receiver operating characteristic curve (AUC) of 0.78 (0.01). Model performance improved on high confidence samples with an AUC of 0.81 (0.01). Confidence calibration for aleatoric uncertainty was excellent for all models. Calibration for epistemic uncertainty was more variable, with an ensemble of XGBoost models performing the best with an AUC of 0.84 (0.08). Epistemic uncertainty was improved using the PN approach, compared to GVI. UAN is able to use an interpretable and flexible deep learning approach to provide estimates of model uncertainty alongside state-of-the-art predictions. The model has been made freely available as an easy-to-use web application demonstrating that by designing uncertainty-aware models with innately explainable predictions deep learning may become more suitable for routine clinical use.

Introduction

Machine learning (ML) is increasingly being applied to risk stratification and prediction of postoperative outcomes in cardiac surgery [1]. The extreme physiological demands of cardiac surgery make the development of effective risk stratification tools an important strategy for improving patient care [2]. Currently, the most widely used tools are clinical scores, which are derived from logistic regression (LR) models [35]. Modern approaches, such as ensemble decision tree (EDT) models or deep neural networks, have had limited success improving upon the performance or the interpretability of the standard linear regression methods [69].

A ML model must demonstrate four key qualities to be safely applied in a healthcare setting. It needs to be performant, interpretable, uncertainty-aware, and robust to incomplete data [10]. The current gold standard in risk stratification for cardiac surgery is the LR model, which has proven performant, interpretable, and to an extent uncertainty aware [11]. EDT models, such as gradient boosting machines, have also been shown to be highly performant, flexible to missing data, and globally interpretable [12]. While these models each fulfill some of the qualities needed, their limitations restrict widespread adoption into clinical practice.

Aims and hypothesis

Our aims were: 1. Develop a novel ML model, the uncertainty-aware attention network (UAN), that fulfilled the four key qualities of an ML model. 2. Benchmark predictive performance against two current gold standard models. These models were a popular EDT, XGBoost, and a Bayesian LR model. 3. Benchmark uncertainty quantification against the gold standard models. 4. Benchmark performance with missingess against common imputation methods. 5. Demonstrate the predictive explanations generated from the UAN.

We hypothesized that the UAN would be as performant as the benchmark models, quantify uncertainty well, and provide individual explanations for each prediction, visualized with a heat map.

Methods

Study population

The Australian and New Zealand Society of Cardiac and Thoracic Surgeons (ANZSCTS) Cardiac Surgery Database registry recorded 153,932 cardiac surgery events in 151,078 unique patients from June 2001 to December 2019, captured at 32 centers in Australia. As the Database stores sensitive patient information, it is not publicly available. Criteria for inclusion within the database was any patient undergoing cardiac surgery, other thoracic surgery using cardiopulmonary bypass (CPB), or pericardiectomy for constrictive pericarditis, where performed on or off CPB [13].

The third version of the Medical Information Mart for Intensive Care (MIMIC III), a large single-center database, contained data from 53,423 distinct hospital admissions for 38,597 distinct adult patients admitted to the intensive care unit at Beth Israel Deaconess Medical Center in Boston, Massachusetts between 2001 and 2012 [14]. A surgical cohort undergoing either coronary artery bypass or valvular surgery was subsetted and used for external validation of models fitted to the ANZSCTS database.

Variable selection

41 preoperative, intraoperative, and early postoperative variables available across either dataset were analyzed for inclusion in the predictive modeling. Included variables and their definitions are listed in S1 of the supplemental materials.

Outcomes

Eight clinically significant outcomes were modeled, however, only 30 day mortality was available in both datasets for external validation. Outcomes and their definitions are listed in S2 of the supplemental materials. While multiple clinically important outcomes were tested, for simplicity only the results for 30 day mortality on the external validation dataset will be reported in the rest of the manuscript. Performance and calibration results for all outcomes on training and validation datasets are available in the S4 and S5 of the supplemental materials.

Data preprocessing

Three different processing strategies were utilized. Firstly, a minimal processing strategy was used for models capable of handling missing data, whereby all patient data were included in training and validation. Secondly, a partial dataset was further subsetted, including only variables present in both external and derivation datasets and excluding patient data with missing values. The partial dataset was used to train and validate models incapable of handling missing data. Thirdly, two imputation strategies were used to replace missing values and train the baseline models and compare performance against the UAN without imputation. Simple imputation using mode and mean replacement, and multivariate imputation with chained equations (MICE) were used [15]. Throughout this process, outlier values for each variable were relabelled as missing. The data processing pipeline for the datasets described above is visualizes in Fig 1. Definitions for outliers can be found in the S1 of the supplemental materials.

thumbnail
Fig 1. Flow diagram depicting included and excluded data following cleaning and subsetting for the full and partial ANZSCTS datasets, and the MIMIC III dataset.

https://doi.org/10.1371/journal.pone.0289930.g001

Machine learning models

Uncertainty-aware attention network.

At the core of the UAN, is the attention function. The attention function is a method of mapping outcome vectors, called ‘queries,’ to inputs that are transformed with neural networks to ‘keys’ and values’ [16]. The queries, keys, and values are combined using dot product operations, allowing for processing of sequences of different lengths and masking of specific values [16]. Dot-product attention has been very successful in processing sequences, but can be extended to processing incomplete inputs [16].

In predictive tasks with missing values, the question being answered is ‘what is the probability of the outcome given the observed values?’ An attention model that learns a posterior distribution with Bayesian methods can effectively answer that question for multiple outcomes, in a principled and consistent way, without imputation, excluding data, or limiting the scope of input variables to the model. The attention function is also innately interpretable, whereby the attention the model is paying to each variable can be visualized in a heat-map, allowing for predictive explainability.

Two different approaches to modeling uncertainty were applied to the UAN. The first was to learn the parameters of a posterior beta distribution using generalized variational inference (GVI) [17]. The second was to learn the parameters of a posterior beta distribution with the addition of posterior networks (PN), which have the reported advantage of better uncertainty on out-of-distribution inputs [18].

A further description and implementation details of the UAN can be found in S1 Appendix. An example code-base for the model and web application for hosting the model can be found at https://github.com/jahanpd/UAN.

Bayesian logistic regression.

A Bayesian logistic regression model is a generalized linear model which learns a distribution over the coefficients in the model [11]. It was trained using variational inference [19].

Extreme gradient boosting (XGBoost).

XGBoost is a popular gradient boosting decision tree model [12]. In order to estimate uncertainty, an ensemble of XGBoost models was trained on bootstrapped sampling of the training dataset [20].

Implementations.

All models were implemented using open source software, with the UAN and LR models implemented in Pytorch, and XGBoost from the publicly available python package [12, 21]. Imputation was performed using the implementations in the python package scikit-learn [22].

Training, visualization, and statistical methodology

Models were trained on rebalanced datasets using randomly combined majority and minority class undersampling and oversampling [23]. Model performance was assessed using 5-fold cross-validation repeated 4 times. Imputation, rebalancing, and standardization of data were all calibrated using the training set on a per-fold basis. Performance metrics were the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. Pairwise testing of distributional differences was performed using a student T-test.

To assess confidence calibration we distinguished between two forms of uncertainty, which were aleatoric and epistemic. Aleatoric uncertainty arises from irreducible noise intrinsic to the data, and epistemic uncertainty from absent knowledge due to unseen data [18]. Confidence calibration was assessed using the AUC with labels of 1 for correct predictions and 0 for incorrect. Aleatoric confidence calibration was determined using the probability of the predicted class as scores. Epistemic confidence calibration was determined using either the maximum parameter of the beta distribution for the UAN, or the inverse empirical variance for the LR model or ensemble models, as scores. The Brier score was measured to further assess aleatoric uncertainty calibration for each model. Performance of the model was also measured on high confidence samples, which was defined as a prediction where the 95% credible or confidence interval did not overlap with the baseline risk.

Model interpretability for UAN predictions was visualized using an attention heatmap depicting the outcomes and feature importance from the partial dataset. The UAN model was further interrogated for model fit by plotting methods in two ways. Firstly, the univariate predicted probability for mortality was plotted against the input range of four continuous features, which were age, body mass index, estimated glomerular filtration rate (eGFR), and preoperative creatinine. This is possible as the UAN is flexible to input features and able to provide predictions with only one input variable, in contrast to the other models. Secondly, the embedded representations of the input features and outcomes were visualized using t-distributed stochastic neighbor embedding (t-SNE) to 2-dimensional space. This plot shows how the UAN learns how closely related features or outcomes are to each other as demonstrated by the proximity in the plot.

Ethics

This project was conducted in concordance with the National Health and Medical Research Council (NHMRC) National Statement on Ethical Conduct in Human Research, with approval from the Monash University Human Research Ethics Committee (HREC) approval number 2020-24850-45439. The ANZSCTS National Database has approval for data collection from Monash University and Alfred Health HREC utilizing an opt-out system for patient consent with the data policy and patient information sheet available in the supplemental materials.

Results

Data subsets

All 153,932 cardiac surgical procedures in 151,078 unique patients were included from the ANZSCTS Database, and 7,343 cardiac surgical procedures in 6,748 unique patients were subsetted from the MIMIC III database. 149,988 procedures in 147,317 unique patients were further subsetted from the ANZSCTS Database for the partial dataset. Patient characteristics for the partial dataset and the MIMIC III dataset are presented in Table 1, and for the full dataset in S3 of the supplemental materials.

thumbnail
Table 1. Patient characteristics of the partial database subsets.

https://doi.org/10.1371/journal.pone.0289930.t001

As expected, there were many differences detected between the derivation and external validation datasets. The largest were in a recorded history of cardiac arrhythmia with 17. 17% and 52.34%, and a history of smoking with 57.35% and 15.92% for the ANZSCTS and MIMIC III datasets respectively.

Model performance and uncertainty calibration

For the main outcome of AUC, the highest performing model was the UAN-GVI and UAN-PN with scores of 0.78 (0.01) (Table 2). All model performance improved when measured on high confidence samples, with the UAN-GVI and UAN-PN performing the best with scores of 0.82 (0.01) (Table 3).

thumbnail
Table 2. Model performance on the MIMIC III dataset for 30 day mortality.

https://doi.org/10.1371/journal.pone.0289930.t002

thumbnail
Table 3. Uncertainty calibration on the MIMIC III dataset for 30 day mortality.

https://doi.org/10.1371/journal.pone.0289930.t003

Confidence calibration for aleatoric uncertainty was excellent for all models, achieving a score of close to, or exactly 1.0. Calibration for epistemic uncertainty was more variable, with the ensemble of XGBoost models with MICE performing the best with a score of 0.84 (0.08) (Table 3). The UAN-GVI and LR with simple imputation had the worst epistemic calibration with scores of 0.63 (0.10) and 0.57 (0.14), respectively. Compared to the UAN-GVI, the PN approach had better epistemic calibration with a score of 0.63 (0.10). The Brier score was best in the XGBoost and worst in the LR model with scores of 0.12 (0.07) and 0.23 (0.01) respectively.

Statistical testing is reported in S5S12 Tables of the supplementary materials.

Interpretability

The attention paid to each feature by the UAN-PN model was visualized for high risk, low risk and uncertain risk prediction (Fig 2).

thumbnail
Fig 2. Attention heat maps for 3 samples from the MIMIC III dataset with high, low and baseline population risk of a postoperative complication.

These plots provide a visual explanation for how the model is weighting each feature with respect to each of the outcomes. The patient in a) had a low probability of 30 day mortality of 0.10, and their characteristics were age 51, male, BMI 25.4, privately insured, no history of arrhythmia, diabetes, smoking, high cholesterol, hypertension, peripheral vascular disease or heart failure, an isolated CABG, preoperative creatinine of 53 and ICU length of stay of 43 hours. The patient in b) had a baseline probability of 30 day mortality of 0.5, and their characteristics were age 61, male, BMI 21.0, privately insured, no history of arrhythmia, smoking, diabetes, high cholesterol, hypertension, peripheral vascular disease, a positive history for heart failure, an isolated CABG, preoperative creatinine of 61.9 and ICU length of stay of 146 hours. The patient in c) had a high probability of 30 day mortality of 0.85, and their characteristics were age 86, male, BMI 23.0, insured under medicare, no history of smoking, diabetes, high cholesterol, hypertension, peripheral vascular disease, or heart disease, a positive history for cardiac arrhythmia, a combined CABG and valve operation, a preoperative creatinine of 106 and ICU length of stay of 317 hours.

https://doi.org/10.1371/journal.pone.0289930.g002

Model interrogation

The univariate predicted probability distribution for age, body mass index, eGFR, and preoperative creatinine demonstrated sensible trends. The probability of 30-day mortality increased with age, body mass index, and preoperative creatinine and decreased with eGFR (Fig 3).

thumbnail
Fig 3. Univariate predicted probability plots for the outcome of 30 day mortality across the input range of 4 different continuous variables.

This plot demonstrates that the UAN is able to learn the appropriate non-linear risk for each variable in the univariate case, with the risk decreasing with increasing eGFR, but increasing with age, preoperative creatinine and BMI.

https://doi.org/10.1371/journal.pone.0289930.g003

The cluster-mapped features showed that the UAN learned feature embeddings that were similar for features that were related or had similar effects on the predicted outcome (Fig 4).

thumbnail
Fig 4. A visualization in 3 dimensions using t-SNE clustering on the 16 dimensional feature vector representations the UAN learned.

Related variables tended to cluster in 3 dimensional space, for example, history of heart failure clustered with being admitted with heart failure, and type of operation, operative urgency, and bypass time were clustered together.

https://doi.org/10.1371/journal.pone.0289930.g004

Web application

The UAN-PN was made available for test use at www.cardiac-ml.com.

Discussion

This is the first description of an uncertainty-aware attention-based neural network. The results presented demonstrate the UAN to have performance as good or better than traditional benchmarks, as well as acceptable uncertainty calibration. The success of ML across many industries has not translated to healthcare, as performance alone does not overcome clinicians’ lack of trust in black-box model predictions [24]. Key to developing clinician trust and use is flexibility to incomplete data, interpretable predictions and uncertainty assessment. We have demonstrated that the UAN provides all of these features.

Incomplete data

Healthcare data is often incomplete, which presents a problem for many models. LR, and their derivative scoring tools, are only valid when the clinician has all variables at hand [11]. Excluding patient data with missing values or performing imputation is likely to introduce bias [25]. The UAN overcomes this problem by applying the attention function in a novel approach allowing extremely flexible inference with 1 or more features available. In this work, we demonstrated this capability by training the model on 41 features, however, the external validation dataset only had 14 features from which to generate predictions. Importantly, in this setting, the UAN is as good or better than imputing data before inference or training.

Model interpretability

For clinicians to trust a predictive model, the decision-making mechanism must be transparent. In LR, coefficients associated with each variable are interpretable as the importance the fitted model ascribes to that variable [11]. Similarly, for EDT, methods exist to interpret the importance the model weights for each feature. These are, however, global explanations and do not provide transparency for a specific prediction [12]. Further explanatory methods can be applied to trained models in order to provide individualized explanations [26]. This approach is limited by needing to optimize another model based on the output of the trained model, which can be inefficient or result in inaccurate explanations [2628]. The UAN has inherent interpretability by visualizing the attention paid to each variable, allowing for individualized prediction level explanations (Fig 2). In this way, the UAN provides individualized risk profiles, which is the first deep-learning model to achieve this outside of using explanatory modeling [29].

Interestingly, trust in the UAN can be furthered by visualizing certain unique characteristics of the UAN. Plotting the univariate probability distribution across the input range of a feature (Fig 3) allows the clinician to know how the model maps a value to a predicted probability. For each feature we interrogated with this strategy, the model learned a mapping matching the expected trend and provided unique insight into the data. For example, the model learned a non-linear distribution for age with a small peak at younger ages, a nadir at 40 years old, and gradual increase accelerating at 70 years old. This pattern correlates with the fact that younger patients requiring cardiac surgery are generally suffering from trauma or congenital disease that confers higher mortality than middle-aged patients undergoing coronary bypass or valve surgery [30, 31]. This is in contrast to a logistic regression model, where the coefficient associated with the feature is limited to a linear relationship [11].

Unique to the UAN model architecture is a feature-specific embedding that maps the feature value to a feature-specific space. By using t-SNE on these embeddings, it is possible to view the clustered relationships between the features (Fig 4). Features that are similar are mapped to a similar space and are therefore clustered. A clinician is able to use this plot to ensure that the UAN learns to cluster features that have similar effects on the outcome. Using both the clustering and univariate probability plots, a clinician can have increased confidence that the model has learned appropriate relationships between the features and the outcome.

Computational efficiency

A further added benefit to the UAN model is that it is able to model multiple outcomes at the same time. In this study, the model learned the relationship between the inputs and eight different clinically significant outcomes (see S4 of the supplemental materials). This is in contrast to traditional models, such as LR, where a separate model needs to be trained for each outcome. At inference, the model will generate predictions for all outcomes at the same time as well. The advantage here is the computational cost for training on large datasets for multiple outcomes.

Limitations

The primary limitations of this study arose from the derivation and external validation datasets. Firstly, clinical variables available for input into a predictive model can be categorized as a modifiable or non-modifiable variable. For example, the variable of age is non-modifiable, whereas preoperative hemoglobin is modifiable by giving red blood cells. Ideally, a predictive model would contain a large number of modifiable risk factors, which would allow the clinician to better optimize patient care pre- or early post-operatively. The ANZSCTS Database was primarily created for auditing surgical outcomes and does not contain a large number of modifiable variables [13]. Future datasets should strive to incorporate more modifiable pre-operative variables in order to develop models with more actionable insights. Secondly, there were many differences with how the two databases defined variables. For example, the ANZSCTS Database is updated stochastically with information from clinicians who are directly responsible for patient care, whereas much of the data in the MIMIC III database is derived from codes related to hospital billing [13, 14]. This explains the difference in comorbidity rates for some key variables, such as smoking (Table 1). While this may be a reason for reduced performance on the MIMIC III dataset, these differences may in fact improve the robustness of the external validation results by factoring in differences that healthcare networks across nations may have for defining these variables.

There were also some limitations inherent to the UAN model. The attention function approximates a mixture model, where the output is a mixture of the vector representations of the input variables [16]. This is a modeling assumption that may not accurately reflect the best way to combine the input values.

Finally, although a PN approach to modeling epistemic uncertainty was more effective than GVI, epistemic uncertainty was still poorly modeled compared to aleatoric uncertainty. Future research will be needed to develop and deploy better methods of modeling epistemic uncertainty.

Conclusion

The UAN is a novel, interpretable, and readily available tool for clinicians to assist in risk stratification in cardiac surgery. The model outperforms current gold standards in important performance benchmarks. Further research needs to be conducted in improving uncertainty calibration and externally validating the model in new cohorts.

Supporting information

S1 Table. Variable definitions.

A description of how each variable was defined.

https://doi.org/10.1371/journal.pone.0289930.s001

(DOCX)

S2 Table. Outcome definitions.

A description of how each outcome was defined.

https://doi.org/10.1371/journal.pone.0289930.s002

(DOCX)

S3 Table. Patient characteristics of the database subsets.

A complete table of patient characteristics for all variables included in the analysis.

https://doi.org/10.1371/journal.pone.0289930.s003

(DOCX)

S4 Table. Model performance across all datasets and outcomes.

A complete table for all internal validation and external validation results for performance.

https://doi.org/10.1371/journal.pone.0289930.s004

(DOCX)

S5 Table. Uncertainty calibration across all datasets and outcomes.

A complete table for all internal validation and external validation results for uncertainty calibration.

https://doi.org/10.1371/journal.pone.0289930.s005

(DOCX)

S6 Table. Pairwise T-test p-values for AUC.

Statistical testing for performance differences across cross-validation.

https://doi.org/10.1371/journal.pone.0289930.s006

(DOCX)

S7 Table. Pairwise T-test p-values for sensitivity.

Statistical testing for performance differences across cross-validation.

https://doi.org/10.1371/journal.pone.0289930.s007

(DOCX)

S8 Table. Pairwise T-test p-values for specificity.

Statistical testing for performance differences across cross-validation.

https://doi.org/10.1371/journal.pone.0289930.s008

(DOCX)

S9 Table. Pairwise T-test p-values for AUC of high confidence samples.

Statistical testing for performance differences across cross-validation.

https://doi.org/10.1371/journal.pone.0289930.s009

(DOCX)

S10 Table. Pairwise T-test p-values for AUC of aleatoric uncertainty.

Statistical testing for performance differences across cross-validation.

https://doi.org/10.1371/journal.pone.0289930.s010

(DOCX)

S11 Table. Pairwise T-test p-values for AUC of epistemic uncertainty.

Statistical testing for performance differences across cross-validation.

https://doi.org/10.1371/journal.pone.0289930.s011

(DOCX)

S12 Table. Pairwise T-test p-values for the Brier score.

Statistical testing for performance differences across cross-validation.

https://doi.org/10.1371/journal.pone.0289930.s012

(DOCX)

S1 Appendix. Uncertainty-aware attention network implementation details.

A detailed description of the uncertainty aware network implementation.

https://doi.org/10.1371/journal.pone.0289930.s013

(DOCX)

Acknowledgments

The ANZSCTS Cardiac Surgery Database Program thanks all of the investigators, data managers, and institutions that participate in the Program.

References

  1. 1. Kilic A. Artificial intelligence and machine learning in cardiovascular health care. The Annals of Thoracic Surgery. 2020;109: 1323–1329. pmid:31706869
  2. 2. Sanders J, Cooper J, Mythen MG, Montgomery HE. Predictors of total morbidity burden on days 3, 5 and 8 after cardiac surgery. Perioperative Medicine. 2017;6. pmid:28228937
  3. 3. Billah B, Huq MM, Smith JA, Sufi F, Tran L, Shardey GC, et al. AusSCORE II in predicting 30-day mortality after isolated coronary artery bypass grafting in australia and new zealand. The Journal of Thoracic and Cardiovascular Surgery. 2014;148: 1850–1855.e2. pmid:24655903
  4. 4. Nashef SAM, Roques F, Sharples LD, Nilsson J, Smith C, Goldstone AR, et al. EuroSCORE II. European Journal of Cardio-Thoracic Surgery. 2012;41: 734–745. pmid:22378855
  5. 5. Thakar CV, Arrigain S, Worley S, Yared J-P, Paganini EP. A clinical score to predict acute renal failure after cardiac surgery. Journal of the American Society of Nephrology. 2004;16: 162–168. pmid:15563569
  6. 6. Benedetto U, Dimagli A, Sinha S, Cocomello L, Gibbison B, Caputo M, et al. Machine learning improves mortality risk prediction after cardiac surgery: Systematic review and meta-analysis. The Journal of Thoracic and Cardiovascular Surgery. 2022;163: 2075–2087.e9. pmid:32900480
  7. 7. Penny-Dimri JC, Bergmeir C, Perry L, Hayes L, Bellomo R, Smith JA. Machine learning to predict adverse outcomes after cardiac surgery: A systematic review and meta-analysis. Journal of Cardiac Surgery. 2022. pmid:36001761
  8. 8. Jiang H, Liu L, Wang Y, Ji H, Ma X, Wu J, et al. Machine learning for the prediction of complications in patients after mitral valve surgery. Frontiers in Cardiovascular Medicine. 2021;8. pmid:34977184
  9. 9. Shuhaiber JH, Conte JV. Machine learning in heart valve surgery. European Journal of Cardio-Thoracic Surgery. 2021;60: 1386–1387. pmid:34279023
  10. 10. Scott I, Carter S, Coiera E. Clinician checklist for assessing suitability of machine learning applications in healthcare. BMJ Health & Care Informatics. 2021;28. pmid:33547086
  11. 11. Sperandei S. Understanding logistic regression analysis. Biochemia Medica. 2014; 12–18. pmid:24627710
  12. 12. Chen T, Guestrin C. XGBoost. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2016.
  13. 13. Database NCS. Data definitions manual version 4. Melbourne, Australia: Australia; New Zealand Society for Cardiac; Thoracic Surgeons; 2018.
  14. 14. Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Scientific Data. 2016;3. pmid:27219127
  15. 15. Buuren S van, Groothuis-Oudshoorn K. Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software. 2011;45: 1–67.
  16. 16. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in neural information processing systems. Curran Associates, Inc.; 2017. Available: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  17. 17. Knoblauch J, Jewson J, Damoulas T. Generalized variational inference: Three arguments for deriving new posteriors. arXiv; 2019.
  18. 18. Charpentier DZ Bertrand, Günnemann S. Posterior network: Uncertainty estimation without OOD samples via density-based pseudo-counts. Advances in neural information processing systems 33. Curran Associates, Inc.; 2020.
  19. 19. Drugowitsch J. Variational Bayesian inference for linear and logistic regression. arXiv e-prints. 2013; arXiv:1310.5438. Available: https://arxiv.org/abs/1310.5438
  20. 20. Lakshminarayanan B, Pritzel A, Blundell C. Simple and scalable predictive uncertainty estimation using deep ensembles. Proceedings of the 31st international conference on neural information processing systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. pp. 6405–6416.
  21. 21. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, dAlché-Buc F, Fox E, Garnett R, editors. Advances in neural information processing systems. Curran Associates, Inc.; 2019. Available: https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf
  22. 22. Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, et al. API design for machine learning software: Experiences from the scikit-learn project. ECML PKDD workshop: Languages for data mining and machine learning. 2013. pp. 108–122.
  23. 23. Shamsudin H, Yusof UK, Jayalakshmi A, Akmal Khalid MN. Combining oversampling and undersampling techniques for imbalanced classification: A comparative study using credit card fraudulent transaction dataset. 2020 IEEE 16th international conference on control & automation (ICCA). 2020. pp. 803–808.
  24. 24. Vayena E, Blasimme A, Cohen IG. Machine learning in medicine: Addressing ethical challenges. PLOS Medicine. 2018;15: e1002689. pmid:30399149
  25. 25. van der Heijden GJMG, Donders ART, Stijnen T, Moons KGM. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: A clinical example. Journal of Clinical Epidemiology. 2006;59: 1102–1109. pmid:16980151
  26. 26. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Proceedings of the 31st international conference on neural information processing systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. pp. 4768–4777.
  27. 27. Fryer DV, Strümke I, Nguyen H. Shapley values for feature selection: The good, the bad, and the axioms. IEEE Access. 2021;9: 144352–144360.
  28. 28. Sharma P, Mirzan SR, Bhandari A, Pimpley A, Eswaran A, Srinivasan S, et al. Evaluating tree explanation methods for anomaly reasoning: A case study of SHAP TreeExplainer and TreeInterpreter. In: Grossmann G, Ram S, editors. Advances in conceptual modeling. Cham: Springer International Publishing; 2020. pp. 35–45.
  29. 29. Penny-Dimri JC, Bergmeir C, Reid CM, Williams-Spence J, Cochrane AD, Smith JA. Machine learning algorithms for predicting and risk profiling of cardiac surgery-associated acute kidney injury. Seminars in Thoracic and Cardiovascular Surgery. 2021;33: 735–745. pmid:32979479
  30. 30. Kang N, Cole T, Tsang V, Elliott M, de Leval M. Risk stratification in paediatric open-heart surgery. European Journal of Cardio-Thoracic Surgery. 2004;26: 3–11. pmid:15200974
  31. 31. Mortasawi A, Arnrich B, Rosendahl U, Frerichs I, Albert A, Walter J, et al. Is age an independent determinant of mortality in cardiac surgery as suggested by the EuroSCORE? BMC Surgery. 2002;2. pmid:12370083