Figures
Abstract
Aim
In this review, we investigated how Machine Learning (ML) was utilized to predict all-cause somatic hospital admissions and readmissions in adults.
Methods
We searched eight databases (PubMed, Embase, Web of Science, CINAHL, ProQuest, OpenGrey, WorldCat, and MedNar) from their inception date to October 2023, and included records that predicted all-cause somatic hospital admissions and readmissions of adults using ML methodology. We used the CHARMS checklist for data extraction, PROBAST for bias and applicability assessment, and TRIPOD for reporting quality.
Results
We screened 7,543 studies of which 163 full-text records were read and 116 met the review inclusion criteria. Among these, 45 predicted admission, 70 predicted readmission, and one study predicted both. There was a substantial variety in the types of datasets, algorithms, features, data preprocessing steps, evaluation, and validation methods. The most used types of features were demographics, diagnoses, vital signs, and laboratory tests. Area Under the ROC curve (AUC) was the most used evaluation metric. Models trained using boosting tree-based algorithms often performed better compared to others. ML algorithms commonly outperformed traditional regression techniques. Sixteen studies used Natural language processing (NLP) of clinical notes for prediction, all studies yielded good results. The overall adherence to reporting quality was poor in the review studies. Only five percent of models were implemented in clinical practice. The most frequently inadequately addressed methodological aspects were: providing model interpretations on the individual patient level, full code availability, performing external validation, calibrating models, and handling class imbalance.
Citation: Askar M, Tafavvoghi M, Småbrekke L, Bongo LA, Svendsen K (2024) Using machine learning methods to predict all-cause somatic hospitalizations in adults: A systematic review. PLoS ONE 19(8): e0309175. https://doi.org/10.1371/journal.pone.0309175
Editor: Tariq Jamal Siddiqi, The University of Mississippi Medical Center, UNITED STATES OF AMERICA
Received: February 1, 2024; Accepted: August 6, 2024; Published: August 23, 2024
Copyright: © 2024 Askar et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: This is a systematic review. The extracted data is available in the supplementary material.
Funding: The publication charges for this article have been funded by a grant from the publication fund of UiT The Arctic University of Norway. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Unplanned hospital admissions and readmissions (hospitalizations) account for a significant share of global healthcare expenditures [1,2]. Interestingly, up to 35% of these hospitalizations are potentially avoidable [3]. One approach to address avoidable hospitalizations is to implement statistical and mathematical models on healthcare datasets in order to predict future hospitalization [4,5].
Previous attempts were mainly based on regression models and specific risk indexes (scores). Systematic reviews have concluded that most models had poor, inconsistent performance, and limited applicability. They also found that models utilizing health records data performed better than models using self-report data [4,6,7].
More recently, prediction models that utilize Machine Learning (ML) [8,9] algorithms have become more popular. Recent reviews emphasized the growing importance and effectiveness of ML models in predicting clinical outcomes such as hospital readmissions. These reviews concluded that ML techniques can improve readmission prediction ability over traditional statistical models. This improvement could be explained by ML models offering several advantages over traditional regression models such as flexibility, the ability to handle large, complex, high dimensional datasets, and identifying non-linear relationships [10]. The reviews also highlighted the critical role of selecting features and addressed some challenges such as transparency, the difficulty of ML models’ interpretation, and the importance of handling class imbalance to enhance the models’ performance. Moreover, they highlighted the importance of demonstrating the clinical usefulness of the models in practice [11–13]. A systematic analysis of readmission prediction literature proposed a comprehensive framework for ML model development detailing steps from data preparation and preprocessing to suggesting methods of feature selection and transformation, data splitting, model training, validation, and evaluation [14].
Although several reviews have considered the use of ML in predicting hospitalizations for specific diseases and conditions [15–17], none has systemically reviewed the literature on all-cause hospital admissions. We aim with this review to (i) summarize the characteristics of ML studies used in predicting all-cause somatic admissions and readmissions; (ii) provide a picture of the ML pipeline steps including, data preprocessing, feature selection, model evaluation, validation, calibration, and explanation; (iii) assessing the risk of bias, applicability and reporting completeness of the studies; and finally (iv) to comment on the challenges facing implementation of ML models in clinical practice.
Materials and methods
The protocol of this systematic review was registered in the International prospective register of systematic reviews PROSPERO (CRD42021276721). The PRISMA, and PRISMA-Abstract guidelines [18] were followed in reporting this review, see S1 File: Section 1.
Inclusions/Exclusion criteria
To formulate the research question, we used the PICOTS checklist [19,20]. Studies that only included non-adults were excluded. Hospitalizations were defined as all-cause somatic admissions or readmissions from outside hospitals, hence psychological-related, disease-specific, and internal admissions between wards are excluded. Emergency Department (ED) were considered portals, thus the admissions from an ED to the hospitals were included, but ED admissions followed by discharge were excluded.
Our focus is the studies performed in the ML context (both in developing steps of the model, e.g., feature engineering, or making the final predictions), so studies that only used statistical learning or risk indexes for prediction were excluded. All performance measures were reported for competing models. This review is mainly descriptive of how ML was used in predicting hospitalization; hence we chose to include studies conducted using real-world data with hospital admissions and readmissions as a valid outcome regardless of the timing of the outcome. Table 1 represents the overall inclusion criteria. A detailed description of the inclusions and exclusions is provided in S1 File: Section 2.
Search strategy
We searched four main databases: PubMed, Embase (via Ovid), Web of Science, and CINAHL (via EBSCO) from inception dates to October 13th,2023. The search strategy was developed through the piloting of some relevant studies. Search terms were used if included in the database (MeSH for PubMed and CINAHL, and Emtree for EMBASE). We also searched 4 other databases: ProQuest, OpenGrey, WorldCat (OCLC FirstSearch), and MedNar for grey literature.
Four main search blocks were used to identify relevant studies: prediction, hospitalization, machine learning, and exclusions. The exclusion of irrelevant search words was developed by iteration and preliminary title/abstract piloting. The Boolean operators AND, OR, and NOT were used alongside truncation operators and phrase-searching. Search syntax was adapted for each database using the Polyglot tool [21] with manual supervision. The complete search syntax can be found in S1 File: Section 3.
Duplicate studies were removed using Mendeley Reference Manager (version 1.19.8, Elsevier). In cases where the reference manager was uncertain, we manually checked and removed any duplicates. Titles and abstracts were screened by two independent investigators (MA and KS), and full-text papers were retrieved for all candidate studies. The full-text screening was separately performed by MA, MT, LS, and KS. A manual search was conducted using the reference lists of the included studies to manually extract literature that did not appear in the electronic search. A list of all full-text screened studies, including those that were included and excluded with the reason(s) for exclusion, is attached to S2 File, sheet: Included & excluded studies. The final included studies were decided by discussion between MA, KS, and LS. The descriptive results were synthesized using Pivot tables in Microsoft Excel.
Data extraction
The data was extracted using the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The (CHARMS) Checklist [19] separately by MA, MT, and LS. For further analysis, features were grouped into administrative and clinical feature groups. The included records, extracted data, models’ features, and feature groupings can be found in S2 File, sheets: CHARMS, and Features.
Assessment of bias and applicability
Despite that the main purpose of the review is descriptive, we assessed the risk of bias and applicability using the Prediction model Risk of Bias Assessment Tool (PROBAST) [22] by MA and MT. PROBAST is a commonly used tool to assess prediction models. The tool evaluates four domains; namely: Participants, Predictors, Outcome, and Analysis. For each domain, there is a set of questions to help judge the risk of both bias and applicability concerns. If any domain was not rated “low”, the overall risk of bias was considered “high”. Abstracts were not assessed due to their limited information. The assessment is attached to S2 File, sheet: PROBAST.
Quality of reporting
To assess the quality of reporting, we utilized the Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) checklist [23]. We followed the suggested methodology in Constanza et al. [24] to evaluate the adherence to TRIPOD per article and per item in the reporting checklist. Each item is scored by (1 = reported, 0 = not reported, 0,5 = incomplete reporting, or not applicable = ‘_’). Abstracts and conference proceedings were not evaluated. We then calculated the adherence per TRIPOD item by dividing the sum of the items of all studies by the total number of studies. Adherence for each article was calculated as the sum of all TRIPOD items over the sum of all items if the reporting was complete, S2 File, sheet: TRIPOD. All abbreviations mentioned in the study are included in S2 File, sheet: Abbreviations.
Results
Of the 7,543 records reviewed, 147 records were eligible for full-text screening. We included 16 other records by manual searching of the references. In total, 163 studies were fully screened, and 116 studies were included in the review, of which 87 peer-reviewed articles (76%), 17 conference articles, 9 abstracts, and three theses (Fig 1).
Data extraction results
Characteristics of the included studies.
Sixty-one studies (53%) were conducted using data from the USA, followed by Australia (seven studies), Taiwan (four studies), and Canada and Singapore, each with three studies. The oldest article is from 2005, and 2019 had the greatest number of articles (22 articles, 19%), followed by 2020 (17 articles, 15%), see Fig 2.
Left panel, a bar plot of the top 10 countries where the datasets originate and the number of publications. Right panel, a bar plot of the number of publications by year.
Population characteristics.
Only 23 studies (20%) had a complete reporting of sample size (number of unique patients and the total number of admissions). Six studies (5%) neither reported the number of patients nor admissions (among them were 4 abstracts). The sample size varied from 371 to 4,637,294. Regarding age, 49 studies (42%) did not report an age range of the included patients. The rest of the studies had different minimum age requirements for the studied patients.
Outcomes characteristics.
Readmission was the outcome in 70 studies (60%), while 45 studies (39%) had hospital admission as an outcome, and one study investigated both outcomes. The readmission outcome prediction horizon varied from 24 hours to 1 year. The most frequent predicted horizon was 30-day readmission (51 studies (73% of the readmission studies), 7 other studies combined it with other readmission horizons, in total 58 studies (83%)). The dataset’s inclusion period varied from 1 month to 30 years (median: 1.25 years, mean 3.2 years). Excluding rebalanced datasets, the readmission proportion varied from 0.7% to 34.6% (median 12.4%), while the admission proportion varied from 0.38% to 41% (median 17.2%).
Datasets
The number of studies that used administrative, clinical, or both types was close (40, 36, and 38 studies respectively), with two abstracts, having an unclear description of the type of dataset. Among the studies that reported an Area Under the ROC curve (AUC) and used these types of datasets (103 studies), the mean AUCs were 0.80, 0.78, and 0.77 with standard deviations (SD) of 0.08, 0.07, and 0.09 respectively. Six studies reported an AUC over 90%, while 81 studies reported an AUC from 70–90% and 18 studies reported an AUC ranging from 60–70%. Fig 3 shows the relationship between outcomes, dataset types, dataset sources, and the best model performance. S1 File: Section 4 includes detailed information on the data types, sources, and frequency of use with predicting either admission or readmissions.
The thickness of the streams indicates the number of records common between pairs of categories. Medical records include patient information from EHR or EMR. Hospital datasets include data from hospital information systems.
Types of features included in models.
The most used feature groups were demographics (92 studies, 79%), diagnoses (43 studies, 37%), vital signs (34 studies, 29%), and laboratory tests (28%). Fig 4 represents the most used feature groups in the included studies. Natural Language Processing techniques (NLP) were used in 16 studies (14%) to predict hospitalizations from clinical free-text notes.
Missing data and data imbalance.
Missing values were not mentioned at all in 52 studies (45%). In 55 studies (47%) where the handling of how missing values were reported, the most used methods were removing the records of missing values (27 studies, 23%) and various imputation methods (25 studies, 22%) with some studies using both removing and imputation to deal with missing values.
Of 99 studies (85%) that reported class imbalance in the outcome, only 32 studies (32%) reported handling this imbalance by some technique. The most used techniques were undersampling (19 studies, 59%), oversampling (7 studies, 22%), and Synthetic Minority Oversampling TEchnique (SMOTE) (6 studies, 19%). Note that some studies tested more than one resampling method.
Models’ performance and comparison
In total, 57 different algorithms were used for predicting the outcomes. Regression models were the most frequently used algorithm group (73 studies, 63%) followed by bagging tree-based algorithms, in 61 studies (53%), and boosting tree-based algorithms in 60 studies (52%). The best-performing algorithm group was boosting algorithms in 35 studies (42%), bagging algorithms in 16 studies (19%), followed by regression and Neural Networks (NN) models in 14 studies (17%) each, see Table 2.
Comparing the performance of algorithms.
Eighty-three studies (72%) compared the performance of multiple algorithms. Based on the results of the best-performing algorithm groups (Table 2), we compared the performance of some of these algorithm groups. DT, Bayesian models were not involved in the comparison as they did not perform best in any of the studies. Fig 5 illustrates the performance comparison between different algorithm groups in the retrieved studies.
The numbers on each segment denote the count of publications in which the first algorithm group demonstrated superior, equivalent, or inferior performance compared to the second one. Adjacent to each bar, the total number of publications involving such comparisons was indicated.
Evaluation metrics.
AUC was the most used evaluation metric (105 studies) followed by precision, sensitivity, specificity, and accuracy (Fig 6). Thirty-seven studies (32%) reported only one evaluation metric such as AUC or accuracy without reporting a clinical performance metric such as sensitivity or specificity. Of the 105 studies that reported AUC, 18 studies (17%) reported AUC between (60–70%), 42 studies (40%) reported AUC between (70–80%), 39 studies (37%) reported an AUC between (80–90%), and only 6 studies (6%) reported AUC above 90% (Fig 3). The highest reported AUC in admission models was 95% and 99% in readmission models. The mean AUC reported in the studies that used administrative, clinical, or combined both was 0.80, 0.78, and 0.77 ((SD: 0.08, 0.07, and 0.09) respectively.
Each subplot represents the frequency of use in the reviewed studies.
Model calibration and benchmarking.
Only 28 studies (24%) calibrated their models using one of the calibration methods. Fig 6 represents the calibration methods used and the count of publications. Eighteen studies (16%) were benchmarked against one or more risk prediction indexes such as LACE [25], PARR [26], HOSPITAL [27] indexes, etc. The most used risk index in benchmarking was LACE index with nine studies, followed by PARR and HOSPITAL with two studies each. In all 18 studies, ML models outperformed predictions obtained from these risk indexes. A detailed comparison is attached to S1 File: Section 5.
Model validation.
The majority of studies were trained and validated retrospectively (96 studies, 83%). Only 17 studies (15%) were trained retrospectively and tested prospectively, among them three studies performed a real-time validation. The study design was not clear in three studies. Fig 6 depicts the internal and external validation methods used in the studies.
Model explainability and availability
Providing model interpretation on the patient level (local model interpretation) was only represented in three studies. Fig 6 represents the different interpretation methods used in the studies. Twenty studies (17%) used publicly available datasets, and 15 studies (13%) reported providing the data upon request. Only 17 studies made their code available (15%) and only six studies implemented their models in clinical practice.
Quality of the studies
Bias and applicability assessment.
Of 106 studies assessed, 68 (64%) were evaluated to be at high risk of bias. We evaluated 94 studies (87%) to be at low concern of applicability. Assessment results are attached to S2 File, sheet: PROBAST.
Reporting quality assessment.
Only nine studies reported adherence to TRIPOD checklist. These studies [28–36] had generally better reporting quality (scored 17, 17, 19, 17.5, 16.5, 18, 17.5, 17, 16 out of 20) respectively. The overall median of 20 items TRIPOD adherence was 77% (IQR 63–95). The assessment of adherence to TRIPOD reveals insufficient reporting, especially in some items such as reporting the flow of participants (35% of the studies), supplementary material (52%), population characteristics (53%), reporting missing data (56%), and funding (58%) among others (Fig 7). The evaluation sheet is attached to S2 File, sheet: TRIPOD
Only explicit reporting of Confidence Intervals (CI) was considered as complete reporting. Note that all items are calculated excluding abstracts and proceedings (10 studies). Hence, some items such as missing data calculations can differ from what is reported in the result section as the calculations in the result section included all the studies.
Discussion
To our knowledge, this is the first systematic review to focus on ML models for predicting all-cause somatic hospitalizations. Of 7,543 citations, 116 studies were included. Our review reveals the potential that ML models have in predicting all-cause somatic hospitalizations which is consistent with what is reported by both a general review of AI and machine learning and disease-specific reviews [8,9]. Our findings also raise concerns regarding the quality of the studies conducted. Therefore, despite the potential of the ML prediction framework and the superiority over traditional statistical prediction shown in many studies, there are clear issues with the quality of reporting, external validation, model calibration, and interpretation. All these aspects should support the model performance to be convenient to implement in real-life clinical practice. These main findings are consistent with findings from other reviews [11,12,37–39].
Most studies were based on data from the USA which can be an issue. This geographic skew limits the generalizability of the developed models, considering the differences in healthcare systems and patient populations between countries [40]. As 30 day-readmission is a widely used indicator of hospital care quality [41] the majority of the included readmission studies used this indicator as an outcome.
Datasets and features
A wide variety of data sources and types were used. We found that the performance of models trained on administrative (claims), clinical, or datasets combined both clinical and administrative variables to be close with a slight edge for models trained on administrative datasets.
The most important features varied between the different studies. This lack of convergence of risk factors is due to: i) different definitions of admissions and readmissions as an outcome for these studies, ii) the use of different feature selection methods [42], iii) the diversity of recorded features in different healthcare databases, iv) lack of standard handling of data preprocessing steps and variance of methods handling and generating variables, v) the variety in population and subpopulations, exclusion criteria, and finally, vi) the use of different risk scores and indexes which include different sets of features. This is consistent with what previous studies concluded about the difficulty of finding universal features for predicting hospitalization [43–45]. While defining general risk factors is particularly difficult for studies of all-cause hospitalizations, it may be appropriate in subpopulations (i.e., patients of specific diseases) that have more similarities and less diversity. Yet, some groups of risk factors are shown to be more common than others (Fig 4).
The most used feature groups were demographics, diagnoses, physiological measurements, and laboratory tests, respectively (Fig 4). Some studies used only one or a limited number of feature groups [46–53]. All these studies yielded generally good predictive performance suggesting that the sole use of one or limited categories of features can be enough to predict hospitalization. However, this needs to be further investigated by comparing the performance of these models that were exclusively built on one or a few feature groups with models built on several feature groups.
Some studies used Natural Language Processing (NLP) techniques to extract information from the clinical text and either combined them with other structured features [54–57] or as a sole source of data [47,52,53,58–62]. Some studies reported better prediction performance using textual data over numerical data (e.g., laboratory tests and vital signs) suggesting the existence of relevant expert knowledge within these reports [53]. We noticed an increase in applying NLP techniques in recent studies suggesting that utilizing textual data is a promising future direction for predicting hospitalizations. Incorporating NLP techniques in prediction models will provide models with a rich source of clinical information that could not be present in the tabular format of patient records. It will also improve the research scalability by the automatic extraction of relevant information rather than manual processing. Furthermore, it can provide real-time assistance for clinicians. However, some challenges should be considered such as the limited availability of sharing large, annotated datasets which are necessary for developing efficient NLP models, the current popular evaluation methods may not be clinically relevant, and the lack of transparent protocols to ensure NLP methods are reproducible [63,64].
Data preprocessing
Elaborating on how the individual patient multiple admissions were handled in data preparation should be reported. Only 23 studies reported both a unique number of patients and the total number of rows indicating poor reporting of this item. Reporting both the number of patients and admissions and the methods of handling multiple admissions for the same patient is important since neglecting the correlation between admissions may lead to unreliable predictions.
Similarly, less than half of the studies (47%) reported a method of handling missing values. Only a few studies (32 studies) reported handling class imbalance in the dataset. Class imbalance means that the outcome contains more samples from one class (majority class) over the other classes (minority class) [65] and represents one of the most common issues in training ML models in predicting hospitalizations. However, it is not usually taken into consideration in the readmission risk prediction literature [66]. The problem with class imbalance is that the models could be biased towards the majority class leading to a misleadingly high prediction performance [67]. Resampling techniques, especially undersampling were the most used approach. Resampling techniques involve balancing the distribution of outcome classes either by oversampling or undersampling. Oversampling involves increasing the number of instances of minority class (e.g. SMOTE), while undersampling involves randomly reducing the number of majority class instances, thus balancing the class distribution [68]. It should be noted that resampling techniques have some drawbacks such as overfitting or losing useful information which can introduce problematic consequences and hinder model learning [69,70].
Models’ performance comparisons
Evaluating model performance in health-related outcomes should be reported on two levels: model performance (e.g., AUC, F1-score, etc.) and clinical performance metrics (e.g., sensitivity, specificity, PPV, NPV, etc.) [71]. More than one-third of the studies only reported a model performance metric, with AUC as the most used one, which could limit their acceptance in clinical practice.
The analysis of different algorithms’ performance confirms that no algorithm constantly performs better than the other [72]. Yet, some algorithms more frequently yield better results compared to others. In this review, we found that tree-based boosting algorithms often outperformed other algorithms (Table 2 and Fig 5). Tree-based boosting algorithms such as Gradient Boosting Machine (GBM), XGBoost, and AdaBoost, are a class of ensemble learning methods that build multiple decision trees sequentially [73]. Each new decision tree corrects the errors of previous ones by giving more focus to samples that were difficult to estimate [74]. The predictions of the trees are then combined to produce the final model prediction [75]. This group of algorithms has many advantages such as training multiple models which enhances the prediction performance over training a single one, flexibility to handle different data types, capturing non-linear patterns, and being less prone to overfitting [76].
Many studies tended to compare the performance of different algorithms on the same dataset. In this concern, we suggest that conducting even more studies with a sole focus on comparing the performance of commonly used ML algorithms is not needed unless they aim to benchmark new algorithms to the existing ones. We propose that researchers should focus on how to incorporate efforts to generalize ML models and implement them in clinical practice instead.
There is a discussion about whether ML models can offer better predictive abilities than conventional statistical models such as logistic regression (LR). While some studies found that ML models outperform regression models [11,37,77–81], others suggest that using the ML models gives no better prediction than LR [55,82,83]. In our analysis, ML models mostly performed better than regression models. This is consistent with a meta-analysis that concluded the same by comparing LR to advanced ML algorithms such as NN [84]. Regression models performed better only in 17% of studies compared between regression and ML algorithms (Table 2 and Fig 5). This can be justified by LR being a parametric algorithm and lacking enough flexibility compared to non-parametric ones [85] or that LR has restricted assumptions which also gives favor to the less-restricted or no-assumptions algorithms [86].
We also found that ML models outperform risk indexes in prediction performance. This is reasonable because risk indexes usually contain few predictors and aim mainly for simplification of predictions, while ML models utilize more predictors and complex methods to understand the pattern in datasets. It could also be argued that ML models are developed and tested on the same dataset and may be more skilled to predict the outcome from this specific dataset and even could be overfitted for it. While risk indexes are usually developed in a setting and then validated in different datasets and settings which would mean that ML models will outperform these indexes anyway.
Finally, two studies compared ML models to clinicians’ predictions. They concluded that the models outperform ED nurses in predicting admission to ED and that combining ML models with clinical insight improves the model’s performance [87,88].
Model validation
External validation (EV) should ideally be conducted on unrelated and structurally different datasets from the dataset used for model training [89,90]. If the validation dataset differs only temporally but still originates in the same settings, place, etc., it is called temporal EV and is regarded as an approach that lies midway between internal and external validation [91]. This is because the overall patients’ characteristics are similar between the two datasets [92]. Our analysis shows that there was a clear shortage in terms of EV of models. Most of the EV performed can be regarded as temporal EV. Although recent studies indicate an increased awareness of EV, maintaining EV continues to be a critical step in the current development of ML models [93]. However, there are still several obstacles facing ML models’ generalizability. These obstacles can be categorized as either model-related or data-related. Model-related obstacles include issues with transparency in model development and results reporting. Data-related ones include the diversity of data structure, formats, population, etc. across different healthcare systems, and the lack of a standardized data preprocessing framework. Additionally, the strict health data privacy regulations.
Adopting Common Data Models (CDMs) [94], designing a comprehensive and widely accepted framework for data preprocessing, and implementing Federated Learning (FL) [95] could help address these issues. In S1 File: Section 6, we provide a more detailed explanation of these obstacles and solutions.
Model explainability and availability
Model interpretation is of great importance in predicting health-related outcomes. Global model interpretation involves describing the most important rules and most influential features that the model learned in the training steps [96], while, local model interpretation refers to explaining how the model derived each individual prediction (i.e., for each patient) [97,98]
In our analysis, 62 studies (53%) introduced global interpretation to their model in the form of feature importance, or a risk score, for example [28,99,100], while only three studies presented methods for local interpretation [56,78]. Introducing both global and local model interpretation is important to increase the trustworthiness and to enhance the implementation of these models in practice [101–104].
Few studies made their dataset (20 studies) or code (17 studies) publicly available. To facilitate the technical reproducibility of the model, publishing both datasets and code is necessary. Indeed, healthcare datasets contain patients’ confidential information which hinders publishing them. Hence, some suggestions were reported to partially solve this issue such as publishing a simulated dataset [105], providing complementary empirical results on an open-source benchmark dataset [106], or sharing model prediction and data labels to allow further statistical analysis [107]. There is also no doubt that publicly available datasets such as MIMIC-III [108], have boosted ML research and opened many chances to develop ML in the health domain. MIMIC-III has been cited more than 3,000 times to date. The dataset has enabled numerous studies that focus on developing predictive models and enhancing clinical decision support systems [109,110].
Reporting model developing codes and performed experiments can help understand the final methodology, accelerate the overall development, and ensure that models are safeguarded from data leakage and other downfalls in model development [71,111]. Additionally, reporting the software and package versions is also necessary. Many decisions taken by algorithms are taken silently through the default setting of the different packages leading to differences in results when the experiment is repeated even on the same dataset [112].
Bias risk and applicability
More than 60% of the assessed studies had a high risk of bias in line with other reviews’ findings [38,39,113]. Twelve studies were found to have a high concern of applicability. However, factors such as variability of populations, settings, and dataset characteristics are anticipated to further constrain the applicability of these studies.
In general, we observed poor quality of reporting in the studies. This is consistent with findings in other studies [24,114,115]. Poor reporting quality raises concerns about the reproducibility of models [105]. Studies that adhered to TRIPOD had better scores than those that didn’t. This points to the importance of adherence to a reporting checklist in ML studies, especially in the health domain. It also raises the need to develop ML-specialized checklists for quality assessment and reporting quality. Ongoing research is currently addressing this requirement [116]. In S1 File: Section 7, we suggest a reporting scheme for ML studies.
Limitations
We identified the relevant literature from eight databases, but we have not approached authors for missing information on the studies. This is due to the considerable amount of missing information which could have impacted the assessment of bias risk. Our results are also limited by the fact that most of the reviewed studies were based on data from the USA which limits the generalizability because of the differences between populations and healthcare systems between countries. To address this limitation, future studies should aim to include diverse datasets from various countries and healthcare settings. Additionally, more efforts should be directed to compare models from different populations and settings to understand their limitations in different contexts. Assessing the quality of studies was also limited by not being able to access their code scripts. Potential publication bias also limits the ability of the review to comprehensively evaluate the overall results. Additionally, the variability of reporting varied significantly between the studies which can affect the reliability of the findings.
The heterogeneity and differences in healthcare systems and patient populations across countries and ML algorithms and settings limit the comparisons of results between studies and make it more difficult to harmonize the results of different models. Due to this heterogenicity, we had to make decisions regarding the inclusion criteria which may have caused us to miss relevant studies. Finally, only literature published in English was included which also limits our insights to the overall picture of ML development globally.
Conclusions
The main purpose of the review was to describe how ML was used in predicting all-cause somatic hospitalizations. The review raises some concerns about the quality of data preprocessing, the reporting quality, reproducibility, local interpretation, and the external validity of many studies. The quality of studies needs to improve to meet the expectations of clinicians and stakeholders before using these models in clinical practice. We recommend that future studies should prioritize generalizing ML models and integrating them into clinical practice.
Supporting information
S1 File. Includes: Section 1: PRISMA checklist, Section 2: Detailed inclusion/exclusion criteria, Section 3: Literature search syntax, Section 4: Studies data sources, Section 5: Benchmarking with risk indexes, Section 6: A comment on the generalizability of ML models, and Section 7: A suggestion of reporting checklist specifically for ML models in structured datasets.
https://doi.org/10.1371/journal.pone.0309175.s001
(DOCX)
S2 File. Includes: Sheet: Abbreviations, (Sheet: CHARMS) include the extracted data for review studies and studies citations, (Sheet: Features) includes feature-related extractions, (Sheet: PROBAST) includes the applicability and risk of bias assessment, (Sheet: TRIPOD) includes the reporting quality assessment, and (Sheet: Included & excluded studies) includes full-text screened studies with the reason(s) of exclusion.
https://doi.org/10.1371/journal.pone.0309175.s002
(XLSX)
References
- 1.
McDermott KW, Jiang HJ. Characteristics and Costs of Potentially Preventable Inpatient Stays, 2017. Healthcare Cost and Utilization Project (HCUP) Statistical Briefs. Agency for Healthcare Research and Quality (US); 2006. Available: https://www.ncbi.nlm.nih.gov/books/NBK559945/.
- 2. Jencks SF, Williams M V., Coleman EA. Rehospitalizations among Patients in the Medicare Fee-for-Service Program. N Engl J Med. 2009;361: 311–312. pmid:19605841
- 3. Lyhne CN, Bjerrum M, Riis AH, Jørgensen MJ. Interventions to Prevent Potentially Avoidable Hospitalizations: A Mixed Methods Systematic Review. Front Public Heal. 2022;10. pmid:35899150
- 4. Kansagara D, Englander H, Salanitro A, Kagen D, Theobald C, Freeman M, et al. Risk Prediction Models for Hospital Readmission. JAMA. 2011;306: 1688. pmid:22009101
- 5. Dhillon SK, Ganggayah MD, Sinnadurai S, Lio P, Taib NA. Theory and Practice of Integrating Machine Learning and Conventional Statistics in Medical Data Analysis. Diagnostics 2022, Vol 12, Page 2526. 2022;12: 2526. pmid:36292218
- 6. Wallace E, Stuart E, Vaughan N, Bennett K, Fahey T, Smith SM. Risk Prediction Models to Predict Emergency Hospital Admission in Community-dwelling Adults. Med Care. 2014;52: 751–765. pmid:25023919
- 7. Zhou H, Della PR, Roberts P, Goh L, Dhaliwal SS. Utility of models to predict 28-day or 30-day unplanned hospital readmissions: an updated systematic review. BMJ Open. 2016;6: e011060. pmid:27354072
- 8. Helm JM, Swiergosz AM, Haeberle HS, Karnuta JM, Schaffer JL, Krebs VE, et al. Machine Learning and Artificial Intelligence: Definitions, Applications, and Future Directions. Curr Rev Musculoskelet Med. 2020;13: 69–76. pmid:31983042
- 9.
El Naqa I, Murphy MJ. What Is Machine Learning? Machine Learning in Radiation Oncology. Cham: Springer International Publishing; 2015. pp. 3–11. https://doi.org/10.1007/978-3-319-18305-3_1
- 10. Rajula HSR, Verlato G, Manchia M, Antonucci N, Fanos V. Comparison of Conventional Statistical Methods with Machine Learning in Medicine: Diagnosis, Drug Development, and Treatment. Medicina (B Aires). 2020;56: 455. pmid:32911665
- 11. Artetxe A, Beristain A, Graña M. Predictive models for hospital readmission risk: A systematic review of methods. Comput Methods Programs Biomed. 2018;164: 49–64. pmid:30195431
- 12. Teo K, Yong CW, Chuah JH, Hum YC, Tee YK, Xia K, et al. Current Trends in Readmission Prediction: An Overview of Approaches. Arab J Sci Eng. 2021. pmid:34422543
- 13. Benedetto U, Dimagli A, Sinha S, Cocomello L, Gibbison B, Caputo M, et al. Machine learning improves mortality risk prediction after cardiac surgery: Systematic review and meta-analysis. J Thorac Cardiovasc Surg. 2022;163: 2075–2087.e9. pmid:32900480
- 14. Chen T, Madanian S, Airehrour D, Cherrington M. Machine learning methods for hospital readmission prediction: systematic analysis of literature. J Reliab Intell Environ. 2022;8: 49–66.
- 15. Cho SM, Austin PC, Ross HJ, Abdel-Qadir H, Chicco D, Tomlinson G, et al. Machine Learning Compared With Conventional Statistical Models for Predicting Myocardial Infarction Readmission and Mortality: A Systematic Review. Can J Cardiol. 2021;37: 1207–1214. pmid:33677098
- 16. Sun Z, Dong W, Shi H, Ma H, Cheng L, Huang Z. Comparing Machine Learning Models and Statistical Models for Predicting Heart Failure Events: A Systematic Review and Meta-Analysis. Front Cardiovasc Med. 2022;9. pmid:35463786
- 17. Mahajan SM, Heidenreich P, Abbott B, Newton A, Ward D. Predictive models for identifying risk of readmission after index hospitalization for heart failure: A systematic review. Eur J Cardiovasc Nurs J Work Gr Cardiovasc Nurs Eur Soc Cardiol. 2018;17: 675–689. pmid:30189748
- 18. Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Syst Rev. 2021;10: 89. pmid:33781348
- 19. Moons KGM, de Groot JAH, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG, et al. Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The CHARMS Checklist. PLoS Med. 2014;11: e1001744. pmid:25314315
- 20. Debray TPA, Damen JAAG, Snell KIE, Ensor J, Hooft L, Reitsma JB, et al. A guide to systematic review and meta-analysis of prediction model performance. BMJ. 2017;356: 6460. pmid:28057641
- 21. Clark JM, Sanders S, Carter M, Honeyman D, Cleo G, Auld Y, et al. Improving the translation of search strategies using the Polyglot Search Translator: a randomized controlled trial. J Med Libr Assoc. 2020;108. pmid:32256231
- 22. Wolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS, et al. PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies. Ann Intern Med. 2019;170: 51. pmid:30596875
- 23. Collins GS, Reitsma JB, Altman DG, Moons K. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement. BMC Med. 2015;13: 1. pmid:25563062
- 24. Andaur Navarro CL, Damen JAA, Takada T, Nijman SWJ, Dhiman P, Ma J, et al. Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review. BMC Med Res Methodol. 2022;22: 12. pmid:35026997
- 25. van Walraven C, Dhalla IA, Bell C, Etchells E, Stiell IG, Zarnke K, et al. Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the community. CMAJ. 2010;182: 551–7. pmid:20194559
- 26. Billings J, Blunt I, Steventon A, Georghiou T, Lewis G, Bardsley M. Development of a predictive model to identify inpatients at risk of re-admission within 30 days of discharge (PARR-30). BMJ Open. 2012;2. pmid:22885591
- 27. Donzé J, Aujesky D, Williams D, Schnipper JL. Potentially Avoidable 30-Day Hospital Readmissions in Medical Patients. JAMA Intern Med. 2013;173: 632. pmid:23529115
- 28. Fenn A, Davis C, Buckland DM, Kapadia N, Nichols M, Gao M, et al. Development and Validation of Machine Learning Models to Predict Admission From Emergency Department to Inpatient and Intensive Care Units. Ann Emerg Med. 2021;78: 290–302. pmid:33972128
- 29. Chandra A, Rahman PA, Sneve A, McCoy RG, Thorsteinsdottir B, Chaudhry R, et al. Risk of 30-Day Hospital Readmission Among Patients Discharged to Skilled Nursing Facilities: Development and Validation of a Risk-Prediction Model. J Am Med Dir Assoc. 2019;20: 444–450.e2. pmid:30852170
- 30. Spangler D, Hermansson T, Smekal D, Blomberg H. A validation of machine learning-based risk scores in the prehospital setting. PLoS One. 2019;14: e0226518. pmid:31834920
- 31. Raita Y, Goto T, Faridi MK, Brown DFM, Camargo CAJ, Hasegawa K. Emergency department triage prediction of clinical outcomes using machine learning models. Crit Care. 2019;23: 64. pmid:30795786
- 32. Rahimian F, Salimi-Khorshidi G, Payberah AH, Tran J, Ayala Solares R, Raimondi F, et al. Predicting the risk of emergency admission with machine learning: Development and validation using linked electronic health records. PLoS Med. 2018;15: e1002695. pmid:30458006
- 33. Hegselmann S, Ertmer C, Volkert T, Gottschalk A, Dugas M, Varghese J. Development and validation of an interpretable 3 day intensive care unit readmission prediction model using explainable boosting machines. Front Med. 2022;9. pmid:36082270
- 34. Olza A, Millán E, Rodríguez-Álvarez MX. Development and validation of predictive models for unplanned hospitalization in the Basque Country: analyzing the variability of non-deterministic algorithms. BMC Med Inform Decis Mak. 2023;23: 152. pmid:37543596
- 35. Brankovic A, Rolls D, Boyle J, Niven P, Khanna S. Identifying patients at risk of unplanned re-hospitalisation using statewide electronic health records. Sci Rep. 2022;12: 16592. pmid:36198757
- 36. Dadabhoy FZ, Driver L, McEvoy DS, Stevens R, Rubins D, Dutta S. Prospective External Validation of a Commercial Model Predicting the Likelihood of Inpatient Admission From the Emergency Department. Ann Emerg Med. 2023;81: 738–748. pmid:36682997
- 37. Shin S, Austin PC, Ross HJ, Abdel‐Qadir H, Freitas C, Tomlinson G, et al. Machine learning vs. conventional statistical models for predicting heart failure readmission and mortality. ESC Hear Fail. 2021;8: 106–115. pmid:33205591
- 38. Huang Y, Talwar A, Chatterjee S, Aparasu RR. Application of machine learning in predicting hospital readmissions: a scoping review of the literature. BMC Med Res Methodol. 2021;21: 96. pmid:33952192
- 39. Kamel Rahimi A, Canfell OJ, Chan W, Sly B, Pole JD, Sullivan C, et al. Machine learning models for diabetes management in acute care using electronic medical records: A systematic review. Int J Med Inform. 2022;162: 104758. pmid:35398812
- 40. Kashyap M, Seneviratne M, Banda JM, Falconer T, Ryu B, Yoo S, et al. Development and validation of phenotype classifiers across multiple sites in the observational health data sciences and informatics network. J Am Med Inform Assoc. 2020;27: 877–883. pmid:32374408
- 41. Demir E. A Decision Support Tool for Predicting Patients at Risk of Readmission: A Comparison of Classification Trees, Logistic Regression, Generalized Additive Models, and Multivariate Adaptive Regression Splines. Decis Sci. 2014;45: 849–880.
- 42. Junqueira ARB, Mirza F, Baig MM. A machine learning model for predicting ICU readmissions and key risk factors: analysis from a longitudinal health records. Health Technol (Berl). 2019;9: 297–309.
- 43. Hong WS, Haimovich AD, Taylor RA. Predicting hospital admission at emergency department triage using machine learning. PLoS One. 2018;13: e0201016. pmid:30028888
- 44. Sabbatini AK, Kocher KE, Basu A, Hsia RY. In-Hospital Outcomes and Costs Among Patients Hospitalized During a Return Visit to the Emergency Department. JAMA. 2016;315: 663. pmid:26881369
- 45. Adams JG. Ensuring the Quality of Quality Metrics for Emergency Care. JAMA. 2016;315: 659. pmid:26881367
- 46. Fialho AS, Cismondi F, Vieira SM, Reti SR, Sousa JMC, Finkelstein SN. Data mining using clinical physiology at discharge to predict ICU readmissions. Expert Syst Appl. 2012;39: 13158–13165.
- 47. Lorenzana A, Tyagi M, Wang QC, Chawla R, Nigam S. Using text notes from call center data to predict hospitalization. Value Heal. 2016;19: A87.
- 48.
Jayousi, Rashid; Assaf R. 30-day Hospital Readmission Prediction using MIMIC Data. The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Conference Proceedings. Al-Quds University,Jerusalem,Palestine: The Institute of Electrical and Electronics Engineers, Inc. (IEEE); 2020. pp. 1–6. http://dx.doi.org/10.1109/AICT50176.2020.9368625.
- 49. Xue Y, Klabjan D, Luo Y. Predicting ICU readmission using grouped physiological and medication trends. Artif Intell Med. 2019;95: 27–37. pmid:30213670
- 50. Feretzakis G, Karlis G, Loupelis E, Kalles D, Chatzikyriakou R, Trakas N, et al. Using Machine Learning Techniques to Predict Hospital Admission at the Emergency Department. J Crit Care Med. 2022;8: 107–116. pmid:35950158
- 51. Aphinyanaphongs Y, Liang Y, Theobald J, Grover H, Swartz JL. Models to predict hospital admission from the emergency department through the sole use of the medication administration record. Acad Emerg Med. 2016;23: S116.
- 52. Lucini FR, Fogliatto FS, da Silveira GJC, Neyeloff JL, Anzanello MJ, Kuchenbecker R de S, et al. Text mining approach to predict hospital admissions using early medical records from the emergency department. Int J Med Inform. 2017;100: 1–8. pmid:28241931
- 53.
Curto S, Carvalho JP, Salgado C, Vieira SM, Sousa JMC. Predicting ICU readmissions based on bedside medical text notes. 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). Piscataway: IEEE; 2016. pp. 2144-a-2151-h. https://doi.org/10.1109/FUZZ-IEEE.2016.7737956
- 54. Handly N, Thompson DA, Li J, Chuirazzi DM, Venkat A. Evaluation of a hospital admission prediction model adding coded chief complaint data using neural network methodology. Eur J Emerg Med. 2015;22: 87–91. pmid:24509606
- 55. Zhang X, Kim J, Patzer RE, Pitts SR, Patzer A, Schrager JD. Prediction of Emergency Department Hospital Admission Based on Natural Language Processing and Neural Networks. Methods Inf Med. 2017;56: 377–389. pmid:28816338
- 56. Hilton CB, Milinovich A, Felix C, Vakharia N, Crone T, Donovan C, et al. Personalized predictions of patient outcomes during and after hospitalization using artificial intelligence. NPJ Digit Med. 2020;3: 1–8. pmid:32285012
- 57. Fernandes M, Mendes R, Vieira SM, Leite F, Palos C, Johnson A, et al. Predicting Intensive Care Unit admission among patients presenting to the emergency department using machine learning and natural language processing. Olier I, editor. PLoS One. 2020;15: e0229331. pmid:32126097
- 58. Topaz M, Woo K, Ryvicker M, Zolnoori M, Cato K. Home Healthcare Clinical Notes Predict Patient Hospitalization and Emergency Department Visits. Nurs Res. 2020;69: 448–454. pmid:32852359
- 59. Boggan JC, Schulteis RD, Simel DL, Lucas JE. Use of a natural language processing algorithm to predict readmissions at a veterans affairs hospital. J Gen Intern Med. 2019;34: S396–S397.
- 60. Teo K, Yong CW, Chuah JH, Hum YC, Tee YK, Xia K, et al. Current Trends in Readmission Prediction: An Overview of Approaches. Arab J Sci Eng. pmid:34422543
- 61. Li Z, Xing X, Lu B, Zhao Y, Li Z. Early Prediction of 30-Day ICU Re-admissions Using Natural Language Processing and Machine Learning. Biomed Stat Informatics. 2019;4: 22.
- 62. Sterling NW, Patzer RE, Di M, Schrager JD. Prediction of emergency department patient disposition based on natural language processing of triage notes. Int J Med Inform. 2019;129: 184–188. pmid:31445253
- 63. Velupillai S, Suominen H, Liakata M, Roberts A, Shah AD, Morley K, et al. Using clinical Natural Language Processing for health outcomes research: Overview and actionable suggestions for future advances. J Biomed Inform. 2018;88: 11–19. pmid:30368002
- 64. Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V. Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review. JMIR Med Informatics. 2019;7: e12239. pmid:31066697
- 65. Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. J Big Data. 2019;6: 27.
- 66.
Artetxe A, Graña M, Beristain A, Ríos S. Emergency Department Readmission Risk Prediction: A Case Study in Chile. In: Vicente JMF, AlvarezSanchez JR, Lopez F, Moreo JT, Adeli H, editors. BIOMEDICAL APPLICATIONS BASED ON NATURAL AND ARTIFICIAL COMPUTING, PT II. Vicomtech IK4 Res Ctr, Mikeletegi Pasealekua 57, San Sebastian 20009, Spain; 2017. pp. 11–20. https://doi.org/10.1007/978-3-319-59773-7_2
- 67. Tanha J, Abdi Y, Samadi N, Razzaghi N, Asadpour M. Boosting methods for multi-class imbalanced data classification: an experimental review. J Big Data. 2020;7: 70.
- 68.
Guo X, Yin Y, Dong C, Yang G, Zhou G. On the Class Imbalance Problem. 2008 Fourth International Conference on Natural Computation. IEEE; 2008. pp. 192–201. https://doi.org/10.1109/ICNC.2008.871
- 69. He Haibo, Garcia EA. Learning from Imbalanced Data. IEEE Trans Knowl Data Eng. 2009;21: 1263–1284.
- 70. Kaur H, Pannu HS, Malhi AK. A Systematic Review on Imbalanced Data Challenges in Machine Learning. ACM Comput Surv. 2020;52: 1–36.
- 71. Norgeot B, Quer G, Beaulieu-Jones BK, Torkamani A, Dias R, Gianfrancesco M, et al. Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat Med. 2020;26: 1320–1324. pmid:32908275
- 72. Wolpert DH, Macready WG. No free lunch theorems for optimization. IEEE Trans Evol Comput. 1997;1: 67–82.
- 73.
Zhou Z-H. Ensemble Learning. Encyclopedia of Biometrics. Boston, MA: Springer US; 2009. pp. 270–273. https://doi.org/10.1007/978-0-387-73003-5_293
- 74. Zhang Y, Haghani A. A gradient boosting method to improve travel time prediction. Transp Res Part C Emerg Technol. 2015;58: 308–324.
- 75.
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM; 2016. pp. 785–794. https://doi.org/10.1145/2939672.2939785
- 76. Opitz D, Maclin R. Popular Ensemble Methods: An Empirical Study. J Artif Intell Res. 1999;11: 169–198.
- 77. Graham B, Bond R, Quinn M, Mulvenna M. Using Data Mining to Predict Hospital Admissions From the Emergency Department. IEEE Access. 2018;6: 10458–10469.
- 78. Lo Y-T, Liao JC-H, Chen M-H, Chang C-M, Li C-T. Predictive modeling for 14-day unplanned hospital readmission risk by using machine learning algorithms. BMC Med Inform Decis Mak. 2021;21: 288. pmid:34670553
- 79. Futoma J, Morris J, Lucas J. A comparison of models for predicting early hospital readmissions. J Biomed Inform. 2015;56: 229–238. pmid:26044081
- 80. Li Q, Yao X, Échevin D. How Good Is Machine Learning in Predicting All-Cause 30-Day Hospital Readmission? Evidence From Administrative Data. Value Heal J Int Soc Pharmacoeconomics Outcomes Res. 2020;23: 1307–1315. pmid:33032774
- 81. Barbieri S, Kemp J, Perez-Concha O, Kotwal S, Gallagher M, Ritchie A, et al. Benchmarking Deep Learning Architectures for Predicting Readmission to the ICU and Describing Patients-at-Risk. Sci Rep. 2020;10: 1111. pmid:31980704
- 82. Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110: 12–22. pmid:30763612
- 83. Gravesteijn BY, Nieboer D, Ercole A, Lingsma HF, Nelson D, van Calster B, et al. Machine learning algorithms performed no better than regression models for prognostication in traumatic brain injury. J Clin Epidemiol. 2020;122: 95–107. pmid:32201256
- 84. Talwar A, Lopez-Olivo MA, Huang Y, Ying L, Aparasu RR. Performance of advanced machine learning algorithms OVER logistic regression in predicting hospital readmissions: A meta-analysis. Explor Res Clin Soc Pharm. 2023; 100317. pmid:37662697
- 85. Kino S, Hsu Y-T, Shiba K, Chien Y-S, Mita C, Kawachi I, et al. A scoping review on the use of machine learning in research on social determinants of health: Trends and research prospects. SSM—Popul Heal. 2021;15: 100836. pmid:34169138
- 86. Mesgarpour M, Chaussalet T, Chahed S. Ensemble Risk Model of Emergency Admissions (ERMER). Int J Med Inform. 2017;103: 65–77. pmid:28551003
- 87. Peck JS, Benneyan JC, Nightingale DJ, Gaehde SA. Predicting emergency department inpatient admissions to improve same-day patient flow. Acad Emerg Med Off J Soc Acad Emerg Med. 2012;19: E1045–54. pmid:22978731
- 88. Flaks-Manov N, Shadmi E, Yahalom R, Perry-Mezre H, Balicer R, Srulovici E. Identification of elderly patients at-risk for 30-day readmission: clinical insight beyond big data prediction. J Nurs Manag. 2021. pmid:34661943
- 89. Ramspek CL, Jager KJ, Dekker FW, Zoccali C, van Diepen M. External validation of prognostic models: what, why, how, when and where? Clin Kidney J. 2021;14: 49–58. pmid:33564405
- 90. Riley RD, Ensor J, Snell KIE, Debray TPA, Altman DG, Moons KGM, et al. External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges. BMJ. 2016; i3140. pmid:27334381
- 91. Staartjes VE, Kernbach JM. Significance of external validation in clinical machine learning: let loose too early? Spine J. 2020;20: 1159–1160. pmid:32624150
- 92. Altman DG, Vergouwe Y, Royston P, Moons KGM. Prognosis and prognostic research: validating a prognostic model. BMJ. 2009;338: b605–b605. pmid:19477892
- 93. Cabitza F, Campagner A, Soares F, García de Guadiana-Romualdo L, Challa F, Sulejmani A, et al. The importance of being external. methodological insights for the external validation of machine learning models in medicine. Comput Methods Programs Biomed. 2021;208: 106288. pmid:34352688
- 94. Ryu B, Yoo S, Kim S, Choi J. Development of Prediction Models for Unplanned Hospital Readmission within 30 Days Based on Common Data Model: A Feasibility Study. Methods Inf Med. 2021. pmid:34583416
- 95. Rieke N, Hancox J, Li W, Milletarì F, Roth HR, Albarqouni S, et al. The future of digital health with federated learning. npj Digit Med. 2020;3: 119. pmid:33015372
- 96. Yang C, Rangarajan A, Ranka S. Global Model Interpretation via Recursive Partitioning. 2018 [cited 27 Dec 2022]. Available: http://arxiv.org/abs/1802.04253.
- 97. Kopitar L, Cilar L, Kocbek P, Stiglic G. Local vs. Global Interpretability of Machine Learning Models in Type 2 Diabetes Mellitus Screening. 2019. pp. 108–119.
- 98. Du M, Liu N, Hu X. Techniques for interpretable machine learning. Commun ACM. 2019;63: 68–77.
- 99. Wu CX, Suresh E, Phng FWL, Tai KP, Pakdeethai J, D’Souza JLA, et al. Effect of a Real-Time Risk Score on 30-day Readmission Reduction in Singapore. Appl Clin Inform. 2021;12: 372–382. pmid:34010978
- 100. Maali Y, Perez-Concha O, Coiera E, Roffe D, Day RO, Gallego B. Predicting 7-day, 30-day and 60-day all-cause unplanned readmission: a case study of a Sydney hospital. BMC Med Informatics Decis Mak. 2018;18: 1-N.PAG. pmid:29301576
- 101. Petch J, Di S, Nelson W. Opening the Black Box: The Promise and Limitations of Explainable Machine Learning in Cardiology. Can J Cardiol. 2022;38: 204–213. pmid:34534619
- 102. Ribeiro MT, Singh S, Guestrin C. “Why should i trust you?” Explaining the predictions of any classifier. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery; 2016. pp. 1135–1144.
- 103. Sheu Y. Illuminating the Black Box: Interpreting Deep Neural Network Models for Psychiatric Research. Front Psychiatry. 2020;11. pmid:33192663
- 104. Barredo Arrieta A, Díaz-Rodríguez N, Del Ser J, Bennetot A, Tabik S, Barbado A, et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf Fusion. 2020;58: 82–115.
- 105. McDermott MBA, Wang S, Marinsek N, Ranganath R, Foschini L, Ghassemi M. Reproducibility in machine learning for health research: Still a ways to go. Sci Transl Med. 2021;13. pmid:33762434
- 106. Pineau J, Vincent-Lamarre P, Sinha K, Lariviére V, Beygelzimer A, d’Alché-Buc F, et al. Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program). J Mach Learn Res. 2020;22: 1–20.
- 107. Haibe-Kains B, Adam GA, Hosny A, Khodakarami F, Shraddha T, Kusko R, et al. Transparency and reproducibility in artificial intelligence. Nature. 2020;586: E14–E16. pmid:33057217
- 108. Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3: 160035. pmid:27219127
- 109. Harutyunyan H, Khachatrian H, Kale DC, Ver Steeg G, Galstyan A. Multitask learning and benchmarking with clinical time series data. Sci Data. 2019;6: 96. pmid:31209213
- 110. Johnson AEW, Ghassemi MM, Nemati S, Niehaus KE, Clifton D, Clifford GD. Machine Learning and Decision Support in Critical Care. Proc IEEE. 2016;104: 444–466. pmid:27765959
- 111. Varoquaux G, Cheplygina V. Machine learning for medical imaging: methodological failures and recommendations for the future. npj Digit Med. 2022;5: 48. pmid:35413988
- 112. Beam AL, Manrai AK, Ghassemi M. Challenges to the Reproducibility of Machine Learning Models in Health Care. JAMA. 2020;323: 305. pmid:31904799
- 113. Lans A, Pierik RJB, Bales JR, Fourman MS, Shin D, Kanbier LN, et al. Quality assessment of machine learning models for diagnostic imaging in orthopaedics: A systematic review. Artif Intell Med. 2022;132: 102396. pmid:36207080
- 114. Yusuf M, Atal I, Li J, Smith P, Ravaud P, Fergie M, et al. Reporting quality of studies using machine learning models for medical diagnosis: a systematic review. BMJ Open. 2020;10: e034568. pmid:32205374
- 115. Li J, Zhou Z, Dong J, Fu Y, Li Y, Luan Z, et al. Predicting breast cancer 5-year survival using machine learning: A systematic review. Baltzer PAT, editor. PLoS One. 2021;16: e0250370. pmid:33861809
- 116. Collins GS, Dhiman P, Andaur Navarro CL, Ma J, Hooft L, Reitsma JB, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11: e048008. pmid:34244270