Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Using machine learning methods to predict all-cause somatic hospitalizations in adults: A systematic review

  • Mohsen Askar ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    mohsen.g.askar@uit.no

    Affiliation Faculty of Health Sciences, Department of Pharmacy, UiT-The Arctic University of Norway, Tromsø, Norway

  • Masoud Tafavvoghi,

    Roles Data curation, Investigation, Validation, Writing – review & editing

    Affiliation Faculty of Science and Technology, Department of Computer Science, UiT-The Arctic University of Norway, Tromsø, Norway

  • Lars Småbrekke,

    Roles Data curation, Investigation, Methodology, Supervision, Writing – review & editing

    Affiliation Faculty of Health Sciences, Department of Pharmacy, UiT-The Arctic University of Norway, Tromsø, Norway

  • Lars Ailo Bongo,

    Roles Supervision, Writing – review & editing

    Affiliation Faculty of Science and Technology, Department of Computer Science, UiT-The Arctic University of Norway, Tromsø, Norway

  • Kristian Svendsen

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Supervision, Writing – review & editing

    Affiliation Faculty of Health Sciences, Department of Pharmacy, UiT-The Arctic University of Norway, Tromsø, Norway

Abstract

Aim

In this review, we investigated how Machine Learning (ML) was utilized to predict all-cause somatic hospital admissions and readmissions in adults.

Methods

We searched eight databases (PubMed, Embase, Web of Science, CINAHL, ProQuest, OpenGrey, WorldCat, and MedNar) from their inception date to October 2023, and included records that predicted all-cause somatic hospital admissions and readmissions of adults using ML methodology. We used the CHARMS checklist for data extraction, PROBAST for bias and applicability assessment, and TRIPOD for reporting quality.

Results

We screened 7,543 studies of which 163 full-text records were read and 116 met the review inclusion criteria. Among these, 45 predicted admission, 70 predicted readmission, and one study predicted both. There was a substantial variety in the types of datasets, algorithms, features, data preprocessing steps, evaluation, and validation methods. The most used types of features were demographics, diagnoses, vital signs, and laboratory tests. Area Under the ROC curve (AUC) was the most used evaluation metric. Models trained using boosting tree-based algorithms often performed better compared to others. ML algorithms commonly outperformed traditional regression techniques. Sixteen studies used Natural language processing (NLP) of clinical notes for prediction, all studies yielded good results. The overall adherence to reporting quality was poor in the review studies. Only five percent of models were implemented in clinical practice. The most frequently inadequately addressed methodological aspects were: providing model interpretations on the individual patient level, full code availability, performing external validation, calibrating models, and handling class imbalance.

Conclusion

This review has identified considerable concerns regarding methodological issues and reporting quality in studies investigating ML to predict hospitalizations. To ensure the acceptability of these models in clinical settings, it is crucial to improve the quality of future studies.

Introduction

Unplanned hospital admissions and readmissions (hospitalizations) account for a significant share of global healthcare expenditures [1,2]. Interestingly, up to 35% of these hospitalizations are potentially avoidable [3]. One approach to address avoidable hospitalizations is to implement statistical and mathematical models on healthcare datasets in order to predict future hospitalization [4,5].

Previous attempts were mainly based on regression models and specific risk indexes (scores). Systematic reviews have concluded that most models had poor, inconsistent performance, and limited applicability. They also found that models utilizing health records data performed better than models using self-report data [4,6,7].

More recently, prediction models that utilize Machine Learning (ML) [8,9] algorithms have become more popular. Recent reviews emphasized the growing importance and effectiveness of ML models in predicting clinical outcomes such as hospital readmissions. These reviews concluded that ML techniques can improve readmission prediction ability over traditional statistical models. This improvement could be explained by ML models offering several advantages over traditional regression models such as flexibility, the ability to handle large, complex, high dimensional datasets, and identifying non-linear relationships [10]. The reviews also highlighted the critical role of selecting features and addressed some challenges such as transparency, the difficulty of ML models’ interpretation, and the importance of handling class imbalance to enhance the models’ performance. Moreover, they highlighted the importance of demonstrating the clinical usefulness of the models in practice [1113]. A systematic analysis of readmission prediction literature proposed a comprehensive framework for ML model development detailing steps from data preparation and preprocessing to suggesting methods of feature selection and transformation, data splitting, model training, validation, and evaluation [14].

Although several reviews have considered the use of ML in predicting hospitalizations for specific diseases and conditions [1517], none has systemically reviewed the literature on all-cause hospital admissions. We aim with this review to (i) summarize the characteristics of ML studies used in predicting all-cause somatic admissions and readmissions; (ii) provide a picture of the ML pipeline steps including, data preprocessing, feature selection, model evaluation, validation, calibration, and explanation; (iii) assessing the risk of bias, applicability and reporting completeness of the studies; and finally (iv) to comment on the challenges facing implementation of ML models in clinical practice.

Materials and methods

The protocol of this systematic review was registered in the International prospective register of systematic reviews PROSPERO (CRD42021276721). The PRISMA, and PRISMA-Abstract guidelines [18] were followed in reporting this review, see S1 File: Section 1.

Inclusions/Exclusion criteria

To formulate the research question, we used the PICOTS checklist [19,20]. Studies that only included non-adults were excluded. Hospitalizations were defined as all-cause somatic admissions or readmissions from outside hospitals, hence psychological-related, disease-specific, and internal admissions between wards are excluded. Emergency Department (ED) were considered portals, thus the admissions from an ED to the hospitals were included, but ED admissions followed by discharge were excluded.

Our focus is the studies performed in the ML context (both in developing steps of the model, e.g., feature engineering, or making the final predictions), so studies that only used statistical learning or risk indexes for prediction were excluded. All performance measures were reported for competing models. This review is mainly descriptive of how ML was used in predicting hospitalization; hence we chose to include studies conducted using real-world data with hospital admissions and readmissions as a valid outcome regardless of the timing of the outcome. Table 1 represents the overall inclusion criteria. A detailed description of the inclusions and exclusions is provided in S1 File: Section 2.

Search strategy

We searched four main databases: PubMed, Embase (via Ovid), Web of Science, and CINAHL (via EBSCO) from inception dates to October 13th,2023. The search strategy was developed through the piloting of some relevant studies. Search terms were used if included in the database (MeSH for PubMed and CINAHL, and Emtree for EMBASE). We also searched 4 other databases: ProQuest, OpenGrey, WorldCat (OCLC FirstSearch), and MedNar for grey literature.

Four main search blocks were used to identify relevant studies: prediction, hospitalization, machine learning, and exclusions. The exclusion of irrelevant search words was developed by iteration and preliminary title/abstract piloting. The Boolean operators AND, OR, and NOT were used alongside truncation operators and phrase-searching. Search syntax was adapted for each database using the Polyglot tool [21] with manual supervision. The complete search syntax can be found in S1 File: Section 3.

Duplicate studies were removed using Mendeley Reference Manager (version 1.19.8, Elsevier). In cases where the reference manager was uncertain, we manually checked and removed any duplicates. Titles and abstracts were screened by two independent investigators (MA and KS), and full-text papers were retrieved for all candidate studies. The full-text screening was separately performed by MA, MT, LS, and KS. A manual search was conducted using the reference lists of the included studies to manually extract literature that did not appear in the electronic search. A list of all full-text screened studies, including those that were included and excluded with the reason(s) for exclusion, is attached to S2 File, sheet: Included & excluded studies. The final included studies were decided by discussion between MA, KS, and LS. The descriptive results were synthesized using Pivot tables in Microsoft Excel.

Data extraction

The data was extracted using the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The (CHARMS) Checklist [19] separately by MA, MT, and LS. For further analysis, features were grouped into administrative and clinical feature groups. The included records, extracted data, models’ features, and feature groupings can be found in S2 File, sheets: CHARMS, and Features.

Assessment of bias and applicability

Despite that the main purpose of the review is descriptive, we assessed the risk of bias and applicability using the Prediction model Risk of Bias Assessment Tool (PROBAST) [22] by MA and MT. PROBAST is a commonly used tool to assess prediction models. The tool evaluates four domains; namely: Participants, Predictors, Outcome, and Analysis. For each domain, there is a set of questions to help judge the risk of both bias and applicability concerns. If any domain was not rated “low”, the overall risk of bias was considered “high”. Abstracts were not assessed due to their limited information. The assessment is attached to S2 File, sheet: PROBAST.

Quality of reporting

To assess the quality of reporting, we utilized the Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) checklist [23]. We followed the suggested methodology in Constanza et al. [24] to evaluate the adherence to TRIPOD per article and per item in the reporting checklist. Each item is scored by (1 = reported, 0 = not reported, 0,5 = incomplete reporting, or not applicable = ‘_’). Abstracts and conference proceedings were not evaluated. We then calculated the adherence per TRIPOD item by dividing the sum of the items of all studies by the total number of studies. Adherence for each article was calculated as the sum of all TRIPOD items over the sum of all items if the reporting was complete, S2 File, sheet: TRIPOD. All abbreviations mentioned in the study are included in S2 File, sheet: Abbreviations.

Results

Of the 7,543 records reviewed, 147 records were eligible for full-text screening. We included 16 other records by manual searching of the references. In total, 163 studies were fully screened, and 116 studies were included in the review, of which 87 peer-reviewed articles (76%), 17 conference articles, 9 abstracts, and three theses (Fig 1).

Data extraction results

Characteristics of the included studies.

Sixty-one studies (53%) were conducted using data from the USA, followed by Australia (seven studies), Taiwan (four studies), and Canada and Singapore, each with three studies. The oldest article is from 2005, and 2019 had the greatest number of articles (22 articles, 19%), followed by 2020 (17 articles, 15%), see Fig 2.

thumbnail
Fig 2.

Left panel, a bar plot of the top 10 countries where the datasets originate and the number of publications. Right panel, a bar plot of the number of publications by year.

https://doi.org/10.1371/journal.pone.0309175.g002

Population characteristics.

Only 23 studies (20%) had a complete reporting of sample size (number of unique patients and the total number of admissions). Six studies (5%) neither reported the number of patients nor admissions (among them were 4 abstracts). The sample size varied from 371 to 4,637,294. Regarding age, 49 studies (42%) did not report an age range of the included patients. The rest of the studies had different minimum age requirements for the studied patients.

Outcomes characteristics.

Readmission was the outcome in 70 studies (60%), while 45 studies (39%) had hospital admission as an outcome, and one study investigated both outcomes. The readmission outcome prediction horizon varied from 24 hours to 1 year. The most frequent predicted horizon was 30-day readmission (51 studies (73% of the readmission studies), 7 other studies combined it with other readmission horizons, in total 58 studies (83%)). The dataset’s inclusion period varied from 1 month to 30 years (median: 1.25 years, mean 3.2 years). Excluding rebalanced datasets, the readmission proportion varied from 0.7% to 34.6% (median 12.4%), while the admission proportion varied from 0.38% to 41% (median 17.2%).

Datasets

The number of studies that used administrative, clinical, or both types was close (40, 36, and 38 studies respectively), with two abstracts, having an unclear description of the type of dataset. Among the studies that reported an Area Under the ROC curve (AUC) and used these types of datasets (103 studies), the mean AUCs were 0.80, 0.78, and 0.77 with standard deviations (SD) of 0.08, 0.07, and 0.09 respectively. Six studies reported an AUC over 90%, while 81 studies reported an AUC from 70–90% and 18 studies reported an AUC ranging from 60–70%. Fig 3 shows the relationship between outcomes, dataset types, dataset sources, and the best model performance. S1 File: Section 4 includes detailed information on the data types, sources, and frequency of use with predicting either admission or readmissions.

thumbnail
Fig 3. Sankey diagram showing the type of outcome, datatypes and sources of datasets, and model performance by AUC.

The thickness of the streams indicates the number of records common between pairs of categories. Medical records include patient information from EHR or EMR. Hospital datasets include data from hospital information systems.

https://doi.org/10.1371/journal.pone.0309175.g003

Types of features included in models.

The most used feature groups were demographics (92 studies, 79%), diagnoses (43 studies, 37%), vital signs (34 studies, 29%), and laboratory tests (28%). Fig 4 represents the most used feature groups in the included studies. Natural Language Processing techniques (NLP) were used in 16 studies (14%) to predict hospitalizations from clinical free-text notes.

thumbnail
Fig 4. The most frequently used feature groups in the retrieved studies.

https://doi.org/10.1371/journal.pone.0309175.g004

Missing data and data imbalance.

Missing values were not mentioned at all in 52 studies (45%). In 55 studies (47%) where the handling of how missing values were reported, the most used methods were removing the records of missing values (27 studies, 23%) and various imputation methods (25 studies, 22%) with some studies using both removing and imputation to deal with missing values.

Of 99 studies (85%) that reported class imbalance in the outcome, only 32 studies (32%) reported handling this imbalance by some technique. The most used techniques were undersampling (19 studies, 59%), oversampling (7 studies, 22%), and Synthetic Minority Oversampling TEchnique (SMOTE) (6 studies, 19%). Note that some studies tested more than one resampling method.

Models’ performance and comparison

In total, 57 different algorithms were used for predicting the outcomes. Regression models were the most frequently used algorithm group (73 studies, 63%) followed by bagging tree-based algorithms, in 61 studies (53%), and boosting tree-based algorithms in 60 studies (52%). The best-performing algorithm group was boosting algorithms in 35 studies (42%), bagging algorithms in 16 studies (19%), followed by regression and Neural Networks (NN) models in 14 studies (17%) each, see Table 2.

thumbnail
Table 2. Algorithms’ groups and frequency of use in the included studies.

https://doi.org/10.1371/journal.pone.0309175.t002

Comparing the performance of algorithms.

Eighty-three studies (72%) compared the performance of multiple algorithms. Based on the results of the best-performing algorithm groups (Table 2), we compared the performance of some of these algorithm groups. DT, Bayesian models were not involved in the comparison as they did not perform best in any of the studies. Fig 5 illustrates the performance comparison between different algorithm groups in the retrieved studies.

thumbnail
Fig 5. A pairwise comparison of the performance of different algorithms’ groups.

The numbers on each segment denote the count of publications in which the first algorithm group demonstrated superior, equivalent, or inferior performance compared to the second one. Adjacent to each bar, the total number of publications involving such comparisons was indicated.

https://doi.org/10.1371/journal.pone.0309175.g005

Evaluation metrics.

AUC was the most used evaluation metric (105 studies) followed by precision, sensitivity, specificity, and accuracy (Fig 6). Thirty-seven studies (32%) reported only one evaluation metric such as AUC or accuracy without reporting a clinical performance metric such as sensitivity or specificity. Of the 105 studies that reported AUC, 18 studies (17%) reported AUC between (60–70%), 42 studies (40%) reported AUC between (70–80%), 39 studies (37%) reported an AUC between (80–90%), and only 6 studies (6%) reported AUC above 90% (Fig 3). The highest reported AUC in admission models was 95% and 99% in readmission models. The mean AUC reported in the studies that used administrative, clinical, or combined both was 0.80, 0.78, and 0.77 ((SD: 0.08, 0.07, and 0.09) respectively.

thumbnail
Fig 6. Various aspects of model evaluation.

Each subplot represents the frequency of use in the reviewed studies.

https://doi.org/10.1371/journal.pone.0309175.g006

Model calibration and benchmarking.

Only 28 studies (24%) calibrated their models using one of the calibration methods. Fig 6 represents the calibration methods used and the count of publications. Eighteen studies (16%) were benchmarked against one or more risk prediction indexes such as LACE [25], PARR [26], HOSPITAL [27] indexes, etc. The most used risk index in benchmarking was LACE index with nine studies, followed by PARR and HOSPITAL with two studies each. In all 18 studies, ML models outperformed predictions obtained from these risk indexes. A detailed comparison is attached to S1 File: Section 5.

Model validation.

The majority of studies were trained and validated retrospectively (96 studies, 83%). Only 17 studies (15%) were trained retrospectively and tested prospectively, among them three studies performed a real-time validation. The study design was not clear in three studies. Fig 6 depicts the internal and external validation methods used in the studies.

Model explainability and availability

Providing model interpretation on the patient level (local model interpretation) was only represented in three studies. Fig 6 represents the different interpretation methods used in the studies. Twenty studies (17%) used publicly available datasets, and 15 studies (13%) reported providing the data upon request. Only 17 studies made their code available (15%) and only six studies implemented their models in clinical practice.

Quality of the studies

Bias and applicability assessment.

Of 106 studies assessed, 68 (64%) were evaluated to be at high risk of bias. We evaluated 94 studies (87%) to be at low concern of applicability. Assessment results are attached to S2 File, sheet: PROBAST.

Reporting quality assessment.

Only nine studies reported adherence to TRIPOD checklist. These studies [2836] had generally better reporting quality (scored 17, 17, 19, 17.5, 16.5, 18, 17.5, 17, 16 out of 20) respectively. The overall median of 20 items TRIPOD adherence was 77% (IQR 63–95). The assessment of adherence to TRIPOD reveals insufficient reporting, especially in some items such as reporting the flow of participants (35% of the studies), supplementary material (52%), population characteristics (53%), reporting missing data (56%), and funding (58%) among others (Fig 7). The evaluation sheet is attached to S2 File, sheet: TRIPOD

thumbnail
Fig 7. Studies’ adherence proportion to a range of TRIPOD checklist items.

Only explicit reporting of Confidence Intervals (CI) was considered as complete reporting. Note that all items are calculated excluding abstracts and proceedings (10 studies). Hence, some items such as missing data calculations can differ from what is reported in the result section as the calculations in the result section included all the studies.

https://doi.org/10.1371/journal.pone.0309175.g007

Discussion

To our knowledge, this is the first systematic review to focus on ML models for predicting all-cause somatic hospitalizations. Of 7,543 citations, 116 studies were included. Our review reveals the potential that ML models have in predicting all-cause somatic hospitalizations which is consistent with what is reported by both a general review of AI and machine learning and disease-specific reviews [8,9]. Our findings also raise concerns regarding the quality of the studies conducted. Therefore, despite the potential of the ML prediction framework and the superiority over traditional statistical prediction shown in many studies, there are clear issues with the quality of reporting, external validation, model calibration, and interpretation. All these aspects should support the model performance to be convenient to implement in real-life clinical practice. These main findings are consistent with findings from other reviews [11,12,3739].

Most studies were based on data from the USA which can be an issue. This geographic skew limits the generalizability of the developed models, considering the differences in healthcare systems and patient populations between countries [40]. As 30 day-readmission is a widely used indicator of hospital care quality [41] the majority of the included readmission studies used this indicator as an outcome.

Datasets and features

A wide variety of data sources and types were used. We found that the performance of models trained on administrative (claims), clinical, or datasets combined both clinical and administrative variables to be close with a slight edge for models trained on administrative datasets.

The most important features varied between the different studies. This lack of convergence of risk factors is due to: i) different definitions of admissions and readmissions as an outcome for these studies, ii) the use of different feature selection methods [42], iii) the diversity of recorded features in different healthcare databases, iv) lack of standard handling of data preprocessing steps and variance of methods handling and generating variables, v) the variety in population and subpopulations, exclusion criteria, and finally, vi) the use of different risk scores and indexes which include different sets of features. This is consistent with what previous studies concluded about the difficulty of finding universal features for predicting hospitalization [4345]. While defining general risk factors is particularly difficult for studies of all-cause hospitalizations, it may be appropriate in subpopulations (i.e., patients of specific diseases) that have more similarities and less diversity. Yet, some groups of risk factors are shown to be more common than others (Fig 4).

The most used feature groups were demographics, diagnoses, physiological measurements, and laboratory tests, respectively (Fig 4). Some studies used only one or a limited number of feature groups [4653]. All these studies yielded generally good predictive performance suggesting that the sole use of one or limited categories of features can be enough to predict hospitalization. However, this needs to be further investigated by comparing the performance of these models that were exclusively built on one or a few feature groups with models built on several feature groups.

Some studies used Natural Language Processing (NLP) techniques to extract information from the clinical text and either combined them with other structured features [5457] or as a sole source of data [47,52,53,5862]. Some studies reported better prediction performance using textual data over numerical data (e.g., laboratory tests and vital signs) suggesting the existence of relevant expert knowledge within these reports [53]. We noticed an increase in applying NLP techniques in recent studies suggesting that utilizing textual data is a promising future direction for predicting hospitalizations. Incorporating NLP techniques in prediction models will provide models with a rich source of clinical information that could not be present in the tabular format of patient records. It will also improve the research scalability by the automatic extraction of relevant information rather than manual processing. Furthermore, it can provide real-time assistance for clinicians. However, some challenges should be considered such as the limited availability of sharing large, annotated datasets which are necessary for developing efficient NLP models, the current popular evaluation methods may not be clinically relevant, and the lack of transparent protocols to ensure NLP methods are reproducible [63,64].

Data preprocessing

Elaborating on how the individual patient multiple admissions were handled in data preparation should be reported. Only 23 studies reported both a unique number of patients and the total number of rows indicating poor reporting of this item. Reporting both the number of patients and admissions and the methods of handling multiple admissions for the same patient is important since neglecting the correlation between admissions may lead to unreliable predictions.

Similarly, less than half of the studies (47%) reported a method of handling missing values. Only a few studies (32 studies) reported handling class imbalance in the dataset. Class imbalance means that the outcome contains more samples from one class (majority class) over the other classes (minority class) [65] and represents one of the most common issues in training ML models in predicting hospitalizations. However, it is not usually taken into consideration in the readmission risk prediction literature [66]. The problem with class imbalance is that the models could be biased towards the majority class leading to a misleadingly high prediction performance [67]. Resampling techniques, especially undersampling were the most used approach. Resampling techniques involve balancing the distribution of outcome classes either by oversampling or undersampling. Oversampling involves increasing the number of instances of minority class (e.g. SMOTE), while undersampling involves randomly reducing the number of majority class instances, thus balancing the class distribution [68]. It should be noted that resampling techniques have some drawbacks such as overfitting or losing useful information which can introduce problematic consequences and hinder model learning [69,70].

Models’ performance comparisons

Evaluating model performance in health-related outcomes should be reported on two levels: model performance (e.g., AUC, F1-score, etc.) and clinical performance metrics (e.g., sensitivity, specificity, PPV, NPV, etc.) [71]. More than one-third of the studies only reported a model performance metric, with AUC as the most used one, which could limit their acceptance in clinical practice.

The analysis of different algorithms’ performance confirms that no algorithm constantly performs better than the other [72]. Yet, some algorithms more frequently yield better results compared to others. In this review, we found that tree-based boosting algorithms often outperformed other algorithms (Table 2 and Fig 5). Tree-based boosting algorithms such as Gradient Boosting Machine (GBM), XGBoost, and AdaBoost, are a class of ensemble learning methods that build multiple decision trees sequentially [73]. Each new decision tree corrects the errors of previous ones by giving more focus to samples that were difficult to estimate [74]. The predictions of the trees are then combined to produce the final model prediction [75]. This group of algorithms has many advantages such as training multiple models which enhances the prediction performance over training a single one, flexibility to handle different data types, capturing non-linear patterns, and being less prone to overfitting [76].

Many studies tended to compare the performance of different algorithms on the same dataset. In this concern, we suggest that conducting even more studies with a sole focus on comparing the performance of commonly used ML algorithms is not needed unless they aim to benchmark new algorithms to the existing ones. We propose that researchers should focus on how to incorporate efforts to generalize ML models and implement them in clinical practice instead.

There is a discussion about whether ML models can offer better predictive abilities than conventional statistical models such as logistic regression (LR). While some studies found that ML models outperform regression models [11,37,7781], others suggest that using the ML models gives no better prediction than LR [55,82,83]. In our analysis, ML models mostly performed better than regression models. This is consistent with a meta-analysis that concluded the same by comparing LR to advanced ML algorithms such as NN [84]. Regression models performed better only in 17% of studies compared between regression and ML algorithms (Table 2 and Fig 5). This can be justified by LR being a parametric algorithm and lacking enough flexibility compared to non-parametric ones [85] or that LR has restricted assumptions which also gives favor to the less-restricted or no-assumptions algorithms [86].

We also found that ML models outperform risk indexes in prediction performance. This is reasonable because risk indexes usually contain few predictors and aim mainly for simplification of predictions, while ML models utilize more predictors and complex methods to understand the pattern in datasets. It could also be argued that ML models are developed and tested on the same dataset and may be more skilled to predict the outcome from this specific dataset and even could be overfitted for it. While risk indexes are usually developed in a setting and then validated in different datasets and settings which would mean that ML models will outperform these indexes anyway.

Finally, two studies compared ML models to clinicians’ predictions. They concluded that the models outperform ED nurses in predicting admission to ED and that combining ML models with clinical insight improves the model’s performance [87,88].

Model validation

External validation (EV) should ideally be conducted on unrelated and structurally different datasets from the dataset used for model training [89,90]. If the validation dataset differs only temporally but still originates in the same settings, place, etc., it is called temporal EV and is regarded as an approach that lies midway between internal and external validation [91]. This is because the overall patients’ characteristics are similar between the two datasets [92]. Our analysis shows that there was a clear shortage in terms of EV of models. Most of the EV performed can be regarded as temporal EV. Although recent studies indicate an increased awareness of EV, maintaining EV continues to be a critical step in the current development of ML models [93]. However, there are still several obstacles facing ML models’ generalizability. These obstacles can be categorized as either model-related or data-related. Model-related obstacles include issues with transparency in model development and results reporting. Data-related ones include the diversity of data structure, formats, population, etc. across different healthcare systems, and the lack of a standardized data preprocessing framework. Additionally, the strict health data privacy regulations.

Adopting Common Data Models (CDMs) [94], designing a comprehensive and widely accepted framework for data preprocessing, and implementing Federated Learning (FL) [95] could help address these issues. In S1 File: Section 6, we provide a more detailed explanation of these obstacles and solutions.

Model explainability and availability

Model interpretation is of great importance in predicting health-related outcomes. Global model interpretation involves describing the most important rules and most influential features that the model learned in the training steps [96], while, local model interpretation refers to explaining how the model derived each individual prediction (i.e., for each patient) [97,98]

In our analysis, 62 studies (53%) introduced global interpretation to their model in the form of feature importance, or a risk score, for example [28,99,100], while only three studies presented methods for local interpretation [56,78]. Introducing both global and local model interpretation is important to increase the trustworthiness and to enhance the implementation of these models in practice [101104].

Few studies made their dataset (20 studies) or code (17 studies) publicly available. To facilitate the technical reproducibility of the model, publishing both datasets and code is necessary. Indeed, healthcare datasets contain patients’ confidential information which hinders publishing them. Hence, some suggestions were reported to partially solve this issue such as publishing a simulated dataset [105], providing complementary empirical results on an open-source benchmark dataset [106], or sharing model prediction and data labels to allow further statistical analysis [107]. There is also no doubt that publicly available datasets such as MIMIC-III [108], have boosted ML research and opened many chances to develop ML in the health domain. MIMIC-III has been cited more than 3,000 times to date. The dataset has enabled numerous studies that focus on developing predictive models and enhancing clinical decision support systems [109,110].

Reporting model developing codes and performed experiments can help understand the final methodology, accelerate the overall development, and ensure that models are safeguarded from data leakage and other downfalls in model development [71,111]. Additionally, reporting the software and package versions is also necessary. Many decisions taken by algorithms are taken silently through the default setting of the different packages leading to differences in results when the experiment is repeated even on the same dataset [112].

Bias risk and applicability

More than 60% of the assessed studies had a high risk of bias in line with other reviews’ findings [38,39,113]. Twelve studies were found to have a high concern of applicability. However, factors such as variability of populations, settings, and dataset characteristics are anticipated to further constrain the applicability of these studies.

In general, we observed poor quality of reporting in the studies. This is consistent with findings in other studies [24,114,115]. Poor reporting quality raises concerns about the reproducibility of models [105]. Studies that adhered to TRIPOD had better scores than those that didn’t. This points to the importance of adherence to a reporting checklist in ML studies, especially in the health domain. It also raises the need to develop ML-specialized checklists for quality assessment and reporting quality. Ongoing research is currently addressing this requirement [116]. In S1 File: Section 7, we suggest a reporting scheme for ML studies.

Limitations

We identified the relevant literature from eight databases, but we have not approached authors for missing information on the studies. This is due to the considerable amount of missing information which could have impacted the assessment of bias risk. Our results are also limited by the fact that most of the reviewed studies were based on data from the USA which limits the generalizability because of the differences between populations and healthcare systems between countries. To address this limitation, future studies should aim to include diverse datasets from various countries and healthcare settings. Additionally, more efforts should be directed to compare models from different populations and settings to understand their limitations in different contexts. Assessing the quality of studies was also limited by not being able to access their code scripts. Potential publication bias also limits the ability of the review to comprehensively evaluate the overall results. Additionally, the variability of reporting varied significantly between the studies which can affect the reliability of the findings.

The heterogeneity and differences in healthcare systems and patient populations across countries and ML algorithms and settings limit the comparisons of results between studies and make it more difficult to harmonize the results of different models. Due to this heterogenicity, we had to make decisions regarding the inclusion criteria which may have caused us to miss relevant studies. Finally, only literature published in English was included which also limits our insights to the overall picture of ML development globally.

Conclusions

The main purpose of the review was to describe how ML was used in predicting all-cause somatic hospitalizations. The review raises some concerns about the quality of data preprocessing, the reporting quality, reproducibility, local interpretation, and the external validity of many studies. The quality of studies needs to improve to meet the expectations of clinicians and stakeholders before using these models in clinical practice. We recommend that future studies should prioritize generalizing ML models and integrating them into clinical practice.

Supporting information

S1 File. Includes: Section 1: PRISMA checklist, Section 2: Detailed inclusion/exclusion criteria, Section 3: Literature search syntax, Section 4: Studies data sources, Section 5: Benchmarking with risk indexes, Section 6: A comment on the generalizability of ML models, and Section 7: A suggestion of reporting checklist specifically for ML models in structured datasets.

https://doi.org/10.1371/journal.pone.0309175.s001

(DOCX)

S2 File. Includes: Sheet: Abbreviations, (Sheet: CHARMS) include the extracted data for review studies and studies citations, (Sheet: Features) includes feature-related extractions, (Sheet: PROBAST) includes the applicability and risk of bias assessment, (Sheet: TRIPOD) includes the reporting quality assessment, and (Sheet: Included & excluded studies) includes full-text screened studies with the reason(s) of exclusion.

https://doi.org/10.1371/journal.pone.0309175.s002

(XLSX)

References

  1. 1. McDermott KW, Jiang HJ. Characteristics and Costs of Potentially Preventable Inpatient Stays, 2017. Healthcare Cost and Utilization Project (HCUP) Statistical Briefs. Agency for Healthcare Research and Quality (US); 2006. Available: https://www.ncbi.nlm.nih.gov/books/NBK559945/.
  2. 2. Jencks SF, Williams M V., Coleman EA. Rehospitalizations among Patients in the Medicare Fee-for-Service Program. N Engl J Med. 2009;361: 311–312. pmid:19605841
  3. 3. Lyhne CN, Bjerrum M, Riis AH, Jørgensen MJ. Interventions to Prevent Potentially Avoidable Hospitalizations: A Mixed Methods Systematic Review. Front Public Heal. 2022;10. pmid:35899150
  4. 4. Kansagara D, Englander H, Salanitro A, Kagen D, Theobald C, Freeman M, et al. Risk Prediction Models for Hospital Readmission. JAMA. 2011;306: 1688. pmid:22009101
  5. 5. Dhillon SK, Ganggayah MD, Sinnadurai S, Lio P, Taib NA. Theory and Practice of Integrating Machine Learning and Conventional Statistics in Medical Data Analysis. Diagnostics 2022, Vol 12, Page 2526. 2022;12: 2526. pmid:36292218
  6. 6. Wallace E, Stuart E, Vaughan N, Bennett K, Fahey T, Smith SM. Risk Prediction Models to Predict Emergency Hospital Admission in Community-dwelling Adults. Med Care. 2014;52: 751–765. pmid:25023919
  7. 7. Zhou H, Della PR, Roberts P, Goh L, Dhaliwal SS. Utility of models to predict 28-day or 30-day unplanned hospital readmissions: an updated systematic review. BMJ Open. 2016;6: e011060. pmid:27354072
  8. 8. Helm JM, Swiergosz AM, Haeberle HS, Karnuta JM, Schaffer JL, Krebs VE, et al. Machine Learning and Artificial Intelligence: Definitions, Applications, and Future Directions. Curr Rev Musculoskelet Med. 2020;13: 69–76. pmid:31983042
  9. 9. El Naqa I, Murphy MJ. What Is Machine Learning? Machine Learning in Radiation Oncology. Cham: Springer International Publishing; 2015. pp. 3–11. https://doi.org/10.1007/978-3-319-18305-3_1
  10. 10. Rajula HSR, Verlato G, Manchia M, Antonucci N, Fanos V. Comparison of Conventional Statistical Methods with Machine Learning in Medicine: Diagnosis, Drug Development, and Treatment. Medicina (B Aires). 2020;56: 455. pmid:32911665
  11. 11. Artetxe A, Beristain A, Graña M. Predictive models for hospital readmission risk: A systematic review of methods. Comput Methods Programs Biomed. 2018;164: 49–64. pmid:30195431
  12. 12. Teo K, Yong CW, Chuah JH, Hum YC, Tee YK, Xia K, et al. Current Trends in Readmission Prediction: An Overview of Approaches. Arab J Sci Eng. 2021. pmid:34422543
  13. 13. Benedetto U, Dimagli A, Sinha S, Cocomello L, Gibbison B, Caputo M, et al. Machine learning improves mortality risk prediction after cardiac surgery: Systematic review and meta-analysis. J Thorac Cardiovasc Surg. 2022;163: 2075–2087.e9. pmid:32900480
  14. 14. Chen T, Madanian S, Airehrour D, Cherrington M. Machine learning methods for hospital readmission prediction: systematic analysis of literature. J Reliab Intell Environ. 2022;8: 49–66.
  15. 15. Cho SM, Austin PC, Ross HJ, Abdel-Qadir H, Chicco D, Tomlinson G, et al. Machine Learning Compared With Conventional Statistical Models for Predicting Myocardial Infarction Readmission and Mortality: A Systematic Review. Can J Cardiol. 2021;37: 1207–1214. pmid:33677098
  16. 16. Sun Z, Dong W, Shi H, Ma H, Cheng L, Huang Z. Comparing Machine Learning Models and Statistical Models for Predicting Heart Failure Events: A Systematic Review and Meta-Analysis. Front Cardiovasc Med. 2022;9. pmid:35463786
  17. 17. Mahajan SM, Heidenreich P, Abbott B, Newton A, Ward D. Predictive models for identifying risk of readmission after index hospitalization for heart failure: A systematic review. Eur J Cardiovasc Nurs J Work Gr Cardiovasc Nurs Eur Soc Cardiol. 2018;17: 675–689. pmid:30189748
  18. 18. Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Syst Rev. 2021;10: 89. pmid:33781348
  19. 19. Moons KGM, de Groot JAH, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG, et al. Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The CHARMS Checklist. PLoS Med. 2014;11: e1001744. pmid:25314315
  20. 20. Debray TPA, Damen JAAG, Snell KIE, Ensor J, Hooft L, Reitsma JB, et al. A guide to systematic review and meta-analysis of prediction model performance. BMJ. 2017;356: 6460. pmid:28057641
  21. 21. Clark JM, Sanders S, Carter M, Honeyman D, Cleo G, Auld Y, et al. Improving the translation of search strategies using the Polyglot Search Translator: a randomized controlled trial. J Med Libr Assoc. 2020;108. pmid:32256231
  22. 22. Wolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS, et al. PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies. Ann Intern Med. 2019;170: 51. pmid:30596875
  23. 23. Collins GS, Reitsma JB, Altman DG, Moons K. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement. BMC Med. 2015;13: 1. pmid:25563062
  24. 24. Andaur Navarro CL, Damen JAA, Takada T, Nijman SWJ, Dhiman P, Ma J, et al. Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review. BMC Med Res Methodol. 2022;22: 12. pmid:35026997
  25. 25. van Walraven C, Dhalla IA, Bell C, Etchells E, Stiell IG, Zarnke K, et al. Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the community. CMAJ. 2010;182: 551–7. pmid:20194559
  26. 26. Billings J, Blunt I, Steventon A, Georghiou T, Lewis G, Bardsley M. Development of a predictive model to identify inpatients at risk of re-admission within 30 days of discharge (PARR-30). BMJ Open. 2012;2. pmid:22885591
  27. 27. Donzé J, Aujesky D, Williams D, Schnipper JL. Potentially Avoidable 30-Day Hospital Readmissions in Medical Patients. JAMA Intern Med. 2013;173: 632. pmid:23529115
  28. 28. Fenn A, Davis C, Buckland DM, Kapadia N, Nichols M, Gao M, et al. Development and Validation of Machine Learning Models to Predict Admission From Emergency Department to Inpatient and Intensive Care Units. Ann Emerg Med. 2021;78: 290–302. pmid:33972128
  29. 29. Chandra A, Rahman PA, Sneve A, McCoy RG, Thorsteinsdottir B, Chaudhry R, et al. Risk of 30-Day Hospital Readmission Among Patients Discharged to Skilled Nursing Facilities: Development and Validation of a Risk-Prediction Model. J Am Med Dir Assoc. 2019;20: 444–450.e2. pmid:30852170
  30. 30. Spangler D, Hermansson T, Smekal D, Blomberg H. A validation of machine learning-based risk scores in the prehospital setting. PLoS One. 2019;14: e0226518. pmid:31834920
  31. 31. Raita Y, Goto T, Faridi MK, Brown DFM, Camargo CAJ, Hasegawa K. Emergency department triage prediction of clinical outcomes using machine learning models. Crit Care. 2019;23: 64. pmid:30795786
  32. 32. Rahimian F, Salimi-Khorshidi G, Payberah AH, Tran J, Ayala Solares R, Raimondi F, et al. Predicting the risk of emergency admission with machine learning: Development and validation using linked electronic health records. PLoS Med. 2018;15: e1002695. pmid:30458006
  33. 33. Hegselmann S, Ertmer C, Volkert T, Gottschalk A, Dugas M, Varghese J. Development and validation of an interpretable 3 day intensive care unit readmission prediction model using explainable boosting machines. Front Med. 2022;9. pmid:36082270
  34. 34. Olza A, Millán E, Rodríguez-Álvarez MX. Development and validation of predictive models for unplanned hospitalization in the Basque Country: analyzing the variability of non-deterministic algorithms. BMC Med Inform Decis Mak. 2023;23: 152. pmid:37543596
  35. 35. Brankovic A, Rolls D, Boyle J, Niven P, Khanna S. Identifying patients at risk of unplanned re-hospitalisation using statewide electronic health records. Sci Rep. 2022;12: 16592. pmid:36198757
  36. 36. Dadabhoy FZ, Driver L, McEvoy DS, Stevens R, Rubins D, Dutta S. Prospective External Validation of a Commercial Model Predicting the Likelihood of Inpatient Admission From the Emergency Department. Ann Emerg Med. 2023;81: 738–748. pmid:36682997
  37. 37. Shin S, Austin PC, Ross HJ, Abdel‐Qadir H, Freitas C, Tomlinson G, et al. Machine learning vs. conventional statistical models for predicting heart failure readmission and mortality. ESC Hear Fail. 2021;8: 106–115. pmid:33205591
  38. 38. Huang Y, Talwar A, Chatterjee S, Aparasu RR. Application of machine learning in predicting hospital readmissions: a scoping review of the literature. BMC Med Res Methodol. 2021;21: 96. pmid:33952192
  39. 39. Kamel Rahimi A, Canfell OJ, Chan W, Sly B, Pole JD, Sullivan C, et al. Machine learning models for diabetes management in acute care using electronic medical records: A systematic review. Int J Med Inform. 2022;162: 104758. pmid:35398812
  40. 40. Kashyap M, Seneviratne M, Banda JM, Falconer T, Ryu B, Yoo S, et al. Development and validation of phenotype classifiers across multiple sites in the observational health data sciences and informatics network. J Am Med Inform Assoc. 2020;27: 877–883. pmid:32374408
  41. 41. Demir E. A Decision Support Tool for Predicting Patients at Risk of Readmission: A Comparison of Classification Trees, Logistic Regression, Generalized Additive Models, and Multivariate Adaptive Regression Splines. Decis Sci. 2014;45: 849–880.
  42. 42. Junqueira ARB, Mirza F, Baig MM. A machine learning model for predicting ICU readmissions and key risk factors: analysis from a longitudinal health records. Health Technol (Berl). 2019;9: 297–309.
  43. 43. Hong WS, Haimovich AD, Taylor RA. Predicting hospital admission at emergency department triage using machine learning. PLoS One. 2018;13: e0201016. pmid:30028888
  44. 44. Sabbatini AK, Kocher KE, Basu A, Hsia RY. In-Hospital Outcomes and Costs Among Patients Hospitalized During a Return Visit to the Emergency Department. JAMA. 2016;315: 663. pmid:26881369
  45. 45. Adams JG. Ensuring the Quality of Quality Metrics for Emergency Care. JAMA. 2016;315: 659. pmid:26881367
  46. 46. Fialho AS, Cismondi F, Vieira SM, Reti SR, Sousa JMC, Finkelstein SN. Data mining using clinical physiology at discharge to predict ICU readmissions. Expert Syst Appl. 2012;39: 13158–13165.
  47. 47. Lorenzana A, Tyagi M, Wang QC, Chawla R, Nigam S. Using text notes from call center data to predict hospitalization. Value Heal. 2016;19: A87.
  48. 48. Jayousi, Rashid; Assaf R. 30-day Hospital Readmission Prediction using MIMIC Data. The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Conference Proceedings. Al-Quds University,Jerusalem,Palestine: The Institute of Electrical and Electronics Engineers, Inc. (IEEE); 2020. pp. 1–6. http://dx.doi.org/10.1109/AICT50176.2020.9368625.
  49. 49. Xue Y, Klabjan D, Luo Y. Predicting ICU readmission using grouped physiological and medication trends. Artif Intell Med. 2019;95: 27–37. pmid:30213670
  50. 50. Feretzakis G, Karlis G, Loupelis E, Kalles D, Chatzikyriakou R, Trakas N, et al. Using Machine Learning Techniques to Predict Hospital Admission at the Emergency Department. J Crit Care Med. 2022;8: 107–116. pmid:35950158
  51. 51. Aphinyanaphongs Y, Liang Y, Theobald J, Grover H, Swartz JL. Models to predict hospital admission from the emergency department through the sole use of the medication administration record. Acad Emerg Med. 2016;23: S116.
  52. 52. Lucini FR, Fogliatto FS, da Silveira GJC, Neyeloff JL, Anzanello MJ, Kuchenbecker R de S, et al. Text mining approach to predict hospital admissions using early medical records from the emergency department. Int J Med Inform. 2017;100: 1–8. pmid:28241931
  53. 53. Curto S, Carvalho JP, Salgado C, Vieira SM, Sousa JMC. Predicting ICU readmissions based on bedside medical text notes. 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). Piscataway: IEEE; 2016. pp. 2144-a-2151-h. https://doi.org/10.1109/FUZZ-IEEE.2016.7737956
  54. 54. Handly N, Thompson DA, Li J, Chuirazzi DM, Venkat A. Evaluation of a hospital admission prediction model adding coded chief complaint data using neural network methodology. Eur J Emerg Med. 2015;22: 87–91. pmid:24509606
  55. 55. Zhang X, Kim J, Patzer RE, Pitts SR, Patzer A, Schrager JD. Prediction of Emergency Department Hospital Admission Based on Natural Language Processing and Neural Networks. Methods Inf Med. 2017;56: 377–389. pmid:28816338
  56. 56. Hilton CB, Milinovich A, Felix C, Vakharia N, Crone T, Donovan C, et al. Personalized predictions of patient outcomes during and after hospitalization using artificial intelligence. NPJ Digit Med. 2020;3: 1–8. pmid:32285012
  57. 57. Fernandes M, Mendes R, Vieira SM, Leite F, Palos C, Johnson A, et al. Predicting Intensive Care Unit admission among patients presenting to the emergency department using machine learning and natural language processing. Olier I, editor. PLoS One. 2020;15: e0229331. pmid:32126097
  58. 58. Topaz M, Woo K, Ryvicker M, Zolnoori M, Cato K. Home Healthcare Clinical Notes Predict Patient Hospitalization and Emergency Department Visits. Nurs Res. 2020;69: 448–454. pmid:32852359
  59. 59. Boggan JC, Schulteis RD, Simel DL, Lucas JE. Use of a natural language processing algorithm to predict readmissions at a veterans affairs hospital. J Gen Intern Med. 2019;34: S396–S397.
  60. 60. Teo K, Yong CW, Chuah JH, Hum YC, Tee YK, Xia K, et al. Current Trends in Readmission Prediction: An Overview of Approaches. Arab J Sci Eng. pmid:34422543
  61. 61. Li Z, Xing X, Lu B, Zhao Y, Li Z. Early Prediction of 30-Day ICU Re-admissions Using Natural Language Processing and Machine Learning. Biomed Stat Informatics. 2019;4: 22.
  62. 62. Sterling NW, Patzer RE, Di M, Schrager JD. Prediction of emergency department patient disposition based on natural language processing of triage notes. Int J Med Inform. 2019;129: 184–188. pmid:31445253
  63. 63. Velupillai S, Suominen H, Liakata M, Roberts A, Shah AD, Morley K, et al. Using clinical Natural Language Processing for health outcomes research: Overview and actionable suggestions for future advances. J Biomed Inform. 2018;88: 11–19. pmid:30368002
  64. 64. Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V. Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review. JMIR Med Informatics. 2019;7: e12239. pmid:31066697
  65. 65. Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. J Big Data. 2019;6: 27.
  66. 66. Artetxe A, Graña M, Beristain A, Ríos S. Emergency Department Readmission Risk Prediction: A Case Study in Chile. In: Vicente JMF, AlvarezSanchez JR, Lopez F, Moreo JT, Adeli H, editors. BIOMEDICAL APPLICATIONS BASED ON NATURAL AND ARTIFICIAL COMPUTING, PT II. Vicomtech IK4 Res Ctr, Mikeletegi Pasealekua 57, San Sebastian 20009, Spain; 2017. pp. 11–20. https://doi.org/10.1007/978-3-319-59773-7_2
  67. 67. Tanha J, Abdi Y, Samadi N, Razzaghi N, Asadpour M. Boosting methods for multi-class imbalanced data classification: an experimental review. J Big Data. 2020;7: 70.
  68. 68. Guo X, Yin Y, Dong C, Yang G, Zhou G. On the Class Imbalance Problem. 2008 Fourth International Conference on Natural Computation. IEEE; 2008. pp. 192–201. https://doi.org/10.1109/ICNC.2008.871
  69. 69. He Haibo, Garcia EA. Learning from Imbalanced Data. IEEE Trans Knowl Data Eng. 2009;21: 1263–1284.
  70. 70. Kaur H, Pannu HS, Malhi AK. A Systematic Review on Imbalanced Data Challenges in Machine Learning. ACM Comput Surv. 2020;52: 1–36.
  71. 71. Norgeot B, Quer G, Beaulieu-Jones BK, Torkamani A, Dias R, Gianfrancesco M, et al. Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat Med. 2020;26: 1320–1324. pmid:32908275
  72. 72. Wolpert DH, Macready WG. No free lunch theorems for optimization. IEEE Trans Evol Comput. 1997;1: 67–82.
  73. 73. Zhou Z-H. Ensemble Learning. Encyclopedia of Biometrics. Boston, MA: Springer US; 2009. pp. 270–273. https://doi.org/10.1007/978-0-387-73003-5_293
  74. 74. Zhang Y, Haghani A. A gradient boosting method to improve travel time prediction. Transp Res Part C Emerg Technol. 2015;58: 308–324.
  75. 75. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM; 2016. pp. 785–794. https://doi.org/10.1145/2939672.2939785
  76. 76. Opitz D, Maclin R. Popular Ensemble Methods: An Empirical Study. J Artif Intell Res. 1999;11: 169–198.
  77. 77. Graham B, Bond R, Quinn M, Mulvenna M. Using Data Mining to Predict Hospital Admissions From the Emergency Department. IEEE Access. 2018;6: 10458–10469.
  78. 78. Lo Y-T, Liao JC-H, Chen M-H, Chang C-M, Li C-T. Predictive modeling for 14-day unplanned hospital readmission risk by using machine learning algorithms. BMC Med Inform Decis Mak. 2021;21: 288. pmid:34670553
  79. 79. Futoma J, Morris J, Lucas J. A comparison of models for predicting early hospital readmissions. J Biomed Inform. 2015;56: 229–238. pmid:26044081
  80. 80. Li Q, Yao X, Échevin D. How Good Is Machine Learning in Predicting All-Cause 30-Day Hospital Readmission? Evidence From Administrative Data. Value Heal J Int Soc Pharmacoeconomics Outcomes Res. 2020;23: 1307–1315. pmid:33032774
  81. 81. Barbieri S, Kemp J, Perez-Concha O, Kotwal S, Gallagher M, Ritchie A, et al. Benchmarking Deep Learning Architectures for Predicting Readmission to the ICU and Describing Patients-at-Risk. Sci Rep. 2020;10: 1111. pmid:31980704
  82. 82. Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110: 12–22. pmid:30763612
  83. 83. Gravesteijn BY, Nieboer D, Ercole A, Lingsma HF, Nelson D, van Calster B, et al. Machine learning algorithms performed no better than regression models for prognostication in traumatic brain injury. J Clin Epidemiol. 2020;122: 95–107. pmid:32201256
  84. 84. Talwar A, Lopez-Olivo MA, Huang Y, Ying L, Aparasu RR. Performance of advanced machine learning algorithms OVER logistic regression in predicting hospital readmissions: A meta-analysis. Explor Res Clin Soc Pharm. 2023; 100317. pmid:37662697
  85. 85. Kino S, Hsu Y-T, Shiba K, Chien Y-S, Mita C, Kawachi I, et al. A scoping review on the use of machine learning in research on social determinants of health: Trends and research prospects. SSM—Popul Heal. 2021;15: 100836. pmid:34169138
  86. 86. Mesgarpour M, Chaussalet T, Chahed S. Ensemble Risk Model of Emergency Admissions (ERMER). Int J Med Inform. 2017;103: 65–77. pmid:28551003
  87. 87. Peck JS, Benneyan JC, Nightingale DJ, Gaehde SA. Predicting emergency department inpatient admissions to improve same-day patient flow. Acad Emerg Med Off J Soc Acad Emerg Med. 2012;19: E1045–54. pmid:22978731
  88. 88. Flaks-Manov N, Shadmi E, Yahalom R, Perry-Mezre H, Balicer R, Srulovici E. Identification of elderly patients at-risk for 30-day readmission: clinical insight beyond big data prediction. J Nurs Manag. 2021. pmid:34661943
  89. 89. Ramspek CL, Jager KJ, Dekker FW, Zoccali C, van Diepen M. External validation of prognostic models: what, why, how, when and where? Clin Kidney J. 2021;14: 49–58. pmid:33564405
  90. 90. Riley RD, Ensor J, Snell KIE, Debray TPA, Altman DG, Moons KGM, et al. External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges. BMJ. 2016; i3140. pmid:27334381
  91. 91. Staartjes VE, Kernbach JM. Significance of external validation in clinical machine learning: let loose too early? Spine J. 2020;20: 1159–1160. pmid:32624150
  92. 92. Altman DG, Vergouwe Y, Royston P, Moons KGM. Prognosis and prognostic research: validating a prognostic model. BMJ. 2009;338: b605–b605. pmid:19477892
  93. 93. Cabitza F, Campagner A, Soares F, García de Guadiana-Romualdo L, Challa F, Sulejmani A, et al. The importance of being external. methodological insights for the external validation of machine learning models in medicine. Comput Methods Programs Biomed. 2021;208: 106288. pmid:34352688
  94. 94. Ryu B, Yoo S, Kim S, Choi J. Development of Prediction Models for Unplanned Hospital Readmission within 30 Days Based on Common Data Model: A Feasibility Study. Methods Inf Med. 2021. pmid:34583416
  95. 95. Rieke N, Hancox J, Li W, Milletarì F, Roth HR, Albarqouni S, et al. The future of digital health with federated learning. npj Digit Med. 2020;3: 119. pmid:33015372
  96. 96. Yang C, Rangarajan A, Ranka S. Global Model Interpretation via Recursive Partitioning. 2018 [cited 27 Dec 2022]. Available: http://arxiv.org/abs/1802.04253.
  97. 97. Kopitar L, Cilar L, Kocbek P, Stiglic G. Local vs. Global Interpretability of Machine Learning Models in Type 2 Diabetes Mellitus Screening. 2019. pp. 108–119.
  98. 98. Du M, Liu N, Hu X. Techniques for interpretable machine learning. Commun ACM. 2019;63: 68–77.
  99. 99. Wu CX, Suresh E, Phng FWL, Tai KP, Pakdeethai J, D’Souza JLA, et al. Effect of a Real-Time Risk Score on 30-day Readmission Reduction in Singapore. Appl Clin Inform. 2021;12: 372–382. pmid:34010978
  100. 100. Maali Y, Perez-Concha O, Coiera E, Roffe D, Day RO, Gallego B. Predicting 7-day, 30-day and 60-day all-cause unplanned readmission: a case study of a Sydney hospital. BMC Med Informatics Decis Mak. 2018;18: 1-N.PAG. pmid:29301576
  101. 101. Petch J, Di S, Nelson W. Opening the Black Box: The Promise and Limitations of Explainable Machine Learning in Cardiology. Can J Cardiol. 2022;38: 204–213. pmid:34534619
  102. 102. Ribeiro MT, Singh S, Guestrin C. “Why should i trust you?” Explaining the predictions of any classifier. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery; 2016. pp. 1135–1144.
  103. 103. Sheu Y. Illuminating the Black Box: Interpreting Deep Neural Network Models for Psychiatric Research. Front Psychiatry. 2020;11. pmid:33192663
  104. 104. Barredo Arrieta A, Díaz-Rodríguez N, Del Ser J, Bennetot A, Tabik S, Barbado A, et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf Fusion. 2020;58: 82–115.
  105. 105. McDermott MBA, Wang S, Marinsek N, Ranganath R, Foschini L, Ghassemi M. Reproducibility in machine learning for health research: Still a ways to go. Sci Transl Med. 2021;13. pmid:33762434
  106. 106. Pineau J, Vincent-Lamarre P, Sinha K, Lariviére V, Beygelzimer A, d’Alché-Buc F, et al. Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program). J Mach Learn Res. 2020;22: 1–20.
  107. 107. Haibe-Kains B, Adam GA, Hosny A, Khodakarami F, Shraddha T, Kusko R, et al. Transparency and reproducibility in artificial intelligence. Nature. 2020;586: E14–E16. pmid:33057217
  108. 108. Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3: 160035. pmid:27219127
  109. 109. Harutyunyan H, Khachatrian H, Kale DC, Ver Steeg G, Galstyan A. Multitask learning and benchmarking with clinical time series data. Sci Data. 2019;6: 96. pmid:31209213
  110. 110. Johnson AEW, Ghassemi MM, Nemati S, Niehaus KE, Clifton D, Clifford GD. Machine Learning and Decision Support in Critical Care. Proc IEEE. 2016;104: 444–466. pmid:27765959
  111. 111. Varoquaux G, Cheplygina V. Machine learning for medical imaging: methodological failures and recommendations for the future. npj Digit Med. 2022;5: 48. pmid:35413988
  112. 112. Beam AL, Manrai AK, Ghassemi M. Challenges to the Reproducibility of Machine Learning Models in Health Care. JAMA. 2020;323: 305. pmid:31904799
  113. 113. Lans A, Pierik RJB, Bales JR, Fourman MS, Shin D, Kanbier LN, et al. Quality assessment of machine learning models for diagnostic imaging in orthopaedics: A systematic review. Artif Intell Med. 2022;132: 102396. pmid:36207080
  114. 114. Yusuf M, Atal I, Li J, Smith P, Ravaud P, Fergie M, et al. Reporting quality of studies using machine learning models for medical diagnosis: a systematic review. BMJ Open. 2020;10: e034568. pmid:32205374
  115. 115. Li J, Zhou Z, Dong J, Fu Y, Li Y, Luan Z, et al. Predicting breast cancer 5-year survival using machine learning: A systematic review. Baltzer PAT, editor. PLoS One. 2021;16: e0250370. pmid:33861809
  116. 116. Collins GS, Dhiman P, Andaur Navarro CL, Ma J, Hooft L, Reitsma JB, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11: e048008. pmid:34244270