Skip to main content
Advertisement
  • Loading metrics

Performance of predictive AI-based clinical decision support systems across clinical domains: A systematic review and meta-analysis

  • William J. Waldock,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Writing – original draft, Writing – review & editing

    Affiliation Institute of Global Health Innovation, Imperial College London, London, United Kingdom

  • Ahmad Guni,

    Roles Data curation, Formal analysis, Writing – review & editing

    Affiliation Institute of Global Health Innovation, Imperial College London, London, United Kingdom

  • Ara Darzi,

    Roles Conceptualization, Supervision, Writing – review & editing

    Affiliation Institute of Global Health Innovation, Imperial College London, London, United Kingdom

  • Hutan Ashrafian

    Roles Conceptualization, Data curation, Formal analysis, Writing – original draft, Writing – review & editing

    hutan@ic.ac.uk

    Affiliation Institute of Global Health Innovation, Imperial College London, London, United Kingdom

Abstract

Despite advances in deep learning and transformer architectures, prior reviews have focused narrowly on traditional clinical decision support systems (CDSS) or single medical domains, leaving significant gaps in understanding contemporary AI-driven predictive tools. This systematic review and meta-analysis evaluated the predictive performance of artificial intelligence-based CDSS (AI-CDSS) across multiple medical specialties. Following PRISMA guidelines, PubMed and Cochrane Library were searched through December 2024 for studies evaluating predictive AI-CDSS using real-world clinical data. Two reviewers independently screened 3,296 records (κ = 0.833), with study quality assessed via QUADAS-2 and performance measures pooled using random-effects meta-analysis. Fifty studies spanning 17 medical specialties were included. Meta-analysis demonstrated moderate discriminatory ability (pooled AUC: 0.652, 95% CI: 0.562–0.743), high specificity (0.819, 95% CI: 0.793–0.844), moderate accuracy (0.765, 95% CI: 0.734–0.796), and variable sensitivity (0.660, 95% CI: 0.535–0.785), with substantial heterogeneity across all measures (I² ≥ 98.9%). Only 24% of studies involved prospective deployment, and 64% reported exclusively technical metrics without clinical workflow data. Predictive AI-CDSS demonstrate moderate-to-good diagnostic performance with strong specificity; however, the predominance of retrospective study designs and limited implementation reporting reveal critical gaps between technical validation and real-world clinical utility. To address these shortcomings, we propose the ROADMAP framework, structured around seven domains: Representative development, Outcomes-focused evaluation, Assessment for deployment, Data harmonization, Monitoring for bias, Allocation via economic evaluations, and Priorities for standardized reporting and prospective validation. This framework provides a practical roadmap for bridging the gap between algorithmic performance and meaningful clinical integration.

Author summary

In our study, we set out to understand how well modern Artificial Intelligence (AI) assists doctors in making clinical decisions across a wide range of medical specialties. While AI technology has advanced rapidly, we realized that previous research was often too narrow or outdated to show the full picture of these modern predictive tools.

After reviewing 50 studies covering 17 different medical fields, we found that current AI tools demonstrate moderate to good accuracy. They are particularly effective at correctly identifying when a patient does not have a condition (high specificity). However, they are less consistent at catching every positive case, and their performance varies significantly depending on the setting.

Crucially, we identified a major gap between technical success and real-world usefulness. Most studies tested AI on historical data rather than in live hospital environments, often ignoring how these tools fit into a doctor’s actual workflow.

To address this, we developed the ROADMAP framework when applying our findings to the case study to antimicrobial resistance. This seven-step guide outlines how researchers can move beyond simple math scores to create AI tools that are representative, fair, economically viable, and proven to work in actual patient care scenarios.

Background

Artificial Intelligence (AI), encompassing a wide spectrum of computational approaches, has transformed clinical decision-making in healthcare by enhancing predictive capabilities and enabling precise, data-driven interventions [1]. The landscape of clinical decision support has evolved rapidly with recent advances in AI methodologies, particularly deep learning architectures, convolutional neural networks, and transformer-based models. These modern AI techniques have demonstrated unprecedented capacity to identify complex patterns in clinical data, generating predictions that inform diagnostic, prognostic, and therapeutic decisions.

Despite the proliferation of AI-CDSS development and the growing body of literature reporting technically promising but with limited evidence of real‑world impact, several critical knowledge gaps persist. First, previous systematic reviews have primarily focused on traditional CDSS or have been limited to single clinical domains such as cardiology, oncology, or radiology [24], lacking comprehensive synthesis across the diverse landscape of AI-driven predictive tools. Second, the rapid evolution of AI methodologies, particularly the emergence of deep learning and transformer-based architectures in recent years [5], necessitates contemporary evaluation that reflects current technological capabilities. Third, significant heterogeneity exists in how AI-CDSS performance is evaluated and reported, with inconsistent use of metrics and validation approaches across studies [6]. This variability complicates efforts to assess the true clinical utility of these tools and compare performance across different systems and clinical contexts.

Furthermore, concerns regarding explainability, clinical integration, and liability have been identified as barriers to frontline clinical adoption [7]. While AI models may achieve high technical performance in controlled settings, questions remain about their real-world effectiveness, including how well they integrate into clinical workflows, whether clinicians trust and adopt their recommendations, and whether they ultimately improve patient outcomes. The gap between technical validation and clinical implementation represents a critical consideration for the field.

The need for rigorous, comprehensive evaluation of AI-CDSS spans multiple clinical domains. While specific applications, such as antimicrobial stewardship, where prescription surveillance has been implemented in Australia [8], Japan [9], and Africa [10], demonstrate the potential impact of decision support tools, the broader landscape of predictive AI-CDSS warrants systematic examination. A previous systematic review examining traditional Clinical Decision Support Systems and their role in antibiotic stewardship [11] found that CDSS interventions significantly improved outcomes relevant to antibiotic prescribing, with both active and passive systems contributing to more appropriate antibiotic use and improved patient outcomes. However, this work focused specifically on antibiotic stewardship and did not comprehensively evaluate modern AI-driven predictive models across diverse clinical domains. Unlike domain-specific reviews of AI-CDSS in nursing, psychiatry, obstetrics, and oncology, our review delivers the first multi-domain synthesis of predictive AI-CDSS performance across more than 15 specialties using standardised metrics, explicitly excluding rule-based systems and non-predictive tools examined in prior work. We uniquely quantify the critical implementation gap, demonstrating that while AI-CDSS show strong technical performance, evidence for clinical integration remains severely limited.

Objective

This study aims to systematically evaluate the predictive performance of AI-based clinical decision support systems (AI-CDSS) across a broad range of medical domains. By synthesizing evidence on diagnostic accuracy, prognostic capability, and, where reported, clinical implementation metrics, we seek to:

  1. Quantify the pooled diagnostic performance of predictive AI-CDSS using standardized metrics (sensitivity, specificity, accuracy, and AUROC)
  2. Assess the methodological quality and risk of bias in AI-CDSS evaluation studies
  3. Identify sources of heterogeneity in reported performance across clinical domains, patient populations, and AI methodologies
  4. Inform the potential and limitations of predictive AI-CDSS for integration into routine clinical practice across diverse healthcare settings

This multi-domain synthesis provides a foundation for understanding the current state of predictive AI-CDSS development and validation, highlighting both the promise of these technologies and the critical needs for standardized evaluation methodologies and real-world clinical assessment.

Results

Study selection and screening process

This systematic review adheres to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) [12] guidelines to ensure transparency and methodological rigor. The study selection process is summarized in the PRISMA flow diagram (Fig 1), with Supplementary PRISMA Checklist (S1 File). A total of 3,296 studies were identified through database searches (PubMed: 2,960; Cochrane: 336). After removing 2,824 studies due to duplication or clear ineligibility, 472 abstracts were screened in detail. Of these, 60 studies were retrieved for full-text assessment, and 50 met the inclusion criteria and were included in the final analysis (Tables 1,2).

Ten studies were excluded at the full-text stage for the following reasons: inappropriate or irrelevant outcome measures (n = 5), absence of a true AI intervention (n = 1), or ineligible study design such as simulation studies or narrative reviews (n = 4). No studies were classified as ongoing or awaiting assessment.

Reviewer agreement and inter-rater reliability

The screening process demonstrated high inter-reviewer agreement, confirming the robustness and reliability of study selection: title screening: 3,146 of 3,296 records agreed upon (κ = 0.911, Cohen’s kappa statistic); abstract screening: 429 of 472 records agreed upon (κ = 0.818); full-text screening: 55 of 60 records agreed upon (κ = 0.833). These kappa statistics indicate excellent agreement beyond chance, reflecting consistent application of inclusion criteria.

Characteristics of included studies

Specialty distribution.

The 50 included studies encompassed a diverse range of medical specialties, reflecting the broad applicability of predictive AI-based CDSS across clinical domains. The distribution was as follows: Infectious Diseases (n = 14), Cardiology (n = 8), Emergency Medicine (n = 5), Oncology (n = 4), Gastroenterology (n = 2), Obstetrics and Gynaecology (n = 2), Orthopaedics (n = 2), Otolaryngology (n = 2), Paediatrics (n = 2), Renal (n = 2), and one study each in Dermatology, Endocrinology, Haematology, Plastic Surgery, Respiratory, Urology, and Vascular specialties.

This distribution indicates that research on predictive AI-CDSS spans multiple clinical fields, with concentration in areas characterized by high data availability, complex decision-making requirements, and urgent clinical needs.

Geographic distribution

Studies were conducted across multiple countries, demonstrating global interest in AI-CDSS development and validation: USA (n = 16), China (n = 10), South Korea (n = 3), Israel (n = 3), Canada (n = 2), Australia (n = 2), Sweden (n = 2), Greece (n = 2), United Kingdom (n = 2), and one study each from Colombia, Germany, Hong Kong, Ireland, Italy, Japan, Thailand, and Turkey.

AI tool characteristics (Table 3)

The AI models evaluated ranged from traditional machine learning algorithms (logistic regression, random forests, gradient boosting machines, support vector machines) to complex deep learning approaches (neural networks, convolutional neural networks for imaging, recurrent neural networks). Approximately one-third of studies involved imaging data (e.g., radiology or pathology images interpreted by AI), another third focused on electronic health record (EHR) tabular data or vital signs, and the remainder used other data types including genomic data, clinical text, and multimodal inputs.

thumbnail
Table 3. Summary of Studies Using AI Models and Explainability Tools.

https://doi.org/10.1371/journal.pdig.0001310.t003

Clinical implementation and deployment context

Of the 50 included studies, the majority (76%) evaluated AI-CDSS using retrospective clinical datasets, while 12 studies (24%) involved prospective clinical deployment or real-time implementation in healthcare settings.

Among studies reporting clinical implementation details (n = 18), the following metrics were documented: workflow integration: 11 studies described integration approaches, including EHR embedding (n = 7), standalone dashboard systems (n = 3), and mobile applications (n = 1); clinician adoption/usage rates: Reported in 6 studies, ranging from 45% to 89% of eligible cases; alert response metrics: 4 studies documented alert override rates (range: 12%–38%); time-to-decision: 3 studies reported decision time improvements (reductions of 2.3–15.7 minutes); clinical outcome measures: 8 studies assessed impacts on patient outcomes, including mortality (n = 4), length of stay (n = 3), and diagnostic accuracy in practice (n = 5)

However, the majority of studies (n = 32, 64%) focused exclusively on technical performance metrics without reporting clinical workflow integration or adoption data, representing a significant gap between ML model development and clinical utility assessment.

Meta-analysis overview

The meta-analysis incorporated 58 outcome measurements from the 50 included studies (some studies contributed multiple outcomes) assessing the diagnostic performance of predictive AI-based clinical decision support tools. A random-effects inverse variance model using DerSimonian-Laird estimates for between-study variance (τ²) was employed.

Diagnostic performance measures

The pooled analysis yielded the following performance metrics: Area Under the Curve (AUC): 0.652 (95% CI: 0.562–0.743), based on 58 outcome measurements; z = 14.162, p < 0.001; Specificity: 0.819 (95% CI: 0.793–0.844), based on 34 studies; Sensitivity: 0.660 (95% CI: 0.535–0.785), based on 40 studies; Accuracy: 0.765 (95% CI: 0.734–0.796), based on 39 studies. These results are visualized in forest plots (Figs 36).

Heterogeneity and variability across studies

Substantial heterogeneity was observed across all performance metrics, confirming extreme variability unlikely to be due to chance alone: AUC Heterogeneity: Cochran’s Q = 1.2 × 10⁵; H = 46.414; I² = 100%; τ² = 0.1227; Specificity Heterogeneity: Q = 7,107.85; H = 14.676; I² = 99.5%; τ² = 0.0054; Sensitivity Heterogeneity: Q = 2.7 × 10⁵; H = 83.015; I² = 100%; τ² = 0.1624; Accuracy Heterogeneity: Q = 3,354.90; H = 9.396; I² = 98.9%; τ² = 0.0092.

Explainability and model transparency

Among the 50 studies reviewed, 13 (26%) incorporated explainability tools to enhance model transparency. Specifically, 12 studies used SHAP (Shapley Additive Explanations), and 1 study used both SHAP and LIME (Local Interpretable Model-Agnostic Explanations). These methods were applied post hoc and did not influence the underlying predictive performance. Our analysis found no consistent differences in AUROC or accuracy between models with or without reported use of explainability tools, as these techniques are model-agnostic and do not modify algorithm outputs. Explainability tools improved model transparency and aided in identifying potential sources of bias but were insufficient for bias mitigation on their own. It should be noted that transparency is not the same as safety since SHAP and LIME alone do not mitigate bias, nor guarantee generalisability.

Risk of Bias Assessment (QUADAS-2)

The QUADAS-2 tool (Quality Assessment of Diagnostic Accuracy Studies) [13] was employed to evaluate the methodological quality and risk of bias across included studies. Fig 2 presents a summary of QUADAS-2 risk of bias assessments for all studies. Methodological quality was heterogeneous, with most studies demonstrating at least one domain rated as high or unclear risk of bias.

Regarding patient selection, approximately 85% of studies were rated as low risk, employing consecutive patient recruitment or random sampling from existing databases to ensure representative samples. The remaining 15% exhibited unclear or high risk attributable to case-control designs or convenience sampling methodologies. For the index test domain, most studies provided comprehensive descriptions of their AI algorithms and implementation protocols. However, approximately 20% demonstrated unclear risk due to insufficient documentation of model training procedures, validation protocols, or threshold selection criteria. Concerning reference standards, the majority of studies utilized appropriate gold standards, including laboratory-confirmed diagnoses, expert consensus determinations, or validated clinical outcomes. Approximately 15% showed unclear or high risk resulting from suboptimal reference standards or inadequate blinding procedures. In the flow and timing domain, approximately 70% of studies achieved low risk ratings, with complete or near-complete patient accounting and consistent application of reference standards. The remaining 30% exhibited concerns including substantial attrition, exclusion of indeterminate results, or differential assessment timing, factors potentially introducing bias if inadequately addressed. Overall applicability was satisfactory regarding patient populations and index tests in relation to the review objectives. Given this review’s intentionally broad, multi-domain scope, most studies were deemed applicable to the research question.

The AI-specific bias evaluation revealed several critical limitations. External validation was conducted in only 32% of studies (16/50). Explainability methods, such as SHAP or LIME, were implemented in 28% of studies (14/50). Algorithmic bias was explicitly assessed in merely 4% of studies (2/50; Bolton 2024, Du 2022). Prospective validation was performed in 8% of studies (4/50).

An important distinction exists between domain-specific and global bias assessment. Re-analysis revealed that only 15 studies (30%) demonstrated low risk of bias across all four QUADAS-2 domains (patient selection, index test, reference standard, and flow/timing). While approximately 85% achieved low risk in patient selection alone, this metric does not reflect overall methodological quality. High-risk AI-specific concerns were prevalent: approximately 70% of studies (35/50) lacked external validation, 30% (15/50) demonstrated flow and timing concerns, and 20% (10/50) provided insufficient detail regarding model training or threshold determination.

A substantial gap exists between technical validation and real-world implementation evidence. Technical validation alone characterised 92% of studies (46/50), while only 8% (4/50) reported implementation outcomes. Critical metrics remained largely unreported: no studies documented adoption rates, one study reported alert override rates, no studies measured time-to-decision, and only two studies assessed patient outcomes beyond diagnostic accuracy.

Interpretation and implications

The pooled analysis demonstrates technically promising performance but limited evidence of real‑world impact, particularly regarding specificity (0.819) and accuracy (0.765). However, sensitivity (0.660) and discriminatory ability as measured by AUC (0.652) were more moderate. The substantial heterogeneity across all metrics (I² > 98.9% for all measures) underscores considerable variability in study populations, clinical tasks, AI methodologies, and evaluation approaches. The limited reporting of clinical implementation metrics in the majority of studies highlights a critical gap between technical performance validation and real-world clinical utility assessment. While ML performance metrics provide essential information about model accuracy, they do not fully capture whether AI-CDSS tools integrate effectively into clinical workflows, achieve clinician adoption, or improve patient outcomes in practice. These findings emphasize the need for standardized evaluation methodologies, more comprehensive reporting of both technical and clinical metrics, and further validation in diverse clinical environments to enhance the generalizability and practical utility of predictive AI-CDSS tools.

Discussion

Principal findings

This systematic review and meta-analysis of 50 studies across 17 medical specialties revealed moderate-to-good predictive performance of AI-based clinical decision support systems, with notable variability across performance metrics and substantial methodological heterogeneity. Specificity was notably high at 81.9% (95% CI: 0.793–0.844), indicating strong potential for accurately identifying true negatives and reducing unnecessary interventions. Accuracy was robust at 76.5% (95% CI: 0.734–0.796), highlighting overall reliability in practical applications. However, sensitivity was more moderate at 66% (95% CI: 0.535–0.785), suggesting limitations in identifying true positives and raising concerns about missed diagnoses. The pooled AUC of 0.652 (95% CI: 0.562–0.743) demonstrated moderate discriminatory ability across diverse clinical contexts. Our analysis prioritised clinically meaningful discrimination metrics: primary outcomes included sensitivity, specificity, and AUROC (reported for all studies), with secondary outcomes of PPV and NPV where available. Critically, the absence of calibration assessments and prediction-decision curves in most studies represents a significant evaluation gap. While discrimination metrics reveal whether models distinguish between outcome classes, they fail to assess whether predicted probabilities align with actual outcome frequencies (essential for clinical decision-making, particularly for rare but high-consequence events where miscalibration can have catastrophic implications). We strongly advocate that future AI-CDSS evaluations adopt calibration curves and prediction-decision analyses as mandatory reporting elements, alongside explicit acknowledgment that accuracy alone may be misleading in imbalanced datasets. These tools are indispensable for evaluating whether AI systems provide actionable, reliable probability estimates that support clinical judgment, especially in scenarios where false negatives carry severe consequences.

The substantial heterogeneity observed across all performance metrics reflects the diversity of clinical tasks, patient populations, AI methodologies, and evaluation approaches encompassed in this review. This variability underscores that AI-CDSS performance is highly context-dependent and cannot be easily generalized across clinical domains without careful consideration of task-specific characteristics and implementation settings. Our meta-analysis employed the DerSimonian-Laird random-effects model, which may underestimate between-study variance under conditions of extreme heterogeneity. While alternative estimators such as restricted maximum likelihood (REML) or Paule-Mandel, potentially with Hartung-Knapp adjustments, might provide more conservative confidence intervals, we maintained the DL approach for consistency with established systematic review methodology. The substantial heterogeneity observed (I² > 90%) reflects genuine diversity in clinical domains, AI methodologies, and patient populations rather than methodological limitations, reinforcing the need for domain-specific validation of AI-CDSS.

Gap between technical performance and clinical implementation

A critical finding is the marked gap between technical validation and real-world clinical implementation. While 76% of studies evaluated AI-CDSS using retrospective datasets, only 24% involved prospective deployment. Furthermore, 64% reported exclusively on technical metrics (sensitivity, specificity, accuracy, AUROC) without documenting workflow integration, clinician adoption, or patient outcomes. Among 18 studies (36%) reporting implementation details, approaches varied considerably, including EHR embedding, standalone dashboards, and mobile applications. However, inconsistent reporting limits assessment of clinical utility beyond technical performance. This gap highlights a fundamental challenge: demonstrating high predictive accuracy in controlled settings does not guarantee successful clinical integration or improved outcomes. Transition from development to deployment requires attention to human factors, workflow compatibility, and organizational readiness, dimensions that remain understudied.

A key methodological issue is inconsistency in performance metric reporting [14,15]. Terms like AUC and AUROC were used interchangeably, often without clear definitions [16]. Some studies differentiated AUROC from AUC-PR, offering better understanding of model performance in imbalanced datasets [17,18]. Others used “AUC” ambiguously, complicating cross-study comparisons [19]. While sensitivity and specificity were generally reported consistently, they were calculated at varying probability thresholds, making direct comparisons problematic [20,21]. Threshold choice has substantial clinical implications; optimizing for high sensitivity versus specificity represents different priorities depending on context [22,23]. Predictive values (PPV and NPV), crucial for clinical decision-making, were seldom reported despite their direct clinical relevance [24,25]. This lack of standardization undermines evidence synthesis and limits generalizability [26,27]. The field would benefit from consensus guidelines on performance metric reporting, similar to TRIPOD-AI or CONSORT-AI initiatives [28,29].

Explainability and trust in AI-CDSS

Explainability emerged as critical for clinician trust and adoption, yet remains inadequately addressed [30,31]. Complex models, particularly deep neural networks and ensemble methods, often function as “black boxes” [32,33]. This opacity leads to clinician hesitation, especially when AI outputs conflict with clinical judgment [34,35]. Lack of explainability raises legal and ethical concerns, as clinicians must justify decisions to patients, colleagues, and in medico-legal contexts [36,37]. The question of accountability remains unresolved in most healthcare systems [38,39]. Explainability techniques such as LIME and SHAP were employed in 26% of studies [40,41]. These methods identify which features most influenced model outputs [42,43]. However, explainability tools were applied retrospectively and did not consistently improve performance or adoption [44,45]. There is an important distinction between model interpretability and prediction explainability; both are needed for full integration, yet most studies addressed only the latter [46,47]. The tension between model complexity and interpretability remains fundamental [48,49]. Simpler models offer inherent interpretability but may sacrifice accuracy compared to deep learning [50,51]. Future research must balance predictive performance with explainability, potentially through hybrid models or interpretable-by-design architectures [52,53].

Bias, fairness, and ethical concerns in AI-CDSS

AI models inherit biases from training data, potentially perpetuating healthcare disparities [54,55]. Development must prioritize diverse, representative datasets ensuring adequate representation across race, ethnicity, sex, age, socioeconomic status, and geography [56,57]. However, diverse data alone is insufficient; developers must assess performance across demographic subgroups, implement fairness-aware algorithms, and monitor deployed models for bias [58,59]. Regulatory frameworks emphasize these requirements, though standardized approaches remain underdeveloped [60,61].

Ethical implications extend beyond technical fairness to consent, transparency, and patient autonomy [62,63]. Patients should be informed when AI contributes to their care and have mechanisms to understand or contest AI-influenced decisions [64,65]. Current practice rarely includes such transparency measures [66,67].

This review represents one of the first comprehensive multi-domain assessments of predictive AI-CDSS, addressing a critical literature gap [68,69]. Previous reviews focused on rule-based CDSS or single clinical domains [70,71]. By synthesizing evidence across 17 specialties, this review reveals patterns transcending individual contexts [72,73].

ROADMAP framework for AI-driven clinical decision support systems in antimicrobial resistance management

We propose the ROADMAP framework to synthesise evidence-based principles for developing, implementing, and evaluating AI-driven CDSS addressing antimicrobial resistance challenges [74,75].

Representative development principles

Development requires representative training datasets reflecting target population diversity [7681].

Outcomes and patient-centred evaluation

Evaluation must expand beyond technical performance to patient-centered outcomes including quality of life, satisfaction, treatment burden, and health equity impacts [82,83].

Assessment requirements for clinical deployment

AI-CDSS applications demonstrate potential through pathogen resistance profiling [84], contact tracing [85], and predicting Gram-negative bacterial resistance [86]. Digital health tools reveal disparities between high and low-income countries [87], while intrinsic resistance mechanisms contribute significantly to mortality [88]. Infection risk modeling incorporates vital signs and laboratory results [89], with applications in sepsis prediction [90], diagnostics and drug discovery [91], battlefield medicine [92], and sociotechnical frameworks [93]. Adaptive learning systems provide implementation frameworks [94], while AI enhances biosafety protocols and outbreak management [95]. Prediction models for decompensated cirrhosis [96], nosocomial infections [97], and resistant Enterobacterales [98] demonstrate utility. Genomic surveillance informs vaccine development [99], though antibiotic therapy thresholds vary considerably [100]. Context-specific evaluation is essential given performance heterogeneity [101,102]. The COMBACTE-Magnet EPI-Net COACH project assembled evidence for surveillance systems [103], while European surveillance programs publish annual reports [104]. Machine learning for IV-to-oral antibiotic switches faces workflow integration and trust challenges [105]. Case-based reasoning systems demonstrate enhanced prescribing appropriateness [106], while microbiome analysis separates biological signals for AMR surveillance [107].

Data harmonization and global surveillance

Critical gaps include limited understanding of environmental AMR levels, unclear high-risk transmission definitions, and insufficient knowledge of concentrations driving resistance [108]. These gaps are compounded by training biases, inequitable access, and standardization needs [109,110]. Global sewage analysis accommodates regional AMR diversity [111], while clinical microbiologists’ collaboration requires international consensus [112]. The TSARA trial generates actionable data on resistance and prescriptions for low-resource settings [113].

Monitoring challenges in AMR surveillance

Ongoing monitoring addresses performance degradation, bias emergence, and evolving clinical utility [114,115]. Successful implementation encompasses workflow integration, training programs, and organizational readiness [116,117].

Allocation of resources and economic sustainability

Rigorous health economic evaluations (cost-effectiveness analyses, budget impact assessments, value-based implementation modelling) inform resource allocation and guide sustainable integration within resource-constrained systems, fundamental to long-term viability [118,119].

Priorities for targeted research

Critical gaps include standardized reporting frameworks, prospective validation studies in diverse settings, and implementation science frameworks elucidating determinants of AI-CDSS adoption [120,121].

Limitations

Several limitations warrant acknowledgment. First, our focus on predictive AI-CDSS excludes knowledge graphs, natural language generation, and conversational agents [122,123]. Second, the predominance of retrospective evaluations (76%) versus prospective deployments introduces potential performance bias [124,125]. Retrospective performance often represents an upper bound for prospective deployment [126]. Third, limited reporting of implementation metrics prevented comprehensive assessment of real-world utility [127,128]. Fourth, QUADAS-2, while providing valuable quality assessment, was not designed for AI-driven diagnostic tools and may not capture AI-specific biases such as overfitting, poor generalizability, or data leakage [13,129]. The upcoming QUADAS-AI tool [130] will standardize assessment in systematic reviews [131]. Finally, our search was limited to PubMed and Cochrane Library [132]. Although these provide extensive coverage, relevant studies in computer science venues or preprint servers may have been missed [133].

Materials and methods

This systematic review adheres to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) [12] guidelines to ensure transparency and methodological rigor.

Rationale and scope

The landscape of clinical decision support has been transformed by recent advances in artificial intelligence, particularly deep learning and transformer-based architectures. While earlier systematic reviews examined traditional clinical decision support systems, the rapid evolution of AI methodologies, including convolutional neural networks, recurrent neural networks, and attention mechanisms, has created a knowledge gap requiring contemporary synthesis. Previous reviews have not comprehensively evaluated AI-based CDSS performance across multiple clinical domains using modern predictive modelling approaches, nor have they systematically assessed the standardization of performance metrics in this rapidly evolving field.

This review focuses specifically on predictive AI-based clinical decision support systems; tools that use machine learning or deep learning to generate individualized predictions or risk assessments to inform clinical decisions. This represents the largest and fastest-growing segment of AI-CDSS applications and merits focused analysis given its clinical prevalence. This operational definition intentionally excludes other AI-CDSS types such as knowledge graphs, natural language generation systems, and conversational agents, which would require different methodological approaches and are noted as a limitation of this review.

Definition and identification of CDSS studies

To ensure consistency and reproducibility in study selection, we established explicit operational definitions. A Clinical Decision Support System (CDSS) was defined as a health information technology system designed to assist clinicians in making decisions by providing individualized, actionable recommendations or predictions based on clinical data inputs.

An AI-based predictive CDSS was defined as a digital clinical decision support tool that utilizes artificial intelligence techniques, including machine learning (ML), deep learning (DL), or natural language processing (NLP), to derive predictions or risk assessments from clinical data. This focus on predictive modeling reflects the dominant paradigm in current AI-CDSS research and clinical implementation.

Tools were excluded if they met any of the following criteria: (1) use of rule-based logic without learning algorithms (e.g., IF-THEN statements); (2) being non-digital (e.g., paper-based algorithms); (3) functioning solely as descriptive analytics tools without providing individualized outputs for clinical decision-making; or (4) representing non-predictive AI-CDSS types such as knowledge retrieval systems, natural language generation tools, or conversational agents.

Search strategy

A systematic literature search was conducted across major biomedical databases, covering studies published up to December 6, 2024. We searched PubMed and the Cochrane Library, which were selected based on their comprehensive coverage of peer-reviewed medical literature and their established role as primary sources for clinical evidence synthesis. PubMed provides extensive indexing of biomedical journals with robust MeSH term capabilities, while Cochrane captures high-quality systematic reviews and controlled trials. For our research question focused on clinical decision support systems in healthcare settings, these databases offer comprehensive coverage of the target literature. While EMBASE provides additional European coverage, preliminary scoping indicated substantial overlap with PubMed for our inclusion criteria. IEEE Xplore, while valuable for computer science perspectives, primarily indexes technical implementations rather than clinical evaluations, which formed the core of our inclusion criteria.

The search strategy employed combinations of terms including “Clinical Decision Support System,” “CDSS,” “Artificial Intelligence,” “Machine Learning,” “Deep Learning,” “Predictive model,” and associated Medical Subject Headings (MeSH) terms. Results were supplemented with grey literature and clinical trial registries to minimize publication bias.

Eligibility criteria

Studies were included if they met all of the following criteria:

  1. - Tool Characteristics: The study evaluated an AI-based Clinical Decision Support System (CDSS) that used machine learning, deep learning, or related AI methods to generate individualized clinical predictions or recommendations.
  2. - Evaluation Focus: The study assessed predictive performance using standard metrics (accuracy, sensitivity, specificity, or AUROC).
  3. - Clinical Context: The CDSS was evaluated using real-world clinical datasets or implemented in actual healthcare settings. This included both systems deployed in clinical practice and systems rigorously validated using authentic clinical data.
  4. - Study Design: The study used a quantitative observational or experimental design, such as retrospective cohort studies, prospective trials, or randomized controlled trials.
  5. - Language and Publication Date: Studies published in English on or before December 6, 2024.

Exclusion criteria

Studies were excluded if they:

  1. - Did not use AI methodologies (e.g., relied solely on rule-based or expert systems)
  2. - Failed to report predictive performance using standard metrics
  3. - Were published in non-peer-reviewed formats (e.g., editorials, conference abstracts, case reports)
  4. - Described AI tools that did not directly support clinical decision-making (e.g., image segmentation algorithms without interpretive or predictive outputs)
  5. - Involved non-predictive AI-CDSS types (knowledge graphs, conversational agents, NLP generation systems)

Study selection

Two reviewers (WW and AG) independently screened all retrieved records in two stages. First, they reviewed titles and abstracts to identify potentially eligible studies. Second, they conducted full-text reviews to confirm eligibility.

Discrepancies were resolved through discussion. When consensus could not be reached, a third senior reviewer (HA) made the final decision. A detailed screening log documented all decisions and rationales. Inter-rater reliability was assessed using Cohen’s kappa statistic at each stage.

Data extraction

Data extraction was performed independently by the same two reviewers (WW and AG) using a standardized extraction template. Extracted information included: publication details: Authors, year, journal, geographic location; clinical domain and setting: Specialty, care setting (inpatient/outpatient/emergency), patient population; study characteristics: Sample size, study design, data sources; AI-CDSS characteristics: Algorithm type (e.g., random forest, neural network), input features, targeted clinical task, training dataset details; performance metrics: Sensitivity, specificity, accuracy, AUROC, PPV, NPV; clinical implementation metrics: Where reported, we extracted data on clinical workflow integration, clinician adoption rates, time-to-decision, alert override rates, and clinical outcome measures (e.g., changes in mortality, length of stay, diagnostic accuracy in practice). Any disagreements encountered during data extraction were resolved through discussion or, when necessary, through consultation with the third reviewer (HA).

Risk of bias assessment

The QUADAS-2 tool (Quality Assessment of Diagnostic Accuracy Studies) [13] was applied to assess the methodological quality and risk of bias in the included studies. QUADAS-2 evaluates risk of bias in four domains: patient selection, index test, reference standard, and flow of patients/timing of assessments. Two reviewers independently applied QUADAS-2 to each study, with disagreements resolved by consensus.

Signalling questions were answered per QUADAS-2 guidance, and each domain was rated as “low,” “high,” or “unclear” risk of bias. We also evaluated concerns regarding applicability in each domain. It is important to note that QUADAS-2 was not originally designed to assess AI-driven diagnostic tools, which often have unique sources of bias such as overfitting to training data, poor generalizability across populations, and data leakage. As such, this quality assessment may not fully capture AI-specific bias issues. The overall QUADAS-2 ratings for each study are presented in Fig 1a and 1b.

Outcome measures and standardization

For consistency, we standardized the definitions of key metrics across studies: Sensitivity (recall): The proportion of true positive cases correctly identified by the AI tool; Specificity: The proportion of true negative cases correctly identified; PPV (precision): The probability that a positive prediction by the AI is a true positive; NPV: The probability that a negative prediction is a true negative; Accuracy: The overall proportion of correct classifications (true positives plus true negatives over all cases); AUC (AUROC): The Area Under the Receiver Operating Characteristic curve, which plots sensitivity versus (1–specificity).

Data synthesis and analysis

We summarized key findings of included studies qualitatively and, where appropriate, quantitatively via meta-analysis. For studies sufficiently homogeneous in terms of reported metrics, we pooled performance measures using random-effects meta-analysis models (DerSimonian-Laird method). We chose a random-effects model a priori given the anticipated heterogeneity in study populations, clinical tasks, and AI models.

Pooled estimates with 95% confidence intervals (CI) were computed for the primary metrics of interest (sensitivity, specificity, accuracy, and AUROC). Each study’s contribution was weighted by the inverse of its variance (incorporating sample size and outcome prevalence), so that larger studies (with more precise estimates) had greater influence on the pooled result.

We assessed statistical heterogeneity using Cochran’s Q and the I² statistic, with I² > 75% indicating substantial heterogeneity. The I² statistic is used to quantify the dispersion of effect sizes in a meta-analysis, representing the percentage of total variation across studies that is due to heterogeneity rather than chance; values of 25%, 50%, and 75% are commonly interpreted as low, moderate, and high heterogeneity, respectively. We also report τ² as the between-study variance. All meta-analyses were conducted using Stata Statistical Software Release 15 (StataCorp), and results are displayed in forest plots (Fig 36).

We evaluated publication bias qualitatively (e.g., noting if only positive studies were published in certain domains) and with funnel plots for the main outcome (AUC) when ≥10 studies were available.

Where reported in source studies, we qualitatively synthesized clinical implementation metrics including workflow integration approaches, clinician adoption patterns, alert response rates, and impacts on clinical outcomes.

No formal patient or public involvement was applicable in this evidence synthesis, as it relied on previously published studies.

Conclusion

This review identifies that predictive AI-CDSS achieve moderate diagnostic performance across diverse specialties, with particular strength in specificity [134,135]. However, substantial performance heterogeneity, predominance of retrospective studies, and limited reporting of implementation metrics highlight a critical gap between technical validation and real-world utility assessment [136,137]. To realize AI-CDSS potential in enhancing clinical decision-making and patient care, the field must transition from focusing on technical metrics toward comprehensive evaluation encompassing workflow integration, clinician adoption, patient outcomes, and health equity impacts [138,139]. This requires collaboration across AI developers, clinicians, patients, health system administrators, regulators, and policymakers to establish standardized frameworks, address ethical concerns, and develop implementation strategies facilitating successful translation from development to deployment [140,141]. As AI methodologies evolve rapidly with deep learning, transformer architectures, and foundation models, maintaining rigorous, transparent, and clinically meaningful evaluation standards will be essential [142,143]. The QUADAS-AI tool [130] represents an important step toward standardizing quality assessment [144], and its adoption should be prioritized. By addressing identified gaps, particularly the need for prospective validation, standardized reporting, and implementation-focused research, the field can move toward evidence-based integration of AI-CDSS that demonstrably improves clinical care while maintaining safety, fairness, and patient trust [145,146].

Supporting information

S1 File. PRISMA Checklist (From: Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews.

BMJ 2021;372:n71. https://doi.org/10.1136/bmj.n71. This work is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/).

https://doi.org/10.1371/journal.pdig.0001310.s001

(PDF)

S1 Table. Systematic Review Study Characteristics.

https://doi.org/10.1371/journal.pdig.0001310.s003

(PDF)

S3 Table. Summary of Studies Using AI Models and Explainability Tools.

https://doi.org/10.1371/journal.pdig.0001310.s005

(PDF)

S4 Table. All Studies Identified in Literature Search (excel spreadsheet).

https://doi.org/10.1371/journal.pdig.0001310.s006

(CSV)

Acknowledgments

AI Disclosure: No generative AI tools were used in the drafting of this manuscript.

References

  1. 1. Magrabi F, Ammenwerth E, McNair JB, De Keizer NF, Hyppönen H, Nykänen P, et al. Artificial Intelligence in Clinical Decision Support: Challenges for Evaluating AI and Practical Implications. Yearb Med Inform. 2019;28(1):128–34. pmid:31022752
  2. 2. Moazemi S, Vahdati S, Li J, Kalkhoff S, Castano LJV, Dewitz B, et al. Artificial intelligence for clinical decision support for monitoring patients in cardiovascular ICUs: A systematic review. Front Med (Lausanne). 2023;10:1109411. pmid:37064042
  3. 3. Oehring R, Ramasetti N, Ng S, Roller R, Thomas P, Winter A, et al. Use and accuracy of decision support systems using artificial intelligence for tumor diseases: a systematic review and meta-analysis. Front Oncol. 2023;13:1224347. pmid:37860189
  4. 4. Beşler MS, Koç U. Systematic review of artificial intelligence competitions in radiology: a focus on design, evaluation, and trends. Diagn Interv Radiol. 2026;32(2):164–70. pmid:40192339
  5. 5. Moor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, Topol EJ, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616(7956):259–65. pmid:37045921
  6. 6. Hama T, Alsaleh MM, Allery F, Choi JW, Tomlinson C, Wu H, et al. Enhancing Patient Outcome Prediction Through Deep Learning With Sequential Diagnosis Codes From Structured Electronic Health Record Data: Systematic Review. J Med Internet Res. 2025;27:e57358. pmid:40100249
  7. 7. Jones C, Thornton J, Wyatt JC. Artificial intelligence and clinical decision support: clinicians’ perspectives on trust, trustworthiness, and liability. Med Law Rev. 2023;31(4):501–20. pmid:37218368
  8. 8. Maher D, Sluggett JK, Soriano J, Hull D-A, Hillock NT. Surveillance of Antimicrobial Use in Long-Term Care Facilities: An Antimicrobial Mapping Survey. J Am Med Dir Assoc. 2024;25(9):105144. pmid:38991651
  9. 9. Sugawara T, Ohkusa Y, Kawanohara H, Kamei M. Prescription surveillance for early detection system of emerging and reemerging infectious disease outbreaks. Biosci Trends. 2018;12(5):523–5. pmid:30473564
  10. 10. Okedo-Alex IN, Akamike IC, Iyamu I, Umeokonkwo CD. Pattern of antimicrobial prescription in Africa: a systematic review of point prevalence surveys. Pan Afr Med J. 2023;45:67. pmid:37637407
  11. 11. Rittmann B, Stevens MP. Clinical Decision Support Systems and Their Role in Antibiotic Stewardship: a Systematic Review. Curr Infect Dis Rep. 2019;21(8):29. pmid:31342180
  12. 12. Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. pmid:33782057
  13. 13. Whiting PF, Rutjes AWS, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155(8):529–36. pmid:22007046
  14. 14. Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health. 2019;1(6):e271–97. pmid:33323251
  15. 15. Nagendran M, Chen Y, Lovejoy CA, Gordon AC, Komorowski M, Harvey H, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689. pmid:32213531
  16. 16. Park SH, Han K. Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction. Radiology. 2018;286(3):800–9. pmid:29309734
  17. 17. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432. pmid:25738806
  18. 18. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56. pmid:30617339
  19. 19. Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015;350:g7594. pmid:25569120
  20. 20. Wynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E, et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ. 2020;369:m1328. pmid:32265220
  21. 21. Rajkomar A, Dean J, Kohane I. Machine Learning in Medicine. N Engl J Med. 2019;380(14):1347–58. pmid:30943338
  22. 22. Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128–38. pmid:20010215
  23. 23. Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230. pmid:31842878
  24. 24. Vickers AJ, Van Calster B, Steyerberg EW. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ. 2016;352:i6. pmid:26810254
  25. 25. Kerr KF, Brown MD, Zhu K, Janes H. Assessing the Clinical Impact of Risk Prediction Models With Decision Curves: Guidance for Correct Interpretation and Appropriate Use. J Clin Oncol. 2016;34(21):2534–40. pmid:27247223
  26. 26. Riley RD, Ensor J, Snell KIE, Harrell FE Jr, Martin GP, Reitsma JB, et al. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020;368:m441. pmid:32188600
  27. 27. Collins GS, Dhiman P, Andaur Navarro CL, Ma J, Hooft L, Reitsma JB, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11(7):e048008. pmid:34244270
  28. 28. Collins GS, Moons KGM. Reporting of artificial intelligence prediction models. Lancet. 2019;393(10181):1577–9. pmid:31007185
  29. 29. Rivera SC, Liu X, Chan AW, Denniston AK, Calvert MJ. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nature Medicine. 2020;26(9):1351–63.
  30. 30. Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019;17(1):195. pmid:31665002
  31. 31. Char DS, Shah NH, Magnus D. Implementing Machine Learning in Health Care - Addressing Ethical Challenges. N Engl J Med. 2018;378(11):981–3. pmid:29539284
  32. 32. Rudin C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat Mach Intell. 2019;1(5):206–15. pmid:35603010
  33. 33. Lipton ZC. The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue. 2018;16(3):31–57.
  34. 34. Ghassemi M, Oakden-Rayner L, Beam AL. The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit Health. 2021;3(11):e745–50. pmid:34711379
  35. 35. Tonekaboni S, Joshi S, McCradden MD, Goldenberg A. What clinicians want: contextualizing explainable machine learning for clinical end use. Proceedings of Machine Learning Research. 2019;106:359–80.
  36. 36. Grote T, Berens P. On the ethics of algorithmic decision-making in healthcare. J Med Ethics. 2020;46(3):205–11. pmid:31748206
  37. 37. Price WN 2nd, Gerke S, Cohen IG. Potential Liability for Physicians Using Artificial Intelligence. JAMA. 2019;322(18):1765–6. pmid:31584609
  38. 38. Challen R, Denny J, Pitt M, Gompels L, Edwards T, Tsaneva-Atanasova K. Artificial intelligence, bias and clinical safety. BMJ Qual Saf. 2019;28(3):231–7. pmid:30636200
  39. 39. Gerke S, Minssen T, Cohen G. Ethical and legal challenges of artificial intelligence-driven healthcare. Artif Intell Healthc. 2020;:295–336.
  40. 40. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30:4765–74.
  41. 41. Ribeiro MT, Singh S, Guestrin C. Why should I trust you? Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 1135–44.
  42. 42. Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B. Definitions, methods, and applications in interpretable machine learning. Proc Natl Acad Sci U S A. 2019;116(44):22071–80. pmid:31619572
  43. 43. Amann J, Blasimme A, Vayena E, Frey D, Madai VI, Precise4Q consortium. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC Med Inform Decis Mak. 2020;20(1):310. pmid:33256715
  44. 44. van der Velden BHM, Kuijf HJ, Gilhuijs KGA, Viergever MA. Explainable artificial intelligence (XAI) in deep learning-based medical image analysis. Med Image Anal. 2022;79:102470. pmid:35576821
  45. 45. Holzinger A, Biemann C, Pattichis CS, Kell DB. What do we need to build explainable AI systems for the medical domain? arXiv preprint. 2017.
  46. 46. Caruana R, Lou Y, Gehrke J, Koch P, Sturm M, Elhadad N. Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015. 1721–30.
  47. 47. Doshi-Velez F, Kim B. Towards a rigorous science of interpretable machine learning. arXiv preprint. 2017.
  48. 48. Zhang Z, Beck MW, Winkler DA, Huang B, Sibanda W, Goyal H; written on behalf of AME Big-Data Clinical Trial Collaborative Group. Opening the black box of neural networks: methods for interpreting neural network models in clinical applications. Ann Transl Med. 2018;6(11):216.
  49. 49. Tjoa E, Guan C. A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI. IEEE Trans Neural Netw Learn Syst. 2021;32(11):4793–813. pmid:33079674
  50. 50. Luo W, Phung D, Tran T, Gupta S, Rana S, Karmakar C, et al. Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View. J Med Internet Res. 2016;18(12):e323. pmid:27986644
  51. 51. Beam AL, Kohane IS. Big Data and Machine Learning in Health Care. JAMA. 2018;319(13):1317–8. pmid:29532063
  52. 52. Adadi A, Berrada M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access. 2018;6:52138–60.
  53. 53. Barredo Arrieta A, Díaz-Rodríguez N, Del Ser J, Bennetot A, Tabik S, Barbado A, et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion. 2020;58:82–115.
  54. 54. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–53. pmid:31649194
  55. 55. Chen IY, Pierson E, Rose S, Joshi S, Ferryman K, Ghassemi M. Ethical Machine Learning in Healthcare. Annu Rev Biomed Data Sci. 2021;4:123–44. pmid:34396058
  56. 56. Norori N, Hu Q, Aellen FM, Faraci FD, Tzovara A. Addressing bias in big data and AI for health care: A call for open science. Patterns (N Y). 2021;2(10):100347. pmid:34693373
  57. 57. Gianfrancesco MA, Tamang S, Yazdany J, Schmajuk G. Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data. JAMA Intern Med. 2018;178(11):1544–7. pmid:30128552
  58. 58. Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH. Ensuring Fairness in Machine Learning to Advance Health Equity. Ann Intern Med. 2018;169(12):866–72. pmid:30508424
  59. 59. Vokinger KN, Feuerriegel S, Kesselheim AS. Mitigating bias in machine learning for medicine. Commun Med (Lond). 2021;1:25. pmid:34522916
  60. 60. Gerke S, Babic B, Evgeniou T, Cohen IG. The need for a system view to regulate artificial intelligence/machine learning-based software as medical device. NPJ Digit Med. 2020;3:53. pmid:32285013
  61. 61. Reddy S, Allan S, Coghlan S, Cooper P. A governance model for the application of AI in health care. J Am Med Inform Assoc. 2020;27(3):491–7. pmid:31682262
  62. 62. Mittelstadt BD, Allo P, Taddeo M, Wachter S, Floridi L. The ethics of algorithms: Mapping the debate. Big Data & Society. 2016;3(2).
  63. 63. Shortliffe EH, Sepúlveda MJ. Clinical Decision Support in the Era of Artificial Intelligence. JAMA. 2018;320(21):2199–200. pmid:30398550
  64. 64. Vayena E, Blasimme A, Cohen IG. Machine learning in medicine: Addressing ethical challenges. PLoS Med. 2018;15(11):e1002689. pmid:30399149
  65. 65. Morley J, Machado CCV, Burr C, Cowls J, Joshi I, Taddeo M, et al. The ethics of AI in health care: A mapping review. Soc Sci Med. 2020;260:113172. pmid:32702587
  66. 66. Wachter S, Mittelstadt B, Russell C. Why fairness cannot be automated: Bridging the gap between EU non-discrimination law and AI. Computer Law & Security Review. 2021;41:105567.
  67. 67. Coiera E, Kocaballi B, Halamka J, Laranjo L. The digital scribe. NPJ Digit Med. 2018;1:58. pmid:31304337
  68. 68. Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, et al. Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol. 2017;2(4):230–43. pmid:29507784
  69. 69. Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24–9. pmid:30617335
  70. 70. Sutton RT, Pincock D, Baumgart DC, Sadowski DC, Fedorak RN, Kroeker KI. An overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ Digit Med. 2020;3:17. pmid:32047862
  71. 71. Aggarwal R, Sounderajah V, Martin G, Ting DSW, Karthikesalingam A, King D, et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit Med. 2021;4(1):65. pmid:33828217
  72. 72. Sendak MP, Gao M, Brajer N, Balu S. Presenting machine learning model information to clinical end users with model facts labels. NPJ Digit Med. 2020;3:41. pmid:32219182
  73. 73. Wiens J, Saria S, Sendak M, Ghassemi M, Liu VX, Doshi-Velez F, et al. Do no harm: a roadmap for responsible machine learning for health care. Nat Med. 2019;25(9):1337–40. pmid:31427808
  74. 74. Rawson TM, Moore LSP, Zhu N, Ranganathan N, Skolimowska K, Gilchrist M, et al. Bacterial and Fungal Coinfection in Individuals With Coronavirus: A Rapid Review To Support COVID-19 Antimicrobial Prescribing. Clin Infect Dis. 2020;71(9):2459–68. pmid:32358954
  75. 75. Tacconelli E, Carrara E, Savoldi A, Harbarth S, Mendelson M, Monnet DL, et al. Discovery, research, and development of new antibiotics: the WHO priority list of antibiotic-resistant bacteria and tuberculosis. Lancet Infect Dis. 2018;18(3):318–27. pmid:29276051
  76. 76. Verma AA, Murray J, Greiner R, Cohen JP, Shojania KG, Ghassemi M, et al. Implementing machine learning in medicine. CMAJ. 2021;193(34):E1351–7. pmid:35213323
  77. 77. D’Agostino RB Sr, Vasan RS, Pencina MJ, Wolf PA, Cobain M, Massaro JM, et al. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation. 2008;117(6):743–53. pmid:18212285
  78. 78. Amann J, Vayena E, Ormond KE, Frey D, Madai VI, Blasimme A. Expectations and attitudes towards medical artificial intelligence: A qualitative study in the field of stroke. PLoS One. 2023;18(1):e0279088. pmid:36630325
  79. 79. McCradden MD, Joshi S, Mazwi M, Anderson JA. Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit Health. 2020;2(5):e221–3. pmid:33328054
  80. 80. Kaushal A, Altman R, Langlotz C. Geographic Distribution of US Cohorts Used to Train Deep Learning Algorithms. JAMA. 2020;324(12):1212–3. pmid:32960230
  81. 81. Cabitza F, Rasoini R, Gensini GF. Unintended Consequences of Machine Learning in Medicine. JAMA. 2017;318(6):517–8. pmid:28727867
  82. 82. Lehman CD, Wellman RD, Buist DSM, Kerlikowske K, Tosteson ANA, Miglioretti DL. Diagnostic accuracy of digital screening mammography with and without computer-aided detection. JAMA Intern Med. 2015;175(11):1828–37.
  83. 83. Scott IA, Carter SM, Coiera E. Exploring stakeholder attitudes towards AI in clinical practice. BMJ Health Care Inform. 2021;28(1):e100450. pmid:34887331
  84. 84. Sintchenko V, Iredell JR, Gilbert GL. Pathogen profiling for disease management and surveillance. Nat Rev Microbiol. 2007;5(6):464–70. pmid:17487146
  85. 85. Keskin S, Emecen AN, Ergör A. Infection Risk Prediction in Healthcare Settings: Lessons from COVID-19 Contact Tracing. Infect Dis Clin Microbiol. 2024;6(1):44–54. pmid:38633443
  86. 86. Kherabi Y, Thy M, Bouzid D, Antcliffe DB, Rawson TM, Peiffer-Smadja N. Machine learning to predict antimicrobial resistance: future applications in clinical practice? Infect Dis Now. 2024;54(3):104864. pmid:38355048
  87. 87. Rawson TM, Zhu N, Galiwango R, Cocker D, Islam MS, Myall A, et al. Using digital health technologies to optimise antimicrobial use globally. Lancet Digit Health. 2024;6(12):e914–25. pmid:39547912
  88. 88. Baltas I, Rawson TM, Houston H, Grandjean L, Pollara G. Antimicrobial resistance-attributable mortality: a patient-level analysis. JAC Antimicrob Resist. 2024;6(6):dlae202. pmid:39703831
  89. 89. Feng T, Noren DP, Kulkarni C, Mariani S, Zhao C, Ghosh E, et al. Machine learning-based clinical decision support for infection risk prediction. Front Med (Lausanne). 2023;10:1213411. pmid:38179280
  90. 90. Murri R, De Angelis G, Antenucci L, Fiori B, Rinaldi R, Fantoni M, et al. A Machine Learning Predictive Model of Bloodstream Infection in Hospitalized Patients. Diagnostics (Basel). 2024;14(4):445. pmid:38396484
  91. 91. Rabaan AA, Alhumaid S, Mutair AA, Garout M, Abulhamayel Y, Halwani MA, et al. Application of Artificial Intelligence in Combating High Antimicrobial Resistance Rates. Antibiotics (Basel). 2022;11(6):784. pmid:35740190
  92. 92. Liu GY, Yu D, Fan MM, Zhang X, Jin ZY, Tang C. Antimicrobial resistance crisis: could artificial intelligence be the solution?. Mil Med Res. 2024;11(1):7.
  93. 93. Ali T, Ahmed S, Aslam M. Artificial Intelligence for Antimicrobial Resistance Prediction: Challenges and Opportunities towards Practical Implementation. Antibiotics (Basel). 2023;12(3):523. pmid:36978390
  94. 94. Howard A, Aston S, Gerada A, Reza N, Bincalar J, Mwandumba H, et al. Antimicrobial learning systems: an implementation blueprint for artificial intelligence to tackle antimicrobial resistance. Lancet Digit Health. 2024;6(1):e79–86. pmid:38123255
  95. 95. Lv J, Deng S, Zhang L. A review of artificial intelligence applications for antimicrobial resistance. Biosafety and Health. 2021;3(1):22–31.
  96. 96. Zheng J, Li J, Zhang Z, Yu Y, Tan J, Liu Y, et al. Clinical Data based XGBoost Algorithm for infection risk prediction of patients with decompensated cirrhosis: a 10-year (2012-2021) Multicenter Retrospective Case-control study. BMC Gastroenterol. 2023;23(1):310. pmid:37704966
  97. 97. Chen Y, Zhang Y, Nie S, Ning J, Wang Q, Yuan H, et al. Risk assessment and prediction of nosocomial infections based on surveillance data using machine learning methods. BMC Public Health. 2024;24(1):1780. pmid:38965513
  98. 98. Deelen JWT, Rottier WC, Giron Ortega JA, Rodriguez-Baño J, Harbarth S, Tacconelli E, et al. An International Prospective Cohort Study To Validate 2 Prediction Rules for Infections Caused by Third-generation Cephalosporin-resistant Enterobacterales. Clin Infect Dis. 2021;73(11):e4475–83. pmid:32640024
  99. 99. Lipworth S, Vihta K-D, Chau KK, Kavanagh J, Davies T, George S, et al. Ten Years of Population-Level Genomic Escherichia coli and Klebsiella pneumoniae Serotype Surveillance Informs Vaccine Development for Invasive Infections. Clin Infect Dis. 2021;73(12):2276–82. pmid:33411882
  100. 100. Cressman AM, MacFadden DR, Verma AA, Razak F, Daneman N. Empiric Antibiotic Treatment Thresholds for Serious Bacterial Infections: A Scenario-based Survey Study. Clin Infect Dis. 2019;69(6):930–7. pmid:30535310
  101. 101. Hoffman SJ, Outterson K. What Will It Take to Address the Global Threat of Antibiotic Resistance? J Law Med Ethics. 2015;43(2):363–8. pmid:26242959
  102. 102. Laxminarayan R, Duse A, Wattal C, Zaidi AKM, Wertheim HFL, Sumpradit N, et al. Antibiotic resistance-the need for global solutions. Lancet Infect Dis. 2013;13(12):1057–98. pmid:24252483
  103. 103. Pezzani MD, Mazzaferri F, Compri M, Galia L, Mutters NT, Kahlmeter G, et al. Linking antimicrobial resistance surveillance to antibiotic policy in healthcare settings: the COMBACTE-Magnet EPI-Net COACH project. J Antimicrob Chemother. 2020;75(Suppl 2):ii2–19. pmid:33280049
  104. 104. European Centre for Disease Prevention and Control. Surveillance of antimicrobial resistance in Europe 2022. Stockholm: ECDC. 2023.
  105. 105. Bolton WJ, Wilson R, Gilchrist M, Georgiou P, Holmes A, Rawson TM. Personalising intravenous to oral antibiotic switch decision making through fair interpretable machine learning. Nat Commun. 2024;15(1):506. pmid:38218885
  106. 106. Rawson TM, Hernandez B, Moore LSP, Herrero P, Charani E, Ming D, et al. A Real-world Evaluation of a Case-based Reasoning Algorithm to Support Antimicrobial Prescribing Decisions in Acute Care. Clin Infect Dis. 2021;72(12):2103–11. pmid:32246143
  107. 107. Golob JL, Rao K. Signal Versus Noise: How to Analyze the Microbiome and Make Progress on Antimicrobial Resistance. J Infect Dis. 2021;223(12 Suppl 2):S214–21. pmid:33880565
  108. 108. Bengtsson-Palme J, Abramova A, Berendonk TU, Coelho LP, Forslund SK, Gschwind R, et al. Towards monitoring of antimicrobial resistance in the environment: For what reasons, how to implement it, and what are the data needs? Environ Int. 2023;178:108089. pmid:37441817
  109. 109. Tomašev N, Harris N, Baur S, Mottram A, Glorot X, Rae JW, et al. Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records. Nat Protoc. 2021;16(6):2765–87. pmid:33953393
  110. 110. Parikh RB, Teeple S, Navathe AS. Addressing Bias in Artificial Intelligence in Health Care. JAMA. 2019;322(24):2377–8. pmid:31755905
  111. 111. Hendriksen RS, Munk P, Njage P, van Bunnik B, McNally L, Lukjancenko O, et al. Global monitoring of antimicrobial resistance based on metagenomics analyses of urban sewage. Nat Commun. 2019;10(1):1124. pmid:30850636
  112. 112. Morency-Potvin P, Schwartz DN, Weinstein RA. Antimicrobial Stewardship: How the Microbiology Laboratory Can Right the Ship. Clin Microbiol Rev. 2016;30(1):381–407. pmid:27974411
  113. 113. Elias C, Raad M, Rasoanandrasana S, Raherinandrasana AH, Andriananja V, Raberahona M, et al. Implementation of an antibiotic resistance surveillance tool in Madagascar, the TSARA project: a prospective, observational, multicentre, hospital-based study protocol. BMJ Open. 2024;14(3):e078504. pmid:38508637
  114. 114. Finlayson SG, Subbaswamy A, Singh K, Bowers J, Kupke A, Zittrain J, et al. The Clinician and Dataset Shift in Artificial Intelligence. N Engl J Med. 2021;385(3):283–6. pmid:34260843
  115. 115. Davis SE, Greevy RA, Fonnesbeck C, Lasko TA, Walsh CG, Matheny ME. A nonparametric updating method to correct clinical prediction model drift. J Am Med Inform Assoc. 2019;26(12):1448–57. pmid:31397478
  116. 116. Kawamoto K, Houlihan CA, Balas EA, Lobach DF. Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success. BMJ. 2005;330(7494):765. pmid:15767266
  117. 117. Berner ES. Clinical Decision Support Systems: Theory and Practice. 3rd ed. New York: Springer. 2016.
  118. 118. Jha S, Topol EJ. Adapting to Artificial Intelligence: Radiologists and Pathologists as Information Specialists. JAMA. 2016;316(22):2353–4. pmid:27898975
  119. 119. Stevenson M, Scope A, Sutcliffe P, Booth A, Slade P, Parry G. The cost-effectiveness of group cognitive behavioural therapy compared with routine primary care for women with postnatal depression: value of information analysis. Health Technol Assess. 2010;14(44):1–107.
  120. 120. Mooney SJ, Pejaver V. Big Data in Public Health: Terminology, Machine Learning, and Privacy. Annu Rev Public Health. 2018;39:95–112. pmid:29261408
  121. 121. Keane PA, Topol EJ. With an eye to AI and autonomous diagnosis. NPJ Digit Med. 2018;1:40. pmid:31304321
  122. 122. Yu K-H, Beam AL, Kohane IS. Artificial intelligence in healthcare. Nat Biomed Eng. 2018;2(10):719–31. pmid:31015651
  123. 123. Haug CJ, Drazen JM. Artificial intelligence and machine learning in clinical medicine. N Engl J Med. 2023;388(13):1201–8.
  124. 124. Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162(1):W1–73. pmid:25560730
  125. 125. Steyerberg EW, Harrell FE Jr. Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol. 2016;69:245–7. pmid:25981519
  126. 126. Van Calster B, Wynants L, Verbeek JFM, Verbakel JY, Christodoulou E, Vickers AJ, et al. Reporting and Interpreting Decision Curve Analysis: A Guide for Investigators. Eur Urol. 2018;74(6):796–804. pmid:30241973
  127. 127. Ash JS, Sittig DF, Campbell EM, Guappone KP, Dykstra RH. Some unintended consequences of clinical decision support systems. AMIA Annu Symp Proc. 2007;2007:26–30. pmid:18693791
  128. 128. Greenhalgh T, Wherton J, Papoutsi C, Lynch J, Hughes G, A’Court C, et al. Beyond Adoption: A New Framework for Theorizing and Evaluating Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies. J Med Internet Res. 2017;19(11):e367. pmid:29092808
  129. 129. Sounderajah V, Ashrafian H, Aggarwal R, De Fauw J, Denniston AK, Greaves F, et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: The STARD-AI Steering Group. Nat Med. 2020;26(6):807–8. pmid:32514173
  130. 130. Guni A, Sounderajah V, Whiting P, Bossuyt P, Darzi A, Ashrafian H. Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies Using AI (QUADAS-AI): Protocol for a Qualitative Study. JMIR Res Protoc. 2024;13:e58202. pmid:39293047
  131. 131. DECIDE-AI Steering Group. DECIDE-AI: new reporting guidelines to bridge the development-to-implementation gap in clinical artificial intelligence. Nat Med. 2021;27(2):186–7. pmid:33526932
  132. 132. Rethlefsen ML, Kirtley S, Waffenschmidt S, Ayala AP, Moher D, Page MJ, et al. PRISMA-S: an extension to the PRISMA Statement for Reporting Literature Searches in Systematic Reviews. Syst Rev. 2021;10(1):39. pmid:33499930
  133. 133. Bramer WM, Rethlefsen ML, Kleijnen J, Franco OH. Optimal database combinations for literature searches in systematic reviews: a prospective exploratory study. Syst Rev. 2017;6(1):245. pmid:29208034
  134. 134. Liu Y, Chen P-HC, Krause J, Peng L. How to Read Articles That Use Machine Learning: Users’ Guides to the Medical Literature. JAMA. 2019;322(18):1806–16. pmid:31714992
  135. 135. Vabalas A, Gowen E, Poliakoff E, Casson AJ. Machine learning algorithm validation with a limited sample size. PLoS One. 2019;14(11):e0224365. pmid:31697686
  136. 136. Shah NH, Milstein A, Bagley SC. Making Machine Learning Models Clinically Useful. JAMA. 2019;322(14):1351–2. pmid:31393527
  137. 137. McCoy LG, Nagaraj S, Morgado F, Harish V, Das S, Celi LA. What do medical students actually need to know about artificial intelligence? NPJ Digit Med. 2020;3:86. pmid:32577533
  138. 138. Sendak M, Elish MC, Gao M, Futoma J, Ratliff W, Nichols M, et al. The human body is a black box: supporting clinical decision-making with deep learning. In: Proc Conf Fairness Accountability Transp, 2020. 99–109.
  139. 139. Cruz Rivera S, Liu X, Chan AW, Denniston AK, Calvert MJ. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Lancet Digit Health. 2020;2(10):e549–60.
  140. 140. Ibrahim H, Liu X, Denniston AK. Reporting guidelines for artificial intelligence in healthcare research. Clin Exp Ophthalmol. 2021;49(5):470–6. pmid:33956386
  141. 141. Khullar D, Casalino LP, Qian Y. The transformative potential of artificial intelligence in health care delivery. JAMA Health Forum. 2022;3(6):e222186.
  142. 142. Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit Med. 2023;6(1):120. pmid:37414860
  143. 143. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80. pmid:37438534
  144. 144. Mongan J, Moy L, Kahn CE Jr. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers. Radiol Artif Intell. 2020;2(2):e200029. pmid:33937821
  145. 145. Celi LA, Cellini J, Charpignon M-L, Dee EC, Dernoncourt F, Eber R, et al. Sources of bias in artificial intelligence that perpetuate healthcare disparities-A global review. PLOS Digit Health. 2022;1(3):e0000022. pmid:36812532
  146. 146. He J, Baxter SL, Xu J, Xu J, Zhou X, Zhang K. The practical implementation of artificial intelligence technologies in medicine. Nat Med. 2019;25(1):30–6. pmid:30617336