Figures
Abstract
Background
Sepsis-associated delirium (SAD) occurs due to disruptions in neurotransmission linked to inflammatory responses from infections. It poses significant challenges in clinical management and is associated with poor outcomes. Survivors often experience long-term cognitive and behavioral issues that impact their quality of life and place a burden on their families. This study aimed to develop and validate an interpretable machine learning model for early prediction of SAD in critically ill patients. Additionally, we constructed an online risk calculator to facilitate real-time clinical assessment.
Methods
This study is a retrospective analysis utilizing data from 16,120 patients in the Medical Information Mart for Intensive Care IV database. To manage imbalanced data, we applied the Synthetic Minority Over-sampling Technique (SMOTE) method. Feature selection was conducted using Multivariate Logistic Regression, LASSO regression, and the Boruta algorithm. We developed predictive models using eight machine learning algorithms and selected the best one for validation. The SHapley Additive exPlanations (SHAP) method was used for visualization and interpretation, enhancing the clinical understanding of the model, alongside the creation of an online web calculator.
Results
We combined three feature selection methods to identify 17 key features for our machine learning prediction model. The Gradient Boosting Machine (GBM) model demonstrated excellent calibration and strong predictive accuracy in the validation cohort. The SHAP feature importance ranking revealed five critical risk factors for predicting outcomes: Glasgow Coma Scale (GCS), ICU stay duration, chloride, sodium, and Sequential Organ Failure Assessment (SOFA). Based on this optimal model, we successfully developed an online web calculator.
Conclusion
We developed and validated a machine learning model capable of accurately predicting SAD with high clinical applicability. The integration of interpretable machine learning and an online calculator offers a practical tool to support early identification and timely management of SAD in critically ill patients.
Citation: Gao L, Wang GD, Yang XY, Tong SJ, Wang XJ, Chen YR, et al. (2025) Development of a risk prediction model for sepsis-related delirium based on multiple machine learning approaches and an online calculator. PLoS One 20(7): e0323831. https://doi.org/10.1371/journal.pone.0323831
Editor: Chiara Lazzeri, Azienda Ospedaliero Universitaria Careggi, ITALY
Received: April 15, 2025; Accepted: June 29, 2025; Published: July 16, 2025
Copyright: © 2025 Gao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Abbreviations: MIMIC-IV, Medical Information Mart for Intensive Care IV; SAD, Sepsis-associated delirium; MLR, Multivariate Logistic Regression; LASSO, Least Absolute Shrinkage and Selection Operator Regression; SHAP, The SHapley Additive exPlanations; SOFA, Sequential Organ Failure Assessment; GCS, Glasgow Coma Scale; SAPSII, Simplified Acute Physiology Score II; RDW, Red cell distribution width; AKI, Acute kidney injury; MV, Mechanical ventilation; CRRT, Continuous renal replacement therapy; CAM-ICU, Confusion Assessment Method for the Intensive Care Unit; SMOTE, Synthetic Minority Over-sampling Technique; LR, Logistic Regression; SVM, Support Vector Machine; GBM, Gradient Boosting Machine; RF, Random Forest; XGBoost, Extreme Gradient Boosting; LightGBM, Adaptive Boosting, AdaBoost, Light Gradient Boosting Machine; DCA, Decision Curve Analysis.
1. Introduction
Sepsis is a host’s uncontrolled systemic inflammatory response syndrome triggered by infection, where a large number of inflammatory factors lead to multi-organ dysfunction, ultimately resulting in a crisis of multiple organ failure that can threaten the patient’s life [1]. Although supportive bundled care for sepsis has been helpful in reducing mortality rates among sepsis patients, the overall prognosis remains poor. According to the Centers for Disease Control and Prevention, the annual incidence of sepsis in the United States is 300–1,000 cases per 100,000 people. Global burden of disease studies show that the global annual incidence of sepsis reaches up to 31 million cases, making it a major public health threat [2–4].
Delirium is an acute brain dysfunction characterized by fluctuations in attention and impaired cognitive function, with diverse symptoms that may include psychomotor agitation and altered consciousness. It is commonly seen in critically ill patients admitted to the intensive care unit (ICU) [5]. Sepsis is one of the major risk factors for delirium, as the systemic inflammatory response syndrome associated with sepsis can disrupt the balance of central nervous system function, leading to delirium in patients. Delirium in sepsis patients is referred to as sepsis-associated delirium (SAD); however, the mechanisms by which sepsis affects the central nervous system remain unclear and may involve brain inflammation, cerebral perfusion, blood-brain barrier disruption, and neurotransmitter disturbances. Managing SAD in the ICU has historically been challenging, with poor prognoses for SAD patients. Survivors often experience long-term and severe cognitive impairments and behavioral abnormalities, significantly reducing their quality of life and placing a heavy burden on families [6,7]. Therefore, early identification and prevention of potential SAD patients in clinical practice are of utmost importance. Consequently, we developed a predictive model for SAD and constructed an online calculator to provide clinicians with an important tool for early identification of high-risk populations, optimizing individual intervention measures through various simple clinical indicators to improve clinical outcomes and reduce the length of hospital stay.
The rapid development of artificial intelligence and machine learning algorithms has significantly accelerated the innovation of predictive models for medical diagnosis and prognosis assessment of various diseases [8,9]. Compared to traditional regression analysis, machine learning is widely used to handle clinical data, extracting features related to clinical outcomes from large datasets and identifying independent predictive factors associated with clinical results, thereby better addressing clinical decision-making issues [10,11]. In the ICU, assessing a patient’s risk of developing delirium heavily relies on subjective judgment by healthcare professionals, necessitating the development of a more straightforward, objective, and accurate tool to evaluate the risk of delirium. The aim of this study is to establish an interpretable machine learning-based online predictive tool for SAD risk to assist clinicians in accurately assessing the risk of delirium early, allowing timely adjustments to treatment plans to reduce the incidence of delirium and make more informed clinical decisions.
2. Materials and methods
2.1 Study design
The data for this study were sourced from the Medical Information Mart for Intensive Care III (MIMIC-III, version 1.4) and MIMIC-IV (version 3.1) databases, developed by the Laboratory for Computational Physiology at the Massachusetts Institute of Technology. These publicly available, de-identified databases comprise one of the largest and most comprehensive collections of critical care data worldwide. MIMIC-III includes detailed clinical records from approximately 60,000 ICU admissions at the Beth Israel Deaconess Medical Center between 2001 and 2012. MIMIC-IV expands on this scope, containing data from nearly 90,000 patients admitted between 2008 and 2022. Both databases provide a wide array of structured and unstructured clinical information, including demographic characteristics, diagnoses, procedures, medication use, laboratory test results, nursing documentation, and outcome data [12,13]. To utilize the database, we completed the web-based course provided by the National Institutes of Health (NIH), fulfilling the CITI program training requirements and obtaining research ethics certification (certification number: 66380198). Our study complies with the Declaration of Helsinki and international medical ethics standards, and all patient data were fully anonymized to waive informed consent requirements and ethical review.
2.2 Study population and outcome
We extracted each patient’s hospitalization information, including demographics, vital signs, laboratory test indicators, and treatment information, from three databases using Structured Query Language (SQL). Patients included in the study met the following criteria: (1) confirmed or suspected infection within 24 hours of ICU admission, and according to the Sepsis-3.0 diagnostic criteria, a Sequential Organ Failure Assessment (SOFA) score of ≥2. (2) Patients were aged 18–100 years and had laboratory test records within 24 hours of ICU admission. (3) first admission to the ICU. (4) Patients assessed using the Confusion Assessment Method for the Intensive Care Unit (CAM-ICU). CAM-ICU is an effective screening tool for identifying delirium in the intensive care unit, and septic patients with positive assessments are defined as having SAD. The exclusion criteria were as follows: (1) patients with dementia or schizophrenia; (2) missing variables greater than 30%; (3) ICU stay of less than 24 hours; (4) outliers in vital signs. (5) develop delirium before admission to the ICU. The data collection process for the training and testing sets is illustrated in Fig 1.
2.3 Inclusion of variables
This retrospective study included 46 clinical variables extracted from the MIMIC-III and MIMIC-IV databases. The dataset comprised demographic information (age, sex, and weight), baseline clinical characteristics, and comorbidities. Vital signs recorded within the first 24 hours of ICU admission included heart rate (HR), systolic blood pressure (SBP), diastolic blood pressure (DBP), mean arterial pressure (MAP), respiratory rate (RR), oxygen saturation (SpO₂), body temperature, Sequential Organ Failure Assessment (SOFA) score, Simplified Acute Physiology Score II (SAPS II), and Glasgow Coma Scale (GCS) score. Laboratory parameters measured within 24 hours of ICU admission encompassed hematological indices, red blood cell (RBC) count, white blood cell (WBC) count, platelet count, hemoglobin, hematocrit, red cell distribution width (RDW), as well as coagulation profiles (international normalized ratio, prothrombin time, activated partial thromboplastin time), renal function markers (creatinine, blood urea nitrogen), and metabolic indicators (anion gap, pH, bicarbonate, calcium, magnesium, chloride, potassium, sodium, lactate, and glucose). Recorded comorbidities included hypertension, acute kidney injury (AKI), type 2 diabetes mellitus, and heart failure. Iatrogenic and environmental factors comprised major therapeutic interventions during ICU stay, such as mechanical ventilation (MV), continuous renal replacement therapy (CRRT), and administration of medications including midazolam and vasopressin (VP). All variables had less than 20% missing data. Multiple imputation techniques were applied to address missingness, ensuring dataset completeness and minimizing potential bias.
2.4 The selection of features and the processing of data
In this study, This approach alleviates the bias introduced by sample imbalance while avoiding the risks associated with excessive oversampling. The training set was used for model development, while the validation set was employed to assess the model’s performance. To determine potential predictive factors in the training set, we utilized three independent methods to filter baseline variables: Multiple Linear Regression (MLR), Least Absolute Shrinkage and Selection Operator (LASSO), and the Boruta algorithm [14]. MLR is a variable selection method based on covariate adjustment that controls for confounding factors. By utilizing statistical significance (P < 0.05), MLR identifies variables independently associated with outcomes while retaining predictive factors that remain interpretable in the presence of multivariable co-occurrence, providing adjusted odds ratios (OR) and their confidence intervals. This process results in a validated variable set with both statistical and clinical significance, offering robust support for clinical decision-making [15]. LASSO regression, when handling high-dimensional data, applies penalties to feature coefficients, automatically selecting variables that have a practical impact on the prediction target. When multiple features are correlated, LASSO regression retains the most representative feature while eliminating multicollinearity interference [16,17]. To optimize model effectiveness, we employed 20-fold cross-validation to analyze baseline high-dimensional data and selected variables, which mitigated the risk of overfitting and ensured result stability. The Boruta algorithm is an ensemble feature selection framework based on random forests, where the core idea is to conduct significance testing on original variables by constructing “shadow features” [18]. The Boruta algorithm is capable of capturing complex nonlinear relationships, effectively reducing the risk of overfitting while enhancing model performance, thus providing a more reliable feature space for subsequent model construction [19]. To ensure maximum robustness of the selected features, mitigate overfitting risks, enhance clinical interpretability, and meet the requirements for both feature quantity and practicality in subsequent online calculators, this study ultimately selected the intersecting features identified by Boruta, LASSO, and MLR as the modeling feature set.
2.5 Model development and evaluation
We employed a random sampling method to allocate the patients included in the study into the training set and validation set in a ratio of 7:3. The data distribution in the training set was as follows: 2,355 SAD cases and 8,930 Non-SAD cases. Given that class imbalance may introduce bias into the machine learning prediction model, we conducted a supplementary analysis comparing the use of oversampling, undersampling, and the Synthetic Minority Over-sampling Technique (SMOTE) to balance the training set. Critically, SMOTE was applied exclusively to the training set. After balancing using the SMOTE algorithm, the final training set comprised 7,065 SAD cases and 7,065 Non-SAD cases. This approach effectively mitigates model bias induced by class imbalance while avoiding the risk associated with oversampling.
We utilized eight machine learning methods to construct models in the training set: Logistic Regression (LR), Support Vector Machine (SVM), Gradient Boosting Machine (GBM), Neural Network, Random Forest (RF), Extreme Gradient Boosting (XGBoost), Adaptive Boosting (AdaBoost), and Light Gradient Boosting Machine (LightGBM). In this study, we employed the “caret” package to perform model selection using 10-fold cross-validation, while incorporating a grid parameter optimization process during the model training phase, as detailed in S1 Table [20]. Subsequently, we tested the models’ performance using both internal and external validation sets, evaluating different models through the Area Under the Receiver Operating Characteristic Curve (AUC), Decision Curve Analysis (DCA), and calibration curves. Additionally, we calculated the Accuracy, Sensitivity, and F1-score for each model to further assess performance. Based on the evaluation results, we selected the best-performing model for SHAP significance analysis and generated SHAP summary plots to assess feature importance. We then conducted SHAP dependence plots to analyze the mechanisms by which features influence the prediction outcomes, and finally quantified the contribution weights of each feature within individual samples using SHAP analysis. For ease of use by clinicians, we developed a user-friendly web calculator.
2.6 Statistical analysis
In this study, all statistical analyses were performed using R software (version 4.4.2) and Python software (version 3.10.6). To compare baseline characteristics between the two groups in the MIMIC-IV dataset, continuous variables conforming to a normal distribution were expressed as Mean ± Standard Deviation (x ± s), while non-normally distributed continuous variables were denoted as Median (Interquartile Range) [M (IQR)]. Continuous variables were analyzed using independent samples t-test or Wilcoxon rank-sum test. Categorical variables were expressed as numbers (percentage) [n (%)], and the χ² test or Fisher’s exact test was employed for categorical variables, with a two-sided P < 0.05 indicating statistical significance. To address the issue of sample imbalance in the training set, we utilized the SMOTE algorithm based on the “DMwR” package in R. For model development, the dataset was randomly divided into a training set (70%) and an internal validation set (30%). Variable selection was performed using MLR, LASSO regression, and the Boruta algorithm, incorporating the shared variables from these three algorithms into the machine learning model. The “pROC” and “ggplot2” packages were employed to plot the ROC curve analysis and AUC values for the internal validation set, identifying the best predictive model. Finally, SHAP analysis was utilized to interpret the optimal model.
3. Results
3.1 Baseline characteristics
We collected data from 16,120 sepsis patients in the MIMIC-IV database, with detailed screening processes illustrated in Fig 1. Among the included sepsis patients, 3,364 (26.4%) were diagnosed with delirium. Table 1 summarizes the characteristics of SAD and Non-SAD patients in the MIMIC-IV database, including demographics, treatment information, and laboratory test results. Overall, the SAD group had higher values for Age, ICU Day, SOFA, SAPS II, Creatinine, Blood Urea Nitrogen, Anion Gap, Glucose, and Platelet counts compared to the Non-SAD group, with statistical significance (P < 0.05). The median age of the study population was 67 years (57, 77), with a gender distribution of 9,532 (59%) males. We divided the total population into a training set (70%) and an internal validation set (30%), utilizing the training set for model development. To explore the correlations between variables, we plotted a correlation bar chart for each variable (Fig 2A) and a heatmap of the correlations between variables (Fig 2B).
(A) Bar chart of variable correlation. (B) Heat map depicting the correlations among variables.
3.2 Variable selection
The 16,120 patients collected from the MIMIC-IV database were randomly divided into a training set (70%) and an internal validation set (30%). To exclude irrelevant variables, we performed an initial screening of variables using MLR, LASSO regression, and the Boruta algorithm. As shown in Table 2, ULR analysis indicated statistical relationships between AKI, Type 2 Diabetes Mellitus, Heart Failure, MV, CRRT, Midazolam, VP, Age, Weight, ICU Day, SOFA, SAPS II, GCS, SBP, DBP, MAP, Temperature, Platelet, RDW, MCHC, MCV, Creatinine, Blood Urea Nitrogen, Anion Gap, Calcium, Magnesium, Chloride, Sodium, Lactate, and Glucose with SAD patients (P < 0.05). MLR analysis of these results suggested that these factors have significant statistical associations with SAD patients, indicating they may be independent risk factors for patient mortality. The best LASSO regression lambda.1se value was confirmed to be 0.0035 through 20-fold cross-validation, resulting in the identification of 28 variables with significant predictive power (Fig 3A and 3B). Additionally, we utilized the Boruta algorithm for a more in-depth analysis to clarify key variables. This algorithm effectively distinguishes between strongly correlated and weakly correlated variables, significantly improving predictive accuracy. In the analysis results presentation, the green boxes indicate shadow features automatically generated by the algorithm, which were excluded from the final analysis to focus on the most influential variables. Ultimately, we identified 40 impactful variables (Fig 3C and 3D). Finally, we generated a Venn diagram using the “ggvenn” package in R, revealing 17 features shared among the three algorithms (Fig 3E). Based on these features, we established eight machine learning predictive models.
(A) Variable Trajectory Screening via LASSO Regression. (B) The LASSO model was subjected to 20-fold cross-validation to determine the optimal penalization parameter (lambda.1se). (C, D) Variables selected by the Boruta algorithm. In terms of feature importance scores, the 40 red variables are considered to be important variables. (E) The Venn diagram illustrates the features selected by Boruta, LASSO, and MLR. The intersection of the features identified by these three methods reveals 17 clinical characteristics.
3.3 Model performance on test and external validation datasets
Upon identifying 17 clinical features, we employed eight distinct machine learning algorithms to construct predictive models for SAD. Following the application of the SMOTE function to balance the training set, the models developed on internal and external validation sets—specifically LR, SVM, GBM, Neural Network, RF, XGBoost, AdaBoost, and LightGBM—achieved AUC values of 0.67, 0.68, 0.73, 0.69, 0.71, 0.68, 0.69, and 0.70, respectively (Fig 4A). Corresponding AUC values for the external validation set were 0.61, 0.61, 0.70, 0.72, 0.70, 0.65, 0.66, and 0.67 (Fig 4B). Detailed performance metrics, including Accuracy, Sensitivity, Specificity, Precision, and F1-Score for both internal and external validation sets, are provided in Table 3. Due to the substantial performance gap observed between the training and validation sets for the RF model, raising concerns about potential overfitting, the GBM model was ultimately selected as optimal. To further evaluate the GBM model’s performance, we generated DCA curves for both the internal and external validation sets (Fig 4C–4F). The GBM model’s calibration curve and DCA demonstrated robust predictive performance across the majority of threshold probability ranges and indicated significant clinical net benefit. Based on the comprehensive assessment of the above performance metrics, the GBM model was confirmed as the most suitable predictive tool for this dataset and was selected for subsequent analyses. A confusion matrix illustrating the GBM model’s classification performance for patients in the internal validation set is presented (Fig 4G). Finally, the feature importance plot for the GBM model is also shown (Fig 4H).
(A) ROC curves for the internal validation set. (B) ROC curves for the external validation set. (C, D) Calibration curves for the internal and external validation sets, respectively. (E, F) DCA curves for the internal and external validation sets, respectively. (G) Confusion matrix for the internal validation set. (H) Feature importance plot for the GBM model in the internal validation set.
3.4 Interpretability analysis
To further explore the clinical application of the GBM model, we utilized the SHAP algorithm to quantify the contribution of each feature within the model. The feature importance bee swarm plot we created illustrates the mechanisms by which each feature affects the model’s predictions, with SHAP values plotted on the x-axis; higher SHAP values indicate an increased likelihood of the outcome occurring, and vice versa. The y-axis represents the magnitude of the feature values, visually depicted through a gradient from yellow to purple, where yellow indicates high feature values and purple signifies low feature values. According to the results, lower GCS scores, lower Chloride levels, and higher ICU Day, SOFA scores, Sodium levels, RDW, MCV, Age, and the administration of Midazolam are associated with higher SHAP values, indicating a greater likelihood of delirium in sepsis patients (Fig 5A). Fig 5B displays the GBM model’s SHAP significance analysis, visualizing the ranking of feature importance. To enhance the interpretability of the model’s decision-making process at the individual level, we conducted a systematic interpretability study on two representative cases. We plotted bar charts for one SAD patient and one Non-SAD patient, where yellow represents increased risk and purple indicates reduced risk (Fig 5C and 5D). Additionally, to investigate the interaction effects among variables, we generated SHAP dependence plots. Using the GCS score as an example, the SHAP values for SOFA score, serum sodium levels, and ICU day exhibited significant variation across different GCS scores. The results revealed that patients with GCS scores in the two extreme ranges (1–3 and 14–15) consistently demonstrated significantly lower delirium incidence rates compared to patients with scores in the intermediate ranges (Fig 6A–6C).
(A) The SHAP summary plot of the GBM model. (B) Significance analysis of feature importance ranking via SHAP, based on the mean value. (C, D) The force plots offer individualized feature attributions for two representative examples. C: Patients with SAD; D: Patients with Non-SAD.
(A-C) The Y-axis represents SHAP values, while the X-axis depicts actual clinical parameters. It is noteworthy that when the SHAP value of a feature is greater than 0, it indicates an increased risk of SAD, while a negative SHAP value suggests a reduced risk.
3.5 Construction of the online calculator
In this study, we developed an online web calculator based on the GBM model (Fig 7) (https://risk-model.shinyapps.io/make_web/). This calculator can predict the likelihood of delirium occurring in sepsis patients within the ICU based on various clinical feature variables. Clinicians can conveniently input relevant data into this tool, facilitating the use of the model to predict the incidence of SAD and allowing for timely adjustments to treatment plans to improve patient outcomes. The example parameters can be found in S2 Table.
4. Discussion
SAD is a severe neurological syndrome that significantly increases mortality among affected patients, leading to long-term mental and cognitive impairments, and even dementia, placing a substantial burden on patients and their families [21,22]. We developed a machine learning prediction model based on GBM, which showed promising results in internal validation (AUC: 0.73). Utilizing the SHAP method, we enhanced the clinical interpretability of the model. Although several predictive models have been developed to assess the risk of SAD in the ICU, their practicality in clinical practice remains insufficient, failing to effectively translate into clinical tools [6,7]. Therefore, we created a simple web-based calculator to assist healthcare professionals in quickly identifying SAD and timely adjusting medical strategies to improve patient outcomes.
Currently, clinical recognition of SAD in the ICU is not optimistic, mainly due to physicians’ insufficient understanding of the complication. This underscores the importance of enhancing clinical awareness. Through systematic training, we can improve healthcare teams’ ability to recognize sepsis-associated delirium, thereby allowing for more accurate identification of early intervention opportunities and a scientific assessment of the risk-benefit ratio of treatment strategies. Commonly used delirium assessment methods in the ICU include CAM-ICU and the Intensive Care Delirium Screening Checklist (ICDSC). CAM-ICU is applicable for patients requiring mechanical ventilation, offering rapid and effective assessment; however, it is not suitable for deeply sedated or comatose patients and requires multiple evaluations for diagnosis [23–25]. In contrast, the ICDSC has a broader applicability, as it does not require patient cooperation and is based on nurses observing patients’ behavior over a 24-hour period. Nonetheless, it relies on nurses’ subjective judgment, which may lead to lower consistency among different evaluators and increase the nursing workload [26–28]. The time-consuming nature of both methods may delay treatment decisions and compromise patient safety. To address this issue, we constructed a web-based calculator using the GBM model to enable clinicians to quickly detect SAD early. It is important to emphasize that, although preliminary validation shows promising performance, multi-center, prospective cohort studies are still required to ensure the model’s applicability in various clinical scenarios, providing external validation with independent datasets. This will help objectively assess the model’s generalizability and diagnostic accuracy, offering evidence-based medicine to support its clinical translation.
Feature selection is a crucial step in building predictive models, as its validity directly affects the model’s predictive performance [29]. We obtained a substantial sample from the MIMIC-IV database and utilized MLR to identify independent risk factors for SAD patients. LASSO regression was applied to process features, avoiding multicollinearity and reducing the risk of overfitting [30]. Building on this approach, we employed the Boruta algorithm for feature selection, leveraging its rigorous statistical significance-based filtering mechanism to retain critical features. This methodology effectively curbs model complexity while enhancing generalization capability on independent datasets. The cross-verified features identified through these three complementary methods deliver optimal predictive power and generalizability. This strategy ensures model robustness and parsimony, simultaneously providing an operationally efficient, computationally feasible, and clinically actionable feature set for online calculator implementation. In the realm of healthcare big data, machine learning algorithms demonstrate significant advantages in handling complex medical data [31]. Compared to traditional statistical models, the nonlinear modeling capabilities of machine learning can capture higher-order interactions between variables and enable multivariable synchronous analysis of high-dimensional data through parallel computing frameworks, thereby increasing the ability to predict diseases [32]. In this study, we compared the performance of eight machine learning models. Although RF theoretically mitigate overfitting risks through bagging and random feature selection, they can still exhibit significant overfitting in practical applications, particularly with noisy datasets or limited sample sizes. To address this, we implemented rigorous optimization procedures, including grid search, 10-fold cross-validation, and the SMOTE algorithm to alleviate sample insufficiency. Nevertheless, even after applying SMOTE to augment the training dataset, the RF model continued to demonstrate a tendency toward overfitting. Consequently, based on comprehensive performance evaluation, we ultimately identified GBM as the optimal model for this study.
However, in existing research, Zhang Y et al. utilized LASSO regression to identify key predictors for SAD patients and subsequently constructed an XGBoost prediction model based on these factors. Their study ultimately identified lower GCS scores, sedative medication use, and concomitant AKI as risk factors for SAD development, findings consistent with the results of the present study. This model demonstrated robust predictive performance on both the internal validation set (AUC = 0.793) and the external validation set (AUC = 0.701). By contrast, the AUC value of the model developed in the current study is comparatively lower. This discrepancy may stem from differences in data preprocessing pipelines and the specific feature sets incorporated. Notably, Zhang et al.‘s model did not include core variables such as lactate levels and ICU_Day. The omission of such critical information may have limited its predictive capability [33]. On the other hand, Gu Q et al. employed MLR to identify higher SOFA scores, elevated lactate levels, elevated phosphate levels, and MV use as independent risk factors for SAD. They subsequently developed a nomogram for risk prediction based on these factors. However, unlike the present study, their model lacks an external validation cohort. Consequently, its generalizability remains unassessed, posing challenges for real-world clinical application and implementation [34].
GBM is one of the most representative methods in ensemble learning, which iteratively combines weak learners and utilizes gradient descent to optimize prediction errors, simulating the progressive learning and collaborative optimization characteristics found in biological systems [35–37]. As an efficient predictive model, GBM is able to capture nonlinear relationships and complex interaction effects in the data and is widely used in the construction of clinical prediction models. The clinical interpretability of machine learning models is critical for medical practice; however, interpretability has long been one of the core challenges in this field [38]. To address this issue, we adopted the SHAP method to analyze features and enhance model interpretability [39]. Compared to traditional weight-based explanatory methods, SHAP exhibits superior consistency and performance, while demonstrating greater stability across various models [40]. This study utilized SHAP value analysis, which significantly improved the model’s interpretability compared to the coefficient interpretations of traditional generalized linear regression models. SHAP values not only quantify the contributions of each feature to the predictive outcomes but also provide intuitive visualizations through feature importance plots [41]. This analytical approach offers a new perspective for understanding the decision-making mechanisms of machine learning models, clearly illustrating the specific impacts of feature variables on model predictions, thereby effectively enhancing the model’s interpretability and transparency.
According to the SHAP feature importance plot, a low GCS score is identified as the most significant risk factor for SAD. Studies have shown that low GCS scores, high SOFA scores, advanced age, prolonged ICU stay, hypernatremia, and the use of midazolam are all risk factors for SAD. Specifically, the GCS score is one of the strong predictive features, originally designed by Graham Teasdale and Bryan Jennett at the University of Glasgow for assessing traumatic brain injury. It quantifies eye-opening, verbal, and motor responses to objectively evaluate the degree of consciousness impairment [42]. The lower the GCS score, the higher the risk of delirium; research indicates that for each point decrease in GCS, the risk of delirium increases by approximately 34% [43,44]. Notably, the SHAP dependence plot further reveals a complex non-linear association between the GCS score and SAD. Clinically, GCS scores in the 1–3 range typically indicate severe brain injury, while scores of 14–15 signify a state of clear consciousness. Interestingly, the incidence of delirium in both these extreme scoring groups is significantly lower than in patients with intermediate scores. This phenomenon is consistent with findings from multiple prior delirium prediction models and has been reported in the literature [33]. The SOFA score is an effective tool for assessing organ dysfunction in sepsis patients, with the central nervous system score relying on the GCS. By treating GCS as an independent variable, we avoided the limitation of the SOFA score, which might focus solely on single-organ function assessment, thus more accurately reflecting brain function impairment. An elevated SOFA score signifies systemic inflammatory response, tissue hypoxia, and organ dysfunction, with central nervous system dysfunction being the most common. These factors are interrelated via a systemic inflammation-brain injury axis, where many inflammatory factors can disrupt the blood-brain barrier and subsequently induce delirium [45–47]. Research has found that patients with a SOFA score exceeding 9 have a probability of SAD occurrence greater than 70%, making the SOFA score a core indicator for predicting SAD; a high SOFA score is an independent risk factor for delirium [48].
Advanced age and prolonged ICU stay are both significant risk factors for delirium. As age increases, older patients often have various chronic diseases, malnutrition, sensory impairments, and cognitive deficits, resulting in decreased brain physiological reserve, thereby heightening the probability of delirium [49]. Patients who are in the ICU for extended periods are more likely to be exposed to mechanical ventilation, sedative medications, sleep deprivation, and infections, which can further trigger delirium. Delirious patients commonly exhibit complications such as agitation, respiratory dysfunction, and infections, which may significantly prolong their ICU stays, creating a vicious cycle. Studies have shown that remaining in the ICU for more than 7 days can transition delirium from an “acute” to a “persistent” state, with about 30%−40% of patients experiencing long-term cognitive sequelae [50,51]. Additionally, midazolam, as a benzodiazepine, suppresses the central nervous system, potentially interfering with cholinergic neurotransmission and disrupting sleep architecture, thus increasing the risk of delirium. Our findings indicate that the pre-illness use of midazolam significantly raises the risk of delirium, consistent with recent studies [52,53]. Hypernatremia also contributes to an increased incidence of delirium. Hypernatremia affects sodium-potassium pump function, leading to abnormal neural excitability and interfering with the release of inhibitory neurotransmitters such as gamma-aminobutyric acid. Furthermore, hypernatremia decreases brain energy metabolism, affecting glucose utilization and ATP production, resulting in insufficient neuronal energy, which further contributes to the onset of delirium [54–56].
In summary, the SHAP method provides significant support for personalized diagnosis and treatment of SAD patients. By quantifying the contributions of various feature variables to predictive outcomes, it enhances the interpretability of the model’s predictions and offers intuitive decision-making support for clinicians. The SHAP values clearly illustrate the impact of clinical features on patient prognosis, aiding physicians in identifying key risk factors and formulating personalized intervention strategies. By integrating GBM and SHAP, this model not only improves predictive accuracy but also significantly enhances interpretability. The visualization of SHAP values allows clinicians to gain deeper insights into important features and their interactions, thereby increasing the transparency and credibility of medical decisions and providing a scientific basis for the personalized risk management of SAD. This interpretable machine learning approach holds substantial practical value in clinical settings and advances the development of precision medicine.
This study has some limitations. First, as a retrospective analysis, the nature of the study restricts the establishment of causal relationships between features and outcomes. Additionally, this study did not include key inflammatory biomarkers such as procalcitonin (PCT) and interleukin-6 (IL-6), which are closely associated with the pathophysiology of sepsis and delirium. However, due to their limited availability and high rates of missingness in the MIMIC-IV database, these variables could not be incorporated into the current model. This omission may have constrained the model’s ability to capture certain inflammatory dimensions of SAD. In future work, we plan to conduct prospective, multi-center studies that systematically collect longitudinal data on biomarkers such as PCT and IL-6. Integrating these markers may improve the predictive performance and biological interpretability of the model, thereby enhancing its clinical utility and generalizability. Lastly, the limitations of sample size reduce the performance of the machine learning algorithms. Future research should establish a systematic data collection mechanism to expand the sample size and diversity, thereby improving the predictive accuracy and reliability of the model and ensuring that the research findings have clinical application value.
5. Conclusion
In this study, we developed and validated an interpretable machine learning model based on the GBM algorithm to predict SAD using clinical data from the MIMIC-IV database. SHAP analysis enhanced the model’s interpretability by identifying key predictors such as GCS, ICU stay duration, chloride, sodium, and SOFA score. A web-based risk calculator was also constructed to facilitate bedside risk assessment. The model demonstrates promising predictive accuracy and practical utility, offering a valuable tool for early identification of SAD and supporting individualized clinical decision-making.
Supporting information
S1 Table. Hyperparameter settings for eight models.
Gradient Boosting Machine: GBM; Support Vector Machine: SVM; Random Forest: RF; Extreme Gradient Boosting: XGBoost; Adaptive Boosting: AdaBoost; Light Gradient Boosting Machine: LightGBM.
https://doi.org/10.1371/journal.pone.0323831.s001
(DOCX)
S2 Table. Example input parameters for the online risk calculator predicting SAD.
https://doi.org/10.1371/journal.pone.0323831.s002
(DOCX)
S3 Table. Baseline characteristics of SAD and non-SAD patients in the external validation cohort.
https://doi.org/10.1371/journal.pone.0323831.s003
(DOCX)
Acknowledgments
The authors sincerely thank the MIMIC official team for their outstanding contributions, which are of significant importance.
References
- 1. Tang J, Huang J, He X, Zou S, Gong L, Yuan Q, et al. The prediction of in-hospital mortality in elderly patients with sepsis-associated acute kidney injury utilizing machine learning models. Heliyon. 2024;10(4):e26570. pmid:38420451
- 2. Chung HY, Wickel J, Brunkhorst FM, Geis C. Sepsis-associated encephalopathy: from delirium to dementia? J Clin Med. 2020;9(3).
- 3. Fleischmann C, Scherag A, Adhikari NKJ, Hartog CS, Tsaganos T, Schlattmann P, et al. Assessment of global incidence and mortality of hospital-treated sepsis. Current estimates and limitations. Am J Respir Crit Care Med. 2016;193(3):259–72. pmid:26414292
- 4. Kempker JA, Martin GS. The changing epidemiology and definitions of sepsis. Clin Chest Med. 2016;37(2):165–79.
- 5. Ebersoldt M, Sharshar T, Annane D. Sepsis-associated delirium. Intensive Care Med. 2007;33(6):941–50. pmid:17410344
- 6. Atterton B, Paulino MC, Povoa P, Martin-Loeches I. Sepsis associated delirium. Medicina (Kaunas). 2020;56(5).
- 7. Tokuda R, Nakamura K, Takatani Y, Tanaka C, Kondo Y, Ohbe H. Sepsis-associated delirium: a narrative review. J Clin Med. 2023;12(4).
- 8. Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nat Rev Mol Cell Biol. 2022;23(1):40–55. pmid:34518686
- 9. Shamout F, Zhu T, Clifton DA. Machine learning for clinical outcome prediction. IEEE Rev Biomed Eng. 2021;14:116–26. pmid:32746368
- 10. Deo RC. Machine learning in medicine. Circulation. 2015;132(20):1920–30.
- 11. Hunter DJ, Holmes C. Where medical statistics meets artificial intelligence. N Engl J Med. 2023;389(13):1211–9.
- 12. Preliminary criteria for the classification of systemic sclerosis (scleroderma). Subcommittee for scleroderma criteria of the American Rheumatism Association Diagnostic and Therapeutic Criteria Committee. Arthritis Rheum. 1980;23(5):581–90. pmid:7378088
- 13. Vitali C, Bombardieri S, Moutsopoulos HM, Coll J, Gerli R, Hatron PY, et al. Assessment of the European classification criteria for Sjögren’s syndrome in a series of clinically defined cases: results of a prospective multicentre study. The European Study Group on Diagnostic Criteria for Sjögren’s Syndrome. Ann Rheum Dis. 1996;55(2):116–21. pmid:8712861
- 14. Sanchez-Pinto LN, Venable LR, Fahrenbach J, Churpek MM. Comparison of variable selection methods for clinical predictive modeling. Int J Med Inform. 2018;116:10–7. pmid:29887230
- 15. Pate A, Riley RD, Collins GS, van Smeden M, Van Calster B, Ensor J, et al. Minimum sample size for developing a multivariable prediction model using multinomial logistic regression. Stat Methods Med Res. 2023;32(3):555–71. pmid:36660777
- 16. Frost HR, Amos CI. Gene set selection via LASSO penalized regression (SLPR). Nucleic Acids Res. 2017;45(12):e114. pmid:28472344
- 17. Lee S, Gornitz N, Xing EP, Heckerman D, Lippert C. Ensembles of lasso screening rules. IEEE Trans Pattern Anal Mach Intell. 2018;40(12):2841–52. pmid:29989981
- 18. Wang X, Ren J, Ren H, Song W, Qiao Y, Zhao Y, et al. Diabetes mellitus early warning and factor analysis using ensemble Bayesian networks with SMOTE-ENN and Boruta. Sci Rep. 2023;13(1):12718. pmid:37543637
- 19. Dang T, Fermin ASR, Machizawa MG. oFVSD: a Python package of optimized forward variable selection decoder for high-dimensional neuroimaging data. Front Neuroinform. 2023;17:1266713. pmid:37829329
- 20. Poldrack RA, Huckins G, Varoquaux G. Establishment of best practices for evidence for prediction: a review. JAMA Psychiatry. 2020;77(5):534–40. pmid:31774490
- 21. Gao Q, Hernandes MS. Sepsis-associated encephalopathy and blood-brain barrier dysfunction. Inflammation. 2021;44(6):2143–50. pmid:34291398
- 22. Sonneville R, Benghanem S, Jeantin L, de Montmollin E, Doman M, Gaudemer A, et al. The spectrum of sepsis-associated encephalopathy: a clinical perspective. Crit Care. 2023;27(1):386. pmid:37798769
- 23. Kotfis K, Marra A, Ely EW. ICU delirium - a diagnostic and therapeutic challenge in the intensive care unit. Anaesthesiol Intensive Ther. 2018;50(2):160–7.
- 24. Miranda F, Gonzalez F, Plana MN, Zamora J, Quinn TJ, Seron P. Confusion Assessment Method for the Intensive Care Unit (CAM-ICU) for the diagnosis of delirium in adults in critical care settings. Cochrane Database Syst Rev. 2023;11(11):CD013126. pmid:37987526
- 25. Tomasi CD, Grandi C, Salluh J, Soares M, Giombelli VR, Cascaes S, et al. Comparison of CAM-ICU and ICDSC for the detection of delirium in critically ill patients focusing on relevant clinical outcomes. J Crit Care. 2012;27(2):212–7. pmid:21737237
- 26. Fagundes JA de O, Tomasi CD, Giombelli VR, Alves SC, de Macedo RC, Topanotti MFL, et al. CAM-ICU and ICDSC agreement in medical and surgical ICU patients is influenced by disease severity. PLoS One. 2012;7(11):e51010. pmid:23226448
- 27. Krewulak KD, Rosgen BK, Ely EW, Stelfox HT, Fiest KM. The CAM-ICU-7 and ICDSC as measures of delirium severity in critically ill adult patients. PLoS One. 2020;15(11):e0242378. pmid:33196655
- 28. von Hofen-Hohloch J, Awissus C, Fischer MM, Michalski D, Rumpf J-J, Classen J. Delirium screening in neurocritical care and stroke unit patients: a pilot study on the influence of neurological deficits on CAM-ICU and ICDSC outcome. Neurocrit Care. 2020;33(3):708–17. pmid:32198728
- 29. Chowdhury MZI, Turin TC. Variable selection strategies and its importance in clinical prediction modelling. Fam Med Community Health. 2020;8(1):e000262. pmid:32148735
- 30. Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997;16(4):385–95. pmid:9044528
- 31. Hale AT, Stonko DP, Brown A, Lim J, Voce DJ, Gannon SR, et al. Machine-learning analysis outperforms conventional statistical models and CT classification systems in predicting 6-month outcomes in pediatric patients sustaining traumatic brain injury. Neurosurg Focus. 2018;45(5):E2. pmid:30453455
- 32. Nishi H, Oishi N, Ishii A, Ono I, Ogura T, Sunohara T. Predicting clinical outcomes of large vessel occlusion before mechanical thrombectomy using machine learning. Stroke. 2019;50(9):2379–88.
- 33. Zhang Y, Hu J, Hua T, Zhang J, Zhang Z, Yang M. Development of a machine learning-based prediction model for sepsis-associated delirium in the intensive care unit. Sci Rep. 2023;13(1):12697. pmid:37542106
- 34. Gu Q, Yang S, Fei D, Lu Y, Yu H. A nomogram for predicting sepsis-associated delirium: a retrospective study in MIMIC III. BMC Med Inform Decis Mak. 2023;23(1):184. pmid:37715189
- 35. Teramoto R. Balanced gradient boosting from imbalanced data for clinical outcome prediction. Stat Appl Genet Mol Biol. 2009;8:Article20. pmid:19409064
- 36. Zeng X. Length of stay prediction model of indoor patients based on light gradient boosting machine. Comput Intell Neurosci. 2022;2022:9517029.
- 37. Zhou S, Wang S, Wu Q, Azim R, Li W. Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression. Comput Biol Chem. 2020;85:107200. pmid:32058946
- 38. Racine AM, Tommet D, D’Aquila ML, Fong TG, Gou Y, Tabloski PA, et al. Machine learning to develop and internally validate a predictive model for post-operative delirium in a prospective, observational clinical cohort study of older surgical patients. J Gen Intern Med. 2021;36(2):265–73. pmid:33078300
- 39. Song Y, Zhang D, Wang Q, Liu Y, Chen K, Sun J, et al. Prediction models for postoperative delirium in elderly patients with machine-learning algorithms and SHapley Additive exPlanations. Transl Psychiatry. 2024;14(1):57. pmid:38267405
- 40. Gong K, Lee HK, Yu K, Xie X, Li J. A prediction and interpretation framework of acute kidney injury in critical care. J Biomed Inform. 2021;113:103653. pmid:33338667
- 41. Nohara Y, Matsumoto K, Soejima H, Nakashima N. Explanation of machine learning models using shapley additive explanation and application for real data in hospital. Comput Methods Programs Biomed. 2022;214:106584. pmid:34942412
- 42. Rabiu TB. Revisiting the eye opening response of the Glasgow Coma Scale. Indian J Crit Care Med. 2011;15(1):58–9. pmid:21633551
- 43. Bodien YG, Barra A, Temkin NR, Barber J, Foreman B, Vassar M, et al. Diagnosing level of consciousness: the limits of the glasgow coma scale total score. J Neurotrauma. 2021;38(23):3295–305. pmid:34605668
- 44. Wang L, Ma X, Zhou G, Gao S, Pan W, Chen J, et al. SOFA in sepsis: with or without GCS. Eur J Med Res. 2024;29(1):296. pmid:38790024
- 45. Myrstad M, Kuwelker K, Haakonsen S, Valebjorg T, Langeland N, Kittang BR, et al. Delirium screening with 4AT in patients aged 65 years and older admitted to the Emergency Department with suspected sepsis: a prospective cohort study. Eur Geriatr Med. 2022;13(1):155–62.
- 46. Qian X, Sheng Y, Jiang Y, Xu Y. Associations of serum lactate and lactate clearance with delirium in the early stage of ICU: a retrospective cohort study of the MIMIC-IV database. Front Neurol. 2024;15:1371827. pmid:39011361
- 47. Zhang Z, Guo L, Jia L, Duo H, Shen L, Zhao H. Factors contributing to sepsis-associated encephalopathy: a comprehensive systematic review and meta-analysis. Front Med (Lausanne). 2024;11:1379019. pmid:38835794
- 48. Zhao Q, Xiao J, Liu X, Liu H. The nomogram to predict the occurrence of sepsis-associated encephalopathy in elderly patients in the intensive care units: a retrospective cohort study. Front Neurol. 2023;14:1084868.
- 49. Tang D, Ma C, Xu Y. Interpretable machine learning model for early prediction of delirium in elderly patients following intensive care unit admission: a derivation and validation study. Front Med (Lausanne). 2024;11:1399848. pmid:38828233
- 50. Crimi C, Bigatello LM. The clinical significance of delirium in the intensive care unit. Transl Med UniSa. 2012;2:1–9. pmid:23905039
- 51. Kukreja D, Günther U, Popp J. Delirium in the elderly: current problems with increasing geriatric age. Indian J Med Res. 2015;142(6):655–62. pmid:26831414
- 52. Peng W, Shimin S, Hongli W, Yanli Z, Ying Z. Delirium risk of dexmedetomidine and midazolam in patients treated with postoperative mechanical ventilation: a meta-analysis. Open Med (Wars). 2017;12:252–6. pmid:28828407
- 53. Shi H-J, Yuan R-X, Zhang J-Z, Chen J-H, Hu A-M. Effect of midazolam on delirium in critically ill patients: a propensity score analysis. J Int Med Res. 2022;50(4):3000605221088695. pmid:35466751
- 54. Ali MA, Hashmi M, Ahmed W, Raza SA, Khan MF, Salim B. Incidence and risk factors of delirium in surgical intensive care unit. Trauma Surg Acute Care Open. 2021;6(1):e000564. pmid:33748426
- 55. Hong L, Shen X, Shi Q, Song X, Chen L, Chen W. Association between hypernatremia and delirium after cardiac surgery: a nested case-control study. Front Cardiovasc Med. 2022;9:828015.
- 56. Zhao L, Wang Y, Ge Z, Zhu H, Li Y. Mechanical learning for prediction of sepsis-associated encephalopathy. Front Comput Neurosci. 2021;15:739265. pmid:34867250