Figures
Abstract
Background
Postoperative pulmonary complications (POPC) are common after general anaesthesia and are a major cause of increased morbidity and mortality in surgical patients. However, prevention and treatment methods for POPC that are considered effective tie up human and technical resources. Therefore, the planned research project aims to create a prediction model that enables the reliable identification of high-risk patients immediately after surgery based on a tailored machine learning algorithm.
Methods
This clinical cohort study will follow the TRIPOD statement for multivariable prediction model development. Development of the prognostic model will require 512 patients undergoing elective surgery under general anaesthesia. Besides the collection of perioperative routine data, standardised lung sonography will be performed postoperatively in the recovery room on each patient. During the postoperative course, patients will be examined in a structured manner on postoperative days 1,3 and 7 to detect POPC. The endpoints determined in this way, together with the clinical and imaging data collected, are then used to train a machine learning model based on neural networks and ensemble methods to predict POPC in the early postoperative phase.
Discussion
In the perioperative setting, detecting POPC before they become clinically manifest is desirable. This would ensure optimal patient care and resource allocation and help initiate adequate patient treatment after being transferred from the recovery room to the ward. A reliable prediction algorithm based on machine learning holds great potential to improve postoperative outcomes.
Citation: Trautwein B, Beer M, Blobner M, Jungwirth B, Kagerbauer SM, Götz M (2025) Preventing postoperative pulmonary complications by establishing a machine-learning assisted approach (PEPPERMINT): Study protocol for the creation of a risk prediction model. PLoS One 20(8): e0329076. https://doi.org/10.1371/journal.pone.0329076
Editor: Silvia Fiorelli,, Sapienza University of Rome: Universita degli Studi di Roma La Sapienza, ITALY
Received: July 15, 2024; Accepted: July 3, 2025; Published: August 19, 2025
Copyright: © 2025 Trautwein et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data cannot be shared publicly because of Article 9 of the EU General Data Protecion Regulation (GDPR). According to Article 9 of the EU GDPR, health-related data are classified as sensitive personal data, and their processing is generally prohibited unless specific exceptions apply. Additionally, under the German Federal Data Protection Act (Bundesdatenschutzgesetz, BDSG), the processing of health data is subject to strict requirements, particularly regarding pseudonymization and the risk of re-identification. Even with pseudonymization, there remains a risk of re-identification, especially in a dataset obtained from a special patient population in a defined timeframe at a single hospital. Therefore, publishing these data in an open repository would not comply with data protection regulations and could compromise patient confidentiality. For this reason, we are not allowed to publish the raw data publicly. However, we fully agree that aggregated or properly anonymized data, which are no longer considered personal data under the GDPR (Recital 26), can be shared without restrictions, and we will make such data available in a public repository. To promote transparency and reproducibility, we will make the following resources available via a public repository (e.g., GitHub) or as supplemental material of the final publication: • The full source code and model architecture, • A trained version of the predictive model, • A fully anonymized test dataset for evaluation purposes, • Detailed aggregated statistics describing the dataset. Furthermore, for researchers who meet the criteria for access to condifential data, controlled access to the raw data upon reasonable request and in compliance with Article 9 of the GDPR, the BDSG, and local ethical approvals (Ethics committee University of Ulm, mail: ethik-kommission@uni-ulm.de) data will be made available.
Funding: The study is funded by the Department of Anaesthesiology and Intensive Care Medicine of the University Hospital Ulm and the associated study centre. The necessary equipment, including ultrasound devices and tablets as well as staffing, is available on site. An intramural funding is used for the resources necessary for data evaluation and model development from the department of radiology, including powerful computers and scientific staff. Furthermore, the study team of the PEPPERMINT study has submitted a grant application to Deutsche Forschungsgemeinschaft (DFG), which is currently being evaluated. The funding will provide financial support only, and has no role in the design, management, analysis, interpretation of the data and reporting of this study. No other institution or industrial company were or will be involved in financing, planning or conducting the study.
Competing interests: I have read the journal's policy and the authors of this manuscript have the following competing interests: B. Jungwirth and S. Kagerbauer received grants from Löwenstein Medical Innovation (Berlin, Germany). M. Blobner received research support from MSD (Haar, Germany), fees for consultancy or lectures from GE Healthcare (Helsinki, Finland), Grünenthal (Aachen, Germany), and Senzime (Landshut, Germany). This does not alter our adherence to PLOS ONE policies on sharing data and materials.
Abbreviations: ARDS, Acute respiratory distress syndrome; ARISCAT, assess respiratory risk in surgical patients in Catalonia; AI, Artificial intelligence; AUPRC, area under the precision recall curve; AUROC, Area under the receiver operating characteristic; CNN, Convolutional Neural Network; CPAP, Continuous positive airway pressure; CRP, C-reactive protein; DCA, , Decision curve analysis; Diast., Diastolic; DICOM, Digital imaging and communications in medicine; DFG, Deutsche Forschungsgemeinschaft; EPCO, European perioperative clinical outcome; FiO2, Fraction of inspired oxygen; ICU, Intensive care unit; ML, Machine learning; NIV, Non-invasive ventilation; PACS, Picture archiving and communication system; PACU, Post-anesthesia care unit; paCO2, Partial pressure of arterial carbon dioxide; paO2, Partial pressure of arterial oxygen; PCT, Procalcitonin; PDMS, Patient data management system; POPC, Postoperative pulmonary complications; SQL, Structured query language; StEP, Standardised endpoints for perioperative medicine; Syst, Systolic; TRIPOD, Transparent reporting of a multivariable prediction model for individual prognosis and diagnosis; QoR-9, Quality of recovery-9
Introduction
The incidence of postoperative pulmonary complications (POPC) varies between 9–40%, depending on the surgical procedure and the definition used [1]. Therefore, the introduction of a standardised definition of the outcome “postoperative pulmonary complications”, developed by the Standardised Endpoints for Perioperative Medicine (StEP) collaboration in 2018, represents a prerequisite for future studies in this field [1]. Even supposedly minor complications have the potential to significantly increase the length of hospital stay [2]. Various preoperative risk factors are known but usually cannot be modified. As POPC are the main cause of postoperative morbidity and mortality and the reduction of perioperative mortality is dependent on early recognition and treatment [3], accurate prediction is of paramount interest. There are only a few clinical scoring systems [4]; the currently best-evaluated preoperative score for predicting postoperative pulmonary complications (ARISCAT: Assess Respiratory Risk in Surgical Patients in Catalonia, Table 1) has sufficient sensitivity but lacks specificity [5]. A first retrospective study using machine-learning (ML) methods for determining risk for pneumonia and pulmonary embolism using pre- and intraoperative routine data has shown good accuracy [6]. However, the study shows high specificity but low sensitivity, which could result in overtreatment in clinical practice. In our study, we aim to create a high precision decision-supporting tool for perioperative physicians to identify high-risk patients at an early stage.
However, ML algorithms based only on routine clinical data depend highly on data quality and comprehensiveness. Algorithms based on standardised imaging data are easier to transfer to other facilities and implement in routine clinical practice. Therefore, combining image analysis might be beneficial, especially as sonography is becoming increasingly important as a non-invasive examination method that can be performed at the bedside. Various sonographic scores and models have been developed to predict pulmonary complications [7]. Image processing methods and machine learning, particularly deep learning, are also increasingly used in ultrasound diagnostics [8,9]. Augmented algorithms using pre- and intraoperative clinical information in addition to ultrasound imaging data may provide better predictive accuracy than the respective individual methods. However, to our knowledge, combining routine clinical data and ultrasound imaging data to develop a predictive machine-learning algorithm has not yet been tested. In addition, prospective clinical evaluation of machine-learning algorithm-based prediction models, which is planned herein, lacks to date.
Measures for preventing POPC, such as postoperative non-invasive ventilation and physiotherapy, are known and considered effective [10,11] but are probably not consistently applied in clinical routine due to the increased demand, especially for human resources.
This study aims to combine pre- and intraoperative data with lung ultrasound imaging in the recovery room to develop an ML-based risk score for POPC. A precise score that reliably identifies patients at risk in the early postoperative phase and simultaneously avoids overtreatment can ensure adequate personalised treatment of postoperative patients.
Materials and Methods
Trial registration
Name of the registry: ClinicalTrials.gov
Registration ID: NCT05789953
Approval date: 29/03/2023
Ethics approval
Name: Ethics committee University of Ulm
Approval Number: 369/22
Approval date: 22/12/2022
Head of committee: Prof. Dr. Florian Steger, Oberberghof 7, 89081 Ulm, Germany
Mail: ethik-kommission@uni-ulm.de
Homepage: https://www.uni-ulm.de/einrichtungen/ethikkommission-der-universitaet-ulm/
Written informed consent to participate will be obtained from all participants.
Objectives
The main hypothesis of the PEPPERMINT study is that a patient’s risk of POPC can be reliably predicted using a machine-learning algorithm and that the predictive accuracy of the algorithm outperforms common clinical scoring systems.
The primary objective is to develop a machine-learning algorithm based on immediately postoperatively obtained lung ultrasound imaging data of adult patients undergoing surgery in general anaesthesia to predict the risk of postoperative pulmonary complications. This model is intended to provide better predictive ability than the currently best-established clinical score, the ARISCAT [5] or a machine learning model solely based on clinical routine data.
The secondary objective is to investigate whether improving model performance by adding clinical routine parameters to the imaging data is possible.
Furthermore, the optimal risk threshold for an intervention will be determined in case of a clinical application of the model.
Further objectives include identifying patient-specific risk factors for POPC through analysis of the collected routine clinical data and modification of the models created to predict the secondary endpoints of hospital length of stay, in-hospital mortality and postoperative quality of recovery.
Trial design
The PEPPERMINT study is a prospective, single-center clinical cohort study designed to develop and evaluate a risk prediction model for POPC. The study follows the TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) guidelines for multivariable model development and validation [12].
A total of 512 patients will be enrolled based on sample size calculation. For each patient, both clinical data and lung ultrasound images will be prospectively collected. The primary endpoint is the occurrence of any postoperative pulmonary complication, as defined by standardised criteria.
Three predictive models will be developed
First, a model based on deep learning using lung ultrasound images, second, a model based solely on clinical variables using common frameworks like automated machine learning (AutoML), and finally, a combined model integrating both clinical and imaging features.
The dataset will be split using stratified, patient-wise sampling into non-overlapping training and hold-out test sets. Internal model performance will be assessed using k-fold cross-validation within the training set, and final evaluation will be conducted on the independent test set. Full details of the modelling and validation procedures are provided in the statistical methods.
Study setting
The PEPPERMINT study will be conducted as a single-centre study at the University Hospital Ulm in Germany. The hospital is a tertiary care and academic hospital where about 30,000 anaesthesia procedures are performed annually, including a broad spectrum of surgical disciplines and interventions. To establish a data processing pipeline for imaging data and to integrate artificial intelligence (AI) algorithms into the study setting, an interface between the ultrasound devices and the hospital´s internal Picture Archiving and Communication System (PACS) has already been established. To bundle the expertise in image processing, collection, and processing of big data, the departments of anaesthesiology and radiology cooperate for this study.
The members of the study group are predominantly physicians and specialists in the fields of anesthesiology and radiology. The leader of the research group “Experimental Radiology”, who holds a formal background in engineering, is part of the study team. His expertise lies in machine learning, deep learning, and computer vision, with a focus on medical imaging applications. Another group member holds a master’s degree in medical informatics and has substantial experience in machine learning for perioperative risk prediction. The study team is completed by Master’s and PhD students from the Department of Radiology with a background in computer science and engineering
In addition to specialised personnel, high computing capacity for fast processing and reliable storage is necessary to develop the model and will be provided by the involved departments.
As the study is designed to build decision-support systems, on-site evaluation is planned as a subsequent step, which will be covered by a separate study. The data is evaluated offline without direct interaction between medical staff and the AI algorithm. Consequently, no feedback loop is planned in this phase.
Eligibility criteria
Adult patients (≥18 years) of all sexes are eligible for the study.
Patients must meet the following inclusion criteria: Scheduling for elective surgical procedures under general anaesthesia with a planned overnight hospital stay, and written informed consent by patients.
If they meet any of the following exclusion criteria, they will not be included in the study: Younger than 18 years of age, outpatient surgery, planned postoperative admission to intensive care unit (ICU), need for intensive care treatment before surgery and emergency surgery.
Secondary exclusion criteria are: Unplanned hospital discharge/transfer on the day of surgery, which does not allow examination of the primary outcome; cancellation/postponement of index surgery; and immediate unplanned postoperative admission to the ICU due to an intraoperative complication.
Furthermore, the Inclusion criteria for input data are: At least two adjacent ribs and the pleura or corresponding pathologies (e.g., pneumothorax, pleural effusion) must be visible on the ultrasound image. The ultrasound examination must cover all 12 defined areas (details in chapter “Study measures”).
The exclusion criteria for input data are: Ultrasound images on which the leading structures ribs and pleura or alternative pathologies cannot be depicted, or incomplete visualisation of the 12 previously defined examination areas in the thoracic region, e.g., in the case of immobile patients or inaccessibility due to a bandage or drains. Importantly, decisions regarding potential exclusion due to poor image quality will not be made by the examining physician, but rather by an independent, blinded radiologist, thereby minimising the risk of subjective selection bias.
Clinical data is collected from patients throughout their hospitalisation. An employee who is not involved in data collection will carry out a plausibility check and random comparison with the medical records at regular intervals. Implausible data is removed from the data record.
Patients who cannot be visited postoperatively and for whom no endpoint could be defined are excluded from the study; however, their data may be used for secondary analyses. Incomplete pre-operative routine data in the electronic patient record are not a reason for exclusion if the patient can be visited postoperatively
Recruitement and informed consent
The University Hospital Ulm is a tertiary care hospital where about 30,000 anaesthesia procedures are performed annually. The PEPPERMINT study evaluates a general surgical population; therefore, eligibility criteria were set as low as possible. Approximately 50 patients are seen in the pre-anaesthesia outpatient clinic for preoperative evaluation each working day, of which about 80% will fulfil the inclusion criteria. The abovementioned 3 risk groups, according to the ARISCAT score, will be recruited equally, which means that after about 60 patients in one risk group, it will be paused, and the other risk groups will be prioritized until the same number is present in all risk groups. The equipment and staffing of the recovery rooms and the existing expertise in lung sonography allow the inclusion of 3–5 patients daily so that patient recruitment should be possible without any problems within one year.
Informed consent from trial participants or authorised surrogates will be obtained by a physician from the Department of Anaesthesiology and Intensive Care Medicine of the University Hospital of Ulm. The informed consent discussion will be part of the scheduled informed consent discussion for general anaesthesia. According to the usual routine preoperative procedure, the optimal anaesthesia for the patient is planned based on previous diseases, previous anaesthetic and surgical procedures, and personal preference. In case of doubt, the senior physician will be consulted. If the inclusion criteria are met and the patient gives consent, an informed consent discussion about the PEPPERMINT study will take place. Written informed consent will be obtained from the physicians, who will explain the hospital’s policy on data collection and storage, the general process, and the goals of the study. Additional consent for the collection of participant data, which is routinely obtained during standard pre-, intra- and postoperative anesthesiologic care, will be obtained as well.
A participants schedule is shown in Fig 1.
Schematic diagram presenting the schedule for participants, based on SPIRIT schedule; LUS = Lung ultrasound; POPC = postoperative pulmonary complications.
Study measures
Study-related measures to acquire the imaging data include performing a standardised lung ultrasound on each included patient immediately postoperatively in the recovery room or post-anaesthesia care unit (PACU). Sonographic examination of the lungs is a common, noninvasive bedside procedure. It can be performed without additional positioning in supine position and takes approximately 5 minutes. The examination is performed using a standardised, previously published method [13]. Thereby, a convex probe (5 MHz) and a predefined preset of the ultrasound device (the default “lung” preset of the device) are going to be used. Each hemithorax will be divided into 6 areas, separated by the anterior and posterior axillary lines (anterior, lateral, and posterior) and a superior and inferior area. For each patient, 12 pictures and 12 videos, one in each area, will be captured. This 12-zone protocol showed the highest intra-class correlation coefficient compared to other protocols [14]. A graphical representation of the areas can be found in Fig 2.
Lung ultrasound will be performed in 12 areas, 6 in each hemithorax. Areas are separated by the anterior and posterior axillary line into an anterior, lateral and posterior zone and a superior and inferior area.
The criteria for including imaging data are described above (“Eligibility Criteria”). If these conditions are not met, the interpretability of the ultrasound is considered insufficient.
To maximize inclusiveness and minimize selection bias, operators are encouraged to adjust technical parameters (e.g., depth and gain) during image acquisition in order to meet these quality criteria. A convex transducer was selected to allow visualization of deeper structures across a wide range of patient anatomies.
The performing physicians are experienced in perioperative medicine and critical care and will be trained in ultrasound methodology before the start of the study.
After transmission and storage in the PACS of the Department of Radiology, image pre-processing and artifact correction are performed before the data serves as input for the machine learning model. Images are not labeled by human experts; only the pre-defined endpoint POPC serves as labels for the imaging data.
The clinical data obtained for the study correspond to parameters routinely collected during anesthesiologic preoperative evaluation, the course of anaesthesia during surgery, and postoperatively in the recovery room. Plausibility checks are carried out on the numerical data; if these deviate from the valid value range, they are removed. Missing data may be imputed by different methods as described below.
Based on the collected data and the outcome parameters (description in chapter “Outcome”), machine learning algorithms will be trained to predict POPC by outputting an individual patient’s percentage risk of suffering a postoperative pulmonary complication.
Modifications, adherence and concomitant care
Regarding the study protocol, the following scenarios allow a change in the study protocol: (1) In case of patient withdrawal of consent, the study protocol will be stopped immediately, and the patient will be excluded from the study. (2) If it is not possible to perform an ultrasound examination that meets the abovementioned quality requirements, the patient will be excluded from the study. (3) Postoperative lung ultrasound will take place in the recovery room; dates might be rescheduled depending on the postponement of the surgery. (4) Postoperative visits on the normal ward will be on postoperative days 1,3 and 7. If the patient is discharged from the hospital or transferred to another hospital within the first 7 days, the study ends on the day of discharge.
To improve workflow, the recruiting anaesthetist in the pre-anaesthesia outpatient clinic receives a simple checklist with eligibility criteria and the ARISCAT score (S1 File), and study information will be handed to patients in the waiting area. To improve adherence to the lung ultrasound protocol in the recovery room, a group of selected anaesthetists will undergo personal training in ultrasound methodology and receive detailed instructions in written form, which are also attached to the ultrasound device. Additionally, pocket cards with brief instructions will be distributed among the responsible physicians.
Postoperative ward visits will be performed by qualified study nurses and trained medical students with a tablet to simplify the visits and enhance data management.
During the trial, the patient will undergo routine perioperative care as per standard. Therefore, the patient receives the concomitant or intervention as per his physicians’ decision, and no concomitants or interventions are prohibited during the trial. Relevant information regarding POPC that results from postoperative diagnostics or interventions will be recorded. Supposed complications that have not yet been treated are noticed during the postoperative visit. In that case, the study staff will inform the responsible ward physician to initiate any necessary therapy.
Sample size
Predicting POPC for the clinician is difficult on a case-by-case basis. Therefore, several scoring systems have been developed in the past. The most common of these, the ARISCAT score, has an AUROC of 0.83 [15]. It should be noted that the current literature does not provide any further metrics such as area under the precision-recall curve (AUPRC) or F1 score on the ARISCAT [15,16]. We therefore used the AUROC as the reference metric. The model we aim to create should be significantly better than the ARISCAT score and thus have at least an AUROC of 0.93. Achieving this seems realistic since, in the preliminary work of our research group, prediction models for various postoperative complications have already been created, whose predictive accuracy is in this range [17,18]. With a significance level of 0.05 and a power of 80%, 512 patients would be needed to create the database based on the method described by Hanley and McNeil for comparing ROC curves [19].
Patients are stratified according to the risk criteria determined by ARISCAT so that approximately equal numbers of patients are included in each of the low-risk (ARISCAT < 26 points), intermediate-risk (ARISCAT 26–44 points), and high-risk (ARISCAT ≥ 45 points) groups.
Outcomes
The primary outcome of the PEPPERMINT study to be predicted by the machine learning model is the risk of developing postoperative pulmonary complications after surgery in general anaesthesia between postoperative days 1 and 7. POPC will be defined and graded by severity according to the standardised criteria of the StEP collaboration [1]. Complications not further described by the StEP collaboration will be defined as from the EPCO (European Perioperative Clinical Outcome) task force [20] as listed in Table 2. POPC, as a composite outcome, summarises atelectasis, pneumonia, acute respiratory distress syndrome (ARDS), pulmonary aspiration, pulmonary embolism, pleural effusion, pneumothorax, and bronchospasm [1]. POPC is assumed as soon as at least one of the listed events occurs and will be detected during the postoperative visit or chart review after the patient is discharged.
Outcome assessment will be performed by qualified study staff consisting of three study nurses and two doctoral students on postoperative visits on days 1,3 and 7. To detect complications, visits will include a questionnaire (pulmonary symptoms and mental state), a clinical examination (pulmonary auscultation), the collection of vital parameters (Heart rate, blood pressure, oxygen saturation, breathing rate, temperature) as well as a chart review (oxygen supply, medication, signs of aspiration, admission to ICU). If available during the postoperative course, the following measures will be included: laboratory (c-reactive protein (CRP), procalcitonin (PCT), leukocytes, partial pressure of arterial oxygen (paO2), and partial pressure of arterial carbon dioxide (paCO2), thoracic imaging (chest radiography or computed tomography) and respiratory support (CPAP, non-invasive or invasive ventilation). All potentially collected parameters are described in Table 3. To check postoperative mental status, the Mini-Cog test will be used. This test consists of a 3-word recall task and the clock drawing test [21].
Additionally, after hospital discharge, relevant pulmonary imaging, diagnoses or complications are extracted from the discharge letter.
Postoperative recovery and patient satisfaction as a secondary outcome parameter are going to be evaluated with the Quality of Recovery-9 (QoR-9) questionnaire (Table 4) on day 1,3, and 7 [22]. The score ranges from 0 to 18, with a higher score indicating a better subjective recovery.
Other secondary outcome parameters, hospital length of stay and in-hospital mortality, will be determined by a chart review after hospital discharge.
Data management
A FileMaker™ database is used to record postoperative outcomes. Database constraints avoid duplicate entries and values outside the valid range of clinical routine data. The ARISCAT score will be determined pre-operatively with the help of a paper-based checklist. A detailed description of the score is provided in “Background and rationale”. The score has been validated in several European populations [15,23]. The primary endpoint, POPC, is defined according to the consensus definition of the StEP collaboration [1] and complemented by the definition of the EPCO task force [20]. The definition is described in detail in the chapter “Outcome”. As mental status is part of the StEP criteria, a brief cognitive screening test (Mini-Cog test) will be performed regularly during the post-operative visits. The sensitivity of this test is 0.76–0.99, and the specificity is 0.83–0.93 [24]. As a secondary endpoint, the Quality-of-Recovery 9 questionnaire will be applied to assess subjective recovery after surgery. The test is highly sensitive (0.92 ± 0.01) with a high negative predictive value (0.93 ± 0.01) in a German study collective [22]. The tests and the questionnaire are detailed in the chapter “Outcome”.
To promote participant retention, all outcome data will be assessed while the patient is still hospitalised. If a patient drops out, for example, because of withdrawal of consent or unexpected surgery rescheduling and therefore missed ultrasound, the study protocol will be stopped immediately, no further data will be collected, and already gathered data will be deleted. In case of patient discharge from the hospital before performing postoperative visits on days 3 and/or 7, the study protocol will continue, and the available data will be processed if the postoperative visit on day one is documented.
The data are collected and stored in different formats. The inclusion criteria as well as the ARISCAT score, are going to be collected paper-based in the pre-anaesthesia outpatient clinic. Lung ultrasound is performed with the SonoSite PX (Fujifilm SonoSite Inc., Bothell, Washington, USA). The imaging data is transferred to the hospital’s internal PACS and stored and processed in Digital Imaging and Communications in Medicine (DICOM) format within the Department of Radiology. Routinely collected perioperative data are stored in the hospital’s internal patient data management system (PDMS) in an Oracle8i database. The data is accessible via Structured Query language (SQL) and can be exported in csv format. Postoperative visits are conducted in the form of structured questionnaires using tablets. Data collection and processing are done with FileMaker Pro software (Claris, version 19.6.3.302). In the electronic data entry form for the postoperative visits, no user input is possible outside the valid value ranges for numerical data; dichotomous questions (yes/no) are documented with the help of radio buttons. The clock is drawn by the patient on the tablet and also stored in the database as a drawing. Together with the performance from the word-recall task, the score of the Mini-Cog test is calculated. Regarding the correct evaluation of the drawn clock, the score is cross-checked by a physician before it is finally transferred to the database. After patient discharge, discharge notes are searched for relevant diagnoses and complications. The total volume of data collected will be merged into a comprehensive FileMaker database. Data storage occurs pseudonymized and de-identified.
No laboratory evaluations or biological specimens outside the clinical routine will be obtained or stored as part of this study. For the detection of postoperative pulmonary complications, diagnostic tests and relevant results performed by the respective departments will be collected during a chart review and directly transferred to the pseudonymized data collection.
Any documents with identifiable information will be collected in paper-based form and stored in a locked cabinet at the study centre, where only authorised personnel will have access to them. This includes original informed consent, a checklist with eligibility criteria, the ARISCAT score, and the patient identification list. The documents will be kept there until completion of the study. The identification list will be stored separately, and only authorised study personnel will have access to it. After completion of the study, all paper records will be stored in a central archive for at least ten years according to the clinic’s specifications and legal requirements. Information collected during the postoperative visits will be saved without any identifiable information on password-protected study tablets. After transferring the data to a hospital-internal computer, they are deleted from the tablets.
Pseudonymised data received from PDMS, chart review, and postoperative visits will be securely stored in the hospital’s internal server infrastructure according to GDPR requirements. Imaging data is exclusively stored within the clinic’s radiology information system.
Statistical methods
The primary objective of the PEPPERMINT study is to develop and evaluate prediction models for postoperative pulmonary complications (POPC) using prospectively collected clinical and imaging data. Model development and validation will be conducted in accordance with the TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) guidelines. The statistical analysis plan is provided in S3 File.
Overview of modelling approaches
Three predictive modelling strategies will be pursued. First, an imaging model will be created using deep learning techniques applied to lung ultrasound images. This model will be developed in Python using the PyTorch framework. Transfer learning will be employed, utilizing pretrained convolutional neural network (CNN) architectures such as ResNet or DenseNet. These models will be fine-tuned on the study-specific dataset. We will also assess the performance of medical foundation models and contrastive learning-based pretraining to optimize feature extraction from the ultrasound images.
Second, a clinical model will be developed using only structured patient data. This model will be trained using frameworks like H2O AutoML within R/RStudio as well as built-in R functions, which allows for the systematic evaluation and tuning of various machine learning algorithms, including gradient boosting machines, random forests, and neural networks.
Third, a combined model will integrate both clinical and imaging data. Various fusion strategies will be explored, including early and late fusion, to determine the most effective method for combining heterogeneous data types. The final architecture for this combined model will be selected based on performance observed in a hold-out test dataset.
Model training and evaluation
The primary model development will be based on a dataset of 512 patients. Within this cohort, we will implement stratified k-fold cross-validation (typically 5- or 10-fold, depending on the final class distribution) where possible to assess model robustness and mitigate overfitting. Image-based deep learning models require extensive computational resources, therefore, cross-validation will be applied more selectively. We will therefore use a fixed training/validation/test split for imaging data, ensuring non-overlapping patient groups. Early model experiments will be conducted using internal validation on the training data with hold-out validation to optimize architecture and hyperparameters. Final model evaluation will be performed on an independent hold-out test set.
This hold-out test set will be compiled prospectively during the course of the study after model development, using newly enrolled patients not included in the training dataset. This temporal separation will ensure unbiased performance evaluation of the final model.
Model performance will be evaluated on the independent hold-out test set using several complementary performance metrics. These will include the area under the receiver operating characteristic curve (AUROC), the area under the precision-recall curve (AUPRC), overall accuracy, sensitivity, specificity, F1-score, as well as positive and negative predictive values. Calibration will be assessed using appropriate measures, such as the Brier score, Hosmer-Lemeshaw-Test and calibration curves. Clinical decision-making utility can additionally be evaluated using decision curve analysis (DCA).
Importantly, our models are designed to output individualized risk probabilities, not dichotomous predictions. As such, the selection of a cut-off point for potential clinical decision-making will be a post-modelling step, and risk thresholds for high-risk patient classification will not be predefined. Instead, various thresholding strategies will be explored post hoc, including Youden’s index (for balanced sensitivity and specificity), cut-offs optimized for specific clinical priorities (e.g., maximizing sensitivity), and thresholds guided by DCA.
Addressing overfitting and class imbalance
Overfitting will be addressed through the implementation of several regularization techniques commonly used in deep learning. These include dropout layers within the network architecture, L2 weight regularization to penalize large weights, and data augmentation techniques applied to ultrasound images (e.g., rotation, flipping, and zooming). Model complexity will also be restricted as needed.
To better estimate the prevalence and clinical relevance of POPC, we conducted a preliminary single-center observational study involving 259 patients undergoing surgery under general anesthesia. The cohort included 106 female patients (41%) and 62 current smokers (24%) with a median age of 66 years. Overall, 111 patients (43%) experienced at least one POPC, indicating that POPC is a relatively frequent complication in the targeted patient population [25].
Based on this preliminary observational data, only moderate class imbalance is to be expected. Nevertheless, additional strategies will be employed to ensure balanced learning. These include maintaining class ratios across cross-validation folds, monitoring class-wise performance metrics, and using class weighting techniques if imbalances are observed in specific subgroups.
Missing values and data imputation
With regard to imaging data, only complete data sets will be accepted for analysis.
Missing values in the clinical dataset will initially be handled using the default preprocessing pipeline in the case that H2O AutoML is used. This pipeline applies median or mean imputation for numerical variables and mode imputation for categorical variables (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/imputing-data.html). In addition to this standard approach, we will evaluate alternative imputation strategies, particularly for approximately normally distributed numerical variables. These include, e.g., k-nearest neighbors (kNN) imputation, which leverages similarities across observations in the multivariate feature space and multiple imputation using chained equations (MICE), which incorporates multivariable relationships and reflects the uncertainty associated with imputed values.
Given the prospective nature of the study, we anticipate that most missing data will not be missing at random (NMAR), but rather occur systematically, for instance, due to early patient discharge or an inability to perform assessments as a result of clinical deterioration. In such cases, the absence of data may itself be informative with respect to the patient’s risk for postoperative complications. Therefore, in addition to the imputation techniques mentioned above, we will explore modelling strategies that explicitly account for informative missingness. These include the incorporation of missingness indicators (binary flags denoting whether a variable was observed or missing) and sensitivity analyses to assess the robustness of model predictions under different assumptions about the missing data mechanism.
The final approach to handling missing data will be selected based on a combination of model performance, diagnostic checks, and clinical plausibility.
Secondary models and endpoints
In addition to the prediction models, we aim to extract specific predictive markers which might be evaluated in an additional study. Once established and evaluated with additional studies, both the AI models and the marker might help to identify high-risk patients, allowing for the adaptation of treatment at an early point of care.
In addition to the primary binary classification model for the presence or absence of POPC, the study will explore the development of additional models that aim to distinguish between different types and severities of complications. These extended models will incorporate more granular outcome definitions, including the timing of complication onset and the presence of multiple complications in a single patient.
Moreover, secondary predictive models will be constructed to estimate outcomes such as quality of postoperative recovery, length of hospital stay, and in-hospital mortality. Important features will be identified based on variable importance metrics. Comparisons between patients with and without complications will be conducted using appropriate statistical tests, including t-tests or Mann-Whitney U tests for continuous variables, and chi-squared or Fisher’s exact tests for categorical variables.
External and subgroup validation
While model development and internal validation will be performed using data from a single clinical centre, further validation is planned at a second site of our hospital that predominantly treats gynaecological and ENT patients. This cohort will serve as a distinct population for assessing model generalisability across surgical disciplines.
Additionally, the model will be evaluated in a separate high-risk subgroup comprising patients undergoing urgent or emergency procedures, such as elderly individuals with hip fractures. These patients were intentionally excluded from the initial development cohort and will offer insight into the model’s performance in more acute care settings.
Interim analysis
An interim analysis will be conducted after the enrollment of approximately 50 patients (around 10% of the planned sample size). This analysis will assess the technical feasibility, data quality, and operational workflow. As this is a non-interventional study, early termination is not planned regardless of interim results.
Oversight and monitoring
The Coordinating investigators are responsible for study design, funding, and creation of the study protocol. They take over the coordination and communication between the two involved departments, the Department of Anaesthesiology and Intensive Care Medicine and the Department of Radiology, and the persons involved in the study. Furthermore, they are part of the trial steering committee.
The trial steering committee is responsible for adhering to the study protocol, conducting the planned patient enrollment, and compiling the patient identification list. They will monitor the study’s progress and, if necessary, agree on changes to the procedure and study protocol. The committee meets once every 2 months in an in-person or online meeting.
The Lead investigator is responsible for eligibility, consent, and enrollment of patients as well as imaging and data collection, and therefore supports the practising physicians on a day-to-day basis.
The present study is not a study according to the German Medical Product Act (AMG) or the Medical Devices Regulation (MDR). The study is monocentric, and no intervention will take place. No risks for the patients are to be expected. For these reasons, no external data monitoring committee will be set up.
However, the coordinating investigators will be responsible for the creation of the database, supporting data entry, data verification, and quality management. Data monitoring and outcome reports will take place every 8 weeks.
Adverse events and harms
Patients will undergo routine perioperative care as per standard during this trial; responsible for patient care will be the attending physicians and departments. Additionally, patients will receive (1) a lung ultrasound, considered non-invasive and without side effects, and (2) a clinical examination and interview at up to three time points postoperatively. All other procedures are part of general anaesthesia or usual perioperative management and are completed even without study participation. Therefore, we do not expect any complications or harm from trial participation.
Nevertheless, patients can report adverse events or other unintended effects of the trial to the study hotline or email address. The trial steering committee will process the reports.
Due to the relatively small size of the data set, the prediction model will be developed using the whole dataset. Cross-validation will be used to evaluate the performance of the model. Furthermore, validation is planned on a temporally independent data set obtained during the period of model training. Therefore, the study will be classified as Type 1b according to the TRIPOD Statement [12].
Error analysis will be carried out with the help of a confusion matrix after threshold determination. The cases of incorrect predictions will be analyzed in more detail, in particular, to determine whether certain characteristics correlate with the errors. External validation will take place in future studies.
In the current study, we are not focusing on the identification of possible confounders for model performance. This will be done in a follow-up study, investigating the impact of possible confounders such as different raters, imaging devices, bad image quality, resolutions, etc., in a possibly multi-centric study.
Discussion
A major target of clinical research in the perioperative field is to reduce the occurrence of postoperative complications. In times of skills and resource shortage, personalized medicine is getting more important, which includes the application of required treatment but avoids overtreatment. Machine learning algorithms might improve risk prediction as a prerequisite for personalised medicine. POPC represent a large proportion of the overall postoperative complications and occur about twice as frequently as cardiac complications. POPC are not only common, but they are also responsible for increased morbidity and mortality. Furthermore, they contribute to increased hospital length of stay and a higher frequency of hospital readmissions. Therefore, they occupy more healthcare resources and cause higher healthcare costs [26–29].
Despite these facts, there are only a few scores for evaluating pulmonary risk, which have not yet become standards in clinical routine, even though pulmonary complications could be controlled and avoided by specific, however, personnel-intensive measures.
Lung ultrasound is a non-invasive, bedside diagnostic screening measure that has recently become increasingly popular, not at least due to the COVID-19 pandemic. Standardised protocols and guidelines mean lung ultrasound is becoming increasingly important in clinical medicine [30]. Increasingly, machine-learning models based on ultrasound examinations are being developed that deliver high diagnostic accuracy [31] and already exceed the currently best evaluated conventional score, the ARISCAT Score [32].
The PEPPERMINT study aims to develop a tailored machine-learning model to reliably predict the risk for POPC, based on lung ultrasound imaging performed in the recovery room and perioperatively assessed clinical data. We hypothesize that the accuracy of the prediction model outperforms common scoring systems. Early identification of patients at risk helps to target scarce resources and apply adequate therapy in the sense of personalised medicine.
Limitations
We wanted to set a specific time frame for the post-operative visits. Therefore, in-person visits occur exclusively during the first 7 postoperative days. After that timepoint, the survey of findings is limited to a chart review. However, since the majority of POPC occurs within the first week [26], this pragmatic approach is justified.
Secondly, risk identification by the model will take place only after the surgery. Therefore, preoperative assessment and optimization are not the subject of our study. However, if one considers the criteria that are relevant in preoperative risk assessment scores [5], for example, age, respiratory infection, or expected surgery duration and incision, most of the criteria are related to the underlying disease or planned surgical procedure, and are therefore not amenable to preoperative modifications. Consequently, in this study we would like to develop a tool that reliably predicts complications in the early postoperative phase in order to be able to provide the patient with adequate postoperative treatment and monitoring.
Strengths
The PEPPERMINT study will be the first study to combine ultrasound imaging data with clinical data in an artificial intelligence prediction model. We, therefore, hope to achieve a highly accurate risk prediction that can be applied in clinical practice. Besides POPC, we also investigate secondary endpoints that are of interest to the healthcare system, like the length of in-hospital stay and endpoints that are relevant to the subjective feelings of patients, like the quality of recovery.
In perspective, the suitability of the algorithm will be tested in a clinical intervention study. Therefore, a higher number of patients will be screened with the created model. High-risk patients receive a multimodal training and therapy program postoperatively to reduce the rate of POPC. This includes, among other things, non-invasive ventilation in the recovery room, physiotherapy, respiratory training, a nutrition plan to prevent malnutrition, fluid balancing to prevent overhydration, and special oral hygiene. All included patients will be again visited on the ward on days 1,3, and 7 and examined for signs of pulmonary complications. The aim is a reduction of pulmonary complications with measurable clinical benefit. Clinically measurable success parameters are a shorter hospital stay, a lower rate of unplanned intensive care admissions, and a higher quality of life.
Precise risk assessment using a machine-learning algorithm combined with targeted preventive and therapeutic measures for identified high-risk patients, therefore, has great potential to improve patient outcomes and could also help to reduce health care costs.
Dissemination plans
Trial results will be communicated via publication in international, peer-reviewed journals and at international congresses in the fields of anaesthesia and radiology. Positive as well as negative results will be published.
Protocol amendments
All important protocol modifications will be communicated to the necessary parties through the trial steering committee via direct contact or online meeting. Necessary changes in trial registries and ethics committee will be carried out as soon as possible.
Trial sponsor
Prof. Dr. Bettina Jungwirth
University Hospital Ulm , Albert-Einstein-Allee 23 , 89081 Ulm
mail: ains@uniklinik-ulm.de
Prof. Dr. Meinrad Beer
University Hospital Ulm, Albert-Einstein-Allee 23 , 89081 Ulm
mail: sekretariat.radiologie1@uniklinik-ulm.de
This is an investigator-initiated trial. The funding source had no role in the design of this study and will not have any role during its execution, analyses, interpretation of the data, or decision to submit results.
Supporting information
S2 File. SPIRIT-AI checklist.
Recommended items to address in a protocol and related documents for clinical trials evaluating AI interventions.
https://doi.org/10.1371/journal.pone.0329076.s002
(PDF)
Acknowledgments
We acknowledge F. Scheffenbichler, B. Ulm and A. Podtschaske for their support and advice in the implementation of the study. We acknowledge K. Lukas-Jazwinski, S. Hoheisen, F. Branz, G. Frömmichen, P. Leibinger and P. S. Sam for data acquisition and H. Hillenhagen and T. Bader for the preliminary work on the machine learning model. We also want to thank the patients for their willingness to participate in this study.
References
- 1. Abbott TEF, Fowler AJ, Pelosi P, Gama AM, Moller AM, Canet J, et al. A systematic review and consensus definitions for standardised end-points in perioperative medicine: pulmonary complications. Br J Anaesth. 2018;120(4):705–11. pmid:29576111
- 2. Fernandez-Bustamante A, Frendl G, Sprung J, Kor DJ, Subramaniam B, Martinez Ruiz R, et al. Postoperative Pulmonary Complications, Early Mortality, and Hospital Stay Following Noncardiothoracic Surgery: A Multicenter Study by the Perioperative Research Network Investigators. JAMA Surg. 2017;152(2):157–66. pmid:27829093
- 3. Ghaferi AA, Birkmeyer JD, Dimick JB. Variation in hospital mortality associated with inpatient surgery. N Engl J Med. 2009;361(14):1368–75. pmid:19797283
- 4. Ball L, Pelosi P. Predictive scores for postoperative pulmonary complications: time to move towards clinical practice. Minerva Anestesiol. 2016;82(3):265–7. pmid:26344668
- 5. Nithiuthai J, Siriussawakul A, Junkai R, Horugsa N, Jarungjitaree S, Triyasunant N. Do ARISCAT scores help to predict the incidence of postoperative pulmonary complications in elderly patients after upper abdominal surgery? An observational study at a single university hospital. Perioper Med (Lond). 2021;10(1):43. pmid:34876228
- 6. Xue B, Li D, Lu C, King CR, Wildes T, Avidan MS, et al. Use of Machine Learning to Develop and Evaluate Models Using Preoperative and Intraoperative Data to Identify Risks of Postoperative Complications. JAMA Netw Open. 2021;4(3):e212240. pmid:33783520
- 7. Szabó M, Bozó A, Darvas K, Soós S, Őzse M, Iványi ZD. The role of ultrasonographic lung aeration score in the prediction of postoperative pulmonary complications: an observational study. BMC Anesthesiol. 2021;21(1):19. pmid:33446103
- 8. van Sloun RJG, Demi L. Localizing B-Lines in Lung Ultrasonography by Weakly Supervised Deep Learning, In-Vivo Results. IEEE J Biomed Health Inform. 2020;24(4):957–64. pmid:31425126
- 9. Brusasco C, Santori G, Tavazzi G, Via G, Robba C, Gargani L, et al. Second-order grey-scale texture analysis of pleural ultrasound images to differentiate acute respiratory distress syndrome and cardiogenic pulmonary edema. J Clin Monit Comput. 2022;36(1):131–40. pmid:33313979
- 10. Miskovic A, Lumb AB. Postoperative pulmonary complications. Br J Anaesth. 2017;118(3):317–34. https://doi.org/10.1093/bja/aex002 pmid:28186222
- 11. Ferreyra GP, Baussano I, Squadrone V, Richiardi L, Marchiaro G, Del Sorbo L, et al. Continuous positive airway pressure for treatment of respiratory complications after abdominal surgery: a systematic review and meta-analysis. Ann Surg. 2008;247(4):617–26. pmid:18362624
- 12. Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med. 2015;162(1):55–63. pmid:25560714
- 13. Bouhemad B, Mongodi S, Via G, Rouquette I. Ultrasound for “lung monitoring” of ventilated patients. Anesthesiology. 2015;122(2):437–47. pmid:25501898
- 14. Tung-Chen Y, Ossaba-Vélez S, Acosta Velásquez KS, Parra-Gordo ML, Díez-Tascón A, Villén-Villegas T, et al. The Impact of Different Lung Ultrasound Protocols in the Assessment of Lung Lesions in COVID-19 Patients: Is There an Ideal Lung Ultrasound Protocol?. J Ultrasound. 2022;25(3):483–91. pmid:34855187
- 15. Kokotovic D, Degett TH, Ekeloef S, Burcharth J. The ARISCAT score is a promising model to predict postoperative pulmonary complications after major emergency abdominal surgery: an external validation in a Danish cohort. Eur J Trauma Emerg Surg. 2022;48(5):3863–7. pmid:35050387
- 16. Kiyatkin ME, Aasman B, Fazzari MJ, Rudolph MI, Vidal Melo MF, Eikermann M, et al. Development of an automated, general-purpose prediction tool for postoperative respiratory failure using machine learning: A retrospective cohort study. J Clin Anesth. 2023;90:111194. pmid:37422982
- 17. Andonov DI, Ulm B, Graessner M, Podtschaske A, Blobner M, Jungwirth B, et al. Impact of the Covid-19 pandemic on the performance of machine learning algorithms for predicting perioperative mortality. BMC Med Inform Decis Mak. 2023;23(1):67. pmid:37046259
- 18. Graeßner M, Jungwirth B, Frank E, Schaller SJ, Kochs E, Ulm K, et al. Enabling personalized perioperative risk prediction by using a machine-learning model based on preoperative data. Sci Rep. 2023;13(1):7128. pmid:37130884
- 19. Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology. 1983;148(3):839–43. pmid:6878708
- 20. Jammer I, Wickboldt N, Sander M, Smith A, Schultz MJ, Pelosi P, et al. Standards for definitions and use of outcome measures for clinical effectiveness research in perioperative medicine: European Perioperative Clinical Outcome (EPCO) definitions: a statement from the ESA-ESICM joint taskforce on perioperative outcome measures. Eur J Anaesthesiol. 2015;32(2):88–105. pmid:25058504
- 21. Borson S, Scanlan J, Brush M, Vitaliano P, Dokmak A. The mini-cog: a cognitive “vital signs” measure for dementia screening in multi-lingual elderly. Int J Geriatr Psychiatry. 2000;15(11):1021–7. pmid:11113982
- 22. Anetsberger A, Blobner M, Krautheim V, Umgelter K, Schmid S, Jungwirth B. Self-Reported, Structured Measures of Recovery to Detect Postoperative Morbidity. PLoS One. 2015;10(7):e0133871. pmid:26207620
- 23. Mazo V, Sabaté S, Canet J, Gallart L, de Abreu MG, Belda J, et al. Prospective external validation of a predictive score for postoperative pulmonary complications. Anesthesiology. 2014;121(2):219–31. pmid:24901240
- 24. Fage BA, Chan CC, Gill SS, Noel-Storr AH, Herrmann N, Smailagic N, et al. Mini-Cog for the detection of dementia within a community setting. Cochrane Database Syst Rev. 2021;7(7):CD010860. pmid:34259337
- 25.
Trautwein B. Preventing postoperative pulmonary complications after general anaesthesia in adult surgical patients – an interim analysis. Euroanaesthesia Congress 2025.
- 26. Shander A, Fleisher LA, Barie PS, Bigatello LM, Sladen RN, Watson CB. Clinical and economic burden of postoperative pulmonary complications: patient safety summit on definition, risk-reducing interventions, and preventive strategies. Crit Care Med. 2011;39(9):2163–72. pmid:21572323
- 27. Lawrence VA, Hilsenbeck SG, Mulrow CD, Dhanda R, Sapp J, Page CP. Incidence and hospital stay for cardiac and pulmonary complications after abdominal surgery. J Gen Intern Med. 1995;10(12):671–8. pmid:8770719
- 28. Lawrence VA, Hilsenbeck SG, Noveck H, Poses RM, Carson JL. Medical complications and outcomes after hip fracture repair. Arch Intern Med. 2002;162(18):2053–7. pmid:12374513
- 29. McAlister FA, Bertsch K, Man J, Bradley J, Jacka M. Incidence of and risk factors for pulmonary complications after nonthoracic surgery. Am J Respir Crit Care Med. 2005;171(5):514–7. pmid:15563632
- 30. Demi L, Mento F, Di Sabatino A, Fiengo A, Sabatini U, Macioce VN. Lung Ultrasound in COVID-19 and Post-COVID-19 Patients, an Evidence-Based Approach. J Ultrasound Med. 2022;41(9):2203–15. https://doi.org/10.1002/jum.15902 pmid:34859905
- 31. Dave C, Wu D, Tschirhart J, Smith D, VanBerlo B, Deglint J, et al. Prospective Real-Time Validation of a Lung Ultrasound Deep Learning Model in the ICU. Crit Care Med. 2023;51(2):301–9. pmid:36661454
- 32. Li P, Gao S, Wang Y, Zhou R, Chen G, Li W, et al. Utilising intraoperative respiratory dynamic features for developing and validating an explainable machine learning model for postoperative pulmonary complications. Br J Anaesth. 2024;132(6):1315–26. pmid:38637267