Hypoxemia prediction in pediatric patients under general anesthesia using machine learning: A retrospective observational study and external validation

Sujin Baek; Jung-Bin Park; Jihye Heo; Kyungsang Kim; Donghyeon Baek; Chahyun Oh; Hyung-Chul Lee; Dongheon Lee; Boohwi Hong

doi:10.1371/journal.pone.0339276

Abstract

Background

Pediatric patients under general anesthesia are particularly vulnerable to hypoxemia, which can lead to rapid oxygen desaturation. This vulnerability necessitates heightened vigilance from anesthesiologists, making pediatric anesthesia management especially challenging. Continuous intraoperative monitoring of oxygenation is critical. However, traditional methods relying solely on SpO₂ readings may be insufficient and prone to inaccuracies.

Methods

This study aimed to develop and externally validate various machine learning models to predict hypoxemia in pediatric patients under general anesthesia. This retrospective observational study included 800 pediatric cases from Seoul National University Hospital and 134 pediatric cases from Chungnam National University Hospital. Patient data, including vital signs and ventilator parameters sampled every 2 seconds, were analyzed. Four machine learning models (XGBoost, LSTM, InceptionTime, and Transformer) were evaluated using area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), and F1-score.

Results

XGBoost achieved the highest performance in internal validation (AUROC, 0.85), whereas the Transformer model demonstrated the best performance in external validation (AUROC, 0.83). Reducing the observation window from 1 minute to 10 seconds lowered the AUPRC but preserved high AUROC.

Conclusions

The XGBoost and Transformer models demonstrated robust performance in predicting intraoperative hypoxemia in pediatric patients under general anesthesia across two hospitals. Adjustments for age-related variations did not enhance model performance. Future research should focus on developing machine learning models that can accurately distinguish true hypoxemia, leading to clinically significant improvements in patient outcomes.

Citation: Baek S, Park J-B, Heo J, Kim K, Baek D, Oh C, et al. (2026) Hypoxemia prediction in pediatric patients under general anesthesia using machine learning: A retrospective observational study and external validation. PLoS One 21(1): e0339276. https://doi.org/10.1371/journal.pone.0339276

Editor: Vijayalakshmi Kakulapati, Sreenidhi Institute of Science and Technology, INDIA

Received: June 8, 2025; Accepted: December 3, 2025; Published: January 8, 2026

Copyright: © 2026 Baek et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data supporting the findings of this study are not publicly available due to patient confidentiality and ethical restrictions on the use of medical data. The initial Institutional Review Board (IRB) approvals for this study limited data use to internal research and did not include provisions for public data deposition. However, de-identified data may be made available from the corresponding author upon reasonable request and with approval from the relevant institutional review boards. Contact Information for Data Access Requests: - Chungnam National University Hospital Institutional Review Board: cnuhirb@cnuh.co.kr, +82-42-280-6713 - Seoul National University Hospital Institutional Review Board: irb@snuh.org, +82-2-2072-0694.

Funding: “This research was supported and funded by the SNUH Lee Kun-hee Child Cancer & Rare Disease Project, Republic of Korea (grant number: 24C-003-0100) and supported by research fund from Chungnam National University.”

Competing interests: The authors have declared that no competing interests exist.

Introduction

Anesthesia-related cardiac arrest in pediatric patients is a significant concern, with approximately 27% of cases attributed to respiratory causes [1]. During surgical procedures, pediatric patients have an increased susceptibility to hypoxemia compared with adults, presenting a potentially fatal risk [2,3]. This heightened vulnerability stems from their smaller functional residual capacity, leading to reduced oxygen reserves and greater chest wall compliance, which predisposes them to small airway collapse [4,5]. Consequently, interruptions or inadequacies in oxygen supply can lead to a rapid decline in arterial oxygen levels. Therefore, vigilant intraoperative monitoring of oxygenation is imperative in pediatric patients.

However, false hypoxemia can complicate monitoring in pediatric patients, increase the workload of anesthesiologists, and obstruct careful monitoring efforts. A reliable study using Anesthesia Information Management Systems found that the incidence of pediatric intraoperative hypoxemia was 12%, with only 54% classified as true hypoxemia [2]. Conventionally, true hypoxemia is defined by a low arterial oxygen pressure (PaO₂) from arterial blood gas analysis. Relying solely on peripheral oxygen saturation (SpO₂) readings during hypoxemia events is insufficient for accurately assessing the condition of the patient. Thus, immediate assessment of hypoxemia requires evaluating various other patient-specific parameters. Additionally, parameters such as airway pressure and end-tidal carbon dioxide (EtCO₂) waveform, along with related metrics, influence intuitive interpretation during monitoring [6,7]. Therefore, clinicians rely on SpO₂ as a real-time surrogate, although it can be prone to artifacts.

Recent studies have reported the use of machine learning models, such as gradient boosting machine (GBM) and long short-term memory (LSTM), to predict hypoxemia in pediatric patients undergoing general anesthesia [8,9]. Lundberg et al. also achieved high predictive performance for hypoxemia, with an area under the receiver operating characteristic curve (AUROC) of up to 0.92, using machine learning-based explainable models. However, their study was limited to internal validation, failing to confirm the generalizability of the models [10]. In addition, Erion et al. [11] and Liu et al. [12] demonstrated the effectiveness of SpO₂-based predictive models and proposed hybrid networks to address class imbalance and the challenge of predicting persistent hypoxemia. However, these models primarily relied on low-resolution data measured at 1-minute intervals and long observation windows (10 minutes to 1 hour), limiting their ability to capture the rapid desaturation events typical in pediatric hypoxemia, which occur within an average of 45 seconds [11,12]. Park et al. achieved high accuracy, recording an AUROC of up to 0.939, using high-frequency data sampled at 2-second intervals to improve temporal resolution. However, like previous studies, the lack of external validation limits the assessment of the model’s generalizability across diverse clinical settings [13].

This study aimed to propose a machine learning model to predict intraoperative hypoxemia in pediatric patients undergoing general anesthesia. We used high-resolution time-series data normalized and stratified by age groups to account for pediatric-specific characteristics and compared the performance of various machine learning models. Additionally, external validation with data from other hospital was performed to verify model generalizability, and the importance of key features influencing predictions was analyzed.

Materials and methods

Study design

This retrospective multicenter external validation study was approved by the institutional review boards (IRBs) of Chungnam National University Hospital (CHUH; IRB number 2023-09-052) and Seoul National University Hospital (SNUH; IRB number H-2303-092-1412). Authorization for data anonymization and external transfer was granted by the Data Review Board of SNUH (DRB number DRB-R-2023-03-05).

This study used data from 800 pediatric cases at Seoul National University Hospital and 134 cases at Chungnam National University Hospital (Table 1). All biosignal data were obtained from the prospective registry of vital signs for surgical patients at SNUH and CNUH, using Vital Recorder version 1.9 (accessed at https://vitaldb.net; VitalDB, Seoul, Republic of Korea) [14]. The authors accessed the datasets for research purposes on 01/06/2023 at SNUH and on 15/12/2023 at CNUH. All data were de-identified, and no identifiable information was accessible to the authors. This manuscript adheres to the TRIPOD-AI (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis-Artificial Intelligence) statement (S1 Table).

Download:

Table 1. Baseline characteristics of the datasets.

https://doi.org/10.1371/journal.pone.0339276.t001

As described in Table 1, demographic data, including sex, age, height, and weight, were extracted from the electronic medical record. Surgical and anesthesia-related details were collected, including the surgery name, diagnosis, surgery date, and start and end times of surgery and anesthesia. The Vital Recorder program was used to gather physiological data, including vital signs (including SpO₂) and parameters from the anesthesia ventilator machine, including EtCO₂, fraction of inspired oxygen (FiO₂), tidal volume measured on exhalation (TV), peak inspiratory pressure (PIP), and minute ventilation (MV).

The inclusion criteria were pediatric patients younger than 18 years, those who underwent general anesthesia in the operating room, and those whose oxygen saturation decreased <95% at least once during surgery. Patients without Vital Recorder data during anesthesia, including patients undergoing cardiac or pulmonary surgery, those who underwent procedures that induce apnea (such as airway surgeries), and those with preoperative oxygen saturation <95% were excluded. This exclusion was intended to minimize confounding factors from procedure-specific causes of hypoxemia, thereby focusing the model on predicting events arising from general anesthetic management. Additionally, to exclude clinically insignificant and noisy hypoxemic events, the following exclusion criteria were applied: when the pulse rate measured via pulse oximetry exceeded 20% of the heart rate recorded by an electrocardiogram; when the plethysmography waveform was severely distorted; when an anesthesiologist recorded that the oxygen saturation was inaccurate (Fig 1).

Download:

Fig 1. Flowchart of patient inclusion.

SNUH, Seoul National University Hospital; CNUH, Chungnam National University Hospital.

https://doi.org/10.1371/journal.pone.0339276.g001

Preprocessing

A hypoxemic event was defined as the period from the onset of SpO₂ decline to 95%, through its nadir, and until SpO₂ values recovered to 95%. Clinically important time points in the vital sign data included the onset of SpO₂ decline, the point when SpO₂ dropped to 95%, the nadir of SpO₂, and the time when SpO₂ returned to baseline levels.

The biosignal data used in this study were sampled at 2-second intervals, with each time-series segment arranged into 120-second intervals. To predict potential hypoxemia events, the first half of each segment (30 time points consisting of 6 physiological parameters) was used as input for the model to forecast events occurring in the latter half of the segment.

Each segment for model training was set to 1 minute in length, with segments designated as hypoxemia-predictive intervals if a hypoxemia event occurred within 1 minute after the segment’s end. This 1-minute prediction window was selected as it provides a clinically actionable timeframe for preventive interventions and is consistent with methodologies used in prior studies on pediatric hypoxemia prediction [13]. All other segments were labeled as non-hypoxemia intervals (Fig 2). Data segments of 1-minute duration were used for model training. These segments were generated with an initiation time staggered by 2 seconds from the preceding segment. To ensure that the SpO₂-defined labels reflected clinically valid hypoxemic events, all labeling processes were conducted by two expert anesthesiologists and were double-checked by inter-observers; any disagreements during the validation process were resolved through discussion.

Download:

Fig 2. Process of biosignal annotation for hypoxemia prediction.

Each 1-minute segment was labeled as hypoxemic if a hypoxemic event occurred within 1 minute after the segment ended and as non-hypoxemic otherwise. Yellow bar, prediction window with non-hypoxemic event; orange bar, prediction window including SpO₂ < 95% during hypoxemic event; green bar, observational window. SpO₂, peripheral oxygen saturation.

https://doi.org/10.1371/journal.pone.0339276.g002

The vital signs used for model input were sequential tabular data, comprising numerical time-series data sampled at 2-second intervals. The six key physiological parameters used were SpO₂, FiO₂, EtCO₂, PIP, TV, and MV. These time-series features were combined with four static demographic variables (height, weight, sex, and age) to form the complete input for the model. To prepare these features for the models, we first imputed any missing values with zero. Subsequently, for each 60-second input window, each of the six time-series signals was independently scaled to a range of [0, 1] using local min-max normalization to ensure stable model training.

Data preparation

Patients at SNUH from January 2019 to October 2020 were included, with the dataset split into training and internal validation sets at an 8:2 ratio, yielding 724 and 76 patients, respectively, based on surgery dates. The training set was further split into training and tuning sets at an 8:2 ratio for model training. For external validation, the dataset included 134 patients from CNUH between January 2021 and December 2022.

The proportion of data segments labeled as hypoxemia was higher in the CNUH dataset (1.42%) than in the SNUH dataset (1.15%). This difference reflects that hypoxemia events occurred more frequently but with shorter durations at SNUH, whereas they were less frequent but had longer durations at CNUH (S2 Table).

Machine learning models and hyperparameter optimization

This study used four types of machine learning models: XGBoost, LSTM, InceptionTime, and Transformer [9,15–17]. To optimize each model, we performed a hyperparameter search for XGBoost, LSTM, and Transformer using a 5-fold cross-validation strategy on the training set. The parameter combination that yielded the highest average AUROC was selected as the final configuration. For the InceptionTime model, we adopted the architecture from the original publication due to its proven performance on time-series classification tasks. The complete search space and the final selected hyperparameters for all models are detailed in S3 Table.

We used XGBoost, which offers several advantages for biosignal analysis, including high computational efficiency, the ability to handle missing data, robustness to noise, and strong performance with structured and time-series data due to its gradient boosting framework [15,18]. We used 2,000 trees with a maximum depth of 5, a subsampling rate of 0.5, a gamma value of 0.4, and a minimum child weight of 2.

We also used LSTM, which is well-suited for processing time-series data and has been used in previous hypoxemia prediction studies [10–13], to effectively capture the long-term dependencies of biosignals [19]. Our LSTM model had a single hidden layer with 64 hidden nodes and 16 dense nodes along with a dropout rate of 0.5.

Additionally, InceptionTime is a scalable deep-learning model for time-series classification, inspired by the Inception-v4 architecture [16]. We used 8 filters as well as residual blocks and bottlenecks and had a depth of 6 and kernel size of 20.

Lastly, the Transformer has the ability to capture long-range dependencies through self-attention mechanisms, efficient parallel processing for large datasets, and adaptability to complex, multivariate time-series data [20]. The Transformer model featured 64 filters, 3 attention heads, an embedding dimension of 32, 1 convolutional layer, 3 transformer layers, and a dropout rate of 0.2.

Implementation details

All models were trained for 100 epochs using five-fold cross-validation to improve robustness, with binary cross-entropy loss as the optimization criterion. As hypoxemic events were nearly 100 times rarer than non-hypoxemic periods, the loss function for class 1 was heavily weighted compared with class 0 to address the data imbalance issue.

The batch size was set to 1,024, and all deep learning models used the Adam optimizer with an initial learning rate of 0.001, beta1 of 0.9, and beta2 of 0.999. The model weights that achieved the best area under the receiver operating characteristic curve (AUROC) across all epochs were selected for evaluation. The F1 score was calculated, and the optimal threshold for binary classification was chosen based on the highest F1 score.

The models were developed on an NVIDIA GEFORCE RTX 4090 and implemented using Python version 3.10.1, TensorFlow version 2.10.1, and Keras version 2.10.0. For further details, the source code is publicly available on github at https://github.com/jihyeheo/ML-PredGA-Hypoxemia.

Evaluation metrics and feature importance

The models’ performance was evaluated using the AUROC, AUPRC, and F1 score. AUROC assesses the overall classification performance, AUPRC focuses on precision and recall in imbalanced datasets, and the F1 score provides a balance between precision and recall.

To further interpret the model’s predictions and understand feature importance, we used Shapley Additive exPlanations (SHAP) values. The SHAP values, derived from cooperative game theory, provide a unified measure of feature importance by considering all possible combinations of feature values [21]. SHAP values are essential for understanding which features are most important to a model and how different values of these features influence the predictions.

Statistical analysis

Statistical analyses were conducted using Python version 3.10.1. The normality of the data distribution was assessed using the Shapiro–Wilk test. Based on these results, continuous variables were reported as medians and interquartile ranges (IQRs). To evaluate data comparability, the Mann–Whitney U-test was used for continuous variables. Categorical variables were presented as numbers (%) and analyzed using Pearson’s chi-square test. A comparison of the AUROCs between the models was conducted, and statistical significance was assessed using the DeLong test.

Results

Model performance for predicting hypoxemia

The models’ performance in predicting intraoperative hypoxemia in pediatric patients under general anesthesia was evaluated using AUROC, AUPRC, and F1 score values across the four models in both the internal and external validation datasets (Table 2 and Fig 3). In the internal validation dataset, the XGBoost model achieved the highest performance among the four models, with an AUROC of 0.85, an AUPRC of 0.18, and an F1 score of 0.24, consistent with previous studies. However, it showed the lowest performance on the external dataset. Following the XGBoost, the InceptionTime, Transformer, and LSTM models were ranked in descending order of performance in the internal validation dataset. In the external validation dataset, the Transformer model demonstrated the best performance, with an AUROC of 0.85, AUPRC of 0.06, and F1 score of 0.12, followed by the LSTM and InceptionTime models in terms of effectiveness. Nonetheless, the AUPRC and F1 score values were notably low across all four models.

Download:

Table 2. Comparative performance of machine learning models for hypoxemia prediction in pediatric patients. The performance of the four machine learning models (XGBoost, LSTM, Transformer, and InceptionTime) in predicting hypoxemia in pediatric patients under general anesthesia was evaluated across both internal and external validation datasets.

https://doi.org/10.1371/journal.pone.0339276.t002

Download:

Fig 3. Comparison of the performance curves of four models (XGBoost, InceptionTime, Transformer, and LSTM) in predicting intraoperative hypoxemia in pediatric patients under general anesthesia.

Abbreviations: AUROC for internal (a) and external validations (b). LSTM, long short-term memory; AUROC, area under the receiver operating characteristic curve.

https://doi.org/10.1371/journal.pone.0339276.g003

Comparative performance in hypoxemia prediction

We evaluated the impact of four adjustment methods to improve model performance. First, we implemented data normalization to account for characteristics that vary significantly with age in pediatric patients. The performance of the XGBoost on the internal validation dataset showed a decrease in AUROC from 0.855 to 0.804 and in and AUPRC from 0.182 to 0.139. For the external validation dataset, the performance of the Transformer decreased in AUROC from 0.850 to 0.711 and in AUPRC from 0.060 to 0.041. Second, the dataset was stratified into age subgroups for training. However, the model trained on the 2–8-year age subgroup showed performance improvements on the external validation dataset. This improvement was not consistent across all subgroups. As a result, these adjustments resulted in decreased overall performance (S4 and S5 Tables).

We also aimed to identify the minimal observation window required to effectively predict hypoxemia without significantly compromising performance. Therefore, we reduced the observation window from 1 minute down to 10 seconds and examined the resulting changes in performance. For XGBoost, increasing the observation window led to a slight improvement in AUROC of the internal validation dataset, rising from 0.836 to 0.855. However, the AUPRC increased significantly from 0.144 to 0.182. This indicated that the AUROC remained substantial even with a reduced 10-second observation window. With the increase in the observation window, the AUROC showed only a modest improvement, whereas a more substantial enhancement was observed in the AUPRC (S6 Table).

Additionally, we investigated the use of waveform biosignals for model performance enhancement. Three waveforms—photoplethysmography, airway pressure, and capnography—were converted into 2D spectrograms, and features were extracted using EfficientNet. These waveform features were concatenated with traditional single-measurement features derived using InceptionTime to generate the final prediction. The pretrained model achieved an AUROC of 0.8073 and an AUPRC of 0.0936 on the internal validation dataset and an AUROC of 0.7523 and an AUPRC of 0.0497 on the external validation dataset. The F1 scores were 0.0217 for internal validation and 0.0318 for external validation (S7 Table).

Finally, to address the potential for circular reasoning from using SpO₂ to predict a SpO₂-defined event, we evaluated the performance of models trained using only SpO₂ time-series data. The results, detailed in S8 Table, showed that the SpO₂-only models had lower performance across most metrics compared to our proposed multi-variable models. For instance, the AUROC of the multi-variable XGBoost model on the internal validation set was 0.855, significantly higher than the 0.766 achieved by its SpO₂-only counterpart. This demonstrates that incorporating additional physiological variables provides critical contextual information that enhances the model’s predictive power beyond simple SpO₂ trend extrapolation.

Feature importance analysis

To visually explain the impact of the variables, we used SHAP to illustrate how they influence the model’s predictions (Fig 4). The demographic information shows wide distribution of SHAP values across both positive and negative domains suggests their influence is highly context-dependent, varying based on interactions with other physiological variables. The SHAP values for SpO₂, FiO₂, and EtCO₂ show a broader distribution with greater density of points indicating both positive and negative effects on the model’s predictions, compared to those for PIP, TV, and MV. This indicates that SpO₂, FiO₂, and EtCO₂ have a more significant impact on the model’s predictions. Additionally, the variables PIP, TV, and MV predominantly contributed negatively to the model’s predictions, as their distributions are primarily located on the left side of zero. This suggests that higher values of these variables may adversely affect the outcome.

Download:

Fig 4. SHAP values for feature importance in the model’s predictions of intraoperative hypoxemia.

SHAP values for internal (a) and external validations (b). The plot shows that demographic features have a highly context-dependent influence. In contrast, key physiological variables such as SpO₂, FiO₂, and EtCO₂ exhibit a more significant and direct impact on the model’s predictions compared to mechanical ventilation parameters like PIP, TV, and MV, aligning with clinical intuition. Abbreviation: SHAP, Shapley Additive exPlanations; SpO₂, peripheral oxygen saturation; FiO₂, fraction of inspired oxygen; EtCO₂, end-tidal carbon dioxide; PIP, peak inspiratory pressure; TV, tidal volume and MV, minute ventilation.

https://doi.org/10.1371/journal.pone.0339276.g004

Discussion

This study developed various machine learning models to predict intraoperative hypoxemia in pediatric patients undergoing general anesthesia and performed external validation. The results showed that the XGBoost model demonstrated the best performance in internal validation, whereas the Transformer model outperformed others in external validation. Various methods were applied to enhance the performance of the hypoxemia prediction models, revealing that normalization of input values or stratification into age subgroups had minimal impact on performance. Additionally, reducing the observation window from 1 minute to 10 seconds maintained a high AUROC but resulted in a decreased AUPRC. Notably, SpO₂, FiO₂, and EtCO₂ were identified as having a greater impact on model predictions compared to PIP, TV, and MV.

Pediatric anesthesia poses unique risks because of narrow safety margins, limited drug choices, and the uneven distribution of specialized anesthesiologists [22]. Therefore, in medical environments where it is difficult to allocate dedicated pediatric anesthesiologists, machine learning-based predictive models could potentially serve as a supportive tool to help address this gap. Previous studies on hypoxemia prediction have demonstrated that deep learning models perform as well as or better than traditional machine learning models [23]. However, other research has shown that machine learning approaches, such as the GBM, achieved the best performance for hypoxemia prediction [13]. Our findings align with this, indicating that even machine learning models like XGBoost, which lack a native sequential architecture [24], can achieve high performance by effectively capturing the complex interactions between biosignals and demographic data. Furthermore, as with previous studies, it is crucial to evaluate models using external validation rather than relying solely on internal validation to confirm their generalizability [10–13].

In cases such as inadequate ventilation, abnormal values may be observed in ventilator-related parameters such as EtCO₂, PIP, TV, and MV. TV and MV vary substantially with age, necessitating subgroup analysis and min-max normalization. These adjustments did not markedly improve performance, and SHAP values indicated relatively modest contributions from TV and MV, which were offset by integrating demographic data into the model’s final layer.

Decreasing the observation window from 1 minute to 10 seconds revealed that even with a mere 10-second observation window, the AUROC remained substantially unchanged. This implies that the model primarily predicts hypoxemia using the absolute values just before a hypoxemic event rather than by analyzing trends within the time-series data. This finding clarifies why models like GBM can excel in time-series data. Particularly, the SpO₂ value, a key parameter used in training, serves both as a pivotal criterion for determining hypoxemic events and is used in prediction as well. Prior research on hypoxia prediction indicates that models relying solely on prior SpO₂ data as a variable generally outperform those incorporating multiple clinical variables [11]. From the SHAP value, SpO₂ significantly influences the model’s predictions more than other variables.

Additionally, AUPRC values tended to increase as the observation window length increased. AUPRC is more effective for evaluating classifier performance in highly imbalanced datasets [25]. Although a high AUROC is beneficial, a clinically valuable model should also prioritize generating true-positives, illustrated by higher F1 scores and AUPRC values. As roughly 54% of actual hypoxemic events are accompanied by low SpO₂, refining predictive accuracy for early detection remains a priority.

Lastly, we aimed to improve the prediction of pediatric hypoxemia by incorporating photoplethysmography, airway pressure, and capnography waveforms, but it did not significantly enhance the model’s performance and, in some cases, added noise. This outcome underscores the complexity of effectively integrating waveform features from multiple sources. Careful feature selection and tailored model design are essential to optimize predictions for pediatric hypoxemia.

This study has several limitations. The study is limited by biases inherent in its retrospective analysis, insufficient evidence linking hypoxemia prediction in children to reduced event occurrences, and the exclusion of high-risk groups, which restricts the applicability of the results. Furthermore, while our models demonstrated predictive capability significantly better than baseline, their precision, as measured by AUPRC, remains modest. This indicates a high rate of false alarms, which could lead to alarm fatigue. Notably, this discrepancy between high AUROC and modest AUPRC is consistent with previous findings on a similar high-frequency pediatric dataset [13], suggesting it may be an inherent challenge of this specific prediction task. However, this reflects a necessary trade-off in a clinical context where failing to predict a true hypoxemic event (a false negative) carries significantly greater risk than a false alarm. This trade-off solidifies the model’s current role as a situational awareness support tool rather than a definitive diagnostic alarm. The model primarily focuses on ventilation-related measures, overlooking circulatory factors that also contribute to hypoxemia. Incorporating variables such as blood pressure and ECG data could improve the model’s relevance to real-world surgical scenarios. Future research should adopt a prospective approach with a broader patient demographic, include real-time data collection to enhance precision and clinical applicability, and prioritize external validation across multiple institutions to ensure the model’s generalizability and robustness.

In conclusion, we used machine learning models, specifically XGBoost and Transformer, to predict intraoperative hypoxemia in pediatric patients under general anesthesia. Internal and external validations were conducted using data from two distinct hospitals (CHUH and SNUH), demonstrating robust predictive performance. It is expected that the proposed model, with improved precision and effectiveness and validated generalizability through external testing at additional institutions, has the potential for widespread adoption in clinical practice to accurately identify hypoxemia in pediatric patients.

Supporting information

S1 Table. TRIPOD-AI checklist.

https://doi.org/10.1371/journal.pone.0339276.s001

(DOCX)

S2 Table. Counts of each labeled segment in the dataset.

This table presents the distribution of labeled segments across the training, internal validation, and external validation datasets, highlighting that hypoxemia-labeled segments were more prevalent in the CNUH dataset (1.42%) than in the SNUH datasets (1.10% in training and 1.08% in internal validation). Abbreviations: SNUH, Seoul National University Hospital; CNUH, Chungnam National University Hospital.

https://doi.org/10.1371/journal.pone.0339276.s002

(DOCX)

S3 Table. Hyperparameter search space and final selected values for the machine learning models.

https://doi.org/10.1371/journal.pone.0339276.s003

(DOCX)

S4 Table. Comparative performance of machine learning models for hypoxemia prediction in pediatric patients after normalization.

The performance of the XGBoost and Transformer models for predicting hypoxemia in pediatric patients under general anesthesia was compared before and after normalization. The performance metrics, including the AUROC, AUPRC, and F1 score, were evaluated on both internal and external validation datasets. The highest values for each metric in both datasets are highlighted in bold. Abbreviations: AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve; w/o, without; w/, with; norm., normalization.

https://doi.org/10.1371/journal.pone.0339276.s004

(DOCX)

S5 Table. Comparative performance of machine learning models for hypoxemia prediction in pediatric patients after training by age subgroup.

The performance of the XGBoost and Transformer models for hypoxemia prediction in pediatric patients under general anesthesia, stratified by age subgroups, shows variations in the AUROC, AUPRC, and F1 scores across internal and external validation datasets. Abbreviations: AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve.

https://doi.org/10.1371/journal.pone.0339276.s005

(DOCX)

S6 Table. Comparative performance of the XGBoost model for hypoxemia prediction in pediatric patients based on changes in observation window length.

This table presents the performance of the XGBoost model for hypoxemia prediction in pediatric patients under general anesthesia, highlighting how different observation window lengths (from 10 to 60 seconds) impact the AUROC, AUPRC, and F1 scores across internal and external validation datasets. Abbreviations: AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve.

https://doi.org/10.1371/journal.pone.0339276.s006

(DOCX)

S7 Table. Comparative performance of the machine learning model for hypoxemia prediction in pediatric patients using waveform biosignals.

In this analysis, the three waveforms—Photoplethysmography, Airway Pressure, and Capnography—were converted into 2D spectrograms, and features were extracted using EfficientNet. These waveform features were then concatenated with features derived from traditional single measurements, which had been extracted using InceptionTime, to be used for the final prediction. Abbreviations: AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve.

https://doi.org/10.1371/journal.pone.0339276.s007

(DOCX)

S8 Table. Comparative performance of the machine learning model for hypoxemia prediction in pediatric patients using only the SpO₂ feature.

This table presents the performance of the four models when trained and evaluated using only the time-series of peripheral oxygen saturation (SpO₂) as an input feature. These results serve as a baseline to assess the added predictive value of the multi-variable model. Abbreviations: LSTM, long short-term memory; AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve.

https://doi.org/10.1371/journal.pone.0339276.s008

(DOCX)

Acknowledgments

The IRBs of CHUH and SNUH waived the requirement for informed consent owing to the retrospective nature of the study.

References

1. Bhananker SM, Ramamoorthy C, Geiduschek JM, Posner KL, Domino KB, Haberkern CM, et al. Anesthesia-related cardiac arrest in children: Update from the Pediatric Perioperative Cardiac Arrest Registry. Anesth Analg. 2007;105(2):344–50. pmid:17646488
- View Article
- PubMed/NCBI
- Google Scholar
2. de Graaff JC, Bijker JB, Kappen TH, van Wolfswinkel L, Zuithoff NPA, Kalkman CJ. Incidence of intraoperative hypoxemia in children in relation to age. Anesth Analg. 2013;117(1):169–75. pmid:23687233
- View Article
- PubMed/NCBI
- Google Scholar
3. Mir Ghassemi A, Neira V, Ufholz L-A, Barrowman N, Mulla J, Bradbury CL, et al. A systematic review and meta-analysis of acute severe complications of pediatric anesthesia. Paediatr Anaesth. 2015;25(11):1093–102. pmid:26392306
- View Article
- PubMed/NCBI
- Google Scholar
4. Harless J, Ramaiah R, Bhananker SM. Pediatric airway management. Int J Crit Illn Inj Sci. 2014;4(1):65–70. pmid:24741500
- View Article
- PubMed/NCBI
- Google Scholar
5. Adewale L. Anatomy and assessment of the pediatric airway. Paediatr Anaesth. 2009;19 Suppl 1:1–8. pmid:19572839
- View Article
- PubMed/NCBI
- Google Scholar
6. Schmidt GA. Monitoring gas exchange. Respir Care. 2020;65(6):729–38. pmid:32457167
- View Article
- PubMed/NCBI
- Google Scholar
7. Rackley CR. Monitoring during mechanical ventilation. Respir Care. 2020;65(6):832–46. pmid:32457174
- View Article
- PubMed/NCBI
- Google Scholar
8. Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Statist. 2001;29(5).
- View Article
- Google Scholar
9. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. pmid:9377276
- View Article
- PubMed/NCBI
- Google Scholar
10. Lundberg SM, Nair B, Vavilala MS, Horibe M, Eisses MJ, Adams T, et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng. 2018;2(10):749–60. pmid:31001455
- View Article
- PubMed/NCBI
- Google Scholar
11. Erion G, Chen H, Lundberg SM, Lee SI. Anesthesiologist-level forecasting of hypoxemia with only SpO2 data using deep learning. arXiv preprint. 2017.
- View Article
- Google Scholar
12. Liu H, Montana M, Li D, Renfroe C, Kannampallil T, Lu C. Predicting intraoperative hypoxemia with hybrid inference sequence autoencoder networks. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022. 1269–78.
- View Article
- Google Scholar
13. Park J-B, Lee H-J, Yang H-L, Kim E-H, Lee H-C, Jung C-W, et al. Machine learning-based prediction of intraoperative hypoxemia for pediatric patients. PLoS One. 2023;18(3):e0282303. pmid:36857376
- View Article
- PubMed/NCBI
- Google Scholar
14. Lee H-C, Jung C-W. Vital Recorder-a free research tool for automatic recording of high-resolution time-synchronised physiological data from multiple anaesthesia devices. Sci Rep. 2018;8(1):1527. pmid:29367620
- View Article
- PubMed/NCBI
- Google Scholar
15. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 785–94.
- View Article
- Google Scholar
16. Ismail Fawaz H, Lucas B, Forestier G, Pelletier C, Schmidt DF, Weber J, et al. InceptionTime: Finding AlexNet for time series classification. Data Min Knowl Disc. 2020;34(6):1936–62.
- View Article
- Google Scholar
17. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN. Attention is all you need. Adv Neural Inf Process Syst. 2017;30.
- View Article
- Google Scholar
18. Paliari I, Karanikola A, Kotsiantis S. A comparison of the optimized LSTM, XGBOOST and ARIMA in time series forecasting. In: 2021 12th International Conference on Information, Intelligence, Systems & Applications (IISA), 2021. 1–7.
- View Article
- Google Scholar
19. Lindemann B, Müller T, Vietz H, Jazdi N, Weyrich M. A survey on long short-term memory networks for time series prediction. Procedia CIRP. 2021;99:650–5.
- View Article
- Google Scholar
20. Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M. Transformers in vision: A survey. ACM Comput Surv. 2022;54(10s):1–41.
- View Article
- Google Scholar
21. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30.
- View Article
- Google Scholar
22. Muffly MK, Medeiros D, Muffly TM, Singleton MA, Honkanen A. The geographic distribution of pediatric anesthesiologists relative to the US Pediatric Population. Anesth Analg. 2017;125(1):261–7. pmid:27984248
- View Article
- PubMed/NCBI
- Google Scholar
23. Pigat L, Geisler BP, Sheikhalishahi S, Sander J, Kaspar M, Schmutz M, et al. Predicting hypoxia using machine learning: Systematic review. JMIR Med Inform. 2024;12:e50642. pmid:38329094
- View Article
- PubMed/NCBI
- Google Scholar
24. Masini RP, Medeiros MC, Mendes EF. Machine learning advances for time series forecasting. J Econ Surv. 2021;37(1):76–111.
- View Article
- Google Scholar
25. Hancock JT, Khoshgoftaar TM, Johnson JM. Evaluating classifier performance with highly imbalanced Big Data. J Big Data. 2023;10(1).
- View Article
- Google Scholar

[ref1] 1. Bhananker SM, Ramamoorthy C, Geiduschek JM, Posner KL, Domino KB, Haberkern CM, et al. Anesthesia-related cardiac arrest in children: Update from the Pediatric Perioperative Cardiac Arrest Registry. Anesth Analg. 2007;105(2):344–50. pmid:17646488
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. de Graaff JC, Bijker JB, Kappen TH, van Wolfswinkel L, Zuithoff NPA, Kalkman CJ. Incidence of intraoperative hypoxemia in children in relation to age. Anesth Analg. 2013;117(1):169–75. pmid:23687233
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Mir Ghassemi A, Neira V, Ufholz L-A, Barrowman N, Mulla J, Bradbury CL, et al. A systematic review and meta-analysis of acute severe complications of pediatric anesthesia. Paediatr Anaesth. 2015;25(11):1093–102. pmid:26392306
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Harless J, Ramaiah R, Bhananker SM. Pediatric airway management. Int J Crit Illn Inj Sci. 2014;4(1):65–70. pmid:24741500
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Adewale L. Anatomy and assessment of the pediatric airway. Paediatr Anaesth. 2009;19 Suppl 1:1–8. pmid:19572839
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Schmidt GA. Monitoring gas exchange. Respir Care. 2020;65(6):729–38. pmid:32457167
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Rackley CR. Monitoring during mechanical ventilation. Respir Care. 2020;65(6):832–46. pmid:32457174
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Statist. 2001;29(5).
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref9] 9. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. pmid:9377276
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref10] 10. Lundberg SM, Nair B, Vavilala MS, Horibe M, Eisses MJ, Adams T, et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng. 2018;2(10):749–60. pmid:31001455
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref11] 11. Erion G, Chen H, Lundberg SM, Lee SI. Anesthesiologist-level forecasting of hypoxemia with only SpO2 data using deep learning. arXiv preprint. 2017.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref12] 12. Liu H, Montana M, Li D, Renfroe C, Kannampallil T, Lu C. Predicting intraoperative hypoxemia with hybrid inference sequence autoencoder networks. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022. 1269–78.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref13] 13. Park J-B, Lee H-J, Yang H-L, Kim E-H, Lee H-C, Jung C-W, et al. Machine learning-based prediction of intraoperative hypoxemia for pediatric patients. PLoS One. 2023;18(3):e0282303. pmid:36857376
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref14] 14. Lee H-C, Jung C-W. Vital Recorder-a free research tool for automatic recording of high-resolution time-synchronised physiological data from multiple anaesthesia devices. Sci Rep. 2018;8(1):1527. pmid:29367620
View Article
PubMed/NCBI
Google Scholar

[51] View Article

[52] PubMed/NCBI

[53] Google Scholar

[ref15] 15. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 785–94.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref16] 16. Ismail Fawaz H, Lucas B, Forestier G, Pelletier C, Schmidt DF, Weber J, et al. InceptionTime: Finding AlexNet for time series classification. Data Min Knowl Disc. 2020;34(6):1936–62.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref17] 17. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN. Attention is all you need. Adv Neural Inf Process Syst. 2017;30.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref18] 18. Paliari I, Karanikola A, Kotsiantis S. A comparison of the optimized LSTM, XGBOOST and ARIMA in time series forecasting. In: 2021 12th International Conference on Information, Intelligence, Systems & Applications (IISA), 2021. 1–7.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref19] 19. Lindemann B, Müller T, Vietz H, Jazdi N, Weyrich M. A survey on long short-term memory networks for time series prediction. Procedia CIRP. 2021;99:650–5.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref20] 20. Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M. Transformers in vision: A survey. ACM Comput Surv. 2022;54(10s):1–41.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref21] 21. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref22] 22. Muffly MK, Medeiros D, Muffly TM, Singleton MA, Honkanen A. The geographic distribution of pediatric anesthesiologists relative to the US Pediatric Population. Anesth Analg. 2017;125(1):261–7. pmid:27984248
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref23] 23. Pigat L, Geisler BP, Sheikhalishahi S, Sander J, Kaspar M, Schmutz M, et al. Predicting hypoxia using machine learning: Systematic review. JMIR Med Inform. 2024;12:e50642. pmid:38329094
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref24] 24. Masini RP, Medeiros MC, Mendes EF. Machine learning advances for time series forecasting. J Econ Surv. 2021;37(1):76–111.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref25] 25. Hancock JT, Khoshgoftaar TM, Johnson JM. Evaluating classifier performance with highly imbalanced Big Data. J Big Data. 2023;10(1).
View Article
Google Scholar

[87] View Article

[88] Google Scholar

Figures

Abstract

Background

Methods

Results

Conclusions

Introduction

Materials and methods

Study design

Preprocessing

Data preparation

Machine learning models and hyperparameter optimization

Implementation details

Evaluation metrics and feature importance

Statistical analysis

Results

Model performance for predicting hypoxemia

Comparative performance in hypoxemia prediction

Feature importance analysis

Discussion

Supporting information

S1 Table. TRIPOD-AI checklist.

S2 Table. Counts of each labeled segment in the dataset.

S3 Table. Hyperparameter search space and final selected values for the machine learning models.

S4 Table. Comparative performance of machine learning models for hypoxemia prediction in pediatric patients after normalization.

S5 Table. Comparative performance of machine learning models for hypoxemia prediction in pediatric patients after training by age subgroup.

S6 Table. Comparative performance of the XGBoost model for hypoxemia prediction in pediatric patients based on changes in observation window length.

S7 Table. Comparative performance of the machine learning model for hypoxemia prediction in pediatric patients using waveform biosignals.

S8 Table. Comparative performance of the machine learning model for hypoxemia prediction in pediatric patients using only the SpO2 feature.

Acknowledgments

References

S8 Table. Comparative performance of the machine learning model for hypoxemia prediction in pediatric patients using only the SpO₂ feature.