Using machine learning models to predict oxygen saturation following ventilator support adjustment in critically ill children: A single center pilot study

Background In an intensive care units, experts in mechanical ventilation are not continuously at patient’s bedside to adjust ventilation settings and to analyze the impact of these adjustments on gas exchange. The development of clinical decision support systems analyzing patients’ data in real time offers an opportunity to fill this gap. Objective The objective of this study was to determine whether a machine learning predictive model could be trained on a set of clinical data and used to predict transcutaneous hemoglobin oxygen saturation 5 min (5min SpO2) after a ventilator setting change. Data sources Data of mechanically ventilated children admitted between May 2015 and April 2017 were included and extracted from a high-resolution research database. More than 776,727 data rows were obtained from 610 patients, discretized into 3 class labels (< 84%, 85% to 91% and c92% to 100%). Performance metrics of predictive models Due to data imbalance, four different data balancing processes were applied. Then, two machine learning models (artificial neural network and Bootstrap aggregation of complex decision trees) were trained and tested on these four different balanced datasets. The best model predicted SpO2 with area under the curves < 0.75. Conclusion This single center pilot study using machine learning predictive model resulted in an algorithm with poor accuracy. The comparison of machine learning models showed that bagged complex trees was a promising approach. However, there is a need to improve these models before incorporating them into a clinical decision support systems. One potentially solution for improving predictive model, would be to increase the amount of data available to limit over-fitting that is potentially one of the cause for poor classification performances for 2 of the three class labels.


Introduction
In case of respiratory failure, mechanical ventilation supports the oxygen (O 2 ) diffusion into the lungs and the carbon dioxide (CO 2 ) body removal. As an expert in mechanical ventilation cannot reasonably be expected to be continuously present at the patient's bedside, specific medical devices aimed to help in ventilator settings adjustments may help to improve the quality of care [1]. Such devices are developed using algorithms either based on medical reasoning that adapt ventilator settings in real time based on patients' characteristics [2,3] or based on physiologic models that simulate cardiorespiratory responses to mechanical ventilation settings modifications [4]. The first ones are not accurate enough to be used widely in clinical practice, especially in children, and the latter are not validated for this indication. Both algorithms do not learn from ever-growing sets of clinical research data that could potentially improve their performances. To overcome this drawback, another avenue is the development of algorithms using artificial intelligence to provide caregivers with support in their decisionmaking tasks.
Among the vital parameters, transcutaneous hemoglobin saturation oxygen (SpO 2 ) is monitored continuously at the bedside in intensive care and must be maintained in an adequate range to insure tissue oxygenation. In mechanically ventilated patients, when SpO 2 is low, either FiO 2 or ventilation pressures/volume are increased.
In this retrospective study, we assessed machine learning methods to predict the classification (normal, low or critically low) SpO 2 of mechanically ventilated children after a ventilator setting change using a high-resolution research database. Such a modelling will help caregivers for the prescription of ventilator settings i.e. the caregiver will use the model to predict the effect of a ventilator setting change on SpO 2 and will apply this ventilator modification if satisfied of the predicted SpO 2 .

Materials and methods
This retrospective study was conducted at Sainte-Justine Hospital, Quebec, Canada and included the data collected prospectively between May 2015 and April 2017 of all the children, less than 18 years old, admitted to the Pediatric Intensive Care Unit (PICU) and were mechanically ventilated with an endotracheal tube. Patients' data were excluded if the patient was hemodynamically unstable defined as 2 or more vasoactive drugs delivered at the same time (ie., epinephrine, norepinephrine, dopamine or vasopressin) or with an uncorrected cyanotic heart disease defined by no SpO 2 > 97% during all PICU stay. All the respiratory data from included patients were extracted from the PICU research database [5], after study approval by the ethics review board (ERB) of Sainte-Justine hospital (ERB study number 2017 1480).

Prediction problem
The predictive SpO 2 class (prognostic class) was the SpO 2 5 minutes after a change of a ventilator setting. The delay of 5 min corresponded to the shortest period of time to reach a steady state after modification of a ventilator setting [6]. SpO 2 levels at 5min were classified into three categories ( Table 1). The thresholds were selected according to clinical value: a SpO 2 < 92% is a target to increase oxygenation in mechanically ventilated children [7]. The critical level of 85% SpO 2 is used as an alarm of severe hypoxemia in intensive care [8]. The success criteria for prediction was the ability of the model to predict the SpO 2 category, 5min after a ventilator setting change ie delta in inspired fraction of Oxygen (ΔFiO 2 ), delta in tidal volume, pressure support or pressure controlled (ΔVt, ΔPS or ΔPC) or delta in Positive end expiratory pressure set (ΔPEEP). The variables used in the model are detailed in Fig 1. These ventilator parameters were determined by an item generation-selection methods conducted by three physicians (PAJ, MS, DB). The resulting items are presented in Fig 1 within their sources, means of extraction and a schematic of the main components of the study.

Data preparation for model building
The data were extracted from a research database approved by the ethics committee of Sainte-Justine Hospital (database ERB number 2016-1210, 4061). The data extracted from the research database needed: (1) to remove erroneous data due to disconnection of the patient from the ventilator or the monitor, or due to transient interventions such as suctioning; (2) to remove the rows at which no ventilator setting variables was modified; (3) to adapt data format for classifier training. The methodology to format the data is described in S1 File. In summary, we first transformed the data from the linear format into a table, where the clinical variables are the column labels and the patient codes and storing times are the row labels. Since the readings for the various variables involved are not all set at the same frequency, the data for the different variables were aligned along the rows time-steps. Then, only the rows at which at least one of the setting variables is modified were preserved in the data file. The rows with change in "FiO 2 Setting" more than 0.2 were excluded, to remove increase of FiO 2 to 1 when suctioning. For each row, the target variable is added by binning the data of variable "SpO 2 in 5 min" into three classes ( Table 1). The binning of the target variable data into three classes allows for better classification performance. For all time-steps, SpO 2 values were validated and kept in the database if heart rate (HR) from monitors in each row was within ± 10 bpm the HR from the pulse oximeter. All rows containing HR readings which do not respect this condition were removed.
The number of patients included was 610 mechanically ventilated children and the total number of rows according to SpO 2 classification is specified in Table 1. We randomly distributed the number of rows between the training and test databases, without considering the number of patients in each dataset.

Data balancing
The data analysis showed a severe imbalance with most SpO 2 at 5min above 92%. This is logical as caregivers want to maintain SpO 2 in normal range during child PICU stay. In such condition, the classifier learns the majority class label (class 3) (Table 1) but doesn't learn the minority class labels (class 1 and 2) [9]. As, the data balancing process aims to allow the  classifier to learn from all class equally, a combination of down-sampling and up-sampling techniques were included: to balance the three classes of the data involved, a down-sampling of the SpO 2 class 3 using TOMEK algorithm [10] and an over-sampling of SpO 2 class 1 and 2 using Synthetic Minority Oversampling Technique (SMOTE) [11] were performed. The down-sampling process was made up of the following steps: (1) TOMEK algorithm was used to detect TOMEK links throughout the whole dataset, for all three classes, and removed them. TOMEK links are the links between any two observations considered nearest neighbors, but which belong to different classes [9], (2) points remainders removed are selected at random.
The creation of synthetic data points by SMOTE can be formulated as follows: In this equation, x syn represents the synthetic data point. The variables x i and x knn are respectively the original instance, and the nearest neighbor data point which is randomly picked among the k nearest neighbors. The random number δ is generated in [0,1] to determine the position of the created synthetic data point along a straight line joining the original data point x i and its chosen nearest neighbor x knn .
To study which data balancing method provided the more accurate algorithm, four datasets were produced via four different balancing procedures, involving different combinations of data balancing techniques (Table 2).

Predicted SpO 2 model construct
To identify the best machine learning classification method, we tested two classification models: artificial neural network and bagged complex decision trees, on the four balanced training datasets.
Artificial Neural Network (ANN). Once the data has been pre-processed, a machine learning predictive model was trained on a sub-set of labeled training data. The model is then used to predict the target variable values on a testing subset where the class labels are hidden. We used Artificial Neural Networks (ANN) to make predictions of the SpO 2 variable, based on the values of the other variables of interest. Through the function approximation that the ANN performs, it is possible to make predictions of SpO 2 variable, based on the input data. The outputs are the probability for each of the 3 class where the sum of their probabilities is 1.
The ANN is learned from training data, using the backpropagation algorithm [12] and is tested on a test set made of the remaining rows of data to validate the generalization of the model. The learning algorithm runs through all the rows of data in the training data set and compares the predicted outputs with the target outputs found in the training data set. The weights are adjusted via supervised learning, in a manner to minimize the error of predicted SpO 2 vs target SpO 2 . The process is repeated until the error is minimized.
The ANN classifier was implemented through cycles of forward propagation followed by backward propagation through the network's layers. The backpropagation algorithm is used for performance optimization. For detailed information see S2 File.
Bootstrap aggregation of complex decision trees. Bootstrap aggregating (acronym: bagging) was proposed by L Breiman in 1994 to improve classification by combining classifications of randomly generated training sets [13]. Bagging allows for the creation of an aggregated predictor via the use of multiple training sub-sets taken from the same training set. Let (T i ) denote the replicate training sub-sets bootstrapped from the training set T. These replicate sub-sets each contain N observations, drawn at random and with replacement from T. For each of these sub-sets of N observations, a prediction model, or classifier, is created. The computational model used for bagging was complex decision trees. This means that, for each bootstrapped sub-set of training data, a complex decision tree is trained and thus a classifier is created. If i = 1, . . ., n, then n classifiers are created through the bagging process.
A decision tree is a flowchart computational model which can be used for both regression, as well as classification problems. Paths from the root of the tree to its various leaf nodes go through decision nodes in which decision rules are applied in a recursive manner, based on values of input variables. Each path represents an observation (X, y) = (x 1 , x 2 , x 3 , . . ., x n , y), where the label assigned to the target y is given in the leaf node, at the end of the path i.e. classification [14].
The measure used to build sub-trees was the gini index (see infogain.doc for details). We tested the BACDT model using 30, 50 and 70 decision trees.
In the aim of maximizing the model's generalization capability during the training process, the Bagged Complex Trees' performance is tested via k-fold cross-validation. A value k = 10, which is common practice, was used in this study for both the complex decision trees and ANN. The training using k-fold cross-validation is carried out as described below: The data-set is first divided into two parts; the training-set and the test-set. The training of the "Bagged" Complex Trees includes a k-fold cross-validation, which is performed as follows: • Randomly partition the data-set into k equal-sized subsets (folds).
• For each of the k equal-sized subsets: � Train/fit the model on the elements contained in the other (k-1) subsets. � Test the model's accuracy on the given subset.
• Iterate over the k subsets, until each one has been used once for testing the model's performance during its training.
• The training validation score consists of the average score obtained by validating the model on all k subsets.
The mathworks Matlab R2016b Machine Learning toolbox was used for the creation of the ensemble of Bagged complex trees model. The ANN classifiers were implemented using the Scikit-Learn package within the Python programming language [http://scikit-learn.org].

Classifiers performances assessments
If the model outputted a predicted probability >0.9 for a given class, then the predicted class was considered positive. We evaluated the performances of the classifiers based on the metrics including ROC curves, average accuracy, precision (ratio of all correct classifications for class i to all instances labeled as class label i by the model), recall (ratio of the number of instances classified in class label i to the number of true class i labels) and F score (single measure of classification performance of the model used), see S2 File for further details [15].

Results
The number of patients included was 610 mechanically ventilated children with a median duration of ventilation of 33hrs (1 st quartile: 6.5hr and 3 rd quartile: 116.9 hr), similar to a previous study [16]. In the 776,727 ventilator settings modifications ( Table 1), 98% of the ventilator settings modifications were FiO 2 setting changes. The performances of the two machine learning classifiers to predict SpO 2 at 5 min after a ventilator setting change (ie FiO 2 , PEEP, Vt/Pressure support or pressure controlled above PEEP) was developed on four different balanced training datasets and assessed on four different balanced test datasets (see Table 2). In Fig 2  and Table 3, we report the performances of these two classifiers. Using the classification performance metrics, the bagged trees classifier trained on dataset #3 has yielded the best classification performance on the test sets (Table 3) and was the predictive model retained. The ROC curves are shown in Fig 2 with area under the curves below 0.75 for all class.

Impact of hidden layers for ANN and number of complex trees for BACDT on performance
For the artificial neural network, the variation of the number of hidden layers and number of neurons per hidden layer did not seem to have a significant effect on the model's classification performance ( Table 4). As for the Bagged complex trees, the variation of the number of complex trees did not yield significant changes in classification performance ( Table 5). The number of decision trees used in best BACDT model was 50.

Discussion
This single center pilot study using machine learning predictive model resulted in a predictive model with a poor accuracy (area under the ROC curves < 0.75). The comparison of machine learning models showed that bagged complex trees was the best approach. However, the model was of limited value for to predict SpO 2 below 92%.
In agreement with previous studies regarding bagging being a better method for medical data classification, tree Bagging fared better than the artificial neural network [13]. The gap in performance between the training and testing confusion matrices in the case of bagged trees model (data not shown) seems to indicate that, although the bagged trees model was capable of learning very well from the data, there's still room for improvement in the generalization. Table 2). Avg/total: average accuracy of total classification values. In italics is the performance of the best predictive model obtained among the eight tested.  Table 4. Absence of impact on performance of the increase of neurons and hidden layers for artificial neural network (ANN). Example of the performance assessed by the F score on the balanced test dataset 3 (see Table 2). The SMOTE algorithm is designed in such a way that should theoretically not affect the generalization of the trained model. However, in cases of extreme data imbalance, as in this study, the over-sampling of minority class label is also likely to be extreme. This may render the data space of this class relatively dense with respect to the rest of the data made up of real data points. This may potentially explain the classification model's relatively poor generalization for 5min SpO 2 class "1" and "2". Also, since SMOTE generates synthetic data points by interpolating between existing minority class instances, it can increase the risk of over-fitting when classifying minority class labels, since it may duplicate minority class instances but this needs to be further investigated.

ANN
The strengths of this study include a large clinical database of mechanically ventilated children with more 776,727 rows. In a recent similar study in PICU, 200 patients were included with 1,150 rows [17]. However, the volume of data is clearly insufficient. To use such machine learning predictive models both for low SpO 2 class and for ventilator setting modification such as PEEP. The pediatric intensive care community needs to combine multicenter high resolution database to increase the datasets. In addition, children data could be pooled to neonatal and adult intensive care data, when possible, such as MIMIC III database in specific clinical analysis [18]. The other strength is the process used to transform the data into a usable format and to correct a variety of artifacts (see S1 File). In health care, there is a significant interest in using clinical databases including dynamic and patient-specific information to develop clinical decision support algorithms. The ubiquitous monitoring of critical care units' patients has generated a wealth of data that creates many opportunities in this domain. However, when developing algorithms, such as transport or finance, data are specifically collected for research purposes. This is not the case in healthcare where the primary objective of data collection systems is to document clinical activity, resulting in several issues to address in data collection, data validation and complex data analysis [19]. As detailed in S1 File, a significant amount of effort is needed, when data have been successfully archived and retrieved, to transform the data into a usable format for research.
This study has several limitations. First, the limited row number in low SpO 2 levels reduced the SpO 2 classification for machine learning predictive model to three clinically relevant classes. SpO 2 is a continuous variable and the use of three class is probably insufficient [20,21]. Instead of the classification model, the next step could be to test regression models' performance. Second, SpO 2 was predicted at 5min after ventilator setting change, a clinically relevant delay. However, the delay between ventilator setting change and oxygenation steady state is not well defined and vary from 1 to 71 minutes according to the parameter set (FiO 2 , PEEP or other parameters that change mean airway pressure) and clinical conditions studied [17,22,23]. This needs further research and probably more sophisticated clinical decision support systems using machine learning predictive models should consider these factors. Third, we excluded hemodynamic unstable patients using a treatment criteria (� 2 vasoactive drugs infused) because this condition decreases pulse oximeter reliability [24,25]. The validation Table 5. Absence of impact on performance of the number of complex trees for bootstrap aggregation of complex decision trees (BACDT). Example of the performance assessed by the F score on the balanced test dataset 3 (see Table 2). and electronic availability of reliable markers of hemodynamic instability in children such as plethysmographic variability indices could be helpful [26]. Finally, based on the classification approach taken, we didn't stratify the number of unique patients whose data were used for training versus testing, but only the number of instances for train versus test. The median duration of ventilation in our PICU is 33 hours, the medical conditions are numerous and the weaning phase where lung condition is almost the same among children represents 50% of the mechanical ventilation duration [16]. By random, the number of unique patient in the training and validation dataset is proportional to the whole population and reflects the whole PICU population studied. If we had determined a given number of patient per training and validation, we probably should also ned to dispatch the medical condition, the duration of ventilation, the underlying medical conditions. To address this problem, we included in the model variables that characterize the patient and lung severity at a given time including age, weight and mean airway pressure (see Fig 1).

Conclusion
This pilot study using machine learning predictive model resulted in an algorithm with poor accuracy. We have proposed a method to apply supervised machine learning algorithms to extract knowledge from large amounts of patient mechanical ventilation data. Our method aimed at predicting the behavior of SpO 2 , based on ventilator setting changes made by the clinician and other clinical variables. To do that, we have exploited large amounts of data from a PICU research database and proposed a data formatting process which creates datasets that can be used for supervised training. The comparison of machine learning models showed the use of ensembles of bagged complex trees to be a promising approach. As for future work, various approaches and methods may be considered, in the aim of improving prediction of SpO 2 classification, or level prediction in the case of regression models. One potentially viable solution for improving predictive models would be to use a greater amount of data. Although this could not be considered a warrant for better classifier robustness, it will decrease the need of a data balancing process and may be a relatively simple approach to be considered in future work. This will require a multicenter pediatric intensive care high resolution databases. For the moment, the study presents a model that predicts SpO 2 using known setting changes made by the clinician, as well as the other clinical data that the clinicians involved in the study deemed relevant for SpO 2 prediction. However, it is hoped that this predictive model will be incorporated in a larger Clinical Decision Support System to assist PICU clinicians in making decisions about required setting changes, based on the range in which SpO 2 and other parameters (PaCO 2 , hemodynamic status, . . .) are to be maintained.
Supporting information S1 File. Data formatting process.
(DOCX) S2 File. Performance tests used in the machine learning models to predict oxygen saturation following ventilator support adjustment in critically ill children.
by grants from the "Fonds de Recherche du Québec-Santé (FRQS)", the Quebec Ministry of Health and Sainte Justine Hospital.