Prediction of blood pressure changes associated with abdominal pressure changes during robotic laparoscopic low abdominal surgery using deep learning

Background Intraoperative hypertension and blood pressure (BP) fluctuation are known to be associated with negative patient outcomes. During robotic lower abdominal surgery, the patient’s abdominal cavity is filled with CO2, and the patient’s head is steeply positioned toward the floor (Trendelenburg position). Pneumoperitoneum and the Trendelenburg position together with physiological alterations during anesthesia, interfere with predicting BP changes. Recently, deep learning using recurrent neural networks (RNN) was shown to be effective in predicting intraoperative BP. A model for predicting BP rise was designed using RNN under special scenarios during robotic laparoscopic surgery and its accuracy was tested. Methods Databases that included adult patients (over 19 years old) undergoing low abdominal da Vinci robotic surgery (ovarian cystectomy, hysterectomy, myomectomy, prostatectomy, and salpingo-oophorectomy) at Soonchunhyang University Bucheon Hospital from October 2018 to March 2021 were used. An RNN-based model was designed using Python3 language with the PyTorch packages. The model was trained to predict whether hypertension (20% increase in the mean BP from baseline) would develop within 10 minutes after pneumoperitoneum. Results Eight distinct datasets were generated and the predictive power was compared. The macro-average F1 scores of the datasets ranged from 68.18% to 72.33%. It took only 3.472 milliseconds to obtain 39 prediction outputs. Conclusions A prediction model using the RNN may predict BP rises during robotic laparoscopic surgery.

waived because all data were obtained by retrospective chart review. Databases that included adult patients (over 19 years old) treated with low abdominal da Vinci robot surgery (roboticovarian cystectomy, hysterectomy, myomectomy, prostatectomy, and salpingo-oophorectomy) at Soonchunhyang University Bucheon Hospital from October 1, 2018, to March 31, 2021, were selected. The data were collected from one da Vinci robot operating room. Each patient data set was a combination of attributes obtained from two different sources, the conventional electronic medical record (EMR) database and the operation data server (ODS). The EMR database provides the general attributes of a given patient, and the values are preprocessed by scaling and one-hot representation. The ODS maintains vital values using Vital recorder software [15], which works in real-time to gather data from the Bx50 (patient monitor) and the Datex-Ohmeda (anesthesia machine) monitoring machines. The total number of enrolled patients was 625. Data from patients with an American Society of Anesthesiologists (ASA) class of 4 or higher, or when the Vital recorder or EMR file was not fully recorded were excluded from the analysis. Finally, data from 533 patients were analyzed (Fig 1).
A feature vector from the attributes derived from the two sources (i.e., the EMR database and the ODS) was utilized for training a binary classification model for forecasting potential mean BP hypertension. Hypertension was defined as a 20% rise in the mean BP from baseline within 10 minutes after gas injection to the peritoneum. Two labels including "non-H" and "H" were used to indicate non-hypertension and hypertension, respectively. Fig 2 outlines the process flow. The purpose of this study was to demonstrate that our RNN model could predict non-H and H.
Based on the EMR database, several demographic information and patient status attributes were extracted including sex, age, weight, height, and ASA class. The categorical attributes such as sex and ASA class were converted into one-hot representations, while other numerical attributes (e.g., age, weight, and height) were maintained as real numbers, resulting in an 8-dimensional vector Femr. The attributes are listed in S1 Appendix. The 19 vital values were extracted from the ODS to obtain a real-numbered attribute matrix Fvital 2 R W×19 , where W is the size of the time window. The matrix Fvital had no missing values, but the vital values collected from different devices (or different ways) showed different sampling rates. For example, tidal volume and minute ventilation (MV) were collected every six seconds, whereas BP data were obtained every minute. Preprocessing was conducted as reported in a previous study [14], and thus, all vital values had the same sampling rate of 3 seconds, i.e., the size of timewindow W was 60 using a vital value of 3 minutes. All numerical attributes in Femr and Fvital were normalized using a 0-1 scaling algorithm with only training data. S2 Appendix describes the per-label number of instances from a total of 533 instances.

Methods
An RNN-based model was designed using Python3 language with the PyTorch packages. The architecture is shown in Fig 3. The key architecture of the RNN-based model was the recurrent connections in hidden layers, and the output of the hidden layers, which conveyed the sequential patterns underlying the given data. The model used Fvital and Femr as input and generated a 2-dimensional output vector for prediction. The matrix Fvital represents a sequence of 19-dimensional vital values with the window size W. The sequence was fed into a stack of bidirectional gated recurrent unit [16] layers, resulting in two R 5 -dimensional real-numbered  vectors, h F and h B . The h F vector was generated by the forward RNN and the h B vector was derived from the backward RNN. These two vectors revealed the sequential patterns in the forward and backward directions, respectively. These vectors were concatenated with another input, the 8-dimensional real-numbered vector Femr, transforming into a 2 × R 5 + 8-dimensional vector. The vector Femr consisted of independent features (e.g., age, sex, and ASA), and therefore, was not passed to the RNN layers but was concatenated with the vectors generated by the RNN layers. The concatenated vector was delivered to two consecutive fully connected (FC) layers followed by a 2-dimensional output layer. The two nodes of the output layer indicated hypertension and non-hypertension, respectively. The proposed RNN architecture was found via grid searching, such as by repeated experiments with training and validation datasets to identify the most promising number of RNN layers. The dimension of all RNN hidden layers (i.e., R1, R2, R3, R4, and R5) was set to 15. The time window size W = 10 suggests that the RNN model incorporated the vital sequences for 30 seconds as input. The dimensions of the two FC layers, F1 and F2, were 15 and 10, respectively. A cross-entropy loss function was used for parameter estimation.

Statistical analysis of patient's demographic data
Data analysis was conducted using SPSS software version 21.0 (SPSS Inc., Chicago IL, USA). The Kolmogorov-Smirnov test was performed to test the normality of the continuous variables. The Student's t-test or Mann-Whitney U test was used depending upon the normality of the continuous variables. The chi-squared test or Fisher's exact test was used for categorical variables. The Kruskal-Wallis test or oneway analysis of variance (ANOVA) along with post hoc analysis using Bonferroni's method was used for multiple group comparisons. A P-value of less than 0.05 was considered significant.
There is no absolute sample size for training machine learning and deep learning models. The complexity of the model and training parameters are determined based on the amount and nature of the data. For that reason, it is often used to classify data into training, validation, and test datasets. Grid searching method is the most widely used method to obtain proof that model training has been performed properly for given data [17], and this method was also used in this study.

Performance evaluation
The performance of the RNN model was evaluated using precision, recall, and F1 score.

Results
The patient demographics in each of the non-H and H groups are listed in Tables 1 and 2. There were no differences in age, height, weight, body mass index (BMI), or comorbidities between the two groups, but the differences between the two groups were significant when surgery type and composition were considered. Therefore, the analysis of the differences in patient demographics based on surgery type revealed significant differences in age, height, weight, BMI, and comorbidities between the patient groups (S3 Appendix and Table 3). Using the total dataset Dtotal (n = 533), three distinct datasets with different proportions of test data were generated. Five additional datasets were created using each of the five surgeries as test data. These eight derived datasets were generated from Dtotal (S4 Appendix). The three datasets, D7:3, D8:2, and D9:1, were generated by random sampling but maintained the proportion of the different surgeries. The remaining datasets treated a particular type of surgery as test data. For example, Dsal indicated that all instances of salpingo-oophorectomy surgery constituted the test data. Note that all datasets were the same size as the total dataset Dtotal. Ten independent experiments were conducted with each dataset. All the reported performances (e.g., F1 score, precision, recall) represent the average values. The machine contained two central processing units of Intel(R) Xeon(R) Silver4214 at 2.20GHz and four graphics processing units of the NVIDIA Quadro RTX 5000. When the proposed model was trained, the optimal training recipe (e.g., number of epochs, drop-out probability, parameter initialization, learning rate, etc.) was found via a grid search with validation data, where the train:validation ratio was 9:1. Fig 4 shows the receiver operating characteristic (ROC) curve generated by a separated run with D9:1. Adam optimizer [18] with an initial learning rate of 0.001. The earlystopping algorithm was adopted with at least 50 epochs, and the mini-batch size was 32. A drop-out [19] with a keep probability of 0.1 in the RNN layers was adopted. Table 4 summarizes the overall performance. The macro-average F1 scores of all datasets were 72.33% and 68.18% for non-hypertension and hypertension, respectively. Based on these results, the prediction using the proposed model showed about 70% confidence. The elapsed time was about 3.472 milliseconds for generating 39 prediction outputs.  The data are presented as n (%) COPD, chronic obstructive pulmonary disease a Statistically significant differences between the groups were analyzed using the chi-squared test. https://doi.org/10.1371/journal.pone.0269468.t003

Discussion
In this study, the task of predicting the mean BP hypertension of patients who underwent robotic laparoscopic surgery was addressed using an RNN-based model incorporating vital sequences and EMR data as input. The model used the vital values for 30 seconds to capture  arbitrary sequential patterns and utilized the patient EMR data as supplementary features to predict potential hypertension within 10 minutes after pneumoperitoneum. The effectiveness of the model was verified by experimental results with our collected data. The model architecture and training recipe was found using a grid search, which yielded macro-average F1 scores of 68.18% to 72.33%. The gap between the macro and micro averages was attributed to the difference (e.g., the proportion of test data) between the datasets. The performance degraded according to decreases in the proportion of the training data. For example, the F1 score in the D9:1 dataset was substantially greater than that obtained with D7:3, which is reasonable because the model strength increases if the number of learning materials is high. The F1 scores with Dhys were consistent with the observation. It had the worst result because it contained the smallest number of training data. Interestingly, among the five datasets with test data involving different types of surgery, the best F1 scores were achieved with Dcys even though it was not the dataset with the highest number of training data. S3 Appendix and Table 3 show that the patients who underwent cystectomy were younger and fewer patients had hypertension. Vaso-reactivity is increased in patients with essential hypertension, and therefore, even a small change in a sympathetic agent induces large BP fluctuations, which complicate the prediction of BP fluctuations. Arterial stiffness increases with age, which may increase systolic BP and pulse pressure. This difference in basal conditions affected the predictive power of the Dcys dataset. The results of two datasets, Dhys and Dpro, were contrasting, showing high F1 scores for 'H' in Dhys but high F1 scores for 'non-H' in Dpro. The contrast can be explained by the opposite label proportion in the datasets. As the Dhys dataset had more 'H' than 'non-H' instances, the model tended to yield 'H' as a prediction output. Likewise, the Dpro dataset had 'non-H' instances, and the model yielded more 'non-H' as a prediction output.
There are several widely-adopted ML models for medical applications including logistic regression, support vector machine [20], decision tree, random forest [21], naive Bayes, extreme gradient boosting [22], and artificial neural networks (ANN). Lee et al. [23] proposed an ANN model to predict postoperative in-hospital mortality, and achieved an area under the ROC curve of 0.91 (95% confidence interval: 0.88-0.93). Jeong et al. [24] compared a few ML models for predicting the potential postoperative complications of patients diagnosed with end-stage renal disease. Even though ML models are effective in performing several medical tasks, a major limitation relates to their need for human intervention, entailing feature engineering. The performance of ML models strongly depends upon feature definition, which is expensive to obtain satisfactory performance with each individual task [25,26]. The DL technique is one of the solutions that address this limitation. It does not require significant human effort but automatically learns arbitrary features (or patterns) from the dataset. However, it is worth noting that the domain knowledge of the target task needs to be examined before designing DL models. For example, decisions regarding input data are based on domain knowledge. While there are several well-known types of DL models, RNNs are known to be effective in capturing sequential patterns [27]. The RNN model has been adopted in various medical applications such as intensive care unit mortality risk prediction [28], predicting diabetes mellitus [29], and predicting specific targeted clinical events [30]. These studies generally utilized sequential data (e.g., vital signs and time-stamped electronic health records) as input for the RNN model so that the model learned to predict the desired output (i.e., adverse outcomes). The RNN has also been successfully used to predict BP values and BP-related events (e.g., hypertension). Peng Su et al. [31] proposed an RNN architecture to predict future BP values a few days in advance. Thus, the RNN model could be utilized to predict whether hypertension would occur within 10 minutes after pneumoperitoneum in robot assistedlaparoscopic low abdominal surgery.
As the output of the proposed model was an indicator of potential mean BP hypertension within 10 minutes after pneumoperitoneum, the model should generate the prediction results as early as possible. It only took about 3.472 milliseconds to generate 39 prediction outputs, suggesting that it is possible to inform the physicians of potential hypertension immediately after gas injection during surgery using our model.
Despite the exquisite experimental results and overall F1 scores of around 70%, there were several limitations to this study. First, the performance, especially recall in the 'H' cases, needs improvement. For the 'H' cases in cystectomy surgery, the proposed model achieved a recall of 91.29%, in contrast to the 'H' cases in prostatectomy surgery. Based on demographic differences, the patients who underwent prostatectomy were older and had underlying diseases, which may also suggest the increased difficulty in predicting BP fluctuations in the elderly or those with a number of underlying diseases. This limitation can be addressed by better model architecture or an alternative training recipe. Second, further data are needed to improve the overall model performance. Especially, as shown in Table 4, the performance of five datasets (e.g., Dsal, Dpro, Dmyo, Dhys, and Dcys) containing the test data of a particular target surgery was relatively lower than that of the other datasets. The collection of additional data will facilitate the creation of a robust model for different types of surgery. Third, patient data from other instruments such as the bispectral index that monitor the depth of anesthesia or the syringe pumps injecting anesthesia-related drugs are often missed in the vital database due to connection problems, so they were not used as input variables in this RNN model. If additional cases are available and these data are also used as variables, the performance of the model will be improved. Fourth, the data were obtained from a single source (a surgery robot in the same operating room). A robust model requires data from different surgery rooms, medical centers, and hospitals. Fifth, Since the model in this study was not a real-time prediction model, there are restrictions on applying this to individual patients. However, we are conducting research to develop real-time perioperative BP prediction models based on models developed using these tasks [14]. I think it will be a great help to create an accurate real-time blood pressure prediction model in the future through data accumulated with these model tasks. Sixth, the potential confounding factors such as incision time and depth of anesthesia were not considered in this model. As a limitation of the retrospective study, it was difficult to precisely specify the incision time. In contrast, the initiation time of the pneumoperitoneum could be accurately determined through the record of changes in abdominal pressure levels. However, it seems that the incision time would not have a significant impact on the result. Because, in general, the trocar is inserted immediately at the rt of the incision and the pneumoperitoneum is started within 1-2 minutes. It was difficult to include depth of anesthesia into the analysis either. It was difficult to obtain uniform data, because the records of values for depth of anesthesia were omitted in many cases and various equipments (i.e., bispectral index, entropy, and sedline) were used for measuring the depth of anesthesia. There is a need for additional prospective studies.
In conclusion, the current RNN model could predict hypertension within 10 minutes after changing to the Trendelenburg position with CO 2 pneumoperitoneum during various types of robotic surgery. Further studies are needed to increase the overall predictive power based on other vital features (i.e, dose of anesthetics, intra-abdominal pressure, and angle of Trendelenburg position) to improve model performance and provide reliable information to anesthesiologists.