Development and validation of machine learning models to identify high-risk surgical patients using automatically curated electronic health record data (Pythia): A retrospective, single-site study

Background Pythia is an automated, clinically curated surgical data pipeline and repository housing all surgical patient electronic health record (EHR) data from a large, quaternary, multisite health institute for data science initiatives. In an effort to better identify high-risk surgical patients from complex data, a machine learning project trained on Pythia was built to predict postoperative complication risk. Methods and findings A curated data repository of surgical outcomes was created using automated SQL and R code that extracted and processed patient clinical and surgical data across 37 million clinical encounters from the EHRs. A total of 194 clinical features including patient demographics (e.g., age, sex, race), smoking status, medications, comorbidities, procedure information, and proxies for surgical complexity were constructed and aggregated. A cohort of 66,370 patients that had undergone 99,755 invasive procedural encounters between January 1, 2014, and January 31, 2017, was studied further for the purpose of predicting postoperative complications. The average complication and 30-day postoperative mortality rates of this cohort were 16.0% and 0.51%, respectively. Least absolute shrinkage and selection operator (lasso) penalized logistic regression, random forest models, and extreme gradient boosted decision trees were trained on this surgical cohort with cross-validation on 14 specific postoperative outcome groupings. Resulting models had area under the receiver operator characteristic curve (AUC) values ranging between 0.747 and 0.924, calculated on an out-of-sample test set from the last 5 months of data. Lasso penalized regression was identified as a high-performing model, providing clinically interpretable actionable insights. Highest and lowest performing lasso models predicted postoperative shock and genitourinary outcomes with AUCs of 0.924 (95% CI: 0.901, 0.946) and 0.780 (95% CI: 0.752, 0.810), respectively. A calculator requiring input of 9 data fields was created to produce a risk assessment for the 14 groupings of postoperative outcomes. A high-risk threshold (15% risk of any complication) was determined to identify high-risk surgical patients. The model sensitivity was 76%, with a specificity of 76%. Compared to heuristics that identify high-risk patients developed by clinical experts and the ACS NSQIP calculator, this tool performed superiorly, providing an improved approach for clinicians to estimate postoperative risk for patients. Limitations of this study include the missingness of data that were removed for analysis. Conclusions Extracting and curating a large, local institution’s EHR data for machine learning purposes resulted in models with strong predictive performance. These models can be used in clinical settings as decision support tools for identification of high-risk patients as well as patient evaluation and care management. Further work is necessary to evaluate the impact of the Pythia risk calculator within the clinical workflow on postoperative outcomes and to optimize this data flow for future machine learning efforts.

out-of-sample test set from the last 5 months of data. Lasso penalized regression was identified as a high-performing model, providing clinically interpretable actionable insights. Highest and lowest performing lasso models predicted postoperative shock and genitourinary outcomes with AUCs of 0.924 (95% CI: 0.901, 0.946) and 0.780 (95% CI: 0.752, 0.810), respectively. A calculator requiring input of 9 data fields was created to produce a risk assessment for the 14 groupings of postoperative outcomes. A high-risk threshold (15% risk of any complication) was determined to identify high-risk surgical patients. The model sensitivity was 76%, with a specificity of 76%. Compared to heuristics that identify high-risk patients developed by clinical experts and the ACS NSQIP calculator, this tool performed superiorly, providing an improved approach for clinicians to estimate postoperative risk for patients. Limitations of this study include the missingness of data that were removed for analysis.

Conclusions
Extracting and curating a large, local institution's EHR data for machine learning purposes resulted in models with strong predictive performance. These models can be used in clinical settings as decision support tools for identification of high-risk patients as well as patient evaluation and care management. Further work is necessary to evaluate the impact of the Pythia risk calculator within the clinical workflow on postoperative outcomes and to optimize this data flow for future machine learning efforts.

Author summary
Why was this study done?
• Most postoperative complication risk prediction models use American College of Surgeons (ACS) National Surgical Quality Improvement Program (NSQIP) data.
• Few published postoperative risk prediction models using electronic health record (EHR) data exist.
• Creating and updating manual datasets such as ACS NSQIP are intensive processes with regards to time, labor, and cost.

What did the researchers do and find?
• A curated data repository of postoperative outcomes was created that extracted and processed patient clinical and surgical data across 37 million clinical encounters in EHRs into 194 clinical features.
• Machine learning models were built off this dataset to predict risk of postoperative complications. Models were able to classify patients at high risk of postoperative complication with high sensitivity and specificity.
• An online calculator requiring input of 9 data fields was created to produce a risk assessment within the clinic environment.

Introduction
Complications arise in 15% of all US surgical procedures performed, with high-risk surgeries having complications in up to 50% of cases [1]. In addition to worsening quality of life, surgical complications in the US cost on average over $11,000 per major 30-day complication [2]. With an estimated 19 million surgeries performed each year [3], the total cost of surgical complications per year in the US is approximately $31.35 billion. In response, efforts to enhance preoperative and perioperative support for high-risk and high-cost patients are increasing nationwide [4]. Targeted preoperative intervention clinics for high-risk individuals have been shown to improve 30-day postoperative outcomes at a multisite, quaternary health center [5]. However, the task of identifying these patients within a preoperative setting is challenged by difficulties in timely access to pertinent patient care data and lack of robust predictive models. The most widespread pre-surgical high-risk patient identification program is the National Surgical Quality Improvement Program (NSQIP) calculator developed by the American College of Surgeons (ACS). This online risk prediction calculator represents national surgical data from 393 different institutions [6]. It has been shown that predictive models built from nationally derived databases have limited local accuracy due to an average effect derived from aggregating data from many different institutions, populations, and regions. Cologne et al. demonstrated that NSQIP postoperative risk predictions differed significantly in terms of length of stay, surgical site infections, and major complications from actual rates at a single institution [7]. Moreover, Etzioni et al. and Osborne et al. demonstrated that enrollment in and feedback from NSQIP are not associated with improved postoperative outcomes or lower Medicare payments among surgical patients [8,9]. This indicates the need for institution-specific improvement efforts driven by highly curated institution-specific data.
The aggregation of health data within each local institution's electronic health records (EHRs) serves as fertile ground for machine learning to transform healthcare. Machine learning models utilizing EHR data to predict in-hospital length of stay and mortality as well as postoperative complications can be more accurate than prediction models built from manually collected data [10][11][12]. However, despite the maturation of methodological approaches to working with health data, there has been limited impact on provider productivity and patient outcomes [13]. Current health information technology infrastructure does not facilitate rapid transmission of data between EHRs and model applications. Furthermore, building technologies that integrate with current EHR systems requires significant financial investment [14,15].
The primary aim of this study was to demonstrate an initial use case of machine learning leveraging an institute-specific surgical data pipeline and repository derived from EHRs, Pythia, to identify patients at high risk of post-surgical complications. Pythia was built as part of an innovation initiative to efficiently curate high-volume, high-quality data to monitor surgical care and outcomes. Although EHR data can be inaccurate or incomplete [16], models that are developed and validated on local, structured data in EHRs are best positioned for deployment to support clinical workflows [17]. Pythia was designed to both promote the development of machine learning models and bridge the translational gap to enable rapid deployment of validated models. The secondary aim of this study was to describe machine learning model design decisions that supported clinical interpretability and rapid development of a decision support tool to be used within the preoperative clinic workflow. This decision support tool enables surgeons and referring clinicians to identify high-risk patients who may require targeted assessments and optimization as part of their preoperative care.

Methods
This project was approved by the Duke Institutional Review Board (Pro00081702), with waiver of informed consent. This was a single-center, retrospective study at Duke University Health System (DUHS), a large, quaternary, multisite hospital system that had 68,000 inpatient stays and more than 2 million outpatient visits in 2017 [18]. This study is reported as per the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines (S1 Checklist) [19].

Training dataset for models
A cohort of 163,599 patients was identified in the DUHS EHR system who had undergone any surgical procedure between June 1, 2012, and May 31, 2017. Clinical patient data were extracted from the EHR Oracle database across all inpatient and outpatient encounters using SQL queries. Outpatient and inpatient medications, vital signs, diagnoses, procedures, and orders were extracted across 37,195,164 inpatient and outpatient encounters. Patient demographic and social history data including age, sex, BMI, race, and smoking status were also extracted (Fig 1).
A cohort of 90,145 patients that had undergone 145,604 invasive procedures between January 1, 2014, and January 31, 2017, was identified to develop machine learning models to predict post-surgical complications. Patients under the age of 18 years were excluded from the cohort. Encounters with a CPT code included in the Surgery Flag Software [20] were defined as invasive procedures and included in the cohort. All CPT codes for invasive procedures were grouped into 128 procedure groupings (S1 Table). Predictor variables for the models included comorbidities, number of CPT codes recorded during the invasive procedure, outpatient medications, and demographics. Patient comorbidities were identified by surveilling all ICD codes within 1 year preceding the date of the procedure. These diagnosis codes were then classified into 29 binary comorbidity groupings (S1 Table) as defined by the Elixhauser Comorbidity Index [21]. Patients' active outpatient medications recorded during medication reconciliation at preoperative visits were classified into 15 therapeutic binary indicator groupings (S1 Table), along with a separate feature that counted the total number of active medications. Surgical complications were defined by diagnosis codes occurring within 30 days following the surgical procedure. In total, 271 diagnosis codes (S2 Table) were grouped into 12 groupings that aligned with prior studies evaluating post-surgical complications [5]. A composite variable, "any complication," for each procedure was created by aggregating across all 12 complication groupings, and mortality was identified as death occurring within 30 days of the index procedure date. Mortality was captured in the EHRs during encounters (for in-hospital deaths) or uploaded from the Social Security Death Index (for out-of-hospital deaths). In total, 99,755 encounters had complete information on the set of 194 predictors and 14 postoperative outcome groupings, including any complication and 30-day mortality. Encounters missing EHR data were deemed not missing at random and were therefore excluded from the model development cohort. This cohort was used for training, validating, and testing model prediction algorithms. SQL and R code were subsequently written to extract, clean, and curate patient data from the EHRs to the surgical data repository. Efforts to automate this process are currently underway.

Machine learning methods
Due to the high dimensionality of the input features and the large sample size, machine learning methods were used to model the likelihood of post-surgical complications. Least absolute shrinkage and selection operator (lasso) penalized logistic regression [22], random forest [23], and extreme gradient boosted decision tree [24] models were trained on 88,255 surgical encounters from the Pythia data repository for each of the 14 postoperative outcomes (13 outcome groupings and 30-day mortality). Lasso is an l 1 -penalized regression method that performs both regularization and variable selection, which results in a regression solution with improved interpretability and prediction accuracy compared to other regression approaches. Random forests are an ensemble learning method in which multiple decision trees are constructed and averaged to form a solution that is resistant to overfitting to the training data. Lastly, boosted decision trees are another ensemble decision tree approach that aims to optimize a differential loss function; in our case, the loss function is the area under the receiver operator characteristic curve (AUC). The latest set of 11,500 (11.5%) encounters, from October 1, 2016, to January 31, 2017, was excluded from the complete set of 99,755 encounters, for validation testing. This was done in order to provide estimates of how the models would perform if put into operations currently within our local setting.
Ten-fold cross-validation was used within the training set to train lasso models, using the R package glmnet [25] to find the optimal shrinkage hyperparameter for each of the 14 outcomes. Random forests were trained using the R package randomforest [26]. The number of trees was set to 500, and the number of candidate splits was determined using cross-validation across a range of possible values. Lastly, extreme gradient boosted decision tree models were trained using the R package XGboost [27], where the learning rate, eta, and depth of trees were chosen by cross-validation across possible ranges of values. The chosen hyperparameters were cross-validated for each individual outcome across these 3 model types.
Lasso penalized logistic regression, random forest models, and extreme gradient boosted decision trees were chosen specifically due to their ability to provide variable importance and interpretability. By providing model users with additional information about predictor weights, clinicians can glean insights into potential patient risk mitigation strategies. During experiments, elastic net penalized logistic regression models were built as well, but their performance was almost identical to that of the lasso models and they were therefore omitted. Furthermore, by comparing different approaches, clinician users can understand how different types of machine learning models perform on large complex EHR patient data.
Superior models were chosen based on predictive performance measured by AUC, sensitivity, and specificity, with a focus on clinical interpretability. After model selection, an online calculator was built using R Shiny [28] for use in the clinic. The calculator was organized into 3 sections requiring patient information: (1) procedure details, (2) demographic and social history, and (3) patient comorbidities and outpatient medications. The calculator ran the selected machine learning models to provide complication risk scores for all 14 outcomes. Complication risks greater than 5% were displayed on the user interface, as requested by clinical partners. A high-risk threshold that maximizes the sum of sensitivity and specificity for the "any complication" outcome was chosen in order to identify high-risk patients requiring further evaluation. Table 1 displays summary statistics for all invasive procedures within Pythia and the machine learning model cohort. In the model cohort, 45% of encounters involved male patients, and the average age was 62.1 years. The most common comorbidities were hypertension (47.4%), tumor without metastasis (13.8%), and uncomplicated diabetes (13.4%). The most common outpatient medications were cardiovascular drugs (68.2%), analgesics (40.0%), and antiplatelet drugs (32.8%). Post-surgical complication rates were 16.0% for any complication within 30 days and 0.5% for death within 30 days. Characteristics between the 2 groups were consistently similar. Pythia encounters excluded from the machine learning model cohort were most often missing active outpatient medications.

Results
The resulting 42 models (lasso, random forest, and extreme gradient boosted decision trees for 14 outcomes) overall demonstrated strong predictive performance, with AUCs ranging between 0.747 and 0.924 (Table 2) calculated on a non-random, out-of-sample test set of the latest patient encounters from a different time period than the training set data. The size of this test set was 11,500 encounters. However, the lasso penalized logistic regression performed slightly superiorly to the random forest models, with AUCs ranging from 0.747 to 0.903. Lasso and extreme gradient boosted decision tree models performed very similarly. However, lasso outperformed extreme gradient boosted decision trees in 8 outcome models (any complication, 30-day mortality, gastrointestinal, genitourinary, hematological, integumentary, renal, and shock), while extreme gradient boosted decision trees outperformed lasso in 5 of the remaining outcome models (cardiac, endocrine, pulmonary, sepsis, and vascular). Neurological outcome models had the same AUC performance (0.810) in lasso and extreme gradient boosted decision trees. The receiver operator characteristic curves in Fig 2 display  curve for each modeling method across all complications. This visualization confirms that lasso, random forest, and extreme gradient boosted decision trees performed very similarly. These well-established and interpretable models achieved strong performance on structured data that are highly available within our EHR system, paving the way for rapid deployment of an application to impact patient care. High-risk thresholds were determined with the aim of identifying patients needing referrals to targeted perioperative optimization treatment programs. Fig 3 displays predicted probabilities for patients who experienced a postoperative complication and for those who did not for each machine learning model. This plot demonstrates a clear delineation between the 2 populations, with patients experiencing postoperative complications having higher risk predictions. Due to the differences in distributions displayed, the high-risk threshold was purposefully chosen as a percent risk value bordering between the 2 population distributions, resulting in a strong cutoff. Specifically, the thresholds were chosen by maximizing the sum of sensitivity and specificity. The resulting sensitivities and specificities are displayed in Table 3. Under a threshold of 0.142 (14.2% risk of any complication), a sensitivity of 0.775, specificity of 0.749, and positive predictive value (PPV) of 0.362 were achieved with lasso modeling. The resulting sensitivities and specificities were similar across methods. A threshold of 14.9% was chosen for random forest models, and a threshold of 17.4% for extreme gradient decision tree models, resulting in a sensitivity of 0.757 and 0.725, a specificity of 0.744 and 0.792, and a PPV of 0.351 and 0.390, respectively. Consequently, lasso and extreme gradient boosted decision tree models identified a more concentrated group of patients with higher complication rates (36% and 39%) than random forest (35%). Furthermore, lasso and extreme gradient boosted decision tree models have a higher PPV (36.2% and 39.0%) compared to random forest (35.1%), thereby better identifying high-risk patients who then have postoperative complications. However, in order to optimize model performance for healthcare providers by providing clinically interpretable insights regarding risk factors, and identifying a more targeted number of patients, we chose the 14 lasso models to predict complication risk through the online web application for DUHS clinicians. This was done because lasso models allow for better interpretability of which particular health predictors will affect a patient's risk of postoperative complication, as well as how much each predictor affects the predicted postoperative outcome.
Our clinical partners wanted this insight within this decision support tool.
In order to test model stability over time, the observed versus predicted rate of any complication in our data was plotted (Fig 4). Using a high-risk threshold of 14.4% risk of complication, the machine learning models predict that approximately 35% of procedures will result in a complication, while the actual rate of complications is approximately 17% at DUHS over time. This rate difference was intentional, to increase sensitivity to capture more high-risk patients for perioperative optimization. Our clinical and operational partners felt that targeted In response to requests made by peer reviewers, we further assessed Pythia's model performance through a local validation analysis by comparing our methods to expert clinical criteria for a geriatric preoperative optimization intervention within our healthcare system. Expert clinical criteria for geriatric high risk included patients taking more than 5 medications or with multiple comorbidities, neurological disorders, or recent weight loss, as previously described [5]. A new test set (n = 5,734) was identified using our original test set of all patients on or after October 1, 2016, and filtering for geriatric patients (age > 65 years). Our model identified 1,933 patients with risk scores above our model's high-risk threshold. In comparison, we identified 3,102 geriatric patients using expert clinical criteria within the same test set. The mean complication rate for high-risk patients identified by our model was 37.99%, while Pythia: EHR data repository for surgical risk modeling the mean complication rate for patients identified by clinical criteria was 16.55%, indicating that our model identified a more specific high-risk cohort of patients. The sensitivity,  In order to directly compare our model predictions to the ACS NSQIP model predictions, we input preoperative health information from 75 patients in the ACS NSQIP calculator and then in Pythia's risk calculator. These patients were real patients from our local setting who were randomly selected and were not present within our models' training set. We upsampled patients with postoperative mortality (16%) to provide more stable estimates of AUC, sensitivity, and specificity. We compared the risk predictions and performance of the 30-day postoperative mortality model and found that Pythia's model outperformed the NSQIP model by 0.12 AUC (Pythia 0.79 versus NSQIP 0.67) (Fig 5). Furthermore, sensitivity (0.9167 versus 0.7500), specificity (0.5873 versus 0.5556), and PPV (0.2973 versus 0.2432) were also higher for Pythia's 30-day mortality risk prediction.

Discussion
We demonstrated that machine learning models built from highly curated, clinically meaningful features from local, structured EHR data were able to achieve high sensitivity and specificity for classifying patients at risk of post-surgical complications. The models and accompanying application can be easily deployed to identify patients for targeted perioperative treatment.
We chose the 14 lasso models to predict complication risk through an online web application built for local clinicians to identify high-risk patients. Our results show that the performances of lasso, random forest, and extreme gradient boosted decision tree models on a non-random, out-of-sample test set from a later time period are nearly identical. However, lasso models performed superiorly to the random forest and extreme gradient boosted decision tree models as a whole, while also returning interpretable coefficients that provide clinicians insights into why patients are at high risk for complications. Lasso models also perform variable selection, minimizing the number of data inputs required in the web application. The variable selection used for the initial pilot of the application, during which manual entry of input features is required, will enable rapid use during clinic visits. The tool requires the input of 9 patient features, curated by grouping the reduced set of covariates chosen by our lasso models, to produce risk scores for 14 postoperative outcome groupings. For example, the comorbidities feature within our calculator is comprised of the 29 binary Elixhauser groupings. This comorbidities feature contains a dropdown menu where multiple comorbidities can be selected if needed. Moreover, Fig 6 demonstrates how the structure of the calculator inputs aligns well with the information collected during surgical clinic visits and with the typical presurgical evaluation workflow. As all fields are available as structured data in the EHR system, if the initial pilot is successful, Pythia will enable the rapid deployment of an automated pipeline to extract patient data and calculate risk to notify relevant providers.
Through our sensitivity analysis comparing expert clinical criteria to Pythia's models, we were able to demonstrate that machine learning models trained from local data can identify individuals at high risk of complications and high cost within the local patient population. Pythia's models were shown to perform at a higher sensitivity and specificity through this analysis. By specifically targeting a narrower population of patients needing preoperative optimization, our healthcare system can better utilize clinical resources while lowering clinic costs.
Currently, the NSQIP calculator is the most widely used pre-surgical risk prediction model. Recent publications predict postoperative complication risks using ACS NSQIP data [29][30][31][32][33]. These data are manually extracted from EHRs, making them high fidelity but very difficult to update with new patient data or to adjust by adding new variables. Two studies compare models using automatically extracted (EHR) data versus manually collected data. Comparable AUCs were reported by Anderson et al. [34] for multivariate logistic regression models trained on 66 manually collected NSQIP variables versus 25 EHR NSQIP variables. Differences ranged from −0.0073 to 0.1944 across specific surgery procedures for mortality and from 0.0198 to 0.0687 for morbidity [34]. In Amrock et al., the AUCs were 0.813 for mortality and 0.629 for morbidity in multivariate logistic regression models utilizing manually collected data versus 0.795 for mortality and 0.629 for morbidity in the same type of models utilizing EHR data [35]. Both studies found that models using EHR data perform similarly to models using manually extracted data in predicting postoperative morbidity and mortality. However, deploying machine learning models in operations at scale requires automated pipelines for structured EHR data to calculate risk scores and trigger clinical workflows.
The NSQIP calculator uses a logistic regression model using random intercepts per hospital [6], while our models incorporate machine learning via lasso, random forest, and extreme gradient boosted decision trees. We specifically decided not to utilize logistic regression due to the complexity of our patient data. Due to the high collinearity within our dataset and inherent sparsity of many of the covariates, we found that logistic regression suffers from inflated variance of the learned coefficients. Benefits of the lasso model include its ability to perform variable selection, thereby helping to reduce multicollinearity while providing clinical clarity into which predictors cause an increase of specific complication risks. Beyond differences in model choice, the data and validation methodologies differ significantly between the 2 calculators. The ACS NSQIP calculator reports strong predictive in-sample performance, with AUCs of 0.944 for mortality and 0.816 for morbidity [6]. Because the ACS NSQIP models are trained and tested on the same cohort of patients, it is difficult to discern whether these results indicate accurate model predictions or overfitting, limiting the NSQIP calculator's clinical use capabilities. In comparison, Pythia's calculator is validated on a non-random, out-of-sample test set from a different time period derived from our health system's EHRs, with similar AUCs demonstrating strong potential performance in clinical practice with appropriate validation methods.
Our analysis directly comparing the 30-day postoperative mortality models from ACS NSQIP and Pythia demonstrates the superior performance of Pythia's predictions on our local patients. Many publications have demonstrated the inability of the ACS NSQIP calculator to accurately depict postoperative complication risks in many different patient populations [30][31][32][33]. However, very few publications propose superior methods. Not only does this direct comparison between the 2 models provide further evidence that the ACS NSQIP calculator does not perform strongly on our local patients, but it also puts forth a new methodology of local data extraction, curation, and modeling. This new methodology is shown to be superior to ACS NSQIP's for predicting postoperative complications in a local setting.
Few published postoperative risk prediction models utilizing EHR data exist. SORT (Surgical Outcome Risk Tool) was developed by Protopapa et al. and predicts 30-day mortality utilizing 6 predictor variables using logistic regression, with AUCs ranging from 0.82 to 0.96 for surgical subspecialty groups [36]. To our knowledge, the only other studies utilizing EHR extracted data to build and train machine learning models for postoperative risk predictions are Weller et al. [11] and Soguero-Ruiz et al. [12]. Weller et al. built 5 different types of machine learning models to predict postoperative superficial skin infection, ileus, and bleeding in colorectal surgery cases. However, due to small sample size, their reported AUCs were not as strong, with the exception of random forest models predicting postoperative bleeding complications (AUC 0.8) [11]. Similarly, Soguero-Ruiz et al. used EHR data to predict postoperative anastomosis leakage in colorectal surgeries. Their reported AUC was strong (0.92) using a support vector machine (SVM) model [12]. Our body of work, however, differs greatly from these previously generated postoperative risk prediction models. Not only do we utilize EHRdata-driven machine learning models, but our models have strong predictive performance while predicting postoperative outcomes across a broad range of surgical procedures. In addition, our calculator is based on real-time data extraction from a pipeline from the EHR system that can be continuously and automatically updated and does not rely on manual extraction. The current study addresses a substantial breadth of surgical complications, providing diverse opportunity to intervene on high-risk patients and improve outcomes.
Limitations of our work include missing data, resulting in 99,755 encounters being used to build these models. This number is reduced from the total 145,604 invasive procedure encounters within the data repository. While this is a large reduction in the original data, it still provides a large and sufficient sample to model and predict complications with great accuracy. We chose not to consider imputation methods due to the underlying difficulties of imputing clinical data. Specifically, the most frequent missing variables were outpatient medication lists. The complication groupings within our models are also defined broadly, limiting the user's ability to understand exactly which type of complication the patient is at risk for within the groupings. For example, cardiac complications include a wide range of ICD-9 and ICD-10 diagnosis codes. Possible missing data also include outpatient death data. The Social Security Death Index excludes a portion of state reported deaths due to public data restrictions as of 2011 [37]. Pythia is a current project that requires optimization with regards to missing data and further curation of fields from additional tables in the EHR database. Efforts to develop strategies for effective imputation, data curation, and the addition of other quality data sources are priorities for future iterations.
Further analytical limitations include our 30-day mortality model comparison to ACS NSQIP's model performance. This analysis was based on a random sample of 75 patients. Analyses with larger patient sample sizes will be needed in the future.
Although our proposed data collection methodology is less burdensome due to non-manual data extraction reducing cost and time, as with all data collection methods, there are limitations. Over time, standards of data collection within the EHR system may change as well as clinical practice trends, both altering the way data are represented in large EHR data repositories. Monitoring systems that are able to catch these variations need to be put in place as these healthcare repositories continue to grow over long periods of time. Efforts to develop the monitoring architecture around EHR repositories are also priorities for future iterations of this surgical data repository.
Our work demonstrates that we can better identify patients who are high risk and high cost within our referral base by creating a site-specific surgical data pipeline and repository to fuel our clinical calculator. Our calculator is unique and personalized to our institution because it is derived from our local patient population, with our university-affiliated surgeons. As stated by Bates et al., "algorithms are most effective and perform best when they are derived from and then used in similar populations" [38], thus further highlighting the need for local data to drive healthcare insights. By leveraging our local institution's EHR data, not only are we able to easily build machine learning models to improve healthcare delivery to our patients, but we also have the ability to enhance our education for trainees and build future quality improvement initiatives. In addition, our calculator is in the form of a clinical portal on a web application for easy usability. By quickly inputting a surgical patient's information into the 9 fields of the calculator, a clinician can see if the patient is deemed high risk, thereby requiring further preoperative evaluation and prompting referral to a high-risk clinic. Displays of the risk attributable to a given disease or medication can also help the team prioritize preoperative interventions and postoperative monitoring and care that have been shown to significantly lower postoperative complications as well as length of stay [5]. Use of the tool may also promote more specific discussions about the benefits and risks of surgery with patients, enhance shared decision-making, and advance care planning. If implemented thoughtfully within a preoperative clinic workflow, this tool has the ability to help support decisions made by a patient's care team. Hosted through a simple web application, our risk calculator can be easily incorporated into the EHR system and can be automatically populated as patient features are being input into the chart. Plans to integrate this calculator into our institute's EHR system are currently underway. Once fully implemented, the models would be updated on a yearly basis by retraining and validating them with the latest patient data. Through this yearly retrain and validation plan, we will be able to track any changes made to our EHR data collection system and determine whether our models are performing strongly over time. Furthermore, as a project within our institute's learning health system, we plan to reevaluate our implementation as a whole and make correctional adjustments on a continual basis to best support our providers in their decision-making process. We believe that our methods for building a data pipeline from EHRs in order to develop machine learning models create a prototype for an institutional learning health system. In the future, our methods can be disseminated to develop infrastructure and best practices to extend to other institutions and patient populations in order to improve patient care at other healthcare institutions.