A 12-hospital prospective evaluation of a clinical decision support prognostic algorithm based on logistic regression as a form of machine learning to facilitate decision making for patients with suspected COVID-19

Objective To prospectively evaluate a logistic regression-based machine learning (ML) prognostic algorithm implemented in real-time as a clinical decision support (CDS) system for symptomatic persons under investigation (PUI) for Coronavirus disease 2019 (COVID-19) in the emergency department (ED). Methods We developed in a 12-hospital system a model using training and validation followed by a real-time assessment. The LASSO guided feature selection included demographics, comorbidities, home medications, vital signs. We constructed a logistic regression-based ML algorithm to predict “severe” COVID-19, defined as patients requiring intensive care unit (ICU) admission, invasive mechanical ventilation, or died in or out-of-hospital. Training data included 1,469 adult patients who tested positive for Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) within 14 days of acute care. We performed: 1) temporal validation in 414 SARS-CoV-2 positive patients, 2) validation in a PUI set of 13,271 patients with symptomatic SARS-CoV-2 test during an acute care visit, and 3) real-time validation in 2,174 ED patients with PUI test or positive SARS-CoV-2 result. Subgroup analysis was conducted across race and gender to ensure equity in performance. Results The algorithm performed well on pre-implementation validations for predicting COVID-19 severity: 1) the temporal validation had an area under the receiver operating characteristic (AUROC) of 0.87 (95%-CI: 0.83, 0.91); 2) validation in the PUI population had an AUROC of 0.82 (95%-CI: 0.81, 0.83). The ED CDS system performed well in real-time with an AUROC of 0.85 (95%-CI, 0.83, 0.87). Zero patients in the lowest quintile developed “severe” COVID-19. Patients in the highest quintile developed “severe” COVID-19 in 33.2% of cases. The models performed without significant differences between genders and among race/ethnicities (all p-values > 0.05). Conclusion A logistic regression model-based ML-enabled CDS can be developed, validated, and implemented with high performance across multiple hospitals while being equitable and maintaining performance in real-time validation.


Introduction
The dynamic of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infection raised concerns regarding resource availability throughout medical systems, including intensive care unit (ICU) healthcare providers, personal protective equipment, total hospital, and ICU beds, and mechanical ventilators. On March 11 th , 2020, the World Health Organization declared the Coronavirus disease 2019 (COVID-19) a pandemic. The COVID-19 pandemic has caused over 249 million confirmed infections and over 5 million confirmed deaths as of November 9 th , 2021 [1]. One of the initial large observational studies, published from China, revealed that approximately 15% of the confirmed cases required hospitalization, 5% needed ICU admission, and 2.3% died [2]. A multihospital United States (U.S.) based cohort study identified that the 30-day mean risk standardized event rate of hospital mortality and hospice referral among patients with COVID-19 varied from 9% to 16%, with better outcomes occurring in community's with lower disease prevalence [3]. A large cross-sectional study found racial and ethnic disparities in rates of COVID-19 hospital and ICU admission and in-hospital mortality in the US [4].
Since the beginning, global efforts by the scientific community to understand SARS-CoV-2 and the COVID-19 from the bench to the bedside have been remarkable [5]. Stratifying disease severity is an essential aspect of patient care; however, during a pandemic, its role becomes paramount and expands to improving patient safety while also optimizing hospital resource utilization. Several studies have developed emergency department (ED) evaluation systems with variable goals and methods [6][7][8][9][10][11][12]. These models successfully evaluated the possibility of isolating COVID-19 patients in ED, the epidemiology and COVID-19 clinical data, the advantage of distinguishing life-threatening emergencies, and the likelihood of COVID-19 diagnosis [6][7][8][9][10][11][12].
Most predictive models for COVID-19 severity involved patients with a positive polymerase chain reaction (PCR) test, not in patients with suspected COVID-19. A systematic evaluation of COVID-19 predictive models aimed at identifying clinical deterioration found that the majority of published studies included patients with confirmed infection [13], making them less useful in the clinic or emergency departments when diagnosis remains uncertain. The majority of predictive models for patients with suspected COVID-19 infection aimed to diagnose COVID-19, and very few predicted severity [14]. One systematic review of the prognostic models emphasized the high risk of bias while not recommending their use in clinical practice yet [15]. Since limitations mark the systematic reviews of the prognostic models, and a group of researchers from the United Kingdom (UK) developed a COVID-19 precise living document [16]. Another group of researchers proposed an open platform for such reviews that will be continuously updated using artificial intelligence and numerous experts [17]. The QCOVID is a published living risk prediction algorithm that performed well for predicting time to death in patients with confirmed or suspected COVID-19 [18].
We hypothesize that a logistic regression-based machine learning (ML) tool for patients with suspected or confirmed COVID-19 can accurately and equitably predict the development of "severe" COVID-19. The objective of this study was to conduct a 12-site prospective observational study to evaluate the real-time performance of a ML-enabled COVID-19 prognostic tool delivered as clinical decision support (CDS) to ED providers to facilitate shared decisionmaking with patients regarding ED discharge.

Study design and setting
This is a retrospective and prospective multihospital observational study that developed, implemented, and evaluated a prognostic model in patients with PCR-confirmed COVID-19 diagnosis or suspected COVID-19 (person under investigation [PUI]) in a 12-hospital system. This study was approved and determined as non-human research by the University of Minnesota Institutional Review Board (STUDY00011742).

Selection of participants
Patients were included if they were PCR confirmed COVID-19 positive or symptomatic PUI with a patient status of emergency, observation, or inpatient at a participating center. We only included patients who did not opt out of research on admission. Patients were excluded if they did not have at least one recorded ED vital sign (heart rate, respiratory rate, temperature, oxygen saturation, or systolic blood pressure) or missing comorbidity data. A complete set of vital signs was deemed necessary given our model was intended to be implemented and utilized across patients receiving a complete evaluation which would include at least one complete set of vital signs.

Feature selection and model development
A team of subject matter experts with expertise treating patients with COVID-19 and research experience in COVID-19 identified features hypothesized to be associated with development of "severe" disease (S1 Table). To reduce the likelihood of over-fitting a Least Absolute Shrinkage and Selection Operator (LASSO)-logit model was used to facilitate feature selection from this list with the tuning parameter determined by the Bayesian information criterion (BIC) as previously done by our group [19,20]. LASSO is a penalized regression method that can facilitate factor selection by excluding factors with a minor contribution to the model [21]. S1 Table  lists the features selected for the final model following LASSO selection. Final features selected by LASSO included age (years), male [3,22], race or ethnicity, non-English speaking [23,24], overweight or obese (body mass index [BMI] > 25) [19,25,26], three month prior home medications [27] (defined as whether a patient was prescribed a medication within 3 months or before and after the index acute care visit) and chronic comorbidities [3,28] extracted from ICD10 codes (S2 Table) collected in the 5 years prior to the index visit: Finally, we included the following vital signs: maximum heart rate (HR), respiratory rate (RR), temperature within the first 24 hours, and minimal peripheral arterial oxygen saturation (SpO 2 ) and systolic blood pressure (SBP) within the first 24 hours. We included in the final list of features for LASSO only the variables available on presentation to ED.

Model construction
The purpose of this model generation was to develop a prognostic model that could predict patients who developed a severe case of COVID-19. Due to ease of interpretation and the importance to provide the basis to the clinician and patients for model predictions, a multivariable logistic regression model was trained using the features selected from LASSO. This model was developed using only data from the training dataset. A risk score was calculated in the validation cohorts based on the sum of the beta coefficients. The AUROC was calculated for all validation cohorts to evaluate discrimination in the validation datasets.

Outcomes
Our primary outcome was "severe" COVID-19 infection, defined as intensive care unit (ICU) admission, need for invasive mechanical ventilation (ventilator use), or in-hospital or out-ofhospital mortality (defined using state death certificate database) [2,29,30]. The secondary outcomes were individual, and combinations of the dependent variables mentioned above.

Training and test datasets
The training data set included 1,469 patients who were PCR-positive for SARS-CoV-2 within 14 days of an acute care, hospital-based visit including emergency department, observation, and inpatient encounters between March 4 th to August 21 st , 2020. The test set included 158 patients (random 90:10 selection of the training set).

Validation datasets
We included three validation sets:

Analysis
The patients' characteristics between data sets were compared using ANOVA and chi-square respectively for continuous versus categorical variables. Odds ratios (OR) and 95% Confidence Intervals were also reported. The sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), likelihood ratios, false negative and false positive rate, and the area under the receiver operating characteristics (AUROC) were summarized for the model performance. Statistical significance was defined with the alpha set to 0.05, all tests were two-tailed. Statistical analyses were performed using Stata MP, version 16 (StataCorp, College Station, TX). The real-time model was evaluated across gender and racial/ethnic groups to compare performance across different groups and ensure the model performed equitably.

Model implementation
Implementation into an Electronic Health Record (EHR) occurred for ED patients on November 23 rd , 2020. The logistic prognostic model was exported as a predictive model markup language (PMML) file. An EHR reporting workbench was developed to facilitate inputs into the model. All the inputs were mapped using corresponding ICD-10 codes (S2 Table), pharmaceutical subclasses, RxNorm codes [31], and EHR documentation flowsheets (for vitals). The output was delivered as a clinical decision support system to ED providers. For visualization purposes, the COVID-19 severity risk score was multiplied by 100 and cut points that identified patients with Low Risk (low probability of primary outcome) and High Risk (high probability of primary outcome). Visualization (S1 Fig) was highlighted on the patient sidebar, available to all ED providers and nurses, as well as physicians and staff involved in triage, patient flow, and capacity management.

Descriptive results
A total of 2,041 patients were included in the final model training (1,469), testing (158), and temporal validation (414) (Fig 1). Table 1 listed patients' characteristics in each cohort. Overall, significant difference in all variables in demographics, use of home medications, comorbidities, and 24-hour vitals existed across training and validation cohorts, except for loop diuretic, inflammatory bowel disease, and rheumatoid arthritis. Compared to COVID-19 PCR-positive patients in the training set, the patients in the temporal validation set and PUI set were slightly younger (median age of 52.2 and 49.1 years vs. 53.6 years) and had lower rates of ICU admission (18.1% and 10.8% vs 23.4%), ventilator use (3.4% and 5.3% vs. 11.1%), and mortality (1.7% and 3.5% vs. 8.5%). Compared to the training set, the real-time data set was older (median age of 56.9 years) and had lower rates of ICU admission, ventilator use and mortality (9.4%, 3.5%, and 6.8%), respectively. Table 2 described the odds ratios used in the logistic regression model generation. Other as race and inflammatory bowel disease, are the two variables with the highest odds ratios that reached statistical significance. Warfarin is the variable with lowest odds ratios that reached statistical significance. The model included factors that increase the odds of COVID-19 severity, such as age, male, Asian or Hispanic race, obesity, use of calcium channel blocker, rivaroxaban, oral steroids, clopidogrel, aspirin, and a loop diuretic, hypertension, type 2 diabetes mellitus, venous thromboembolism, pacemaker/automatic implantable cardioverter-defibrillator, pulmonary hypertension, chronic kidney disease, inflammatory bowel disease, maximum temperature, heart rate, respiratory rate in 24 hours, and factors that decrease the odds, such as the use of hydrochlorothiazide, angiotensin-converting enzyme inhibitor, angiotensin II receptor blockade, warfarin, rheumatoid arthritis, minimum peripheral oxygen saturation, systolic blood pressure in 24 hours.
In the validation cohorts, the risk score was used to identify a clinically useful threshold to predict the institutional metric. Multiple thresholds were defined, and 2x2 contingency tables including sensitivity, specificity, PPV, and NPV were created for each threshold. The system leadership reviewed the various thresholds and based on clinical resources, defined an appropriate threshold. The multidisciplinary team reviewed the model performance, including sensitivity, specificity, PPV, NPV, likelihood ratios across multiple thresholds to facilitate rapid implementation. Cut-off points flagging high and low-risk patients were chosen in collaboration with both system leadership following engagement with front-line providers. The goal for low-risk cut-off was to have a high sensitivity at the expense of specificity to reduce potential errors associated with inappropriate discharge home. The goal for high-risk cut-off was a higher specificity to balance the need for close monitoring with resource scarcity, including ICU and step-down capacity.

Pre-implementation validation: Temporal validation and in PUI
The model produced an AUROC of 0.87 (95% CI: 0.83, 0.91) for predicting the primary outcome (ICU admission, ventilator use, or death) using the temporal validation cohort (S2 Fig). None of the patients with the lowest 20% of the scores (0-0.0104) had ICU admission, ventilator use, or died, compared to 62%, 15.9%, and 7.3%, respectively, for patients with the highest 20% of the scores (0.168-1.0) (S3 Table). At a cut point of >0.1, the model had a sensitivity of 73.7% and specificity of 79.9% in predicting the composite outcomes (S4 Table).
This model was further tested in the PUI cohort that included 13,271 patients who had a SARS-CoV-2 test with a "symptomatic" designation ordered in ED. Of note, the accumulative   Table). At the cut point of >0.1, the model had a sensitivity of 52.2% and specificity of 88.1% in predicting composite outcomes (S5 Table).

Real-time validation
Critically, we implemented this model to predict the composite outcomes to evaluating the COVID-19 severity and assessed the model's real-time performance. The COVID positive rate in the real-time validation set was 61.2% (1,331 of 2,174 patients). This real-time cohort had a median age of 56.9 years (IQR: 35.4-72.4), had an ICU admission rate of 9.4%, ventilation rate of 3.5%, and mortality rate of 6.8%. The model had an AUROC of 0.85 (95% CI, 0.83, 0.87) to predict the primary outcome in the real-time data set. (Fig 2). The rates of ICU admission, ventilator use, and death in patients with the lowest 20% of the scores (0.001-0.009) were zero, significantly lower compared to those rates (32.7%, 15.5%, and 22.0%, respectively) for patients with the highest 20% of the scores (0.20-0.99) ( Table 3). At the cut point of >0.1, the model had a sensitivity of 78% and a specificity of 71% in the real-time data set (Table 4). To evaluate the probabilities in the real time world we depicted the calibration plot for the real-time validation set. (S4 Fig).

Model performance on individual and combined outcomes and across minorities
The AUROC of all cohorts predicting various outcomes combined and individual are listed in Table 5. The performance remained strong for predicting secondary outcomes in combinations of ICU admission, need for mechanical ventilation, and mortality.

Discussion
We developed and implemented a ML-enabled model to predict increased risk for COVID-19 severity to support the ED physicians' clinical decision-making across our 12-sites medical system. Despite the significant variabilities of the factors, our model performed well in a large PUI study population. This approach is beneficial for clinical decision-making in ED where the COVID-19 PCR test results are inconsistently resulted. Importantly, we evaluated our model real-time in PUI patients seeking acute care in ED after the score became available in the EHR and the model performance remained strong. The difference in ICU admission rate, ventilator use, and mortality rate between the training set and the temporal, PUI, and realtime validation sets can be explained by the temporal improvement in COVID-19 patients' outcomes that was noted in other studies [4,32,33]. The COVID-19 ICU admission and patient survival improved in our study over time as it did in other reports, perhaps because of better understating of the diseases and improvement of treatment as the pandemic progresses [4,33] In our study, for a cutoff of 0.1 for COVID-19 severity, our model had a sensitivity of 73.7% and specificity 79.9% in the prospective validation set, 52.2% and 88.1%, respectively in the PUI set, and 78% and 71%, respectively in the real-time validation set. These results show good discrimination for patients with scores associated with increased rates of the primary outcome. Furthermore, the performance of the models were robust to the secular improvements in outcomes throughout validation. Our model purpose was to estimate the risk of severe disease using ML as CDS in patients with or suspicion of COVID-19 presenting in ED. Furthermore, our goal was to use this model as CDS and facilitate the shared decision making between ED providers and patients regarding ED discharge and home saturation monitoring. The variables included (demographics, comorbidities, home medication, vital signs) are readily available in the ED. The laboratory values which are not always obtainable in ED were not included in the final model, which seemed to be feasible as described in a recent ML model published in the literature [34]. The variables associated with a significantly higher risk for COVID-19 severity in our model were male gender, older age, other as race, increased temperature, increased respiratory rate, decreased oxygen saturation, inflammatory bowel disease. Comparable to our model, vital signs, age, BMI, and comorbidities were the most important predictors in other investigations and reviews [35,36]. Oxygen saturation and patient's age were strong risk factors for deterioration and mortality in COVID-19 in a systematic evaluation of predictive models [13]. The use of warfarin appeared to be protective for our study's composite outcome, similar to another report [37]. Hypercoagulability and need for anticoagulation were well recognized in COVID-19 and likely from increased immune response [38,39]. We included variables that were not significant on univariate analysis as well as variables that were protective. These variables made our model valuable in real life when many covariates and confounding factors exist and increased the model calibration.
It is imperative that ML models are evaluated for equity across gender, race and ethnicity. We included gender, race and ethnicity in our model given the association between minority populations and male gender and worse COVID-19 outcomes [23,24,[40][41][42]. While others chose to create a different prognostic model for males and females, we decided to include all [18]. The male gender was a significant predictor in our study, and the AUROC in male patients showed good performance without statistical difference compared to the female gender AUROC. While including race has led to over and undertreatment of minority populations [43,44], due to sampling bias, others argue that creating a "race un-aware" model also pertains risk in specific situations [45]. One particular situation is when race/ethnicity is associated with increased risk of the outcome, like other as race in our study that showed increase risk of COVID-19 severity. By creating a model without race or ethnicity, the model is trained to reflect the majority population and will inherently underappreciate the risk across minority populations [45]. Our model performed equitably across racial/ethnic minorities and did not increase the risk of widening the disparate outcomes observed throughout the pandemic [24,46,47]. By increasing treatment and resource allocation to non-whites, we hypothesize that this will increase equitable treatment allocation and attenuate disparate care. Unlike most prognostic models predicting the COVID-19 diagnosis [35,[48][49][50], our study aimed to implement and assess the predictive model in patients with suspected COVID-19 disease, or PUI. It is worth noting that 68% of our patients were discharged before the test resulted in our medical system. During this uncertainty period, many ED physicians are required to make clinical and triage decisions. Previous predictive models for patients with suspected COVID-19 infection have used imaging, demographics, signs and symptoms, vital signs to predict the likelihood of COVID-19 diagnosis, but they have not sought to predict the severity of the disease [12,14,51]. The Epic Deterioration Index (EDI) is a proprietary emergency deterioration index that has been developed in 3 US hospitals in US between 2012 and 2016; although it is not specific for COVID-1, it has been introduced in over 100 US hospitals to predict COVID-19 deterioration [30].
Multiple prognostic models for COVID-19 have been previously developed [14,18,30,52,53]. However, previous models suffer from multiple limitations. For example, many prior prognostic models included very limited training dataset [54,55]. The largest study to date published in Great Britain used the 4 C Mortality Score to stratify the severity of the COVID-19 [56]. In contrast to our model, the 4 C includes some laboratory values (urea level and Creactive protein) not always available in ED, and used data from COVID-19 positive patients admitted to the hospital: AUROC 0.79, (95% CI 0.78-0.79). A systematic external validation of 22 prognostic models in a cohort of 411 patients with COVID-19 found that NEWS2 score that predicted ICU admission or death within 14 days for symptoms onset: AUROC 0.78 (95% CI 0.73-0.83) achieved the highest AUROC [13]. The EDI has been recently tested on 392 COVID-19 hospitalized patients in single center and found an AUROC 0.79 (95% CI, 0.74-0.84) [30]. Our model performance for predicting COVID-19 severity in our prospective validation, PUI, and real-time data sets is more robust than in the above mentioned external validation of the prognostic models. Data from the national Registry of suspected COVID-19 in Emergency care (RECOVER network) comprising 116 hospitals from 25 states in the US produced a 13 variable score that can predict the probability of infection in patients presenting with suspected COVID-19 in ED [57]. The large RECOVER registry used patient data such as age, temperature, oxygen saturation, symptoms, and ethnicity readily available in ED; however, the score was developed with retrospective data and it was not tested in real time [57].

Strengths and limitations
Our study has several strengths. First, it was validated on patients with COVID-19 diagnosis and patients with suspected COVID-19. Second, the logistic regression-based ML used data readily available in ED. Third, we included variables that were non-significant or were protective in univariate analysis, making the logistic regression-based ML more suitable for real-life when many confounders exist. Fourth, it was tested in real-time in patients with suspected COVID-19 who presented in the acute care setting as a CDS for ED providers and patients. Finally, our model was tested for gender and race/ethnicity differences and performed equitably to avoid disparities.
These findings must be viewed within the context of the following limitations. First, this study was done within a single healthcare system. Despite a large catchment area that includes surrounding states, these results are specific to the regional patient population in which the models were derived until they have been validated in other populations with different demographics and socioeconomic backgrounds. Second, our model over-predicted the disease severity making it a valuable tool for patient safety and less for resource utilization. Third, the accuracy of patient comorbidities and medications available in ED relies on the history from EHR, not consistently updated during the acute care visit. Fourth, as seen in the calibration plot, the model does suffer from at the high-risk end, this is likely due to imbalance of the dataset without a large degree of "bad outcomes". Future studies will seek to increase sample size and further include external institutions which will aid in further optimization of the model along with addressing the generalizability, respectively. Lastly, this study sought to develop, validate, and implement a prediction model to support clinical decision-making. Importantly, the model was never intended to replace clinical judgment, rather it was intended to complement and better inform providers and patients, specifically when there is a large degree of clinical uncertainty. The effect on clinical decisions and the long-term effect on patient safety remained to be determined and were beyond the scope of this analysis.

Conclusions
COVID-19 has burdened healthcare systems from multiple different facets, and finding ways to alleviate stress is crucial. CDS through ML-enabled predictive modeling may add to patient care, reduce undue decision-making variations, and optimize resource utilization, especially during a pandemic. We present a 12-hospital successful development and implementation of a COVID-19 prediction model that performs well across gender, race, and ethnicity for three different outcomes. The severity of illness primary outcome performed well in the PUI population despite being developed on a COVID-19 positive population. The effect on patient outcomes and resource use are needed to assess further the benefits of the model presented here.