Comparison of in-hospital mortality risk prediction models from COVID-19

Objective Our objective is to compare the predictive accuracy of four recently established outcome models of patients hospitalized with coronavirus disease 2019 (COVID-19) published between January 1st and May 1st 2020. Methods We used data obtained from the Veterans Affairs Corporate Data Warehouse (CDW) between January 1st, 2020, and May 1st 2020 as an external validation cohort. The outcome measure was hospital mortality. Areas under the ROC (AUC) curves were used to evaluate discrimination of the four predictive models. The Hosmer–Lemeshow (HL) goodness-of-fit test and calibration curves assessed applicability of the models to individual cases. Results During the study period, 1634 unique patients were identified. The mean age of the study cohort was 68.8±13.4 years. Hypertension, hyperlipidemia, and heart disease were the most common comorbidities. The crude hospital mortality was 29% (95% confidence interval [CI] 0.27–0.31). Evaluation of the predictive models showed an AUC range from 0.63 (95% CI 0.60–0.66) to 0.72 (95% CI 0.69–0.74) indicating fair to poor discrimination across all models. There were no significant differences among the AUC values of the four prognostic systems. All models calibrated poorly by either overestimated or underestimated hospital mortality. Conclusions All the four prognostic models examined in this study portend high-risk bias. The performance of these scores needs to be interpreted with caution in hospitalized patients with COVID-19.


Introduction
Since the first reported case of COVID-19 in Wuhan, China, at the end of 2019, COVID-19 has rapidly spread throughout the globe shattering world economy and traditional way of life [1]. As of August 1, 2020, more than 17 million laboratory-confirmed cases had been reported worldwide. The number of infected individuals has surpassed that of SARS and MERS combined. Despite the valiant public health responses aimed at flattening the curve to slow the spread of the virus, more than 675000 people have died from the disease [2].
Numerous prognostic models ranging from rule based scoring systems to advanced machine learning models have been developed to provide prognostic information on patients with COVID-19 [3]. Such information is valuable both to clinicians and patients. It allows healthcare providers to stratify treatment strategy and plan for appropriate resource allocation. As for patients, it offers valuable guidance when advance directives are to be implemented. However, initial description of these prognostic models has been based on patients from a localized geography and time frame. These evaluations may thus be limited in scope of their predictability as concerns have been raised about the applicability of such models when patient demographics change with geography, clinical practice evolves with time, and when disease prevalence varies with both [4,5]. In response to the call for sharing relevant COVID-19 research findings, many of these models have been published in open access forums before undergoing a peer review. The quality of these models are further compromised by the relatively small sample size both in derivation and validation [6]. Recently, Wynants and colleagues [7] conducted a systematic review of COVID-19 models developed for predicting diagnosis, progression, and mortality from the infection. All models reviewed were at high risk of bias because of improper selection of control patients, data overfitting, and exclusion of patients who had not experienced the event of interest by the end of the study. Besides, external validation of these models was rarely performed. In the present study, we sought to examine the external validity of four scoring models that have shown excellent precision for predicting hospitalization outcome from COVID-19 [8][9][10].

Patients
We used data from the Veterans Affairs Corporate Data Warehouse (CDW) of all patients tested positive on the reverse transcriptase polymerase chain reaction assay for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) between January 1 st , 2020 and May 1 st 2020. Data were extracted from CDW using structure query language (SQL) with pgAdmin4 PostgreSQL 9.6 on July 16, 2020. The de-identified database contained data of demographic information, laboratory values, treatment processes, and survival data. We excluded patients who had a length of stay <24 hours, who lacked vital signs or laboratory data, and who were transferred to or from another acute care facility (because we could not accurately determine the onset or subsequent course of their illness). The median time between the date the case index tested positive for COVID-19 and the date of discharge (whether alive or dead) was referred to as the median follow-up. All data analysis was done on the VA Informatics and Computing Infrastructure workspace (VINCI). Access of the CDW for research was approved by the Institutional Review Board of the VA Western New York Healthcare System. Because the study was deemed exempt, informed consent was not required.

Missing data
Demographic and comorbidity data contained almost no missing data. However, many baseline laboratory values had up to 20% missing data. When data are missing at random, statistical methods such as multiple imputation give less biased and realistic results compared with complete case analysis [11]. However, the ordering of a laboratory test is likely driven by factors that make assumptions underlying multiple imputation inaccurate. In the absence of a standardized method to address missing data under these conditions, we have adopted the following approach: Missing data were imputed with the centered mean. A dummy variable (called also an indicator variable) is added to the statistical model in order to indicate whether the value for that variable is available or not [12]. When using the indicator method to handle missing covariate data, the value for the missing variable is set to 1, otherwise the value is set to 0. Then, both the primary variable and the indicator variable are entered into the regression model to predict the intended outcome. Then, both the primary variable and missingness indicator were evaluated in a mixed-effects logistic regression model and the primary mean imputed variable was considered.

External validation of risk models
We conducted initially a search strategy using PubMed and Medline databases between January 1st, 2020 and May 1 st , 2020. The literature was done using the following keywords in combination: 1) (COVID-19 OR SARS-CoV-2 OR 2019-nCoV) AND 2) (Mortality OR Death) AND 3) (Predictive model OR Scoring system) ("S1 Table" in S1 File). Inclusion criteria were: 1) English-written peer reviewed studies; 2) hospitalized patients with COVID-19; 3) prognostic models for predicting in-hospital mortality; and 4) sample size of no less than 100. Exclusion criteria included duplicate studies and lack of access to full documents. Studies identified by the search strategy were reviewed by title and abstract. Screening was conducted by two independent investigators (YL and DES). Any disagreements were resolved by consensus. Fifteen studies were identified. Two were concise reviews leaving 13 studies for further evaluation. Four prognostic models were selected based on availability of the predictive parameters in the CDW [8][9][10]13]. For each predictive model, we replicated the methods used by the original authors to calculate the predicted hospital mortality from COVID-19. The main outcome of interest was in-hospital mortality.
We incorporated the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) principles for validating each of the selected predictive models [14]. The risk of bias for each predictive model was evaluated by the Prediction model Risk Of Bias Assessment Tool (PROBAST) described by Moons and colleagues [15].

Statistical analysis
The normality of continuous variables was assessed using the Kolmogorov-Smirnov test. Continuous variables with and without normal distribution were reported as mean (standard deviation (SD)) and median (interquartile range (IQR)), respectively. Categorical variables were presented as number (percentage). Continuous variables with or without normal distribution between survivors and non-survivors were compared using t-test and Mann-Whitney U test, respectively. Comparisons of categorical variables were performed using Chi squares tests.
Receiver operating characteristic (ROC) curves were drawn for each model by plotting sensitivity versus one minus specificity. The area under the receiver operating characteristic curve (AUC) was used to evaluate the discriminatory capacity of the selected models [16]. An ideal discrimination produces an AUC of 1.0, whereas discrimination that is no better than chance produces an AUC of 0.5. Based on a rough classifying system, AUC can be interpreted as follows: 90-100 = excellent; 80-90 = good; 70-80 = fair; 60-70 = poor; 50-60 = fail [17]. Pairwise comparison of the area under the ROC curve for each model was performed according to the method described by Hanley and McNeil [16]. If P is less than the conventional 5% (P < .05), the compared areas are considered statistically different. Calibration was assessed with the Hosmer-Lemeshow goodness-of-fit χ 2 estimates by grouping cases into deciles of risk [18]. The method involves sorting the predictive probabilities of death in ascending order and dividing the total number of cases into 10 equally distributed subgroups or deciles. Calibration plots were provided to show the relationship between model-based predictions of mortality and observed proportions of mortality using the loess algorithm [19]. The non-parametric bootstrapping method was used to calculate the 95% confidence intervals (CIs) of both discrimination and calibration estimates [20]. These CIs were reported using the percentile method, or bias corrected method if the estimation bias was greater than 25% of the standard error [21]. All analyses were performed using STATA 15.0 (STATA Corp). A P-value less than .05 was considered statistically significant.

Results
A total of 1634 patients were hospitalized for COVID-19 between January 1, 2020 and May 1, 2020. The majority of patients were male (95%) with 47% identified as Caucasian, 43% as African American, and 10% as Latino. Fever (65%), dyspnea (41%), and cough (32%) were the three most common manifestations at hospital admission. The mean age of the cohort was 68.8±13.4 years. Fifty percent of the group had three or more comorbidities. Hypertension was the most common comorbidity, followed by hyperlipidemia, and heart disease. The median time from illness onset to admission was 7.8 days (interquartile range 1.0-14.2). Of the 817 patients treated in the intensive care units, 478 (59%) required invasive mechanical ventilation. Overall, 73.8% received at least one antibiotic treatment during their hospital stay. Almost half of the patients had received azithromycin and/or hydroxychloroquine. After a median followup of 58 (IQR, 50-68) days, there were 475 deaths (overall mortality, 29%) for a mortality rate of 12 (95%CI, 11-12) per 1000 patient-days.
The clinical characteristics of survivors and non-survivors of the CDW cohort are depicted in " Table 1". In univariate analysis, age, current tobacco smoker, high burden of comorbidities, lymphopenia, thrombocytopenia, liver function abnormalities, and elevated procalcitonin and D-dimer levels were associated with mortality. Compared with survivors, non-survivors were more likely to receive vasopressors, to require mechanical ventilation, and to develop complications including acute respiratory distress syndrome, acute renal failure, and septic shock.
A summary of models methodology is depicted in "S2 Table" in S1 File. " Table 2" shows the independent risk variables and corresponding odds ratios of the four prognostic models. All four models were classified as overall high ROB either because of flawed methods of data analysis pertaining to handling of missing data or lack of validation cohort "S3 Table" in S1 File. The predictive performances of the four models on the CDW cohort are presented in " Table 3". The AUCs indicate inferior discriminative power across all models compared to the AUCs obtained by the derivation cohorts. Pair-wise comparisons of the AUCs were performed by using the method described by Hanley and McNeil [22] " Table 4". Overall the best discrimination was obtained by the scoring model proposed by Shang et al. [9] which attained significance with respect to Chen et al. [8] and Yu et al. [10] models (AUCs 0.72 (95% CI 0.69-0.74) versus 0.68 (0.66-0.70) and 0.63 (95% CI 0.60-0.66); respectively) " Fig 1". The least discriminatory model was the model described by Yu et al. [10] with an AUC of 0.63 (95% CI 0.60-0.66).
The Hosmer-Lemeshow goodness-of-fit test reveals poor calibration (p < 0.05) for all the models " Table 3". Calibration was further explored by plotting the observed to expected frequency of death for each quintile of every model " Fig 2". The Chen et al. model [8] showed a departure from expected risks at the tail of risk distribution for each of the three endpoints selected (14, 21, and 28 days predicted mortality) "Fig 2A-2C". The predictions overestimated the probability of death for high risk patients. This was also the case for the model by Yu et al. [10] " Fig 2E". In contrast, the model by Wang et al. [13] underestimated the probability of death for low risk patients and overestimated it for high risk patients " Fig 2F" while Shang et al. model [9] consistently overestimated mortality risk across the range of total scores " Fig  2D".

Discussion
This is, to our knowledge, the first study to evaluate and externally validate risk prediction models of in-hospital mortality from COVID-19 in a large cohort. Our results showed that external validation of all four selected scores was not commensurate with the performance observed in the primary derivation cohorts underscoring that model evaluation can generally be generalizable only when the model has been tested in a separate cohort exposed to similar risk pressure.
With the rapid spread of COVID-19, healthcare providers struggle to institute clinical strategies aiming at optimizing outcomes and reducing resource consumption. In response, more than two dozen prediction models have been destined for publications in just over 12 weeks period since COVID-19 was declared a pandemic by the WHO [3]. Many of the prediction models were developed as simplified scoring system or nomograms. Despite the excellent predictive accuracy shown in the initial derivation, the validity of these models has not been confirmed independently. Based on our observations, the performance of these prognostic systems varied in their ability to discriminate between survivors and non-survivors and were labeled overall either fair or poor in contrast to their original designation as excellent or good. We should point out that the four models originated from mainland China that was initially

PLOS ONE
hard hit by the pandemic. With the large disparity in medical resources among the Chinese provinces [23], the expected models can only be accurate under the same clinical setting the model was derived under. As such, the risk prediction models developed in a different geographic setting can be less accurate in providing risk-adjusted outcomes when applied externally [24]. Various statistical and clinical factors may lead to a prognostic model to perform poorly when applied to other cohorts [25]. First, the models presented in this study are parsimonious, making a variety of assumptions in order to simplify applicability and avoid overfitting the limited and often incomplete data available. Even when these predictive models being constructed with similar variables such as age, presence of comorbidities, and laboratory values (procalcitonin, C-reactive protein, or D-dimer), the thresholds selected for each of these variables vary significantly for a given geographic locality [26]. Second, these models have several sources of uncertainty, including the definition of parameters entered into the final model, differences in handling missing data, and most importantly, non-comparable traits (genetic diversity), which can weaken model prognostication and lessen its discrimination accuracy [27,28].
Even when discrimination can be useful for generic risk stratification, the observed poor calibration underlines the fact that the applicability of these prognostic scoring models to heterogeneous systems of health care delivery dissimilar to the derivation cohorts may not be feasible. The four prognostic models showed shortcomings with regard to calibration, tending to over-predict or under-predict hospital mortality. This may partly reflect the inclusion criteria of the sample-in which, for example, do not resuscitate patients were not included-and improvements in care (e.g. timing to transfer to ICU or prone position in management of ARDS) since the models were first developed. A relevant factor in explaining the divergence in performance accuracy is that the time from onset of illness to admission was not similar among all cohorts. Wang and colleagues [13] reported the shortest interval of a median of 5.0 days for survivors and 6.8 days for non-survivors while Yu and coworkers [10] reported a median of 10.0 days for both the survivors and non-survivors. Our interval was comparable to the study of Shang and colleagues [9] which may explain the higher performance of that study using the CDW cohort.
It could be argued also that our CDW cohort consisted of predominantly male, Caucasian and African American patients with multiple comorbidities which are different from the patient demographics in the original training dataset and, as such, may impose significant strain on the accuracy of the risk estimates. Age-standardized mortality in men was shown to be almost double compared to that of women across all age groups [29]. Reports have similarly suggested a disproportionate mortality rates among Black and Latino residents compared with their proportion of the US population. Age and population adjusted Black mortality was reported more than twice that for Whites [30,31]. Accordingly, the predictive models might be expected to give different predictions of mortality risk in our validation cohort. While this may cause prognostic systems to underestimate the mortality rate at the lower end of the calibration curve, only two out of the four tested models exhibited this pattern. Multiple studies have demonstrated a decreasing ratio of observed mortality to expected mortality with time [32,33]. Changing risk profiles, advance treatment modalities, and changes in the association of risk factors with outcomes can all contribute to poor calibration. Given that the CDW cohort overlaps the time period during which the four models were constructed, we cannot attribute the failure of the Hosmer-Lemeshow tests to this phenomenon [34]. Consideration of other variables, such as severity of comorbid diseases, lifestyle habits (smoking or alcohol intake), and prescribed treatments may improve the predictive accuracy of these models.
Alternatively, an ensemble learning model [35] which uses multiple decision-making tools can be implemented to produce a more accurate output [36].
Our study has its own strengths but also several limitations. The systematic nature of the model identification, the large sample size in which the models were validated and the opportunity to compare the performance among the predictive models are all substantial strengths. Conversely, at the time our analysis was conducted, the number of COVID-19 cases was relatively small compared to the most recent statistics of veterans infected with COVID-19. This limits the precision in re-estimating the baseline prevalence of the disease, which may have hampered the calibration performance of the model. However, CDW is undergoing continuous update and re-conducting this validation in the large expanded cohort may mitigate some of these issues related to selection bias. Such validations in large datasets have been advocated to ensure developed prediction models are fit for use in all intended settings [37]. Finally, while previous studies have shown that physicians usually overestimate patients' mortality [38], there is limited evidence so far to suggest that prognostic models represent a superior solution when their performance in actual clinical practice is taken into consideration.

Conclusions
In conclusion, predictions arising from risk models applied to cohorts drawn from a different distribution of patient characteristics should not be adopted without appropriate validation. The variability in predicted outcomes as we have documented in this analysis highlights the challenges of forecasting the course of a pandemic during its early stages [7]. To achieve a more robust prediction model, the focus should be placed on developing platforms that enable deployment of well-validated predictive models and prospective evaluation of their effectiveness. We are actively engaged in pursuing these objectives at the Veterans Affairs.