Regional performance variation in external validation of four prediction models for severity of COVID-19 at hospital admission: An observational multi-centre cohort study

Background Prediction models should be externally validated to assess their performance before implementation. Several prediction models for coronavirus disease-19 (COVID-19) have been published. This observational cohort study aimed to validate published models of severity for hospitalized patients with COVID-19 using clinical and laboratory predictors. Methods Prediction models fitting relevant inclusion criteria were chosen for validation. The outcome was either mortality or a composite outcome of mortality and ICU admission (severe disease). 1295 patients admitted with symptoms of COVID-19 at Kings Cross Hospital (KCH) in London, United Kingdom, and 307 patients at Oslo University Hospital (OUH) in Oslo, Norway were included. The performance of the models was assessed in terms of discrimination and calibration. Results We identified two models for prediction of mortality (referred to as Xie and Zhang1) and two models for prediction of severe disease (Allenbach and Zhang2). The performance of the models was variable. For prediction of mortality Xie had good discrimination at OUH with an area under the receiver-operating characteristic (AUROC) 0.87 [95% confidence interval (CI) 0.79–0.95] and acceptable discrimination at KCH, AUROC 0.79 [0.76–0.82]. In prediction of severe disease, Allenbach had acceptable discrimination (OUH AUROC 0.81 [0.74–0.88] and KCH AUROC 0.72 [0.68–0.75]). The Zhang models had moderate to poor discrimination. Initial calibration was poor for all models but improved with recalibration. Conclusions The performance of the four prediction models was variable. The Xie model had the best discrimination for mortality, while the Allenbach model had acceptable results for prediction of severe disease.


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was discovered in Wuhan, China in December 2019. The virus was shown to cause viral pneumonia, later designated as coronavirus disease 2019 (COVID-19) [1]. The disease has evolved as a pandemic with an extensive amount of severe cases with high mortality [2]. Several biomarkers, clinical and epidemiological parameters have been associated with disease severity [3,4]. Practical tools for prediction of prognosis in COVID-19 patients are still lacking in clinical practice [5,6]. We observed that many laboratory tests are ordered for patients with COVID-19 due to their predictive value. Very likely, there is redundancy in the information from the different tests and it could be possible to improve the prediction by using a multivariable model and reducing the number of redundantly ordered tests. Prediction models can be crucial to prioritize patients needing hospitalization, intensive care treatment, or future individualized therapy.
Since the onset of the pandemic, the number of prediction models for COVID-19 patients has been continuously growing [7]. Prediction models should be validated in different populations with a sufficient number of patients reaching the outcome before implementation [8][9][10]. A validation study of 22 prediction models at one site was recently published [6]. Interestingly, this study found that none of the models performed better than oxygen saturation alone, even though the performance at the original study sites in most cases was much better.
This study aimed to validate published prediction models of severity and mortality for hospitalized patients based on laboratory and clinical values in COVID-19 cohorts from London (United Kingdom) and Oslo (Norway).
The study is reported according to the guidelines in "Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis" (TRIPOD) [11] and has also followed recommendations from "Prediction Model Risk of Bias Assessment Tool" (PROBAST) [12].

Study design and participants
The study was performed as a retrospective validation study with adult patients hospitalized with COVID-19. Two cohorts were included: (1) Oslo University Hospital (OUH) in Norway, (2) Kings Cross Hospital (KCH) in London, United Kingdom . The patients included were all  adult inpatients testing positive for SARS-CoV-2 by real-time polymerase chain reaction  (RT-PCR) with symptoms consistent with COVID-19 at admission. SARS-CoV-2 -positive  patients admitted for conditions not related to COVID-19 were excluded, e.g. pregnancyrelated conditions or trauma. Patients referred from other hospitals were also excluded, as we did not have access to measurements from the first hospital admission. OUH cohort. OUH is a large urban university hospital. Patients admitted between 6 th March and 31 th December 2020 were included. The OUH project protocol was approved by the Regional Ethical Committee of South East Norway (Reference 137045). All patients with confirmed COVID-19 were included in the quality registry "COVID19 OUS", approved by the data protection officer (Reference 20/08822). Informed consent was waived because of the strictly observational nature of the project. Demographics, clinical variables and hospital stay information were manually recorded in the registry and merged with laboratory results exported from the laboratory information system in Microsoft Excel.

PLOS ONE
KCH cohort. In the KCH cohort patients were admitted between 23 rd February to 1 st May 2020 at two hospitals (King's College Hospital and Princess Royal University Hospital) in South East London (UK) of Kings College Hospital NHS Foundation Trust.
Data (demographics, emergency department letters, discharge summaries, lab results, vital signs) were retrieved from components of the electronic health record (EHR) using a variety of natural language processing (NLP) informatics tools belonging to the CogStack ecosystem [13]. The study adhered to the principles of the UK Data Protection Act 2018, UK National Health Service (NHS) information governance requirements, and the Declaration of Helsinki. De-identified data from patients admitted to KCHFT were analysed under London SE Research Ethics Committee approval (reference 18/LO/2048) granted to the King's Electronic Records Research Interface (KERRI) which waives the consent requirements. Specific work on COVID-19 on the de-identified data was approved by the KERRI committee which included patients and the Caldicott Guardian in March 2020 and reaffirmed in May 2020. Data from this cohort has been published in prior studies [14,15].

Selection of prediction models
A literature search was performed to select prediction models for validation. Published articles or preprint manuscripts were included until 29.05.2020. A structured search was performed in PubMed with the words "COVID-19" and "prediction model" or "machine learning" or "prognosis model". Prediction models included in the review by Wynants et al. [7] published April 7th 2020 were also investigated, as well as search for articles/preprints citing Wynants et. al. using Google Scholar 18.05.2020.
The inclusion criteria for selection of multivariable prediction models were: (1) Symptomatic hospitalized patients over 18 years with PCR confirmed COVID-19; (2) outcomes including respiratory failure or intensive care unit (ICU) admission or death or composite outcomes of these. (3) The predictive models had to include at least one laboratory test as we wanted to explore models that combined clinical and laboratory variables (4). All variables had to be available in the datasets and the model had to be described in adequate detail.

Missing values
Predictive variables were collected from the admission to the emergency department (ED). If not available in the ED, the first available values within 24 hours from hospital admission were used. Missing values (i.e. no recorded values within 24 hours) were imputed using both simple (k-nearest neighbors (KNN) and random forest) and multiple imputation (Bayesian ridge and Gaussian process) [16,17], using the multivariate imputation by chained equations method implemented in the Python function IterativeImputer available in the scikit-learn package, version 0.24.2 [18].

Statistical analyses and performance measurements for the prediction models
Univariate comparisons between patients with 'mild' versus 'severe' disease were carried out for continuous (Wilcoxon rank-sum test) and binary (X 2 test) measures. Severe disease was defined as transfer to ICU or in-hospital mortality.
Validation of the selected prediction models was assessed with discrimination and calibration as recommended in TRIPOD [11]. Discrimination is the ability of the model to differentiate between those who do or do not experience the outcome. It is commonly estimated by concordance index (c-index) which is identical to the area under the receiver-operating characteristic curve (AUROC) for models with binary endpoints. The discrimination for the models at OUH and KCH was also compared to the discrimination in the original development cohort and to the external validation by Gupta et al. [6]. Calibration is the agreement between the observed outcomes and the outcome predictions from the model. It is preferably reported by a calibration plot, intercept and slope.
The mortality rate and the rate of 'poor outcome' varied between the cohorts. The models were therefore recalibrated by adjusting the intercept of the logistic regression models according to the frequency of outcomes at each study site [19]. Validation of the recalibration was not performed. All statistical analyses were conducted in Python 3.7 and R 3.4 [20].

Selection of prediction models
Four publications comprising five prediction models met our inclusion criteria [14,[21][22][23]. The inclusion process is illustrated in Fig 1. However, since one of the models was developed at KCH and validated at OUH in a previous publication [14], only four models are presented here. The four models are referred to as 'Xie' [21], 'Zhang1', 'Zhang2' [22] and 'Allenbach' [23].
Information on the predictor variables and outcomes of the four models are summarized in Table 1.
All predictors were measured at hospital admission. Treatment of missing values in the development cohorts was not well described and imputation methods were not mentioned. The Xie model had hospital mortality as the only outcome. Zhang presented two models with different outcomes: (1) Mortality and (2) Composite outcome of mortality or 'poor outcome'. Poor outcome was defined as acute respiratory distress syndrome (ARDS), intubation or extracorporeal membrane oxygenation (ECMO) treatment, ICU admission or death. The Allenbach model used a composite outcome of transfer to ICU or mortality within 14 days of hospital admission. There were no details of the censoring date in the original studies. Mortality during the hospital stay was used for the OUH cohort and for the KCH cohort hospital mortality at data collection time.
All prediction models were based on multiple logistic regression and presented coefficients and intercepts for the different variables that enabled the calculation of risk prediction for our cohorts. Allenbach additionally provided an 8-point scoring system derived from the logistic regression model. However, we chose to use the regression model for calculation as this retains as much information as possible.

Description of the cohorts
Patient characteristics for the three development cohorts and the KCH and OUH cohorts are shown in S1 Table. Since the three models use different outcomes and timeframes, the number of patients included in each validation is not the same. An overview of missing values is presented in Table 2. Missing values were imputed via simple imputation and multiple imputations [17]. Preliminary analyses showed no differences between AUROCs calculated with different

PLOS ONE
Validation of four prediction models for COVID-19 at hospital admission imputation methods (see S2 Table). Thus, the simple imputation method k-nearest neighbor was used for the rest of this paper. At KCH the number of missing values was very high for LDH (87.8%) and relatively high for SpO2 (33.3%) and WHO scale (33.8%).
The OUH cohort consisted of 307 patients while the KCH cohort consisted of 1295 patients (S1 Fig). For the OUH cohort median age was 60 years with 57% males, while in the KCH cohort the median age was 69 with 59% males. In the OUH cohort, 32 patients died in the hospital (10.4%), while 333 (26.8%) had died at the hospital by data collection time in the KCH cohort. For the composite outcome death or ICU transfer, the number of patients with the outcome was 66 (21.5%) at OUH and 419 (33.7%) at KCH.
The percentage of patients with hypertension and diabetes was higher in the KCH cohort (54% and 35%, respectively) than in the OUH cohort (34% and 21%, respectively). The patients at KCH also had higher levels of CRP, creatinine, LDH, and possibly a lower number of lymphocytes than the OUH patients; all of which are known predictors for severe COVID-19.
In Table 3, univariate associations are presented for mild/moderate and severe groups for the KCH and OUH cohorts. In general, the same variables were predictive for severe disease at KCH and OUH; except for ischemic heart disease, temperature and platelets which were associated with severe disease at OUH, but not KCH.

Performance of the prediction models
The validation of the four prediction models with both the OUH and KCH cohorts is presented in terms of discrimination (AUROC) and calibration (slope and intercept) in Table 2 and Figs 2 and 3, respectively. For the models predicting mortality, the Xie model had the highest AUROC both in the KCH cohort (0.79; 95% CI 0.76-0.82) and the OUH cohort (0.87; 95% CI 0.79-0.95). The Zhang1 model had a lower AUROC at both KHC (0.64; 95% CI 0.60-0.68) and OUH (0.72; 95% CI 0.62-0.82).
For 'severe disease', discrimination was highest in the Allenbach model with AUROCs 0.72 (95% CI 0.68-0.75) for KCH and 0.81 (95% CI 0.74-0.88) for OUH. For the Zhang2 model, the AUROC was 0.67 (95% CI 0.64-0.70) for KCH and 0.77 (95% CI 0.70-0.84) for OUH. For the Xie and Allenbach models, discrimination at OUH was similar to the development cohorts (Fig 2). We compared the AUROCs between the KCH and OUH cohorts using the bootstrap method implemented in the pROC R package [24]. The results indicated that there was a statistically significant difference in the AUROCs between KCH and OUH for the Xie model (p = 0.01), Allenbach model (p = 0.009), and the Zhang2 model (p = 0.007), but not for the Zhang1 model (p = 0.140).
The calibration plots are shown in Fig 3 (after recalibration). S3 Fig shows the calibration results before and after recalibration for the Xie and Allenbach models. Recalibration will not render models with poor discrimination more useful. Thus, we focused on the recalibration of the Xie and Allenbach models as these had the best discrimination. Recalibration improved the predictions for both the Xie and Allenbach models at OUH and the Xie model at KCH, and the slope and intercept were acceptable for both models at both hospitals after recalibration.
Continuous variables in median [IQR] and categorical variables in number (percent). Pvalues are calculated with the Pearson X 2 test for categorical variables, and with the Wilcoxon

Discussion
In this study, we validated four prediction models for prognosis in hospitalized COVID-19 patients from London, UK and Oslo, Norway. We found varying performance of the models in the two cohorts. The models performed better in the OUH cohort with similar discrimination to the original studies. The Xie and Allenbach models had the best performance for prediction of death and severe disease, respectively. Initial calibration was poor for all models, but improved after recalibration of the intercept according to the frequency of the outcome in our cohorts. This improves the accuracy of the prediction for each patient without affecting the discrimination and is recommended in several publications [5,11,19]. Local or possibly regional/national recalibration is likely to be important for COVID-19 prediction models since there is a large variation in the frequency of severe disease and death in different studies. However, ideally the local recalibration should also be tested for optimism using methods for internal validation such as bootstrapping or cross validation.
In some cases, we found poorer discrimination in the validation cohorts compared to the development cohorts. This is consistent with past evidence showing discrimination in development cohorts to be better than at external validation due to overfitting and differences in characteristics of the cohorts [25]. The cohorts in the original studies and at KCH and OUH had many differences such as mortality, age and frequencies of severe disease and comorbidities. UK and Norway differ in the structures of their healthcare systems, and the incidence of AUROCs from validation of the four models at the KCH and OUH cohorts, and the original AUROC from development cohorts [21][22][23]. Also shown are the results from the external validation of the Xie and Zhang models by Gupta et al. [6]. Lines represent the 95% CIs of the AUROCs. For the development cohorts only Xie reported confidence intervals.
https://doi.org/10.1371/journal.pone.0255748.g002 COVID-19 has been far higher in the UK. These factors may have affected the selection of patients for hospital and ICU admission, which might have resulted in a more homogenous patient population in regards to severity at KCH. It is to be expected that discrimination will be less good when the population is more homogenous.
The findings underline the importance of validation at several external sites. This is particularly true for a new disease like COVID-19, with rapidly developing treatment guidelines, and with an overwhelming effect on healthcare resources in some locations, but not at others.
The Xie model had the best results compared to the other models. The differences in the performance of the prediction models might have several reasons. Firstly, the predictors used in one model might have better predictive value than predictors used in others. SaO 2 , which is included in the Xie model, is a strong clinical indicator of the severity of disease, and often indicates a need for ICU transfer. Secondly, there might be weaknesses in the models, as bias is common in prediction models [12]. To date, only the Allenbach study is published in a peerreviewed journal, while Xie and Zhang are preprints. Thirdly, criteria for ICU admittance might vary across sites. The fact that we and other studies generally find better discrimination for mortality than for severe disease (often defined by ICU admittance) supports this hypothesis. For instance, patients with short life expectancy will often not be admitted to the ICU, but given oxygen therapy in a hospital ward and transferred to nursing homes for palliative care. These patients, not fulfilling the criteria for severe disease, often have predictors that indicate severe disease at admission.
Many prediction models have been published, but few have been systematically validated [26]. To our knowledge, only one study to date has validated COVID-19 prediction models; Gupta et al recently validated 22 prognostic models [6], including the Xie and Xhang models. For the OUH cohort, we found substantially better discrimination for the Xie and Allenbach models for the prediction of mortality and severe disease, respectively. The performance of the models at KCH was more similar to the results in the Gupta study, also performed at a London hospital. The rate of severe disease, mortality and the characteristics of the London cohorts are quite similar which might explain the similar performance at these two sites.
Several other prediction models have been recently published, such as models based on NEWS2 or the ISARIC model [14,27]. The AUROCs of the models are in the range of 0.75 to 0.80, which is not a substantial improvement over single univariate predictors of severity. Thus, the finding that the Xie and Allenbach models perform well at both the original study site and at our validation cohort at OUH might indicate that it is possible to achieve higher AUROCs with relatively simple prediction models.
Our study has several strengths. Validation was performed at two sites in different countries with consistent inclusion and exclusion criteria. We included all eligible patients admitted to the hospital during the study period therefore the cohorts should be representative of the study sites. Moreover, the study was conducted and reported according to the TRIPOD guidelines. However, there are also some weaknesses. Firstly, the OUH cohort is not very large with relatively few patients meeting the outcomes. Some publications recommend including at least 100 patients with the relevant outcome [10]. However, studies with lower numbers may still contain useful information. Furthermore, the KCH cohort is probably one of the largest cohorts analyzed in prediction models for severe COVID-19 in hospital. Secondly, Gupta et al. included 22 models in their validation study, while we ended our inclusion of models in May 2020, and included only four models in this study. Whereas it could be interesting to include more models we think that the results for the Xie and Allenbach models at OUH indicate that further studies of these models could be interesting. Thirdly, there was a relatively high number of missing values for LDH and SpO 2 at KCH. It is uncertain how much this affected the results. Both are included in the Xie model and SpO 2 is a strong predictor for mortality, while LDH is probably a weaker predictor (6). The number of missing values at OUH was low and probably did not affect the validation.
In conclusion, following the TRIPOD guidelines, our study validated developed models for prediction of prognosis in COVID-19, and showed that these models have a variable performance in different cohorts. The Xie model and Allenbach model clearly had the best performance, and we suggest that these models should be included in future studies of COVID-19 prediction models. However, the performance of these models at our two validation sites was not similar, which underlines the importance of external validation of prediction models at several study sites before their implementation in the clinical practice.
Supporting information S1