Training and testing of a gradient boosted machine learning model to predict adverse outcome in patients presenting to emergency departments with suspected covid-19 infection in a middle-income setting

COVID-19 infection rates remain high in South Africa. Clinical prediction models may be helpful for rapid triage, and supporting clinical decision making, for patients with suspected COVID-19 infection. The Western Cape, South Africa, has integrated electronic health care data facilitating large-scale linked routine datasets. The aim of this study was to develop a machine learning model to predict adverse outcome in patients presenting with suspected COVID-19 suitable for use in a middle-income setting. A retrospective cohort study was conducted using linked, routine data, from patients presenting with suspected COVID-19 infection to public-sector emergency departments (EDs) in the Western Cape, South Africa between 27th August 2020 and 31st October 2021. The primary outcome was death or critical care admission at 30 days. An XGBoost machine learning model was trained and internally tested using split-sample validation. External validation was performed in 3 test cohorts: Western Cape patients presenting during the Omicron COVID-19 wave, a UK cohort during the ancestral COVID-19 wave, and a Sudanese cohort during ancestral and Eta waves. A total of 282,051 cases were included in a complete case training dataset. The prevalence of 30-day adverse outcome was 4.0%. The most important features for predicting adverse outcome were the requirement for supplemental oxygen, peripheral oxygen saturations, level of consciousness and age. Internal validation using split-sample test data revealed excellent discrimination (C-statistic 0.91, 95% CI 0.90 to 0.91) and calibration (CITL of 1.05). The model achieved C-statistics of 0.84 (95% CI 0.84 to 0.85), 0.72 (95% CI 0.71 to 0.73), and 0.62, (95% CI 0.59 to 0.65) in the Omicron, UK, and Sudanese test cohorts. Results were materially unchanged in sensitivity analyses examining missing data. An XGBoost machine learning model achieved good discrimination and calibration in prediction of adverse outcome in patients presenting with suspected COVID19 to Western Cape EDs. Performance was reduced in temporal and geographical external validation.


Introduction
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged in December 2019 and subsequently spread globally, causing the coronavirus disease 2019 (COVID- 19) pandemic.[1] To date, South Africa, has experienced four distinct pandemic waves caused by the ancestral Wuhan SARS-CoV-2 strain and subsequent evolutionary variants (Alpha and Beta, Delta, and Omicron).[2,3] Although, much reduced from earlier in the pandemic, infection rates remain high, with approximately 3,000 confirmed cases recorded per week across South Africa in Autumn 2022.[4] The morbidity and mortality of COVID-19 infection has been attenuated by vaccination, development of natural immunity, and the evolution of less pathogenic variants.[5] However, emergency health care systems in middle-income settings, such as South Africa, remain vulnerable to being overwhelmed due to low vaccine coverage (35% fully vaccinated in October 2022) and emergence of future severe COVID-19 variants.[6] Moreover, healthcare within South Africa emergency systems may be delivered by less experienced clinicians, with restricted access to laboratory or radiological investigations.[7] Clinical prediction models could help risk-stratification of patients presenting to emergency departments (ED) with suspected COVID-19 and support clinical decision making around triage and management decisions for individual patients.Existing models, such as the COVID-specific Pandemic Respiratory Infection Emergency System Triage (PRIEST) score, were developed in highincome settings and may not be generalisable or applicable to less well-resourced settings.[8] Machine learning is where a data is provided to a computer algorithm to produce a mathematical model for prediction of future outcomes.[9] Machine learning algorithms may have advantages over traditional statistical prediction models, such as logistic regression, in 'big data' settings, in high signal to noise scenarios, where high-order interactions exist between model inputs, and if continuous model inputs are non-linear.[9] The Western Cape of South Africa has recently developed integration of electronic health care data across prehospital, ED, laboratory, and public health systems.Linkage of these routine data sources provides a unique opportunity to produce a very large study sample amenable to machine learning approaches.
The aim of this study was therefore to develop a model to predict adverse outcome in patients presenting with suspected COVID-19 suitable for use in a middle-income setting.Specific objectives were to train a gradient boosted machine learning model using data from Western Cape EDs, explore which clinical features were most predictive of adverse outcome, and test the model's discrimination and calibration in external validation.

Study design
A retrospective cohort study was conducted to train and test a machine learning model, using previously collected routine electronic data, to predict adverse outcomes in patients with suspected COVID-19, in middle income countries.The study was conducted and reported in accordance with relevant expert guidelines: Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD), [10] Reporting of studies Conducted using Observational Routinely collected Data (RECORD), [11] and DOME: recommendations for supervised machine learning validation in biology.[12]

Setting and study populations
The source population for model training (derivation) was patients aged over 16 years presenting with suspected COVID-19 infection to public-sector EDs in the Western Cape, South Africa during the Alpha, Beta and Delta waves of the COVID-19 pandemic (27th August 2020 to 31 st October 2021).[3] The subsequent study population comprised consecutive patients presenting to seven hospital EDs contributing to the Hospital Emergency Centre Triage and Information System (HECTIS) data repository.Participating hospitals were from the urban Cape Town metropole district and a single large peri-rural hospital.Patients were included where an ED clinical impression of suspected, or confirmed, COVID-19 infection had been recorded.
Three additional populations were studied for model testing (external validation).A temporal validation sample consisted of Western Cape patients who presented to participating hospitals after the emergence of the Omicron COVID-19 variant (1 st November 2021 to 11th March 2022).(3)A geographical, validation sample was derived from the PRIEST mixed prospective and retrospective cohort study that collected data from 70 EDs across 53 sites in the UK during the initial COVID-19 ancestral wave between 26th March and 28th May 2020.[8] A second geographical validation was performed in a retrospective cohort of Sudanese patients presenting to two government referral hospitals in Sudan's most populous region, Khartoum State, between January 2020 and 14th December 2021.[13] This period corresponded to two distinct COVID-19 waves: an initial ancestral wave, and a later Eta variant wave.A base case complete case analysis was performed excluding all cases with any missing values from the training or test data sets.

Data collection and preparation
The Western Cape data sets were produced by linking information from routinely collected public data sources: ED HECTIS system, National Health Laboratory Services, death certification and other Western Cape Public Health data sources.Deterministic matching, based on unique patient hospital numbers, was performed by the Western Cape Provincial Health Data Centre (PHDC)).The final data set comprised patient demographics, ED clinical details, COVID-19 status, hospital and critical care admissions, and death during the index COVID encounter.For patients with multiple ED attendances, data were extracted for the first ED attendance and outcomes were assessed up to 30 days from index attendance.Data collection for the UK PRIEST and Sudanese cohort studies have been described in detail in previous publications.[8,13] Anonymised versions of the PRIEST and Sudan study data were used to derive the geographical external validation cohorts.
Where no comorbidities were recorded, they were assumed not to be present.Implausible physiological variables were set as missing, including systolic blood pressure <50 mmHg, temperature >42 or <25 degrees Celsius, heart rate < 10/minute, peripheral oxygen saturation < 10% and respiratory rate = 0/minute.List-wise deletion of cases with missing feature data was performed for a complete case base case analysis.

Features and feature engineering
Candidate features were selected a priori on the basis of a previous systematic review of COVID-19 outcome predictors suitable for use in lower and middle income countries, [14] previous research, expert opinion within the research team, and availability at ED triage in the Western Cape.[8,15,16] The final features considered were: age, sex, presenting symptoms (cough or fever), co-morbidities (heart disease, diabetes, immunosuppression (including HIV), asthma, chronic obstructive pulmonary disease, other chronic respiratory disease, hypertension or pregnancy), first ED recorded physiological parameters (respiratory rate, pulse rate, systolic blood pressure, level of consciousness, peripheral oxygen saturations) and requirement for supplemental oxygen in the ED.Comorbidities were one-hot encoded, with asthma, chronic obstructive pulmonary disease, and other chronic respiratory disease grouped into a single feature.Level of consciousness, recorded using the ACVPU scale in Western Cape data, was pre-processed to a numeric AVPU scale (confusion, grouped with verbal) to ensure consistency across data sets.Continuous physiological features were not transformed, and no other feature engineering was performed.No feature selection was performed prior to machine learning.All features were available for the Western Cape and PRIEST data.A restricted range of features was available in Sudanese data, with no information available for temperature, immunosuppression, or level of consciousness.

Label
The label was a composite, binary, adverse outcome of either intubation or non-invasive ventilation in the ED on index attendance, Intensive Care Unit (ICU) admission or inpatient death up to 30 days from index attendance.This was comparable to the PRIEST study primary outcome (used in the geographical external validation sample) of death or organ support (respiratory, cardiovascular, or renal) at 30 days.(8) Outcome in the Sudanese data was intubation or non-invasive ventilation in the ED, High/Intensive Care Unit (HDU/ICU) admission or inpatient death.The model aimed to provide both good discrimination and well calibrated predictions of adverse outcome, rather than label classification per se, therefore methods to address class imbalance were not performed.[17]

Model training
Supervised machine learning was performed using an ensemble, decision tree-based, gradient boosting algorithm, implemented using the XGBoost framework.[18] This approach was implemented over alternative algorithms due to its scalability to large datasets, flexibility in capturing non-linear relationships and interactions, and favourable predictive performance compared to other algorithms.[19] Training was initially performed using default parameters, with regularization and early stopping rules defined to reduce variance and mitigate over-fitting.Key hyperparameters (number of decision trees, learning rate, and tree depth) were tuned using 5-fold cross validation of the training data, across a manually selected range of parameter values, aiming to optimise model discrimination.To facilitate out of sample prediction, conservative hyperparameter values were favoured in the absence of significant gains in model performance.
The relative importance of each feature in the base case model was evaluated by calculating the average training loss reduction gained across all occasions it was used for decision tree splitting.A waterfall chart, showing the additive contribution of each individual feature for label prediction, was constructed for illustrative cases to aid model interpretation.Machine learning was carried out with the xgboost library in R 4.1.2(R Core Team, 2021) using the R Studio interface.[20] The DALExtra R package was used to construct waterfall charts.Models were independently developed using the same methods by a second data analyst in Python using the Scikit-learn package (version 0.24.1) using Python (Python Software Foundation.Python Language Reference, version 3.8.8).

Model testing
Internal validation, using a random 80:20 train-test ratio split, was performed.The model was then re-trained on the whole training dataset, and apparent validation assessed.External validation was evaluated by application of the final model to the Western Cape Omicron period, UK PRIEST, and Sudanese external validation cohorts with adverse outcome probability calculated for each case.Model discrimination was assessed through receiver-operating characteristic (ROC) curves and calculating the area under the ROC curve (C-statistic).[21] Calibration in the large was evaluated by comparing the average predicted risk to the average observed risk.Calibration plots were constructed to compare predicted to observed risks across deciles of predicted outcome probability.Calibration plot slope and intercept (weak calibration) and Locally Weighted Scatterplot Smoothing (LOWESS, moderate calibration) were also evaluated.[22] Diagnostic parameters (accuracy, precision, recall, negative predictive value, and specificity), and the proportion of cases with adverse outcome, were calculated and presented graphically for different model probability thresholds to inform clinical management decisions.[23] Discrimination was computed using the pROC library in R 4.1.2(R Core Team, 2021).Calibration metrics and plots were calculated using the pmcalplot package in

Secondary analyses
Case-wise and variable-wise missing data patterns were examined and the influence of missing data was explored using 3 approaches: deterministic imputation (single imputation within normal ranges defined by the South Africa Triage Early Warning Score (TEWS) score), [24] multiple imputation approaches (data assumed to be missing at random, chained equations, 5 imputations, model predictions averaged across datasets), [25] and using the in-built XGBOOST missing data algorithm (based on surrogate decision tree splits).[18] For simplicity and increased usability, an alternative model was also developed using categorised physiological variables, engineering features according to thresholds used by (TEWS) score in complete case data.[24] To provide an indication of likely performance in future waves, irrespective of vaccination prevalence and variant dominance, an additional model was trained on Western Cape data across the alpha, delta, and omicron waves.A random 80:20 train/test split of the pooled data was used, with modelling and validation otherwise proceeding as described for the base case model.

Sample size
The training sample size was fixed based on a census sample of patients in the Western Cape recorded on the HECTIS during the study period.There were 282,051 patients in this cohort, with over 100 outcomes per model parameter.Test sample size was also fixed based on the size of Western Cape Omicron wave, PRIEST and Sudanese data sets.However, assuming an outcome prevalence of 10% and c-statistic of 0.75, external validation would require only 6,420 cases to provide measurement of the area under the operating characteristic curve with a standard error of 0.01.[

Patient and Public Involvement (PPI)
A community advisory board comprising eight community members affected by COVID (infected themselves or immediate family infected/ hospitalised) was purposively recruited by an experienced community liaison officer to achieve representation across the Western Cape population.Through several meetings, the community advisory board were able to influence study planning and conduct.Implementation of machine learning and algorithmic (un)fairness were considered to ensure acceptability and avoid minority group disparities.

Study sample
A total of 305,564 patients aged over 16 years presented to participating Western Cape hospitals with suspected COVID-19 during the Alpha, Beta, and Delta waves between 27th August 2020 and 31 st

Trained model
The base case XGBOOST model hyper-parameters following tuning are presented in Table 2.The most important features for predicting adverse outcome in the final model were the requirement for supplemental oxygen, peripheral saturations, level of consciousness and age (Fig 1 and 2).Apparent validation revealed excellent discrimination (C-statistic 0.91, 95% CI 0.90 to 0.91).This was unchanged on internal validation using a split train-test sample (C-statistic 0.91, 95% CI 0.90 to 0.91, Fig 3

Secondary analyses
Case-and variable-wise missing data patterns are presented in the supplementary materials for training and testing data (S4-S10 Figs).Results were not substantively changed in secondary analyses exploring different missing data mechanisms.(.On internal validation, C-statistics ranged from 0.891 to 0.892 across deterministic, surrogate split, and multiple imputation analyses, and calibration plots were not significantly changed from the complete data base case analysis.Discrimination and calibration metrics were similarly unchanged in missing data secondary analyses in the Western Cape Omicron, UK PRIEST, and Sudanese test cohorts (S11-S13

Summary of results
An XGBoost machine learning model was trained in patients with suspected COVID19 presenting to Western Cape public hospitals during the Alpha/Beta/Delta pandemic waves.The most important features for predicting adverse outcome were the requirement for supplemental oxygen, peripheral oxygen saturations, level of consciousness and age.Internal validation using a split-test sample revealed excellent discrimination (C-statistic 0.91, 95% CI 0.90 to 0.91) and calibration (CITL of 1.05).The model achieved C-statistics of 0.84 (95% CI 0.84 to

Interpretation
Clinical prediction models rarely out-perform the clinical judgement of experienced clinicians.[27,28] Western Cape EDs demonstrated excellent diagnostic performance for management of COVID19 during the study period, admitting only 14.7% of patients as inpatients, with a risk of false negative triage of around 1%. [29] Based on internal validation results, a model probability cut point of 8% predicted risk would result in a similar admission rate of 13.1% and a negative predictive value of 98.7%.Despite the favourable accuracy of clinical  threshold, sensitivity will fall, and specificity will increase, at lower prevalence.[30] The high proportion of adverse outcome in the Sudanese test cohort may reflect differing patient demographics, increased SARS-CoV-2 virulence, or inclusion of more severe cases from tertiary referral hospitals.However, the primary reasons for lack of model transportability to non-Western Cape settings are likely to be differences in population characteristics and variation is measurement of features and labels across data sets.
The model label of adverse outcome from death or ICU admission/organ support has face validity for guiding clinical management decision in the ED.However, it is important to note that the model may predict differentially across the individual outcomes comprising the composite endpoint.[31] Differences in the relative proportions of death and organ support across test cohorts could therefore explain differential model performance.The previously published PRIEST score demonstrated better prediction for death than critical care requirement, [8] implying that the current model should be restricted for determining hospital admission only, rather than guiding escalation of care decisions.Additionally, caution may be required when interpreting model outputs in advance age, where the underlying hazard of mortality may dominate prediction.
Calibration drift occurs when deploying models in non-stationary clinical scenarios, [32] where differences arise over time between the training and test populations to which the model is applied.Predicting outcome in patients with suspected COVID19 is a highly dynamic situation with multiple potential changes in data collection, patient case mix, vaccination coverage, and clinical decision-making.Adaptive mutations in the SARS-CoV-2 genome can alter the virus's pathogenic potential, influencing transmissibility, virulence, and vaccination effectiveness.[33] The base case model, trained using Alpha/Beta/Delta data, retained favourable discrimination in patients with Omicron variant, but systematically overestimated individual risk adverse outcome.Updated model training and re-calibration will likely be required as the COVID19 pandemic evolves, and future variants of concern emerge in South Africa.

Comparison to other literature
There is a paucity of research investigating risk-stratification scores for patients presenting to EDs with suspected COVID19 in middle-income settings.The Nutri-CoV score was developed and internally validated in Mexico using data from the ancestral Wuhan strain of the pandemic.[34] A smaller number of features were included in that model (age, comorbidities, peripheral oxygen saturations, respiratory rate, pneumonia), with lower discrimination for adverse outcome achieved than the current study (C-statistic 0.797).The study sample of RT-PCR confirmed cases may limit generalisability to use in undifferentiated ED patients.Moreover, the inclusion of diagnosis of pneumonia in the score, requiring imaging or advanced clinical skills, may reduce relevance in less well-resourced settings.
Our research group has previously developed similar risk-stratification scores using statistical modelling, rather than machine learning.Multivariable logistic regression with Least Absolute Shrinkage and Selection Operator (LASSO) and fractional polynomials achieved a slightly lower C-statistic of 0.87 (95% CI 0.866 to 0.874) and (CITL) of -0.017 (95%CI -0.043 to 0.009) on internal validation in Western Cape Alpha/Beta/Delta data, compared to the machine learning model reported here.[35] Discrimination on external validation was lower in the Omicron (C-statistic 0.79, 95% CI: 0.79 to 0.80) and Sudanese data (c-statistic 0.53, 95% CI: 0.53 to 0.54), but higher in the UK PRIEST, test data (C-statistic 0.79, 95% CI: 0.79 to 0.80).Machine learning is a trade-off between bias and variance, and despite the large sample size and use of regularization, there is a risk of over-fitting to the Western Cape data, which may explain the differential performance across test cohorts compared to a more conservative statistical modelling strategy.[36]

Limitations
This study has several strengths including large sample size, adherence to established prediction modelling principles, external validation, assessment of calibration, and exploring model interpretability.[10] However, there are potential limitations.There is a risk of selection bias from incomplete identification of patients with suspected COVID19 infection, inaccurate linking of health records, and incomplete outcome ascertainment.[37] Furthermore, list-wise deletion of missing data, and multiple imputation, could result in systematic error if data is not missing completely at random, or missing at random.[38] The use of routine data, not primarily intended for research purposes, could also result in information bias from measurement error and incomplete ascertainment of deaths.[11]

Generalisability
The external validity of any COVID19 prediction model will depend on circulating COVID-19 variant, population vaccination status, clinical setting, and underlying patient demographics.The model's inclusion of basic patient characteristics and vital signs should ensure the model is transportable to other middle-income settings.However, the requirement for pulse oximetry may limit application in lower income settings.South Africa is an upper middleincome country, with large wealth disparities, a mixed state-private health economy, and a high prevalence of HIV.[39] Generalisability of model predictions to other middle-income settings with differing characteristics therefore requires caution.

Clinical and research implications
Despite increasing numbers of published machine learning models, very few are implemented into clinical practice.[40] Furthermore, there is little experience of deploying machine learning models into clinical practice outside high-income settings.[41] Independent external validation, impact studies, and qualitative work to explore acceptability are therefore recommended prior to any introduction of the current model into clinical use.Western Cape emergency medical services use an electronic patient record with the functionality to incorporate electronic decision support, potentially facilitating translation into clinical practice.Model operationalisation as a smartphone is an alternative strategy that could aid usability once regulatory requirements are met.The 'black box' nature of artificial intelligence, with a patient-level prediction provided without any explanation or rationale, is a major barrier to uptake of machine learning models.[9] Using explainable machine learning tools, such as waterfall charts, could help with future model implementation.[42]

Conclusions
An XGBoost machine learning model was trained and achieved good discrimination and calibration in prediction of adverse outcome in patients presenting to Western Cape EDs with suspected COVID19 infection.Performance was reduced in temporal and geographical external validation.Independent external validation, impact studies, and qualitative work to explore acceptability are recommended prior to any introduction of the current model into clinical use.Updated model training and re-calibration will likely be required as the COVID19 pandemic evolves, and future variants of concern emerge.
), with excellent calibration also apparent (CITL of 1.05, Fig 3).The final base-case model saved in XGBoost-internal binary format is available in the supplementary materials (S1 Data).The final model showed good discrimination in the Western Cape Omicron test cohort (Cstatistic 0.84, 95% CI 0.83 to 0.85, Fig 3); however calibration was sub-optimal with overprediction of adverse outcome across all risk subgroups (CITL of 1.7, Fig 3).Discrimination was reduced in the UK PRIEST cohort (c-statistic 0.72, 95% CI 0.71 to 0.73, Fig 3) with underprediction of adverse outcome (CITL of 0.68, Fig 3).Model discrimination was lower in the Sudanese cohort (C-statistic 0.62, 95% CI 0.59 to 0.65, Fig 3) with under-prediction of adverse outcome for low and moderate risk patients.Estimated diagnostic accuracy (recall, specificity, NPV, precision) at different model probability thresholds across the train and test populations are presented in Fig 4, with tables available in the supplementary materials (S4-S7 Text).
Fig).The alternative model, using categorised physiological features in complete case data, achieved a minimal reduction in discrimination (C-statistic 0.90, 95% CI 0.89-0.91),and similar calibration (CITL 1.05) compared to the base case continuous model (S14 Fig).The model trained on Western Cape data across all variants and time periods, demonstrated similar discrimination (C-statistic 0.91, 95% CI 0.90-0.91)and calibration (CITL 1.07) metrics compared to the base case model (S15 Fig).

26 ]
EthicsUse of routinely collected electronic health care records from the Western Cape for the derivation of the development and Omicron cohorts for this study was approved by the University of Cape Town Human Research Ethics Committee (HREC 594/2021), and the Western Cape Health Research Committee (WC_202111_034).Analysis of Sudanese data was approved by the University of Cape Town Human Research Ethics Committee (HREC 594/2021), the Western Cape Health Research Committee (WC_202111_034) and Khartoum State ministry of health.As all data were de-identified at source before being provided to the research team the need for patient consent was waived.Data collection for the UK validation cohort was first approved by the Northwest-Haydock Research Ethics Committee on 25 June 2012 (reference 12/NW/0303) and on the updated PRIEST study on 23rd March 2020.The Confidentiality Advisory Group of the Health Research Authority granted approval to collect data without patient consent in line with Section 251 of the National Health Service Act 2006.
October 2021.Of these, 282,051 (92.3%) cases had complete data available and were included in the base case training dataset.The prevalence of 30-day adverse outcome was 4.0%, and 74,580 patients (24.4%) had a diagnosis of COVID confirmed by PCR testing.There were 140,520 patients in the Omicron wave test (external validation) cohort, of whom 130,407 (92.8%) had complete data.The PRIEST test (external validation) cohort comprised 20,698 patients, of whom 18,960 (91.6%) had complete data.The Sudanese test (external validation) cohort comprised 2,583 patients, of whom 1,290 (49.9%) had complete data.Prevalence of adverse outcome was 1.98%, 22.1%, 35.7% in the Omicron wave, PRIEST and Sudanese cohorts respectively.Table 1 summarises characteristics of the Western Cape data.Derivation of the Western Cape, PRIEST, and Sudanese cohorts; and comparison of patient characteristics across train and test cohorts is provided in the supplementary materials (S1-S3 Fig; and S1-S3 Text).

Table 1 . Characteristics of Western Cape Alpha/Beta/Delta wave study participants.
wave, UK, and Sudanese external validation cohorts.Calibration was sub-optimal with overprediction of adverse outcome across all risk subgroups in Omicron cases (CITL of 1.7); and underprediction of adverse outcome in UK cases with ancestral COVID19 infection and Sudanese patients with ancestral or Eta variant infection (CITL of 0.68 and 0.61 respectively).Results were not substantively changed in extensive sensitivity analyses exploring different missing data mechanisms.