Severity and mortality prediction models to triage Indian COVID-19 patients

As the second wave in India mitigates, COVID-19 has now infected about 29 million patients countrywide, leading to more than 350 thousand people dead. As the infections surged, the strain on the medical infrastructure in the country became apparent. While the country vaccinates its population, opening up the economy may lead to an increase in infection rates. In this scenario, it is essential to effectively utilize the limited hospital resources by an informed patient triaging system based on clinical parameters. Here, we present two interpretable machine learning models predicting the clinical outcomes, severity, and mortality, of the patients based on routine non-invasive surveillance of blood parameters from one of the largest cohorts of Indian patients at the day of admission. Patient severity and mortality prediction models achieved 86.3% and 88.06% accuracy, respectively, with an AUC-ROC of 0.91 and 0.92. We have integrated both the models in a user-friendly web app calculator, https://triage-COVID-19.herokuapp.com/, to showcase the potential deployment of such efforts at scale.


Introduction
Translational science is a rapidly growing field with immediate potential in direct clinical applications. This includes the development of computational models that analyze Electronic Health Records (EHRs) and aid us in interpreting the complex biological associations between clinical measurements and patient outcomes. Machine Learning is an extremely powerful tool deployed by translational scientists to recognize patterns and identify features from medical data that correlate with clinical outcomes. The predictions obtained from such machine learning models may assist in clinical decision-making. This can enable us to automate certain stages of diagnosis, especially when during a scarcity of medical resources such as trained medical professionals or intensive care units (ICUs).
Globally, several models have been proposed by researchers to tackle the problem of triaging COVID-19 patients to budget for, allocate and effectively manage appropriate medical resources, such as by D Ellinghaus et al. [1] and by Yuan Hou et al. [2]. These studies found that the susceptibility and mortality rates are widely variable across countries. However, only a handful of these models have been trained on datasets populations in the developing world, such as India. To the best of our knowledge, only two such models have been trained and tested on Indian populations. First, a model based on a Random Forests classifier predicting ICU admissions using features such as age, symptoms at diagnosis, number of comorbidities, chest X-ray, SpO2 concentration, ANC/ALC ratio, CRP, and Serum ferritin concentrations [3]. Second, a set of models forecast the number of COVID-19 cases in India using simple SEIR mathematical models and also using LSTMs [4], [5].

Results
Li Yan et al. proposed one of the first mortality prediction models for COVID-19 [6]. This model was trained and tested on 375 infected patients in the region of Wuhan, China. Of the 375 patients, 201 had recovered, and 174 had died. This paper also proposed a clinically operable decision tree that predicted the outcome based on lactic dehydrogenase (LDH), lymphocytes, and high-sensitivity C-reactive protein (hs-CRP) values. They achieved 100% accuracy in predicting COVID-19 severity and 81% accuracy in predicting patient mortality in their dataset using this decision tree. However, as we demonstrated previously (8), testing these models on our cohort of Indian patients was not as successful (Fig 1). Applying this model to a subset of 120 patients from the current cohort of patients, ensuring a maximum overlap of parameters used in the model trained by Li Yan et al., the overall severity prediction accuracy was 65.26%, and the mortality prediction accuracy was 88% [7]. The poor performance in predicting severity is of particular concern as this directly affects the expected medical resources that a patient may require, thereby fulfilling the purpose of triaging. Therefore, it was noted that the existing models might be population-specific due to various intrinsic factors such as genetics and external factors such as demography, population density, or access to appropriate medical infrastructure.
Following this, we endeavored to build two supervised machine learning models pertaining to the needs of the Indian population.

Evaluation of the mortality prediction model
Performance metrics for the mortality model are given in Table 1.
From the SHAP(13) value and the mean |SHAP| value importance plot (Fig 2), we can get an idea about the relative importance of different parameters. We see that age plays a vital role in determining the mortality of a patient, which is in accordance with the literature available on COVID-19. Having a lower value for age greatly impacts the model's output negatively (or favors the negative class, i.e., alive) than the positive class (which is deceased in this case). According to the Union Health Ministry, Government of India,~53% of people who have died because of COVID-19 in India are above the age of 60 [8].

PLOS DIGITAL HEALTH
Other parameters that are useful to determine the mortality of a patient are the percentage of neutrophils in the blood, the creatinine levels in the urine, and the Urea levels as well, all of which, when present in higher amounts than usual, point towards a positive classification. Parameters moderately impacting the output are alkaline phosphatase (ALP) enzyme level, serum sodium levels in the blood, and indirect bilirubin in the blood, suggesting that the liver could be affected by COVID-19. We can see from the ROC curve, plotted on the validation set, that it covers an area of 0.91, which means it can separate the classes correctly 91% of the time (Fig 2).

Evaluation of the severity model
Performance metrics for the severity model are given in Table 2.
We see from the plots of the SHAP and the mean |SHAP| values (Fig 3), Age and Urea levels play an essential role in determining the severity of the disease. But, since we had used additional biomarkers for COVID-19 in this model, we get interesting results showing that High sensitivity C-reactive protein (hs-CRP) and D-D dimer have a big impact on the model. We see that for all these parameters, an increased value suggests that the patient is categorized as severe. However, we also see indirect bilirubin and AST/SGOT having a moderate impact on the model output as well. The severity model performs well with its ROC curve (plotted on validation set) covering an area of 0.92, thus differentiating between classes 92% of the time, slightly better than the mortality prediction model. A comparison of the two models is made in the next section.

Reduced models
Since our machine learning models use 33 clinical parameters, which is a lot and would not always be available in a general setting. The distributions of the important features (according to the feature importance plots) are given in Supplementary Material (Fig B in S1 Text). We decided to make 'reduced' models that use only the top ten features that were found by the mean SHAP values since ten features would be easier and faster to collect from a patient than 33 features. We performed hyperparameter tuning again for the two reduced models of mortality and severity and were able to achieve good accuracy and similar AUC-ROC score of 0.91 and 0.93, respectively, on the validation set without facing overfitting (Fig 4 and Tables C,D in S1 Text).

Discussion and conclusion
We were successfully able to make two machine learning models to triage COVID-19 patients in India into prediction categories like alive versus deceased and severe versus non-severe. This was done using XGBoost (12) models that use gradient boosting in decision trees (the gbtree booster of XGBoost was internally used). These models can be summarized through sample decision trees (Fig 5) to inform a clinical decision support system. Note that XGBoost is an ensemble model and creates multiple weak classifiers which work strongly when together. In order to make the results of our model available for testing, we made an online calculator web app that can be used to determine the mortality and severity using our models at: https:// triage-COVID-19.herokuapp.com/. In the mortality model, we see that most of the evaluation parameters, like test accuracy, Fscore, recall, precision, and AUC-ROC scores are less than those for the severity model. This can be because, for the severity model, we have used certain biomarkers that have been shown to be affected by COVID-19. These include high sensitivity C-Reactive Protein (hs-CRP), ferritin, D-D dimer, and Lactate Dehydrogenase (LDH). However, we could not use LDH because of data limitations as described in the section on building the severity model. Even so, we found in the above section that ferritin does not contribute as much to the severity model's output as hs-CRP and D-D dimer do. This could imply there is a more direct correlation

PLOS DIGITAL HEALTH
between COVID-19 and hs-CRP and D-D dimer than ferritin. Additionally, in the ROC curves, the severity model can achieve around 60% sensitivity (TPR) with a very low FPR, whereas the mortality model can only achieve up to about 50% sensitivity for the same low FPR. In the precision-recall curves, we see that the mortality model can achieve a recall of 0.2 without predicting any false positives, whereas severity can achieve a recall of 0.4.

Confidentiality statement and ethics
The study protocol was reviewed and approved by Institute ethics committees at both the participating centres through Institute ethics committee File No. IEC/320/325 to SS. Consent waiver was granted given the retrospective nature of analysis and emergent nature of the pandemic. Confidentiality was maintained by the de-identification of data. All analysis was performed on de-identified data.

Data collection
The data for this study was collected from one of the largest dedicated COVID-19 centers in Northern India, Sardar Vallabhbhai Patel COVID Hospital and PM CARES COVID Care Hospitals. In these hospitals, a series of blood tests were performed for all the patients with  confirmed infection at the time of admission. The patients were diagnosed using a Rapid Antigen Test or RT-PCR testing of a nasal/throat swab sample. The data for 815 patients were collected between 13th July and 31st December 2020. The clinicians followed the guidelines provided by the MOHFW, Govt. of India [9], to classify 390, 160, 84, 181 patients into the categories mild, moderate, severe, and dead based on symptoms on arrival. These results were used to identify the primary biomarkers that control the risk of developing severe infections and survival chances.

Data preprocessing
Data preprocessing involved removing missing values by deleting the patients or features entirely if they are available for significantly fewer patients. We avoided simple imputation, i.e., replacing missing values by the mean, because the size of the dataset was not appreciable enough to guess values accurately. Imputing the missing values would have resulted in a low variance, and decreasing our data's variance directly implies adding a bias to our model. Other than this, we converted all categorical variables into one-hot encoded form, and we made binary variables for signs/symptoms and comorbidities of each patient. The comorbidities were classified into broad categories: cardiac disease, chronic liver disease, hypertension, diabetes, chronic kidney disease, lung disease, morbid obesity, and hypothyroidism. The dataset obtained after completing data-preprocessing contained 600 patients, out of which 170 were deceased, 250 had experienced mild symptoms, 99 moderate, and 81 severe with 33 clinical parameters.

Clustering
We had performed clustering of patients based on their collected clinical parameters using the KPrototypes algorithm to check for the presence of any clinical bias. We have summarized the results in Supplementary Material (Fig A in S1 Text).

Modelling strategy
4.5.1 Mortality prediction model. All the features, including their gross statistics, used for training mortality prediction models can be found in Supplementary Material (Table A in S1 Text). The number of deceased patients was less than those who had recovered (170:430); this formed a skewed data distribution. Standard machine learning algorithms perform poorly on such datasets as they tend to be biased towards the majority class. This decreases the prediction accuracy of the minority class [10]. For this reason, we decided to use ensemble models, which are multiple classifier methods, i.e., these learning machines combine the decision of multiple base classifiers to reach the final prediction. A widely-used class of ensemble models is boosting algorithms, in which the final classifier is built through the sequential addition of multiple weak classifiers. The overall performance of the model increases with the addition of each classifier. We did an extensive survey of the different CART (Classification and Regression Trees) models and decided to implement AdaBoost(16), XGBoost(12), and CatBoost(15) because of their proof of performance.
In boosting algorithms, a 1:1 ratio of the classes in the data is recognized as ideal. We used random undersampling for the majority class, i.e., recovered patients, to reduce the data imbalance. The new dataset contained the information of 170 deceased patients and 205 recovered patients. We ensured that the number of mild, moderate, and severe patients were equally represented (Mild-55, Moderate-55, Severe-55) in the dataset to minimize bias in the model. Following this, we split this dataset into training and validation sets containing 302 and 73 patients, respectively. The splitting was performed using stratified sampling to maintain the ratio of mild, moderate, severe, and deceased patients in both sets.
Once the training dataset had been finalized, we started with the training of the model. We implement Repeated Stratified 5-fold cross-validation using the GridSearchCV function from the sklearn(14) python package. While iterating with a wide range of hyperparameters during cross-validation, we have optimized the model for the best accuracy and re-fitted it for the best F-Score. The hyperparameter values of the model obtained after cross-validation are listed in Table 3. Using colsample_bytree enables us to induce randomness in the training examples and make our model robust from noise, and varying max_depth, min_child_weight, gamma, and alpha enables us to reduce and remove any overfitting. 4.5.2 Severity prediction model. All the features, including their gross statistics, used for training mortality prediction models can be in Supplementary Material (Table B in S1 Text). The objective of this model was to predict the risk of developing severe infection. We clubbed mild and moderate patients as non-severe, and severe and dead patients were grouped as severe. Many researchers have found that COVID-19 causes a change in these four critical biomarkers: D-D Dimer, High Sensitivity C-Reactive Protein(hs C-RP), Lactate Dehydrogenase (LDH), and ferritin. However, these were not measured for all patients. We had 396, 390, 305, 398 patients for those parameters, respectively. We calculated the number of unique patients with different combinations of three of these parameters and found the maximum to be 331 patients with values for D-Dimer, hs-CRP, and ferritin as the optimal combination keeping in mind how the model performance depends on the quantity of data and number of parameters. Following this, we split these 331 patients into training and validation sets containing 264 and 67 patients, respectively. The choice of model was the same as in the mortality prediction model mentioned above. After hyperparameter tuning and cross-validation as mentioned in the mortality model above, the hyperparameters of the best model are mentioned in Table 3.

Evaluation methodology.
We evaluated the models based on different parameters, namely Accuracy, Recall, Precision, and F-Score, along with the ROC and Precision-Recall curves. These are defined in terms of the number of True Positives(TP), False Positive(FP), True Negative(TN), and False Negative(FN) for a class as:  . FPR(1-specificity(or TNR)). It is a measure of how well our model can differentiate between 2 classes. The AUC-ROC is the Area under the ROC curve.
The methods accuracy_score(), recall_score(), precision_score() and f1_score() of the sklearn.metrics module were used to evaluate the respective parameters and the plot_roc_curve() and precision_recall_curve() to plot the ROC and P-R curves and get the AUC value.
For the mortality model, the positive class was chosen to be deceased(dead) patients and the negative class to be alive (mild + moderate + severe) patients. Following the same pattern, in the severity model, the positive class was chosen to be severe(severe + dead) patients, whereas the negative class was chosen to be non-severe (mild + moderate) patients.
The SHAP value plots are not part of the model itself; rather, SHAP uses an external model to vary the value of every feature and observe how much the output is affected. The SHAP value plots tell us how these variations affect the model's output. The mean |SHAP| value plot gives us information about which features are the most important in our model and have the most impact.
Please note that we can use the precision-recall curve here since we are more concerned with the positive class and having fewer False Negatives than False Positives (as patients whose mortality/severity goes undetected on being classified as non-severe/alive is more harmful to the society than a patient which was not severe being classified as severe/dead) [11].
Supporting information S1 Text. Fig A: The points are colored according to the clustering labels (learned from data), and the point shape represents the severity status of the patients. Table A: Comparison of features in deceased and alive patients in the mortality prediction model. Table B: Comparison of features in severe and non-severe patients in the severity prediction model. Fig B: Distribution of features with high predictive power.