Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Utilizing machine learning for survival analysis to identify risk factors for COVID-19 intensive care unit admission: A retrospective cohort study from the United Arab Emirates

  • Aamna AlShehhi ,

    Contributed equally to this work with: Aamna AlShehhi, Hiba Alblooshi

    Roles Conceptualization, Formal analysis, Methodology, Visualization, Writing – original draft

    Affiliations Biomedical Engineering Department,College of Engineering, Khalifa University, Abu Dhabi, United Arab Emirates, Healthcare Engineering Innovation Center (HEIC), Khalifa University, Abu Dhabi, United Arab Emirates

  • Taleb M. Almansoori,

    Roles Conceptualization, Investigation, Resources, Validation, Writing – review & editing

    Affiliation Department of Radiology, College of Medicine and Health Sciences, United Arab Emirates University, Al Ain, United Arab Emirates

  • Ahmed R. Alsuwaidi,

    Roles Conceptualization, Investigation, Resources, Validation, Writing – review & editing

    Affiliation Department of Pediatrics, College of Medicine and Health Sciences, United Arab Emirates University, Al Ain, United Arab Emirates

  • Hiba Alblooshi

    Contributed equally to this work with: Aamna AlShehhi, Hiba Alblooshi

    Roles Conceptualization, Data curation, Investigation, Project administration, Writing – original draft, Writing – review & editing

    hiba.alblooshi@uaeu.ac.ae

    Affiliation Department of Genetics and Genomics, College of Medicine and Health Sciences, United Arab Emirates University, Al Ain, United Arab Emirates

Abstract

Background

The current situation of the unprecedented COVID-19 pandemic leverages Artificial Intelligence (AI) as an innovative tool for addressing the evolving clinical challenges. An example is utilizing Machine Learning (ML) models—a subfield of AI that take advantage of observational data/Electronic Health Records (EHRs) to support clinical decision-making for COVID-19 cases. This study aimed to evaluate the clinical characteristics and risk factors for COVID-19 patients in the United Arab Emirates utilizing EHRs and ML for survival analysis models.

Methods

We tested various ML models for survival analysis in this work we trained those models using a different subset of features extracted by several feature selection methods. Finally, the best model was evaluated and interpreted using goodness-of-fit based on calibration curves,Partial Dependence Plots and concordance index.

Results

The risk of severe disease increases with elevated levels of C-reactive protein, ferritin, lactate dehydrogenase, Modified Early Warning Score, respiratory rate and troponin. The risk also increases with hypokalemia, oxygen desaturation and lower estimated glomerular filtration rate and hypocalcemia and lymphopenia.

Conclusion

Analyzing clinical data using AI models can provide vital information for clinician to measure the risk of morbidity and mortality of COVID-19 patients. Further validation is crucial to implement the model in real clinical settings.

Introduction

The COVID-19 pandemic started first in Wuhan (China) in December 2019 and has expanded to every inhabited continent [1, 2]. In March 2020, the World Health Organization (WHO) categorized it as a global pandemic with remarkably high incidence and mortality rates [13] and to date, 683,955,862 total number of infected individual with 6,831,756 tolls of death. Until now, supportive care is the main treatment available. Different COVID-19 vaccines were developed and are currently used to reduce the susceptibility to infection [1].

The pandemic has drastically challenged the health care systems globally. The increment in the number of affected individuals exerted substantial pressure on health sectors particularly with the limitation in the intensive care units (ICU) [2]. The main challenge for health professionals was to identify the cases that are more likely to progress from mild to severe or sudden death during the early stages of the pandemic. Understanding the risk factors for severe disease can help the clinician to provide a timely and efficient intervention including proper utilization of ICU facilities.

Since the beginning of the pandemic, tremendous quantitative research using Electronic Health Records (EHRs) were undertaken for different objectives such as patients discharge time prediction [2], mortality risk prediction [4] and early-detection models for COVID-19 [5]. Several studies reported the clinical characteristics and relevant risk factors for severe disease. In a Malaysian cohort [6], chronic kidney disease, chronic pulmonary disease, fever, cough, diarrhea, breathlessness, tachypnoea, abnormal chest radiographs and high serum CRP (≥5 mg/dL) at the time of hospital admission were found as risk factors for severe disease using univariate and multivariate logistic regressions for 5,889 confirmed COVID-19 patients. Another study of 17,278,392 patients from England [3], reported gender, age, diabetes and asthma to be associated with severe COVID-19 cases using multivariable Cox proportional hazards model. A second retrospective study from England [7] (n = 3,138,410) focusing on diabetic population, showed that severity of COVID-19 disease was associated with history of cardiovascular disease, gender, age, renal impairment, non-white ethnicity, socioeconomic deprivation, poor glycemic control and high body mass index (BMI) using Cox proportional hazards model. A Scotland representative cohort [8] also for diabatic patients consists of 5,463,300 patients reported using logistic regression that the risks factors that associate with fatal or critical care unit-treated COVID-19 are: gender, smoking, live in residential care, retinopathy, reduced renal function, or worse glycemic control, diabetic ketoacidosis or hypoglycemia. A retrospective study of Kazakhstan diabetic population [1] (n = 1961) showed that the severity of COVID-19 was higher in diabetic patients, in which they have higher rates of coexisting cardiovascular pathology and kidney disease. Also, the clinical symptoms such as impaired breathing and nausea/vomiting and weakness/lethargy are worst in this group in comparison to the non-diabetic matched group. The previous works reported several risk factors that are associated with COVID-19 severity. A common risk factors for severe cases are age and gender. Those risk factors are various across countries.

This study aims to explore and report the association between patient’s hospital admission medical information and the patient’s deterioration to access the ICU. To the best of our knowledge, this is the first quantitative study employing machine learning for survival analysis to report risk factors to access the ICU in a United Arab Emirates (UAE) cohort. The cross-validation predictive model of accessing critical care unit for COVID-19 patients is also explored.

Materials and methods

Ethical statement

This study was approved by the Institutional Review Board of the Department of Health, Abu Dhabi. Approval number: IRB DOH/CVDC/2020/799. The review board waived the requirement for individual informed consent as this study was part of outbreak investigation in the UAE. All investigators had access to only anonymized patient information. This study was performed in accordance with the relevant laws and regulations that govern research in the emirates of Abu Dhabi, UAE.

Data source

In this retrospective study, an anonymized COVID-19 patients’ medical records extracted from the Abu Dhabi Health Services Company (SEHA) healthcare system. SEHA dataset is a high-dimensional UAE population health data source which include rich patients’ medical information such as sociodemographic, comorbidity, upon admission: symptoms, laboratory results, vital signs, and COVID-19 medications.

Study design

A COVID-19 cohort was identified and extracted by SEHA. The clinical diagnosis of COVID-19 was confirmed using Reverse-transcriptase-polymerase chain reaction (RT-PCR) from nasal swabs. The data contained 1,800 registered patients (ages ≥ 18 years) admitted to any of SEHA healthcare facilities between March 1, 2020, and April 20, 2020. We excluded patients who were under 19 years of age following Petrilli et al. [9]; Fig 1 shows the flow diagram of the cohort inclusion and exclusion criteria. Patients were divided into two groups with respect to severity by accessing the ICU unit. The follow-up period started from the date of hospital admission up to the date of ICU access or date of discharge from the hospital. Some patients had up to 60 days of follow-up. We extracted all patients’ baseline (upon admission) covariates provided in the dataset.

thumbnail
Fig 1. Flow diagram of the cohort.

Inclusion and exclusion criteria for determining patients consider in this study. SEHA extracted 1,800 COVID-19 patients from March 1, 2020 to April 20, 2020. After applying the inclusion and exclusion criteria, our study included 1787 patients.

https://doi.org/10.1371/journal.pone.0291373.g001

Modeling

EHRs data contain rich information of patients’ medical information. Those patient-centered records impose challenges because of its heterogeneous, high-dimensional nature as well as the presence of missing and censored records [10]. These various challenges require advanced techniques to detect the patterns and report unseen association between the variables of interest and the outcome. In the following section, we will outline the study pipeline to address the challenges imposed by such complex data and how we built a robust model to report the association between patient baseline information and developing a critical situation leading to ICU admission.

Data pre-processing.

Missing value is a common problem in a medical data such as electronic health records, which needs to be handled in a proper way to avoid biased results [11]. We encountered this problem in our dataset; variables with missing variables are reported in the Supplementary S1 to S6 Figs. To deal with this problem, we first excluded the variables with more than 70% missing information and imputed the remaining variables. Understanding and describing the missingness mechanism is an important step in determining the best way to handle them. The pattern of missingness was performed using the pairs plots to assess the relationships between missing values and observed values in all variables [12] (S7 to S17 Figs). From the pattern analysis, we conclude that the missingness is Missing at random (MAR). Therefore, Multiple imputation using random forest was employed in this study. Multiple imputation using random forest is the most popular and robust imputation method widely used for this task [11] (S18 Fig).

Baseline characteristics statistical analysis.

Differences in the baseline patient’s information grouping by accessing the ICU variable were tested using the t-test for parametric continuous variables (with equal variance assumption), while the Mann-Whitney U test was used for nonparametric continuous variables. The χ2 test was used to test categorical variables hypotheses (with continuity correction), while Fisher’s exact test was used for smaller sample sizes (small cell counts). All the statistical analyses were tested at the 95% significance level.

Statistical and machine learning models.

  1. Feature Selection: To address the high-dimensional problem in our dataset, we utilized several feature selections approaches to find the prominent subsets of feature to train our final models. Feature selection assisted to speed up the training time, enhance the model interpretability, and improve model performance. The following are the methods/approaches used in this study:
    • Random forest variable importance (RF Var Imp): This method relies on random permutation of the feature to calculate its importance. In which model performance measure after and before the imputation. If the model prediction error increase significantly after the imputation, it will assign an importance score to the feature which reflects how important it is to the final prediction [10].
    • Random forest minimal depth (RF Min Depth): The method measures the shortest distance between the tree main root to the largest subtree. The feature of interest is its root. The shortest the distance the more influent and significant the feature is [10].
    • Univariate Score using Survival tree: The model fits survival tree for each single feature in turn and the importance of the feature determine by prediction for the survival tree model.
    • Cox model permutation importance (CPH Perm Imp): Similar to, RF Var Imp but random forests model substitute with Cox model.
    During this process and specifically for machine learning based features selection, we selected the optimal subset of features by keeping specific percentage of the highest importance features reported by the model. The optimal percentage threshold value selected using 5-fold cross-validation.
  2. Survival Analysis Models: Survival analysis is a subfield from statistical methods which focus on modeling the time to event or the expected time until the event of interest occurs. The event of interest might not be observed or missed during the study period for some patients; those data points named censored data [2, 10]. Survival analysis approach was performed in this study rather than a classification approach. In the classification analysis we have to train the model at each time step. Consequently, classifier can only determine whether or not the patients will experience the event of interest without knowing when exactly the event will occur. Whares, in survival analysis consider the time till the event occur [13]. One of most popular survival analysis models in clinical study is Cox Proportional Hazard Model which lacks the scalability to high-dimensional data [10]. Recently, many machine learning methods were adopted for survival analysis to incorporate the capability of such advanced models to handle complex relation between variables and its capability to scale for high-dimensional data [10]. Some machine learning for survival analysis are Survival Tree, Gradient Boosting Machine, Random Forests and Regularized Generalized Linear Model. Following is a detailed description of the statistical and machine learning survival models considered in this study:
    • Multivariate Cox Proportional Hazard (CPH) Model is a standard and most popular survival analysis model in medical domains [14]. It is fast, computation unexpensive and easy to use and interpret. However, CPH has several drawbacks such as its inability to deal with high dimensional dataset, and to model the nonlinear interaction and correlation between the features; finally, CPH model makes several assumptions that need to be satisfied to produce valid results such as proportional hazards assumption, test for influential observations, and nonlinearity [14].
    • Survival Tree is a tree-base method like the tradition decision tree machine learning method. The model recursively partition tree nodes based on splitting rules that incorporate censoring information such as log-rank which is the most popular splitting rules for survival model [13].
    • Random Forests (RFs) are the extended version of traditional random forests which incorporate the censored information during mode training. RFs is an ensemble tree-based model; each individual decision tree divided its nodes based on splitting rule that incorporate censoring information such as log-rank splitting and gradient-based brier score. The final outcome is calculated by averaging the predicators of each tree [14].
    • Regularized Generalized Linear Model (GLM): fits a regularized Cox model using a penalized negative log of the partial likelihood with an elastic net penalty. This model adds a regularization term to control the overfitting problem and reduce model complexity. Three GLMs can be fitted depending on the alpha(α) values: elastic net, ridge, and lasso. Elastic net model combines and bridges the gap between the other two penalties models: ridge (α = 0) and lasso (α = 1) [15].
    • Gradient Boosting Machines (GBMs) are a stage-wise models which convert a weak learner (tree-based) into a stronger model by incorporating optimization function such as gradient descent to minimize the objective (loss) function [14].
  3. Hyperparameters Optimization: Machine learning models hyperparameters tuning with random search optimization algorithm applied to choose the optimal set of models’ parameters that yield the best performance. We used 5-fold cross-validation to assess the selected parameters quality. Table 4 presents the parameter search space of each model and the selected parameter for the final selected models.
  4. Model Evaluation and Explanation
    • Discrimination (Concordance Index): model performances were compared and measured using Concordance index (C-index). C-index is a standard evaluation measure for survival analysis, it measures the proportion of the concordant pairs between all possible evaluation pairs [2, 14].
    • Partial Dependence Plots (PDPs): model interoperation and transparency is an important factor to adopt machine learning for clinical practice [14]. In this study, we applied PDPs as a post hoc technique to explain model decisions. The plots showed the marginal effect of feature of interest as a risk factor on the outcome of interest [15].
    • Calibration curve: a plot representation of the model-predicted probabilities versus observed event rates within a given duration. Survival probability is ranked first, followed by partitioning the data set into groups. The subjects in the upper group are those who are least likely to experience the event of interest, while those in the lower group are most likely to experience it [16].
    • Models’ comparison: We compare the performance of different algorithms based on several runs with shuffling using the Kruskal-Wallis test, a non-parametric method for comparing distributions of model outcomes (C-Index). Then we perform a pairwise comparison among the different models using the Nemenyi posthoc test to detect the models that differ from each other.

Results

Baseline characteristics statistical analysis

Baseline sociodemographic, comorbidity, upon admission: symptoms, laboratory results, vital information, and the descriptive statistics are presented in Tables 13 for the 60 features included in this study. In general, the non-ICU group was younger and had a larger number of patients than the ICU group. Also, the majority of the population is male in both groups, with no significant difference between them. Several features were significantly different between the two groups namely, age, diet, obesity, diabetes, CKD, ESRD, cough, c reactive protein, calcium level, chloride level, CO2 level, creatinine, ferritin level etc. We applied a correlation matrix to identify and remove the highly correlated independent variables (greater than 0.7 and less than -0.7) and we end up with 55 features for training the different models; the heat map for the correlation matrix is found in (S19 Fig).

thumbnail
Table 1. Baseline characteristics statistical analysis: Baseline characteristics of patients stratified by severity measure by accessing the ICU, mean (SD) or N (%).

https://doi.org/10.1371/journal.pone.0291373.t001

thumbnail
Table 2. Baseline characteristics statistical analysis: Baseline characteristics of patients stratified by severity measure by accessing the ICU, mean (SD) or N (%).

https://doi.org/10.1371/journal.pone.0291373.t002

thumbnail
Table 3. Baseline characteristics statistical analysis: Baseline characteristics of patients stratified by severity measure by accessing the ICU, mean (SD) or N (%).

https://doi.org/10.1371/journal.pone.0291373.t003

Statistical and machine learning models

Fig 2 illustrates hyperparameters tuning matrix for the combination of different models and feature selection methods. The heatmap shows the C-Index mean value for 5-fold cross validation, across all the models’ random forest minimal depth yield the best selected features. Model chosen hyperparameters and number of features selected for each model reported in Table 4. As it clears in the Table, the percentage of features selected for each model is range from 10 to 49 features. We trained all model with the selected features reported and the completed features (Table 5). The table shows the C-index from the combination of the subset of selected features and the models. Gradient Boosting Machines (GBMs) slightly outperforms CPH and RFs when trained on the feature selected from CPH, GBMs,GLM, RFs and Survival Tree.

thumbnail
Fig 2. Models hyperparameters tuning: Heatmap of mean C-index (5-fold cross validation) of the best combination of the hyperparameters and feature selections methods; also we tune the threshold for selecting the highest importance features.

https://doi.org/10.1371/journal.pone.0291373.g002

thumbnail
Table 4. Models’ hyperparameters: Machine learning models parametrized using random search optimization algorithm of 20 different parameter settings with a 5-fold cross validation to maximize the C-index.

https://doi.org/10.1371/journal.pone.0291373.t004

thumbnail
Table 5. Selected features and models combined performance using repeated 5-fold cross-validation, C-index with 95% confidence interval(95% CI).

https://doi.org/10.1371/journal.pone.0291373.t005

Model evaluation and explanation

Using Kruskal-Wallis test showed a significant difference between models’ performance (H = 926.50, p ≤ 0.05). Followed by Namanya’s multiple comparisons tests (Table 6); the test showed that there is not a significant difference between GBMs and CPH or RFs performance at 95 significance level. Based on Table 5, we selected the GBM model for further analysis and interpretation. Calibration curves for the probability of 2, 3, and 5 days ICU access showed excellent agreement between model prediction and actual observation (Fig 3a–3c). The Time-dependent ROC curve (discrimination accuracy) for the predicting the ICU access was 96.8%, 96.8%, and 96.5% for 2,3, and 5 days, respectively (Fig 3d–3f).

thumbnail
Fig 3.

Calibration curves and time-dependent ROC curve of the gradient boosting machine (GBM) model: Calibration curves of predicted compared with observed ICU access after 2 Days (a), 3 Days (b) and 5 Days (c) of hospital administration. Time-dependent ROC curve for the ICU admission predicting after after 2 Days (d),3 Days (e) and 5 Days (f) of hospital administration.

https://doi.org/10.1371/journal.pone.0291373.g003

thumbnail
Table 6. Results of Namanya’s post hoc multiple comparisons for different models.

https://doi.org/10.1371/journal.pone.0291373.t006

Finally, we explore the model feature of interest marginal risk effect using PDPs. Fig 4 shows the risk of disease severity rises with increase in the C-reactive protein, ferritin level, lactate dehydrogenase level (LDH), Modified Early Warning Score (MEWS), respiratory rate, and troponin levels. In addition, the risk increase with the lower concentrations of potassium, calcium, oxygen saturation and estimated glomerular filtration rate (eGFR) and lymphocytes. Finally, COVID19 adopted treatment region using hydroxychloroquine and favipiravir reduce the severity of the COVID19.

thumbnail
Fig 4. Partial dependence plots inferred from gradient boosting machine (GBM) model using random forest minimal depth features subset: The lines present the change in the risk to access the ICU across selected variable of interest whilst holding other variables constant.

https://doi.org/10.1371/journal.pone.0291373.g004

Discussion

Various studies have been conducted to leverage Machine learning (ML) to characterize the clinical risk factors for COVID-19 severity globally. To the best of our knowledge, this is the first study in the UAE to utilize data from electronic health records (HER) to facilitate clinical decision making during the COVID-19 pandemic using Machine Learning for survival analysis. In our study we identified the risk factors that increase the severity of the COVID-19 (Fig 4) in a UAE cohort. The highest clinical risk variable featured by the model were inflammatory markers including C-reactive protein, Ferritin, and Lactic Dehydrogenase (LDH). Around seventy-six of the cases with elevated CRP upon admission entered ICU. As reported in the literature, elevated level of C-reactive protein might be an indicator of COVID-19 severity and/or mortality [17] and can be used as biomarker to identify patients’ progression status [18]. Elevated LDH has been reported to be in association with respiratory failure in COVID-19 patients. It has been tagged as COVID-19 severe marker with six-fold increase in progressing to severe COVID-19 disease [19]. Another factor identified from our study is serum Ferritin that has been reported as predictor of patient’s severity with COVID-19. A recent study indicated that elevated Ferritin (over 25 percentile) associated with pulmonary involvement [20] that was targeted as biomarker for therapeutic monitoring of Methylprednisolone [21].

In this study, low serum calcium (hypocalcemia) has been predicated as COVID-19 severity risk factor. A recent meta-analysis [22] reported that hypocalcemia was significantly associated with COVID-19 severity, mortality, number of hospitalization days and admission to the ICU. These findings support that serum calcium level can be a prognostic marker for COVID-19 especially at initial assessments. Another study reported that COVID-19 patient with hypocalcemia was more likely to require high oxygen support during hospitalization and to be admitted to ICU [23]. It is still not clear the role of calcium in the pathophysiology SARS-CoV-2.

Low potassium (Hypokalemia) has been identified as risk factor for COVID-19 severity. Many cohort characteristics reported the association of the hypokalemia with increased severity of COVID-19 among affected patients [24]. Another study reported the association of the hypokalemia and COVID-19 pneumonia (Moreno-Pérez et al, 2020). From both studies, it is suggested the presence of the renin-angiotensin disorder. As the SARS-COV-2 binds to ACE2 enhancing in the degradation of the ACE2. Hence, this reduces the counteraction of ACE2 on renin-angiotensin system (RAS). The main challenge is to maintain the potassium level due to the continuous renal potassium loss. Consequently, this can have possible effect on cardiovascular functions, neurohormonal activation and other vital organs such as the lung. Therefore, sever hypokalemia in COVID-19 patients indicate the consideration of mechanical ventilation [25]. In an Italian cohort, hypokalemia was reported in 41% of the hospitalized patients but not associated with ICU admission or mortality [26].

Another risk factor identified in our study, is low count of lymphocyte (lymphopenia). This findings align with various studies reported the association of lymphopenia severity and hospitalization of COVID-19 patients [2729]. Multiple mechanisms were proposed to explain lymphocytes deficiency in COVID-19 including that the virus directly infect lymphocytes resulting in cell destruction as the lymphocyte express ACE2 receptors on its surface. However, further studies needed to understand the underlying reasoning for lymphopenia being an indicator of for COVID-19 severity and poor outcome.

Low oxygen saturation (Hypoxemia) and high respiratory rate upon admission are other COVID-19 severity risk factors selected in our study. The relative risk factor is higher in patient with oxygen saturation below 88% and respiratory rate below 38 breath per minutes. As reported in literature oxygen saturation below 90% is a predictor risk factor for sever COVID-19 and /or mortality [30]. These factors can provide a clinical indication upon admission to consider patient for appropriate oxygen supplement and timely access to hospital care especially with the limited critical care resources during COVID-19 pandemic. In addition, hypoxia was reported as an independent marker associated with in hospital mortality in COVID-19 patients [30].

Initial Modified Early Warning Score (MEWS) is an important variable to measure the deterioration of patients’ status in hospital-based setting. In our study the MEWS is a factor to predict the risk of ICU admission (Fig 4). The initial MEWS scoring is a significant factor to indicate the ICU admission along especially in patients with silent hypoxemia [31]. The strength of this factor is to leverage the information from HER to provide actionable strategy for COVID-19 patients during pandemic. Hence, this reflects the effect of the clinical decision aided tools on health system.

Elevated cardiac troponin was observed to be a COVID-19 severity predictor risk factor. Cardiac Troponin is well-known myocardial injury marker. Various studies evaluated elevated cardiac troponin as a biomarker of COVID-19 severity and indicative of patient deteteriation [3234]. In addition, cardiac troponin was evaluated in COVID-19 patients to be a risk factor of severity and an independent predictor of death within 30 days [34]. A recent study evaluated the cut-off of high sensitivity troponin I in non-severe COVID-19 patients as indicator of cardiac damage in the second week of the onset [32].

Low Estimated Glomerular Filtration (eGRF) rate is a risk factor for progression of COVID-19 severity in the studied cohorts. This risk factor has been reported in various studies as a predictor for COVID-19 prognosis [35]. Uribarri et al. [36] clearly demonstrated the impact of renal function from an international HOPE COVID-19 (Health Outcome Predictive Evaluation for COVID-19) Registry. Upon admission of COVID-19 patients, kidney dysfunction is common with various possible complication such as renal failure or in-hospital mortality [35]. Furthermore, patients with eGFR below 60ml/min/m2 were found to have higher risk worse prognosis because of respiratory failure and sepsis. Fifty six percent of COVID-19 patients with eGFR below 30 upon admission exhibited significant deterioration in their renal function. Renal involvement in COVID-19 patients is very important risk factor that requires critical follow up during hospitalization to avoid potential renal complication. Furthermore, early identification of kidney injury can predict COVID-19 progression and poor prognosis [37].

In this study model we have considered the COVID-19 treatment part of the analysis since we are evaluating the risk factor during the hospitalization for ICU admission. As part of the initial treatment offered for inpatients with COVID-19 during early pandemic episode, Hydroxychrolquine and other antiviral therapy such as Favipiravir were offered. In Fig 4, two pharmacotherapies were identified as a supported factor to avoid admission to ICU. The findings from this model do not evaluate the treatment effectiveness independently as most of the cohort patients were under different pharmacotherapies (n = 735). However, the results only evaluate the Hydroxychrolquine and Favipiravir effect on the ICU admission of COVID-19 patients during hospitalization. Early in the pandemic, with the increasing number of the critically ill patients and desperation of clinician Hydroxychrolquine preliminary data on the March 16, 2020 provided hope as a potential treatment for COVID-D patients across the globe. On the March 28, 2020 US Food and Drug administration (FDA) authorized the early use of Hydroxychrolquine. The data stet in this study represent the initial pandemic stage were most of the symptomatic COVID-19 inpatients received Hydroxychrolquine with or without antiviral pharmacotherapy. Many follow up studies followed reported inconclusive efficacy of Hydroxychloroquine in COVID-19 patients [3840]. Based on the well-defined scientific evidence, in June 15, 2020 the FDA revoked the approval [41, 42]. In addition, the Hydroxychrolquine is no longer recommended in the UAE.

Favipiravir is an antiviral therapy that was initially used to treat COVID-19 in the early wave of the pandemic. No statistical significance was reported of Favipiravir in relation to oxygen supplement, ICU admission and mortality [43]. Meta analysis reported the lack of Favipiravir effectiveness in reducing mortality among mild to moderate COVID-19 patients [43]. Other studies reported that Favipiravir can trigger viral clearance by 7 days and enhance clinical outcome within 14 days. However, this study recommended further evaluation of the dosing and duration of the treatment to validate the findings [44]. In our study, Favipiravir is identified as a factor that can reduce the chance of ICU admission. However, this finding can be marginal (p-value = 0.068) due to the small sample size in this category in this cohort. Various limitations were encountered while analyzing the low numbered dataset of the ICU group for which full spectrum evaluation was not possible. There are other predictors and features that we did not consider in the implemented models such as lifestyle, viral variant strains, viral load, and severity score. Furthermore, the model was not adjusted for smoking as most of the values were missing from the data set. As most patients in this cohort had pre-existing comorbidities, these patients were on different medications that were not included in this investigation. Medication effectiveness was not part of the analysis as the sample size of each pharmacotherapy groups was under-representative.

Further validation is required to evaluate the performance of the model. Various attributes can be changed due to various factors including the emergence of new virus strains, improvement of management practice, new pharmacotherapy, and vaccination availability around the globe.

Conclusion

In conclusion, predicting COVID-19 severity and evaluating the risk factors during hospitalization is challenging. However, ML models can assist clinicians in identifying high-risk patients upon admission. The focus of our study is to evaluate the risk factors criteria in UAE COVID-19 cohort to facilitate clinician’s decision on ICU admission at limited critical care resources in fast revolving COVID-19 waves. The findings of our study represent the first study of risk factors from EHR in the UAE at the early pandemic stage. Various clinical marker can be used as a predictor variable for ICU admission including C-reactive protein, ferritin level, lactate dehydrogenase level (LDH), Modified Early Warning Score (MEWS), respiratory rate, troponin levels, and the risk increases with lower potassium level, oxygen saturation and estimated glomerular filtration rate (eGFR). By using the identified features, clinicians can provide altered treatment plans and prioritize ICU admission for high-risk patients. Mortality prediction was not investigated in this study.

Supporting information

S1 Fig. Sociodemographic information missing values.

The figure illustrates the number of missing instances in each variable. The variable with more than 70% missing information was excluded from this analysis.

https://doi.org/10.1371/journal.pone.0291373.s001

(PDF)

S2 Fig. Comorbidity missing values.

The figure illustrates the number of missing instances in each variable. The variable with more than 70% missing information was excluded from this analysis.

https://doi.org/10.1371/journal.pone.0291373.s002

(PDF)

S3 Fig. Symptoms upon admission missing values.

The figure illustrates the number of missing instances in each variable. The variable with more than 70% missing information was excluded from this analysis.

https://doi.org/10.1371/journal.pone.0291373.s003

(PDF)

S4 Fig. Laboratory results upon admission missing values.

The figure illustrates the number of missing instances in each variable. The variable with more than 70% missing information was excluded from this analysis.

https://doi.org/10.1371/journal.pone.0291373.s004

(PDF)

S5 Fig. Vital information upon admission missing values.

The figure illustrates the number of missing instances in each variable. The variable with more than 70% missing information was excluded from this analysis.

https://doi.org/10.1371/journal.pone.0291373.s005

(PDF)

S6 Fig. COVID19 therapy missing values.

The figure illustrates the number of missing instances in each variable. The variable with more than 70% missing information was excluded from this analysis.

https://doi.org/10.1371/journal.pone.0291373.s006

(PDF)

S7 Fig. Missing data patterns in multivariate data.

Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables.

https://doi.org/10.1371/journal.pone.0291373.s007

(PDF)

S8 Fig. Missing data patterns in multivariate data.

Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).

https://doi.org/10.1371/journal.pone.0291373.s008

(PDF)

S9 Fig. Missing data patterns in multivariate data.

Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).

https://doi.org/10.1371/journal.pone.0291373.s009

(PDF)

S10 Fig. Missing data patterns in multivariate data.

Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).

https://doi.org/10.1371/journal.pone.0291373.s010

(PDF)

S11 Fig. Missing data patterns in multivariate data.

Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).

https://doi.org/10.1371/journal.pone.0291373.s011

(PDF)

S12 Fig. Missing data patterns in multivariate data.

Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).

https://doi.org/10.1371/journal.pone.0291373.s012

(PDF)

S13 Fig. Missing data patterns in multivariate data.

Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).

https://doi.org/10.1371/journal.pone.0291373.s013

(PDF)

S14 Fig. Missing data patterns in multivariate data.

Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).

https://doi.org/10.1371/journal.pone.0291373.s014

(PDF)

S15 Fig. Missing data patterns in multivariate data.

Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).

https://doi.org/10.1371/journal.pone.0291373.s015

(PDF)

S16 Fig. Missing data patterns in multivariate data.

Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).

https://doi.org/10.1371/journal.pone.0291373.s016

(PDF)

S17 Fig. Missing data patterns in multivariate data.

Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).

https://doi.org/10.1371/journal.pone.0291373.s017

(PDF)

S18 Fig. Missing value imputation using random forest.

The figure compare the distribution of the original and imputed data. The magenta points represent the imputed points, and the blue ones show the observed ones. The plots infer that the imputed values are plausible values for the missing points.

https://doi.org/10.1371/journal.pone.0291373.s018

(PDF)

S19 Fig. Correlation plot.

Heat map to visualize the correlation between the study features after removing the highly correlated features (〉0.7 and 〈-0.7).

https://doi.org/10.1371/journal.pone.0291373.s019

(PDF)

Acknowledgments

We thank the Abu Dhabi Health Services Company (SEHA) healthcare system to provide us with data that used in this study.

References

  1. 1. Dyusupova A., Faizova R., Yurkovskaya O., Belyaeva T., Terekhova T., Khismetova A., et al. Clinical characteristics and risk factors for disease severity and mortality of COVID-19 patients with diabetes mellitus in Kazakhstan: A nationwide study. Heliyon. 7, e06561 (2021,3), https://www.sciencedirect.com/science/article/pii/S2405844021006642 pmid:33763618
  2. 2. Nemati M., Ansary J. & Nemati N. Machine-Learning Approaches in COVID-19 Survival Analysis and Discharge-Time Likelihood Prediction Using Clinical Data. Patterns. 1, 100074 (2020,8), https://www.sciencedirect.com/science/article/pii/S2666389920300945 pmid:32835314
  3. 3. Williamson E., Walker A., Bhaskaran K., Bacon S., Bates C., Morton C., et al. Factors associated with COVID-19-related death using OpenSAFELY. Nature. 584, 430–436 (2020,8) pmid:32640463
  4. 4. Schwab P., Mehrjou A., Parbhoo S., Celi L., Hetzel J., Hofer M., et al. Real-time prediction of COVID-19 related mortality using electronic health records. Nature Communications. 12, 1058 (2021,2) pmid:33594046
  5. 5. Soltan A., Kouchaki S., Zhu T., Kiyasseh D., Taylor T., Hussain Z., et al. Rapid triage for COVID-19 using routine clinical data for patients attending hospital: development and prospective validation of an artificial intelligence screening test. The Lancet Digital Health. 3, e78–e87 (2021,2), Publisher: Elsevier pmid:33509388
  6. 6. Sim B., Chidambaram S., Wong X., Pathmanathan M., Peariasamy K., Hor C., et al. Clinical characteristics and risk factors for severe COVID-19 infections in Malaysia: A nationwide observational study. The Lancet Regional Health Western Pacific. 4 (2020,11), Publisher: Elsevier pmid:33521741
  7. 7. Holman N., Knighton P., Kar P., O’Keefe J., Curley M., Weaver A., et al. Risk factors for COVID-19-related mortality in people with type 1 and type 2 diabetes in England: a population-based cohort study. The Lancet Diabetes & Endocrinology. 8, 823–833 (2020,10), Publisher: Elsevier
  8. 8. McGurnaghan S., Weir A., Bishop J., Kennedy S., Blackbourn L., McAllister D., et al. Risks of and risk factors for COVID-19 disease in people with diabetes: a cohort study of the total population of Scotland. The Lancet Diabetes & Endocrinology. 9, 82–93 (2021,2), Publisher: Elsevier pmid:33357491
  9. 9. Petrilli C., Jones S., Yang J., Rajagopalan H., O’Donnell L., Chernyak Y., et al. Factors associated with hospital admission and critical illness among 5279 people with coronavirus disease 2019 in New York City: prospective cohort study. BMJ. 369 pp. m1966 (2020,5), http://www.bmj.com/content/369/bmj.m1966.abstract pmid:32444366
  10. 10. Spooner A., Chen E., Sowmya A., Sachdev P., Kochan N., Trollor J. et al. A comparison of machine learning methods for survival analysis of high-dimensional clinical data for dementia prediction. Scientific Reports. 10, 20410 (2020,11) pmid:33230128
  11. 11. Chang C., Deng Y., Jiang X. & Long Q. Multiple imputation for analysis of incomplete data in distributed health data networks. Nature Communications. 11, 5467 (2020,10) pmid:33122624
  12. 12. Li J., Yan X., Chaudhary D., Avula V., Mudiganti S., Husby H., et al. Imputation of missing values for electronic health record laboratory data. Npj Digital Medicine. 4, 147 (2021,10) pmid:34635760
  13. 13. Wang P., Li Y. & Reddy C. Machine Learning for Survival Analysis: A Survey. ACM Comput. Surv. 51 (2019,2), Place: New York, NY, USA Publisher: Association for Computing Machinery
  14. 14. Moncada-Torres A., Maaren M., Hendriks M., Siesling S. & Geleijnse G. Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival. Scientific Reports. 11, 6968 (2021,3) pmid:33772109
  15. 15. Steele A., Denaxas S., Shah A., Hemingway H. & Luscombe N. Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease. PLOS ONE. 13, e0202344 (2018,8), Publisher: Public Library of Science pmid:30169498
  16. 16. Austin P., Harrell F. & Klaveren D. Graphical calibration curves and the integrated calibration index (ICI) for survival models. Statistics In Medicine. 39, 2714–2742 (2020,9), https://pubmed.ncbi.nlm.nih.gov/32548928, Edition: 2020/06/16 Place: England
  17. 17. Li J., Wang L., Liu C., Wang Z., Lin Y., Dong X. et al. Exploration of prognostic factors for critical COVID-19 patients using a nomogram model. Scientific Reports. 11, 8192 (2021,4) pmid:33854118
  18. 18. Yitbarek G., Walle Ayehu G., Asnakew S., Ayele F., Bariso Gare M., Mulu A., et al. The role of C-reactive protein in predicting the severity of COVID-19 disease: A systematic review. SAGE Open Medicine. 9 pp. 20503121211050755 (2021,1), Publisher: SAGE Publications Ltd pmid:34659766
  19. 19. Henry B., Aggarwal G., Wong J., Benoit S., Vikse J., Plebani M. et al. Lactate dehydrogenase levels predict coronavirus disease 2019 (COVID-19) severity and mortality: A pooled analysis. The American Journal Of Emergency Medicine. 38, 1722–1726 (2020,9), https://www.sciencedirect.com/science/article/pii/S0735675720304368 pmid:32738466
  20. 20. Carubbi F., Salvati L., Alunno A., Maggi F., Borghi E., Mariani R., et al. Ferritin is associated with the severity of lung involvement but not with worse prognosis in patients with COVID-19: data from two Italian COVID-19 units. Scientific Reports. 11, 4863 (2021,3) pmid:33649408
  21. 21. Papamanoli A., Kalogeropoulos A., Hotelling J., Yoo J., Grewal P., Predun W., et al. Association of Serum Ferritin Levels and Methylprednisolone Treatment With Outcomes in Nonintubated Patients With Severe COVID-19 Pneumonia. JAMA Network Open. 4, e2127172–e2127172 (2021,10) pmid:34605919
  22. 22. Alemzadeh E., Alemzadeh E., Ziaee M., Abedi A. & Salehiniya H. The effect of low serum calcium level on the severity and mortality of Covid patients: A systematic review and meta-analysis. Immunity, Inflammation And Disease. 9, 1219–1228 (2021), https://onlinelibrary.wiley.com/doi/abs/10.1002/iid3.528 pmid:34534417
  23. 23. Torres B., Alcubilla P., González-Cordón A., Inciarte A., Chumbita M., Cardozo C., et al. & COVID19 Hospital Clínic Infectious Diseases Research Group Impact of low serum calcium at hospital admission on SARS-CoV-2 infection outcome. International Journal Of Infectious Diseases: IJID: Official Publication Of The International Society For Infectious Diseases. 104 pp. 164–168 (2021,3), https://pubmed.ncbi.nlm.nih.gov/33278624, Edition: 2020/12/02 Place: Canada
  24. 24. Chen D., Li X., Song Q., Hu C., Su F., Dai J., et al. Assessment of Hypokalemia and Clinical Characteristics in Patients With Coronavirus Disease 2019 in Wenzhou, China. JAMA Network Open. 3, e2011122–e2011122 (2020,6) pmid:32525548
  25. 25. Moreno-Pérez O., Leon-Ramirez J., Fuertes-Kenneally L., Perdiguero M., Andres M., Garcia-Navarro M., et al. Hypokalemia as a sensitive biomarker of disease severity and the requirement for invasive mechanical ventilation requirement in COVID-19 pneumonia: A case series of 306 Mediterranean patients. International Journal Of Infectious Diseases. 100 pp. 449–454 (2020,11), Publisher: Elsevier
  26. 26. Alfano G., Ferrari A., Fontana F., Perrone R., Mori G., Ascione E., et al. & The Modena Covid-19 Working Group (MoCo19) Hypokalemia in Patients with COVID-19. Clinical And Experimental Nephrology. 25, 401–409 (2021,4) pmid:33398605
  27. 27. Tan L., Wang Q., Zhang D., Ding J., Huang Q., Tang Y., et al. Lymphopenia predicts disease severity of COVID-19: a descriptive and predictive study. Signal Transduction And Targeted Therapy. 5, 33 (2020,3) pmid:32296069
  28. 28. Ghizlane E., Manal M., Abderrahim E., Abdelilah E., Mohammed M., Rajae A., et al. Lymphopenia in Covid-19: A single center retrospective study of 589 cases. Annals Of Medicine And Surgery (2012). 69 pp. 102816–102816 (2021,9), https://pubmed.ncbi.nlm.nih.gov/34512964, Edition: 2021/09/08 Place: England
  29. 29. Huang I. & Pranata R. Lymphopenia in severe coronavirus disease-2019 (COVID-19): systematic review and meta-analysis. Journal Of Intensive Care. 8 pp. 36–36 (2020,5), https://pubmed.ncbi.nlm.nih.gov/32483488, Place: England
  30. 30. Xie J., Covassin N., Fan Z., Singh P., Gao W., Li G., et al. Association Between Hypoxemia and Mortality in Patients With COVID-19. Mayo Clinic Proceedings. 95, 1138–1147 (2020,6), Publisher: Elsevier pmid:32376101
  31. 31. Tobin M., Laghi F. & Jubran A. Why COVID-19 Silent Hypoxemia is Baffling to Physicians. American Journal Of Respiratory And Critical Care Medicine. 202 (2020,6) pmid:32539537
  32. 32. Lin Y., Yan K., Chen L., Wu Y., Liu J., Chen Y., et al. Role of a lower cutoff of high sensitivity troponin I in identification of early cardiac damage in non-severe patients with COVID-19. Scientific Reports. 12, 2389 (2022,2) pmid:35149778
  33. 33. Manocha K., Kirzner J., Ying X., Yeo I., Peltzer B., Ang B., et al. Troponin and Other Biomarker Levels and Outcomes Among Patients Hospitalized With COVID-19: Derivation and Validation of the HA(2)T(2) COVID-19 Mortality Risk Score. Journal Of The American Heart Association. 10, e018477–e018477 (2021,3), https://pubmed.ncbi.nlm.nih.gov/33121304, Edition: 2020/10/30 Publisher: John Wiley and Sons Inc.
  34. 34. Guadiana-Romualdo L., Morell-García D., Rodríguez-Fraga O., Morales-Indiano C., María Lourdes Padilla Jiménez A., Gutiérrez Revilla J., et al. Cardiac troponin and COVID-19 severity: Results from BIOCOVID study. European Journal Of Clinical Investigation. 51, e13532 (2021,6), Publisher: John Wiley & Sons, Ltd
  35. 35. Xiang H., Fei J., Xiang Y., Xu Z., Zheng L., Li X., et al. Renal dysfunction and prognosis of COVID-19 patients: a hospital-based retrospective cohort study. BMC Infectious Diseases. 21, 158 (2021,2) pmid:33557785
  36. 36. Uribarri A., Núñez-Gil I., Aparisi A., Becerra-Muñoz V., Feltes G., Trabattoni D., et al. Impact of renal function on admission in COVID-19 patients: an analysis of the international HOPE COVID-19 (Health Outcome Predictive Evaluation for COVID 19) Registry. Journal Of Nephrology. 33, 737–745 (2020,8), Place: Italy pmid:32602006
  37. 37. Xia T., Zhang W., Xu Y., Wang B., Yuan Z., Wu N., et al. Early kidney injury predicts disease progression in patients with COVID-19: a cohort study. BMC Infectious Diseases. 21, 1012 (2021,9) pmid:34579666
  38. 38. Reis G., Moreira Silva E., Medeiros Silva D., Thabane L., Singh G., Park J., et al. & TOGETHER Investigators Effect of Early Treatment With Hydroxychloroquine or Lopinavir and Ritonavir on Risk of Hospitalization Among Patients With COVID-19: The TOGETHER Randomized Clinical Trial. JAMA Network Open. 4, e216468–e216468 (2021,4) pmid:33885775
  39. 39. Shih R., Johnson H., Maki D. & Hennekens C. Hydroxychloroquine for Coronavirus: The Urgent Need for a Moratorium on Prescriptions. The American Journal Of Medicine. 133, 1007–1008 (2020,9), Publisher: Elsevier pmid:32502485
  40. 40. Hennekens C., Rane M., Solano J., Alter S., Johnson H., Krishnaswamy S., et al. Updates on Hydroxychloroquine in Prevention and Treatment of COVID-19. The American Journal Of Medicine. 135, 7–9 (2022,1), Publisher: Elsevier pmid:34437834
  41. 41. Thomson K. & Nachlis H. Emergency Use Authorizations During the COVID-19 Pandemic: Lessons From Hydroxychloroquine for Vaccine Authorization and Approval. JAMA. 324, 1282–1283 (2020,10) pmid:32870235
  42. 42. Coronavirus (COVID-19) Update: FDA Revokes Emergency Use Authorization for Chloroquine and Hydroxychloroquine, https://www.fda.gov/news-events/press-announcements/coronavirus-covid-19-update-fda-revokes-emergency-use-authorization-chloroquine-and, 14 04 2022.
  43. 43. Hassanipour S., Arab-Zozani M., Amani B., Heidarzad F., Fathalipour M. & Hoyo R. The efficacy and safety of Favipiravir in treatment of COVID-19: a systematic review and meta-analysis of clinical trials. Scientific Reports. 11, 11022 (2021,5) pmid:34040117
  44. 44. Manabe T., Kambayashi D., Akatsu H. & Kudo K. Favipiravir for the treatment of patients with COVID-19: a systematic review and meta-analysis. BMC Infectious Diseases. 21, 489 (2021,5) pmid:34044777