Figures
Abstract
Background
The current situation of the unprecedented COVID-19 pandemic leverages Artificial Intelligence (AI) as an innovative tool for addressing the evolving clinical challenges. An example is utilizing Machine Learning (ML) models—a subfield of AI that take advantage of observational data/Electronic Health Records (EHRs) to support clinical decision-making for COVID-19 cases. This study aimed to evaluate the clinical characteristics and risk factors for COVID-19 patients in the United Arab Emirates utilizing EHRs and ML for survival analysis models.
Methods
We tested various ML models for survival analysis in this work we trained those models using a different subset of features extracted by several feature selection methods. Finally, the best model was evaluated and interpreted using goodness-of-fit based on calibration curves,Partial Dependence Plots and concordance index.
Results
The risk of severe disease increases with elevated levels of C-reactive protein, ferritin, lactate dehydrogenase, Modified Early Warning Score, respiratory rate and troponin. The risk also increases with hypokalemia, oxygen desaturation and lower estimated glomerular filtration rate and hypocalcemia and lymphopenia.
Citation: AlShehhi A, Almansoori TM, Alsuwaidi AR, Alblooshi H (2024) Utilizing machine learning for survival analysis to identify risk factors for COVID-19 intensive care unit admission: A retrospective cohort study from the United Arab Emirates. PLoS ONE 19(1): e0291373. https://doi.org/10.1371/journal.pone.0291373
Editor: Luis Felipe Reyes, Universidad de La Sabana, COLOMBIA
Received: November 23, 2022; Accepted: August 26, 2023; Published: January 11, 2024
Copyright: © 2024 AlShehhi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: “The data that support the findings of this study are available from Department of Health (DOH), Abu Dhabi, UAE medical.research@doh.gov.ae. But restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available.”
Funding: The authors received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The COVID-19 pandemic started first in Wuhan (China) in December 2019 and has expanded to every inhabited continent [1, 2]. In March 2020, the World Health Organization (WHO) categorized it as a global pandemic with remarkably high incidence and mortality rates [1–3] and to date, 683,955,862 total number of infected individual with 6,831,756 tolls of death. Until now, supportive care is the main treatment available. Different COVID-19 vaccines were developed and are currently used to reduce the susceptibility to infection [1].
The pandemic has drastically challenged the health care systems globally. The increment in the number of affected individuals exerted substantial pressure on health sectors particularly with the limitation in the intensive care units (ICU) [2]. The main challenge for health professionals was to identify the cases that are more likely to progress from mild to severe or sudden death during the early stages of the pandemic. Understanding the risk factors for severe disease can help the clinician to provide a timely and efficient intervention including proper utilization of ICU facilities.
Since the beginning of the pandemic, tremendous quantitative research using Electronic Health Records (EHRs) were undertaken for different objectives such as patients discharge time prediction [2], mortality risk prediction [4] and early-detection models for COVID-19 [5]. Several studies reported the clinical characteristics and relevant risk factors for severe disease. In a Malaysian cohort [6], chronic kidney disease, chronic pulmonary disease, fever, cough, diarrhea, breathlessness, tachypnoea, abnormal chest radiographs and high serum CRP (≥5 mg/dL) at the time of hospital admission were found as risk factors for severe disease using univariate and multivariate logistic regressions for 5,889 confirmed COVID-19 patients. Another study of 17,278,392 patients from England [3], reported gender, age, diabetes and asthma to be associated with severe COVID-19 cases using multivariable Cox proportional hazards model. A second retrospective study from England [7] (n = 3,138,410) focusing on diabetic population, showed that severity of COVID-19 disease was associated with history of cardiovascular disease, gender, age, renal impairment, non-white ethnicity, socioeconomic deprivation, poor glycemic control and high body mass index (BMI) using Cox proportional hazards model. A Scotland representative cohort [8] also for diabatic patients consists of 5,463,300 patients reported using logistic regression that the risks factors that associate with fatal or critical care unit-treated COVID-19 are: gender, smoking, live in residential care, retinopathy, reduced renal function, or worse glycemic control, diabetic ketoacidosis or hypoglycemia. A retrospective study of Kazakhstan diabetic population [1] (n = 1961) showed that the severity of COVID-19 was higher in diabetic patients, in which they have higher rates of coexisting cardiovascular pathology and kidney disease. Also, the clinical symptoms such as impaired breathing and nausea/vomiting and weakness/lethargy are worst in this group in comparison to the non-diabetic matched group. The previous works reported several risk factors that are associated with COVID-19 severity. A common risk factors for severe cases are age and gender. Those risk factors are various across countries.
This study aims to explore and report the association between patient’s hospital admission medical information and the patient’s deterioration to access the ICU. To the best of our knowledge, this is the first quantitative study employing machine learning for survival analysis to report risk factors to access the ICU in a United Arab Emirates (UAE) cohort. The cross-validation predictive model of accessing critical care unit for COVID-19 patients is also explored.
Materials and methods
Ethical statement
This study was approved by the Institutional Review Board of the Department of Health, Abu Dhabi. Approval number: IRB DOH/CVDC/2020/799. The review board waived the requirement for individual informed consent as this study was part of outbreak investigation in the UAE. All investigators had access to only anonymized patient information. This study was performed in accordance with the relevant laws and regulations that govern research in the emirates of Abu Dhabi, UAE.
Data source
In this retrospective study, an anonymized COVID-19 patients’ medical records extracted from the Abu Dhabi Health Services Company (SEHA) healthcare system. SEHA dataset is a high-dimensional UAE population health data source which include rich patients’ medical information such as sociodemographic, comorbidity, upon admission: symptoms, laboratory results, vital signs, and COVID-19 medications.
Study design
A COVID-19 cohort was identified and extracted by SEHA. The clinical diagnosis of COVID-19 was confirmed using Reverse-transcriptase-polymerase chain reaction (RT-PCR) from nasal swabs. The data contained 1,800 registered patients (ages ≥ 18 years) admitted to any of SEHA healthcare facilities between March 1, 2020, and April 20, 2020. We excluded patients who were under 19 years of age following Petrilli et al. [9]; Fig 1 shows the flow diagram of the cohort inclusion and exclusion criteria. Patients were divided into two groups with respect to severity by accessing the ICU unit. The follow-up period started from the date of hospital admission up to the date of ICU access or date of discharge from the hospital. Some patients had up to 60 days of follow-up. We extracted all patients’ baseline (upon admission) covariates provided in the dataset.
Inclusion and exclusion criteria for determining patients consider in this study. SEHA extracted 1,800 COVID-19 patients from March 1, 2020 to April 20, 2020. After applying the inclusion and exclusion criteria, our study included 1787 patients.
Modeling
EHRs data contain rich information of patients’ medical information. Those patient-centered records impose challenges because of its heterogeneous, high-dimensional nature as well as the presence of missing and censored records [10]. These various challenges require advanced techniques to detect the patterns and report unseen association between the variables of interest and the outcome. In the following section, we will outline the study pipeline to address the challenges imposed by such complex data and how we built a robust model to report the association between patient baseline information and developing a critical situation leading to ICU admission.
Data pre-processing.
Missing value is a common problem in a medical data such as electronic health records, which needs to be handled in a proper way to avoid biased results [11]. We encountered this problem in our dataset; variables with missing variables are reported in the Supplementary S1 to S6 Figs. To deal with this problem, we first excluded the variables with more than 70% missing information and imputed the remaining variables. Understanding and describing the missingness mechanism is an important step in determining the best way to handle them. The pattern of missingness was performed using the pairs plots to assess the relationships between missing values and observed values in all variables [12] (S7 to S17 Figs). From the pattern analysis, we conclude that the missingness is Missing at random (MAR). Therefore, Multiple imputation using random forest was employed in this study. Multiple imputation using random forest is the most popular and robust imputation method widely used for this task [11] (S18 Fig).
Baseline characteristics statistical analysis.
Differences in the baseline patient’s information grouping by accessing the ICU variable were tested using the t-test for parametric continuous variables (with equal variance assumption), while the Mann-Whitney U test was used for nonparametric continuous variables. The χ2 test was used to test categorical variables hypotheses (with continuity correction), while Fisher’s exact test was used for smaller sample sizes (small cell counts). All the statistical analyses were tested at the 95% significance level.
Statistical and machine learning models.
- Feature Selection: To address the high-dimensional problem in our dataset, we utilized several feature selections approaches to find the prominent subsets of feature to train our final models. Feature selection assisted to speed up the training time, enhance the model interpretability, and improve model performance. The following are the methods/approaches used in this study:
- Random forest variable importance (RF Var Imp): This method relies on random permutation of the feature to calculate its importance. In which model performance measure after and before the imputation. If the model prediction error increase significantly after the imputation, it will assign an importance score to the feature which reflects how important it is to the final prediction [10].
- Random forest minimal depth (RF Min Depth): The method measures the shortest distance between the tree main root to the largest subtree. The feature of interest is its root. The shortest the distance the more influent and significant the feature is [10].
- Univariate Score using Survival tree: The model fits survival tree for each single feature in turn and the importance of the feature determine by prediction for the survival tree model.
- Cox model permutation importance (CPH Perm Imp): Similar to, RF Var Imp but random forests model substitute with Cox model.
- Survival Analysis Models: Survival analysis is a subfield from statistical methods which focus on modeling the time to event or the expected time until the event of interest occurs. The event of interest might not be observed or missed during the study period for some patients; those data points named censored data [2, 10]. Survival analysis approach was performed in this study rather than a classification approach. In the classification analysis we have to train the model at each time step. Consequently, classifier can only determine whether or not the patients will experience the event of interest without knowing when exactly the event will occur. Whares, in survival analysis consider the time till the event occur [13]. One of most popular survival analysis models in clinical study is Cox Proportional Hazard Model which lacks the scalability to high-dimensional data [10]. Recently, many machine learning methods were adopted for survival analysis to incorporate the capability of such advanced models to handle complex relation between variables and its capability to scale for high-dimensional data [10]. Some machine learning for survival analysis are Survival Tree, Gradient Boosting Machine, Random Forests and Regularized Generalized Linear Model. Following is a detailed description of the statistical and machine learning survival models considered in this study:
- Multivariate Cox Proportional Hazard (CPH) Model is a standard and most popular survival analysis model in medical domains [14]. It is fast, computation unexpensive and easy to use and interpret. However, CPH has several drawbacks such as its inability to deal with high dimensional dataset, and to model the nonlinear interaction and correlation between the features; finally, CPH model makes several assumptions that need to be satisfied to produce valid results such as proportional hazards assumption, test for influential observations, and nonlinearity [14].
- Survival Tree is a tree-base method like the tradition decision tree machine learning method. The model recursively partition tree nodes based on splitting rules that incorporate censoring information such as log-rank which is the most popular splitting rules for survival model [13].
- Random Forests (RFs) are the extended version of traditional random forests which incorporate the censored information during mode training. RFs is an ensemble tree-based model; each individual decision tree divided its nodes based on splitting rule that incorporate censoring information such as log-rank splitting and gradient-based brier score. The final outcome is calculated by averaging the predicators of each tree [14].
- Regularized Generalized Linear Model (GLM): fits a regularized Cox model using a penalized negative log of the partial likelihood with an elastic net penalty. This model adds a regularization term to control the overfitting problem and reduce model complexity. Three GLMs can be fitted depending on the alpha(α) values: elastic net, ridge, and lasso. Elastic net model combines and bridges the gap between the other two penalties models: ridge (α = 0) and lasso (α = 1) [15].
- Gradient Boosting Machines (GBMs) are a stage-wise models which convert a weak learner (tree-based) into a stronger model by incorporating optimization function such as gradient descent to minimize the objective (loss) function [14].
- Hyperparameters Optimization: Machine learning models hyperparameters tuning with random search optimization algorithm applied to choose the optimal set of models’ parameters that yield the best performance. We used 5-fold cross-validation to assess the selected parameters quality. Table 4 presents the parameter search space of each model and the selected parameter for the final selected models.
- Model Evaluation and Explanation
- Discrimination (Concordance Index): model performances were compared and measured using Concordance index (C-index). C-index is a standard evaluation measure for survival analysis, it measures the proportion of the concordant pairs between all possible evaluation pairs [2, 14].
- Partial Dependence Plots (PDPs): model interoperation and transparency is an important factor to adopt machine learning for clinical practice [14]. In this study, we applied PDPs as a post hoc technique to explain model decisions. The plots showed the marginal effect of feature of interest as a risk factor on the outcome of interest [15].
- Calibration curve: a plot representation of the model-predicted probabilities versus observed event rates within a given duration. Survival probability is ranked first, followed by partitioning the data set into groups. The subjects in the upper group are those who are least likely to experience the event of interest, while those in the lower group are most likely to experience it [16].
- Models’ comparison: We compare the performance of different algorithms based on several runs with shuffling using the Kruskal-Wallis test, a non-parametric method for comparing distributions of model outcomes (C-Index). Then we perform a pairwise comparison among the different models using the Nemenyi posthoc test to detect the models that differ from each other.
Results
Baseline characteristics statistical analysis
Baseline sociodemographic, comorbidity, upon admission: symptoms, laboratory results, vital information, and the descriptive statistics are presented in Tables 1–3 for the 60 features included in this study. In general, the non-ICU group was younger and had a larger number of patients than the ICU group. Also, the majority of the population is male in both groups, with no significant difference between them. Several features were significantly different between the two groups namely, age, diet, obesity, diabetes, CKD, ESRD, cough, c reactive protein, calcium level, chloride level, CO2 level, creatinine, ferritin level etc. We applied a correlation matrix to identify and remove the highly correlated independent variables (greater than 0.7 and less than -0.7) and we end up with 55 features for training the different models; the heat map for the correlation matrix is found in (S19 Fig).
Statistical and machine learning models
Fig 2 illustrates hyperparameters tuning matrix for the combination of different models and feature selection methods. The heatmap shows the C-Index mean value for 5-fold cross validation, across all the models’ random forest minimal depth yield the best selected features. Model chosen hyperparameters and number of features selected for each model reported in Table 4. As it clears in the Table, the percentage of features selected for each model is range from 10 to 49 features. We trained all model with the selected features reported and the completed features (Table 5). The table shows the C-index from the combination of the subset of selected features and the models. Gradient Boosting Machines (GBMs) slightly outperforms CPH and RFs when trained on the feature selected from CPH, GBMs,GLM, RFs and Survival Tree.
Model evaluation and explanation
Using Kruskal-Wallis test showed a significant difference between models’ performance (H = 926.50, p ≤ 0.05). Followed by Namanya’s multiple comparisons tests (Table 6); the test showed that there is not a significant difference between GBMs and CPH or RFs performance at 95 significance level. Based on Table 5, we selected the GBM model for further analysis and interpretation. Calibration curves for the probability of 2, 3, and 5 days ICU access showed excellent agreement between model prediction and actual observation (Fig 3a–3c). The Time-dependent ROC curve (discrimination accuracy) for the predicting the ICU access was 96.8%, 96.8%, and 96.5% for 2,3, and 5 days, respectively (Fig 3d–3f).
Calibration curves and time-dependent ROC curve of the gradient boosting machine (GBM) model: Calibration curves of predicted compared with observed ICU access after 2 Days (a), 3 Days (b) and 5 Days (c) of hospital administration. Time-dependent ROC curve for the ICU admission predicting after after 2 Days (d),3 Days (e) and 5 Days (f) of hospital administration.
Finally, we explore the model feature of interest marginal risk effect using PDPs. Fig 4 shows the risk of disease severity rises with increase in the C-reactive protein, ferritin level, lactate dehydrogenase level (LDH), Modified Early Warning Score (MEWS), respiratory rate, and troponin levels. In addition, the risk increase with the lower concentrations of potassium, calcium, oxygen saturation and estimated glomerular filtration rate (eGFR) and lymphocytes. Finally, COVID19 adopted treatment region using hydroxychloroquine and favipiravir reduce the severity of the COVID19.
Discussion
Various studies have been conducted to leverage Machine learning (ML) to characterize the clinical risk factors for COVID-19 severity globally. To the best of our knowledge, this is the first study in the UAE to utilize data from electronic health records (HER) to facilitate clinical decision making during the COVID-19 pandemic using Machine Learning for survival analysis. In our study we identified the risk factors that increase the severity of the COVID-19 (Fig 4) in a UAE cohort. The highest clinical risk variable featured by the model were inflammatory markers including C-reactive protein, Ferritin, and Lactic Dehydrogenase (LDH). Around seventy-six of the cases with elevated CRP upon admission entered ICU. As reported in the literature, elevated level of C-reactive protein might be an indicator of COVID-19 severity and/or mortality [17] and can be used as biomarker to identify patients’ progression status [18]. Elevated LDH has been reported to be in association with respiratory failure in COVID-19 patients. It has been tagged as COVID-19 severe marker with six-fold increase in progressing to severe COVID-19 disease [19]. Another factor identified from our study is serum Ferritin that has been reported as predictor of patient’s severity with COVID-19. A recent study indicated that elevated Ferritin (over 25 percentile) associated with pulmonary involvement [20] that was targeted as biomarker for therapeutic monitoring of Methylprednisolone [21].
In this study, low serum calcium (hypocalcemia) has been predicated as COVID-19 severity risk factor. A recent meta-analysis [22] reported that hypocalcemia was significantly associated with COVID-19 severity, mortality, number of hospitalization days and admission to the ICU. These findings support that serum calcium level can be a prognostic marker for COVID-19 especially at initial assessments. Another study reported that COVID-19 patient with hypocalcemia was more likely to require high oxygen support during hospitalization and to be admitted to ICU [23]. It is still not clear the role of calcium in the pathophysiology SARS-CoV-2.
Low potassium (Hypokalemia) has been identified as risk factor for COVID-19 severity. Many cohort characteristics reported the association of the hypokalemia with increased severity of COVID-19 among affected patients [24]. Another study reported the association of the hypokalemia and COVID-19 pneumonia (Moreno-Pérez et al, 2020). From both studies, it is suggested the presence of the renin-angiotensin disorder. As the SARS-COV-2 binds to ACE2 enhancing in the degradation of the ACE2. Hence, this reduces the counteraction of ACE2 on renin-angiotensin system (RAS). The main challenge is to maintain the potassium level due to the continuous renal potassium loss. Consequently, this can have possible effect on cardiovascular functions, neurohormonal activation and other vital organs such as the lung. Therefore, sever hypokalemia in COVID-19 patients indicate the consideration of mechanical ventilation [25]. In an Italian cohort, hypokalemia was reported in 41% of the hospitalized patients but not associated with ICU admission or mortality [26].
Another risk factor identified in our study, is low count of lymphocyte (lymphopenia). This findings align with various studies reported the association of lymphopenia severity and hospitalization of COVID-19 patients [27–29]. Multiple mechanisms were proposed to explain lymphocytes deficiency in COVID-19 including that the virus directly infect lymphocytes resulting in cell destruction as the lymphocyte express ACE2 receptors on its surface. However, further studies needed to understand the underlying reasoning for lymphopenia being an indicator of for COVID-19 severity and poor outcome.
Low oxygen saturation (Hypoxemia) and high respiratory rate upon admission are other COVID-19 severity risk factors selected in our study. The relative risk factor is higher in patient with oxygen saturation below 88% and respiratory rate below 38 breath per minutes. As reported in literature oxygen saturation below 90% is a predictor risk factor for sever COVID-19 and /or mortality [30]. These factors can provide a clinical indication upon admission to consider patient for appropriate oxygen supplement and timely access to hospital care especially with the limited critical care resources during COVID-19 pandemic. In addition, hypoxia was reported as an independent marker associated with in hospital mortality in COVID-19 patients [30].
Initial Modified Early Warning Score (MEWS) is an important variable to measure the deterioration of patients’ status in hospital-based setting. In our study the MEWS is a factor to predict the risk of ICU admission (Fig 4). The initial MEWS scoring is a significant factor to indicate the ICU admission along especially in patients with silent hypoxemia [31]. The strength of this factor is to leverage the information from HER to provide actionable strategy for COVID-19 patients during pandemic. Hence, this reflects the effect of the clinical decision aided tools on health system.
Elevated cardiac troponin was observed to be a COVID-19 severity predictor risk factor. Cardiac Troponin is well-known myocardial injury marker. Various studies evaluated elevated cardiac troponin as a biomarker of COVID-19 severity and indicative of patient deteteriation [32–34]. In addition, cardiac troponin was evaluated in COVID-19 patients to be a risk factor of severity and an independent predictor of death within 30 days [34]. A recent study evaluated the cut-off of high sensitivity troponin I in non-severe COVID-19 patients as indicator of cardiac damage in the second week of the onset [32].
Low Estimated Glomerular Filtration (eGRF) rate is a risk factor for progression of COVID-19 severity in the studied cohorts. This risk factor has been reported in various studies as a predictor for COVID-19 prognosis [35]. Uribarri et al. [36] clearly demonstrated the impact of renal function from an international HOPE COVID-19 (Health Outcome Predictive Evaluation for COVID-19) Registry. Upon admission of COVID-19 patients, kidney dysfunction is common with various possible complication such as renal failure or in-hospital mortality [35]. Furthermore, patients with eGFR below 60ml/min/m2 were found to have higher risk worse prognosis because of respiratory failure and sepsis. Fifty six percent of COVID-19 patients with eGFR below 30 upon admission exhibited significant deterioration in their renal function. Renal involvement in COVID-19 patients is very important risk factor that requires critical follow up during hospitalization to avoid potential renal complication. Furthermore, early identification of kidney injury can predict COVID-19 progression and poor prognosis [37].
In this study model we have considered the COVID-19 treatment part of the analysis since we are evaluating the risk factor during the hospitalization for ICU admission. As part of the initial treatment offered for inpatients with COVID-19 during early pandemic episode, Hydroxychrolquine and other antiviral therapy such as Favipiravir were offered. In Fig 4, two pharmacotherapies were identified as a supported factor to avoid admission to ICU. The findings from this model do not evaluate the treatment effectiveness independently as most of the cohort patients were under different pharmacotherapies (n = 735). However, the results only evaluate the Hydroxychrolquine and Favipiravir effect on the ICU admission of COVID-19 patients during hospitalization. Early in the pandemic, with the increasing number of the critically ill patients and desperation of clinician Hydroxychrolquine preliminary data on the March 16, 2020 provided hope as a potential treatment for COVID-D patients across the globe. On the March 28, 2020 US Food and Drug administration (FDA) authorized the early use of Hydroxychrolquine. The data stet in this study represent the initial pandemic stage were most of the symptomatic COVID-19 inpatients received Hydroxychrolquine with or without antiviral pharmacotherapy. Many follow up studies followed reported inconclusive efficacy of Hydroxychloroquine in COVID-19 patients [38–40]. Based on the well-defined scientific evidence, in June 15, 2020 the FDA revoked the approval [41, 42]. In addition, the Hydroxychrolquine is no longer recommended in the UAE.
Favipiravir is an antiviral therapy that was initially used to treat COVID-19 in the early wave of the pandemic. No statistical significance was reported of Favipiravir in relation to oxygen supplement, ICU admission and mortality [43]. Meta analysis reported the lack of Favipiravir effectiveness in reducing mortality among mild to moderate COVID-19 patients [43]. Other studies reported that Favipiravir can trigger viral clearance by 7 days and enhance clinical outcome within 14 days. However, this study recommended further evaluation of the dosing and duration of the treatment to validate the findings [44]. In our study, Favipiravir is identified as a factor that can reduce the chance of ICU admission. However, this finding can be marginal (p-value = 0.068) due to the small sample size in this category in this cohort. Various limitations were encountered while analyzing the low numbered dataset of the ICU group for which full spectrum evaluation was not possible. There are other predictors and features that we did not consider in the implemented models such as lifestyle, viral variant strains, viral load, and severity score. Furthermore, the model was not adjusted for smoking as most of the values were missing from the data set. As most patients in this cohort had pre-existing comorbidities, these patients were on different medications that were not included in this investigation. Medication effectiveness was not part of the analysis as the sample size of each pharmacotherapy groups was under-representative.
Further validation is required to evaluate the performance of the model. Various attributes can be changed due to various factors including the emergence of new virus strains, improvement of management practice, new pharmacotherapy, and vaccination availability around the globe.
Conclusion
In conclusion, predicting COVID-19 severity and evaluating the risk factors during hospitalization is challenging. However, ML models can assist clinicians in identifying high-risk patients upon admission. The focus of our study is to evaluate the risk factors criteria in UAE COVID-19 cohort to facilitate clinician’s decision on ICU admission at limited critical care resources in fast revolving COVID-19 waves. The findings of our study represent the first study of risk factors from EHR in the UAE at the early pandemic stage. Various clinical marker can be used as a predictor variable for ICU admission including C-reactive protein, ferritin level, lactate dehydrogenase level (LDH), Modified Early Warning Score (MEWS), respiratory rate, troponin levels, and the risk increases with lower potassium level, oxygen saturation and estimated glomerular filtration rate (eGFR). By using the identified features, clinicians can provide altered treatment plans and prioritize ICU admission for high-risk patients. Mortality prediction was not investigated in this study.
Supporting information
S1 Fig. Sociodemographic information missing values.
The figure illustrates the number of missing instances in each variable. The variable with more than 70% missing information was excluded from this analysis.
https://doi.org/10.1371/journal.pone.0291373.s001
(PDF)
S2 Fig. Comorbidity missing values.
The figure illustrates the number of missing instances in each variable. The variable with more than 70% missing information was excluded from this analysis.
https://doi.org/10.1371/journal.pone.0291373.s002
(PDF)
S3 Fig. Symptoms upon admission missing values.
The figure illustrates the number of missing instances in each variable. The variable with more than 70% missing information was excluded from this analysis.
https://doi.org/10.1371/journal.pone.0291373.s003
(PDF)
S4 Fig. Laboratory results upon admission missing values.
The figure illustrates the number of missing instances in each variable. The variable with more than 70% missing information was excluded from this analysis.
https://doi.org/10.1371/journal.pone.0291373.s004
(PDF)
S5 Fig. Vital information upon admission missing values.
The figure illustrates the number of missing instances in each variable. The variable with more than 70% missing information was excluded from this analysis.
https://doi.org/10.1371/journal.pone.0291373.s005
(PDF)
S6 Fig. COVID19 therapy missing values.
The figure illustrates the number of missing instances in each variable. The variable with more than 70% missing information was excluded from this analysis.
https://doi.org/10.1371/journal.pone.0291373.s006
(PDF)
S7 Fig. Missing data patterns in multivariate data.
Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables.
https://doi.org/10.1371/journal.pone.0291373.s007
(PDF)
S8 Fig. Missing data patterns in multivariate data.
Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).
https://doi.org/10.1371/journal.pone.0291373.s008
(PDF)
S9 Fig. Missing data patterns in multivariate data.
Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).
https://doi.org/10.1371/journal.pone.0291373.s009
(PDF)
S10 Fig. Missing data patterns in multivariate data.
Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).
https://doi.org/10.1371/journal.pone.0291373.s010
(PDF)
S11 Fig. Missing data patterns in multivariate data.
Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).
https://doi.org/10.1371/journal.pone.0291373.s011
(PDF)
S12 Fig. Missing data patterns in multivariate data.
Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).
https://doi.org/10.1371/journal.pone.0291373.s012
(PDF)
S13 Fig. Missing data patterns in multivariate data.
Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).
https://doi.org/10.1371/journal.pone.0291373.s013
(PDF)
S14 Fig. Missing data patterns in multivariate data.
Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).
https://doi.org/10.1371/journal.pone.0291373.s014
(PDF)
S15 Fig. Missing data patterns in multivariate data.
Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).
https://doi.org/10.1371/journal.pone.0291373.s015
(PDF)
S16 Fig. Missing data patterns in multivariate data.
Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).
https://doi.org/10.1371/journal.pone.0291373.s016
(PDF)
S17 Fig. Missing data patterns in multivariate data.
Explore patterns of missingness between levels of included variables. The pairs plots show relationships between missing values (gray) and observed values (Blue) for all the features. The distributions are used to visualize the continuous features, and the proportions are shown for categorical variables (continue).
https://doi.org/10.1371/journal.pone.0291373.s017
(PDF)
S18 Fig. Missing value imputation using random forest.
The figure compare the distribution of the original and imputed data. The magenta points represent the imputed points, and the blue ones show the observed ones. The plots infer that the imputed values are plausible values for the missing points.
https://doi.org/10.1371/journal.pone.0291373.s018
(PDF)
S19 Fig. Correlation plot.
Heat map to visualize the correlation between the study features after removing the highly correlated features (〉0.7 and 〈-0.7).
https://doi.org/10.1371/journal.pone.0291373.s019
(PDF)
Acknowledgments
We thank the Abu Dhabi Health Services Company (SEHA) healthcare system to provide us with data that used in this study.
References
- 1. Dyusupova A., Faizova R., Yurkovskaya O., Belyaeva T., Terekhova T., Khismetova A., et al. Clinical characteristics and risk factors for disease severity and mortality of COVID-19 patients with diabetes mellitus in Kazakhstan: A nationwide study. Heliyon. 7, e06561 (2021,3), https://www.sciencedirect.com/science/article/pii/S2405844021006642 pmid:33763618
- 2. Nemati M., Ansary J. & Nemati N. Machine-Learning Approaches in COVID-19 Survival Analysis and Discharge-Time Likelihood Prediction Using Clinical Data. Patterns. 1, 100074 (2020,8), https://www.sciencedirect.com/science/article/pii/S2666389920300945 pmid:32835314
- 3. Williamson E., Walker A., Bhaskaran K., Bacon S., Bates C., Morton C., et al. Factors associated with COVID-19-related death using OpenSAFELY. Nature. 584, 430–436 (2020,8) pmid:32640463
- 4. Schwab P., Mehrjou A., Parbhoo S., Celi L., Hetzel J., Hofer M., et al. Real-time prediction of COVID-19 related mortality using electronic health records. Nature Communications. 12, 1058 (2021,2) pmid:33594046
- 5. Soltan A., Kouchaki S., Zhu T., Kiyasseh D., Taylor T., Hussain Z., et al. Rapid triage for COVID-19 using routine clinical data for patients attending hospital: development and prospective validation of an artificial intelligence screening test. The Lancet Digital Health. 3, e78–e87 (2021,2), Publisher: Elsevier pmid:33509388
- 6. Sim B., Chidambaram S., Wong X., Pathmanathan M., Peariasamy K., Hor C., et al. Clinical characteristics and risk factors for severe COVID-19 infections in Malaysia: A nationwide observational study. The Lancet Regional Health Western Pacific. 4 (2020,11), Publisher: Elsevier pmid:33521741
- 7. Holman N., Knighton P., Kar P., O’Keefe J., Curley M., Weaver A., et al. Risk factors for COVID-19-related mortality in people with type 1 and type 2 diabetes in England: a population-based cohort study. The Lancet Diabetes & Endocrinology. 8, 823–833 (2020,10), Publisher: Elsevier
- 8. McGurnaghan S., Weir A., Bishop J., Kennedy S., Blackbourn L., McAllister D., et al. Risks of and risk factors for COVID-19 disease in people with diabetes: a cohort study of the total population of Scotland. The Lancet Diabetes & Endocrinology. 9, 82–93 (2021,2), Publisher: Elsevier pmid:33357491
- 9. Petrilli C., Jones S., Yang J., Rajagopalan H., O’Donnell L., Chernyak Y., et al. Factors associated with hospital admission and critical illness among 5279 people with coronavirus disease 2019 in New York City: prospective cohort study. BMJ. 369 pp. m1966 (2020,5), http://www.bmj.com/content/369/bmj.m1966.abstract pmid:32444366
- 10. Spooner A., Chen E., Sowmya A., Sachdev P., Kochan N., Trollor J. et al. A comparison of machine learning methods for survival analysis of high-dimensional clinical data for dementia prediction. Scientific Reports. 10, 20410 (2020,11) pmid:33230128
- 11. Chang C., Deng Y., Jiang X. & Long Q. Multiple imputation for analysis of incomplete data in distributed health data networks. Nature Communications. 11, 5467 (2020,10) pmid:33122624
- 12. Li J., Yan X., Chaudhary D., Avula V., Mudiganti S., Husby H., et al. Imputation of missing values for electronic health record laboratory data. Npj Digital Medicine. 4, 147 (2021,10) pmid:34635760
- 13. Wang P., Li Y. & Reddy C. Machine Learning for Survival Analysis: A Survey. ACM Comput. Surv. 51 (2019,2), Place: New York, NY, USA Publisher: Association for Computing Machinery
- 14. Moncada-Torres A., Maaren M., Hendriks M., Siesling S. & Geleijnse G. Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival. Scientific Reports. 11, 6968 (2021,3) pmid:33772109
- 15. Steele A., Denaxas S., Shah A., Hemingway H. & Luscombe N. Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease. PLOS ONE. 13, e0202344 (2018,8), Publisher: Public Library of Science pmid:30169498
- 16. Austin P., Harrell F. & Klaveren D. Graphical calibration curves and the integrated calibration index (ICI) for survival models. Statistics In Medicine. 39, 2714–2742 (2020,9), https://pubmed.ncbi.nlm.nih.gov/32548928, Edition: 2020/06/16 Place: England
- 17. Li J., Wang L., Liu C., Wang Z., Lin Y., Dong X. et al. Exploration of prognostic factors for critical COVID-19 patients using a nomogram model. Scientific Reports. 11, 8192 (2021,4) pmid:33854118
- 18. Yitbarek G., Walle Ayehu G., Asnakew S., Ayele F., Bariso Gare M., Mulu A., et al. The role of C-reactive protein in predicting the severity of COVID-19 disease: A systematic review. SAGE Open Medicine. 9 pp. 20503121211050755 (2021,1), Publisher: SAGE Publications Ltd pmid:34659766
- 19. Henry B., Aggarwal G., Wong J., Benoit S., Vikse J., Plebani M. et al. Lactate dehydrogenase levels predict coronavirus disease 2019 (COVID-19) severity and mortality: A pooled analysis. The American Journal Of Emergency Medicine. 38, 1722–1726 (2020,9), https://www.sciencedirect.com/science/article/pii/S0735675720304368 pmid:32738466
- 20. Carubbi F., Salvati L., Alunno A., Maggi F., Borghi E., Mariani R., et al. Ferritin is associated with the severity of lung involvement but not with worse prognosis in patients with COVID-19: data from two Italian COVID-19 units. Scientific Reports. 11, 4863 (2021,3) pmid:33649408
- 21. Papamanoli A., Kalogeropoulos A., Hotelling J., Yoo J., Grewal P., Predun W., et al. Association of Serum Ferritin Levels and Methylprednisolone Treatment With Outcomes in Nonintubated Patients With Severe COVID-19 Pneumonia. JAMA Network Open. 4, e2127172–e2127172 (2021,10) pmid:34605919
- 22. Alemzadeh E., Alemzadeh E., Ziaee M., Abedi A. & Salehiniya H. The effect of low serum calcium level on the severity and mortality of Covid patients: A systematic review and meta-analysis. Immunity, Inflammation And Disease. 9, 1219–1228 (2021), https://onlinelibrary.wiley.com/doi/abs/10.1002/iid3.528 pmid:34534417
- 23. Torres B., Alcubilla P., González-Cordón A., Inciarte A., Chumbita M., Cardozo C., et al. & COVID19 Hospital Clínic Infectious Diseases Research Group Impact of low serum calcium at hospital admission on SARS-CoV-2 infection outcome. International Journal Of Infectious Diseases: IJID: Official Publication Of The International Society For Infectious Diseases. 104 pp. 164–168 (2021,3), https://pubmed.ncbi.nlm.nih.gov/33278624, Edition: 2020/12/02 Place: Canada
- 24. Chen D., Li X., Song Q., Hu C., Su F., Dai J., et al. Assessment of Hypokalemia and Clinical Characteristics in Patients With Coronavirus Disease 2019 in Wenzhou, China. JAMA Network Open. 3, e2011122–e2011122 (2020,6) pmid:32525548
- 25. Moreno-Pérez O., Leon-Ramirez J., Fuertes-Kenneally L., Perdiguero M., Andres M., Garcia-Navarro M., et al. Hypokalemia as a sensitive biomarker of disease severity and the requirement for invasive mechanical ventilation requirement in COVID-19 pneumonia: A case series of 306 Mediterranean patients. International Journal Of Infectious Diseases. 100 pp. 449–454 (2020,11), Publisher: Elsevier
- 26. Alfano G., Ferrari A., Fontana F., Perrone R., Mori G., Ascione E., et al. & The Modena Covid-19 Working Group (MoCo19) Hypokalemia in Patients with COVID-19. Clinical And Experimental Nephrology. 25, 401–409 (2021,4) pmid:33398605
- 27. Tan L., Wang Q., Zhang D., Ding J., Huang Q., Tang Y., et al. Lymphopenia predicts disease severity of COVID-19: a descriptive and predictive study. Signal Transduction And Targeted Therapy. 5, 33 (2020,3) pmid:32296069
- 28. Ghizlane E., Manal M., Abderrahim E., Abdelilah E., Mohammed M., Rajae A., et al. Lymphopenia in Covid-19: A single center retrospective study of 589 cases. Annals Of Medicine And Surgery (2012). 69 pp. 102816–102816 (2021,9), https://pubmed.ncbi.nlm.nih.gov/34512964, Edition: 2021/09/08 Place: England
- 29. Huang I. & Pranata R. Lymphopenia in severe coronavirus disease-2019 (COVID-19): systematic review and meta-analysis. Journal Of Intensive Care. 8 pp. 36–36 (2020,5), https://pubmed.ncbi.nlm.nih.gov/32483488, Place: England
- 30. Xie J., Covassin N., Fan Z., Singh P., Gao W., Li G., et al. Association Between Hypoxemia and Mortality in Patients With COVID-19. Mayo Clinic Proceedings. 95, 1138–1147 (2020,6), Publisher: Elsevier pmid:32376101
- 31. Tobin M., Laghi F. & Jubran A. Why COVID-19 Silent Hypoxemia is Baffling to Physicians. American Journal Of Respiratory And Critical Care Medicine. 202 (2020,6) pmid:32539537
- 32. Lin Y., Yan K., Chen L., Wu Y., Liu J., Chen Y., et al. Role of a lower cutoff of high sensitivity troponin I in identification of early cardiac damage in non-severe patients with COVID-19. Scientific Reports. 12, 2389 (2022,2) pmid:35149778
- 33. Manocha K., Kirzner J., Ying X., Yeo I., Peltzer B., Ang B., et al. Troponin and Other Biomarker Levels and Outcomes Among Patients Hospitalized With COVID-19: Derivation and Validation of the HA(2)T(2) COVID-19 Mortality Risk Score. Journal Of The American Heart Association. 10, e018477–e018477 (2021,3), https://pubmed.ncbi.nlm.nih.gov/33121304, Edition: 2020/10/30 Publisher: John Wiley and Sons Inc.
- 34. Guadiana-Romualdo L., Morell-García D., Rodríguez-Fraga O., Morales-Indiano C., María Lourdes Padilla Jiménez A., Gutiérrez Revilla J., et al. Cardiac troponin and COVID-19 severity: Results from BIOCOVID study. European Journal Of Clinical Investigation. 51, e13532 (2021,6), Publisher: John Wiley & Sons, Ltd
- 35. Xiang H., Fei J., Xiang Y., Xu Z., Zheng L., Li X., et al. Renal dysfunction and prognosis of COVID-19 patients: a hospital-based retrospective cohort study. BMC Infectious Diseases. 21, 158 (2021,2) pmid:33557785
- 36. Uribarri A., Núñez-Gil I., Aparisi A., Becerra-Muñoz V., Feltes G., Trabattoni D., et al. Impact of renal function on admission in COVID-19 patients: an analysis of the international HOPE COVID-19 (Health Outcome Predictive Evaluation for COVID 19) Registry. Journal Of Nephrology. 33, 737–745 (2020,8), Place: Italy pmid:32602006
- 37. Xia T., Zhang W., Xu Y., Wang B., Yuan Z., Wu N., et al. Early kidney injury predicts disease progression in patients with COVID-19: a cohort study. BMC Infectious Diseases. 21, 1012 (2021,9) pmid:34579666
- 38. Reis G., Moreira Silva E., Medeiros Silva D., Thabane L., Singh G., Park J., et al. & TOGETHER Investigators Effect of Early Treatment With Hydroxychloroquine or Lopinavir and Ritonavir on Risk of Hospitalization Among Patients With COVID-19: The TOGETHER Randomized Clinical Trial. JAMA Network Open. 4, e216468–e216468 (2021,4) pmid:33885775
- 39. Shih R., Johnson H., Maki D. & Hennekens C. Hydroxychloroquine for Coronavirus: The Urgent Need for a Moratorium on Prescriptions. The American Journal Of Medicine. 133, 1007–1008 (2020,9), Publisher: Elsevier pmid:32502485
- 40. Hennekens C., Rane M., Solano J., Alter S., Johnson H., Krishnaswamy S., et al. Updates on Hydroxychloroquine in Prevention and Treatment of COVID-19. The American Journal Of Medicine. 135, 7–9 (2022,1), Publisher: Elsevier pmid:34437834
- 41. Thomson K. & Nachlis H. Emergency Use Authorizations During the COVID-19 Pandemic: Lessons From Hydroxychloroquine for Vaccine Authorization and Approval. JAMA. 324, 1282–1283 (2020,10) pmid:32870235
- 42.
Coronavirus (COVID-19) Update: FDA Revokes Emergency Use Authorization for Chloroquine and Hydroxychloroquine, https://www.fda.gov/news-events/press-announcements/coronavirus-covid-19-update-fda-revokes-emergency-use-authorization-chloroquine-and, 14 04 2022.
- 43. Hassanipour S., Arab-Zozani M., Amani B., Heidarzad F., Fathalipour M. & Hoyo R. The efficacy and safety of Favipiravir in treatment of COVID-19: a systematic review and meta-analysis of clinical trials. Scientific Reports. 11, 11022 (2021,5) pmid:34040117
- 44. Manabe T., Kambayashi D., Akatsu H. & Kudo K. Favipiravir for the treatment of patients with COVID-19: a systematic review and meta-analysis. BMC Infectious Diseases. 21, 489 (2021,5) pmid:34044777