Profiles and outcomes in patients with COVID-19 admitted to wards of a French oncohematological hospital: A clustering approach

Objectives Although some prognostic factors for COVID-19 were consistently identified across the studies, differences were found for other factors that could be due to the characteristics of the study populations and the variables incorporated into the statistical model. We aimed to a priori identify specific patient profiles and then assess their association with the outcomes in COVID-19 patients with respiratory symptoms admitted specifically to hospital wards. Methods We conducted a retrospective single-center study from February 2020 to April 2020. A non-supervised cluster analysis was first used to detect patient profiles based on characteristics at admission of 220 consecutive patients admitted to our institution. Then, we assessed the prognostic value using Cox regression analyses to predict survival. Results Three clusters were identified, with 47 patients in cluster 1, 87 in cluster 2, and 86 in cluster 3; the presentation of the patients differed among the clusters. Cluster 1 mostly included sexagenarian patients with active malignancies who were admitted early after the onset of COVID-19. Cluster 2 included the oldest patients, who were generally overweight and had hypertension and renal insufficiency, while cluster 3 included the youngest patients, who had gastrointestinal symptoms and delayed admission. Sixty-day survival rates were 74.3%, 50.6% and 96.5% in clusters 1, 2, and 3, respectively. This was confirmed by the multivariable Cox analyses that showed the prognostic value of these patterns. Conclusion The cluster approach seems appropriate and pragmatic for the early identification of patient profiles that could help physicians segregate patients according to their prognosis.


Introduction
The coronavirus disease 2019 (COVID- 19) epidemic has been spreading worldwide since the beginning of 2020. Copious data concerning the clinical presentation and prognosis of the disease have been published. The results report variable mortality rates ranging from 3.2 to 28% [1] and different risk factors associated with mortality, which is predominantly secondary to respiratory failure. These differences vary depending on the geographical location of the study [2], the characteristics of the study population, including whether patients were admitted to wards and/or intensive care units (ICU), and the prognostic variables selected for inclusion in the statistical model. The difficulties regarding the prediction of the progression of COVID-19 may be attributed to the low precision of the available tools and the absence of a more global approach to prognostic prediction. At a time when countries are facing subsequent waves of the pandemic, we need complementary data and complementary statistical approaches to better classify and manage patients admitted to the hospital for COVID- 19. Many studies have focused on one predictor of mortality or ICU admission, such as lymphopenia [3], the platelet count [4] or the level of NT-proBNP [5]. Some studies have also analyzed mortality in specific populations, such as obese patients [6], diabetic patients [7], cancer patients [8] or kidney transplant recipients [9]. Although still small, the number of COVID-19 prognostic models is increasing, but their validity and widespread use remain uncertain [10,11]. A few studies have proposed rapid scoring systems that can be used to predict mortality in critically ill patients and non-critically ill patients. These scores included clinical parameters such as heart rate, systolic blood pressure, respiratory rate, body temperature, consciousness, Glasgow coma scale (GCS) score, level of oxygen saturation, and age [12]. Nevertheless, most studies have ignored the time to death.
In contrast, various clinical presentations of COVID-19 patients have been reported; most reports have described the prevalence of each symptom separately, either from a single cohort or from meta-analyses of previous reports [13], or they have focused on specific symptoms, such as neurological symptoms (including anosmia [13] or ocular symptoms [14]).
Based on descriptive studies focused on cancer patients [8,15,16] and studies that matched cancer patients to a control population [17], patients with cancer who developed COVID-19 were found to be at an elevated risk of mortality/severe disease [8,18,19]. However, although the univariate analyses from the largest descriptive studies identified cancer as a prognostic factor, it was not confirmed by multivariate analyses [20]. One explanation could be that these studies included only a small number of cancer patients (1 to 6%) [20]. Furthermore, these studies mainly included patients with solid cancers, and few hematological malignancies [15,21].
By contrast, less focus has been placed on patient profiles involving the entire combination of symptoms, especially in ICU, or biological measurements such as cytokine levels and their relationships with the outcome [22][23][24].
Herein, we aimed to a priori identify specific patient profiles of COVID-19 and their association with outcomes in a French cohort of consecutive patients with respiratory symptoms at admission to wards in our institution specializing in onco-hematology.

Study population
We conducted a retrospective single-center study from February 2020 to April 2020. All consecutive patients who were admitted for at least 48 hours to one of the four different COVID-19 wards of Saint Louis Hospital (Paris, France), excluding those directly admitted to the ICU, were considered for inclusion in this study. Only patients with respiratory symptoms, namely, dyspnea, cough, thoracic pain and/or the need for supplemental oxygen (oxygen saturation at room air � 94%), were selected for further analyses. The Saint Louis University Hospital is a 650-bed hospital, with 330 beds dedicated to the management of oncohematology patients.
This retrospective cohort study was approved by the Institutional Review Board of the French Learned Society for Respiratory Medicine (CEPRO 2020-029). All data were fully anonymized before we accessed them. Patients provided informed written consent to have data from their medical records used in research.

Data collection
The dataset contained records of all patients, including their basic information (record ID, age), height, weight, body mass index, type of admission (ward or ICU), comorbidities (diabetes, high blood pressure, cardiovascular diseases, kidney failure, active malignant disease (with ongoing treatment), HIV status), medical history pertaining to COVID-19 (time of symptom onset and all clinical symptoms), laboratory measures (complete blood count, electrolytes, and inflammatory markers, such as C-reactive protein (CRP), fibrinogen, ferritin, and D-dimer), radiological findings, and use of supplemental oxygen at admission. Final survival status up to 6 months after hospital discharge was collected by phone (alive or dead), and the time to death was recorded. Patient medical information was recorded from May 3, 2020.

Outcomes
The primary outcome was overall survival (OS), measured from the time of hospital admission until the date of last follow-up or death. At each hospital admission, the risk/benefit balance with regard to ICU transfer in case of clinical deterioration was discussed collaboratively. When the decision was not in favor of resuscitation, the patient received a "do not resuscitate" (DNR) order. For patients discharged alive, information regarding their status was obtained on September 25, 2020.
First, we used a principal component analysis (PCA) algorithm, which is a non-supervised statistical approach to discover inherent but hidden profiles in the patient baseline data, as measured at the time of hospital admission, with plots allowing the visualization of distance between the variables and between the patients, thus facilitating our interpretation of the data. Sixteen variables were used, namely age, body mass index, high blood pressure, malignancy, acute renal failure, chronic obstructive pulmonary disease, days elapsed since disease onset, oxygen flow at baseline, body temperature, cough, dyspnea, digestive symptoms, neurological symptoms, lymphocytes count, CRP, and platelet count. All data were scaled to unit variance. We first imputed missing values as a preliminary step before performing PCA on the complete dataset. Imputation used an iterative algorithm, consisting in (i) imputing missing values with initial values such as the mean of the variable, (ii) PCA is performed on the complete dataset, (iii) it imputes the missing values with the (regularized) fitted matrix. These three steps of estimation of the parameters via PCA and imputation of the missing values are iterated until convergence [25]. In PCA plots, similar individuals (and characteristics shared by these individuals) were represented as points and tend to groups together, while dissimilarity, on the other hand, results in distance among the points.
Then, clustering was conducted on the five computed components of the PCA (accounting for 47.1% of the total variance) using an iterative partitioning k-means method, that aims to partition the observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers), minimizing within-cluster variance. Initialization used the Forgy method, that randomly chooses k observations from the dataset and uses these as the initial means. The optimal number of clusters k was estimated as the most frequently selected by 30 different indices as proposed by Charrad et al [26]. Hierarchical clustering on the results of the PCA was then conducted where starting from one cluster, the algorithm splits 3 clusters depending on the similarities measured by the distance among points.
Clinical characteristics, disease presentation, and outcome were compared across the different clusters using the chi-square test or Fisher's exact test for qualitative variables, the nonparametric Kruskal-Wallis test for continuous variables, and the log-rank test for the censored outcome. The cumulative probability of OS was plotted using the Kaplan-Meier method.
Finally, prognostic analyses for OS were conducted according to the Transparent Reporting Of a Multivariable Prediction Model for Individual Prognosis of Diagnosis (TRIPOD) reporting guidelines. A Cox proportional hazards model with a stepwise selection procedure was used to select covariates based on their statistical significance (P < .05) from among a list of variables with prognostic relevance according to the univariable analyses or previous findings of cluster analyses. Significant covariates were confirmed by forward selection and backward elimination techniques.
All p-values were two-sided with values < .05 considered statistically significant.

Patients
A total of 330 consecutive patients hospitalized with COVID-19 in our hospital were enrolled in the study (Fig 1). Eighty-seven patients were directly admitted to the ICU. Among the 243 COVID-19 patients admitted to the wards, 220 (91%) patients had respiratory symptoms and were selected for further analyses. Table 1 reports the patients' characteristics at baseline. These 220 patients were admitted to 4 wards of Saint Louis Hospital; of these, 93 (42.3%) were aged > 65 years and 75 (34.1%) had an active malignant disease.

Clustering
Unsupervised statistical learning methods were used to discover inherent but hidden patterns in the data without any a priori hypotheses. Fig 2 displays the data on the first axes of the PCA, exhibiting the correlation of age with CRP and oxygen flow, while old patients were likely to have no GI tract symptoms, independently of having malignancy; those patients with malignancy appeared to have more frequently high body temperature levels but were less likely to present with dyspnea. K-means was performed on the five computed components of the PCA, summing up for 47.1% of the data variance. According to the majority rule, the best number of clusters was 3. Hierarchical clustering then segregated 47 patients in cluster 1, 87 in cluster 2, and 86 in cluster 3; the three clusters differed in terms of presentation ( Table 2). Cluster 1 mostly included sexagenarian patients with active malignancies. Cluster 2 included the oldest patients who needed supplemental oxygen and have high C-reactive protein levels. Cluster 3 included the youngest patients with digestive symptoms. Note that inclusion of anosmia did not modify those results, with anosmia correlated with digestive disorders (S1

Follow-up
A total of 33 patients were admitted to the ICU, 13 of whom received mechanical ventilation. Sixty patients died, 8 of whom were non-DNR patients and 50 of whom had a DNR, including 12 (25.5%) in cluster 1, 37 (42.5%) in cluster 2, and 1 (1.1%) in cluster 3. Patients with a DNR order included not only cancer patients but also the oldest patients and those who had many comorbidities. Patients with a DNR order had a higher mortality rate (64.1%) than non-DNR patients (6.7%) (p-value <0.0001). The 30-day survival rate was estimated to be 75.9% (95% CI, 70.5-81.8%) (Fig 3). Among the survivors, the median length of hospital stay was 5 days [IQR, 3 to 10 days], with a maximum of 56 days. We wondered whether such distinctions could have some prognostic value. Survival differed substantially across the clusters defined above, with 60-day survival rates of 74.3% (cluster 1), 50.6% (cluster 2), and 96.5% (cluster 3) (Fig 3).

Prognostic analyses
Analyses of the patient outcomes were then performed in an attempt to define prognostic characteristics based on measured outcomes. We applied a Cox proportional hazards model to predict survival among the 220 patients admitted to the wards. Table 3 summarizes the results of the univariable and multivariable analyses. Among the 64 patients for whom the dosage was available, an elevated D-dimer level was associated with increased mortality (HR = 1.024 (IC, 1.006 to 1.042), p = 0.0084). The multivariable model identified five independent predictors of survival at the 0.001 level ( Table 3). Interestingly, those variables were selected by non-supervised analyses and distinguished the clusters ( Table 4).

Discussion
In this study, we identified three clusters of patients using a non-supervised approach, that allowed data learning without any prior hypotheses. The three clusters of patients distinguished based on their initial profiles exhibited different outcomes. This was unexpected, given that such an approach only attempts to reduce the dimensionality of the dataset from information provided at baseline, thus ignoring patient outcomes. However, this finding has important implications given the emerging nature of the disease, and the need to reduce the delay in the observation of outcomes from cohort studies to understand the prognostic value of patient presentation.
Patients selected in cluster 2 had the worst survival. They were characterized by older age, more comorbidities, and a higher level of need for supplemental oxygen. Cluster 1 included a majority of patients with active malignancies and intermediate outcomes, and we identified specific patients managed at our institution. The characteristics of these patients may reflect the management and close follow-up provided because of the underlying active malignancy: early admission after the onset of COVID-19 symptoms and lower lymphocyte and platelet cell counts. DNR orders, which have been previously found to be a predictor of mortality [27], were less common in cluster 1 than in cluster 2. This is probably because of the selection of patients with active malignancies but few comorbidities and younger age in cluster 1. Finally, cluster 3 had the best outcome and included the youngest patients, who were characterized by relatively few comorbidities and COVID-related gastrointestinal (GI) symptoms.
The prognostic value of the clusters was confirmed by multivariable prognostic analyses, which selected five key independent clinical variables associated with mortality, namely, age, active malignancy, dyspnea, supplemental oxygen (>5 L/min), and acute renal failure, which were also differentially distributed across the clusters. These results, obtained from an   unsupervised model, confirm and reinforce the results obtained with supervised analytical models [6,12,19,28]. Older age is certainly the strongest risk factor for a poor outcome. It has been identified as such in most of the studies, regardless of the population studied [12]. A recent meta-analysis further identified more than 30 independent clinical or biological risk factors for severe COVID-19, most of which were in agreement with the results of previous meta-analyses [29]. Although our findings may appear somewhat expected, with young and old patients differing in terms of both presentation and outcomes, the poor outcome of patients with malignancies is worthy of attention. Indeed, surprisingly, malignancy was not associated with a poor outcome in multivariate analyses in previous studies in the general patient population. This may reflect the small proportion of patients with cancer in most studies and the lack of distinction between patients with active cancer or a past history of cancer [20,30]. Laboratory findings have also been associated with the prognosis of COVID-19, including blood cell counts, markers of inflammation, and coagulation factors. As practices have evolved over time, we were only able to include biological parameters that have been routinely used in the cluster model, i.e., blood cell counts and the levels of CRP and creatinine. Unfortunately, while the D-dimer level was associated with prognosis in many studies and its measurement has become routine, missing data in our cohort prevented us from including this parameter in the multivariable analysis. We were also not able to study the levels of IL-6 and troponin. However, we used simple clinical and biological parameters that are accessible in all centers, making our findings pragmatic.
We found that patients with GI tract symptoms had better survival than the others, although most of them had dyspnea, which is known to be a poor prognostic factor. This finding is in accordance with recent data that further showed that patients with GI symptoms have reduced levels of circulating cytokines associated with inflammation and tissue damage [31]. This could be explained by the various settings of those studies, which involved the overall population (mostly in China) or hospitalized patients (Europe and the US) and sometimes focused on specific subpopulations, such as patients with hypertension [32].
Our study has limitations. As this was a retrospective study, there is always potential for biases. It should be noted that at the time of the study, the treatment for COVID-19 was not standardized. Nonetheless, we made systematic efforts to obtain a thorough and detailed history from each patient included in the study, in part by performing a chart review, and we performed prolonged follow-up after patient discharge. The study was monocentric, given that it was scheduled in the emergency context of the pandemic; however, patients were prospectively enrolled in 4 different wards in the hospital, involving different specialists (from pulmonary to infectious diseases, post-emergency care or internal medicine) and therefore cared for by different teams, which somewhat increases its external validity. Furthermore, the outcome results were extracted from the very early cases considered in the "first wave" of COVID-19; it would be interesting to investigate whether it has changed since then in the following waves. Last, we used PCA as the method of data reduction, while new methods of dimensionality reduction such as autoencoders based on neural networks may have been used, that have the potential of handling non-linearity, allowing the model to learn more powerful generalizations compared to PCA, and to reconstruct the input with significantly lower information loss [33]. Other data mining techniques that combined non-supervised and supervised information, such as subgroup discovery which extracts interesting rules with respect to a target variable, or semisupervised learning methods, could also be of interest. However, we placed ourselves in the setting of define clusters of patients from baseline information, that is, of evaluating patient profile when no target outcome could be available, even if its relationships with the outcome could be of interest.

Conclusion
This study in a large cohort of COVID-19 patients admitted to wards with respiratory symptoms identified different patient profiles based on their history and presentation at the time of hospital admission; these profiles correlated with patient outcomes. This study emphasized the heterogeneity among the profiles and outcomes of COVID-19 patients in hospitalized wards, as well as the similarities of profiles compared to a recent Spanish cohort. The cluster approach seems appropriate and pragmatic to help physicians segregate patients according to their predicted outcomes.