Clinical and biological clusters of sepsis patients using hierarchical clustering

Background Heterogeneity in sepsis expression is multidimensional, including highly disparate data such as the underlying disorders, infection source, causative micro-organismsand organ failures. The aim of the study is to identify clusters of patients based on clinical and biological characteristic available at patients’ admission. Methods All patients included in a national prospective multicenter ICU cohort OUTCOMEREA and admitted for sepsis or septic shock (Sepsis 3.0 definition) were retrospectively analyzed. A hierarchical clustering was performed in a training set of patients to build clusters based on a comprehensive set of clinical and biological characteristics available at ICU admission. Clusters were described, and the 28-day, 90-day, and one-year mortality were compared with log-rank rates. Risks of mortality were also compared after adjustment on SOFA score and year of ICU admission. Results Of the 6,046 patients with sepsis in the cohort, 4,050 (67%) were randomly allocated to the training set. Six distinct clusters were identified: young patients without any comorbidities, admitted in ICU for community-acquired pneumonia (n = 1,603 (40%)); young patients without any comorbidities, admitted in ICU for meningitis or encephalitis (n = 149 (4%)); elderly patients with COPD, admitted in ICU for bronchial infection with few organ failures (n = 243 (6%)); elderly patients, with several comorbidities and organ failures (n = 1,094 (27%)); patients admitted after surgery, with a nosocomial infection (n = 623 (15%)); young patients with immunosuppressive conditions (e.g., AIDS, chronic steroid therapy or hematological malignancy) (n = 338 (8%)). Clusters differed significantly in early or late mortality (p < .001), even after adjustment on severity of organ dysfunctions (SOFA) and year of ICU admission. Conclusions Clinical and biological features commonly available at ICU admission of patients with sepsis or septic shock enabled to set up six clusters of patients, with very distinct outcomes. Considering these clusters may improve the care management and the homogeneity of patients in future studies.


Introduction
In European intensive care units (ICU), the frequency of admission for sepsis still ranges from 10 to 64% [1]. Although the prognosis has improved thanks to a better management of vital organ support, the in-ICU mortality of patients with sepsis still ranges between 20 to 30% [2]. Despite many studies, the lack of effective specific therapies remains the main issue for the management of septic patients. The heterogeneity of patients included in studies focusing on sepsis could partly explain these failures [3][4][5]. Sepsis is not a specific illness, but rather a syndrome encompassing a still-uncertain pathobiology. Currently, the classification of sepsis is only based on the etiology, the distinction between sepsis and septic shock, and mortality risk levels after stratification on severity scores [6]. These scores do not capture adequately the heterogeneity of sepsis [7]. As explained by J. Castela Forte, to personalize and improve treatments of sepsis, patients must be clustered into common phenotypes based on clinically objective parameters reflecting disease mechanisms. After external validation step, these clusters could be grouped by underlying causal mechanisms and will improve patient characterization; optimized design and powering of randomized control trials. Finally, these clusters can allow identifying differential response patterns by considering baseline characteristics of sepsis patients [8]. Cluster analysis refers to statistical methods of data partitioning whereby objects or individuals are grouped into homogeneous groups on the basis of similarity independent of any outcome variables. Several methods may be used, each of them with advantage and pitfalls [9]. The multiple correspondence analysis (MCA) approach, combined with hierarchical clustering (HC) has already been used in several diseases such as chronic obstructive pulmonary disease, asthma, obstructive sleep apnea, lung cancer and vasculitis [10][11][12][13][14]. This clustering method has two advantages; it limits the noise in the data set without inducing bias by reducing the variables in coordinates obtained for each patient which summarize the main part of the information and does not require defining a priori the number of clusters.
In ICU, acute respiratory distress syndrome (ARDS) was evaluated through a cluster analysis [15]. Two clusters were described, with consistent outcomes and responses to treatment; these two ARDS clusters were a tangible discovery made thanks to a retrospective analysis of negative clinical trials. In sepsis, several studies used cluster analysis mainly for genotypic or transcriptomic approach [16,17]. Recently, Seymour et al conducted a clustering analysis on several large cohorts of patients meeting the definition of sepsis 3.0 [18]. Four interesting patient clusters were identified that correlated with host-response patterns and clinical outcomes. However, understanding the phenotype of critically ill patients should include more data characterizing patient comorbidity and infection [19]. The comparison of cluster analysis on different cohorts is therefore essential.
The primary objective of our study was to identify clusters among patients with sepsis by considering data available at admission including: underlying disorders, source of infection, micro-organism, biological host response and organ failures. The secondary objectives were to assess their heterogeneity on outcomes and validate them in an independent dataset.

Study design and data source
We conducted a retrospective analysis of a prospective observational multicenter database (OutcomeRea™). The database, fed by 20 French ICUs, collects prospective data on daily disease severity, iatrogenic events, and nosocomial infections. Each year, each ICU includes a random sample of at least 50 patients who have ICU stays longer than 24 h. Each ICU could choose to obtain the random sample by taking either consecutive admissions to selected ICU beds throughout the year or consecutive admissions to all ICU beds for 1 month. This study was approved by our institutional review board (CECIC Clermont-Ferrand-IRB n˚5891; Ref: 2007-16), which waived the need for signed informed consent of the participants, in accordance with French legislation on non-interventional studies. However, the patients and their next of kin were asked whether they were willing to participate in the database and use of their personal anonymized data, none declined participation.

Participants
All patients admitted in ICU for sepsis and septic shock were included. ICU-acquired infections were excluded. The presence or absence of infection at admission was prospectively recorded by clinical physicians according to the standard definitions developed by the Centers for Disease Control and Prevention and recently updated [21].
According to the new Sepsis 3.0 definition [5], sepsis was defined as a life-threatening organ dysfunction, identified by an increase by 2 points or more of the SOFA score, associated with an infection. Patients were included in the OUTCOMEREA™ database prior to this new definition. Accordingly, the increase in the SOFA score was retrospectively calculated. The maximum SOFA score measured on the first day of ICU stay was used. Prior to admission in ICU, SOFA score baseline can be considered to be at zero. The baseline SOFA score of patients with chronic renal replacement therapy was assumed at 4 points. Septic shock was defined by a need to administer vasopressor agents for maintaining a mean arterial pressure of 65 mmHg or greater, and a serum lactate level greater than 2 mmol/L (>18 mg/dL) in the absence of hypovolemia.
automatic check for internal consistency, generating queries that were sent to the ICUs for resolution before incorporation of the new data into the database. In each participating ICU, data quality is checked by having a senior physician from another participating ICU who performs a review of a 2% random sample of the study data every other year. A 1-day data-capture training course held once a year is open to all OutcomeRea™ investigators and study monitors. All qualitative variables used in the analyses had Cohen's kappa coefficient > 0.8 and all variables had inter-rater coefficients in the 0.67 to 1 range, indicating good to excellent reproducibility.

Statistical methods
For variables with less than 20% of missing values, we performed a multiple imputation of missing data using Markov Chain Monte Carlo. Sixty-three clinical and biological variables were available at admission. Description of 63 variables included in the cluster analysis was available S1 Table. The original data set was randomly split into a training set (2/3 of the patients) and a validation set (1/3 of the patients). The statistical analyses comprised 4 steps: 1) Reduction dimension and cluster analysis; 2) Cluster description, 3) Outcomes, 4) Binary tree and cluster validation (Fig 1). Due to the absence of recommendations, we empirically chose the combination of MCA and HC.
Cluster analysis. An MCA was performed to reduce the dimension of the 63 variable of the dataset (S1 Table) in "Euclidian patient-coordinates" dataset. Because MCA is based on qualitative variables, quantitative variables we categorized. The first 52 dimensions out of a total of 79, which explained at least 90% of the total variability, were considered in the HC [20]. The HC was performed on this patient-coordinates dataset using the Ward's minimumvariance. Initially, each patient was his own cluster, and was thereafter merged into larger clusters to minimize the within-cluster homogeneity and to maximize the inter-cluster heterogeneity. The final number of clusters was defined on the basis of the Semi partial R-Squared, the Squared-R, Pseudo F statistic and The Pseudo t 2 statistic. There is no consensus in the literature on the final choice of the number of clusters regardless of the clustering method. Kmeans algorithm was considered to identify potential outliers in the dataset [21].
Cluster description. Variables were described separately for each cluster, by the use of median and interquartile range (IQR) for quantitative variables, and frequency and percent for qualitative variables. The probability for each variable of belonging to one cluster was assessed using odds ratios, determined by a univariable logistic regression.
Outcomes. The associations between clusters and mortality were assessed by using the status of the patients at day-28, day-90 and one year after admission. The log-rank test was used to compare clusters for mortality. Risk of mortality were described after adjustment on SOFA score at admission using Cox model. Analyses were performed on sub-groups of patients with septic shock. The length of ICU and hospital stay, the number of ventilator-free days at day-28, the duration of the renal replacement therapy, the number of catecholaminefree days at day-28, and the number of organ system failure-free days at day-28 were evaluated.
Binary tree. To build a simple tool able to assign a new patient into clusters using only data that are commonly available at ICU admission, a binary tree was performed using classification and regression tree (CART) [22,23]. The accuracy of the binary tree was evaluated using sensitivity, specificity and Area under the Receiver Operating Characteristic (AUC) in the training set.
Cluster validation. Cluster analysis was also applied in the validation set. Results obtained were compared to clusters obtained by applying the binary tree using AUC. New cluster analysis and cluster description were performed after exclusion of the oldest data (patients admitted before 2008).
Data analyses were performed using R (The R foundation, Vienna, Austria) and SAS version 9.4 (SAS Institute Inc., Cary, NC) [24].

Subject demographics
The new definition of sepsis applied on the 18,840 patients of the OUTCOMEREA1 database yielded a total of 6,046 patients admitted for a first episode of sepsis between 1997 and 2015, of whom 58% (n = 3,479) had a septic shock (S1 Fig). Details of missing data are available (S1 Table). The initial sample was split in two sets: a training set (n = 4,050 (67%)) and a validation set (n = 1,996 (33%)) (Fig 1 and S1 Table).

Binary tree
Six discriminatory variables available on admission were identified by CART methods to assign new patients into a cluster (S8 Fig). The distribution of the patients amongst the clusters was similar. The accuracy of the binary tree in the training set is shown on S3 Table.

Cluster validation
Six clusters were identified in the validation set (S9 Fig). The contributions of variables in the construction of the first four dimensions of the MCA are depicted on S10 Fig. S4 Table describe the distribution of demographic characteristics, various comorbidities, sources of infection, micro-organisms, clinical and biological data, and the organ failures at admission. The accuracy of the binary tree in the validation set is shown on S5 Table. The results of the clusters analysis after exclusion of the oldest data (admission before 2008) are provided in S6 Table.

Fig 2. Mortality estimated by Kaplan-Meier according to the cluster assignment with log rank tests (performed in training set). Definition of abbreviations:
Each curve was compared one by one using a log rank test; Results of these tests are presented in a double entry matrix; each cluster can be identified by its color; Analysis was performed including all patients in sepsis or septic shock. Cluster 1 = young patients without any comorbidities, admitted in ICU for community-acquired pneumonia; Cluster 2 = young patients without any comorbidities, admitted in ICU for meningitis or encephalitis; Cluster 3 = elderly patients with COPD, admitted in ICU for bronchial infection with few organ failures; Cluster 4 = elderly patients with several comorbidities and organ failures; Cluster 5 = patients admitted after surgery with a nosocomial infection; Cluster 6 = young patients with immunosuppressive disease or therapy, such as AIDS, chronic steroid therapy or hematological malignancy. https://doi.org/10.1371/journal.pone.0252793.g002

Discussion
Our work is an original clinical study based on a large sample that tried to reduce the heterogeneity of septic population. Using the new definition of sepsis 3.0 [4], we performed a HC based on clinical and biological data commonly available at ICU admission. We were able to discriminate 6 rather homogeneous clusters of patients with sepsis and septic shock. Three clusters were characterized by underlying disorders, while two clusters were characterized by the source of infection. Baseline risk of death at day-28, day-90, and one year were significantly different across clusters, independently of organ failures at admission. After having developed a binary classification tree, we were able to identify similar results from the validation subset. The ability to affect a patient with sepsis to a homogeneous cluster should enable to achieve a personalized ICU medical care strategy, and to test potentially appropriate new therapies by performing more efficiently targeted clinical trials. The simplest way to do so is to use classification tree. The analysis was performed after exclusion of patients without septic shock. A Cox model was used to determine the hazard ratio; Data are reported as HR ± 95% confidence intervals, presented from lowest to highest;presented from lowest to highest; Cluster 3 was used as reference; Adjusted mortality were adjusted using SOFA score at admission and year of ICU admission. Cluster 1 = young patients without any comorbidities, admitted in ICU for community-acquired pneumonia; Cluster 2 = young patients without any comorbidities, admitted in ICU for meningitis or encephalitis; Cluster 3 = elderly patients with COPD, admitted in ICU for bronchial infection with few organ failures; Cluster 4 = elderly patients with several comorbidities and organ failures; Cluster 5 = patients admitted after surgery with a nosocomial infection; Cluster 6 = young patients with immunosuppressive disease or therapy such as AIDS, chronic steroid therapy or hematological malignancy. https://doi.org/10.1371/journal.pone.0252793.g003 The low influence of organ dysfunction to characterize clusters was the main result that was comparable with the study of Seymour et al [18]. Two clusters share identical characteristics, one that included young patients with low comorbidity, and one that included elderly patients with high Charlson scores. Some similarities can be found between their δ clusters and our cluster 5. Patients had more liver dysfunction, and more often a septic shock and intra-abdominal sepsis. It was the most frequent site of infection of this cluster. Interestingly, we identified a well-defined cluster including immunocompromised patients, that mainly comprised hematologic patients. These patients were excluded of the SENECA cohorts, GenIMS cohort, and of the ProCESS trial, PROWESS trial or ACCESS trial.
We identified a small cluster that included patients with COPD exacerbation. These patients met to the definition of sepsis 3.0, yet mechanisms of organ failure were probably very different from other cluster. The inclusion of these patients in sepsis studies should be discussed.
Some limitations must be acknowledged for this study. First, these phenotypes will reduce heterogeneity prior to randomization; however, because they will not necessarily create better trials, further studies are necessary to better explain this clustering [31]. Second, the training set and the validation set come from the same database. A validation with an external database would have been more robust. Third, the time of patient's admission on our study ranged from 1997 to 2015 with heterogeneous periods of inclusion and number of patients included between each center (S7 Table). The improvement of the prognosis related to the improvements in sepsis management is no longer debated. This difference in risks of mortality was taken into account by an adjustment on the year of ICU admission and a sensitivity analysis was performed without the oldest data. Finally, unsupervised analysis is not unbiased because only the data available within the database can be explored, and only the data that are known or thought to be important are entered [25]. Some data would have been necessary, in particular data about clinical management or specific therapy [26]. Also, there is a lack of information about cytokines, ethnicity, genetic polymorphism, precise dating of the infection onset, and biomarkers like C-reactive protein test. Several studies have focused on identification of sepsis molecular phenotypes based on gene expression data [16,27,28]. Between two to four clusters were identified. Genotype or endotype approach has the potential to substantially improve our understanding of the key biological pathways involved in human diseases and to suggest new targets for treatment or prevention. However, studies using these approaches often failed to replicate positive findings, especially when investigating associations with sepsis outcomes [29,30]. The possible explanations include low statistical power, heterogeneous patient populations, and imprecise definition of phenotypes [31]. As suggested in a review by Clark et al, sepsis defines a syndrome rather than a specific disease. This may lead to a marked heterogeneity in patient populations, which may explain some of the variations in the findings. Also, he suggested to use more precise phenotypes, for example, meningococcal sepsis or fecal peritonitis. According to Rautanen et al, promising results might be generated by focusing on more homogenous subgroups such as sepsis patients with pneumonia [32]. The understanding of the immune system and its interaction with pathogens must take into account the high dimensionality of the data, future studies are needed to aggregate different data in order to combine clinical data, host genomics, transcriptomic responses and cytokines, and using data science approaches accounting for longitudinal data.
When the hypothesis is made that the treatment effect is similar in all patients and evolves linearly with the severity of illness, the expected effect of the new treatment in a randomized trial measured by the reduction of the risk of death is dependent of the baseline risk of death and its distribution in the sample. This was described and illustrated by Kent [33]. In their simulation study of therapeutic trials on ARDS and sepsis, Iwashyna et al showed that the variation in the baseline risk of death was the main determinant of the heterogeneity of the treatment response [34]. Thus, the impact of the same treatment applied to populations with a different baseline risk of death spanned from an increase in the risk of death in a low risk population to a decrease in the risk of death in patients whom baseline risk of death was high. However, patients with sepsis are intrinsically heterogeneous, not only in their baseline risk of death [35,36], but also in their risk of adverse outcome [37]. Identifying treatable traits and setting an accurate diagnosis are the major challenges in sepsis [37]. Improving prognostic estimation, and performing comparisons and benchmarking of processes and outcomes between different ICUs are also necessary. Although the current definition of sepsis groups was designed to reduce heterogeneity of the patients [4], an important variability can still be observed for many patients' characteristics such as the sources of infection, causative pathogen (s), age, lifestyle, comorbidities, or genetic profile.

Conclusion
Because the prognostication and the identification of target patients remain difficult in sepsis, new approaches are necessary. In patients who met the new sepsis 3.0 definition, six clusters, clearly different in their clinical and biological presentation, were identified by using hierarchical clustering. These clusters also differed in mortality and severity of illness. Considering these clusters may reduce the uncontrolled differences in patients' prognosis and improve the power of studies. Future works including big data analysis, clinical and genomic data and several biomarkers may contribute to better defining homogeneous subsets of sepsis patients. Dendrogram obtained after application of hierarchical clustering analysis by accounting for the 51 dimensions of the multiple correspondence analysis. The vertical axis of the dendrogram represents the distance between clusters. The horizontal vertical axis represents the patients and clusters. Each junction between two clusters is represented on the graph by the split of a vertical line into two vertical lines. The vertical position of the split, shown by the short horizontal bar, gives the distance between the two clusters. The red line shows the cut level that determines the number of clusters. The indices used to determine this cut level, Semi partial R-Squared, the Squared-R, the Pseudo F statistic and the Pseudo t2 statistic, are presented in S3 The analysis was performed after exclusion of patients without septic shock. A logistic regression was used to determine the odds ratio; Data are reported as odds ratios ± 95% confidence intervals, presented from lowest to highest; Cluster 3 was used as reference class; Adjusted mortality were adjusted using SOFA score at admission and year of ICU admission. Cluster 1 = young patients without any comorbidities, admitted in ICU for communityacquired pneumonia; Cluster 2 = young patients without any comorbidities, admitted in ICU for meningitis or encephalitis; Cluster3 = elderly patients with COPD, admitted in ICU for bronchial infection with few organ failures; Cluster 4 = elderly patients with several comorbidities and organ failures; Cluster 5 = patients admitted after surgery with a nosocomial infection; Cluster 6 = young patients with immunosuppressive disease or therapy, such as AIDS, chronic steroid therapy or hematological malignancy. (DOCX)

S8 Fig. Binary tree using variables available at the ICU admission.
Definition of abbreviations: ICU = intensive care unit; The binary tree was built in the training set using Breiman methods [22] with Rpart package version 4.1-10 [23], R version 3.1.0. The structure is similar to a real tree, from the bottom up: There is a root, where the first split happens. After each split, two new nodes are created. Each node contains only a subset of the patients with a minimum size of 20. The partitions of the data, which are not split any more, are called terminal nodes or leafs. The second stage of the procedure consists to prune the tree using cross-validation. Pruning means to shorten the tree, which makes trees more compact and avoids over fitting to the training data. Each split is examined, if it brings a reliable improvement. The six variables used by the binary tree are: Lung infection, Surgical admission, Hematological malignancy, Bronchial infection, Chronic heart failure and Meningeal infection. The accuracy of the binary tree evaluated in the training dataset was available on S3 Table. Cluster 1 = young patients without any comorbidities, admitted in ICU for community-acquired pneumonia; Cluster 2 = young patients without any comorbidities, admitted in ICU for meningitis or encephalitis; Cluster3 = elderly patients with COPD, admitted in ICU for bronchial infection with few organ failures; Cluster 4 = elderly patients with several comorbidities and organ failures; Cluster 5 = patients admitted after surgery with a nosocomial infection; Cluster 6 = young patients with immunosuppressive disease or therapy, such as AIDS, chronic steroid therapy or hematological malignancy.  Table. Outcomes of patients (performed in training set). Definition of abbreviations: ICU = intensive care unit; IQR = interquartile range; Free days were censored at 28 days; A ventilator-free day refers to a day without invasive or non-invasive mechanical ventilation or death; A catecholamine-free day refers to a day without to vasoactive or inotropic agent or death; An organ system failure-free day refers to a day without SOFA score upper zero or death; Cluster 1 = young patients without any comorbidities, admitted in ICU for communityacquired pneumonia; Cluster 2 = young patients without any comorbidities, admitted in ICU for meningitis or encephalitis; Cluster 3 = elderly patients with COPD admitted in ICU for bronchial infection with few organ failures; Cluster 4 = elderly patients, with several comorbidities and organ failures; Cluster 5 = patients admitted after surgery with a nosocomial infection; Cluster 6 = young patients with immunosuppressive disease or therapy, such as AIDS, chronic steroid therapy or hematological malignancy. Values in Numbers (%) or median [IQR]. P-values were obtained by Analysis of variance or Chi-2 test. (DOCX) S3 Table. Accuracy of the binary tree (performed in training set). Definition of abbreviations: Truly assigned = Number of patients correctly assigned as belonging to the cluster according to the total number of patients in the cluster; Falsely assigned = Number of patients incorrectly labeled as belonging to the cluster according to the total number of patients out of the cluster; Se = Sensitivity; Sp = Specificity; AUC = Area under the Receiver Operating Characteristic curve; IC95% = 95% confidence intervals; Cluster 1 = young patients, without any comorbidities, admitted in ICU for community-acquired pneumonia; Cluster 2 = young patients, without any comorbidities, admitted in ICU for meningitis or encephalitis; Cluster 3 = elderly patients with COPD, admitted in ICU for bronchial infection with few organ failures; Cluster 4 = elderly patients, with several comorbidities and organ failures; Cluster 5 = patients admitted after surgery with a nosocomial infection; Cluster 6 = Young patients, with immunosuppressive disease or therapy, such as AIDS, chronic steroid therapy or hematological malignancy.  Table. Accuracy of the binary tree (performed in validation set). Definition of abbreviations: Truly assigned = Number of patients correctly assigned as belonging to the cluster according to the total number of patients in the cluster; Falsely assigned = Number of patients incorrectly labeled as belonging to the cluster according to the total number of patients out of the cluster; Se = Sensitivity; Sp = Specificity; AUC = Area under the Receiver Operating Characteristic curve; IC95% = 95% confidence intervals; Cluster 1 = young patients, without any comorbidities, admitted in ICU for community-acquired pneumonia; Cluster 2 = young patients, without any comorbidities, admitted in ICU for meningitis or encephalitis; Cluster 3 = elderly patients with COPD, admitted in ICU for bronchial infection with few organ failures; Cluster 4 = elderly patients, with several comorbidities and organ failures; Cluster 5 = patients admitted after surgery with a nosocomial infection; Cluster 6 = Young patients, with immunosuppressive disease or therapy, such as AIDS, chronic steroid therapy or hematological malignancy.