Benchmarking machine learning models on multi-centre eICU critical care dataset

Progress of machine learning in critical care has been difficult to track, in part due to absence of public benchmarks. Other fields of research (such as computer vision and natural language processing) have established various competitions and public benchmarks. Recent availability of large clinical datasets has enabled the possibility of establishing public benchmarks. Taking advantage of this opportunity, we propose a public benchmark suite to address four areas of critical care, namely mortality prediction, estimation of length of stay, patient phenotyping and risk of decompensation. We define each task and compare the performance of both clinical models as well as baseline and deep learning models using eICU critical care dataset of around 73,000 patients. This is the first public benchmark on a multi-centre critical care dataset, comparing the performance of clinical gold standard with our predictive model. We also investigate the impact of numerical variables as well as handling of categorical variables on each of the defined tasks. The source code, detailing our methods and experiments is publicly available such that anyone can replicate our results and build upon our work.


Introduction
Increasing availability of clinical data and advances in machine learning have addressed a wide range of healthcare problems, such as risk assessment and prediction in acute, chronic and critical care. Critical care is a particularly data-intensive field, since continuous monitoring of patients in Intensive Care Units (ICU) generates large streams of data that can be harnessed by machine learning algorithms. However, progress in harnessing digital health data faces several obstacles, including reproducibility of results and comparability between competing models. While, other areas of machine learning research, such as image and natural language processing have established a number of benchmarks and competitions (including ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [1] and National NLP Clinical Challenges (N2C2) [2], respectively), progress in machine learning for critical care has been difficult to measure, in part due to absence of public benchmarks. Availability of large clinical data sets, including Medical Information Mart for Intensive Care (MIMIC III) [3] and more recently, a multi-centre eICU Collaborative Research Database [4] are opening the possibility of establishing public benchmarks and consequently tracking the progress of machine learning models in critical care. Availing of this opportunity, we propose a public benchmark suite to address four areas of critical care, namely mortality prediction, estimation of length of stay, patient phenotyping and risk of decompensation. We define each task and evaluate our algorithms on a multi-centre dataset of 73,718 patients (containing 4,564,844 clinical records) collected from 335 ICUs across 208 hospitals. While there has been work in this area that has focused on the single-center MIMIC III clinical dataset [5], our work is the first to focus on a multi-center critical care dataset, the eICU database [4]. Evaluating models on a multi-center dataset typically results in the inclusion of a wider range of patient groups, larger number of patients, external validity and lower systematic bias in comparison to a single-center dataset, resulting in increased generalisability of the study [6,7]. However building a predictive model on a multi-centre dataset is more challenging due to heterogeneity of the data. Nevertheless, the performance of our models (as measured by AUC ROC) compare favourably with the performance of the models using the single-center MIMIC III dataset as reported in [5].
The main contributions of this work are as follows: i) we provide the baseline performance, (using either on clinical gold standard or Logistic/Linear Regression algorithm) and compare it against our benchmark result, achieved using a model based on bidirectional long shortterm memory (BiLSTM); ii) investigate impact of categorical and numerical variables on all four benchmarking tasks; iii) evaluate entity embedding for categorical variables, versus one hot encoding; iv) show that for some tasks the number of variables can be reduced significantly without greatly impacting prediction performance; and v) report six evaluation metrics for each of the tasks, facilitating direct comparison with future results. The source code for our experiments is publicly available at https://github.com/mostafaalishahi/eICU_Benchmark, so that anyone with access to the public eICU database can replicate our experiments and build upon our work.

Ethics statement
This study was an analysis of a publicly-available, anonymised database with pre-existing institutional review board (IRB) approval; thus, no further approval was required.

eICU dataset description and cohort selection
The eICU Collaborative Research Database [4] is a multi-center intensive care unit database with high granularity data for over 200,000 admissions to ICUs monitored by eICU programs across the United States. The eICU database comprises 200,859 patient unit encounters for 139,367 unique patients admitted between 2014 and 2015 to 208 hospitals located throughout the US. We selected adult patients (age > 18) that had an ICU admission with at least 15 records, leading to 73,718 unique patients with a median age of 62.41 years (Q1-Q3: 52-75), 45.5% female. Hospital mortality rate was 8.3% and average length of stay in hospital and in unit were 5.29 days and 3.9 days respectively (further details are provided in Table 1). Cohort selection criteria are detailed in Fig 1. The final patient cohort contained 4,564,844 clinical records where we grouped these records on 1 hour window, imputed the missing values based on the mean of that window and took the last valid record of that specific window.
Out of 31 tables in the eICU (v1.0) database we selected variables from the following tables: patient (administrative information and patient demographics), lab (Laboratory measurements collected during routine care), nurse charting (bedside documentation) and diagnosis based on advice from a clinician as well as consistency with other similar tasks reported in the related work section. Selected variables are shown in Table 2 and are common across all the four tasks.

Description of tasks
In this section, we define four different benchmark tasks, namely in-hospital mortality prediction, remaining length of stay (LoS) forecasting, patient phenotyping, and risk of physiologic decompensation. After applying selection criteria for each task, the resulting patient cohorts are outlined in Table 3 Mortality prediction. In-hospital mortality is defined as the patient's outcome at the hospital discharge. This is a binary classification task, where each data sample spans a 1-hour window. The cohort for this task was selected based on the presence of hospital discharge status in patients' record and length of stay of at least 48 hours (we focus on prediction during the first 24 and 48 hours). This selection criteria resulted in 30,680 patients containing 1,164,966 records.
Length of stay prediction. Length of stay is one of the most important factors accounting for the overall hospital costs, as such its forecast could play an important role in healthcare management [8]. Length of stay is estimated through analysis of events occurring within a fixed time-window, once every hour from the initial ICU admission. This is a regression task, where we use 20 clinical variables described in Table 2. For this cohort we selected patients whose length of stay was present in their records. These selection criteria resulted in 73,389 ICU stays, containing 3,054,314 records. The mean length of stay was 1.86 days with standard deviation of 1.94 days, as shown in Table 1.
Phenotyping. Phenotyping is a classification problem where we classify whether a condition (ICD-9 code) is present in a particular ICU stay record. Since any given patient may have more than one ICD-9 code, this is defined as a multi-label classification problem.
While our definition is focused on diagnosis using ICD codes for this task, the definition of phenotyping may encompass other domains, such as procedures [9] [10] for example. However, expanding the definition of phenotyping beyond standardised ICD codes would have required development of non-standardised rules, as no common standard approach for  [12]. Consequently, it would have been challenging to compare this work with the already published benchmarks. Furthermore, there is some concern regarding reproducibility of rule-based phenotyping as found in [12]. Considering these issues, as well as keeping consistent with previously published benchmarks, we settled on using ICD codes as the basis for the definition of this task. Accordingly, the dataset contains 767 unique ICD codes, which are grouped into 25 categories shown in Table 4. The cohort for this task, considering initial inclusion criteria as well as recorded diagnosis during the ICU stay, resulted in 49,299 patients. Physiologic decompensation. There are a number of ways to define decompensation, however in clinical setting majority of early warning systems, such as National Early Warning Score (NEWS) [13] are based on prediction of mortality within the next time window (such as 24 hours after the assessment). Following suit and keeping consistent with previously published benchmarks [5], we also define decompensation as a binary classification problem, where the target label indicates whether the patient dies within the next 24 hours. The cohort for this task results in 55,933 patients (2,800,711 records), where the decompensation rate is around 6.5% (3,664 patients).

Prediction algorithms
Baselines. We compare our model with two standard baseline approaches namely, logistic/linear regression (LR) and a 1-layer artificial neural network (ANN). The embeddings for these models are learned in the same way as for the proposed BiLSTM model as explained in the section that follows.
Deep learning models. In this section, we describe the selected clinical variables, approaches to represent these variables as well as baseline and deep models used in this study. The architecture of this work consists of three modules, namely input module, encoder module and output module as shown in Fig 2. Input representation: We process and model both numerical and categorical variables separately, as shown in Table 2. Categorical variables are represented using either one-hot encoding (OHE) or entity embedding (EE). OHE is the baseline approach that converts the variables into binary representation. Using this approach for our 7 categorical variables results in 429 unique records, rendering a large sparse matrix. In response, we represent each variable as an embedding and compare the performance with the OHE approach. We use entity embedding [14], where each categorical variable in the dataset is mapped to a vector and the corresponding embedding is added to the patient's record. This entity embedding is learned by the neural network during the training phase along with other parameters. So the final representation of the input at time t is as follows: where Num t is the numerical variable, Cat t is the categorical variable at time t and U is the embedding matrix learned by the model. Encoder: To capture sequential dependency in our data, we use Recurrent Neural Network (RNN) that resemble a chain of repeating modules to efficiently model sequential data [15]. They take sequential data X = (x 1 , x 2 , . . ..x n ) as input and provide a hidden representation H = (h 1 , h 2 , . . ..h n ) which captures the information at every time step in the input. Formally, where x t is the input at time t, W is the parameter of RNN learned during training and f is a non-linear operation such as sigmoid, tanh or ReLU.
A drawback of regular RNNs is that the input sequence is fed in one direction, normally from past to future. In order to capture both past and future context, we use a Bidirectional Long Short Term Memory (BiLSTM) [16] [17] for our model, which processes the input in both forward and backward direction. Using a BiLSTM the model is able to capture the context of a record not only by its preceding records but also with the following records, allowing the model to produce more informed predictions. The input at time t is represented by both Output: The choice of output layer is based on whether the benchmarking task is a regression or a classification task.
Remaining LoS prediction is a regression task, in which we predict the remaining LoS record-wise. That is, each patient record is fed to the model to predict the remaining LoS for that specific time step. This task is realized using a many to many architecture, where we assign a label to each patient record. The score for this task is obtained using: where y t is the remaining LoS predicted and ReLU is the non-linear activation function used as the prediction of remaining LoS cannot be negative. In-hospital mortality and decompensation are binary classification tasks. For the in-hospital mortality the many to one architecture is applied and the classifier is as follows: For the decompensation task, a many to many architecture is applied. Prediction at eachtime step is treated as a binary classification and the classifier is defined as: Phenotyping is defined as a multi-label task with 25 binary classifiers for each phenotype, and the score for the task is obtained using: where t is the time step and n is the phenotype being predicted and W n is the model parameter.

Results
In this section, we report benchmarking results of methods and prediction algorithms, focusing on answering the following questions: (a) How does performance of our model compare to the performance of clinical scoring systems as well as baseline algorithms (logistic/linear regression in our case); and (b) What is the impact on prediction performance when using different feature sets, such as categorical and numerical variables, solely categorical and solely numerical variables? We evaluate our model through a 5-fold cross-validation using the following evaluation metrics: for the regression tasks we report coefficient of determination

Mortality prediction
Results from this task indicate that the proposed approach of learning embeddings for categorical variables is more effective than OHE representation. This holds true for both baseline models (LR and ANN) as well as BiLSTM model, reflected in the prediction performance of each model. Furthermore, BiLSTM based model outperforms all the other approaches in predicting mortality in both the 24 hour window and the 48 hour window as shown in Table 5. It is interesting to note that using only categorical variables (reducing the number of variables from 20 to only 7) with embedding provides a better performance than using numerical variables only (AUROC 78.57 vs. 76.63 for the first 24h). These results suggest that EE of categorical features in vector space is more effective in the prediction of mortality.

Remaining length of stay in unit prediction
Predicting remaining LoS in the ICU unit requires capturing temporal dependencies between each time-step. For this reason baseline models perform poorly due the lack of explicit modelling of temporal dependencies. The proposed BiLSTM model on the other hand is able to capture this dependency effectively and outperforms the baseline models as shown in Table 6. We can also see that the numerical variables are the most effective in prediction of remaining LoS. However, using categorical variables encoded with OHE reduces the model performance, while EE improves R 2 measures.

Phenotyping
For the phenotyping task, we focus on comparing performance (AUROC) of the proposed model on different subset of features, namely numerical versus categorical variables. Using only the categorical features, modelled as entity embeddings shows a significantly higher performance (77.75) compared to using only the numerical features (67.05) as outlined in Table 7.
Clearly categorical features are more effective in representing patients' phenotype, since integrating both of the subsets does not significantly improve the result (79.89 from 77.75). In this task there is a wide difference between performance of the model on individual diseases, varying from 61.55 (diabetes mellitus without complications) to 94.24 (acute cerebrovascular disease). As a general trend prediction performance on acute diseases is higher (82.49) than that on chronic diseases (73.55). This may be due to the slow-progressing nature of chronic diseases, where recorded ICU data is relatively short and thus unable to fully capture events related to chronic diseases.

Decompensation prediction
As mentioned in Section Physiologic Decompensation, decompensation is related to mortality prediction with the difference that we predict whether the patient survives in the next 24 hours, given the current time step. As such, time-dependence is critical. Since 3 categorical variables (out of 7) are time-independent and only 4 are time-dependent, they pose a difficult challenge for the model to be able to predict decompensation using only the time-independent categorical variables. For this reason, the model with only numerical variables outperforms the model with only categorical variables as shown in Table 8.

Discussion
In this study we have described four standardised benchmarks in machine learning for critical care research. Our definition of benchmark tasks is consistent with previously published benchmarks to facilitate comparison with already published results. However, in this work we focus on the more recent eICU database, where clinical data has been collected from 335 ICUs across 208 hospitals across the United States. Our dataset contains a larger number of patients and a wider range of patient groups, in comparison to benchmarks published using a single center dataset, which should result in lower systematic bias and increased generalisability of the study. We provided a set of baselines for our benchmarks and show that BiLSTM model significantly outperforms clinical gold standard as well as the baseline models. Of note is the impact of entity embedding of categorical variables in further improving the performance of our LSTM-based model. Clearly, interpretability remains a significant challenge of models based on deep neural networks, including our BiLSTM model. However, there has been significant progress in "opening the black box" [18] as demonstrated by a recently updated review of interpretability methods [19], bringing these models one step closer to clinical practice. As our work is meant to track the progress of machine learning in critical care, interpretability is certainly an important aspect of this progress. We believe that our work will provide a solid basis to further improve critical care decision making and we provide the source code for other researchers that wish to replicate our experiments and build upon our results.

Related work
In this Section, we provide a brief review of the most relevant studies related to each of the tasks, mortality, length of stay, phenotyping, and physiologic decompensation. We briefly review the other benchmarking studies in critical care, related to our work.

Mortality prediction
Many clinical scoring systems have been developed for mortality prediction, including Acute Physiology and Chronic Health Evaluation (APACHE III [20], APACHE IV [21]) and Simplified Acute Physiology Score [22] (SAPS II, SAPS III). Most of these scoring systems use logistic regression to identify predictive variables to establish these scoring systems. Providing an accurate prediction of mortality risk for patients admitted to ICU using the first 24/48 hours of ICU data could serve as an input to clinical decision making and reduce the healthcare costs.
In this regard, recent advances in deep learning have been shown to outperform the conventional machine learning methods as well as clinical prediction techniques such as APACHE and SAPS [5] [23] [24]. Mortality prediction has been a popular application for deep learning researchers in recent years, though model architecture and problem definition vary widely. Convolutional neural network and gradient boosted tree algorithm have been used by Darabi et al. [25], in order to predict long-term mortality risk (30 days) on a subset of MIMIC-III dataset. Similarly, Celi et al. [26] developed mortality prediction models based on a subset of MIMIC database using logistic regression, Bayesian network and artificial neural network.

Length of stay
Resource allocation and identifying patients with unexpected extended ICU stays would help decision-making systems to improve the quality of care and ICU resource allocation. Therefore forecasting the length of stay (LoS) in ICU would be significantly important in order to provide high-quality care to a patient, and it would avoid extra costs for care providers. In this regard, Sotoodeh et al [27] applied hidden markov models to predict LoS by using the first 48 hours of physiological measurements. Ma et al. [28] defined LoS as a classification problem in which the objective was to create a personalized model for patients to forecast LoS. Previous studies [23] [5] have shown that deep learning models obtain good results on forecasting length of stay in ICU. In this regard, Tu et al [29] applied neural network based methods on a Canadian private dataset, which includes patients with cardiac surgery. The developed model was able to detect the patient with low, intermediate, and high prolonged stay in ICU.

Phenotyping
Phenotyping has been a popular task in recent years [30] [31], although problem definition varies widely, from focusing on ICD based diagnosis [24] up to including clinical procedures and medications [9] [10]. Several works on phenotyping from clinical time series have focused on variations of tensor factorization and related models [30] [31] [32], and the most recently published studies on phenotyping are focused on deep learning methods. In this regard, Razavian et al [33] and Lipton et al [24] applied deep learning methods to predict diagnoses. While the first trained RNN LSTM and CNN for prediction of 133 diseases based on 18 laboratory tests on a private dataset including 298k patients, the latter applied an RNN LSTM on a single-center, private pediatric intensive care unit (PICU) dataset in order to classify 128 diagnoses given 13 clinical measurements.

Physiologic decompensation
Early detection of physiologic decompensation could be used to avoid or delay the occurrence of decompensation. Recently machine learning researchers have started to apply various machine learning methods in order to predict the decompensation incident. Recent study by Ren et al. [34] applied gradient boosting models (GBM) to predict required intubation 3 hours ahead of time, in this work they used a cohort of 12,470 patients to predict unexpected respiratory decompensation. Differently, Xu et al [35] proposed a deep learning model to predict the decompensation event. The proposed attention-based model was applied on MIMIC-III Waveform Database and it outperformed several machine learning and deep learning models.

Benchmark
Harutyunyan et al. [5] developed a deep learning model based on RNN LSTM called multitask RNN, in order to predict a number of clinical tasks such as mortality prediction in hospital, physiologic decompensation, phenotyping, and length of stay in ICU unit. The proposed model was applied on MIMIC-III dataset. Similarly, Purushotham et al [23] have provided a single-center benchmark of several machine learning and deep learning models trained on MIMIC-III for various tasks, showing that deep learning models consistently outperformed conventional machine learning models and clinical scoring systems. One common theme across the reviewed work is that the current literature focuses on single-center databases, while we did not find any work in this area that addressed multi-centre datasets, including the associated challenges.