A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction

Background The wide adoption of electronic health records (EHR) system has provided vast opportunities to advance health care services. However, the prevalence of missing values in EHR system poses a great challenge on data analysis to support clinical decision-making. The objective of this study is to develop a new methodological framework that can address the missing data challenge and provide a reliable tool to predict the hospital readmission among Heart Failure patients. Methods We used Gaussian Process Latent Variable Model (GPLVM) to impute the missing values. Specifically, a lower dimensional embedding was learned from a small complete dataset and then used to impute the missing values in the incomplete dataset. The GPLVM-based missing data imputation can provide both the mean estimate and the uncertainty associated with the mean estimate. To incorporate the uncertainty in prediction, a constrained support vector machine (cSVM) was developed to obtain robust predictions. We first sampled multiple datasets from the distributions of input uncertainty and trained a support vector machine for each dataset. Then an optimal classifier was identified by selecting the support vectors that maximize the separation margin of a newly sampled dataset and minimize the similarity with the pre-trained support vectors. Results The proposed model was derived and validated using Physionet MIMIC-III clinical database. The GPLVM imputation provided normalized mean absolute errors of 0.11 and 0.12 respectively when 20% and 30% of instances contained missing values, and the confidence bounds of the estimations captures 97% of the true values. The cSVM model provided an average Area Under Curve of 0.68, which improves the prediction accuracy by 7% as compared to some existing classifiers. Conclusions The proposed method provides accurate imputation of missing values and has a better prediction performance as compared to existing models that can only deal with deterministic inputs.


Introduction
The digitized healthcare data has increased significantly due to the wide adoption of electronic health records (EHR) system, which provides useful resources to advance clinical research and improve healthcare [1]. Many studies have been done to extract important information from EHR to predict different events such as readmission and mortality. [1][2][3][4][5][6]. Despite the vast opportunities, EHR datasets usually contain many missing values, which poses a great challenge on data analysis and clinical applications [7]. One typical approach in dealing with the incomplete data is to delete the instances that contain missing information. However, this is not efficient for EHR data where majority of the values are missing. For example, at a New York academic medical center, 48.9% of patients with disease code for pancreatic cancers did not have corresponding disease documentation in pathology reports [8]. Instead of discarding the missing instances, imputation can be done to estimate the missing information. For instance, a missing value can be replaced with either previous observation (last observation carried forward-LOCF), the average of available data (mean imputation), or fitted values from a regression model (regression imputation).
In the medical literature, many efforts have been made to explore efficient missing data imputation techniques in recent years. Beaulieu-Jones et al. [9] created a representative EHR dataset using a large amount of lab measurements and evaluated the performance of twelve different imputation methods under different missing mechanisms. In their study, two methods, i.e., Multivariate Imputation by Chained Equations (MICE) [10], and softImpute [11], could consistently generate better imputation results as compared with the rest. One possible reason is that these two methods can successfully identify the relationship between different measurements in the constructed dataset, which enables better imputation accuracy. Further, Codella et al. [12] developed an ensemble technique by combining 6 models (i.e., first-order polynomial function, k-nearest neighbors, random forest, perceptron neural network, recurrent neural network architecture with bi-directional gated recurrent unit and bidirectional LSTM) to impute missing values on a time series dataset. Although significant progress has been made to deal with incomplete data, there are still gaps. For example, MICE method assumes a functional relationship between each single variable and all the remaining variables. If the relationship is not properly defined, the method may not provide an accurate estimation of missing values. In addition, Codella's ensemble technique contains a large number of tuning parameters, and the parameter calibration can introduce extra sources of uncertainty to the imputation (e.g., model errors). In this study, we present a new imputation technique which can overcome the aforementioned limitations and provide a reliable estimation of missing values. Specifically, Gaussian Process Latent Variable Model (GPLVM) will be used to impute the missing values, where a lower dimensional embedding of the observations will be learned from a small complete dataset and then used to impute the missing values in the incomplete dataset.
After imputing the missing values, the complete dataset can be applied to develop models to support clinical decision-making. For example, statistical and Machine Learning (ML) models can be used to predict 30-day readmission of Heart Failure (HF) patients after discharged from Intensive Care Units (ICUs). However, the imputation inevitably introduces uncertainties (e.g. imputation errors) to the dataset. Most ML techniques are sensitive to data uncertainty; and small variations in the inputs can lead to significant changes in model outputs. It is important to quantify these uncertainties to ensure accurate modeling and decision-making. Support Vector Machine (SVM) is one of the most popular models in ML field due to its superior performance on high dimensional data and the capability of handling nonlinear classification. But traditional SVM cannot deal with data uncertainty and is sensitive to noises. In this study, we designed a constrained SVM (cSVM) to account for the data uncertainty (imputation errors). Specifically, multiple datasets were constructed by sampling from the imputed distributions, which were then used to train a group of SVMs. Further, an optimal classifier is identified by selecting the support vectors that maximize the separation margin of a newly sampled dataset and eliminate the underperformed support vectors due to data uncertainty.
The proposed framework was used to predict the 30-day readmission of HF patients using the MIMIC-III database. Heart disease is the leading cause of death and about 5.7 million adults in the United States have HF [13,14]. Hospitalization provides patients with optimized treatment to reduce mortality and improve treatment outcomes. A continued challenge in the care of HF patients is the high readmission rate. The 30-day readmission rate for HF is more than 20% and the 6-month readmission rate is up to 50% [15]. It is important to predict the readmission rate before discharges so care providers can be proactive and take actions to reduce readmission risk rather than responding to its consequences. Some models have been developed to predict the HF readmission, but the results are suboptimal. Ross et al. [16] reviewed the models for HF readmission prediction between 1950 and 2007 and reported a best C-statistic of 0.60. Frizzell et al. [17] compared 4 ML algorithms for the prediction of 30-day HF readmission and the best C-statistic is 0.624. Awan et al. [18] developed a multilayer perceptron-based approach for the prediction of 30-day HF readmission, which provide a C-statistic of 0.62. HF readmission prediction needs to be improved and more research efforts are demanded to explore accurate models. The objective of this study is to address the challenges of missing data associated with EHR data analysis and readmission risk prediction.
The rest of the paper is organized as follows. Descriptions of the dataset and clinical variables used to test the methodological framework are presented in the Materials and Methods section. The GPLVM based imputation and cSVM are developed in the section of Readmission Prediction Model. The missing data imputation and readmission prediction performance are analyzed and compared with existing methods in Results, which are followed by Discussions and Conclusions sections.

Dataset
The HF readmission data was constructed from the publicly available MIMIC-III database [19], which consists the EHR data for over 40,000 patients in the critical care unit of the Beth Israel Deaconsess Medical Center from 2001 to 2012. We first discarded patients with age under 18 at admission and died in the ICU. Then patients with discharge diagnosis as HF were screened out based on the International Classification of Diseases, 9 th Revision codes (ICD-9) 402.01, 402. 11 [20]. In total, we got 5959 patients with 8439 ICU stays, among which there were 1000 (11.8%) 30-day readmissions. The data will be used to train the models and conduct cross-validation where 80% will be used as training and 20% as testing. The code for data extraction was developed based on the open source code (https://github.com/Jeffreylin0925/MIMIC-III_ICU_Readmission_ Analysis) from Lin et al. [3].

Data preprocessing
A few clinically important features were taken out from the dataset to build the classifier, including demographics, chart events, and length of stay (LOS). Among these data, demographic variables were widely used in risk stratification models, which provide basic information such as gender, age, race. Chart events are time series data, which represent the variation of patients' physiological conditions during their stay. LOS is the time in hours that a patient spends during the stay. Table 1 provides a summary of these variables for the whole cohort, readmitted group, and non-readmitted group, where continuous variables are expressed as median (interquartile range) while categorical variables are expressed as percentages. All these features were used to develop a classifier to predict the readmission of HF patients.
Demographics. Five demographic attributes were extracted including gender, age, race, insurance type and discharge location. The first three features are frequently used in readmission risk prediction in the literature. Since HF hospitalization is highly correlated with medical costs, different insurance types might affect patient's decision for discharge and readmission. As for discharge location, patients would receive different health care at different discharge facilities, which might potentially influence their recovery and thus affecting the readmission rate. All features except age were coded with dummy variables for model development.
Chart events. Chart events are measurements of patients' vital signs, such as heart rate and blood pressure, during their stay in the ICU. Although the data contains a significant amount of valuable information, they have not been explored much in risk prediction [3]. It has been reported that the charted events within the 48 hours before patients' discharge are correlated with the readmission rate [21,22]. In this study, 13 numerical attributes in 48 hours prior discharge were used, which include Blood urea nitrogen (BUN), Creatinine, Diastolic blood pressure (DSP), Fraction inspired oxygen (FIO), Glucose, Heart rate (HR), Mean blood pressure (MBP), Oxygen saturation (OS), Respiratory rate (RR), Systolic blood pressure (SBP), Temperature, Weight and pH. Two basic statistical parameters, i.e. mean and standard deviation, were calculated based on the measurements of each variable within the last 48-hour window. The corresponding features were considered as missing when no measurements were available in the 48-hour window.

Readmission prediction model
As aforementioned, missing values are prevalent in the EHR, which poses a great challenge on readmission risk prediction. To facilitate an accurate prediction of readmission, we will first perform missing data imputation using GPLVM. This model can provide a mean estimate of each missing feature along with the variance quantifying the imputation error. Further, the imputation error will be propagated into the cSVM model to achieve a robust risk prediction. The imputation and prediction models are described in detail as follows.
Gaussian Process Latent Variable Model (GPLVM). GPLVM is an extension of Gaussian Process (GP) [23], where the functional relationship between inputs and outputs is assumed to have a GP prior, and the inputs are treated as latent variables to be estimated along with model hyperparameters [24]. To better understand GPLVM, we will first review the concepts of GP. Let Y ¼ ½y 1 ; y 2 ; . . . ; y M ��R N�M denote the complete observations, i.e. no missing values, where N is the total number of patients and M is the total number of attributes. In the general GP modelling framework, the input X ¼ ½x 1 ; x 2 ; . . . ; x N � T �R N�Q is known and each dimension of the data is modelled as where where μ(x) is the mean function and k(x,x 0 ) is the covariance function. Therefore, the conditional distribution of y m given X is where , θ groups all the parameters to be estimated. By taking all the attributes together, Y should be modelled as a multiple-output GP. Further, we assume conditional independence across different attributes, then the likelihood of the observations can be written as which can be used to estimate the model parameters through Maximum Likelihood Estimation. However, when the input X is unknown, it should be estimated along with the hyperparameters.
To obtain the distribution of the latent variable X, we seek to optimize the likelihood of data (to simplify the notation, the dependence over θ was omitted) However, the nonlinear incorporation of X inside the kernel function makes the optimization intractable. Thus, a variational distribution q(X) approximating the true posterior distribution p(X|Y) is introduced to derive the following variational lower bound for log{p(Y)} [25] where the second term is the KL divergence between the variational distribution and the prior distribution of X, which could be analytically calculated. Further, the first term in Eq 6 can be written as respectively. The imputation is done as follows. Firstly, the distribution q(x�) of the latent variable x� corresponding to z� will be obtained by maximizing the following density [26]: Let h � ¼ ½h O � ; h U � � 2 R N be the latent function value corresponding to z�, then the density of the h U � can be calculated: Although the marginalization of x� generates an intractable multivariate density, the moments of the density qðh U � Þ are computable for some covariance functions. Therefore, the mean and variance of the missing part z U � could be calculated as where the mean provides the estimate of missing values and the variance quantifies the uncertainty associated with the mean estimate. This information will be further propagated into the classification model described in the next section to obtain a robust estimation of readmission risk.
To evaluate the performance of the imputation method, two metrics were used. The first metric is Mean Absolute Error (MAE), which is calculated as: 1 N miss P ði;jÞ2M jp ij À y ij j, where N miss is the number of missing values, M contains the index for all the missing values, π ij and y ij respectively represent the imputed mean and the true value. Since GPLVM can also provide the estimation of variance, another metric calculates the proportion of the true values that were contained in the confidence bound π ij ±2σ ij .
Constrained Support Vector Machine (cSVM). SVM is a well-established classifier in the ML community, which has been widely applied in different applications [27]. However, traditional SVM can only deal with deterministic inputs. When there is uncertainty involved, SVM needs to be modified to consider the uncertainty in both training and validation processes. In this study, the GPLVM model provides statistical estimations of missing values and quantifies the uncertainty (e.g., imputation error and measurement noises) in all patients' data. To propagate the uncertainty into classification, we developed a constrained SVM model, which is described as follows.
Given the data D ¼ fs j ; y j g N j¼1 ; y j 2 fÀ 1; 1g, s j 2 R M , the original SVM solves the following problem to maximize the margin between two classes [28]: where w is the weight vector, b the bias, C is a term determining the cost of mis-classification, ξ j is a slack variable, φ(�)is a kernel function mapping the original inputs to the feature space. SVM aims at selecting some points from each class to construct a hyperplane that could best separate the two classes. The selected points are called support vectors and the performance of SVM greatly depends on the support vectors [28]. When there are uncertainties involved, the hyperplane could be affected greatly, thus influencing the classification performance of SVM. For example, the imputed values in this study are described as probability distributions, and the true values are random variables that can take any values with a certain probability.
Different samples of the missing values can lead to different classification models and provide different classification results, i.e., readmission or non-readmission. Therefore, it is important to consider the input uncertainty to achieve a robust prediction.
To account for input uncertainty, we will first sample p sets of data from the distributions of missing values. Then, a w i is trained with each set of samples, which can generate p sets of weighting vectors fw i g p i¼1 . Further, a constrained SVM model is trained using fw i g p i¼1 and a new set of samples by solving the following optimization problem where λ is a penalty coefficient. The penalty term containing w i s can incorporate the uncertainty information in the data into SVM optimization. Specifically, minimizing the similarity between w and w i s will guide the optimization to seek a w that can maximumly capture the variations across all sets of data. Since direct solution of this problem is complex, the following dual problem was formulated based on the Karush-Kuhn-Tucker conditions [29]: which is a quadratic programing problem and can be solved using standard optimization packages. Compared with the original SVM, cSVM includes a penalty term to consider the support vectors identified from the classifiers trained with multiple samples. This will take into account the uncertainty in the imputed dataset. Later we will show that this penalty term will improve the performance of classification in the presence of uncertainty (i.e., missing values) (See Results Section).

Feature selection
To avoid overfitting, feature selection was performed on the imputed dataset to select the attributes that are strongly correlated with the outputs. This procedure was done for all the variables list in in the Data Preprocessing section. Logistic regression (LR) with LASSO regularization was used to select the significant features for subsequent analysis.

Missing data imputation
When a vital sign recording is not available within the 48 hours before discharge, the corresponding statistical features are missing and need to be imputed through GPLVM. The original dataset used in this study contains 8439 instances, and 88% of them have missed at least one vital sign recording. To test the performance of the missing data imputation, we extracted all the instances (i.e. patients) with complete statistical features from the original data, and deliberately set some items as missing under the missing at random assumption [9]. The incomplete datasets were generated following three steps: firstly, for each feature, a linear regression model was fitted using the remaining features, and the quartiles of each fitted feature was calculated; secondly, different proportion of instances, i.e., 20% and 30%, were randomly selected; lastly, within each selected instance, the feature value was set as missing if its value was within the predetermined quartile range of its fitted value (e.g. between 2 nd and 3 rd quartile). GPLVM with a squared exponential kernel function was trained using complete instances and tested the imputation performance on the generated incomplete dataset.
To provide some insights into the GPLVM-based imputation results, the imputation for a patient with six missing values: BUN, Creatinine, DSP, FIO, Glucose, and HR is demonstrated in Fig 1. In Fig 1(A) the vertical dark solid lines mark the true measurements while the horizonal dashed lines represent the imputed means. When the intersection between these two lines lies on the black dashed line, the mean estimate matches well with the true value. As seen in Fig 1(A), five missing values were well imputed with the mean estimates except for Cr. The true values and the imputed means are also shown in Fig 1(B). The advantage of GPLVM based imputation is that it can provide the variance along with the mean estimate, which is displayed as the probability density function in Fig 1(A) and the confidence interval of the imputation (CI, bounds within 2-fold standard deviation) in Fig 1(B). For Cr, the true value is away from the estimated mean, but it is still within imputation bounds.
To quantify the imputation accuracy, we normalized all variables to a common scale of zero mean and unit standard deviation. The two metrics, i.e. MAE and the proportion of the true values captured by the imputation bound, were used to evaluate the imputation accuracy. Since the incomplete dataset is generated randomly, experiments were repeated for 50 times, and the box plots of MAEs for different percentages of instances that have missing values are shown in Fig 2(A). As seen, the GPLVM can impute all missing values at an average MAE of 0.11 and 0.12 for 20% and 30%. In addition, when the proportion of missing values increases, the variance of MAE increases. In addition, we also looked at how many true values are within two standard deviation of the mean estimates, i.e., (π ij +2σ ij ). As seen in Fig 2(B), about 97% of the true values are within two standard deviation of the mean estimates. In other words, 97% of the estimated confidence bounds contain the true values of missing data. The metric is important because when the true values of missing data are within the confidence bounds, they will be accounted in the cSVM, thus the classification will be more robust to the uncertainty introduced by missing data imputation.

Model performance
Since missing values are considered as random variables in the GPLVM imputation, traditional ML methods, such as SVM, cannot take such variables as inputs. To make full use of the information, a new classifier, cSVM, was trained for readmission prediction. Since the data is highly unbalanced, patients with no-readmission were down-sampled to obtain an even training data, i.e., the ratio between the two classes is 1:1. Considering the nonlinearity of the data, the radial basis function (RBF) kernel was used to train the classifier. We investigated four parameters to optimize the performance of cSVM. These parameters include box constraint C, kernel scale σ, number of pre-sampled datasets p, and penalty coefficient λ. Specifically, σ was set as the optimal value that provides the best performance for SVM. In addition, we found that the accuracy of cSVM is not significantly sensitive to the sample size p. We recommend choosing a sample size large than 30. However, p should be determined case-by-case based on the number of features that involve uncertainty. In our study, we used p = 200. Further, λ is set to be 1 and the box constraint is set to be 10000.
To benchmark the results of the proposed cSVM model, we trained different classifiers of LR, Naïve Bayes (NB), traditional SVM, and a modified NB (mNB) which can deal with input uncertainty [30]. The receiver operating characteristic (ROC) curves of different classifiers are plotted in Fig 3. The left figure shows the ROC curves and their 95% bootstrap confidence bounds, and the right figure shows the average ROC curve. As seen in the figure, cSVM shows a better ROC as compared to the other four classifiers and the area under curve (AUC) of cSVM is the largest. cSVM has a 7% improvement on AUC as compared to the traditional SVM, which demonstrates that the introduced penalty term could improve the classification performance. Further, the mean True Positive Rates (TPRs) with respect to different False Positive Rates (FPRs) for each classifier were provided in Table 2, which also reveals that cSVM could consistently achieve better results compared with other classifiers. In addition to the AUC, some targeted operating points are also useful in clinical applications [3], for example application of high-sensitivity (TPR) can rule out the disease, whereas high-specificity (1-FPR) can rule in the disease [31]. Thus, the operating points with sensitivity and specificity fixed at 0.85 and 0.85 were adopted to evaluate the performance of different classifiers [32,33], and results are given in Table 3. From the table, we can find that, compared with LR, NB, mNB and SVM, cSVM can improve the performance by 18%, 61%, 57% and 27% at operating point of high-sensitivity and by 22%, 63%, 35%, 6% at the operating point of high-specificity. All the classifiers are trained with the complete dataset after the GPLVM-based missing data imputation. The LR, NB and SVM models used the mean estimates of the imputed data.
Further, to investigate the performance of the combination of GPLVM-based imputation and the cSVM classifier, we replaced the GPLVM with other imputation techniques, and then performed classification using LR, NB and SVM models. Two different imputation methods were used to impute the missing values, which were LOCF and mean imputation. The AUCs of different combinations are given in Table 4. As shown in the table, all three classifiers (LR, NB, and SVM) achieve better AUCs with GPLVM imputation than the LOCF and the mean imputation. This reveals that the GPLVM-based imputation outperforms the traditional  LOCF and mean imputation thus benefiting the classification for readmission prediction. The comparison study suggests that the proposed analytical framework improves the readmission prediction performance as compared with traditional methods.

Discussion
The wide adoption of EHR system provides a great opportunity to mine the large database of patients and analyze patient traits, cause of disease, treatment effects, risks and mortalities. However, missing data is prevalent in EHR system, which poses a great challenge on data analysis to support clinical decision-making. This study developed a new analytical framework to address the missing data challenge and facilitate a robust prediction of HF readmission using EHR data. Specifically, GPLVM was introduced to impute the missing values in EHR dataset. The intuition of using this technique is that physiological measurements are indicators of patient's condition; and there should be inherent correlations among features of an individual patient and across the patient population. Since GPLVM employs the versality of GP in representing a complex functional relationship between inputs and outputs, the optimized latent variables can provide better representation of the features, which in turn facilitate accurate missing data imputation. In addition, GPLVM can provide both the mean estimates of the missing values as well as their variances that quantify the uncertainty associated with the mean estimate. Quantifying the uncertainty of the imputation is useful, as this information can be propagated into the subsequent analysis to ensure a robust prediction. To propagate the uncertainties (i.e., imputation errors) in the imputed dataset into modeling, a new classifier, cSVM, was developed based on SVM. Different sets of features were randomly sampled from the distributions of missing values, and SVM models were trained using these random samples. A penalization term was introduced to constrain the deviation of the optimal classifier from the group of pre-trained support vectors; this will ensure the optimal classifier is not biased due to the imputation uncertainty. The resulting classifier can retain the discriminative features in the dataset while minimizing the influence of data uncertainty, and thereby improving the classification performance. The proposed missing data imputation and classification framework was tested and validated using a clinical database to predict 30-day readmission of HF patients discharged from ICUs. The model can provide a better AUC as compared to other popular classifiers in the literature. The improvement in the prediction accuracy is awarded by the better missing data imputation performance of GPLVM and the robust classification of the cSVM. However, further improvement is limited by the efficiency of the predictors (features). Existing studies on 30-day readmission prediction for HF patients have AUCs up to 0.62 [16][17][18]34]. A recent study using structured data, such as age and race, and unstructured data, such as discharge notes, to predict the readmission of Congestive Heart Failure patients can achieve an AUC of 0.97 [35], but the model has a high level of complexity and needs extensive validation. The generally low accuracy of readmission prediction can be due to the limited predictive potentials of the features. Future research should be done to discover significant risk factors and biomarkers that can improve HF readmission prediction. The GPLVM imputation technique learns the correlations among features to estimate missing values. It can provide better estimations for datasets that have correlated attributes. However, for datasets where all attributes are independent, the accuracy of GPLVM needs further validation. In addition, this study calculated the statistical features, i.e., mean and standard deviation, from the vital sign data, and then imputed the missing features using GPLVM. The two statistical features may not be able to exploit all useful information in the time series data, thus the advantage of GPLVM was not fully explored. Future work can be done to model the time series to extract more informative features for readmission prediction.

Conclusions
Mining EHR datasets to facilitate clinical decision-making has become a popular topic. However, EHR often contains a large amount of missing values, which greatly challenges existing techniques. This study designed a new analytical framework based on the GPLVM and cSVM. The former can effectively impute the missing values in EHR and provide both the mean and variance estimates, while the latter provides better diagnostic capability by incorporating the input uncertainties during model development. The proposed method was tested using the MIMIC-III database and showed better performance for HF readmission prediction compared with some existing models.