Artificial intelligence algorithm for predicting mortality of patients with acute heart failure

Aims This study aimed to develop and validate deep-learning-based artificial intelligence algorithm for predicting mortality of AHF (DAHF). Methods and results 12,654 dataset from 2165 patients with AHF in two hospitals were used as train data for DAHF development, and 4759 dataset from 4759 patients with AHF in 10 hospitals enrolled to the Korean AHF registry were used as performance test data. The endpoints were in-hospital, 12-month, and 36-month mortality. We compared the DAHF performance with the Get with the Guidelines–Heart Failure (GWTG-HF) score, Meta-Analysis Global Group in Chronic Heart Failure (MAGGIC) score, and other machine-learning models by using the test data. Area under the receiver operating characteristic curve of the DAHF were 0.880 (95% confidence interval, 0.876–0.884) for predicting in-hospital mortality; these results significantly outperformed those of the GWTG-HF (0.728 [0.720–0.737]) and other machine-learning models. For predicting 12- and 36-month endpoints, DAHF (0.782 and 0.813) significantly outperformed MAGGIC score (0.718 and 0.729). During the 36-month follow-up, the high-risk group, defined by the DAHF, had a significantly higher mortality rate than the low-risk group(p<0.001). Conclusion DAHF predicted the in-hospital and long-term mortality of patients with AHF more accurately than the existing risk scores and other machine-learning models.

Introduction Approximately 26 million adults worldwide have heart failure, and acute heart failure (AHF) is the leading cause of hospitalization in Europe and the United States, resulting in more than 1 million admissions, and representing 1%-2% of all hospitalizations. [1,2] In the past decades, the mortality rate of AHF has improved with advances in treatment, but AHF is still a leading cause of mortality worldwide. [1][2][3] Risk stratification and prognosis prediction are critical in identifying high-risk patients and decision making for the treatment of patients with AHF.
There are several mortality prediction models for heart failure, such as Get with the Guidelines-Heart Failure (GWTG-HF) score, Meta-Analysis Global Group in Chronic Heart Failure (MAGGIC) score. [4,5] However, these prognostic models have limitations for the current daily practice. First, GWTG-HF and MAGGIC are limited in specific situations. GWTG-HF and MAGGIC were developed only for in-hospital and long-term mortality, respectively. [4,5] Second, because the accuracies of these methods are unsatisfactory, these methods cannot be used to decide the treatment of the patient. Third, these models use only limited information that is based on a conventional statistical approach, such as multivariate analysis by the logistic regression model that has a potential limitation of information loss. [6][7][8] Recently, artificial intelligence algorithm has achieved a high performance in several medical domains, such as image detection and clinical outcome prediction. [9][10][11] An advantage of deep-learning is the automatic learning of the feature and relationship from a given data. [12] In this study, we developed and validated a deep-learning-based artificial intelligence algorithm for predicting mortality of patients with acute heart failure (DAHF) by using a large data from 12 hospitals.

Study population
We conducted a retrospective observational cohort study using AHF patient data from 10 university hospitals of the Korean Acute Heart Failure (KorAHF) registry and 2 hospitals (hospital A: cardiovascular teaching hospital and hospital B: community general hospital), as shown in Fig 1. We defined patients with AHF as patients with signs or symptoms of heart failure and those who met either of the following criteria: 1) lung congestion or 2) objective left ventricular systolic dysfunction or structural heart disease findings.
First, we collected the algorithm train data of patients with AHF admitted in hospitals A and B from October 2016 to December 2017. The data obtained through the electronic health records of the two hospitals were demographic information, treatment and medication, laboratory results, electrocardiography (ECG) and echocardiography findings, final diagnosis, clinical outcome during their hospital stay, and 12-month prognosis after discharge.
Second, we used the data of patients with AHF enrolled in KorAHF as performance test data. KorAHF is a prospective multicenter registry of AHF in Korean patients. All cardiovascular centers in 10 tertiary university hospitals were included in KorAHF from March 2011 to January 2014. The full details of the KorAHF registry's aims and protocols have been published elsewhere. [13] The data obtained through KorAHF demographic information, treatment and medication, laboratory results, electrocardiography (ECG) and echocardiography findings, final diagnosis, clinical outcome during their hospital stay, 12-month prognosis, and 36-month prognosis.
Because hospitals A and B were not university hospitals, they were excluded in KorAHF. Moreover, the periods of the train data were different with the test data. Therefore, train and test data were completely separated. We excluded patients with missing values of predictor variables and endpoints as shown in Fig 1. This study was conducted in accordance with the Declaration of Helsinki and the relevant guidelines and regulations. The institutional review boards (IRBs) of Sejong General Hospital (2018-0839) and Mediplex Sejong Hospital (2018-073) approved this study protocol and granted waivers of informed consent based on general impracticability and minimal harm. Patient information was anonymized and de-identified before the analysis. KorAHF data were collected by each site and approved by the IRB at each hospital. The KorAHF committee approved and provided data for the present study. Deep-learning for acute heart failure mortality

Data management
We used the data of hospital A and B as train data for prediction algorithm development. And we used the data of KorAHF as test data to confirm whether the DAHF can be applied to other hospitals after development. These two data were completely separated.
In the train data of hospital A and B, we made a train dataset each time an ECG was taken during the hospital stay of the patient, as shown in Fig 2. If the value of other variables was missing during ECG, we used the most recent values of the demographic information and vital signs, and echocardiography and laboratory findings, as shown in Fig 2. Hence, several train datasets were generated from one patient. Using this method, we amplified and created a dataset sufficient for developing machine-and deep-learning methods.
In the test data of KorAHF, only initial value of the demographic information, vital sign, ECG, echocardiography, and laboratory data at the time of admission were used. If the value of variable was missing during admission, the first result during hospital stay was used. Because the purpose of test dataset was to assess the accuracy of each prediction model, and the goal of each model was to predict the patient's prognosis at the time of admission, we generated only one test dataset from one patient.

Endpoints
The difference of accuracy to predict in-hospital mortality and 12-and 36-month mortality in each prediction model was evaluated. The primary endpoint, in-hospital mortality, was defined as in-hospital death occurring during hospitalization. Readmission within 24 h was considered as the same hospital period. The secondary endpoints, 12-and 36-month mortality, were defined as death within 12 and 36 months, respectively, among patients who survived to discharge, as shown in Fig 1.

Development of deep-and machine-learning prediction models
We developed the DAHF by using only the train dataset. The DAHF was developed by using deep neural network (DNN), a method of deep-learning with 3 hidden layers, 33 nodes, batch normalization, and dropout layers. [14,15] We used TensorFlow (the Google Brain Team, Mountain View, United States) as the backend. [16] In this DNN, we used a rectified linear unit (ReLU) and back propagation as the activation function and method of training, respectively. We developed DAHF by ensembling a DNN model fitted to in-hospital death and a DNN model fitted to 12-month death, as shown in Fig 2. We provided a detailed description, reference, and figure of the DNN as S1 File.
We also developed four machine-learning models, random forest (RF), logistic regression (LR), supportive vector machine (SVM), and Bayesian network (BN), for the performance comparison with DAHF. [17] In the previous studies, these machine-learning models were the most typically used methods and showed better performance than traditional methods in several medical domains. [18,19] We used randomForest, glmulti, e1071, and bnlearn packages in R (R Development Core Team, Vienna, Austria) for the development of RF, LR, SVM, and BN models, respectively. Moreover, we also provided the detailed description, references, and figures as S1 File. We analyzed the variable importance of LR, RF, SVM, BN, and DNN by using deviance difference, mean decrease Gini, sensitivity analysis, deviance difference, and AUC difference, respectively.

Test of prediction model performance
After we developed the DAHF and machine-learning models, we compared the performance of these models with the conventional prediction scoring. We compared the performance for predicting in-hospital mortality with the GWTG-HF scores and compared the performance for predicting 12-and 36-months mortality with the MAGGIC scores. [4,5] We compared the performance of the models by using only the test data not used for the model development. GWTG-HF and MAGGIC scores were well validated and used globally. We used the area under the receiver operating characteristic curve (AUC) as the comparative measure. The AUC is a frequently used metric, and the receiver operating characteristic curve shows the sensitivity against 1-specificity. [20] We evaluated the 95% confidence interval using bootstrapping (10,000 times resampling with replacement). [21] We used ROCR package in R (R Development Core Team, Vienna, Austria) for these analyses.
We divided the patients of the test data into the high-and low-risk groups based on the DAHF. In this analysis, we used the data of patients who survived to discharge. The optimal cutoff point of the DAHF score was determined when the Youden J statistics was at maximum. [22] After dividing the patients into risk groups, we estimated the 36-month mortality by risk groups by using the Kaplan-Meier method. We used the pROC and survival packages in R (R Development Core Team, Vienna, Austria) in this analysis.

Results
We included 8094 patients with AHF (hospitals A and B: 2469; KorAHF: 5625) in the present study (Fig 1). We excluded 1170 patients because of missing values and endpoints. The study subjects comprised 6724 patients, wherein 194 were in-hospital mortalities. The DNN prediction model, DAHF, was developed using 12,654 train datasets from 2165 patients of hospitals A and B. The performance test was performed using 4759 test datasets from 4759 patients of KorAHF (Fig 2). Table 1 shows the baseline characteristics, and a significant difference was found between the characteristics of the train and test data.  In the train dataset, the optimal cut-off scores of the DAHF risk groups were 0.472. Among 4577 survival-to-discharge patients in the test dataset, the DAHF classified 2668 and 1909 patients as high and low risk, respectively. The cumulative hazard plot of Fig 5. shows that the high-risk group of the DAHF shows a significantly higher hazard than the low-risk group. The high-risk group, defined by the DAHF, has a significantly higher mortality rate than the lowrisk group (p<0.001).
The characteristics and variable importance of each prediction model is shown in S2 File. The variable importance is different for each prediction model. In the conventional machinelearning models, the EF and QRS duration variables are less important for prediction.  However, for the DAHF, these variables are important for prediction (S2 File). The BMI and LAD variables are less important for prediction in DAHF than other conventional machinelearning prediction models.

Discussion
In this study, we developed a deep-learning-based artificial intelligence algorithm, DAHF, for predicting the mortality of patients with AHF using two hospital datasets and validated DAHF using separated AHF registry data. This study revealed that the accuracy of performance of the deep-learning-based artificial intelligence model was excellent for predicting the mortalities and is better than the conventional risk score model. In addition, the deep-learning model outperformed other machine-learning prediction models to predict endpoints. To the best of our knowledge, this study is the first to predict AHF patient endpoints using deep-learning-based artificial intelligence prediction models. The GWTG-HF and MAGGIC scores are well validated conventional models for risk stratification of AHF patients. [23][24][25][26] The previous validation studies have reported that the AUC of the GWTG-HF scores for predicting in-hospital mortality of patients with AHF was 0.71-0.76. [24,25] Furthermore, the AUC of the MAGGIC score for predicting 1-3-year mortality of patients with AHF in previous studies was 0.73-0.74, implying moderate accuracy for predicting the mortality of patients with AHF. [24,26] These results were similar with those of this study.
However, the GWTG-HF and MAGGIC scores have several limitations. First, these models were developed and validated on specific situations. The GWTG-HF was developed to predict in-hospital mortality, and MAGGIC score was developed and validated to predict long-term endpoints. [4,5] Moreover, these prediction scoring methods had no satisfactory performance and were not actively used to decide the treatment of the patient. Because these scoring models were developed by the conventional statistical approach using the logistic regression model that has limitations, including the fixed assumptions on data behavior, and the necessity to preselect variables in the development phase, thus leading to potential information loss. [6][7][8] Unlike the conventional statistical approach, deep-learning does not require the preselection of important variables, and the less important variables are naturally ignored in the model fitting process. [12,14,15] Furthermore, deep-learning does not limit the number of input predictive factors and can use all available information without potential loss. Subsequently, the Deep-learning for acute heart failure mortality old models cannot reflect the relationship between variables. This is because the risk is measured only by the sum of the variables. Meanwhile, deep-learning obtains the relationship between the variables, as shown in S1 File, unlike conventional methods. [12,14,15] The previous studies have attempted to predict prognosis, such as readmission and mortalities of AHF, by using conventional machine-learning prediction models. [18,19] In the present study, we confirmed that DNN outperformed other conventional machine-learning prediction models, such as RF, LR, SVM, and BN. The machine-learning prediction model required careful engineering to design a feature extractor for selecting important variables. [17] This process requires a lot of manpower, and loss of important information is possible. Deep-learning, DNN, including feature-learning process, which is a set of methods that allow a model to be fed with raw data and to automatically identify the feature needed for conducting a task. [12,14,15] Deep-learning comprises multiple processing layers of feature as shown in S1 File, obtained by composing simple but non-linear modules, each of which transforms the feature at one level (starting with the raw input) into a feature at a higher and more abstract level. Because this process is conducted automatically, it is good at discovering the intricate structures in high-dimensional data without information loss and it requires only very little engineering by human. [12,14,15] Therefore, deep-learning can be quickly applied to tasks with ease and outperform other conventional machine-learning models.
Deep-and machine-learning models are used to obtain the relationship between the predictor and outcome variables, rather than creating a rule based on medical knowledge. Hence, the performance of machine-and deep-learning models is not guaranteed in other situations. Wolpert explains the no-free-lunch theorem; if optimized in one situation, a model cannot produce good results in other situations. [27] Because deep-and machine-learning models can over-fit with the characteristics of hospitals in train data, we conducted a performance test using complete separated test data, which were not used for the model development.
In the present study, we amplified the train dataset from the data of the train group patients by using the methods of collecting multiple datasets from a patient data. Using these methods, we collected a sufficient amount of dataset for developing deep-and machine-learning models. Deep-and machine-learning models require an abundant amount of data for its development. Because the available data are limited, and the outcomes to be predicted are highly diverse in the medical field, this method is promising to future studies in medical domains and will inspire many researchers.
Deep-and other machine-learning predicted models predicted endpoints using different structures as shown in S1 File. The patients for which each model correctly predicted the endpoints also differed. Furthermore, the variable importance of DAHF is different from that of RF, LF, SVM, and BN, as shown in S2 File. Therefore, different algorithms can complement the other weaknesses of the algorithms. We used the combination of two deep-learning algorithm, DNNs, for predicting both in-hospital and long-term mortality simultaneously, and this method was called ensemble. [28] Many researchers attempted to improve the accuracy by combining diverse predictive algorithms. This method is also called an ensemble algorithm.
Our study has several limitations. First, the deep-learning model is known as a "black box." Although we can fit and develop the deep-learning based artificial intelligence model, we cannot interpret the model in terms of the approach to the decision of risk score. For example, if the DAHF predicts a high risk of mortality for a patient, the reason for that decision cannot be ascertained. Recently, an interpretable deep-learning model has been studied and will be our next area of study. [29,30] Second, this prediction model was developed with limited variables that could be collected from the KorAHF registry. As shown in previous studies of Cacciatore, functional parameters such as the 6-minute walking test is a good prognostic factor of cardiac diseases. [31] We plan to use more information of HF patients that can enhance the performance of the AI algorithm with more valuable variables. Third, this study was conducted with retrospective big data. Thus, we could not extract accurate past medical histories such as respiratory diseases and malignancies that could also affect long-term mortality. We plan to conduct a prospective study for validating the AI algorithm and confirm the correlation between medical history and HF in our next research.

Conclusions
In conclusion, we developed and validated a new mortality prediction artificial intelligence model of AHF based on the deep-learning approach. The deep-learning algorithm, DAHF, predicted the in-hospital, 12-and 36-month mortality of patients with AHF more accurately than the existing risk scores and other machine-learning methods. This study showed the feasibility and effectiveness of the deep-learning-based artificial intelligence algorithm model for cardiology, which can be a useful tool for precise decision making in daily practice.