Development and verification of prediction models for preventing cardiovascular diseases

Objectives Cardiovascular disease (CVD) is one of the major causes of death worldwide. For improved accuracy of CVD prediction, risk classification was performed using national time-series health examination data. The data offers an opportunity to access deep learning (RNN-LSTM), which is widely known as an outstanding algorithm for analyzing time-series datasets. The objective of this study was to show the improved accuracy of deep learning by comparing the performance of a Cox hazard regression and RNN-LSTM based on survival analysis. Methods and findings We selected 361,239 subjects (age 40 to 79 years) with more than two health examination records from 2002–2006 using the National Health Insurance System-National Health Screening Cohort (NHIS-HEALS). The average number of health screenings (from 2002–2013) used in the analysis was 2.9 ± 1.0. Two CVD prediction models were developed from the NHIS-HEALS data: a Cox hazard regression model and a deep learning model. In an internal validation of the NHIS-HEALS dataset, the Cox regression model showed a highest time-dependent area under the curve (AUC) of 0.79 (95% CI 0.70 to 0.87) for in females and 0.75 (95% CI 0.70 to 0.80) in males at 2 years. The deep learning model showed a highest time-dependent AUC of 0.94 (95% CI 0.91 to 0.97) for in females and 0.96 (95% CI 0.95 to 0.97) in males at 2 years. Layer-wise Relevance Propagation (LRP) revealed that age was the variable that had the greatest effect on CVD, followed by systolic blood pressure (SBP) and diastolic blood pressure (DBP), in that order. Conclusion The performance of the deep learning model for predicting CVD occurrences was better than that of the Cox regression model. In addition, it was confirmed that the known risk factors shown to be important by previous clinical studies were extracted from the study results using LRP.


Methods and findings
We selected 361,239 subjects (age 40 to 79 years) with more than two health examination records from 2002-2006 using the National Health Insurance System-National Health Screening Cohort (NHIS-HEALS). The average number of health screenings (from 2002-2013) used in the analysis was 2.9 ± 1.0. Two CVD prediction models were developed from the NHIS-HEALS data: a Cox hazard regression model and a deep learning model. In an internal validation of the NHIS-HEALS dataset, the Cox regression model showed a highest time-dependent area under the curve (AUC) of 0.79 (95% CI 0.70 to 0.87) for in females and 0.75 (95% CI 0.70 to 0.80) in males at 2 years. The deep learning model showed a highest time-dependent AUC of 0.94 (95% CI 0.91 to 0.97) for in females and 0.96 (95% CI 0.95 to 0.97) in males at 2 years. Layer-wise Relevance Propagation (LRP) revealed that age was PLOS  Introduction Cardiovascular disease (CVD) is one of the leading causes of mortality worldwide [1]. Because multiple risk factors are associated with CVD, managing these risk factors is difficult but could prevent numerous deaths. In previous studies, various prediction models were developed to identify individuals that have a high risk of developing CVD, and Cox hazard regression analysis has been the traditional approach [2][3][4][5][6][7]. Cox hazard regression models have been used to identify risk factors in phases of risk ratios and provide a probability that an individual will develop CVD, enabling personalized treatment for high-risk individuals [8].
Cox hazard regression models assume the independence of predictors using pre-specified risk factors [8]. In a prospective cohort, the selected risk factors are measured at pre-planned times, so information on the collected risk factors can be fully used by statistical methods. However, due to the variety of types and cycles of risk factor measurements in clinical studies, existing statistical models do not have all the information on CVD risk, and only parts of those databases are available. The modern hospital information system (HIS) has created complex, digitalized, time-series health dataset. However, appropriate analysis methods for maximizing the predictive performance using these multi-measurement datasets have not been clearly defined.
Deep learning is a type of machine learning algorithm [9,10] and has been demonstrated to have outstanding performance capabilities for classification of data [11,12]. The overall transformations involve multiple layers in deep learning [8], which can improve a predictive model's performance in analyzing datasets composed of complex time-varying data. To date, several small studies have explored the potential of deep learning for disease-risk prediction using data from specific time points [13][14][15]. Accordingly, this study attempts to evaluate the discriminative accuracy of a deep learning algorithm model, based on survival analysis with repeated health data for CVD prediction, by comparing the results with a conventional Cox hazard regression analysis. The forecasts for the two models were calculated for a specific time point through classification. We also verified the models.

Data source
This study used the National Health Insurance System-National Health Screening Cohort (NHIS-HEALS) [16] data derived from a national health screening program and the national health insurance claim database in the National Health Insurance System (NHIS) of South Korea and prospective cohort data from the Rotterdam Study [17]. Data from the NHIS-HEALS was fully anonymized for all analyses and informed consent was not specifically obtained from each participant. In the Rotterdam Study, all data were collected in a standardized manner according to a pre-determined study protocol and informed consent was obtained from all participants. This study was approved and exempt from informed consent by the Institutional Review Board of Yonsei University, Severance Hospital in Seoul, South Korea (IRB no.4-2016-0383).

Study population
The NHIS constructed the NHIS-HEALS cohort, which consists of data from 514,866 people (age 40 to 79 years), randomly sampled from 10% of the source population, who had undergone the NHIS health examination in 2002-2003 as the baseline. This cohort data represents the Korean adult population, as every Korean over 40 years of age is required to join the NHIS and is recommended to have regular biennial checkups. Due to this recommendation, the baseline for this study can be defined as the year 2002-2003. The data includes information from 2002 to 2013, and repeated data measurements were selected for research purposes as repeated data measurements are useful for identifying discriminative accuracy.
The following steps were implemented for the data manipulation: (a) out of 514,866 individuals, except those with pre-existing histories of CVD; (b) those who had treatment records of CVD or death, or a history of stroke or heart disease at the baseline were removed; (c) only those with more than two screenings from 2002-2006 were included; and (d) the remaining group, 361,239 subjects, who did not have CVD at the baseline were divided into two subgroups; a training set (80%, 288,992 subjects) and a test set (20%, 72,247subjects).
Consequently, a total of 288,992 subjects were allocated to the training set (18,904 with CVD vs. 277,088 without CVD) and were utilized for building a separate model for gender. Also, we constructed a specific dataset for the external verification of the Rotterdam Study, to verify the performance of the model that was built by NHIS-HEALS (See S1 Appendix for the details of the Rotterdam Study). For the external verification, the Rotterdam Study has been constructed based on the same criteria as the training set utilizing the NHIS-HEALS cohort data. Fig 1 presents the flow and detailed processes of all data handling.

Outcomes
The primary outcome was defined as the occurrence of one of the following events during the follow-up period after the baseline health examination: (1) death from CVD (International Classification of Diseases 10th edition [ICD-10] codes), (2) hospitalization due to myocardial infarction, coronary arterial intervention or bypass surgery or (3) hospitalization due to stroke.

Converting the output variables for clinical studies
In the field of medical research, we need to determine how to use Recurrent Neural Network-Long Short-Term Memory (RNN-LSTM) based on survival analysis to determine whether disease occurred at a specific time point. Thus, we transformed the binary output variable into multiple time point output variable vectors for developing point-in-time analysis according to previous studies utilizing vector variables [18][19][20][21].
To find the specific points-in-time when diseases occurred, we analyzed each year's case by converting the output variables. In the output layer, each node represents a time interval, from two to ten years, in 1-year intervals. The value of each node is the probability of survival for that point-in-time. The survival probability after disease initiation is 0, and the probability of disease after the disease-free survival time for censored cases is presumed by the Kaplan-Meier survival function [20]. This predicted output is the probability of survival for each time point.
Based on the predictive results of the deep learning algorithm, we compared the survival probability from the Cox regression and the probability from the deep learning model with the correct answers to confirm the AUC for each year. Thus, we demonstrated the predictive performance of our models, Cox regression and deep learning, by calculating the AUC for each year.

Risk predictors used in model building
To develop the risk model, an a priori decision was made that assumed the following variables-age, body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), total cholesterol (TC), fasting plasma glucose (FPG), current smoking and exercisewere predictor variables. Details of the variables included in Cox regression and deep learning models are described in S1 Table. Variables with missing data (less than 4%) were included in the analysis. In cases where the data was missing, multiple imputations by fully conditional specifications [22] were performed using the following MI procedure in SAS 9.4 [23].

Prediction model of statistics and deep learning
We developed CVD prediction models by sex, as it is known that there are significant differences in the risk factors and occurrence rates of CVD between the sexes [24]. Data from the baseline health examinations and repeated measurements from the periodic follow-up examinations were used to build the prediction models. The time to event was defined as the time between the date of the first health examination and that of the first diagnosed event or the last date in the cohort in non-event subjects. Also, the data used in the analysis was the health examination data from 2002-2006. For example, if a patient with a disease in 2005 had two records of health screenings in 2002 and 2004, the analysis was performed using both health screening records. As another example, assuming that a patient diagnosed with a disease in 2009 had four health examinations every two years from 2002 to 2008, the analysis was conducted by using only health information from 2002 to 2006. This decision was made to control the disparity in the volume of information among subjects by adjusting the amount of time from which screening records were used.
First of all, the Cox model using longitudinal data and its improved accuracy over singlemeasure methods have been described previously in order to compare it with deep learning using longitudinal data [25]. In this study, for the Cox regression model, we used the mean, minimum and maximum values and standard deviations (SDs) as continuous variables and the mean and SDs as categorical variables calculated from the periodic health screening data. The details of the measurement of risk factors in the Cox modeling are described in S2 Appendix.
For the deep learning algorithm model based on survival analysis, an RNN-LSTM [26] network was used. The deep learning algorithm was constructed using the same variables used in the Cox regression model with longitudinal data. Our proposed LSTM model was designed with the following structure. For the optimization of the algorithm, RMSProp [27] was used to update the parameters through back-propagation. Hyper-parameters at a learning rate of 0.01 were configured, with a dropout probability of 50%, and a mini-batch of 64. The correct answer was one-hot encoded to be used for cross-entropy in a loss function. The number of classes was 2. The details of the deep learning and model building process are demonstrated in S3 Appendix. Then, the calculated performance metrics were evaluated with C-statistics or AUC [28]. Research has demonstrated that C-statistics is analogous to AUC [29].

Evaluation of prediction performance
The prediction performances of each prediction model were evaluated using NHIS-HEALS data and external test data, Rotterdam Study. Model discrimination was quantified by calculating the C-statistics for the survival model. All statistical analyses were conducted with SAS (version 9.4, SAS Inc., Cary, NC, USA) and the R Statistical Package (www.R-project.org). The statistical significance criterion was set at 2-sided p < 0.05.

The solution to the problem of understanding classification decisions
In order to overcome the problem which was the inability to explain the reason for classification, we confirmed the influence of the input variables using a Layer-wise Relevance Propagation (LRP) [30], one of many explainable artificial intelligence (XAI) techniques used in artificial neural networks [31,32].
The order of each variable is the mean of the LRP output values for each input sample, which are sorted in descending order. The number of feature variables is n, the number of input samples is m, and the output value of the prediction model is o = {o 1 . . . o m }, thus, the ranking of feature variables is expressed as follows.
Through this technique, we present the effect of the feature variables used to build the model. Table 1 presents the characteristics of the training cohort at baseline. The mean age was 51.2 ± 8.9 years, and a total of 164,024 male subjects (56.76%) were included in the cohort. The average number of health screenings used in the analysis was 3.1 ± 1.1 for male subjects, and 2.6 ± 0.9 for female subjects.

Results
In the internal validation using the NHIS-HEALS cohort data, the Cox regression model showed the highest time-dependent AUC was 0.79 (95% CI 0.70 to 0.87) at 2 years in female  Table. Furthermore, the results of the LRP demonstrated that the known risk factors identified in previous studies do affect CVD and provided numerical impact for each risk factor used in the deep learning modeling. The deep learning model showed that age was the variable that had the greatest effect on CVD occurrence. Moreover, SBP, DBP, sex and FPG were ranked at the upper. The details are described in Table 2.

Discussion
The principal findings of this study were as follows: (1) deep learning algorithms have significantly improved predictive power for CVD compared to Cox regression analysis. However, while the deep learning algorithm maintained high predictive power within 5 years, after that it decreased sharply. (2) The results of the verification using the Rotterdam Study confirmed that the predictive power of the deep learning algorithm compared to the Cox regression analysis was improved. This is the first large-scale and systematic assessment of a deep learning approach for predicting the occurrence of CVD at a particular point in time, suggesting that it can be generalized without racial influence. (3) The effects of the various risk factors were identified through the LRP. The LRP might be useful for identifying the impact of risk factors that the deep learning approach cannot identify. Since the electronic health records (EHR) were introduced decades ago, huge amounts of medical data have accumulated. The nationwide repeated health screening systems in Korea cannot be applied to all medical systems, but as HIS has developed into a medical platform, the accumulation of large-scale datasets in the medical field is accelerating. The deep learning model can be a useful tool for the prediction of risk in the EHR era by providing discrimination and calibration using repeatedly measured data.
Disease prediction studies using deep learning, a subfield of machine learning, have already been studied previously [33][34] and have been shown to have high value in the classification of problems [11][12][35][36]. Deep learning differs from statistics by Cox regression analysis. The Cox regression model assumes an independence between predefined variables and does not reflect changes in those variables over time, but the advantage of deep learning is that it can use variables that are constantly changing. As a result of this research, these advantages were identified by improving the accuracy of CVD predictions, but after five years, the performance of this model was similar to that of the Cox model. The Rotterdam Study maintains a high level of deep learning performance (an AUC of about 0.8) over a longer period of time than the Cox model. This seems to be due to an increase in CVD incidence rates over time. The reason is that the annual incidence rate of CVD in the internal data increased by about 0.5%, but in the Rotterdam Study it increased by about 1.5% and the increase rate decreases markedly from 9 year. When the rate of increase of CVD occurrence is significantly reduced, the predictive power of the deep learning model was reduced. Therefore, while deep learning is appropriate for identifying risk factors that predict the occurrence of disease within 5 years using constantly changing data after 5 years predictions require scrutiny. One of the major disadvantages of the deep learning model is that it can't provide specific recommendations for controlling risk factors because the risk factors that affect the event occurrence are unknown. To overcome these shortcomings, we used LRP to assess the risk factors individually. The results of the LRP show that the risk factors considered to be important in previous clinical studies were similar to those shown to be important by the deep learning model: Age, gender, SBP, TC, smoking, exercise, etc [37][38][39].
However, this study has several limitations. First, because only the information obtained from the screening data is available, it is not possible to reflect changes in the level of risk due to unpredictable drugs or non-pharmacological treatments based on physician or patient behavior during follow-up. In addition, the risk of CVD may change due to changes in the risk factors and the interaction between risk factors, but the research on this is still lacking. Second, although we ranked the risk factors separately using LRP, the model does not know the size of the effect of the risk factors, such as the hazard ratio, due to the nature of the hidden layer of the neural network models. Therefore, further studies are needed to overcome this, as it is not yet ready for clinical use. Third, unlike the NHIS-HEALS, in the Rotterdam Study, there were limitations to the comparison of variables to the performance in the internal validation because the variables were only: age, sex, BLDS, BMI, SBP, DBP, exercise and smocking.

Conclusions
Deep learning models have greater predictive power for CVD occurrence than the Cox regression model within five years. In addition, it was confirmed that the risk factors shown to be important in previous clinical studies were also extracted from the results of this study using LRP.