Comparison of risk models for mortality and cardiovascular events between machine learning and conventional logistic regression analysis

Aims Non-linear models by machine learning may identify different risk factors with different weighting in comparison to conventional linear models. Methods and results The analyses were performed in 15,933 patients included in the Shinken Database (SD) 2004–2014 (n = 22,022) for whom baseline data of blood sampling and ultrasound cardiogram and follow-up data at 2 years were available. Using non-linear models with machine learning software, 118 risk factors and their weighting of risk for all-cause mortality, heart failure (HF), acute coronary syndrome (ACS), ischemic stroke (IS), and intracranial hemorrhage (ICH) were identified, where the top two risk factors were albumin/hemoglobin, left ventricular ejection fraction/history of HF, history of ACS/anti-platelet use, history of IS/deceleration time, and history of ICH/warfarin use. The areas under the curve of the developed models for each event were 0.900, 0.912, 0.879, 0.758, and 0.753, respectively. Conclusion Here, we described our experience with the development of models for predicting cardiovascular prognosis by machine learning. Machine learning could identify risk predicting models with good predictive capability and good discrimination of the risk impact.


Introduction
Risk prediction models for cardiovascular disease (CVD) are generally based on an assumption that each risk factor is linearly associated with CVD outcomes. [1,2] Such models may oversimplify complex relationships, which potentially include both non-linear associations and non-linear interactions. Therefore, better approaches to develop risk models which reflect the real relationship between risk factors and outcomes are necessary.
Machine learning (ML) offers an alternative approach for development of prediction models. ML is a scientific research standing at the intersection of statistics and computer science, which depends on efficient computing algorithms. The importance of ML has been recognized through the challenges of building statistical models from massive data sets which required computational methods. [3] ML may detect the complex and non-linear interactions between variables by minimizing the error between predicted and observed outcomes. [4] To date, although there have been several investigations comparing model development for prognostic assessment of CVD between ML and commonly used statistical methods, no largescale investigations have been reported in a Japanese cohort. This study was performed to examine whether there are any differences between modelling by ML and by logistic regression analysis using a single-centre cohort in a cardiovascular hospital in Japan. [5,6]

Study population
The Shinken Database includes all patients newly visiting the Cardiovascular Institute in Tokyo, Japan ('Shinken' is a Japanese abbreviation for the name of the hospital), excluding foreign travelers and patients with active cancer. This hospital-based database was established to investigate the prevalence and prognosis of CVD. [5,6] Our hospital is a cardiology specialized hospital in an urban area of Japan, Tokyo. The patients seen were not only local residents but also referred from other clinics for treatment of CVD. The attending physicians were all cardiologists or cardiothoracic surgeons.
The registry began in June 2004, and patients have been continually registered in the database annually. A total of 29,832 patients were registered between June 2004 and March 2016. Of these, 15,993 patients whose 2-year follow-up data were available were analysed in the present study.

Ethics
The ethics committee of the Cardiovascular Institute approved this study, and all patients provided written informed consent.

Data collection
After obtaining an electrocardiogram and chest X-ray, the cardiovascular status of the patients was evaluated by echocardiography, exercise test, 24-hour Holter recordings, and blood laboratory data from the initial visit. In addition to gender, age, height, weight, and medications prescribed at the initial visit, we collected data on CVD, including heart failure (HF; New York Heart Association class � 2), valvular heart disease (moderate or severe stenosis or regurgitation using echocardiography), coronary artery disease (diagnosed by angiography or scintigraphy), hypertrophic and dilated cardiomyopathy (diagnosed by echocardiography or magnetic resonance imaging), left ventricular non-compaction (diagnosed by echocardiography), and history of a disabling cerebral infarction or transient ischemic attack (diagnosed by computed tomography or magnetic resonance imaging). The presence of cardiovascular risk factors, including hypertension (use of anti-hypertensive agents, systolic blood pressure � 140 mmHg, or diastolic blood pressure � 90 mmHg on admission), diabetes mellitus (use of oral hypoglycemic agents or insulin, or glycosylated hemoglobin � 6.5%), dyslipidemia (use of a statins or drugs for lowering triglycerides, low-density lipoprotein cholesterol � 140 mg/dL, high-density lipoprotein cholesterol < 40 mg/dL, or triglycerides � 150 mg/dL), chronic kidney disease (estimated glomerular filtration rate < 60 mL/minute/m 2 ), chronic obstructive pulmonary disease, and use of anti-coagulant and anti-platelet medications were determined. Body mass index (BMI) was calculated as weight in kilograms divided by height in meters squared. The glomerular filtration rate (GFR) was estimated using the new Japanese coefficient for the modified isotope dilution mass spectrometry (IDMS)-traceable 4-variable Modification of Diet in Renal Disease (MDRD) study equation (GFR = 194 × serum creatinine (SCr) − 1.094 × Age − 0.287 × 0.739 [if female]). [7] Parameters assessed for the prediction models Of the parameters obtained for the database, 118 parameters were used for development of the prediction models (Table 1). Patient parameters (including age, sex, smoking habit, and drinking habit), diagnosis of comorbidities, and information regarding medications were obtained. Blood pressure, laboratory data, and parameters of ultrasound cardiogram were measured in almost all of the patients; however, in some patients, measurements were not performed incidentally (i.e., at the patient's requires or at the discretion of the attending physician), and some of the parameters were lacking for technical reasons. As described below, in the ML method, missing values were complemented by a automatically selected prespecified algorithm, while data with missing values were excluded from logistic regression analysis with commonly used statistical software.

Patient outcome
The patient outcomes in the present study included the following six events: all-cause mortality, cardiovascular events, HF events, acute coronary syndrome (ACS) events, ischemic stroke

Data analysis by machine learning
We developed linear and non-linear models using an automated ML platform, DataRobot. [8] More than 3,000 procedure sets of data processing, feature engineering, and ML algorithm, including Support Vector Machine, Elastic Net Classifier, Regularized Logistic Regression, Stochastic Gradient Descent Classifier, Neural Network Classifier, etc., are developed from its repository. The software automatically chooses and executes suitable procedure sets when investigating the patterns in data. All of the developed models were verified by cross-validation and sorted by the selected evaluation metric, e.g., the area under the curve (AUC). 1) Data preprocessing. From a large number of data preprocessing approaches, the following approaches were automatically selected in the final models: imputing missing values, onehot encoding for categorical values, standardization for numerical values, and creating new parameters by unsupervised learning of original parameters. Missing numerical values were imputed based on the medians of values in its parameters, and missing categorical values were treated as their own categorical level and given their own parameters. Categorical values were converted to many binary parameters by one-hot encoding if needed. For some models, numerical values were standardized in each parameter by subtracting the mean and dividing by the standard deviation. Moreover, some new parameters were created internally by summarizing original parameters with an unsupervised learning method.
2) Model validation. All developed models were validated by cross-validation and holdout, using the AUC of the receiver-operating characteristic (ROC) curve as the evaluation metric. Before developing models, 20% of the dataset was randomly selected as the holdout, which was never used in training or validation. The remaining data were randomly divided into five mutually exclusive folds of data, four of which were used together for training, with the final fold used for validation. [9] Models were trained five times per algorithm, with each fold used once for validation. Cross-validation scores were calculated by taking the mean of AUC of the five possible validation folds. [10] Random selection was performed in cross-validation and holdout by stratified sampling, which holds the ratio of positive and negative cases. Finally, models were validated on the holdout to demonstrate the generalization performance to new data. As the holdout was taken as a single sample, no confidence intervals were calculated.
3) Permutation Importance. The relative importance of a parameter in the models was assessed using the permutation importance (PI), as described by Breiman. [11] This method is widely used in ML as it can be applied to both linear and non-linear models. To calculate the PI of a parameter in a model, its values in the validation data were randomly shuffled (reordered), keeping other parameters the same as before. If it has considerable importance on the outcome, the resulting performance score in the evaluation metric should decline significantly. We calculated the PI of all parameters and divided by the maximum ratio of the resulting performance scores on the original scores to normalize and compare among different models. The calculation was conducted several times to ensure stability in random shuffling. 4) Partial dependence. To understand how the changes in values of a parameter affect the outcome, we constructed partial dependence plots as described by Friedman. [12] To construct the partial dependence plot of a parameter in a model, we calculated predictions from the model after having replaced all the values for the parameter with a constant value and computing the mean of those predictions. We repeated calculations for many values to observe how the model reacts to changes in the parameter of interest.
Logistic regression analysis by commonly used software 1) Model development. For comparison with prediction modelling by ML, logistic regression analysis was performed with commonly used statistical software (SPSS ver. 19; IBM Corp., Armonk, NY, USA). We used 118 similar parameters, and consecutive variables were assumed to have a linear association with the patient outcomes. The multivariate model was developed with the forward stepwise method. The interactions between parameters were not considered. Data with missing values were excluded from the analysis.
2) Impact of risk factors. Impact of risk factors (IRF) was calculated for each parameter determined in the multivariate models by logistic regression analysis using the following equation: IRF = (Wald statistic for each parameter)/(maximum Wald statistic among parameters in the multivariate model). IRF in the logistic regression model corresponded to the permutation importance in the ML model.

Other statistical methods
Categorical and continuous data are presented as numbers (%) and means ± standard deviation, respectively. Statistical analyses other than ML were performed using SPSS ver. 19 (IBM Corp.). In all analyses, two-sided P < 0.05 was taken to indicate statistical significance.

Incidence rates of patient outcomes
The incidence rates of patient outcomes (percentage for number of study population) are shown in Table 3. All-cause mortality occurred in 217 patients (1% within 2 years), In the logistic regression model, AUC for all-cause mortality was 0.881. In this model, the top five parameters determined by IRF were albumin (100, reference), age (36), total protein (29), dyslipidemia (28), and carvedilol use (25) (Fig 1A).
2) Cardiovascular events. Among the ML models, a model with Random Forest Classifier (Tree-based Algorithm) had the largest AUC for cardiovascular events (0.848). In this model, the top five parameters determined by PI were HF (100, reference), history of ACS (76), left ventricular ejection fraction (72), estimated glomerular filtration rate (57), and mitral regurgitation (degree determined by ultrasound cardiogram) (42). Mitral regurgitation showed a linear relationship with cardiovascular events, whereas left ventricular ejection fraction and estimated glomerular filtration rate showed non-linear relationships (Table 4B, Figs 1B and 2B).
3) Heart failure events. Among the ML models, a model with Random Forest Classifier (Tree-based Algorithm) had the largest AUC for HF event (0.912). In this model, the top five parameters determined by PI were left ventricular ejection fraction (100, reference), HF (93), age (57), left ventricular dimension at end-diastole (57), and left atrial dimension (49). Left ventricular ejection fraction, age, left ventricular dimension at end-diastole, and left atrial dimension showed non-linear relationships (with a threshold) with HF events (Table 4C, Figs 1C and 2C).
In the logistic regression model, the AUC for HF events was 0.907. In this model, the top five parameters determined by IRF were left ventricular dimension at end-systole (100, reference), diuretic use (77), HF (71), direct oral anti-coagulant use (61), and left atrial dimension (58) (Fig 1C). 4) Acute coronary syndrome events. Among the ML models, a model with Elastic-Net Classifier had the largest AUC for ACS events (0.879). In this model, the top five parameters   (18), HF (17), and angiotensin-receptor-II blocker use (10) ( Table 4D, Figs 1D and 2D).

5) Ischemic stroke events.
Among the ML models, a model with Nystroem Kernel SVM Classifier (Regularized Linear Model) had the largest AUC for IS events (0.758). In this model, the top five parameters determined by PI were history of IS or transient ischemic attack (100, reference), deceleration time (98), history of intracranial hemorrhage (76), diastolic blood pressure (48), and left ventricular ejection fraction (34). Deceleration time, diastolic blood pressure, and left ventricular ejection fraction showed linear relationships with IS events (Table 4E, Figs 1E and 2E).
In the logistic regression model, the AUC for IS events was 0.757. In this model, the top five parameters determined by IRF were history of IS or transient ischemic attack (100, reference), systolic blood pressure (52), blood glucose (51), aortic aneurism (30), and tricuspid regurgitation (28) (Fig 1E). 6) Intracranial hemorrhage events. Among the ML models, a model with Elastic-Net Classifier had the largest AUC for intracranial hemorrhage events (0.753). In this model, the top five parameters determined by PI were history of intracranial hemorrhage (100, reference), warfarin use (65), interventricular septal thickness (37), dilated phase hypertrophy cardiomyopathy (29), and sick sinus syndrome (29). Interventricular septal thickness showed a linear relationship with intracranial hemorrhage events (Table 4F, Figs 1F and 2F).
In the logistic regression model, the AUC for intracranial hemorrhage events was 0.726. In this model, the top five parameters determined by IRF were history of intracranial hemorrhage (100, reference), tricuspid regurgitation (degree determined by echocardiogram) (79), interventricular septal thickness (50), estimated glomerular filtration rate (33), and history of coronary artery bypass graft (31) (Fig 1F).

Major findings
In this analysis, we developed prediction models for six prognostic outcomes related to CVD by ML (ML model) using ensemble modelling software (DataRobot) and logistic regression analysis (LR model) with commonly used statistical software (SPSS). The AUCs for each prognostic outcome were mostly similar between ML and LR models. Interestingly, the parameter  The areas under the curve (AUC) for all-cause mortality by prediction models with ML and LR were 0.900 and 0.881, respectively, and the risk factor with the strongest impact was albumin (impact 100, reference) for both models, followed by hemoglobin (78) and age (36) in ML and LR models, respectively. (B) Cardiovascular events: AUC 0.848 and 0.831, respectively, the risk factor with the strongest impact was heart failure and acute coronary syndrome (ACS) (both, 100, reference), followed by ACS (76) and heart failure (57), respectively. (C) Heart failure events: AUC 0.912 and 0.907, respectively, the risk factor with the strongest impact was left ventricular ejection fraction and left ventricular dimension at end-systole (both, 100, reference), respectively, followed by heart failure (93) and diuretics (77), respectively. (D) ACS: AUC 0.879 and 0.884, respectively, the risk factor with the strongest impact was ACS (100, reference) for both models, followed by anti-platelet for both models (26 and 23 in ML and LR, respectively). (E) Ischemic stroke events: AUC 0.758 and 0.757, respectively, the risk factor with the with the greatest impact in each model (top parameter) was mostly similar between ML and LR, but other risk factors were not necessarily consistent between them.

Modelling in ML
We used DataRobot in the present analysis, which automatically selected the model with the largest AUC among numerous ML models. For ACS and intracranial hemorrhage events, the Elastic-Net classifier model was selected. This model was projected for data with strong multicollinearity. In predicting models for ACS and intracranial hemorrhage, the history of each event was determined as the top parameter, strongest impact was a history of ischemic stroke or transient ischemic attack (100, reference) for both models, followed by deceleration time (98) and systolic blood pressure (52), respectively. (F) Intracranial hemorrhage: AUC 0.753 and 0.726, respectively, the risk factor with the strongest impact was a history of intracranial hemorrhage (100, reference) for both models, followed by warfarin (65) and tricuspid regurgitation (79) in models with ML and LR, respectively.
https://doi.org/10.1371/journal.pone.0221911.g001 Comparison of risk models for mortality and morbidity by machine learning and logistic regression and the weighting of risk in the model was concentrated on the top parameter. In contrast, the weighing of the risk in other parameters in the model was very low. This distribution of impact may have been because that parameters indicating the progression of pathological condition (i.e., calcification or plaque of coronary artery for ACS, cerebral artery aneurism or microbleeds for intracranial hemorrhage) were not included in the present analysis, and therefore the history of each event played a role as a surrogate marker for which the weighing of risk was concentrated. As a consequent, the pattern of relationship among parameters became like to be a sparse representation, seen in the field of visual image reconstruction, [13,14] for which Elastic-Net classifier model would fit well. [15] The SVM model was selected for all-cause mortality and IS events. This model was projected to obtain a solution in a complex classification problem. [16] Especially, SVM is suitable for situations where appropriate and representative examples of all the different categories (classes) are available, [17] and as an advantage, SVM do not require linear relationships or independence between the parameters and thus are more suitable for clinical data classification. [18] As shown in Fig 2A and 2E (permutation importance for mortality and IS, respectively), multiple factors had strong impact on outcomes, which may be a reason that SVM suited better than Elastic Net classifier. On the other hand, the changes of the risks according to the change of each parameter were gradual (not having a threshold), which may be a reason that SVM suited better than random forest, which would suit better for parameters having a definite threshold.
For cardiovascular events and HF events, the random forest model was selected. This model was projected to obtain a solution with many parameters of similar effects and was widely used among non-linear model. [11] Similar to SVM, random forest do not require linear relationships or independence between the parameters and thus are suitable for clinical data classification. However, compared to SVM, random forest is more suitable for categorical data or consecutive values with a definite threshold, because its basic methodology is making categories with multiple layers. As shown in Fig 2B and 2C (permutation importance for cardiovascular events and heart failure events, respectively), multiple factors had strong impact on outcomes, and of note, the consecutive parameters had a definite threshold, which may be a reason that random forest suited better than SVM for these outcomes.
There are several possible explanations why the predictive capabilities represented by the AUCs were similar between ML and LR models for prognostic outcomes. First, the sample size was adequate for both ML and LR models, although ML can be applied for larger populations with greater numbers of predictors. Second, the numbers of risk parameters may be adequate for both of ML and LR models, and many of the consecutive parameters were mostly linearly correlated, which may make it difficult to distinguish the differences between ML and LR models. Third, given the relatively low event rate, the low signal-to-noise ratio may be another reason for the difficulty in distinguishing between the ML and LR models. [19] Theoretically, it is expected that the superiority of ML in developing predicting models for prognostic outcomes will be more obvious in larger populations, especially with greater numbers of parameters and complex confounding factors or interactions. From another viewpoint, the problem of missing values may also be important. In the present study, albumin was commonly determined as the top predictor in both of ML and LR models. Although several recent studies have identified hypoalbuminemia as an independent predictor of mortality in CVD [20] (i.e., HF, [21][22][23] myocardial infarction, [24][25][26] or hemodialysis [27][28][29]), albumin was not included in most of the risk scores for mortality in CVD. [30][31][32][33][34][35] The insufficient recognition of albumin as a risk factor for mortality may be due to the strong interaction with other risk factors (i.e., hemoglobin and body weight) or missing values in previous observational studies due to the low attention to its risk. The ML method involves consideration of interactions and the method of complementing missing values. Therefore, analysis by ML may lead to the discovery of new things by studying the past, where we can recognize the true risk in a data-driven manner.

Limitation
This study had several limitations. First, because our patients were from a single-centre cohort from a cardiovascular institute, the results should be interpreted carefully, and cannot be easily extrapolated to other populations. Second, as mentioned above, although the sample size seemed to be adequate for both ML and LR models, larger cohorts would be necessary to distinguish between the predictive capabilities of ML and LR models.

Conclusion
We reported our experience in the development of predictive models for cardiovascular prognosis by ML. ML could identify risk predictive models with good predictive capability and good discrimination of the risk impact.