Figures
Abstract
Cardiovascular diseases (CVDs) are the primary cause of all death globally. Timely and accurate identification of people at risk of developing an atherosclerotic CVD and its sequelae is a central pillar of preventive cardiology. One widely used approach is risk prediction models; however, currently available models consider only a limited set of risk factors and outcomes, yield no actionable advice to individuals based on their holistic medical state and lifestyle, are often not interpretable, were built with small cohort sizes or are based on lifestyle data from the 1960s, e.g. the Framingham model. The risk of developing atherosclerotic CVDs is heavily lifestyle dependent, potentially making many occurrences preventable. Providing actionable and accurate risk prediction tools to the public could assist in atherosclerotic CVD prevention. Accordingly, we developed a benchmarking pipeline to find the best set of data preprocessing and algorithms to predict absolute 10-year atherosclerotic CVD risk. Based on the data of 464,547 UK Biobank participants without atherosclerotic CVD at baseline, we used a comprehensive set of 203 consolidated risk factors associated with atherosclerosis and its sequelae (e.g. heart failure). Our two best performing absolute atherosclerotic risk prediction models provided higher performance, (AUROC: 0.7573, 95% CI: 0.755–0.7595) and (AUROC: 0.7544, 95% CI: 0.7522–0.7567), than Framingham (AUROC: 0.680, 95% CI: 0.6775–0.6824) and QRisk3 (AUROC: 0.725, 95% CI: 0.7226–0.7273). Using a subset of 25 risk factors identified with feature selection, our reduced model achieves similar performance (AUROC 0.7415, 95% CI: 0.7392–0.7438) while being less complex. Further, it is interpretable, actionable and highly generalizable. The model could be incorporated into clinical practice and might allow continuous personalized predictions with automated intervention suggestions.
Citation: Kesar A, Baluch A, Barber O, Hoffmann H, Jovanovic M, Renz D, et al. (2022) Actionable absolute risk prediction of atherosclerotic cardiovascular disease based on the UK Biobank. PLoS ONE 17(2): e0263940. https://doi.org/10.1371/journal.pone.0263940
Editor: Thippa Reddy Gadekallu, Vellore Institute of Technology: VIT University, INDIA
Received: November 24, 2021; Accepted: January 28, 2022; Published: February 11, 2022
Copyright: © 2022 Kesar et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: There are restrictions prohibiting the direct provision of UK Biobank data used in this manuscript. The data were obtained from a third party, UK Biobank, upon application under accession/application number 34802. Interested parties can apply for the data from UK Biobank directly, at http://www.ukbiobank.ac.uk. The UK Biobank will consider data applications from bona fide researchers for health-related research that is in the public interest. This process for reader access to the UK Biobank is the same process as followed by the authors of this study and will provide the same data. All extracted columns are stated in the supporting information files S1 Table and S2 Table.
Funding: This research was funded by Ada Health GmbH and has been conducted using the UK Biobank under application id 34802.
Competing interests: This research was funded by Ada Health GmbH and has been conducted using the UK Biobank under application number 34802. All of the authors are or were employees of, contractors for, or hold equity in Ada Health GmbH. AK, AB, OB, HH, MJ, DN, BLS and SG are employees or company directors of Ada Health GmbH and some of the listed authors hold stock options in the company. Ada Health GmbH has received research grant funding from the Bill & Melinda Gates Foundation, Fondation Botnar, the Federal Ministry of Education and Research Germany, the Federal Ministry for Economic Affairs and Energy Germany and the European Union. PW is employed by Wicks Digital Health Ltd, which has received funding from Ada Health, AstraZeneca, Baillie Gifford, Biogen, Bold Health, Camoni, Compass Pathways, Coronna, EIT, Endava, Happify, HealthUnlocked, Inbeeo, Kheiron Medical, Lindus Health, Sano Genetics, Self Care Catalysts, The Learning Corp, The Wellcome Trust, THREAD Research, VeraSci, and Woebot. HH is the topic driver of the AI-based symptom assessment group of the WHO/ITU Focus Group on AI4H (Artificial Intelligence for Health) and SG is a member of the clinical evaluation topic group of the WHO/ITU Focus Group on AI4H. A related patent application is currently pending with the title “System and method for predicting the risk of a patient to develop an atherosclerotic cardiovascular disease” and application number EP21191089.8. This does not alter our adherence to PLOS ONE policies on sharing data and materials.
Introduction
Globally, cardiovascular diseases (CVDs) are the number one cause of all death [1, 2]. In 2016, 17.9 million people died of CVDs alone, accounting for 31% of all global deaths [1]. The direct costs of CVDs in the US for 2010 were $272.5b whereas indirect costs were $171.7b and are expected to increase to $818.1b and $275.8b in 2030 respectively [3, 4]. Atherosclerosis alone is responsible for 1.3% of all hospital stays with costs of $9b per year, while all atherosclerosis-related diseases amount to $43.5b of hospital costs annually [5]. Individually, patients with CVD incur more than twice the medical costs of age- and sex-matched patients without CVD, largely because of the increased likelihood of subsequent hospitalizations. The greatest differences in total CVD costs usually occur when comparing patients with and without a secondary CVD hospitalization [6].
All current guidelines on the prevention of CVD in clinical practice recommend the assessment of total CVD risk since atherosclerosis is usually the product of a number of risk factors [7, 8] and in recent years these guidelines have evolved to focus on the absolute risk of disease as opposed to relative risk [7–10]. Clinician tools for CVD risk estimation must enable rapid and accurate estimation of an individual patient’s absolute CVD risk [7], or for opportunistic screening of high-risk patients from relevant populations [11]. Screening is the identification of unrecognized disease or risk of disease in individuals without symptoms. In addition to opportunistic screening, which is carried out without a predefined strategy (e.g. when the individual is consulting a general practitioner (GP) for some other reason), tools can be used for systematic screening, which is centrally organized strategic screening in the general population or in targeted subpopulations, such as subjects with a family history of premature CVD or familial hyperlipidaemia [7]. There is ongoing debate on the role of systematic centralized population based screening in CVD [10, 12] because of burdensome diagnostic testing following the use of risk based screening tools [13]. A relatively new area of screening is self-screening, carried out by proactive individuals using screening tools on mobile devices such as smartphones or smartwatches, which may use built in app-linked sensors or screening chat-bots [14–16]. There is public demand for reliable, actionable, explainable and usable health information tools [17], including for disease screening.
The risk to build up atherosclerotic plaque varies and is determined by multiple factors such as genetics, environment and lifestyle [11, 18–21]. The risk of developing atherosclerotic plaque can be reduced based on an individual’s behavioral risk factors, such as smoking, physical activity and nutrition [1, 11, 19, 20].
Most diseases, including atherosclerotic CVDs, have a complex pathophysiology that involves multiple interacting molecular systems, making it insufficient to look only at an isolated biological pathway or a subset of markers to predict disease risk [22]. A precision medicine based approach is required, where multiple biological layers are considered (i.e., ‘multi-omics’), alongside clinical and lifestyle data [22]. Such an approach has the potential to capture all important interactions or correlations detected between molecules in different biological layers, providing a holistic understanding of an individual’s current health status and enabling the quantification of an individual’s absolute risk of atherosclerotic CVDs [23, 24].
Previous studies in this area use a limited set of risk factors and outcomes for their analyses [7, 25, 26]. In recent years, the knowledge of behavioral risk factors and of the pathophysiology of atherosclerotic CVDs have advanced tremendously [11, 25]. Current absolute risk prediction models have limited predictive capability as they have not been trained on all possible atherosclerotic CVD outcomes [27–29], or they include outcomes which are unmodifiable such as those related to pregnancy, accidents, or congenital factors [29].
Both SCORE (Systematic COronary Risk Evaluation) and SCORE2 [30, 31], are models for predicting relative CVD risk, whereas we focus on predicting absolute CVD risk, which is why we chose to omit those models from our analysis. Another related investigation, which also used the UK Biobank (UKB) dataset, developed multiple Cox Proportional Hazard models for 10-year CVD risk prediction, with a reduced version requiring 47 risk factors and another version disregarding all cholesterol risk factors as well as systolic blood pressure, in order to provide a simple approach for risk prediction in remote settings with limited testing resources [32]. However, survival models such as the proportional hazard model are not designed to provide absolute risk estimates for individual patients.
Machine learning (ML) based approaches have many advantages compared to humans or standard statistical algorithms, such as superior performance, being able to identify complex non-linear patterns, the ability to encode diverse and high dimensional data types, being more stable to outliers, allowing continuous model updates, versatility for different domains and scalability [33–36].
However, classic disadvantages of ML based approaches are their lack of interpretability, risk for inherent bias due to the used data, difficulty to acquire physician adoption, explaining to physicians why a new risk model might be superior to existing ones, with all of these hindering widespread adoption of ML based risk prediction models [36, 37]. One example for ML based CVD risk prediction is the AutoPrognosis based approach, where an ensemble of multiple ML pipelines has also been applied on the UK Biobank dataset for 5-year CVD risk prediction [29]. Further, using a purely ML-driven approach can lead to a model that requires too many risk factors to compute risk, which is infeasible for routine clinical check-ups. Another disadvantage of purely data-driven approaches is the inclusion of risk factors which might show strong correlations but are unrelated to the pathophysiology of CVDs or are not actionable, making them inapplicable in a clinical setting or as an actionable self-management tool [29].
The aim of this study was to use a large-data ML approach to develop an actionable absolute risk prediction tool which considers the holistic health of an individual. Uniquely, we focused on behavioral risk factors relating to all atherosclerotic CVD outcomes. Our goal was to have a holistic understanding of an individual’s current health status, to better quantify their risk of atherosclerotic CVDs, and to provide actionable advice. Our approach is novel in that we employ a highly holistic understanding of an individual’s current health status, to better quantify their risk of all atherosclerotic CVDs. By utilizing a comprehensive set of lifestyle factors, we enable the subsequent suggestion of personalized and actionable advice relating to unhealthy risk factors. Instead of using only a limited set of risk factors, we aimed to achieve this by taking multiple biological layers into account, which include: (i) multi-omics data from blood samples (e.g. lipidome and proteome); (ii) family history (e.g. genome), (iii) lifestyle data, (iv) clinical data and (v) environmental data; along with (vi) an extensive set of risk factors and outcomes.
We used data from 464,547 participants of the UK Biobank study who did not have atherosclerotic CVD at their baseline visit. We created an automated pipeline to benchmark risk prediction classifier algorithms against each other, then evaluated their predictive performances in the overall population and tested the generalizability of the top-performing classifiers through retraining and testing on different sub-populations. We explored the clinical implications of the proposed classifiers, with a focus on the top-performing models. This study does not focus on the algorithmic aspects of the utilized classifiers.
Methodological details on the utilized classifiers can be found in the open-source documentation of the respective algorithms of the scikit-learn [38] and xgboost [39] libraries and in the supporting information (S4 Table).
Materials and methods
Baseline data from the UK Biobank was utilized to extract an extensive set of risk factors and outcomes associated with the pathophysiology of atherosclerotic CVDs. A benchmarking pipeline was used to train and evaluate different standard and ML algorithms for the task of 10-year atherosclerotic CVD risk prediction. The performance was measured using AUROC and compared against the baseline models Framingham and QRisk3, which are widely used and recommended models. We evaluated our best performing models further by analyzing the most informative features and assessed model generalizability and created a reduced model.
Study design and participants
The UK Biobank is a long-term prospective large-scale biomedical database including over 500,000 participants aged 40–69 years (when recruited between 2006 and 2010). The database is globally accessible to approved researchers undertaking research into the most common and life-threatening diseases and continuously collects phenotypic and genotypic data about its participants, including data from questionnaires, physical measures, blood, urine and saliva samples, lifestyle data [40]. This data is further linked to each participant’s health-related records, accelerometry, multimodal imaging, genome-wide genotyping and longitudinal follow-up data for a wide range of health-related outcomes [40, 41]. The UK Biobank study protocol is available online [42].
The North West Multi-Centre Research Ethics Committee approved the UK Biobank study and all participants provided written informed consent prior to study enrollment. Our research is covered by the UK Biobank’s Generic Research Tissue Bank (RTB) Approval and was approved by the UK Biobank Access Management Team [43].
We excluded participants with atherosclerotic CVDs present before or during baseline, participants who chose to leave the UKB study and participants who were lost due to various reasons. The resulting cohort consisted of 464,547 participants. The last available date of participant follow-up was March 5th, 2020.
Risk factor definition.
We curated a list of all generally known risk factors and outcomes for atherosclerotic CVDs from the medical literature and from validated risk prediction models. This preliminary list of risk factors was reduced through curation to focus on those factors that were clearly involved in the pathophysiology of atherosclerosis and those that are modifiable through behavioral change. The curation was carried out by three medical doctors with experience in diagnosing or scientifically modelling cardiovascular diseases. We consolidated all relevant UKB columns into 203 risk factors and grouped them into six categories: demographics (e.g. age, biological sex, ethnicity), biomarkers (e.g. cholesterol, glucose, blood pressure, heart rate), lifestyle (e.g. alcohol consumption, smoking, physical activity, sleep, social visits), environment (e.g. exposure to tobacco smoke, work and housing and other socio-economic related factors), genetics (e.g. family history of CVD, stroke, diabetes, high cholesterol, high blood pressure) and comorbidities (e.g. heart arrhythmias, diabetes, acute & chronic kidney injury, migraines, rheumatoid arthritis, systemic lupus erythematosus, severe mental illnesses (schizophrenia, bipolar disorder, depression, psychosis), diagnosis or treatment of erectile dysfunction, atypical antipsychotic medication). A categorized list of all risk factors used in our analysis is provided in the supplementary data (S1 Table).
Outcome definition.
In the same manner as described above, an initial list of atherosclerotic CVDs was further reviewed and curated by the same team of medical doctors. All resulting CVDs of interest are associated with atherosclerotic plaque build-up, are modifiable and relate to the collected risk factors only. Thus, we disregard brain haemorrhages due to accidents and congenital and pregnancy-related CVDs, which are not actionable. The curated list of all ICD-10 and ICD-9 outcomes meeting the above criteria consists of 193 total (125 unique) CVD outcomes, e.g. coronary/ischaemic heart disease, heart attack, angina, stroke, cardiac arrest, congestive heart failure, left ventricular failure, myocardial infarction, aortic valve stenosis, cerebral artery occlusions, nontraumatic haemorrhages. A list with all outcome codes used in our analysis is provided in the supplementary data (S2 Table). An atherosclerotic CVD event was defined as the first occurrence out of the following: any of the atherosclerotic CVD outcome diagnosis codes, also as primary or secondary death cause during the 10-year follow-up period.
Cohort follow-up.
Follow-up time was set to 10 years as commonly used in other risk models (see Table 2 in [7]) and counted from the date of initial assessment center visit. Individuals who died from other causes during their follow-up period or had a relevant CVD event past their individual follow-up period, were marked as not having had a relevant CVD event.
Models used in comparison
Framingham risk score.
The Framingham 10-year CVD absolute risk score is based on the data of the two prospective studies, the Framingham Heart Study and the Framingham offspring study [27]. The cohort consists of 8491 participants, with 4522 women and 3969 men who attended a baseline examination between 30 and 74 years of age and were free of CVD. A positive CVD outcome was defined as any of the following: coronary death, myocardial infarction, coronary insufficiency, angina, ischemic stroke, hemorrhagic stroke, transient ischemic attack, peripheral artery disease and heart failure.
Participants were followed up for 12 years where 1174 participants developed a CVD. Two biological sex-specific risk models were derived, with one model using lipid measurements and the other one Body Mass Index (BMI). The variables used were biological sex, age, total cholesterol, HDL cholesterol, treated and untreated systolic blood pressure, smoking status and diabetes status.
The Framingham risk calculators and model coefficients are publicly available [44]. We imputed missing data using simple mean imputation.
QRisk3.
The QRisk3 10-year CVD absolute risk score is based on a prospective open cohort study using data from general practices (GPs), mortality and hospital records in England [28]. The cohort consists of 10.56 million patients between the age of 25 and 84 years, where 75% of the patients were used for training and 25% for validation. Patients with a pre-existing CVD, missing Townsend score or using statins were removed from the baseline. Patients were classified as having a positive CVD outcome when any of the following outcomes was present during follow-up in the GP, hospital or mortality records: coronary heart disease, ischaemic stroke, or transient ischaemic attack. QRisk3 used the following ICD-10 codes: G45 (transient ischaemic attack and related syndromes), I20 (angina pectoris), I21 (acute myocardial infarction), I22 (subsequent myocardial infarction), I23 (complications after myocardial infarction), I24 (other acute ischaemic heart disease), I25 (chronic ischaemic heart disease), I63 (cerebral infarction), and I64 (stroke not specified as haemorrhage or infarction). The utilized ICD-9 codes were: 410, 411, 412, 413, 414, 434, and 436. Participants were followed-up for 15 years where 363,565 participants of the training set (4,6%) developed a relevant CVD. One biological sex-specific risk model was derived.
The risk factors used in the final model were age, ethnicity, deprivation, systolic blood pressure, BMI, total cholesterol/HDL cholesterol ratio, smoking status, family history of coronary heart disease, diabetes status, treated hypertension, rheumatoid arthritis, atrial fibrillation, chronic kidney disease, systolic blood pressure variability, diagnosis of migraine, corticosteroid use, systemic lupus erythematosus, atypical antipsychotic use, diagnosis of severe mental illnesses, diagnosis or treatment of erectile dysfunction.
The QRisk3 risk calculator and model coefficients are publicly available [45], built into all major NHS GP systems and included in the UK’s national guidelines (https://www.healthcheck.nhs.uk/seecmsfile/?id=1687, accessed 10th November 2021). We imputed missing data using simple mean imputation.
Standard linear and ML models.
Since the introduction of the classic CVD risk prediction methods, the field of supervised machine learning has developed from classical statistics with the sole purpose of maximizing predictive accuracy with modern statistical methods. Therefore, in addition to using standard linear models, we tested the major ML approaches, covering a wide spectrum of the possible ML design space, to evaluate which model type performs best for our task. Based on our initial benchmarking pipeline results, we focused on reporting the results of the initially best performing models: logistic regression, random forest and XGBoost.
We compared regularized linear regression (with L1 penalty), random forests and gradient boosting (xgboost implementation) for assessing the highest achievable Area Under the Receiver Operating Characteristic Curve (AUROC) value, which we used for assessing the trade-off between number of features and predictive performance of several simpler practical risk predictors, as determined by an iterative feature elimination procedure outlined below. L1 regularization for logistic regression implements a strong penalty for non-zero feature weights, resulting in a feature selection procedure that discards features that are likely to be non-predictive. Random Forest is an ensemble method that fits many decision trees independently to a subset of the data. We implemented both methods using their scikit-learn library implementation. Finally, we evaluated Extreme Gradient Boosting: Gradient boosting is an ensemble tree-based machine learning method that combines many weak classifiers to produce a stronger one. It sequentially fits a series of classification or regression trees, with each tree created to predict the outcomes misclassified by the previous tree [46]. By sequentially predicting residuals of previous trees, the gradient boosting process has a focus on predicting more difficult cases and correcting its own shortcomings. Extreme Gradient Boosting (XGB / XGBoost) is a specific implementation of the gradient boosting process, and uses memory-efficient algorithms to improve computational speed and model performance [39, 47].
For completeness, we briefly evaluated a number of other standard classifiers, but discarded them due to excessive computational complexity or inferior performance so we do not report their performances here: Decision Trees [48], Voting Classifiers, Multi-Layer Perceptrons with 2 layers and 200 and 150 neurons each (Neural Network) [49], stochastic gradient descent implementing a support vector machine algorithm [50, 51], Ada Boost [52, 53], Gradient Boosting [46], K Neighbors [54], Quadratic Discriminant Analysis [55] and Gaussian Naive Bayes [38, 56].
Model development and benchmarking using pipeline
We built a benchmarking pipeline for automated and reproducible data extraction, normalization, imputation, model training, tuning of model hyperparameters, classification, documentation and reporting.
We implemented all models using their respective scikit-learn library or xgboost library implementation using the Python programming language [38, 39]. Details on the used Python libraries, methods and parameters are provided in the supplementary data (S3 and S4 Tables).
Categorical values were one-hot encoded. Data normalization was performed by removing the mean and scaling to unit variance. Data imputation was performed for all models using a simple mean imputation. The models’ hyper-parameters were determined using grid search and stratified k-fold cross validation using 3 folds was employed to avoid overfitting.
Finally, we assessed model performance mainly using the AUROC. Fig 1 visualizes an overview of all performed steps of our experimental setup.
Iterative feature elimination.
We employed an iterative feature elimination procedure based on the regularized logistic regression for finding the best trade-off between predictive performance and number of risk factors, with the aim of creating a risk prediction algorithm that is applicable in the clinical context. We used the standard L1 regularization (also known as Lasso) proposed by [57]; it implements a strong penalty on non-zero feature weights of our logistic regression model, resulting in a sparse feature set for prediction.
A logistic regression coefficient value β can be interpreted as the expected change in log odds of having the outcome per unit change in the feature xβ. Therefore, increasing the feature by one unit multiplies the odds of having the outcome by eβ. This means that we can interpret the coefficients as feature importance values in the sense that the feature with the smallest coefficient has the least importance on model predictions. Importantly, this holds only true in the context of the parameters contained in the current model. Thus, we re-estimate the model after each feature elimination round.
In each iteration, we re-estimated the logistic regression model on the remaining parameters, and then discarded all parameters that were set to zero by the L1 regularization; finally, we also discarded the parameter with the lowest non-zero absolute value.
As an additional step, we created a ranking of the relative feature importance value of each feature by dividing its absolute coefficient weight by the sum of all absolute coefficient weights.
Statistical analysis.
To reduce overfitting, we evaluated the classification performance of all our benchmarked algorithms by using 3-fold stratified cross-validation and measuring the Area Under the Receiver Operating Characteristic Curve. For the cross-validation, we used a training set with 325,182 participants to train and derive our standard linear and ML models and then assessed the AUROC performance on the held-out test set with 139,365 participants using 203 risk factors respectively. We reported the AUROC and the 95% confidence intervals (Wilson score intervals) for all models and performed a sensitivity analysis using Shapley Additive Explanations (SHAP values) for the best performing linear model.
Generalizability.
With 442,620 out of the 502,551 participants in the UK Biobank, the cohort has a high proportion (88.1%) of participants with British White ethnicity. In an effort to estimate a proxy for out-of-sample generalizability, we re-trained the two best models, XGB and logistic regression with L1 regularization, only on Whites and tested their performance on a non-White test set. The white-only training set consists of 378,836 participants (81.5%). The non-White test set consists of 85,711 participants (18.5%).
Results
Characteristics of the training and test populations
Of 502,551 patients in the UK Biobank, we filtered out 7.6% who already experienced a relevant CVD outcome (during or before baseline) and the participants being lost or who withdrew from the biobank. This resulted in 464,547 participants who met the inclusion criteria. 28,561 (6.1%) of those participants developed at least one of the relevant CVD outcomes during their 10-year follow-up period. We used a common 70% of the data as a training set and 30% as a hold-out test set. Table 1 shows the overlap of our atherosclerotic CVD outcome definition with the CVD outcome definition used in the related work approach by Alaa et al. [29]:
Prediction accuracy
The resulting prediction accuracy of the benchmarked models is depicted in Table 2. We used both Framingham 10-year CVD risk versions, with and without lipids, as well as QRisk3 as baseline models to assess the performance of predicting someone’s 10-year risk of developing an atherosclerotic cardiovascular disease based on a holistic set of risk factors, with a focus on actionable risk factors and outcomes. The best performing model was XGB with an AUROC of 75.73%, only marginally higher than the logistic regression model with L1 regularization (75.44%) and substantially better than the Random Forest model (66.90%).
Fig 2 shows the AUROCs of the best performing models XGB and from logistic regression with L1 regularization, which is the simplest model tested and amongst the top two best performing models. Logistic regression comes with the advantages of being interpretable by providing reasoning for its classifications, and being a simple and robust method [36].
In order to better evaluate the clinical implications and significance of our results, we compared the results of our benchmarked models with our baseline models Framingham and QRisk3. Table 2 shows that both our XGB and logistic regression classifiers achieved superior performance compared to the baseline models. Apart from the Random Forest model, all tested models had a higher AUROC than both baseline Framingham (68.0% and 68.1%) and QRisk3 (72.5%) models.
The difference in AUROC performance of the Framingham score in our experiments in Fig 2 compared to Alaa et al. [29] is explainable by their use of an older UK Biobank version with 40,000 fewer baseline patients with their last available date of participant follow-up being February 17, 2016. The UK Biobank version we used includes biochemistry data which was released May 1, 2019 including cholesterol and additional questionnaires data. Additionally, more diagnosis data was made available over time. These dataset differences may help explain the difference in AUROC.
Figs 3 and 4 show the AUROCs of all baseline models on imputed and unimputed data respectively.
Both Framingham versions perform nearly identically on imputed and unimputed data whereas QRisk3 performs worse on unimputed data.
Feature elimination vs. predictive performance
Fig 5 shows how the performance of the best logistic regression model depends on the number of risk factors used. Discarding the risk factors stepwise leads to a relatively unchanged and stable model performance until around 170 iterations of feature elimination. This indicates that for predicting an individual’s 10-year atherosclerotic CVD risk, many features provide only marginal value and a small subset of features provides substantial informative value. After around 170 iterations, there was a marked decline in model performance associated with further reductions in utilized features.
AUROC performance of best performing logistic regression model with L1 regularization (continuous blue line) compared to number of features utilized in each iterative feature elimination step (orange line), dotted blue horizontal line showing intersection of 25 features with iterative feature elimination step, allowing for extrapolation to model performance.
Table 3 shows in more detail the dependence of the model performance on the number of features. Utilizing only 25 (88%) out of the 203 total risk factors still leads to a reasonable AUROC performance, with a high reduction in utilized features. Compared to the model performance with an AUROC of 75.44% when using all 203 risk factors, the model still achieves 74.15% (95% CI: 0.7392–0.7438) with the 25 most informative risk factors.
We also assessed the performance for fewer features. To reach the same performance as QRisk3 of 72.5% AUROC, 16 features would be necessary. The two most informative features were age and biological sex. To reach a similar performance as Framingham (68.0%), just two features were necessary (68.98%). It is worth noting, however, that both Framingham and QRisk3 were trained and tuned on other datasets and have different CVD definitions and objectives.
Generalizability of results
We assessed the generalizability of our models by re-training the two previously best performing models only on a White cohort and then testing them on a non-White cohort. Table 4 and Fig 6 show the results for logistic regression and XGB. The logistic regression model has an AUROC of 75.86% in the generalizability experiment, compared with an AUROC of 75.44% in the previous experiment. XGB has an AUROC of 76.26% in the generalizability experiment and 75.73% in the previous experiment. These results show only marginal differences to the results of the previous experiments.
Predictive ability of individual variables in UK Biobank
Table 5 shows the relative regression feature weights of the 25 most informative risk factors in descending order. A full list is provided in the supplementary materials (S5 Table). Based on our previous manual curation of risk factors and outcomes, we can see that the most informative risk factors are distributed across 5 categories (Table 6), with the lifestyle category contributing the most risk factors. The two most informative features were age and biological sex. We provided a sensitivity analysis using SHAP values of the best performing logistic regression model for all risk factors in the supplementary materials (S1 Fig).
Discussion
Using data gathered from the large longitudinal cohort UK Biobank study, we developed a pipeline to benchmark several classification models for predicting a subject’s 10-year absolute risk of developing an atherosclerotic CVD. We used an extensive set of physician curated risk factors and outcomes methodology, employing a holistic view of the subject’s current health status rooted in a precision medicine approach. The models were trained and evaluated using data from 464,547 UK Biobank participants, spanning 203 CVD risk factors for each subject. Using a simple logistic regression model with a holistic set of risk factors significantly improved the accuracy of atherosclerotic CVD risk prediction compared to currently available, widely used and recommended models such as Framingham and QRisk3. Both of these existing models rely on a limited set of risk factors and outcomes and do not focus on modifiable lifestyle factors. Further, our best performing logistic regression model utilizes new CVD risk predictors showing high predictive power, namely: social visits, walking pace and overall health rating. The frequency of social visits could be indicative of someone’s current mental health status, which has been shown to be a relevant CVD risk factor [58, 59]. These and other non-laboratory risk factors could be collected by means of a questionnaire or passively deduced using data analytics from data sources such as GPS, calendar and sensors [26, 60] from e.g. smartphones, smartwatches and fitness trackers.
Additionally, our best performing models, XGBoost and logistic regression, showed marginal differences when trained and tested on particular sub-populations, which is indicative of good generalizability to other ethnicities.
As there was little performance difference between the best performing models, we primarily discuss the simplest model, logistic regression with L1 regularization. This model has the inherent benefit of offering reasoning for its predictions through analyzing the learned coefficients for every risk factor and having feature selection performed by the L1 regularization. With L1 regularization, less important risk factors’ coefficients are minimized and also set to zero, which then leads to entire removal of these features from the model, and fewer risk factors needed for an accurate prediction.
Using iterative feature elimination, we identified a subset of the 25 most relevant risk factors providing a similar performance compared to using all 203 risk factors. The 25 most relevant risk factors are distributed across five different categories, suggesting that different biological layers contribute to the risk of atherosclerotic CVD. This result confirms that it is insufficient to assess only one biological layer for accurate risk prediction, supporting our initial model development approach [61]. Our approach takes into account multiple biological layers by using multi-omics as well as clinical and lifestyle data with the aim to capture all potential interactions or correlations detected between molecules in different biological layers [22]. Multi-omics data generated for the same set of samples can provide useful insights into the interaction of biological information at multiple layers and thus can help in understanding the mechanisms underlying the complex biological condition of interest.
In our model, the lifestyle category contributed the most risk factors, suggesting that accurate prediction relies upon continuous daily lifestyle data and not just periodic snapshots of clinical data. The causal relationships between the risk factors considered in our model and atherosclerotic CVDs have been demonstrated by other studies [11, 19, 21, 25].
Innovative approaches are needed in order to tackle the increasing prevalence and mortality of CVD-related diseases [2], and the associated healthcare systems’ financial burdens. This is particularly true in low and middle income countries where CVD prevalence has also been increasing and is expected to increase as a consequence of an aging and growing population [2]. Our atherosclerotic CVD prediction model has the potential to support healthcare systems by identifying more people at risk earlier and more accurately than currently available models and intervening with personalized behavior change programs. Currently available models, like Framingham and QRisk3, have limited predictive capability for atherosclerotic CVDs as they were not trained on all of them and do not provide actionable results.
There is potential for novel disruptive approaches to affordably improve CVD outcomes. Areas where this may have an impact is in novel approaches to screening, lifestyle coaching and prevention [2]. Screening will become more accessible and widespread by more (near-)medical-grade sensors being integrated into smartphones and smartwatches, enabling continuous monitoring of relevant behavioral CVD risk factors, as well as biomarkers such as heart rate, blood pressure and blood glucose. By gathering a wider spectrum of relevant risk factors for cardiovascular disease automatically and continuously, an ongoing and personalized cardiovascular disease risk prediction could be enabled. Through linking personalized information on an individual’s CVD risk with app-based programs for sustained behavioral modification, it may be possible to lower the incidence and mortality of CVDs [62]. Combined with a companion smartphone-based app, an AI or healthcare provider-generated personalized intervention program could be provided and targeted at those people who need it the most.
A system and method gathering personal health data and predicting an individual’s atherosclerotic CVD risk is handling sensitive health data (e.g. laboratory values) and must adhere to local regulations and best practices in data transfer, processing and storage to ensure data privacy and security.
Many studies have shown that digital health interventions are cost effective for managing CVD (for a review see [63]). One report found that a community-based prevention program could have a mean return on investment (ROI) on medical cost savings of $5.60 for every $1 spent within a 5 year timeframe by improving physical activity and nutrition and reducing tobacco usage [64]. A review of 11 in-home cardiac rehabilitation programs for the secondary prevention of CVD found that social support, goal setting, monitoring, credible instructions and literature resources are all effective behavior change techniques to reduce behavioral risk factors for CVD [65].
The improvement achieved by our models might be partially attributed to being trained and assessed on the UK Biobank dataset, whereas the baseline Framingham model was derived from a different population. The population and many of the data sources used in the QRisk3 model are similar, being the general UK population and using their GP, hospital and mortality records. However, our risk model generation approach and QRisk3’s approach were designed with different aims and objectives and the modelling strategy was different. For these reasons, direct comparison between the models is limited. Notable differences between the approaches include a more limited set of risk factors included in Framingham and QRisk3’s and a focused and wider range of atherosclerotic CVDs included in our approach.
The results from our generalizability sub-analysis indicate that our XGB and logistic regression models might generalize well to other ethnicities and do not overfit to our cohort, however, this needs to be further evaluated with more data from diverse ethnicities.
Our results show that our models have improved performance over the baseline models Framingham and QRisk3 (Table 2). This is because the selection of the appropriate disease modelling approach, classifiers and careful tuning of the model’s hyperparameters are crucial steps for realizing the potential benefits of ML. Our pipeline automates some of these steps which makes the tuning and discovery of new disease risk models easily accessible for clinical research. Our prospective cohort modelling approach, which is rooted in precision medicine, is the first to generate an atherosclerotic CVD absolute risk prediction tool based upon a complete definition of atherosclerotic CVD outcomes and a holistic set of risk factors.
Limitations
The UK Biobank only admitted participants for their initial signup from the ages 40 and up. This might limit the applicability of the risk score for younger populations and further tests with data from younger populations need to be conducted.
There are many missing data values related to the potential risk factors for many participants. Having more unimputed data of relevant CVD risk factors could improve the predictive performance of all our benchmarked classifiers and could also lead to changes in the classifier ranking from Table 2 and relative risk factor importances in Table 5. However, the use of imputed data is highly unlikely to have an impact on our conclusion that a holistic set of risk factors and an exhaustive atherosclerotic CVD outcome definition could improve atherosclerotic and actionable CVD risk prediction.
An additional limitation of our study is that the UK Biobank dataset consists of participants of predominantly (88%) British ethnicity, with an even larger portion having a White background (91%). Therefore, further assessments of the influence of the ethnicity predictor need to be carried out to enable a generalizable tool. Previous work in this area indicates that the development of plaques seems to be independent of ethnicity [21].
A further limitation of this UK-focused dataset is that socio-economic and other environmental factors differ between countries. This is another potential bias that needs to be further evaluated with datasets from other countries with different socio-economic characteristics.
Disease risk prediction models which include subjective non-laboratory risk factors, such as the self-reported health rating and usual walking pace, should be cautiously evaluated to minimize self-reported bias. These risk factors have been found to be good predictors of overall CVD risk in another study using UK Biobank data [29].
Conclusions
We benchmarked multiple classifiers to predict an individual’s 10-year risk of developing an atherosclerotic CVD, using a holistic set of risk factors and a specific definition of atherosclerotic CVDs. Our reduced logistic regression with L1 regularization classifier, a simple and interpretable model, is amongst our best prediction models, includes actionable lifestyle factors, has great predictive power and requires 13 unique features. Our experiments showed that a two feature-questionnaire is as accurate as the Framingham models and a 16 feature-questionnaire is as accurate as QRisk3 for 10-year atherosclerotic CVD risk prediction. Both prediction models, XGBoost and logistic regression, generalize well to non-White people, which might indicate that our models generalize well to other (western) countries. Framingham and QRisk3, which are well established and validated absolute risk prediction models, do not perform as well on predicting individuals’ 10-year risk of developing an atherosclerotic CVD. With our logistic regression model, we created a promising new interpretable, actionable and accurate risk prediction tool that could assist individuals and public health in CVD risk reduction.
Supporting information
S1 Fig. Shapley Additive Explanations (SHAP value) of each risk factor for the logistic regression model.
This summary plot combines risk factor importance with risk factor effects. It shows the relationship between the value of a risk factor and its impact on the prediction. Risk factors are sorted according to their importance along the y-axis. Each point in the summary plot is a Shapley value for a risk factor and an instance. The position of a Shapley value on the y-axis is determined by the risk factor importance and on the x-axis by the Shapley value. The color represents the value of a risk factor from low to high. Overlapping points are jittered on the y-axis direction, showing the distribution of the Shapley values per risk factor.
https://doi.org/10.1371/journal.pone.0263940.s001
(TIFF)
S1 Table. List of all risk factors used in our analysis.
The listed risk factors were summarized into 203 risk factors for the respective UK Biobank participant.
https://doi.org/10.1371/journal.pone.0263940.s002
(XLSX)
S2 Table. List of all outcomes used in our analysis.
The following outcomes were all consolidated into one final binary outcome column indicating if the respective UK Biobank participant did or did not develop one the relevant atherosclerotic CVDs during their individual 10-year follow-up period starting from their individual initial assessment attendance date.
https://doi.org/10.1371/journal.pone.0263940.s003
(XLSX)
S3 Table. Specifications of the python (v3.9.6) libraries and their versions used in this study.
https://doi.org/10.1371/journal.pone.0263940.s004
(PDF)
S4 Table. List of utilized open-source methods, best parameters and references.
https://doi.org/10.1371/journal.pone.0263940.s005
(PDF)
S5 Table. Full list of relative informative values for each risk factor for logistic regression model.
https://doi.org/10.1371/journal.pone.0263940.s006
(XLSX)
References
- 1.
Cardiovascular diseases (CVDs) [Internet]. [cited 2021 Sep 28]. Available from: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
- 2. Roth GA, Mensah GA, Johnson CO, Addolorato G, Ammirati E, Baddour LM, et al. Global Burden of Cardiovascular Diseases and Risk Factors, 1990–2019. J Am Coll Cardiol. 2020 Dec 22;76(25):2982–3021. pmid:33309175
- 3. Heidenreich PA, Trogdon JG, Khavjou OA, Butler J, Dracup K, Ezekowitz MD, et al. Forecasting the Future of Cardiovascular Disease in the United States. Circulation. 2011 Mar 1;123(8):933–44. pmid:21262990
- 4. Weintraub WS, Daniels SR, Burke LE, Franklin BA, Goff DC, Hayman LL, et al. Value of Primordial and Primary Prevention for Cardiovascular Disease. Circulation. 2011 Aug 23;124(8):967–90. pmid:21788592
- 5. Evsikova C, Raplee I, Lockhart J, Jaimes G, Evsikov A. The Transcriptomic Toolbox: Resources for Interpreting Large Gene Expression Data within a Precision Medicine Context for Metabolic Disease Atherosclerosis. J Pers Med. 2019 Apr 29;9:21.
- 6. Nichols GA, Bell TJ, Pedula KL, O’Keeffe-Rosetti M. Medical care costs among patients with established cardiovascular disease. Am J Manag Care. 2010 Mar 1;16(3):e86–93. pmid:20205493
- 7. Piepoli MF, Hoes AW, Agewall S, Albus C, Brotons C, Catapano AL, et al. 2016 European Guidelines on cardiovascular disease prevention in clinical practice: The Sixth Joint Task Force of the European Society of Cardiology and Other Societies on Cardiovascular Disease Prevention in Clinical Practice (constituted by representatives of 10 societies and by invited experts)Developed with the special contribution of the European Association for Cardiovascular Prevention & Rehabilitation (EACPR). Eur Heart J. 2016 Aug 1;37(29):2315–81. pmid:27222591
- 8. 2013 ACC/AHA Guideline on the Assessment of Cardiovascular Risk. J Am Coll Cardiol. 2014 Jul 1;63(25 0 0):2935–59.
- 9. Sedgwick JEC. Absolute, attributable, and relative risk in the management of coronary heart disease. Heart. 2001 May 1;85(5):491–2. pmid:11302989
- 10. Jackson R. Guidelines on preventing cardiovascular disease in clinical practice: Absolute risk rules—but raises the question of population screening. BMJ. 2000 Mar 11;320(7236):659–61. pmid:10710556
- 11.
Libby P, Bonow RO, Mann DL, Tomaselli GF, Zipes DP. Braunwald’s Heart Disease E-Book: A Textbook of Cardiovascular Medicine. Elsevier Health Sciences; 2018. 2527 p.
- 12.
Eriksen CU, Rotar O, Toft U, Jørgensen T. What is the effectiveness of systematic population-level screening programmes for reducing the burden of cardiovascular diseases? [Internet]. Copenhagen: WHO Regional Office for Europe; 2021 [cited 2021 Oct 12]. (WHO Health Evidence Network Synthesis Reports). Available from: http://www.ncbi.nlm.nih.gov/books/NBK567843/
- 13. Lim LS, Haq N, Mahmood S, Hoeksema L. Atherosclerotic Cardiovascular Disease Screening in Adults: American College of Preventive Medicine Position Statement on Preventive Practice. Am J Prev Med. 2011 Mar 1;40(3):381.e1–381.e10. pmid:21335273
- 14. Espinoza J, Crown K, Kulkarni O. A Guide to Chatbots for COVID-19 Screening at Pediatric Health Care Facilities. JMIR Public Health Surveill. 2020 Apr 30;6(2):e18808. pmid:32325425
- 15. Perez MV, Mahaffey KW, Hedlin H, Rumsfeld JS, Garcia A, Ferris T, et al. Large-Scale Assessment of a Smartwatch to Identify Atrial Fibrillation. N Engl J Med. 2019 Nov 14;381(20):1909–17. pmid:31722151
- 16. Lemmen C, Simic D, Stock S. A Vision of Future Healthcare: Potential Opportunities and Risks of Systems Medicine from a Citizen and Patient Perspective—Results of a Qualitative Study. Int J Environ Res Public Health. 2021 Sep 19;18(18):9879. pmid:34574802
- 17. Peeters JM, Krijgsman JW, Brabers AE, Jong JDD, Friele RD. Use and Uptake of eHealth in General Practice: A Cross-Sectional Survey and Focus Group Study Among Health Care Users and General Practitioners. JMIR Med Inform. 2016 Apr 6;4(2):e4515.
- 18. Bui QT, Prempeh M, Wilensky RL. Atherosclerotic plaque development. Int J Biochem Cell Biol. 2009 Nov 1;41(11):2109–13. pmid:19523532
- 19. Herrington W, Lacey B, Sherliker P, Armitage J, Lewington S. Epidemiology of Atherosclerosis and the Potential to Reduce the Global Burden of Atherothrombotic Disease. Circ Res. 2016 Feb 19;118(4):535–46. pmid:26892956
- 20. Bentzon JF, Otsuka F, Virmani R, Falk E. Mechanisms of Plaque Formation and Rupture. Circ Res. 2014 Jun 6;114(12):1852–66. pmid:24902970
- 21. Insull W. The Pathology of Atherosclerosis: Plaque Development and Plaque Responses to Medical Treatment. Am J Med. 2009 Jan 1;122(1, Supplement):S3–14. pmid:19110086
- 22. Picard M, Scott-Boyer M-P, Bodein A, Périn O, Droit A. Integration strategies of multi-omics data for machine learning analysis. Comput Struct Biotechnol J. 2021 Jan 1;19:3735–46. pmid:34285775
- 23.
Collins FS, Varmus H. A New Initiative on Precision Medicine [Internet]. https://doi.org/10.1056/NEJMp1500523. Massachusetts Medical Society; 2015 [cited 2021 Sep 29]. Available from: https://www.nejm.org/doi/10.1056/NEJMp1500523
- 24. Leon-Mimila P, Wang J, Huertas-Vazquez A. Relevance of Multi-Omics Studies in Cardiovascular Diseases. Front Cardiovasc Med. 2019;6:91. pmid:31380393
- 25. Fruchart J-C, Nierman MC, Stroes ESG, Kastelein JJP, Duriez P. New Risk Factors for Atherosclerosis and Patient Risk Assessment. Circulation. 2004 Jun 15;109(23_suppl_1):III–15. pmid:15198961
- 26. Shah A, Ahirrao S, Pandya S, Kotecha K, Rathod S. Smart Cardiac Framework for an Early Detection of Cardiac Arrest Condition and Risk. Front Public Health. 2021;9:1536. pmid:34746087
- 27. D’Agostino RB, Vasan RS, Pencina MJ, Wolf PA, Cobain M, Massaro JM, et al. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation. 2008 Feb 12;117(6):743–53. pmid:18212285
- 28. Hippisley-Cox J, Coupland C, Brindle P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study. BMJ. 2017 May 23;357:j2099. pmid:28536104
- 29. Alaa AM, Bolton T, Angelantonio ED, Rudd JHF, Schaar M van der. Cardiovascular disease risk prediction using automated machine learning: A prospective study of 423,604 UK Biobank participants. PLOS ONE. 2019 May 15;14(5):e0213653. pmid:31091238
- 30. Conroy RM, Pyörälä K, Fitzgerald AP, Sans S, Menotti A, De Backer G, et al. Estimation of ten-year risk of fatal cardiovascular disease in Europe: the SCORE project. Eur Heart J. 2003 Jun 1;24(11):987–1003. pmid:12788299
- 31. SCORE2 working group and ESC Cardiovascular risk collaboration. SCORE2 risk prediction algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe. Eur Heart J. 2021 Jul 1;42(25):2439–54. pmid:34120177
- 32. Dolezalova N, Reed AB, Despotovic A, Obika BD, Morelli D, Aral M, et al. Development of an accessible 10-year Digital CArdioVAscular (DiCAVA) risk assessment: a UK Biobank study. Eur Heart J—Digit Health. 2021 Sep 1;2(3):528–38.
- 33. Kopitar L, Kocbek P, Cilar L, Sheikh A, Stiglic G. Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Sci Rep. 2020 Jul 20;10(1):11981. pmid:32686721
- 34. Ngiam KY, Khor IW. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019 May 1;20(5):e262–73. pmid:31044724
- 35. Doupe P, Faghmous J, Basu S. Machine Learning for Health Services Researchers. Value Health. 2019 Jul 1;22(7):808–15. pmid:31277828
- 36.
Adadi A, Berrada M. Explainable AI for Healthcare: From Black Box to Interpretable Models. In: Bhateja V, Satapathy SC, Satori H, editors. Embedded Systems and Artificial Intelligence. Singapore: Springer Singapore; 2020. p. 327–37.
- 37. He J, Baxter SL, Xu J, Xu J, Zhou X, Zhang K. The practical implementation of artificial intelligence technologies in medicine. Nat Med. 2019 Jan;25(1):30–6. pmid:30617336
- 38. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12(85):2825–30.
- 39.
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proc 22nd ACM SIGKDD Int Conf Knowl Discov Data Min. 2016 Aug 13;785–94.
- 40. Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLOS Med. 2015 Mar 31;12(3):e1001779. pmid:25826379
- 41.
About us [Internet]. [cited 2021 Nov 9]. Available from: https://www.ukbiobank.ac.uk/learn-more-about-uk-biobank/about-us
- 42.
Collins R. UK Biobank Protocol. 112.
- 43.
Ethics [Internet]. [cited 2021 Nov 9]. Available from: https://www.ukbiobank.ac.uk/learn-more-about-uk-biobank/about-us/ethics
- 44.
Cardiovascular Disease (10-year risk) | Framingham Heart Study [Internet]. [cited 2021 Nov 10]. Available from: https://framinghamheartstudy.org/fhs-risk-functions/cardiovascular-disease-10-year-risk/
- 45. QRISK3 [Internet]. [cited 2021 Nov 10]. Available from: https://qrisk.org/three/index.php
- 46. Friedman JH. Greedy Function Approximation: A Gradient Boosting Machine. Ann Stat. 2001;29(5):1189–232.
- 47.
XGBoost Documentation—xgboost 1.6.0-dev documentation [Internet]. [cited 2021 Nov 8]. Available from: https://xgboost.readthedocs.io/en/latest/
- 48.
Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification And Regression Trees. Boca Raton: Routledge; 2017. 368 p.
- 49.
Hinton GE. 20—CONNECTIONIST LEARNING PROCEDURES11This chapter appeared in Volume 40 of Artificial Intelligence in 1989, reprinted with permission of North-Holland Publishing. It is a revised version of Technical Report CMU-CS-87-115, which has the same title and was prepared in June 1987 while the author was at Carnegie Mellon University. The research was supported by contract N00014-86-K-00167 from the Office of Naval Research and by grant IST-8520359 from the National Science Foundation. In: Kodratoff Y, Michalski RS, editors. Machine Learning [Internet]. San Francisco (CA): Morgan Kaufmann; 1990 [cited 2022 Jan 10]. p. 555–610. Available from: https://www.sciencedirect.com/science/article/pii/B9780080510552500298
- 50. Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B. Support vector machines. IEEE Intell Syst Their Appl. 1998 Jul;13(4):18–28.
- 51.
Zhang T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the twenty-first international conference on Machine learning [Internet]. New York, NY, USA: Association for Computing Machinery; 2004 [cited 2021 Nov 12]. p. 116. (ICML ‘04). Available from: https://doi.org/10.1145/1015330.1015332
- 52. Freund Y, Schapire RE. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J Comput Syst Sci. 1997 Aug 1;55(1):119–39.
- 53. Hastie T, Rosset S, Zhu J, Zou H. Multi-class AdaBoost. Stat Interface. 2009;2(3):349–60.
- 54. Omohundro SM. Five balltree construction algorithms. International Computer Science Institute Berkeley; 1989.
- 55. Srivastava S, Gupta MR, Frigyik BA. Bayesian quadratic discriminant analysis. J Mach Learn Res. 2007;8(6).
- 56. Zhang H. The optimality of naive Bayes. AA. 2004;1(2):3.
- 57. Tibshirani R. Regression Shrinkage and Selection Via the Lasso. J R Stat Soc Ser B Methodol. 1996;58(1):267–88.
- 58. Correll CU, Solmi M, Veronese N, Bortolato B, Rosson S, Santonastaso P, et al. Prevalence, incidence and mortality from cardiovascular disease in patients with pooled and specific severe mental illness: a large-scale meta-analysis of 3,211,768 patients and 113,383,368 controls. World Psychiatry. 2017;16(2):163–80. pmid:28498599
- 59. Cunningham R, Poppe K, Peterson D, Every-Palmer S, Soosay I, Jackson R. Prediction of cardiovascular disease risk among people with severe mental illness: A cohort study. PLOS ONE. 2019 Sep 18;14(9):e0221521. pmid:31532772
- 60. Ghayvat H, Pandya S, Patel A. Deep Learning Model for Acoustics Signal Based Preventive Healthcare Monitoring and Activity of Daily Living. In 2020. p. 1–7.
- 61. Hasin Y, Seldin M, Lusis A. Multi-omics approaches to disease. Genome Biol. 2017 May 5;18(1):83. pmid:28476144
- 62. Gao W, Yu C. Wearable and Implantable Devices for Healthcare. Adv Healthc Mater. 2021 Sep 1;10(17):2101548. pmid:34495580
- 63. Jiang X, Ming W-K, You JH. The Cost-Effectiveness of Digital Health Interventions on the Management of Cardiovascular Diseases: Systematic Review. J Med Internet Res. 2019 Jun 17;21(6):e13166. pmid:31210136
- 64. Trust for America’s Health. Prevention for a healthier America: Investments in disease prevention yield significant savings, stronger communities. 2008.
- 65. Heron N, Kee F, Donnelly M, Cardwell C, Tully MA, Cupples ME. Behaviour change techniques in home-based cardiac rehabilitation: a systematic review. Br J Gen Pract. 2016 Oct;66(651):e747–57. pmid:27481858