Risk prediction for cardiovascular related diseases using PRS and EHR in the Framingham Heart Study

Taegun Kim; Jaeseung Song; Jong Wha J. Joo

doi:10.1371/journal.pone.0345914

Abstract

Cardiovascular disease is a leading cause of mortality and rising healthcare costs worldwide. Fortunately, the disease is preventable, and addressing risk factors can significantly reduce its effects. Over the past decade, risk prediction models have advanced significantly, with polygenic risk scoring analysis, which is often used in combination with clinical health information for prediction. However, most previous cardiovascular disease prediction studies based on polygenic risk scores have focused on a single specific disease or event, such as cardiac events. Given the complex nature of the cardiovascular disease, which involves a combination of genetic and environmental factors, a comprehensive analysis of the disease prediction results is essential. In this study, we investigate the genetic and environmental factors contributing to cardiovascular disease by utilizing data from the Framingham Heart Study, a leading cardiovascular cohort. We compared the prediction performance of different methods across various scenarios and assessed performance using various evaluation metrics to identify the best-fitting model for six cardiovascular related diseases. We also analyzed the feature importance of genetic and clinical variables, noting that different variables had varying effects on each disease. Our findings demonstrated the performance of prediction algorithms in forecasting cardiovascular disease by utilizing genetic and clinical factors, as well as highlighting the importance of each feature in the disease prediction. While models relying solely on polygenic risk score showed relatively low prediction performance for some diseases, integrating genetic information with clinical data improved prediction performance in most cases. For certain diseases, particularly those known to be heritable, polygenic risk scores demonstrated predictive ability, suggesting that they may serve as standalone predictive tools. We believe our study reveals the value of combining polygenic risk scores with clinical variables and expect that our thorough analysis can inform study designs tailored to specific diseases and research objectives.

Citation: Kim T, Song J, Joo JWJ (2026) Risk prediction for cardiovascular related diseases using PRS and EHR in the Framingham Heart Study. PLoS One 21(4): e0345914. https://doi.org/10.1371/journal.pone.0345914

Editor: Giuseppe Novelli, Universita degli Studi di Roma Tor Vergata, ITALY

Received: July 19, 2025; Accepted: March 12, 2026; Published: April 17, 2026

Copyright: © 2026 Kim et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data used in this study were obtained from the FHS through dbGaP (phs000007.v32.p13: Framingham Cohort). The FHS genotype and phenotype data are not publicly available due to participant privacy restrictions. The data are available to other researchers upon approval of data requests through dbGaP. Data can be requested at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000007.v32.p13. Summary statistics used for PRS estimation were obtained from the GWAS Catalog (https://www.ebi.ac.uk/gwas/) for the following datasets: GCST006414 (AF), GCST90473543 (Myocardial Ischemia), GCST90480183 (Diastolic heart failure), GCST007320 (Alzheimer’s disease), GCST90267278 (Diabetes), and GCST90044350 (Stroke). All scripts used for data preprocessing, feature engineering, model training, and figure generation are available on GitHub: https://github.com/DGU-CBLAB/FHS-Riskprediction.

Funding: This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2026-RS-2020-II201789), and the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2026-RS-2023-00254592) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Abbreviations: AF, atrial fibrillation; AUC, area under the receiver operating characteristic curve; AUPRC, area under the precision–recall curve; BG, blood glucose; CALC_LDL, calculated low-density lipoprotein cholesterol; CHD, coronary heart disease; CHF, congestive heart failure; CI, confidence interval; CPD, cigarettes per day; CREAT, creatinine; CVD, cardiovascular disease; DBP, diastolic blood pressure; DLVH, definite left ventricular hypertrophy; DOR, diagnostic odds ratio; EHR, electronic health records; FHS, Framingham Heart Study; GWAS, genome-wide association studies; HDL, high-density lipoprotein cholesterol; HGT, height; HIP, hip girth; LR+, positive likelihood ratio; LR−, negative likelihood ratio; MAF, minor allele frequency; ML, machine learning; PCA, principal component analysis; PRS, polygenic risk score; QC, quality control; SBP, systolic blood pressure; SHAP, Shapley Additive Explanations; TRIG, triglycerides; VENT_RT, ventricular rate by electrocardiography

Introduction

Cardiovascular disease (CVD) is one of the most prevalent diseases in modern society, posing a significant health risk and a substantial financial burden on global healthcare systems [1–13]. CVD is a complex disorder resulting from interactions between genetic and environmental factors; however, the specific contributions of genes and environmental influences remain poorly understood. CVD encompasses a variety of conditions affecting the heart and blood vessels, including myocardial infarction, coronary heart disease (CHD), arrhythmia, and heart failure. Moreover, CVD has a substantial influence on major adult diseases including diabetes and dementia that can precede or follow it [14–20]. Although CVD is largely preventable, many individuals remain unaware of their risk because early stages may be asymptomatic or present with nonspecific symptoms that overlap with other diseases.

Over the past decade, risk prediction models have advanced considerably. Early-stage CVD risk prediction models typically rely on linear multivariate regression methods, which generally exhibit only moderate prediction performance for specific subpopulations [1,21,22]. However, with the rise of machine learning (ML) technologies, newer models have enhanced CVD risk predictions by analyzing larger datasets and identifying the complex interactions among predictors [21]. These approaches have employed various algorithms, including random forest, gradient boosting machine, Naive Bayes, support vector machines, and artificial neural networks, often in combination as ensembles to enhance prediction accuracy [23–28].

Genome-wide association studies (GWAS) have identified numerous genetic variants associated with CVD. The polygenic nature of complex diseases, such as CVD, involves multiple genetic variants, each contributing a small effect that collectively influences trait variability [29]. Polygenic risk scores (PRS) quantify an individual’s genetic risk by aggregating the effects of multiple variants, thereby improving disease prediction. Recognizing the substantial role of environmental factors in CVD susceptibility, studies aim to enhance PRS performance by incorporating data from multi-traits [30,31], disease-related biomarkers [32,33], clinical risk factors [34–36], and environmental variables [32–35] that influence disease risk. However, most previous CVD prediction studies based on PRS focus on one specific disease or event, such as cerebrovascular [37], coronary artery disease [38–42], diabetes [43,44], or cardiac events [45], without thoroughly investigating the genetic and environmental factors associated with overall CVD-related diseases. Given the complex nature of CVDs, which result from a combination of genetic and environmental factors, a detailed analysis of CVD prediction outcomes is essential.

This study aimed to investigate multiple CVD-related phenotypes and assess the relative importance of predictive variables, comparing the contribution of genetic and clinical measurements for each phenotype. Six cardiovascular-related diseases atrial fibrillation (AF), stroke, CHD, congestive heart failure (CHF), dementia, and diabetes were analyzed using data from the Framingham Heart Study (FHS), a long-term, large-scale representative cardiovascular cohort. PRS was calculated to estimate genetic risk using FHS genotypes and GWAS summary statistics from the GWAS catalog. Electronic health records (EHR) from the FHS provided clinical variables for risk prediction. To evaluate the contributions of genetic and environmental factors, prediction performance was assessed under three scenarios: PRS-only, EHR-derived clinical variables-only, and an integrated model combining PRS and EHR variables. Logistic regression and four representative ML methods random forest, XGBoost [46], CatBoost [47], and LightGBM [48] were employed to predict each disease and identify the most suitable model. The Shapley Additive Explanations (SHAP) framework was used to quantify the influence of genetic and clinical variables on the prediction of each phenotype.

Materials and methods

Overview of the study design

The study aims to systematically evaluate the predictive utility of PRS and EHR for CVD-related disease risk prediction by comparing different models. PRS was calculated for six disease outcomes AF, stroke, CHD, CHF, dementia, and diabetes using three widely adopted PRS methods: PRSice2 [49], LDpred2 [50], and Lassosum [51]. For each disease, PRS predictive performance was initially compared using logistic regression and the method with the highest accuracy was selected as the disease-specific genetic risk score model. Using the selected PRS, three prediction scenarios were evaluated: (1) PRS-only (2) EHR-only (3) PRS + EHR. This framework enabled a comprehensive assessment of the standalone predictive value of PRS and its contribution beyond conventional clinical risk factors. Across these scenarios, logistic regression and four ML methods random forest, XGBoost, LightGBM, and CatBoost were applied for risk prediction. Fig 1 illustrates the overall streamline of our analysis.

Download:

Fig 1. Workflow of risk prediction analysis.

GWAS summary statistics from the GWAS catalog and genotypes from the FHS are used as base and target data, respectively, for PRS estimation. PRS scores were calculated using PRSice2, LDpred2, and Lassosum, with prediction accuracy assessed by area under the receiver operating characteristic curve (AUC) in logistic regression. The best-performing PRS method is selected for each disease and used for risk prediction with logistic regression and four ML methods, under three scenarios: PRS-only, PRS + EHR, and EHR. Model evaluation metrics are calculated for each scenario and disease outcome.

https://doi.org/10.1371/journal.pone.0345914.g001

CVD data from FHS and GWAS catalog

In the study, we aim to compare the prediction performance of six CVD-related diseases under three different scenarios: PRS-only, EHR-only, and PRS + EHR. For the analysis, genetic and clinical data as well as GWAS summary statistics are required.

Genetic and clinical data were obtained from FHS, which was initiated in 1948 in Framingham, Massachusetts, to investigate the long-term development of CVD and identify its underlying causes. This study includes a total of 7,082 participants from four FHS cohorts: 596 from the Original Cohort, 2,754 from the Offspring Cohort, 88 from the New Offspring Spouse Cohort, and 3,644 from the Third Generation Cohort. Clinical examinations were conducted at different follow-up intervals, including 32 for the Original Cohort, nine for the Offspring Cohort, and three each for the New Offspring Spouse and Third Generation Cohorts. Data are available upon approval through dbGaP (phs000007.v32.p13).

GWAS summary statistics for AF, stroke, myocardial ischemia, diastolic heart failure, Alzheimer’s disease, and diabetes were obtained from the GWAS Catalog (https://www.ebi.ac.uk/gwas/): GCST006414 [52], GCST90044350 [53], GCST90473543 [54], GCST90480183 [55], GCST007320 [56], and GCST90267278 [57], respectively. These datasets were used as base data for PRS construction for the corresponding diseases: AF, stroke, CHD, CHF, dementia, and diabetes.

Preprocessing and quality control

Genotype quality control (QC) was performed using PLINK v1.9. Variants were excluded if they had a call rate < 99%, a minor allele frequency (MAF) < 1%, or deviated from Hardy–Weinberg equilibrium in controls (P < 1 × 10 ⁻ ⁶). Samples with a call rate below 99% were also removed. For PRS analysis, harmonization between base GWAS summary statistics and target genotype data was performed. Variants with allele mismatches, strand inconsistencies, or genomic position discrepancies were excluded, retaining only variants present in both datasets. PRS weights were assigned using complete GWAS summary statistics for each disease outcome. After QC and harmonization, genotype data from 6,170 individuals were retained for dementia, and from 7,082 individuals for all other diseases; these data were then used for downstream analyses. The final number of variants used for PRS construction is shown in Table 1.

Download:

Table 1. Sample size and number of Variants included in the study after QC.

https://doi.org/10.1371/journal.pone.0345914.t001

QC and harmonization procedures were applied to the GWAS summary statistics as well. GCST90473543 and GCST90480183 were originally aligned to genome build GRCh38, whereas the FHS genotype data are based on GRCh37. Therefore, these two datasets were converted to GRCh37 using CrossMap [58] before further analysis. Following the genome build conversion, variant-level QC was conducted by excluding SNPs with MAF < 0.01 or imputation quality score < 0.8. Duplicate SNPs with identical rsIDs were removed. Strand-ambiguous SNPs (A/T or C/G) were excluded to prevent alignment errors. SNP harmonization between the base and target datasets was performed; variants with discordant alleles were removed unless strand flipping resolved the mismatch. Variants with ambiguous alleles relative to the target dataset were further excluded. Table 2 summarizes sample sizes and variant counts after QC.

Download:

Table 2. GWAS summary statistic information used for the analysis.

https://doi.org/10.1371/journal.pone.0345914.t002

For EHR data, 18 clinical EHR variables were initially considered. Disease onset times were identified using disease-specific Survival and Follow-up Datasets provided by the FHS. For each individual, clinical examination data corresponding to the year of disease diagnosis were used for analysis. When data from the exact diagnosis year were unavailable, data from the closest preceding examination were substituted. Through this process, a comprehensive set of clinical and lifestyle variables was compiled. For each disease-specific dataset, principal component analysis (PCA) was applied independently to reduce dimensionality and mitigate multicollinearity. Principal components explaining at least 90% of the total variance were retained. To further remove redundant information, pairwise correlation analysis was conducted among candidate variables. When the absolute correlation coefficient between two variables exceeded 0.8, the variable with lower explanatory contribution was excluded. Notably, although feature selection was performed independently for each disease outcome, the same set of EHR variables was consistently retained across all outcomes, indicating that these variables robustly satisfied the selection criteria and reflected shared cardiometabolic risk factors underlying multiple disease phenotypes. Consequently, a disease-specific set of EHR features was selected for each outcome, as summarized in Table 3. Age and sex were included as common covariates in all analyses. EHR variables were obtained from the FHS Workthru files, specifically from the Frequently Used Cardiovascular Risk Factors dataset. Alcohol consumption variables were processed and categorized to reflect actual intake levels, using the same calculation approach as described in prior work [59]. After excluding individuals with missing data, the final numbers of cases and controls were as follows: AF (658 cases, 5,220 controls), CHD (461 cases, 5,264 controls), CHF (374 cases, 5,569 controls), dementia (323 cases, 5,213 controls), diabetes (681 cases, 5,386 controls), and stroke (256 cases, 5,681 controls).

Download:

Table 3. Patient clinical and lifestyle variables from the FHS included in the analysis after PCA values were presented as mean (min, max).

https://doi.org/10.1371/journal.pone.0345914.t003

Among the participants included in the analysis, the number of cases was relatively low compared with controls, resulting in class imbalance. To mitigate this issue during model training, sampling-based imbalance handling strategies were incorporated into the hyperparameter tuning process to identify the most appropriate approach for each model. Details are provided in the “Hyperparameter Tuning for Risk Prediction Models” section.

PRS

PRS provides a comprehensive estimate of the combined effect of genetic variants on disease occurrence and has been increasingly applied across various diseases, enhancing prediction performance [5,6]. PRS is defined as the sum of risk alleles, each weighted by its corresponding genotype effect size. To calculate the PRS, base and target data sets are required. Base data typically consist of GWAS summary statistics detailing the associations between genetic variants and specific diseases or traits. Target data included genotype and phenotype information from the population under study, which must be independent of the base data. The PRS individual is calculated using the following equation:

Where represents the score for an individual , denotes the number of SNPs, X indicates the genotypes of the target data, and represents the weights derived from the base data. These weights were determined based on effect size estimates from GWAS and may be adjusted depending on the calculation methods used.

In this study, PRS were calculated using three widely used methods: PRSice2, Lassosum, and LDpred2. These methods represent different modeling strategies for PRS construction and were applied uniformly to the same datasets to enable a fair comparison of predictive performance. For each disease, the PRS derived from the method yielding the highest AUC in logistic regression was selected for analysis. Effect sizes for PRS construction were based on the beta coefficients from the previously described GWAS summary statistics.

Risk prediction model

Logistic regression and four ML methods were used to develop risk prediction models for CVD based on PRS and EHR data. Logistic regression and four ML methods: random forest, XGBoost, CatBoost, and LightGBM, were applied to identify the best-fitting model for each disease. Random forest is an ensemble learning method that combines multiple decision trees to produce aggregated predictions. It outperforms other ML methods, such as Naïve Bayes and AdaBoost, in previous studies [1,52,53]. XGBoost, CatBoost, and LightGBM are based on the Gradient Boosting algorithm [46–48]. XGBoost improves overall model predictions by iteratively training new models. CatBoost automatically handles categorical features, making it ideal for categorical data. LightGBM uses a leaf-wise tree-splitting strategy to minimize prediction errors. The dataset was split into training (80%) and test (20%) sets for model training and evaluation [54].

Hyperparameter tuning for risk prediction model

Logistic regression and four ML methods were evaluated for risk prediction. For each model, algorithm-specific hyperparameter search spaces were predefined, and a fully nested cross-validation framework was employed to prevent optimistic bias. The dataset was partitioned using stratified four-fold outer cross-validation, with each outer test fold used exclusively for final performance evaluation.

Within each outer training fold, stratified three-fold inner cross-validation was conducted to jointly optimize model hyperparameters and class imbalance handling strategies, including undersampling, SMOTE-based oversampling, and class-weight adjustment. Sampling ratios of 0.6, 0.8, and 1.0 were explored where applicable. The mean ROC-AUC across inner folds was used as the selection criterion. All resampling procedures and preprocessing steps were performed using training data only to prevent information leakage. Regarding feature preprocessing, age was intentionally retained on its original scale, as it serves as a clinically interpretable risk adjustment covariate rather than a feature whose scale is intended to influence model optimization. Sex and DLVH status were also used in their original form, while all remaining continuous variables were z-score standardized. The algorithm-specific hyperparameter search spaces explored for each model are summarized in S1 Table.

After selecting the optimal configuration, models were retrained on the full outer training set and evaluated on the independent outer test set. Model performance was primarily assessed using ROC-AUC. For clinical interpretability and comparability across models, a fixed classification threshold of 0.5 was applied uniformly to compute sensitivity, specificity, likelihood ratios, and diagnostic odds ratios (DORs). Full results from the nested cross-validation are provided in S1 File.

Model evaluation metrics

Model performance was evaluated using multiple complementary metrics, including the AUC, area under the precision–recall curve (AUPRC), Brier score, sensitivity, specificity, DOR, positive likelihood ratio (LR+), and negative likelihood ratio (LR−). Discriminative ability was primarily assessed using ROC curves, which plot sensitivity against the false positive rate across varying decision thresholds. The AUC, ranging from 0.5 to 1.0, was used as the main summary measure of discrimination, with higher values indicating improved ability to distinguish individuals with and without disease. To quantify statistical uncertainty, 95% confidence intervals (CIs) for AUC were estimated using DeLong’s method.

Given the presence of class imbalance in several disease outcomes, AUPRC was additionally reported to provide a more informative assessment of predictive performance for the positive class. Unlike AUC, AUPRC emphasizes precision–recall trade-offs and is particularly suitable for imbalanced datasets. Model calibration was evaluated using the Brier score, which measures the mean squared difference between predicted probabilities and observed outcomes, with lower values indicating better calibrated and more accurate probabilistic predictions.

Sensitivity and specificity were calculated to assess the ability of each model to correctly identify diseased and non-diseased individuals, respectively. CIs were estimated to account for sampling variability. Overall diagnostic performance was further summarized using the DOR, derived from sensitivity and specificity, with higher values indicating stronger discriminatory power. In addition, the LR + , defined as sensitivity divided by (1 − specificity), and the LR − , defined as (1 − sensitivity) divided by specificity, were computed. Higher LR+ values and lower LR− values indicate superior diagnostic performance.

Implementation

All analyses were conducted using R (version 3.6.0) and Python (version 3.10.19). PRS estimation was performed in R using PRSice2, LDpred2, and lassosum; clumping and p-value thresholding were applied only in PRSice2 using default settings, whereas LDpred2 and lassosum leveraged LD information from the HapMap3 reference panel without explicit p-value thresholds. All other analyses, including ML model training and evaluation, were performed in Python. ML methods were implemented using scikit-learn (version 1.7.2), XGBoost (version 2.0.3), LightGBM (version 4.6.0) and CatBoost (version 1.2.8). Class imbalance handling methods, including oversampling and undersampling, were implemented using the imbalanced-learn package (version 0.14.0). Logistic regression models were implemented using the LogisticRegression class from scikit-learn with L2 regularization and the liblinear solver. All random seeds were fixed within each cross-validation procedure to ensure reproducibility, and SHAP (version 0.44.1) values were computed on the held-out test data only to avoid information leakage; for tree-based models, a model-consistent explainer (TreeExplainer) was used.

Results

PRS estimation

To select the method for estimating the PRS score, three different PRS computation methods were compared: PRSice2 [60], LDpred2 [61], and lassosum [62]. For each disease outcome, a logistic regression model was fitted using a single PRS variable as the sole predictor. Model performance was assessed using the AUC.

To obtain robust performance estimates, the training and evaluation procedure was repeated 100 times with different random data splits, and the mean AUC across repetitions was used as the final performance metric. Mean AUC values for each PRS method and disease are summarized in Table 4. For each disease, the PRS method achieving the highest mean AUC was selected as the representative genetic risk score and subsequently incorporated into the corresponding prediction model.

Download:

Table 4. Comparison of AUC (mean, CI) Values Obtained from PRSice2, Ldpred2, and Lassosum.

https://doi.org/10.1371/journal.pone.0345914.t004

Although performance varied across diseases, the method yielding the highest mean AUC was selected for each outcome. PRSice2 was used for AF and CHF; LDpred2 was applied for CHD, dementia and diabetes; and Lassosum was utilized for stroke. Additional model performance metrics, including AUPRC and mean Brier scores, are reported in S2 Table.

Performance of CVD prediction model

To evaluate the performance of models for predicting six CVD-related conditions: AF, stroke, CHD, CHF, dementia, and diabetes, AUC values were calculated for logistic regression and four ML methods: random forest, XGBoost, CatBoost, and LightGBM. The method achieving the highest AUC value in each case was identified as the best-fit prediction method for the respective disease. The analysis was conducted across three scenarios: (1) PRS-only, (2) EHR-only and (3) PRS + EHR. Five prediction methods were evaluated for each disease under three different scenarios, resulting in 15 combinations of models. Fig 2 presents histograms of AUC values for the five methods across the six diseases under the three scenarios. The full model evaluation metrics are provided in S2 File.

Download:

Fig 2. Histograms illustrate AUC values for five prediction methods across six CVD-related diseases under three scenarios.

(1) PRS-only, (2) EHR-only and (3) PRS + EHR. Blue, orange, green, red and purple bars represent the AUC values for CatBoost, LightGBM, logistic regression, random forest and XGBoost respectively.

https://doi.org/10.1371/journal.pone.0345914.g002

Additionally, the ROC curves for the three scenarios were compared using the best-fit prediction method for each of the six CVD-related diseases (Fig 3).

Download:

Fig 3. ROC curves for the three scenarios evaluated using the best-fit prediction method for six CVD-related diseases.

Blue, orange, and green lines represent (1) PRS-only, (2) EHR-only and (3) PRS + EHR, respectively.

https://doi.org/10.1371/journal.pone.0345914.g003

In addition to AUC, AUPRC, Brier Score, sensitivity, specificity, DOR, LR+ and LR- were compared for the best-fit prediction methods across the three scenarios (Table 5).

Download:

Table 5. Performance metrics of the best-fit prediction models across three scenarios.

https://doi.org/10.1371/journal.pone.0345914.t005

For AF, the PRS + EHR scenario achieved the highest discrimination, outperforming the other scenarios across all evaluation metrics except sensitivity, where it matched the EHR-only scenario. Interestingly, the PRS-only scenario also showed strong performance in AF, underscoring the important role of genetic risk in this condition. Similarly, for CHF, the PRS-only scenario yielded a relatively high AUC of 0.812, and apart from AUPRC, incorporating PRS information appeared to enhance the predictive performance beyond that of EHR alone.

In some diseases, the EHR-only models performed so well that adding PRS provided little additional benefit. For dementia, the PRS-only scenario achieved a fairly high AUC of 0.892, but the EHR-only scenario reached an AUC of 0.910, leaving little room for improvement and resulting in no meaningful gain from combining PRS with EHR. In diabetes, the EHR-only scenario showed extremely strong performance with an AUC of 0.967, whereas the PRS-only scenario had a comparatively lower AUC of 0.715, so adding PRS contributed only modestly to the prediction.

For CHD, the PRS-only scenario showed a relatively modest performance with an AUC of 0.677, while the EHR-only and PRS + EHR scenarios exhibited similar levels of discrimination. For stroke, all three scenarios yielded AUCs in the 0.7 range with no substantial differences, and the best-performing scenario varied by evaluation metric, with PRS + EHR, PRS-only, and EHR-only each showing stronger performance in different aspects.

Feature importance of genetic and environmental variables in CVD prediction

To evaluate feature importance in disease prediction, SHAP analysis was conducted to identify factors that most strongly influenced model outputs and to facilitate interpretability of the prediction results. The analysis was performed for the PRS + EHR models, using the best-performing method selected separately for each disease outcome based on predictive performance. Fig 4 presents the SHAP summary plots highlighting the relative contributions of PRS and clinical features to disease risk prediction across the six diseases.

Download:

Fig 4. SHAP plot for the PRS + EHR model showing feature contributions to predictions for CVD-related diseases. The figure illustrates the distribution of feature importance and the correlation of predictive variables across six CVDs. The x-axis represents the SHAP values, while the color of the dots indicates the corresponding feature values.

https://doi.org/10.1371/journal.pone.0345914.g004

The analysis revealed that age was the dominant contributor to phenotype prediction, ranking first for four out of the six examined phenotypes while second rank for diabetes and CHD. This phenomenon strongly aligns with the established epidemiological characteristics of CVDs, although it is imperative to note that this predictive dominance does not necessarily imply a direct causal relationship [63–65]. Additionally, a positive contribution of blood glucose (BG) emerged as a dominant predictor for diabetes, clearly presenting a key clinical symptom of diabetes. Concurrently, high-density lipoprotein (HDL) was identified as a protective feature for most CVDs, reflecting its beneficial role in maintaining vascular health [66–68]. Distinct sex-specific contributions were observed in AF where sex exhibited a consistent fixed effect across all epochs [69]. This pattern conforms to clinical data indicating that baseline risk levels for AF and CHF are substantially higher in males than in females.

Beyond clinical parameters, genetic risk features demonstrated varying degrees of contribution depending on the phenotype. For AF, the polygenic risk feature from PRSice2 ranked second, suggesting that genetic predisposition is a critical determinant for this arrhythmia. Similarly, for dementia, the polygenic risk feature from LDpred2 ranked third, showing its increasing risk upon the genetic predisposition. However, genetic predisposition did not strongly drive the prediction of the remaining phenotypes. Consequently, while PRS-only models exhibited moderate performance, the relative contribution of genetic factors diminished significantly upon the inclusion of robust clinical predictors in the integrated models [64,70].

Discussion

CVD remains a leading cause of mortality worldwide and contributes substantially to escalating healthcare expenditures. Effective management of modifiable risk factors can markedly reduce their burden in many cases. Over the past decade, risk prediction methodologies have evolved considerably, particularly with the introduction of PRS, which are often combined with clinical information to enhance prediction. Nonetheless, most existing studies have focused on single endpoints, such as specific cardiac events. Given that CVD arises from a complex interplay of genetic predisposition and environmental influences, a broader, more integrated evaluation of prediction performance is needed. Using data from the FHS, we examined how genetic and clinical factors jointly contribute to cardiovascular risk. We compared multiple prediction models across different scenarios for six cardiovascular-related diseases and evaluated their performance using various metrics. We also assessed the importance of individual genetic and clinical variables and found that the relative contributions varied by disease. Overall, our findings indicate that the relative contributions of PRS and EHR variables differ substantially by disease, and that the method of integrating these data sources can influence both predictive performance and interpretation.

Across many CVD-related outcomes, the EHR-only scenario achieved high discrimination, leaving limited room for improvement by adding PRS. For dementia, the PRS-only scenario yielded a strong AUC of 0.892; however, the EHR-only scenario performed better (AUC = 0.910), and combining PRS with EHR did not produce additional gains. For diabetes, the EHR-only scenario reached an AUC of 0.967, whereas the PRS-only scenario achieved an AUC of 0.715. The incremental benefit of adding PRS was modest, likely reflecting the presence of highly informative clinical variables such as BG levels in the EHR. In contrast, PRS played a more prominent role in CHF and AF, where adding PRS to EHR data improved predictive performance, suggesting that integrating genetic information with clinical data can provide additional value. Notably, PRS alone achieved an AUC greater than 0.8 for AF, CHF, and dementia, suggesting that genetic information may have substantial diagnostic utility for these conditions. This pattern was also reflected in the SHAP analysis, where genetic risk features ranked among the top predictors for these diseases. In AF, genetic factors accounted for a large proportion of feature importance, consistent with prior estimates of clinical heritability (derived from twin or family studies) and genetic heritability (h²_g) [71,72]. For example, a 2009 Danish twin study using a biometric model with environmental and additive genetic components estimated the heritability of AF at 62% [73]. Similarly, elevated genetic risk for dementia has been reported in large-scale population genetic studies [74–76]. By contrast, the PRS-only scenario demonstrated relatively modest performance for CHD. In this context, the EHR-only and PRS + EHR scenarios exhibited similar discrimination, suggesting that clinical and environmental factors captured in the EHR may be more influential for risk prediction. There were also cases in which no single scenario consistently dominated across metrics. For stroke, all three scenarios produced AUCs in the 0.7 range, and the best-performing scenario varied by metric. This finding suggests that, with the current data, no single information source clearly outperforms the others.

Collectively, these results highlight the need for disease-specific strategies when integrating genetic and clinical information. Integrating PRS with EHR data generally improves predictive performance compared with using either PRS or EHR alone. However, for diseases in which EHR-based prediction is already highly accurate, the added value of PRS may be limited. Conversely, for conditions such as AF and CHF, PRS can provide substantial predictive information, either alone or in combination with EHR. These insights can inform the design of future risk prediction models by helping to prioritize when PRS should be incorporated and when investment in richer clinical data may yield greater gains.

Some limitations of this study should be acknowledged. First, the FHS predominantly comprises individuals of European ancestry, which may limit the generalizability of our findings to populations with different genetic backgrounds and environmental exposures. External validation in independent, multi-ethnic cohorts will be essential to assess the robustness and transferability of the proposed models. Second, although we applied a nested cross-validation framework and performed all preprocessing steps within the training folds to minimize information leakage, residual sources of leakage cannot be completely excluded. In particular, explicit family or pedigree information was not available in the dataset, preventing family-aware data partitioning. As a result, related individuals may have been included across training and test folds, potentially leading to optimistic performance estimates due to shared genetic and environmental factors. Future studies incorporating family identifiers should evaluate model performance using family-based or cluster-level validation strategies. Third, although SHAP analysis was employed to enhance interpretability and identify features that strongly influenced model predictions, these importance scores should not be interpreted as evidence of causal relationships. SHAP values reflect model-dependent associations rather than underlying biological causation; therefore, conclusions regarding causality should be drawn with caution. In addition, formal uncertainty estimation for SHAP values (e.g., via bootstrapping) was not performed, which limits the robustness and quantitative interpretability of the reported feature importance. Additional limitations include the lack of benchmarking against established clinical risk equations, such as the Framingham risk scores, and the absence of formal calibration and clinical utility analyses, including calibration-in-the-large, net reclassification improvement, integrated discrimination improvement, and decision curve analysis. Furthermore, the feature set was restricted to variables available in the FHS Workthru datasets, and the analyses relied on a single cohort, which may limit the breadth of clinical and environmental factors captured. Addressing these limitations through external validation, expanded feature sets, and comprehensive clinical utility assessments will be important directions for future research.

Conclusion

This study systematically evaluated the predictive performance of PRS, EHR–based clinical variables, and their integration across six CVD–related outcomes using data from the FHS. By comparing multiple prediction models under PRS-only, EHR-only, and PRS + EHR scenarios, we demonstrated that the relative contribution of genetic and clinical information to disease risk prediction varies substantially across diseases.

Overall, EHR-based models achieved strong discrimination for several outcomes, particularly diabetes and dementia, leaving limited room for improvement through the addition of PRS. In these settings, highly informative clinical variables derived from routine care dominated predictive performance, and the incremental value of genetic risk information was modest. In contrast, integrating PRS with EHR data improved prediction accuracyfor AF and CHF, highlighting the complementary role of genetic information when clinical predictors alone are insufficient. Notably, PRS-only models achieved substantial discrimination for certain highly heritable conditions, suggesting that genetic information may provide meaningful predictive utility even in the absence of detailed clinical data.

Feature importance analysis using SHAP further indicated that age and key clinical variables were the dominant predictors across most disease outcomes, while the contribution of genetic risk varied by phenotype. These patterns are consistent with established epidemiological and heritability evidence and emphasize that predictive importance does not necessarily imply causality. Notably, the reduced relative contribution of PRS in the presence of strong clinical predictors underscores that the value of genetic information depends on both disease etiology and the availability and quality of EHR data.

In summary, the findings indicate that a uniform strategy for integrating PRS into cardiovascular risk prediction is unlikely to be optimal. Instead, disease-specific modeling approaches are warranted, with PRS offering the greatest benefit for conditions with substantial genetic susceptibility or limited clinical predictors. These insights provide practical guidance for developing future risk prediction models and for determining when incorporating genetic information is most likely to yield clinically meaningful improvements.

Supporting information

S1 Table. Hyperparameter search spaces for each model.

The hyperparameters and value ranges evaluated during model development are listed for each algorithm.

https://doi.org/10.1371/journal.pone.0345914.s001

(DOCX)

S2 Table. AUC, AUPRC, and Brier mean scores for PRSice2, LDpred2, and Lassosum across diseases.

The table reports the mean AUC, standard deviation, percentile-based confidence intervals, AUPRC, and Brier score for each model and disease.

https://doi.org/10.1371/journal.pone.0345914.s002

(DOCX)

S1 File. Full results from the nested cross-validation.

https://doi.org/10.1371/journal.pone.0345914.s003

(ZIP)

S2 File. Additional model performance metrics from PRSice2, Ldpred2, and Lassosum.

https://doi.org/10.1371/journal.pone.0345914.s004

(ZIP)

References

1. Yang L, Wu H, Jin X, Zheng P, Hu S, Xu X, et al. Study of cardiovascular disease prediction model based on random forest in eastern China. Sci Rep. 2020;10(1):5245. pmid:32251324
- View Article
- PubMed/NCBI
- Google Scholar
2. Andersson C, Nayor M, Tsao CW, Levy D, Vasan RS. Framingham Heart Study: JACC focus seminar, 1/8. J Am Coll Cardiol. 2021;77(21):2680–92. pmid:34045026
- View Article
- PubMed/NCBI
- Google Scholar
3. Townsend N, Kazakiewicz D, Lucy Wright F, Timmis A, Huculeci R, Torbica A, et al. Epidemiology of cardiovascular disease in Europe. Nat Rev Cardiol. 2022;19(2):133–43. pmid:34497402
- View Article
- PubMed/NCBI
- Google Scholar
4. Gersh B. The epidemic of cardiovascular disease in the developing world: Global implications. Lippincott Williams & Wilkins. 2012.
5. Almansouri NE, Awe M, Rajavelu S, Jahnavi K, Shastry R, Hasan A, et al. Early diagnosis of cardiovascular diseases in the era of artificial intelligence: An in-depth review. Cureus. 2024;16(3):e55869. pmid:38595869
- View Article
- PubMed/NCBI
- Google Scholar
6. Qu 6, Liao S, Zhang J, Cao H, Zhang H, Zhang N. Burden of cardiovascular disease among elderly: Based on the Global Burden of Disease Study 2019. European Heart Journal - Quality of Care and Clinical Outcomes. 2024;10(2):143–53.
- View Article
- Google Scholar
7. Padda I, Fabian D, Farid M, Mahtani A, Sethi Y, Ralhan T, et al. Social determinants of health and its impact on cardiovascular disease in underserved populations: A critical review. Curr Probl Cardiol. 2024;49(3):102373. pmid:38185436
- View Article
- PubMed/NCBI
- Google Scholar
8. Naser MA, Majeed AA, Alsabah M, Al-Shaikhli TR, Kaky KM. A Review of machine learning’s role in cardiovascular disease prediction: Recent advances and future challenges. Algorithms. 2024;17(2):78.
- View Article
- Google Scholar
9. Naeem AB, Senapati B, Bhuva D, Zaidi A, Bhuva A, Sudman MdSI, et al. Heart disease detection using feature extraction and artificial neural networks: A sensor-based approach. IEEE Access. 2024;12:37349–62.
- View Article
- Google Scholar
10. Jiang X, Alnoud MAH, Ali H, Ali I, Hussain T, Khan MU, et al. Heartfelt living: Deciphering the link between lifestyle choices and cardiovascular vitality. Curr Probl Cardiol. 2024;49(3):102397. pmid:38232921
- View Article
- PubMed/NCBI
- Google Scholar
11. Catak A, Dinarevic S, Prnjavorac B, Naser N, Masic I. Public Health Dimensions of Cardiovascular Diseases (CVD) Prevention and Control – Global Perspectives and Current Situation in the Federation of Bosnia and Herzegovina. Mater Sociomed. 2023;35(2):88.
- View Article
- Google Scholar
12. Kumar A, Siddharth V, Singh SI, Narang R. Cost analysis of treating cardiovascular diseases in a super-specialty hospital. PLoS One. 2022;17(1):e0262190. pmid:34986193
- View Article
- PubMed/NCBI
- Google Scholar
13. Chhezom K, Gurung MS, Wangdi K. Comparison of Laboratory and Non-Laboratory-Based 2019 World Health Organization Cardiovascular Risk Charts in the Bhutanese Population. Asia Pac J Public Health. 2024;36(1):29–35. pmid:38116599
- View Article
- PubMed/NCBI
- Google Scholar
14. Boivin-Proulx L-A, Brouillette J, Dorais M, Perreault S. Association between cardiovascular diseases and dementia among various age groups: A population-based cohort study in older adults. Sci Rep. 2023;13(1):14881. pmid:37689801
- View Article
- PubMed/NCBI
- Google Scholar
15. Leon BM, Maddox TM. Diabetes and cardiovascular disease: Epidemiology, biological mechanisms, treatment recommendations and future research. World J Diabetes. 2015;6(13):1246–58. pmid:26468341
- View Article
- PubMed/NCBI
- Google Scholar
16. Shaw J, Thomas M, Magliano D. The dark heart of type 2 diabetes. 2018.
17. Group NDD, Diabetes NIO, Digestive DK. Diabetes in America. National Institutes of Health. 1995.
18. Einarson TR, Acs A, Ludwig C, Panton UH. Prevalence of cardiovascular disease in type 2 diabetes: A systematic literature review of scientific evidence from across the world in 2007-2017. Cardiovasc Diabetol. 2018;17(1):83. pmid:29884191
- View Article
- PubMed/NCBI
- Google Scholar
19. Petrie JR, Guzik TJ, Touyz RM. Diabetes, hypertension, and cardiovascular disease: Clinical Insights and Vascular Mechanisms. Can J Cardiol. 2018;34(5):575–84. pmid:29459239
- View Article
- PubMed/NCBI
- Google Scholar
20. Shaw J, Tanamas S. Diabetes: the silent pandemic and its impact on Australia. Melbourne: Baker IDI Heart and Diabetes Institute. 2012.
21. Alaa AM, Bolton T, Di Angelantonio E, Rudd JHF, van der Schaar M. Cardiovascular disease risk prediction using automated machine learning: A prospective study of 423,604 UK Biobank participants. PLoS One. 2019;14(5):e0213653. pmid:31091238
- View Article
- PubMed/NCBI
- Google Scholar
22. Siontis GCM, Tzoulaki I, Siontis KC, Ioannidis JPA. Comparisons of established risk prediction models for cardiovascular disease: Systematic review. BMJ. 2012;344:e3318. pmid:22628003
- View Article
- PubMed/NCBI
- Google Scholar
23. Dalal S, Goel P, Onyema EM, Alharbi A, Mahmoud A, Algarni MA, et al. Application of machine learning for cardiovascular disease risk prediction. Computational Intelligence and Neuroscience. 2023;2023(1).
- View Article
- Google Scholar
24. Saranya G, Pravin A. A comprehensive study on disease risk predictions in machine learning. IJECE. 2020;10(4):4217.
- View Article
- Google Scholar
25. Reddy KVV, Elamvazuthi I, Aziz AA, Paramasivam S, Chua HN, Pranavanand S. Heart disease risk prediction using machine learning classifiers with attribute evaluators. Applied Sciences. 2021;11(18):8352.
- View Article
- Google Scholar
26. Ward A, Sarraju A, Chung S, Li J, Harrington R, Heidenreich P, et al. Machine learning and atherosclerotic cardiovascular disease risk prediction in a multi-ethnic population. NPJ Digit Med. 2020;3:125. pmid:33043149
- View Article
- PubMed/NCBI
- Google Scholar
27. Ordikhani M, Saniee Abadeh M, Prugger C, Hassannejad R, Mohammadifard N, Sarrafzadegan N. An evolutionary machine learning algorithm for cardiovascular disease risk prediction. PLoS One. 2022;17(7):e0271723. pmid:35901181
- View Article
- PubMed/NCBI
- Google Scholar
28. Trigka M, Dritsas E. Long-term coronary artery disease risk prediction with machine learning models. Sensors (Basel). 2023;23(3):1193. pmid:36772237
- View Article
- PubMed/NCBI
- Google Scholar
29. Gibson G. Rare and common variants: Twenty arguments. Nat Rev Genet. 2012;13(2):135–45. pmid:22251874
- View Article
- PubMed/NCBI
- Google Scholar
30. Turley P, Walters RK, Maghzian O, Okbay A, Lee JJ, Fontana MA, et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat Genet. 2018;50(2):229–37. pmid:29292387
- View Article
- PubMed/NCBI
- Google Scholar
31. Lin J, Tabassum R, Ripatti S, Pirinen M. MetaPhat: Detecting and decomposing multivariate associations from univariate genome-wide association statistics. Front Genet. 2020;11:431. pmid:32499813
- View Article
- PubMed/NCBI
- Google Scholar
32. Ma Y, Patil S, Zhou X, Mukherjee B, Fritsche LG. ExPRSweb: An online repository with polygenic risk scores for common health-related exposures. Am J Hum Genet. 2022;109(10):1742–60. pmid:36152628
- View Article
- PubMed/NCBI
- Google Scholar
33. Lin J, Mars N, Fu Y, Ripatti P, Kiiskinen T, et al. Integration of biomarker polygenic risk score improves prediction of coronary heart disease. Basic to Translational Science. 2023;8(12):1489–99.
- View Article
- Google Scholar
34. Mars N, Koskela JT, Ripatti P, Kiiskinen TTJ, Havulinna AS, Lindbohm JV, et al. Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat Med. 2020;26(4):549–57. pmid:32273609
- View Article
- PubMed/NCBI
- Google Scholar
35. Tamlander M, Mars N, Pirinen M, Widén E, Ripatti S. Integration of questionnaire-based risk factors improves polygenic risk scores for human coronary heart disease and type 2 diabetes. Commun Biol. 2022;5(1):158. pmid:35197564
- View Article
- PubMed/NCBI
- Google Scholar
36. Riveros-Mckay F, Weale ME, Moore R, Selzam S, Krapohl E, Sivley RM, et al. Integrated polygenic tool substantially enhances coronary artery disease prediction. Circ Genom Precis Med. 2021;14(2):e003304. pmid:33651632
- View Article
- PubMed/NCBI
- Google Scholar
37. Samani NJ, Beeston E, Greengrass C, Riveros-McKay F, Debiec R, Lawday D, et al. Polygenic risk score adds to a clinical risk score in the prediction of cardiovascular disease in a clinical setting. Eur Heart J. 2024;45(34):3152–60. pmid:38848106
- View Article
- PubMed/NCBI
- Google Scholar
38. Rout M, Tung GK, Singh JR, Mehra NK, Wander GS, Ralhan S, et al. Polygenic risk score assessment for coronary artery disease in Asian Indians. J Cardiovasc Transl Res. 2024;17(5):1086–96. pmid:38658478
- View Article
- PubMed/NCBI
- Google Scholar
39. Agbaedeng TA, Noubiap JJ, Mofo Mato EP, Chew DP, Figtree GA, Said MA, et al. Polygenic risk score and coronary artery disease: A meta-analysis of 979,286 participant data. Atherosclerosis. 2021;333:48–55. pmid:34425527
- View Article
- PubMed/NCBI
- Google Scholar
40. Wu Z, Lou Y, Jin W, Liu Y, Lu L, Chen Q, et al. Relationship of the p22phox (CYBA) gene polymorphism C242T with risk of coronary artery disease: a meta-analysis. PLoS One. 2013;8(9):e70885. pmid:24039708
- View Article
- PubMed/NCBI
- Google Scholar
41. Gola D, Erdmann J, Müller-Myhsok B, Schunkert H, König IR. Polygenic risk scores outperform machine learning methods in predicting coronary artery disease status. Genet Epidemiol. 2020;44(2):125–38. pmid:31922285
- View Article
- PubMed/NCBI
- Google Scholar
42. Schunkert H, König IR, Kathiresan S, Reilly MP, Assimes TL, Holm H, et al. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat Genet. 2011;43(4):333–8. pmid:21378990
- View Article
- PubMed/NCBI
- Google Scholar
43. Shoaib M, Ye Q, IglayReger H, Tan MH, Boehnke M, Burant CF, et al. Evaluation of polygenic risk scores to differentiate between type 1 and type 2 diabetes. Genet Epidemiol. 2023;47(4):303–13. pmid:36821788
- View Article
- PubMed/NCBI
- Google Scholar
44. Hodgson S, Huang QQ, Sallah N, Genes & Health Research Team, Griffiths CJ, Newman WG, et al. Integrating polygenic risk scores in the prediction of type 2 diabetes risk and subtypes in British Pakistanis and Bangladeshis: A population-based cohort study. PLoS Med. 2022;19(5):e1003981. pmid:35587468
- View Article
- PubMed/NCBI
- Google Scholar
45. Pan C, Cheng B, Qin X, Cheng S, Liu L, Yang X, et al. Enhanced polygenic risk score incorporating gene-environment interaction suggests the association of major depressive disorder with cardiac and lung function. Brief Bioinform. 2024;25(2):bbae070. pmid:38436562
- View Article
- PubMed/NCBI
- Google Scholar
46. Ester M, Kriegel H, Xu X. XGBoost: A scalable tree boosting system. Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
- View Article
- Google Scholar
47. Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: Unbiased boosting with categorical features. Advances in neural information processing systems. 2018;31.
- View Article
- Google Scholar
48. Rufo DD, Debelee TG, Ibenthal A, Negera WG. Diagnosis of Diabetes Mellitus Using Gradient Boosting Machine (LightGBM). Diagnostics (Basel). 2021;11(9):1714. pmid:34574055
- View Article
- PubMed/NCBI
- Google Scholar
49. Choi SW, O’Reilly PF. PRSice-2: Polygenic Risk Score software for biobank-scale data. Gigascience. 2019;8(7):giz082. pmid:31307061
- View Article
- PubMed/NCBI
- Google Scholar
50. Privé F, Arbel J, Vilhjálmsson BJ. LDpred2: Better, faster, stronger. Bioinformatics. 2021;36(22–23):5424–31. pmid:33326037
- View Article
- PubMed/NCBI
- Google Scholar
51. Mak TSH, Porsch RM, Choi SW, Zhou X, Sham PC. Polygenic scores via penalized regression on summary statistics. Genet Epidemiol. 2017;41(6):469–80. pmid:28480976
- View Article
- PubMed/NCBI
- Google Scholar
52. Nielsen JB, Thorolfsdottir RB, Fritsche LG, Zhou W, Skov MW, Graham SE, et al. Biobank-driven genomic discovery yields new insight into atrial fibrillation biology. Nat Genet. 2018;50(9):1234–9. pmid:30061737
- View Article
- PubMed/NCBI
- Google Scholar
53. Jiang L, Zheng Z, Fang H, Yang J. A generalized linear mixed model association tool for biobank-scale data. Nat Genet. 2021;53(11):1616–21. pmid:34737426
- View Article
- PubMed/NCBI
- Google Scholar
54. UK Biobank Whole-Genome Sequencing Consortium. Whole-genome sequencing of 490,640 UK Biobank participants. Nature. 2025;645(8081):692–701. pmid:40770095
- View Article
- PubMed/NCBI
- Google Scholar
55. Verma A, Huffman JE, Rodriguez A, Conery M, Liu M, Ho Y-L, et al. Diversity and scale: Genetic architecture of 2068 traits in the VA Million Veteran Program. Science. 2024;385(6706):eadj1182. pmid:39024449
- View Article
- PubMed/NCBI
- Google Scholar
56. Jansen IE, Savage JE, Watanabe K, Bryois J, Williams DM, Steinberg S, et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nat Genet. 2019;51(3):404–13. pmid:30617256
- View Article
- PubMed/NCBI
- Google Scholar
57. Schoeler T, Speed D, Porcu E, Pirastu N, Pingault J-B, Kutalik Z. Participation bias in the UK Biobank distorts genetic associations and downstream analyses. Nat Hum Behav. 2023;7(7):1216–27. pmid:37106081
- View Article
- PubMed/NCBI
- Google Scholar
58. Zhao H, Sun Z, Wang J, Huang H, Kocher J-P, Wang L. CrossMap: A versatile tool for coordinate conversion between genome assemblies. Bioinformatics. 2014;30(7):1006–7. pmid:24351709
- View Article
- PubMed/NCBI
- Google Scholar
59. Willett WC, Stampfer MJ, Colditz GA, Rosner BA, Hennekens CH, Speizer FE. Moderate alcohol consumption and the risk of breast cancer. N Engl J Med. 1987;316(19):1174–80. pmid:3574368
- View Article
- PubMed/NCBI
- Google Scholar
60. Imran B, Hambali H, Subki A, Zaeniah Z, Yani A, Alfian MR. Data mining using random forest, naïve bayes, and adaboost models for prediction and classification of benign and malignant breast cancer. pilar. 2022;18(1):37–46.
- View Article
- Google Scholar
61. Chang V, Ganatra MA, Hall K, Golightly L, Xu QA. An assessment of machine learning models and algorithms for early prediction and diagnosis of diabetes using health indicators. Healthcare Analytics. 2022;2:100118.
- View Article
- Google Scholar
62. Andersson C, Johnson AD, Benjamin EJ, Levy D, Vasan RS. 70-year legacy of the framingham heart study. Nat Rev Cardiol. 2019;16(11):687–98. pmid:31065045
- View Article
- PubMed/NCBI
- Google Scholar
63. Dhingra R, Vasan RS. Age as a risk factor. Med Clin North Am. 2012;96(1):87–91. pmid:22391253
- View Article
- PubMed/NCBI
- Google Scholar
64. Urbut SM, Cho SMJ, Paruchuri K, Truong B, Haidermota S, Peloso GM, et al. Dynamic importance of genomic and clinical risk for coronary artery disease over the life course. Circ Genom Precis Med. 2025;18(1):e004681. pmid:39851049
- View Article
- PubMed/NCBI
- Google Scholar
65. Wood Alexander M, Paterson J, Arvanitakis Z, Black SE, Casaletto KB, Christakis MK, et al. Cardiovascular contributions to dementia: Examining sex differences and female-specific factors. Alzheimers Dement. 2025;21(8):e70610. pmid:40851413
- View Article
- PubMed/NCBI
- Google Scholar
66. Franczyk B, Rysz J, Ławiński J, Rysz-Górzyńska M, Gluba-Brzózka A. Is a High HDL-Cholesterol Level Always Beneficial?. Biomedicines. 2021;9(9):1083. pmid:34572269
- View Article
- PubMed/NCBI
- Google Scholar
67. Woollard KV. High-density lipoprotein and other risk factors for coronary artery disease. Br Med J (Clin Res Ed). 1981;282(6278):1799. pmid:6786629
- View Article
- PubMed/NCBI
- Google Scholar
68. Nagao M, Nakajima H, Toh R, Hirata K-I, Ishida T. Cardioprotective effects of high-density lipoprotein beyond its anti-atherogenic action. J Atheroscler Thromb. 2018;25(10):985–93. pmid:30146614
- View Article
- PubMed/NCBI
- Google Scholar
69. Magnani JW, Moser CB, Murabito JM, Sullivan LM, Wang N, Ellinor PT, et al. Association of sex hormones, aging, and atrial fibrillation in men: The Framingham Heart Study. Circ Arrhythm Electrophysiol. 2014;7(2):307–12. pmid:24610804
- View Article
- PubMed/NCBI
- Google Scholar
70. Elliott J, Bodinier B, Bond TA, Chadeau-Hyam M, Evangelou E, Moons KGM, et al. Predictive accuracy of a polygenic risk score-enhanced prediction model vs a clinical risk score for coronary artery disease. JAMA. 2020;323(7):636–45. pmid:32068818
- View Article
- PubMed/NCBI
- Google Scholar
71. Weng L-C, Choi SH, Klarin D, Smith JG, Loh P-R, Chaffin M, et al. Heritability of Atrial Fibrillation. Circ Cardiovasc Genet. 2017;10(6):e001838. pmid:29237688
- View Article
- PubMed/NCBI
- Google Scholar
72. Segan L, Ho Ho WW, Crowley R, William J, Cho K, Prabh S, et al. Combining polygenic and clinical risk scores in atrial fibrillation risk prediction: Implications for population screening. Heart Rhythm. 2025;22(8):1906–14. pmid:40288473
- View Article
- PubMed/NCBI
- Google Scholar
73. Christophersen IE, Ravn LS, Budtz-Joergensen E, Skytthe A, Haunsoe S, Svendsen JH, et al. Familial aggregation of atrial fibrillation: a study in Danish twins. Circ Arrhythm Electrophysiol. 2009;2(4):378–83. pmid:19808493
- View Article
- PubMed/NCBI
- Google Scholar
74. van Gennip ACE, van Sloten TT, Fayosse A, Sabia S, Singh-Manoux A. Age at cardiovascular disease onset, dementia risk, and the role of lifestyle factors. Alzheimers Dement. 2024;20(3):1693–702. pmid:38085549
- View Article
- PubMed/NCBI
- Google Scholar
75. Bellenguez C, Küçükali F, Jansen IE, Kleineidam L, Moreno-Grau S, Amin N, et al. New insights into the genetic etiology of Alzheimer’s disease and related dementias. Nat Genet. 2022;54(4):412–36. pmid:35379992
- View Article
- PubMed/NCBI
- Google Scholar
76. Shade LMP, Katsumata Y, Abner EL, Aung KZ, Claas SA, Qiao Q, et al. GWAS of multiple neuropathology endophenotypes identifies new risk loci and provides insights into the genetic risk of dementia. Nat Genet. 2024;56(11):2407–21. pmid:39379761
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Yang L, Wu H, Jin X, Zheng P, Hu S, Xu X, et al. Study of cardiovascular disease prediction model based on random forest in eastern China. Sci Rep. 2020;10(1):5245. pmid:32251324
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Andersson C, Nayor M, Tsao CW, Levy D, Vasan RS. Framingham Heart Study: JACC focus seminar, 1/8. J Am Coll Cardiol. 2021;77(21):2680–92. pmid:34045026
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Townsend N, Kazakiewicz D, Lucy Wright F, Timmis A, Huculeci R, Torbica A, et al. Epidemiology of cardiovascular disease in Europe. Nat Rev Cardiol. 2022;19(2):133–43. pmid:34497402
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Gersh B. The epidemic of cardiovascular disease in the developing world: Global implications. Lippincott Williams & Wilkins. 2012.

[ref5] 5. Almansouri NE, Awe M, Rajavelu S, Jahnavi K, Shastry R, Hasan A, et al. Early diagnosis of cardiovascular diseases in the era of artificial intelligence: An in-depth review. Cureus. 2024;16(3):e55869. pmid:38595869
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref6] 6. Qu 6, Liao S, Zhang J, Cao H, Zhang H, Zhang N. Burden of cardiovascular disease among elderly: Based on the Global Burden of Disease Study 2019. European Heart Journal - Quality of Care and Clinical Outcomes. 2024;10(2):143–53.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref7] 7. Padda I, Fabian D, Farid M, Mahtani A, Sethi Y, Ralhan T, et al. Social determinants of health and its impact on cardiovascular disease in underserved populations: A critical review. Curr Probl Cardiol. 2024;49(3):102373. pmid:38185436
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref8] 8. Naser MA, Majeed AA, Alsabah M, Al-Shaikhli TR, Kaky KM. A Review of machine learning’s role in cardiovascular disease prediction: Recent advances and future challenges. Algorithms. 2024;17(2):78.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref9] 9. Naeem AB, Senapati B, Bhuva D, Zaidi A, Bhuva A, Sudman MdSI, et al. Heart disease detection using feature extraction and artificial neural networks: A sensor-based approach. IEEE Access. 2024;12:37349–62.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref10] 10. Jiang X, Alnoud MAH, Ali H, Ali I, Hussain T, Khan MU, et al. Heartfelt living: Deciphering the link between lifestyle choices and cardiovascular vitality. Curr Probl Cardiol. 2024;49(3):102397. pmid:38232921
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref11] 11. Catak A, Dinarevic S, Prnjavorac B, Naser N, Masic I. Public Health Dimensions of Cardiovascular Diseases (CVD) Prevention and Control – Global Perspectives and Current Situation in the Federation of Bosnia and Herzegovina. Mater Sociomed. 2023;35(2):88.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref12] 12. Kumar A, Siddharth V, Singh SI, Narang R. Cost analysis of treating cardiovascular diseases in a super-specialty hospital. PLoS One. 2022;17(1):e0262190. pmid:34986193
View Article
PubMed/NCBI
Google Scholar

[39] View Article

[40] PubMed/NCBI

[41] Google Scholar

[ref13] 13. Chhezom K, Gurung MS, Wangdi K. Comparison of Laboratory and Non-Laboratory-Based 2019 World Health Organization Cardiovascular Risk Charts in the Bhutanese Population. Asia Pac J Public Health. 2024;36(1):29–35. pmid:38116599
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref14] 14. Boivin-Proulx L-A, Brouillette J, Dorais M, Perreault S. Association between cardiovascular diseases and dementia among various age groups: A population-based cohort study in older adults. Sci Rep. 2023;13(1):14881. pmid:37689801
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref15] 15. Leon BM, Maddox TM. Diabetes and cardiovascular disease: Epidemiology, biological mechanisms, treatment recommendations and future research. World J Diabetes. 2015;6(13):1246–58. pmid:26468341
View Article
PubMed/NCBI
Google Scholar

[51] View Article

[52] PubMed/NCBI

[53] Google Scholar

[ref16] 16. Shaw J, Thomas M, Magliano D. The dark heart of type 2 diabetes. 2018.

[ref17] 17. Group NDD, Diabetes NIO, Digestive DK. Diabetes in America. National Institutes of Health. 1995.

[ref18] 18. Einarson TR, Acs A, Ludwig C, Panton UH. Prevalence of cardiovascular disease in type 2 diabetes: A systematic literature review of scientific evidence from across the world in 2007-2017. Cardiovasc Diabetol. 2018;17(1):83. pmid:29884191
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref19] 19. Petrie JR, Guzik TJ, Touyz RM. Diabetes, hypertension, and cardiovascular disease: Clinical Insights and Vascular Mechanisms. Can J Cardiol. 2018;34(5):575–84. pmid:29459239
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref20] 20. Shaw J, Tanamas S. Diabetes: the silent pandemic and its impact on Australia. Melbourne: Baker IDI Heart and Diabetes Institute. 2012.

[ref21] 21. Alaa AM, Bolton T, Di Angelantonio E, Rudd JHF, van der Schaar M. Cardiovascular disease risk prediction using automated machine learning: A prospective study of 423,604 UK Biobank participants. PLoS One. 2019;14(5):e0213653. pmid:31091238
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref22] 22. Siontis GCM, Tzoulaki I, Siontis KC, Ioannidis JPA. Comparisons of established risk prediction models for cardiovascular disease: Systematic review. BMJ. 2012;344:e3318. pmid:22628003
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref23] 23. Dalal S, Goel P, Onyema EM, Alharbi A, Mahmoud A, Algarni MA, et al. Application of machine learning for cardiovascular disease risk prediction. Computational Intelligence and Neuroscience. 2023;2023(1).
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref24] 24. Saranya G, Pravin A. A comprehensive study on disease risk predictions in machine learning. IJECE. 2020;10(4):4217.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref25] 25. Reddy KVV, Elamvazuthi I, Aziz AA, Paramasivam S, Chua HN, Pranavanand S. Heart disease risk prediction using machine learning classifiers with attribute evaluators. Applied Sciences. 2021;11(18):8352.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref26] 26. Ward A, Sarraju A, Chung S, Li J, Harrington R, Heidenreich P, et al. Machine learning and atherosclerotic cardiovascular disease risk prediction in a multi-ethnic population. NPJ Digit Med. 2020;3:125. pmid:33043149
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref27] 27. Ordikhani M, Saniee Abadeh M, Prugger C, Hassannejad R, Mohammadifard N, Sarrafzadegan N. An evolutionary machine learning algorithm for cardiovascular disease risk prediction. PLoS One. 2022;17(7):e0271723. pmid:35901181
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref28] 28. Trigka M, Dritsas E. Long-term coronary artery disease risk prediction with machine learning models. Sensors (Basel). 2023;23(3):1193. pmid:36772237
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref29] 29. Gibson G. Rare and common variants: Twenty arguments. Nat Rev Genet. 2012;13(2):135–45. pmid:22251874
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref30] 30. Turley P, Walters RK, Maghzian O, Okbay A, Lee JJ, Fontana MA, et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat Genet. 2018;50(2):229–37. pmid:29292387
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref31] 31. Lin J, Tabassum R, Ripatti S, Pirinen M. MetaPhat: Detecting and decomposing multivariate associations from univariate genome-wide association statistics. Front Genet. 2020;11:431. pmid:32499813
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref32] 32. Ma Y, Patil S, Zhou X, Mukherjee B, Fritsche LG. ExPRSweb: An online repository with polygenic risk scores for common health-related exposures. Am J Hum Genet. 2022;109(10):1742–60. pmid:36152628
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref33] 33. Lin J, Mars N, Fu Y, Ripatti P, Kiiskinen T, et al. Integration of biomarker polygenic risk score improves prediction of coronary heart disease. Basic to Translational Science. 2023;8(12):1489–99.
View Article
Google Scholar

[111] View Article

[112] Google Scholar

[ref34] 34. Mars N, Koskela JT, Ripatti P, Kiiskinen TTJ, Havulinna AS, Lindbohm JV, et al. Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat Med. 2020;26(4):549–57. pmid:32273609
View Article
PubMed/NCBI
Google Scholar

[114] View Article

[115] PubMed/NCBI

[116] Google Scholar

[ref35] 35. Tamlander M, Mars N, Pirinen M, Widén E, Ripatti S. Integration of questionnaire-based risk factors improves polygenic risk scores for human coronary heart disease and type 2 diabetes. Commun Biol. 2022;5(1):158. pmid:35197564
View Article
PubMed/NCBI
Google Scholar

[118] View Article

[119] PubMed/NCBI

[120] Google Scholar

[ref36] 36. Riveros-Mckay F, Weale ME, Moore R, Selzam S, Krapohl E, Sivley RM, et al. Integrated polygenic tool substantially enhances coronary artery disease prediction. Circ Genom Precis Med. 2021;14(2):e003304. pmid:33651632
View Article
PubMed/NCBI
Google Scholar

[122] View Article

[123] PubMed/NCBI

[124] Google Scholar

[ref37] 37. Samani NJ, Beeston E, Greengrass C, Riveros-McKay F, Debiec R, Lawday D, et al. Polygenic risk score adds to a clinical risk score in the prediction of cardiovascular disease in a clinical setting. Eur Heart J. 2024;45(34):3152–60. pmid:38848106
View Article
PubMed/NCBI
Google Scholar

[126] View Article

[127] PubMed/NCBI

[128] Google Scholar

[ref38] 38. Rout M, Tung GK, Singh JR, Mehra NK, Wander GS, Ralhan S, et al. Polygenic risk score assessment for coronary artery disease in Asian Indians. J Cardiovasc Transl Res. 2024;17(5):1086–96. pmid:38658478
View Article
PubMed/NCBI
Google Scholar

[130] View Article

[131] PubMed/NCBI

[132] Google Scholar

[ref39] 39. Agbaedeng TA, Noubiap JJ, Mofo Mato EP, Chew DP, Figtree GA, Said MA, et al. Polygenic risk score and coronary artery disease: A meta-analysis of 979,286 participant data. Atherosclerosis. 2021;333:48–55. pmid:34425527
View Article
PubMed/NCBI
Google Scholar

[134] View Article

[135] PubMed/NCBI

[136] Google Scholar

[ref40] 40. Wu Z, Lou Y, Jin W, Liu Y, Lu L, Chen Q, et al. Relationship of the p22phox (CYBA) gene polymorphism C242T with risk of coronary artery disease: a meta-analysis. PLoS One. 2013;8(9):e70885. pmid:24039708
View Article
PubMed/NCBI
Google Scholar

[138] View Article

[139] PubMed/NCBI

[140] Google Scholar

[ref41] 41. Gola D, Erdmann J, Müller-Myhsok B, Schunkert H, König IR. Polygenic risk scores outperform machine learning methods in predicting coronary artery disease status. Genet Epidemiol. 2020;44(2):125–38. pmid:31922285
View Article
PubMed/NCBI
Google Scholar

[142] View Article

[143] PubMed/NCBI

[144] Google Scholar

[ref42] 42. Schunkert H, König IR, Kathiresan S, Reilly MP, Assimes TL, Holm H, et al. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat Genet. 2011;43(4):333–8. pmid:21378990
View Article
PubMed/NCBI
Google Scholar

[146] View Article

[147] PubMed/NCBI

[148] Google Scholar

[ref43] 43. Shoaib M, Ye Q, IglayReger H, Tan MH, Boehnke M, Burant CF, et al. Evaluation of polygenic risk scores to differentiate between type 1 and type 2 diabetes. Genet Epidemiol. 2023;47(4):303–13. pmid:36821788
View Article
PubMed/NCBI
Google Scholar

[150] View Article

[151] PubMed/NCBI

[152] Google Scholar

[ref44] 44. Hodgson S, Huang QQ, Sallah N, Genes & Health Research Team, Griffiths CJ, Newman WG, et al. Integrating polygenic risk scores in the prediction of type 2 diabetes risk and subtypes in British Pakistanis and Bangladeshis: A population-based cohort study. PLoS Med. 2022;19(5):e1003981. pmid:35587468
View Article
PubMed/NCBI
Google Scholar

[154] View Article

[155] PubMed/NCBI

[156] Google Scholar

[ref45] 45. Pan C, Cheng B, Qin X, Cheng S, Liu L, Yang X, et al. Enhanced polygenic risk score incorporating gene-environment interaction suggests the association of major depressive disorder with cardiac and lung function. Brief Bioinform. 2024;25(2):bbae070. pmid:38436562
View Article
PubMed/NCBI
Google Scholar

[158] View Article

[159] PubMed/NCBI

[160] Google Scholar

[ref46] 46. Ester M, Kriegel H, Xu X. XGBoost: A scalable tree boosting system. Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
View Article
Google Scholar

[162] View Article

[163] Google Scholar

[ref47] 47. Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: Unbiased boosting with categorical features. Advances in neural information processing systems. 2018;31.
View Article
Google Scholar

[165] View Article

[166] Google Scholar

[ref48] 48. Rufo DD, Debelee TG, Ibenthal A, Negera WG. Diagnosis of Diabetes Mellitus Using Gradient Boosting Machine (LightGBM). Diagnostics (Basel). 2021;11(9):1714. pmid:34574055
View Article
PubMed/NCBI
Google Scholar

[168] View Article

[169] PubMed/NCBI

[170] Google Scholar

[ref49] 49. Choi SW, O’Reilly PF. PRSice-2: Polygenic Risk Score software for biobank-scale data. Gigascience. 2019;8(7):giz082. pmid:31307061
View Article
PubMed/NCBI
Google Scholar

[172] View Article

[173] PubMed/NCBI

[174] Google Scholar

[ref50] 50. Privé F, Arbel J, Vilhjálmsson BJ. LDpred2: Better, faster, stronger. Bioinformatics. 2021;36(22–23):5424–31. pmid:33326037
View Article
PubMed/NCBI
Google Scholar

[176] View Article

[177] PubMed/NCBI

[178] Google Scholar

[ref51] 51. Mak TSH, Porsch RM, Choi SW, Zhou X, Sham PC. Polygenic scores via penalized regression on summary statistics. Genet Epidemiol. 2017;41(6):469–80. pmid:28480976
View Article
PubMed/NCBI
Google Scholar

[180] View Article

[181] PubMed/NCBI

[182] Google Scholar

[ref52] 52. Nielsen JB, Thorolfsdottir RB, Fritsche LG, Zhou W, Skov MW, Graham SE, et al. Biobank-driven genomic discovery yields new insight into atrial fibrillation biology. Nat Genet. 2018;50(9):1234–9. pmid:30061737
View Article
PubMed/NCBI
Google Scholar

[184] View Article

[185] PubMed/NCBI

[186] Google Scholar

[ref53] 53. Jiang L, Zheng Z, Fang H, Yang J. A generalized linear mixed model association tool for biobank-scale data. Nat Genet. 2021;53(11):1616–21. pmid:34737426
View Article
PubMed/NCBI
Google Scholar

[188] View Article

[189] PubMed/NCBI

[190] Google Scholar

[ref54] 54. UK Biobank Whole-Genome Sequencing Consortium. Whole-genome sequencing of 490,640 UK Biobank participants. Nature. 2025;645(8081):692–701. pmid:40770095
View Article
PubMed/NCBI
Google Scholar

[192] View Article

[193] PubMed/NCBI

[194] Google Scholar

[ref55] 55. Verma A, Huffman JE, Rodriguez A, Conery M, Liu M, Ho Y-L, et al. Diversity and scale: Genetic architecture of 2068 traits in the VA Million Veteran Program. Science. 2024;385(6706):eadj1182. pmid:39024449
View Article
PubMed/NCBI
Google Scholar

[196] View Article

[197] PubMed/NCBI

[198] Google Scholar

[ref56] 56. Jansen IE, Savage JE, Watanabe K, Bryois J, Williams DM, Steinberg S, et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nat Genet. 2019;51(3):404–13. pmid:30617256
View Article
PubMed/NCBI
Google Scholar

[200] View Article

[201] PubMed/NCBI

[202] Google Scholar

[ref57] 57. Schoeler T, Speed D, Porcu E, Pirastu N, Pingault J-B, Kutalik Z. Participation bias in the UK Biobank distorts genetic associations and downstream analyses. Nat Hum Behav. 2023;7(7):1216–27. pmid:37106081
View Article
PubMed/NCBI
Google Scholar

[204] View Article

[205] PubMed/NCBI

[206] Google Scholar

[ref58] 58. Zhao H, Sun Z, Wang J, Huang H, Kocher J-P, Wang L. CrossMap: A versatile tool for coordinate conversion between genome assemblies. Bioinformatics. 2014;30(7):1006–7. pmid:24351709
View Article
PubMed/NCBI
Google Scholar

[208] View Article

[209] PubMed/NCBI

[210] Google Scholar

[ref59] 59. Willett WC, Stampfer MJ, Colditz GA, Rosner BA, Hennekens CH, Speizer FE. Moderate alcohol consumption and the risk of breast cancer. N Engl J Med. 1987;316(19):1174–80. pmid:3574368
View Article
PubMed/NCBI
Google Scholar

[212] View Article

[213] PubMed/NCBI

[214] Google Scholar

[ref60] 60. Imran B, Hambali H, Subki A, Zaeniah Z, Yani A, Alfian MR. Data mining using random forest, naïve bayes, and adaboost models for prediction and classification of benign and malignant breast cancer. pilar. 2022;18(1):37–46.
View Article
Google Scholar

[216] View Article

[217] Google Scholar

[ref61] 61. Chang V, Ganatra MA, Hall K, Golightly L, Xu QA. An assessment of machine learning models and algorithms for early prediction and diagnosis of diabetes using health indicators. Healthcare Analytics. 2022;2:100118.
View Article
Google Scholar

[219] View Article

[220] Google Scholar

[ref62] 62. Andersson C, Johnson AD, Benjamin EJ, Levy D, Vasan RS. 70-year legacy of the framingham heart study. Nat Rev Cardiol. 2019;16(11):687–98. pmid:31065045
View Article
PubMed/NCBI
Google Scholar

[222] View Article

[223] PubMed/NCBI

[224] Google Scholar

[ref63] 63. Dhingra R, Vasan RS. Age as a risk factor. Med Clin North Am. 2012;96(1):87–91. pmid:22391253
View Article
PubMed/NCBI
Google Scholar

[226] View Article

[227] PubMed/NCBI

[228] Google Scholar

[ref64] 64. Urbut SM, Cho SMJ, Paruchuri K, Truong B, Haidermota S, Peloso GM, et al. Dynamic importance of genomic and clinical risk for coronary artery disease over the life course. Circ Genom Precis Med. 2025;18(1):e004681. pmid:39851049
View Article
PubMed/NCBI
Google Scholar

[230] View Article

[231] PubMed/NCBI

[232] Google Scholar

[ref65] 65. Wood Alexander M, Paterson J, Arvanitakis Z, Black SE, Casaletto KB, Christakis MK, et al. Cardiovascular contributions to dementia: Examining sex differences and female-specific factors. Alzheimers Dement. 2025;21(8):e70610. pmid:40851413
View Article
PubMed/NCBI
Google Scholar

[234] View Article

[235] PubMed/NCBI

[236] Google Scholar

[ref66] 66. Franczyk B, Rysz J, Ławiński J, Rysz-Górzyńska M, Gluba-Brzózka A. Is a High HDL-Cholesterol Level Always Beneficial?. Biomedicines. 2021;9(9):1083. pmid:34572269
View Article
PubMed/NCBI
Google Scholar

[238] View Article

[239] PubMed/NCBI

[240] Google Scholar

[ref67] 67. Woollard KV. High-density lipoprotein and other risk factors for coronary artery disease. Br Med J (Clin Res Ed). 1981;282(6278):1799. pmid:6786629
View Article
PubMed/NCBI
Google Scholar

[242] View Article

[243] PubMed/NCBI

[244] Google Scholar

[ref68] 68. Nagao M, Nakajima H, Toh R, Hirata K-I, Ishida T. Cardioprotective effects of high-density lipoprotein beyond its anti-atherogenic action. J Atheroscler Thromb. 2018;25(10):985–93. pmid:30146614
View Article
PubMed/NCBI
Google Scholar

[246] View Article

[247] PubMed/NCBI

[248] Google Scholar

[ref69] 69. Magnani JW, Moser CB, Murabito JM, Sullivan LM, Wang N, Ellinor PT, et al. Association of sex hormones, aging, and atrial fibrillation in men: The Framingham Heart Study. Circ Arrhythm Electrophysiol. 2014;7(2):307–12. pmid:24610804
View Article
PubMed/NCBI
Google Scholar

[250] View Article

[251] PubMed/NCBI

[252] Google Scholar

[ref70] 70. Elliott J, Bodinier B, Bond TA, Chadeau-Hyam M, Evangelou E, Moons KGM, et al. Predictive accuracy of a polygenic risk score-enhanced prediction model vs a clinical risk score for coronary artery disease. JAMA. 2020;323(7):636–45. pmid:32068818
View Article
PubMed/NCBI
Google Scholar

[254] View Article

[255] PubMed/NCBI

[256] Google Scholar

[ref71] 71. Weng L-C, Choi SH, Klarin D, Smith JG, Loh P-R, Chaffin M, et al. Heritability of Atrial Fibrillation. Circ Cardiovasc Genet. 2017;10(6):e001838. pmid:29237688
View Article
PubMed/NCBI
Google Scholar

[258] View Article

[259] PubMed/NCBI

[260] Google Scholar

[ref72] 72. Segan L, Ho Ho WW, Crowley R, William J, Cho K, Prabh S, et al. Combining polygenic and clinical risk scores in atrial fibrillation risk prediction: Implications for population screening. Heart Rhythm. 2025;22(8):1906–14. pmid:40288473
View Article
PubMed/NCBI
Google Scholar

[262] View Article

[263] PubMed/NCBI

[264] Google Scholar

[ref73] 73. Christophersen IE, Ravn LS, Budtz-Joergensen E, Skytthe A, Haunsoe S, Svendsen JH, et al. Familial aggregation of atrial fibrillation: a study in Danish twins. Circ Arrhythm Electrophysiol. 2009;2(4):378–83. pmid:19808493
View Article
PubMed/NCBI
Google Scholar

[266] View Article

[267] PubMed/NCBI

[268] Google Scholar

[ref74] 74. van Gennip ACE, van Sloten TT, Fayosse A, Sabia S, Singh-Manoux A. Age at cardiovascular disease onset, dementia risk, and the role of lifestyle factors. Alzheimers Dement. 2024;20(3):1693–702. pmid:38085549
View Article
PubMed/NCBI
Google Scholar

[270] View Article

[271] PubMed/NCBI

[272] Google Scholar

[ref75] 75. Bellenguez C, Küçükali F, Jansen IE, Kleineidam L, Moreno-Grau S, Amin N, et al. New insights into the genetic etiology of Alzheimer’s disease and related dementias. Nat Genet. 2022;54(4):412–36. pmid:35379992
View Article
PubMed/NCBI
Google Scholar

[274] View Article

[275] PubMed/NCBI

[276] Google Scholar

[ref76] 76. Shade LMP, Katsumata Y, Abner EL, Aung KZ, Claas SA, Qiao Q, et al. GWAS of multiple neuropathology endophenotypes identifies new risk loci and provides insights into the genetic risk of dementia. Nat Genet. 2024;56(11):2407–21. pmid:39379761
View Article
PubMed/NCBI
Google Scholar

[278] View Article

[279] PubMed/NCBI

[280] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Overview of the study design

CVD data from FHS and GWAS catalog

Preprocessing and quality control

PRS

Risk prediction model

Hyperparameter tuning for risk prediction model

Model evaluation metrics

Implementation

Results

PRS estimation

Performance of CVD prediction model

Feature importance of genetic and environmental variables in CVD prediction

Discussion

Conclusion

Supporting information

S1 Table. Hyperparameter search spaces for each model.

S2 Table. AUC, AUPRC, and Brier mean scores for PRSice2, LDpred2, and Lassosum across diseases.

S1 File. Full results from the nested cross-validation.

S2 File. Additional model performance metrics from PRSice2, Ldpred2, and Lassosum.

References