Predicting cardiovascular disease risk using photoplethysmography and deep learning

Cardiovascular diseases (CVDs) are responsible for a large proportion of premature deaths in low- and middle-income countries. Early CVD detection and intervention is critical in these populations, yet many existing CVD risk scores require a physical examination or lab measurements, which can be challenging in such health systems due to limited accessibility. We investigated the potential to use photoplethysmography (PPG), a sensing technology available on most smartphones that can potentially enable large-scale screening at low cost, for CVD risk prediction. We developed a deep learning PPG-based CVD risk score (DLS) to predict the probability of having major adverse cardiovascular events (MACE: non-fatal myocardial infarction, stroke, and cardiovascular death) within ten years, given only age, sex, smoking status and PPG as predictors. We compare the DLS with the office-based refit-WHO score, which adopts the shared predictors from WHO and Globorisk scores (age, sex, smoking status, height, weight and systolic blood pressure) but refitted on the UK Biobank (UKB) cohort. All models were trained on a development dataset (141,509 participants) and evaluated on a geographically separate test (54,856 participants) dataset, both from UKB. DLS’s C-statistic (71.1%, 95% CI 69.9–72.4) is non-inferior to office-based refit-WHO score (70.9%, 95% CI 69.7–72.2; non-inferiority margin of 2.5%, p<0.01) in the test dataset. The calibration of the DLS is satisfactory, with a 1.8% mean absolute calibration error. Adding DLS features to the office-based score increases the C-statistic by 1.0% (95% CI 0.6–1.4). DLS predicts ten-year MACE risk comparable with the office-based refit-WHO score. Interpretability analyses suggest that the DLS-extracted features are related to PPG waveform morphology and are independent of heart rate. Our study provides a proof-of-concept and suggests the potential of a PPG-based approach strategies for community-based primary prevention in resource-limited regions.


Introduction
Cardiovascular diseases (CVDs) are responsible for one third of deaths globally 1 with approximately three quarters occurring in low-and middle-income countries (LMICs) where there's a paucity of resources for early disease detection 2,3 . Because CVD risk factors such as hypertension, diabetes, or hyperlipidemia are typically symptomless before advanced disease, there is a great need for screening programs to identify those at high risk of CVD events. Interventions such as lifestyle counseling, with or without prescription medications, have shown to be an effective strategy for CVD prevention among these individuals 4 .
Multiple risk scores, such as WHO/ISH risk chart and Globorisk scores, have been developed to triage CVD risk based on demographics, past medical history, vital signs, and laboratory data [4][5][6][7] . However, the dependency of these risk scores on medical and laboratory equipment (e.g., sphygmomanometers) 8,9 limits their reach. Specifically, low-resource healthcare systems have relied largely on opportunistic screening 10 , such as via community healthcare workers (CHWs) 11 , to close access gaps. We reasoned that developing low-cost, easy-to-use, lightweight, digital point-of-care tools using sensors already available in smartphones [12][13][14] , could potentially further the reach and capability of CHW-based programs and enable large-scale screening at low cost 15 .
Among sensing signals for the circulatory system, photoplethysmography (PPG) is a non-invasive, fast, simple, and low-cost technology, and can be captured with sensors available on increasingly ubiquitous devices such as smartphones and pulse oximeters 16 . PPG measures the change in blood volume in an area of tissue across cardiac cycles and is primarily used for heart rate monitoring in healthcare settings 17,18 . Research has also investigated the utility of PPG in understanding short term fluctuations in vascular compliance, by estimating continuous blood pressure (BP) in an ICU setting 17,19,20 , though the accuracy of such approaches is known to be insufficient even when per-user calibration is available 17 . Beyond short term vascular changes, research has also been conducted into understanding the slow manifestation of vascular aging and arterial stiffness from PPG waveforms 21-23 , which are useful for longer-term CVD risk assessment. Since PPG is potentially more accessible and requires less training for measurement, such technologies could provide accurate real-time insights. The ubiquity of smartphones have also prompted research involving PPG as measured from smartphone cameras, via placing a finger on the camera 16 . Taken together, enabling CVD-risk estimation based on PPG signals can potentially be a highly accessible screening tool in low-resource health systems (Figure 1a).
In this paper, we investigate the feasibility of leveraging PPG for CVD risk prediction using data from the UK Biobank (UKB). Specifically, we predict the ten-year risk of developing a major adverse cardiovascular event (MACE) using deep learning-based PPG embeddings and heart rate (measured by PPG), along with other demographics, including age, sex and smoking status, but without any inputs from physical examination or laboratory data (Supplementary Figure 1). We find that our deep learning PPG-based CVD risk prediction score (DLS) is well-calibrated and non-inferior to the existing comparative office-based CVD risk score using predictors from WHO/ISH and Globorisk, that requires blood pressure, weight and height measurement, or laboratory data.

Results
We showed that DLS demonstrated non-inferiority to the office-based refit-WHO score. We evaluated the ten-year MACE risk prediction performance of all methods using our UKB test subset, which was held-out during the training process. The DLS yielded C-statistic of 71.1% (95% CI [69. 9, 72.4]). When compared with the office-based refit-WHO score, the DLS was non-inferior (p<0.01), with a delta of Based on the C-statistic, there was an incremental improvement when the metadata model (69.1%) was augmented with manually engineered (not deep learning derived) PPG morphology features (70.0%). The DLS was superior to this metadata+PPG morphology features model (p<0.01), indicating value in deep learning based feature extraction. The lab-based model (which requires total cholesterol and glucose information) was superior to the office-based refit-WHO score (71.6% versus 70.9%, p<0.01). By applying the Globorisk scores in 7 , which recalibrating on UKB cohort for baseline hazard and mean risk factors but without re-estimating the coefficients, the office-based Globorisk yielded a C-statistic of 70.0% (68. 8, 71.2), and the lab-based Globorisk yielded a C-statistic of 69.8% (68.5, 71.1). More details are shown in Table 1.
For a fair comparison, we then selected the risk thresholds that matched the specificity or sensitivity of SBP-140 (see Statistical analysis in Methods) (specificity of 63.7%, sensitivity of 55.2%). We found that at matched specificity, the sensitivity of the DLS (67.9%) were non-inferior to the office-based refit-WHO score (67.7%) (p=0.012), and a comparable NRI, while the metadata and metadata + PPG morphology models were not (p=0.984 and p=0.305, respectively). At the matched sensitivity, the DLS's specificity (74.0%) was also non-inferior to the baseline (73.1%) (showed superiority with p<0.01), with a comparable NRI. The laboratory-based refit-WHO and the model using metadata and PPG morphology-based features were also non-inferior to the office-based refit-WHO score, despite these models requiring additional inputs from laboratory measurements or engineered PPG features, respectively. The metadata-only model performed more poorly than the office-based refit-WHO score across different metrics. We also conducted Kaplan Meier analysis on risk groups defined using the above approach (Figure 1b). Both thresholds showed significant (p<0.01, log rank tests) differences between the groups. Results for the 10% risk threshold are in Supplementary Table 1  In addition to the default set of inputs to the DLS, we also evaluated models with BMI, and with both BMI and SBP (both of which are predictors in the office-based refit-WHO) included as additional inputs, which we refer to as DLS+ and DLS++, respectively. We found that adding BMI (DLS+) improved DLS in terms of both discrimination and net reclassification. Additional improvement was observed after adding SBP (DLS++), which further improved the DLS model, and demonstrated superiority across different metrics (Supplementary Table 2a, 2b). We also showed that for DLS and its variants (DLS+ and DLS++), the cfNRI and NRI with different risk thresholds (Supplementary Table 2a) were also on par with the office-based refit-WHO score (Supplementary Table 2b). These findings indicate that combining the existing non-laboratory risk factors from the refit-WHO score with the DLS features yields a more accurate CV risk estimation. We further developed a model (Full model) that includes more risk factors used in the CVD risk scores commonly used in high-income countries (QRISK and/or ASCVD), as well as a model that incorporates genetic risk, and listed the findings in the Supplementary Results.
We also examined the association between each model and MACE via the coefficients and hazard ratios (HRs) (Supplementary Table 3). We found that in the office-based refit-WHO score, smoking, older age, higher BMI, and higher SBP were associated with the ten-year MACE risk. We also found that some DLS features were also associated with the ten-year MACE risk (p<0.05 for four deep learning PPG features in DLS and DLS+, and for two PPG features for DLS++).
Meanwhile, the predicted and observed risks of ten-year MACE were similar across different models ( Figure 1c), which indicated DLS has similar calibration performance compared with other models. The calibration slope of DLS was similar to the office-based refit-WHO score (0.981 versus 0.979) ( Table 2).
We also found that DLS++ has a comparable calibration performance (Supplementary Table 2a, 2b). All models except the DLS+ have an observed ten-year MACE risk estimation within 5% mean absolute calibration error (i.e., the slopes were between 0.95-1.05).
Finally, DLS is on par with the office-based refit-WHO score in some subgroups. Supplementary Table 4 shows that DLS demonstrated non-inferiority in some subgroups and showed superiority in smoking, hypertensive and male subgroups. Both the office-based refit-WHO score and DLS had similar performance trends. Both models have higher sensitivity and lower calibration error but lower specificity on the smoking, older, male, and hypertensive subgroups. The models were well-calibrated for most subgroups, but systematically overestimated absolute risk about 4.0% in the elevated A1c and about 1.0% in hypertensive subgroups. The finding indicates that the developed risk models tend to be better calibrated and better predict ten-year MACE risk in a population that has higher known CVD risk factors, such as older, male, smoking, higher blood glucose and hypertensive subgroups (Supplementary Table 4). Across different age, sex, smoking, and comorbidity (diabetes and hypertension) subgroups, the calibration for all risk scores were similar in predicting ten-year MACE risk in smoking, age greater than 55, male, not elevated A1c populations, with prediction errors within 10% (i.e. the calibration regression slope between 0.9 and 1.1 (Supplementary

Discussion
We developed a deep learning PPG-based CVD risk score, DLS, to predict ten-year MACE risk using age, sex, smoking status, heart rate and deep learning-derived PPG features. Without requiring any vital signs or laboratory measurement, DLS demonstrated non-inferior performance compared to the office-based refit-WHO score with coefficients re-estimated on the same cohort. Results were consistent between metrics (C-statistic, NRI, cfNRI, sensitivity, specificity, calibration slope), and in various subgroups. Improved cfNRI and NRI also indicate the capability of DLS to reclassify cases better than the office-based refit-WHO score. Additionally, if available, adding office-based features (BMI, SBP) on top of DLS further improved the model performance.
Our work focuses on understanding the role that PPG and deep learning can play in settings where equipment access to healthcare is limited, such as community-based screening programs in LMICs.
Several CVD prediction scores without an assumption of the availability of laboratory measurement exist for primary prevention, such as WHO/ISH risk prediction chart 24 , office-based Framingham risk score (FRS) 25 , office-based Globorisk score 4 , non-laboratory INTERHEART risk score 26 , and Harvard NHANES risk score 27 . Some of these are also deployed in real-world clinical practice 4,28 , though these methods require either body measurements (BMI, waist-hip ratio), SBP, or both. Challenges remain in scaling up CVD screening in the resource-limited areas due to reasons such as the lack of laboratory devices, sphygmomanometer cuffs, or the necessary training of CHWs for accurate measurements. In our study, the DLS demonstrated performance comparable to that of the re-estimated office-based refit-WHO score, without requiring accurate laboratory examination, vital signs measured via additional devices, or BMI. This feature improves accessibility for health systems that have limited resources to collect vitals and labs for CVD risk screening and triage. More intriguing, PPG signals could in principle be captured through a smartphone 16 , and future work could leverage smartphone-based PPGs along with the DLS to enable large-scale screening and triage in the community at low cost ( Figure 1a) 14,29 .
Due to the higher prevalence, lower diagnosis rate and lower treatment of CVD in LMICs, WHO has listed preventing and controlling CVD as main targets in their "Global action plan for the prevention and control of non-communicable diseases (NCDs) 2013-2030" 30 . PPG-based screening may allow healthcare systems to optimize use of resources by funneling in those who are likely to benefit the most and improve the early detection of CVDs. Thus, our study represents a step on the journey towards enabling community-based preventive treatment for high CVD risk individuals with limited healthcare access.
The deep learning-based features are challenging to interpret directly, and the pathophysiology between PPG and CVD risk is still under investigation 31 . We computed the Pearson correlation coefficient between DLS features and engineered PPG morphological features (Supplementary Table   5), and found some correlations exceeded 0.3. We also found that using summarized resting electrocardiogram (ECG) yielded a comparable performance against the office-based model on a UKB subset containing resting ECG, with the C-statistic of 70.9% (56.9, 83.0) versus 69.9% (56.7, 81.1) (p=0.845), yet further evidence is required to draw conclusions.
Several limitations of the study should also be noted. We used a single dataset, UKB, for both modeling and evaluation. Though we have stratified the UKB cohort based on geographical information to allow for non-random variation 32 , further work is needed to understand generalization to other populations.
Notably, UKB is not representative of the population in LMICs. However, using UKB to demonstrate the capability of using DLS for long-term CVD risk prediction is an important first step in justifying a prospective data collection in LMICs. The device used for PPG acquisition in the UKB is a clinical pulse oximeter, thus our results provide direct evidence that the pulse oximeter may be a reliable CVD screening tool. Studies have found that the heart rate and rhythm extracted from smartphone PPG were comparable with clinical grade devices such as ECG 33-35 , but additional work is needed to know if deep learning models can be developed directly on smartphone-collected PPGs. Future work could focus on predicting CVD risk using prospective smartphone PPG datasets from low-resource healthcare systems.
To summarize, our study found that a deep learning model extracted features that when added to easily extractable clinical and demographic variables (such as smoking status, age and sex), provided statistically significant prognostic information about cardiovascular risk. Our work is an initial step towards accurate and scalable CVD screening in resource-limited areas around the world.

Overview
We developed a new CVD risk prediction score, DLS, using age, sex, smoking status and the results of analysis of PPG signals using deep learning. We used a Cox proportional hazard model and data from UKB to predict the ten-year risk of MACE among individuals free of CVD at baseline.

Data Source and Cohort
The DLS was developed and evaluated using data from the UKB dataset, filtered to focus on participants aged 40-74 to mirror a previous study 4 . We then stratified UKB participants who had PPG waveforms recorded into three subsets: train (n=105,319), tune (n=46,868), and test (n=57,702) subsets based on geographic information on the site of data collection, i.e., latitude and longitude. This strategy aligns with TRIPOD guidelines 32 on external validation (specifically validation on a different geographic region) by allowing for non-random variation between data splits such as differences in data acquisition or environment.
We used PPG waveforms from all visits for the participants in this train subset to train the PPG feature extractor in DLS (details in "Model Development"). The low-dimensional numeric outputs (embeddings) computed by this model were used as additional input features to our Cox model. To develop the Cox model that generates DLS to predict MACE risk, additional clinical and demographic variables and inclusion/exclusion criteria were needed. First, we excluded participants with non-fatal myocardial infarction or stroke before their first visit, or missing any of variables for our model (age, sex, and smoking status). We also excluded those without body mass index (BMI) or systolic BP (SBP) for a fair comparison against the other office-and lab-based risk prediction models. For each participant, we only included measurements related to their first visit. All numerically measured variables were standard-scaled. Cox models were regularized using a ridge penalty. In the final cohort, 97,970, 43,539, and 54,856 participants were included to train, tune, and test the survival model, respectively ( Figure   2). Descriptive statistics for this cohort are in Table 3.

Model Development
First Stage: PPG Feature Extractor For DLS, we first trained a deep learning-based feature extractor to learn PPG representations from raw PPG waveform signals, using a one-dimensional ResNet18 21,22,36 as the neural network architecture. We trained the feature extractor on the train subset, and picked the network weights that maximized the Cox pseudolikelihood (see description of the second stage below) on the tune subset.
These weights were used to compute PPG embeddings on the train, tune, and test subsets. The PPG

Second Stage: Survival Model
In the second stage, we developed a Cox proportional hazards regression model for predicting ten-year MACE risk, using as inputs age, sex, smoking status, PCA-derived PPG embeddings and PPG-HR (heart rate measured during PPG assessment). The model was trained on the train subset and tuned on the tune subset to decide the best-performing ridge regularization parameter (Supplementary Table   6).

Models for Comparison/Reference
For comparisons, we developed different survival models based on different feature sets (Table 3 "Features used" column, and Supplementary Table 7), including office-based and laboratory-based refit-WHO scores using the CVD risk predictors adopted in WHO/ISH risk chart and Globorisk studies, metadata-only model (age, sex, smoking status), metadata + PPG morphology (a model with metadata and engineered PPG features describing waveform morphology, such as dicrotic notch presence, details in Supplementary Table 5), a model without smoking status as an input (metadata without smoking, DLS without smoking), and the "Full" model that considered metadata, laboratory data, medication and medical history as a reference, to compute CVD risk score. We chose the model using shared predictors from the office-based WHO/ISH risk chart and Globorisk score (office-based refit-WHO score) as the main reference since they are adopted in the CVD risk research for low-resource settings. To ensure the fairest comparison the coefficients for the WHO and Globorisk predictors were re-fitted using the same UKB train subset as our DLS, and a sensitivity analysis was conducted using the original coefficients with recalibration.
We further developed DLS+ (DLS with BMI), and DLS++ (DLS with BMI and SBP) that additionally included more non-laboratory, office-based measurements as inputs of the survival model to better understand the prognostic value of PPG on top of the existing office-based refit-WHO model.
All models were trained on the same train subset and tuned on the tune subset except for laboratory-based refit-WHO score, metadata + PPG morphology, and the Full models that we trained, tuned and compared based on a subset of the testing data without missing values of the input features.

Evaluation Endpoints
The outcome, ten-year risk of MACE, was defined as a composite outcome of three components, non-fatal myocardial infarction, stroke, and CVD-related death (using ICD codes and cause of death to identify, Supplementary Table 5 for details) 7,28 . To define the outcome, we used (1) the date of heart attack, myocardial infarction, stroke, ischemic stroke, either diagnosed by doctor or self-reported, (2) the record of ICD-10 (international classification of diseases, 10th revision) clinical codes, and (3) and strings that are associated with the CVD-related death. The ICD-10 codes used included I21 (acute myocardial infarction), I22 (subsequent myocardial infarction), I23 (complications after myocardial infarction), I63 (cerebral infarction), I64 (stroke not specified as hemorrhage or infarction). The strings we used for matching include those related to coronary artery diseases, myocardial infarction, stroke, hypertensive diseases, heart failure, thromboembolism, arrhythmia, valvular diseases and other heart problems. We used the earliest date on any of the data sources mentioned above as the outcome date.

Statistical analysis
For primary analysis, we compared DLS with the office-based refit-WHO score, which is a risk model for healthy individuals across different countries 4,7,24,37 , using Harrell's C-statistic. We conducted a non-inferiority test with a pre-specified margin of 2.5% and alpha of 0.05, both selected based on power simulations using the tune subset. For secondary analyses, we also compared DLS with scores generated by other models mentioned in "Models for Comparison/Reference" above.
Additional evaluation metrics included the category-free net reclassification improvement (cfNRI) 38 , and after defining a specific risk threshold (model operating point), sensitivity, specificity, NRI, and adjusted hazard ratio (HRs). For NRI and cfNRI, we also reported the respective event and non-event components. Risk thresholds were selected in three ways: (1) matching the sensitivity of SBP-140 (described next), (2) matching the specificity of SBP-140, and (3) the 10% predicted risk threshold suggested by the Globorisk study 4 . Elevated SBP above 140 mmHg ("SBP-140") 39 was used for threshold selection because it is used as a simple single-visit indicator of BP control in the healthcare program of some countries such as India 40 , and we hypothesized that the PPG provided a single-visit indicator of vascular properties. To calculate sensitivity and specificity, we excluded the participants without a ten-year follow up if they didn't have a MACE event within ten years. To evaluate model calibration, we used the slope of the line comparing predicted and actual event rates, for deciles of model prediction 37 . We also performed subgroup analyses based on smoking status, sex, age, elevated HbA1c and hypertension status. We used quintiles for the elevated HbA1c subgroup due to the smaller sample size.
For statistical precision, we used the Clopper-Pearson exact method to compute the 95% confidence intervals (CIs) for sensitivity and specificity, and used the non-parametric bootstrap method with 1,000 iterations to compute 95% of all remaining metrics and delta values. For hypothesis tests in secondary and exploratory analysis, we used a permutation test to examine the non-inferiority and superiority of the C-statistic, and the one-sided Wald test for sensitivity and specificity. The log-rank test was used to determine whether survival differs between the model-defined low and high risk groups. For all two-sided tests, we used an alpha value of 0.05.

Acknowledgements
This work was funded by Google LLC. All authors affiliated with Google are employees and own stock as part of the standard employee compensation package. We acknowledge Nick Furlotte (Google Research) and the Google Research team for software infrastructure support. We also thank Boris

Data availability
This research has been conducted using the UK Biobank Resource under Application Number 65275.

Code availability
The deep learning framework (JAX) used in this study is available at https://github.com/google/jax 41 . All survival analyses were conducted using Lifelines 42 , an open source Python library.

Declaration of competing interests
Author WHW, SB, MD, CC, YL, and DA are employed at Google LLC and hold shares in Alphabet, and are co-inventors on patents (in various stages) for CVD risk prediction using deep learning and PPG, but declare no non-financial competing interests. LH, BB, CYM, YM, GSC, SS, SP are employed at Google LLC and hold shares in Alphabet but declare no non-financial competing interests. SK did this work at Google via PRO Unlimited, holds shares in Alphabet, serves as an Associate Editor for this journal but had no role to play in the editorial process and decisions for this manuscript, and is a co-inventor on patents for CVD risk prediction using deep learning and PPG. GD declares no financial or non-financial competing interests. Tables   Table 1: Model performance comparison of 10-year major adverse cardiovascular events (MACE) risk prediction between DLS versus other methods for the non-operating point dependent metrics. The primary analysis of the study is non-inferiority of the C-statistic of the DLS model compared with the office-based refit-WHO model. 95% confidence intervals (CIs) of C-statistic, cfNRI, and slope were obtained via bootstrapping, and the p-values were computed via a permutation test. The slope was not calculated for SBP-140 since its output is binary. *In the "Feature used" column, "Metadata" includes age, sex, and smoking status.

Supplementary Figures
Supplementary Figure 1 This supplementary material has been provided by the authors to give readers additional information about their work.

Details of UK Biobank photoplethysmography data
The photoplethysmography (PPG) waveforms in the UKB (Data field 4205) were acquired using the PulseTrace PCA2 device (CareFusion, USA). The device collected and averaged a minimum of six heart beats per user with a pulse interval close to the average pulse interval. It has been shown that morphological properties of a single representative PPG waveform are related to CV aging and CVD risk, such as augmentation index 1-3 . We preprocessed PPG waveforms by re-scaling to [0, 1].

Details of model training
During model training, we adopted a multitask learning framework with multiple proxy prediction tasks, such as age, sex, BMI, blood pressure, laboratory data predictions (Supplementary Table 8).
We also used a custom data augmentation inspired by Brownian motion which we termed Brownian tape speed augmentation. We developed the Brownian tape speed augmentation technique, inspired by Brownian motion, to improve the generalizability of the model. The method simulates playing back the signal on a tape while the tape's playback speed is varying according to Brownian motion. Specifically, the playback speed at each time step is drawn from a normal distribution. The method has a single hyperparameter, which we call the magnitude, that is used to define the standard deviation of this normal distribution. For each sequence (i.e., PPG signal), the magnitude is divided by the sequence length to set the standard deviation of the normal distribution for that sequence. This division ensures that regardless of the length of the sequence, the overall amount of transformation is similar.
We then calculate a running sum of this array of normal distribution samples, and add 1 everywhere in order to simulate a random walk of tape speed starting at 1. The array now represents the tape speed. We then calculate another running sum, and now the array represents displacement. We use this displacement as a flow field which is then applied to transform the input using the tfa_image.dense_image_warp function in tensorflow.
Training setup for the PPG feature extractor is listed in Supplementary Table 6.

Photoplethysmography morphology-based features
In the metadata + PPG morphology model, we used the engineered features available in the UK Biobank for PPG-based arterial stiffness evaluation (Supplementary Table 5). The features are pulse wave reflection index (RI), peak to peak time, pulse wave peak position, pulse wave notch position, pulse wave shoulder position, the presence/absence of dicrotic notch, and arterial stiffness index derived from the peak to peak time and the height of the participant.

Polygenic risk model creation
Individuals of European genetic ancestry who did not have PPG data were split into genome-wide association study (GWAS) (N=208k), train (N=40k), and tune (N=40k) sets. We performed GWAS on the GWAS dataset using BOLT-LMM v2.3.6 4 and adjusting for age, sex, genotyping array, smoking status, and the top 15 genetic principal components for the following 24 cardiovascular disease-related phenotypes: angina, myocardial infarction, coronary artery disease, heart failure, stroke, cardiovascular death, hypertension, atrial fibrillation, rheumatic heart disease, rheumatoid arthritis, chronic renal failure, diabetes, systolic blood pressure, diastolic blood pressure, low-density lipoprotein (LDL) cholesterol, high-density lipoprotein (HDL) cholesterol, total cholesterol, triglycerides, hemoglobin A1C, glucose, boday mass index (BMI), and three definitions of major adverse cardiovascular event (MACE) as the logical OR of myocardial infarction, cardiovascular death, and stroke; the logical OR of myocardial infarction, cardiovascular death, and heart failure; and the logical OR of myocardial infarction, cardiovascular death, stroke, heart failure, angina, and coronary artery disease (MACE-lenient). For each phenotype, a polygenic risk score (PRS) was generated by BOLT-LMM using the --predBetasFile option. Additionally, we ran PolyFun 5 to create functionally-informed fine-mapping PRS. We trained a multilayer perceptron to predict the MACE-lenient phenotype from the 48 PRSs in the train set and selected hyperparameters based on performance in the tune set. We then applied the model to all individuals with PPG data and used the resulting model prediction as the MACE PRS.

Supplementary Results
Analysis for the fixed model operating point at 10% risk At the threshold of 10% risk suggested by the Globorisk study, the sensitivity, specificity, and NRI of DLS was 4.0% (3.0, 5.1), 98.9% (98.8, 99.0), 0.6% (-0.6, 1.7), respectively. Meanwhile, the sensitivity and specificity of the office-based refit-WHO was 3.0% (2.2, 4.0), 99.1% (99.0, 99.2), respectively. We found that without any medical device-dependent measurement, DLS is non-inferior to the office-based refit-WHO score given the risk threshold of 10% suggested by Globorisk study. The full evaluation of all models is listed in the Supplementary Table 1.

Models without smoking status
We further examined Cox's models without using smoking status as a feature (the office-based refit-WHO score without smoking status, DLS without smoking status). We found that removing smoking status from the predictor set did not reduce the DLS performance and non-inferiority relative to the office-based refit-WHO was maintained (Supplementary Table 9). However, the calibration was worse (slope of 0.968).
Applying additional features helps improve the cardiovascular disease risk prediction In Supplementary Figure 4, we investigated the extent to which individuals predicted to be at high risk by a model are enriched for MACE prevalence. As expected given the observed improvement in C-statistic, the DLS+ model shows superior performance to the Metadata+ model that contains only age, sex, smoking status, and BMI when examining MACE prevalence in the top 5% and 10% of predicted risk. We observed a similar, and slightly more pronounced, improvement from a model that includes a PRS component in addition to the Metadata+ features at the same 5% and 10% most extreme risk percentiles (2.39-and 2.67-fold enrichment over total sample prevalence, respectively, compared to 2.14-and 2.26-fold enrichment for Metadata+). Interestingly, the contributions of PPG and genetic risk appear complementary, as a model that includes Metadata+, PPG, and PRS was most enriched for MACE prevalence (2.53-and 2.87-fold enrichment, respectively).
While both the Full model and the model including polygenic risk show improved MACE prediction performance, we note that each requires more variables that may not be available in low-resource settings, which may limit the use of such lab-based approaches like QRISK, ASCVD, and PRS.

Supplementary Tables
Supplementary

events (MACE) risk prediction between DLS versus DLS+ (adding BMI) and DLS++ (adding BMI and SBP). (a)
We examined the discrimination performance using C-statistic, reclassification improvement using category-free net reclassification improvement (cfNRI), and model calibration using the slope value from the reliability diagram. *In "Feature used" column, "Metadata" includes age, sex, and smoking status. (b) The sensitivity was calculated at the risk threshold matching the specificity of SBP-140, and the specificity was calculated at the risk threshold matching the sensitivity of SBP-140. 95% confidence intervals (CIs) of C-statistic, cfNRI, and slope were obtained from the bootstrapping, and p-values were computed by the permutation test. CIs of sensitivity and specificity were obtained from the Clopper-Pearson exact method, and the p-values were calculated by a permutation test with the prespecified margin of 2.5% and alpha of 0.