Polygenic risk scores and Parkinson’s disease in South Africa advancing ancestry informed disease prediction

Kathryn Step; Carene Anne Alene Ndong Sima; Spencer Grant; Jonggeol Jeffrey Kim; Emily Waldo; Soraya Bardien; Ignacio F. Mata; on behalf of the Global Parkinson’s Genetics Program (GP2)

doi:10.1371/journal.pgen.1012064

Abstract

Parkinson’s disease (PD) is a complex neurodegenerative disorder with environmental and genetic influences. Using genotyping array data of 661 South African PD cases and 737 controls, we conduct a polygenic risk score (PRS) analysis using PRSice-2. Summary statistics from two PD association studies have been used as base datasets. We split the target dataset into training (70%; n = 979) and validation (30%; n = 419) cohorts. We test various clumping window sizes, linkage disequilibrium thresholds, and p-value thresholds to determine the optimal combination for risk prediction. Additionally, we investigate the variance explained by different combinations of covariates. Overall, we observe modest predictive performance (AUC: 0.5847-0.6183). Age at recruitment emerges as the strongest individual predictor, while sex contributes the least. These findings provide the first evaluation of PRS performance for PD in a highly admixed South African cohort, underscoring the importance of including underrepresented populations in genetic risk prediction. By systematically assessing predictive performance across two base datasets, we highlight how ancestry composition and study design affect risk estimation in diverse populations. This work lays a foundation for refining genomic prediction in admixed populations and contributes to ongoing efforts to ensure that advances in precision medicine are globally relevant.

Author summary

Parkinson’s disease is a complex brain disorder influenced by both genes and environment. Most of what we know about the genetic contribution to this disease comes from studies in people of European ancestry, leaving major gaps in our understanding of how genetic risk works in other populations. In this study, we examined how well existing genetic risk models, known as polygenic risk scores, can predict Parkinson’s disease in people from South Africa, a population with a rich mix of ancestries. We compared different approaches to building these scores and tested how accurately they could identify individuals with the disease. Although we found that genetic scores alone are modest predictors compared to age, our results highlight important insights about the limits and possibilities of using genetic information in diverse populations. By including underrepresented groups in genetic research, our study takes an important step toward making precision medicine more inclusive and equitable worldwide.

Citation: Step K, Ndong Sima CAA, Grant S, Kim JJ, Waldo E, Bardien S, et al. (2026) Polygenic risk scores and Parkinson’s disease in South Africa advancing ancestry informed disease prediction. PLoS Genet 22(3): e1012064. https://doi.org/10.1371/journal.pgen.1012064

Editor: Heather J. Cordell, Newcastle University, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND

Received: October 14, 2025; Accepted: February 19, 2026; Published: March 9, 2026

Copyright: © 2026 Step et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data used in the preparation of this article were obtained from the Global Parkinson’s Genetics Program (GP2; https://gp2.org). Specifically, we used Tier 2 data from GP2 release 9 (https://doi.org/10.5281/zenodo.14510099). GP2 data are available on AMP PD (https://amp-pd.org). For the NAMA dataset, the data analyzed in this study is subjected to the following licenses/restrictions: No new genetic data was generated for this study; however, summary statistics for the quality and accuracy assessment of the genetic data for the NAMA participants will be made available to researchers who meet the criteria for access after application to the Health Research Ethics Committee of Stellenbosch University (https://www.su.ac.za/en/faculties/medicine/research/ethics). The Nama datasets can be requested via EGA: https://ega-archive.org/datasets/EGAD00001006198. Summary statistics for the base datasets are available from the NHGRI-EBI GWAS catalog (https://www.ebi.ac.uk/gwas/) accession numbers GCST009325 and GCST90275127. Code Availability: The QC and ancestry inference pipelines were developed and are maintained by Dr. Thiago Peixoto Leal (peixott@ccf.org) and are available at https://github.com/MataLabCCF. Additionally, an overview of the analysis and any additional scripts not available through the Mata Lab GitHub can be found in the GP2 public domain on GitHub (https://github.com/GP2code/SouthAfrican_PD_PRS) and were given a persistent identifier via Zenodo (https://doi.org/10.5281/zenodo.16583859)].

Funding: This research was carried out with the support of The Michael J. Fox Foundation (to K.S., MJFF-026283 to I.M and E.W.), Aligning Science Across Parkinson’s Disease Global Parkinson Genetic Program (to K.S. and I.M.), the National Institutes of Health (1R01NS112499 and R01NS132437 to I.M., U01AG076482 to I.M. and E.W.), the American Parkinson’s Disease Association to I.M., the Department of Veterans Affairs (I01BX005978-01A1 to I.M.), the National Research Foundation of South Africa (129429 to S.B.), the South African Medical Research Council (Self-Initiated Research Grant to S.B.), and the Centre for Tuberculosis Research of the South African Medical Research Council (to S.B.). The funder, Aligning Science Across Parkinson’s Global Parkinson Genetic Program, had a role in the data generation, decision to publish, and preparation of the manuscript.

Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: I.F.M. has received honorarium from the Parkinson’s Foundation PD GENEration Steering Committee and Aligning Science Across Parkinson’s Global Parkinson Genetic Program (ASAP-GP2).

Introduction

Parkinson’s disease (PD) is a complex neurodegenerative disorder characterized by motor and non-motor symptoms [1]. Once considered primarily sporadic and environmentally driven [2,3], research has revealed a multifactorial etiology involving monogenic causes, gene-environment, and gene-gene interactions [2]. Approximately 5–10% of cases are monogenic, caused by single-gene mutations with large effect sizes, identified through genetic linkage analysis in families [4,5]. Although often observed in familial PD, they can occur in sporadic cases due to incomplete penetrance [6]. Genome-wide association studies (GWAS) have further shown that sporadic PD has a polygenic basis, with many small-effect variants jointly contributing to risk [7].

These susceptibility variants can be used for risk prediction through polygenic risk scores (PRS) [8]. In a PRS analysis, the variants identified through GWAS, along with their effect sizes (whether conferring increased or decreased risk) are combined to estimate an individual’s genetic predisposition to a disease [9]. In 2016, the first report on polygenic risk and clinical outcomes for PD was published [10]. Since then, several studies have evaluated PRS for PD risk prediction, with predictive performance, assessed by the area under the receiver operating characteristic curve (AUC), ranging from ~60%-76% [7,11–14]. This is largely dependent on the number of single-nucleotide polymorphisms (SNPs) included and the population characteristics [15].

PRS analysis classifies individuals by relative disease risk, supporting risk stratification, early intervention strategies, and tailored precision medicine approaches [8,15]. We conducted the first PRS analysis for PD in a South African cohort.

Results

PRSice-2 for disease status prediction

The analysis included 35,075,375 variants in the two target datasets: training (70%; n = 979) and validation (30%; n = 419) cohorts. We conducted the analysis using two sets of summary statistics as the base data. For each, we evaluated varying combinations of thresholds for SNP inclusion (S1 and S2 Figs). The analyses were performed on the training dataset to determine the optimal thresholds and replicated in the validation dataset. All analyses were adjusted for sex, age, and inferred local ancestry components (ANC) with linkage disequilibrium (LD) estimated using the European (EUR) reference panel from the 1000 Genomes project.

One of the main uses of PRS is to predict case status according to their genetic risk or predisposition, making it a useful prognostic tool (S3 Fig) [15]. Using the EUR-based summary statistics [7], we tested 48 combinations of parameters to identify the best-performing PRS model for PD status prediction (S4 Fig). The highest predictive performance was observed under the following parameter set: 100kb clumping window, r² = 0.8, and a SNP inclusion threshold of p-value = 1 × 10⁻³. Here, 3,466 SNPs remained after clumping and the full R² explained 35.84% of the variance in the disease phenotype (PRS R² = 0.005), with a strong association (empirical p-value = 0.076). The PRS analysis was conducted on the validation dataset (30%) using the abovementioned parameter combinations and the 3,466 SNPs included in the model. When applied to the validation set, 37.90% of the variance in case-control status (PRS R² = 0.019).

Given the high genetic admixture in the South African cohort [16], we evaluated predictive accuracy using multi-ancestry summary statistics from Kim et al. (2024) [17]. We tested the same 48 parameter combinations as for the previous analysis (S5 Fig). The highest predictive performance was observed under the following parameter set: a 100kb clumping window, r² = 0.8, and a p-value threshold of 1 × 10⁻³. The full R² explained 36.93% of the variance (PRS R² = 0.015; adjusted R² = 0.016; p-value = 9.82 × 10⁻⁵; empirical p-value = 5.0 × 10⁻⁴). A total of 3,208 variants remained after clumping in this model. We performed PRS analysis on the validation dataset (30%), where the best-fitting model, defined by a p-value threshold of 1 × 10⁻³ and including 3,208 SNPs, explained 40.17% of the variance in disease phenotype (PRS R² = 0.042).

Assessment of model performance

The AUC, sensitivity, and specificity were assessed using the two base datasets as well as the training and validation cohorts of the target dataset (Table 1; Fig 1). We assessed the mean AUC (Nall et al 2019: 0.6038 ± 0.0103; Kim et al 2024: 0.6052 ± 0.0178) across 20 random data splits to assess the robustness of our data split into training and validation cohorts. The results were consistent with the original split presented below, indicating the predictive performance was stable and not notably influenced by the random split.

Download:

Table 1. Model performance across base and target datasets.

https://doi.org/10.1371/journal.pgen.1012064.t001

Download:

Fig 1. Receiver Operating Characteristic (ROC) Curve Comparing Polygenic Risk Score (PRS) Models for Case Status.

The ROC curves compare the predictive performance of two PRS models for Parkinson’s disease: one based on Nalls et al. (2019) (blue) and the other on Kim et al. (2024) (green). The area under the curve (AUC) indicates the discriminative ability of each model, with higher AUC values reflecting better classification of cases and controls. The AUC for the Nalls model is 0.585, while the AUC for the Kim model is 0.616. PRS were calculated and matched to phenotype data from the same sample set (N = 419).

https://doi.org/10.1371/journal.pgen.1012064.g001

Using the Nalls et al., 2019 summary statistics, the PRS demonstrated a moderate ability to distinguish between PD status, with an AUC of 0.6077 (95% CI: 0.5721-0.6433) in the training dataset and 0.5847 (95% CI: 0.5302-0.6391) in the validation dataset. At a fixed probability threshold of 0.5, classification accuracy was 57.00% (95% CI: 0.5383-0.6012) in the training dataset and 55.85% (95% CI: 0.5095-0.6067) in the validation dataset. The observed balanced accuracy varied slightly between datasets (53.43% and 55.60%). Sensitivity was higher in the validation dataset (63.43%) compared to the training dataset (14.16%), whereas specificity was higher in the training dataset (92.70% versus 47.78%).

In contrast, the PRS model constructed using summary statistics from Kim et al., 2024 yielded a comparable discriminative ability, with an AUC of 0.6183 (95% CI: 0.5830-0.6537) in the training dataset and 0.6159 (95% CI: 0.5624-0.6694) in the validation dataset. Classification accuracy at the 0.5 threshold was 60.16% (95% CI: 0.5702-0.6325) and 58.23% (95% CI: 0.5335-0.6300) in the training and validation datasets, respectively. Balanced accuracy was modest in both datasets (58.56% and 58.05%), with sensitivity again higher in the validation dataset (63.88%) compared to the training dataset (40.90%). Specificity followed the same pattern as observed in the EUR-based base dataset [7], with higher values in the training dataset (76.22%) relative to validation (52.22%).

Finally, we evaluated the predictive performance of the PRS using top percentile thresholds (S1 Table). The sensitivity increased as more cases were included when the threshold was lowered, while specificity decreased correspondingly. The positive predictive value remained low across all thresholds, whereas the negative predictive values were consistently high (>99%). The overall patterns observed were similar between the PRS derived from the two base datasets with minor differences in the number of cases captured at the top 5% threshold.

Covariate contribution to the variance observed

We evaluated the contribution of covariates to the explained variance by examining their effect on model performance (Table 2). We looked at the two variance models (Full R² and Null R²) under seven different covariate inclusion scenarios, using each possible combination of covariates: age, sex, and ANC.

Download:

Table 2. Variance explained by polygenic risk score models with different covariate adjustments for the training dataset.

https://doi.org/10.1371/journal.pgen.1012064.t002

We assessed this using the Nalls et al., 2019 summary statistics as the base data. Here, the highest PRS R², representing the variance explained by the PRS after accounting for covariates, was observed when adjusting for sex only. The highest Null R² was observed with age, sex, and ANC, indicating they explained the most variability independent of polygenic score. In contrast, the lowest Null R² was seen in the model including only sex, suggesting this alone has a limited contribution to the model variance and cannot be used alone to predict risk. The models that included age or sex, or both had statistically significant empirical p-values. Models with ANC alone or sex and ANC showed nominal significance, while the age, sex, and ANC as well as the age and ANC models were not significant. However, the magnitude of variance explained by genetic and covariate components varied depending on the covariate combination.

For Kim et al., 2024, the same seven covariate combinations were assessed. The highest PRS R², showing the variance explained by the PRS model only, was observed when adjusting for sex (PRS R² = 0.0552), while the lowest was seen when adjusting for age and ANC (PRS R² = 0.0139). The Null R² values are consistent with those observed in the EUR-based summary statistics, as expected, since they capture the variance explained by covariates alone and are independent of genetic influence. Finally, the variance explained by the PRS was dependent on the covariate structure, highlighting the impact of covariate selection on model performance.

We evaluated whether the inclusion of PRS improved the predictive performance for the models including only covariates (Table 3). For this, we looked at both base dataset summary statistics as well as the training and validation cohorts. The addition of the PRS consistently increased the AUC for all covariate combinations. The improvement was more pronounced in models with fewer covariates showing statistical significance (DeLong p-value <0.05). For models including the full set of covariates (age, sex, and ANC), the PRS still increased the AUC, though the change was smaller and, in some cases, not statistically significant (DeLong p-value >0.05). These results indicate that the PRS provides a meaningful addition to the predictive power beyond the covariates included.

Download:

Table 3. Effect of adding the polygenic risk score on predictive performance of covariate models.

https://doi.org/10.1371/journal.pgen.1012064.t003

Polygenic risk score analysis: age at onset

The PRS explained a small proportion of variance in PD AAO across all datasets. Specifically, for the Nalls et al. (2019) analysis, the PRS R² was 0.0019 in the training subset and 0.0054 in the validation subset. For the Kim et al. (2024) analysis, the PRS R² was 0.0005 in the training subset and 0.0243 in the validation subset. Overall, these results suggest that the selected PRS have limited explanatory power for AAO in this cohort.

Internal polygenic risk score analysis using South African summary statistics

We constructed an internal PRS for PD using 351 variants that reached a suggestive significance threshold in our South African GWAS. After LD clumping, 141 independent variants remained and were included in the PRS. In the training cohort, the PRS explained 33.9% of the phenotypic variance (R² = 0.3386) and achieved an AUC of 0.7736. When applied to the validation cohort, the PRS explained 36.0% of the variance (R² = 0.3598) with an AUC of 0.7667, indicating consistent discriminative performance across cohorts.

Discussion

To our knowledge, this is the first study to evaluate PRS for PD prediction in a South African cohort. We used a well-established PRS software, PRSice-2, and leveraged summary statistics from both EUR-ancestry and multi-ancestry GWAS for PD. Despite the summary statistics not fully matching the genetic background of our cohort, which is five-way admixed [16], our findings demonstrate that polygenic models can still capture a modest but significant proportion of the variance in PD susceptibility, highlighting the utility of PRS in diverse populations.

Using PRSice-2, we identified the best parameter sets for both the Nalls et al. (2019) and Kim et al. (2024) summary statistics. In the training dataset, we found that clumping parameters (e.g., r² and window size) and p-value thresholds had notable effects on predictive power. The optimal PRS derived from the Nalls et al. (2019) summary statistics achieved an AUC of 0.6077. Similarly, using the Kim et al. (2024) summary statistics with more diverse populations, the best-performing PRS model had an AUC of 0.6183. Although the PRS derived from Kim et al. (2024) explained more variance in disease risk, its slightly higher AUC compared to Nalls et al. (2019) highlights how R² and AUC capture different aspects of predictive performance, the underlying genetic predisposition versus classification accuracy, respectively. This may suggest that multi-ancestry meta-analysis PRS better models genetic predisposition for distinguishing cases from controls in the study population. However, the observed AUC is similar to those previously reported for PD PRS which range from 0.620 to 0.760 [15].

Notably, the models that achieved the highest predictive accuracy tended to favor moderate clumping thresholds and relatively lenient SNP inclusion thresholds, likely balancing the trade-off between including informative variants and controlling for LD-induced noise. Moreover, the validation of model performance in an independent subset of the cohort strengthens the robustness of these findings and supports the potential clinical utility of PRS in diverse populations.

Overall, our results highlight that while sex contributes modestly to the explained variance, accounting for approximately 3–5% across models, it cannot be used in isolation to predict disease status [15], as evidenced by the low full model R² values when sex was the only covariate. In contrast, age appears to provide a more meaningful contribution, particularly when combined with other covariates such as ancestry. Importantly, the PRS R² values were consistently low across all models and datasets, suggesting that genetic risk alone, as captured by current PRS methods, is insufficient for reliable disease prediction [18]. However, the DeLong test demonstrated that adding PRS to models with sex or ancestry alone can significantly improve AUC, whereas the incremental gain was not significant when age and ancestry covariates were already included. This underscores that the predictive contribution of PRS is dependent on demographic and ancestral factors. These findings underscore the importance of incorporating non-genetic covariates to enhance predictive performance, as previously illustrated in a PD context [19]. Ultimately, the limited variance explained by PRS alone constrains its current clinical utility for PD and emphasizes the need for integrative models that include both genetic and phenotypic information [8].

The additional sensitivity and specificity analyses support the overall performance of our PRS models, showing results that are consistent with previous work on multi-ancestry populations [20]. Our AUC values ranged from 0.5847 to 0.6183 across base datasets, with corresponding balanced accuracy values between 0.5343 and 0.5856. These values are comparable to those previously reported, whose AUC ranged from 0.505 to 0.651 across four ancestries [20]. Our model performs similarly to theirs, particularly in populations with higher predictive power, such as the EUR populations. Our sensitivity and specificity values also reflect expected trade-offs: for example, when sensitivity increased, specificity tended to decrease, consistent with typical classification dynamics. Together, these findings reinforce the idea that PRS models retain moderate predictive power in diverse populations, but that further optimization may be needed to improve performance, especially in underrepresented groups. This aligns with previous reports showing PRS transferability from EUR to AFR populations is substantially reduced, with performance estimated at 20–40% of that observed in EUR populations [21].

We generated an internal PRS using summary statistics from our South African GWAS. Of 351 suggestively associated variants, 141 remained after clumping. The PRS explained 33.9% of the variance in the training cohort (AUC = 0.774) and 36.0% in the validation cohort (AUC = 0.767). As this PRS was derived and tested in the same cohort, these estimates likely reflect overfitting and should be interpreted as cohort-specific rather than generalizable predictive performance.

Despite the moderate predictive performance observed, our study underscores both the promise and limitations of current PRS models in underrepresented populations. A key limitation of this study is the small sample size and the lower mean age of our control group relative to the case group. A further limitation for the study is the subsequent lack of an appropriate ancestry-matched validation cohort. Moreover, the AUC values achieved reflect modest discriminative power. This finding suggests that while PRS can contribute to risk stratification, they are not yet sufficient for clinical decision-making on their own. This is further illustrated by the AAO PRS, which explains only a modest proportion of variation in AAO, highlighting its limited predictive power for this phenotype. Future studies incorporating ancestry-specific GWAS, functional annotations, and integrative risk models may further improve PRS accuracy in AFR and admixed populations. As LD resources tailored to AFR and admixed genomes continue to improve, evaluating LD-aware methods such as PRS-CS or PRS-CSx will be critical to determine whether these approaches can further enhance PRS performance in this setting [22,23].

A key strength of the study is its novelty, representing the first evaluation of PRS for PD in a South African study collection, thereby addressing a critical gap in genetic risk research. By including both EUR-based and multi-ancestry summary statistics, we were able to compare the PRS transferability across ancestries and assess how base dataset ancestral composition influences predictive performance. Furthermore, the systematic evaluation of clumping thresholds, LD parameter, and p-value thresholds to identify the optimal input parameter combinations further strengthens the methodological approach of this analysis. Finally, the application of PRS across various diseases, including diabetes and cancer [24,25], has proven valuable for stratifying individuals at highest risk, rather than serving as a direct predictor of disease development [26]. In this context, our study contributes by refining PD risk prediction for a smaller subset of individuals most at risk for developing PD.

In conclusion, our results highlight the importance of including diverse ancestral cohorts and relevant covariates when constructing and evaluating PRS models. By systematically assessing the variance observed across different covariate combinations, we highlight the contributions of demographic and genetic factors to disease risk prediction. Future efforts should continue to refine ancestry specific risk-models to ensure equitable translation of PRS from research into clinical applications for early screening, disease risk prediction, and precision medicine [27].

Materials and methods

Ethics statement

Ethical approval was obtained from the Health Research Ethics Committee 1 (HREC 1), Stellenbosch University, South Africa. The HREC 1 reference number is: S23/10/251 (PhD) Sub Study 2002C/059. All participants provided a written statement of formal consent.

Participant demographics

Study participants were recruited from 2002 to 2020 as part of the South African PD Study Collection (S2 Table) [28]. Individuals living with PD (n = 691) were diagnosed in accordance with the Queen’s Square Brain Bank Criteria [29]. In total, 826 controls were recruited as part of the study collection [28].

Data preprocessing

Genotyping was completed through the Global Parkinson’s Genetics Program [30] using the NeuroBooster Array (v1.0, Illumina, San Diego, CA) [31]. Quality control (QC) was performed using PLINK v1.9 and v2.0 [32,33], as previously described [16]. Imputation was performed using the TOPMed Imputation Server [34]. The related individuals (n = 63) were identified using NAToRA [35] and a kinship coefficient of 0.0884 (second degree relation [36]) and excluded from the analysis. After QC, 661 PD cases and 737 controls remained (S2 Table). The South African population is five-way admixed [37], therefore, a reference panel was created using samples from the 1000 Genomes Project Phase III, including individuals of African (AFR), EUR, and South Asian (SAS) ancestries [38,39]. Additional individuals of Malay (MAL) ancestry and an indigenous hunter-gatherer Khoe-San (NAMA) population were included in the reference panel [40], as previously described [16]. The reference files were phased using the TOPMed Imputation Server [34]. For the South African dataset, local ancestry inference was performed using G-Nomix [41], as previously described [16] (S6 Fig).

Polygenic risk score calculation

A PRS analysis includes the following steps (Fig 2): [1] data QC and preparation, [2] PRS calculation, and [3] PRS performance assessment [9]. The analysis utilizes two independent datasets: discovery and target datasets [9]. The discovery dataset consists of GWAS summary statistics, including effect sizes for each variant. The target dataset contains individual-level genotype data, from which SNP dosages are derived for variants included in the PRS calculation. In general, PRS is computed for each individual as the sum of the dosages of risk alleles at selected SNPs, weighted by their corresponding effect sizes from the discovery dataset [42]. The PRS analysis was run using PRSice-2 v2.3.3 [43], which applies clumping and thresholding based on LD and p-values. The predictive PRS model is evaluated using Nagelkerke’s pseudo-R² (R²) [9].

Download:

Fig 2. An overview of the methods followed in the present study.

LiftOver was specific to the summary statistics used as the base datasets. PRS, polygenic risk score. Image created with BioRender.com/. https://BioRender.com/19qh9dv.

https://doi.org/10.1371/journal.pgen.1012064.g002

Polygenic risk score: file preparation

The covariate file was created using the local ancestry analysis output containing the ancestry inferences for the parental populations for each individual and both haplotypes across inferred local ancestry windows. The ancestry proportions were calculated by extracting the ancestry window information, specifically calculating the total genomic span of each parental ancestry, and normalizing the values to determine relative ancestry contribution per individual. Additional covariates included age and sex. Full summary statistics were obtained from the NHGRI-EBI GWAS catalog [44] on 05/09/2024 for studies GCST009325 [7] and GCST90275127 [17]. These two base datasets were used to assess and compare the predictive power of the EUR dataset relative to a multi-ancestry dataset which is better matched to the admixture of the South African cohort. This approach was used to evaluate whether ancestry-matched summary statistics enhanced predictive performance in an admixed population, as traditional EUR-derived GWAS does not fully capture the genetic architecture in diverse populations, like the South African population. The summary statistics were converted to GRCh38 using LiftOver v1.0 [45]. The South African PD data, serving as the target dataset, was randomly split into two cohorts: 70% of the samples were in the training dataset (n = 979 individuals; n = 445 cases; n = 534 controls), and 30% in the validation dataset (n = 419 individuals; n = 216 cases; n = 203 controls). To assess the robustness of the data split, we ran the PRS analysis across 20 random splits and compared the distribution of the AUC values with the original split (S3 Table).

Polygenic risk score analysis: case status

For this analysis, clumping was conducted using window sizes of 100kb, 250kb, and 500kb; LD thresholds (r²) of 0.1, 0.2, 0.5, and 0.8; as well as the p-value thresholds of 1 × 10^-3, 1 × 10^-5, 1 × 10^-6, and 5 × 10^-8. The 1000 Genomes EUR reference panel was used for LD calculations in both base datasets, as the Nalls et al. (2019) study is EUR-based and approximately 62% of participants in the Kim et al. (2024) study were of EUR ancestry [7,17,46]. All analyses were performed using logistic regression, adjusting for sex, age, and ancestral components (AFR, EUR, NAMA, SAS). To prevent perfect multicollinearity, we excluded the MAL ancestry from the covariates, as it represented the smallest ancestral contribution. To obtain robust significance estimates, empirical p-values were calculated using 10,000 phenotype permutations. The initial search for optimal PRS parameters was conducted on 70% of the dataset (training cohort), and the best-performing parameters and variants included in the model were then applied to fit a new model on the remaining 30% (validation cohort). The AUC was used to evaluate the performance of each PRS model using the pROC package [47] in R v4.2.0, providing a quantitative metric for comparing models [48]. Additionally, predicted probabilities from a logistic regression model were converted to binary disease status using a fixed threshold of 0.5, and model performance was evaluated using accuracy, balanced accuracy, sensitivity, and specificity calculated at this threshold. Additionally, the positive predictive value and negative predictive values were calculated at multiple top-percentile thresholds (5%, 10%, and 20%) using the global PD prevalence of 1.386 × 10^-4 [49].

Assessment covariate variance

We evaluated the contribution of different covariate combinations to the variance explained. Using PRSice-2 outputs, we examined the variance models across seven covariate inclusion scenarios, each incorporating a distinct combination of the following variables: age, sex, and ANC. This analysis allowed us to quantify the incremental variance explained by each covariate and their combinations, providing insight into their effects on risk prediction.

Polygenic risk score analysis: age at onset

PRS for PD age at onset (AAO) were generated using PRSice-2. The analysis followed the same 70/30 training-validation split and identical PRS parameters described above, with the only modification being that AAO was modelled as a continuous phenotype rather than a binary outcome (as seen in the PRS for case status). The aforementioned covariates were used and LD was calculated with the EUR reference panel. This analysis was performed using both base datasets as well as the training and validation cohorts.

Internal polygenic risk score analysis using South African summary statistics

We constructed a PD PRS using summary statistics from our internal GWAS [16]. Variants were selected based on a suggestive significance threshold (p < 5 × 10 ⁻ ⁶) to capture loci with potential contribution to PD risk within this population, and to see how predictive performance varies in comparison to the external summary statistics. The analysis was performed in the same 70.30 split using the optimal thresholds determined above. Predictive performance of the PRS was assessed by calculating the AUC.

Supporting information

S1 Fig. The following plots are using Nalls et al 2019 as the base dataset.

(A) PRS R² across clumping thresholds. The variance explained (R²) by the polygenic risk score at different GWAS p-value thresholds, stratified by linkage disequilibrium clumping parameters (r² and kb). Each panel corresponds to a clumping window size (kb), with points and lines indicating R² across p-value thresholds. (B) PRS significance versus predictive performance. Relationship between PRS model significance (PRS association p-value; log-scaled) and variance explained (R²). Colors indicate clumping window sizes (kb) and shapes indicate LD thresholds (r²), highlighting the trade-off between model fit and predictive power.

https://doi.org/10.1371/journal.pgen.1012064.s001

(TIFF)

S2 Fig. The following plots are using Kim et al 2024 as the base dataset.

(A) PRS R² across clumping thresholds. The variance explained (R²) by the polygenic risk score at different GWAS p-value thresholds, stratified by linkage disequilibrium clumping parameters (r² and kb). Each panel corresponds to a clumping window size (kb), with points and lines indicating R² across p-value thresholds. (B) PRS significance versus predictive performance. Relationship between PRS model significance (PRS association p-value; log-scaled) and variance explained (R²). Colors indicate clumping window sizes (kb) and shapes indicate LD thresholds (r²), highlighting the trade-off between model fit and predictive power.

https://doi.org/10.1371/journal.pgen.1012064.s002

(TIFF)

S3 Fig. Density plots showing the polygenic risk score distribution for cases versus controls.

(A) Density plot with Nalls et al 2019 as the base dataset and (B) the density plot with Kim et al 2024.

https://doi.org/10.1371/journal.pgen.1012064.s003

(TIFF)

S4 Fig. Bars plot showing the proportion of phenotypic variance (R²) explained by the polygenic risk score (PRS) at varying SNP inclusion p-value thresholds.

Each bar represents a PRS model fit calculated at a specific threshold. The optimal threshold (defined as the point with the highest R²) is highlighted, indicating the most predictive model. (A) Results shown are based on the training dataset using summary statistics from Nalls et al., 2019 for the strongest association. (B) Results shown are based on the training dataset using summary statistics from Nalls et al., 2019 for the highest predictive performance. (C) Results shown are based on the validation dataset using summary statistics from Nalls et al., 2019 based on the highest predictive performance thresholds from the training dataset.

https://doi.org/10.1371/journal.pgen.1012064.s004

(TIFF)

S5 Fig. Bars plot showing the proportion of phenotypic variance (R²) explained by the polygenic risk score (PRS) at varying SNP inclusion p-value thresholds.

Each bar represents a PRS model fit calculated at a specific threshold. The optimal threshold (defined as the point with the highest R²) is highlighted, indicating the most predictive model. (A) Results shown are based on the training dataset using summary statistics from Kim et al., 2019 for the strongest association. (B) Results shown are based on the training dataset using summary statistics from Kim et al., 2024 for the highest predictive performance. (C) Results shown are based on the validation dataset using summary statistics from Kim et al., 2024 based on the highest predictive performance thresholds from the validation dataset.

https://doi.org/10.1371/journal.pgen.1012064.s005

(TIFF)

S6 Fig. Karyograms showing the inferred local ancestry of three individuals from the South African Parkinson’s Disease Study Collection.

This highlights the complex admixture of the cohort. AFR, African; EAS, East Asian; EUR, European; MAL, Malaysian; NAMA, Nama; POP, Population; SAS, South Asian.

https://doi.org/10.1371/journal.pgen.1012064.s006

(TIFF)

S1 Table. Positive predictive value and negative predictive value of polygenic risk scores for Parkinson’s disease at multiple top-percentile thresholds.

https://doi.org/10.1371/journal.pgen.1012064.s007

(XLSX)

S2 Table. Study participants included in the polygenic risk score analysis.

https://doi.org/10.1371/journal.pgen.1012064.s008

(XLSX)

S3 Table. Stability of PRS performance estimates across repeated for 20 random splits.

https://doi.org/10.1371/journal.pgen.1012064.s009

(XLSX)

S4 Table. Global Parkinson’s Genetics Program banner author list.

https://doi.org/10.1371/journal.pgen.1012064.s010

(XLSX)

Acknowledgments

We would like to acknowledge and thank the study participants for their contribution. For a complete list of GP2 members see https://zenodo.org/records/17753486. All figures were created using BioRender (https://www.biorender.com/). We would like to acknowledge Lim Shen-Yang, Tan Ai-Huey, and Azlina Ahmad-Annuar for their efforts in recruiting study participants from Malaysia. We thank Kate Andersh, Laurel Screven, and Kim Paquette for their work as scientific project managers for this project. We would like to acknowledge Dr. Thiago Peixoto Leal for his work in developing the scripts used in the analysis. We also acknowledge the Centre for High Performance Computing (CHPC), South Africa, for providing computational resources.

This research was supported [in part] by the Intramural Research Program of the National Institutes of Health (NIH). The contributions of the NIH author(s) were made as part of their official duties as NIH federal employees, are in compliance with agency policy requirements, and are considered Works of the United States Government. However, the findings and conclusions presented in this paper are those of the author(s) and do not necessarily reflect the views of the NIH or the U.S. Department of Health and Human Services.

References

1. Bloem BR, Okun MS, Klein C. Parkinson’s disease. Lancet. 2021;397(10291):2284–303.
- View Article
- Google Scholar
2. Chen H, Ritz B. The Search for Environmental Causes of Parkinson’s Disease: Moving Forward. J Parkinsons Dis. 2018;8(s1):S9–17. pmid:30584168
- View Article
- PubMed/NCBI
- Google Scholar
3. Ascherio A, Schwarzschild MA. The epidemiology of Parkinson’s disease: risk factors and prevention. Lancet Neurol. 2016;15(12):1257–72. pmid:27751556
- View Article
- PubMed/NCBI
- Google Scholar
4. Jia F, Fellner A, Kumar KR. Monogenic Parkinson’s disease: Genotype, phenotype, pathophysiology, and genetic testing. Genes. 2022;13(3).
- View Article
- Google Scholar
5. Klein C, Westenberger A. Genetics of Parkinson’s disease. Cold Spring Harb Perspect Med. 2012;2(1):a008888.
- View Article
- Google Scholar
6. Reed X, Bandrés-Ciga S, Blauwendraat C, Cookson MR. The role of monogenic genes in idiopathic Parkinson’s disease. Neurobiol Dis. 2019;124:230–9. pmid:30448284
- View Article
- PubMed/NCBI
- Google Scholar
7. Nalls MA, Blauwendraat C, Vallerga CL, Heilbron K, Bandres-Ciga S, Chang D, et al. Identification of novel risk loci, causal insights, and heritable risk for Parkinson’s disease: a meta-analysis of genome-wide association studies. Lancet Neurol. 2019;18(12):1091–102. pmid:31701892
- View Article
- PubMed/NCBI
- Google Scholar
8. Step K, Ndong Sima CAA, Mata I, Bardien S. Exploring the role of underrepresented populations in polygenic risk scores for neurodegenerative disease risk prediction. Front Neurosci. 2024;18:1380860. pmid:38859922
- View Article
- PubMed/NCBI
- Google Scholar
9. Ndong Sima CAA, Step K, Swart Y, Schurz H, Uren C, Möller M. Methodologies underpinning polygenic risk scores estimation: a comprehensive overview. Hum Genet. 2024;143(11):1265–80. pmid:39425790
- View Article
- PubMed/NCBI
- Google Scholar
10. Pihlstrøm L, Morset KR, Grimstad E, Vitelli V, Toft M. A cumulative genetic risk score predicts progression in Parkinson’s disease: Genetic Risk and Progression in Parkinson’s Disease. Mov Disord. 2016;31(4):487–90.
- View Article
- Google Scholar
11. Li W-W, Fan D-Y, Shen Y-Y, Zhou F-Y, Chen Y, Wang Y-R, et al. Association of the Polygenic Risk Score with the Incidence Risk of Parkinson’s Disease and Cerebrospinal Fluid α-Synuclein in a Chinese Cohort. Neurotox Res. 2019;36(3):515–22. pmid:31209785
- View Article
- PubMed/NCBI
- Google Scholar
12. Foo JN, Chew EGY, Chung SJ, Peng R, Blauwendraat C, Nalls MA, et al. Identification of Risk Loci for Parkinson Disease in Asians and Comparison of Risk Between Asians and Europeans: A Genome-Wide Association Study. JAMA Neurol. 2020;77(6):746–54. pmid:32310270
- View Article
- PubMed/NCBI
- Google Scholar
13. Ibanez L, Dube U, Saef B, Budde J, Black K, Medvedeva A, et al. Parkinson disease polygenic risk score is associated with Parkinson disease status and age at onset but not with alpha-synuclein cerebrospinal fluid levels. BMC Neurol. 2017;17(1):198. pmid:29141588
- View Article
- PubMed/NCBI
- Google Scholar
14. Loesch DP, Horimoto ARVR, Sarihan EI, Inca-Martinez M, Mason E, Cornejo-Olivas M, et al. Polygenic risk prediction and SNCA haplotype analysis in a Latino Parkinson’s disease cohort. Parkinsonism Relat Disord. 2022;102:7–15. pmid:35917738
- View Article
- PubMed/NCBI
- Google Scholar
15. Dehestani M, Liu H, Gasser T. Polygenic Risk Scores Contribute to Personalized Medicine of Parkinson’s Disease. J Pers Med. 2021;11(10):1030. pmid:34683174
- View Article
- PubMed/NCBI
- Google Scholar
16. Step K, Leal TP, Waldo E, Madula L, Swart Y, Hernández CF, et al. Genome-wide association analyses reveal susceptibility variants linked to Parkinson’s disease in the South African population using inferred global and local ancestry. medRxiv. 2025. Available from:
- View Article
- Google Scholar
17. Kim JJ, Vitale D, Otani DV, Lian MM, Heilbron K, 23andMe Research Team, et al. Multi-ancestry genome-wide association meta-analysis of Parkinson’s disease. Nat Genet. 2024;56(1):27–36. pmid:38155330
- View Article
- PubMed/NCBI
- Google Scholar
18. Staerk C, Klinkhammer H, Wistuba T, Maj C, Mayr A. Generalizability of polygenic prediction models: how is the R2 defined on test data? BMC Med Genomics. 2024 May 16;17(1):132.
- View Article
- Google Scholar
19. Nalls MA, McLean CY, Rick J, Eberly S, Hutten SJ, Gwinn K, et al. Diagnosis of Parkinson’s disease on the basis of clinical and genetic classification: a population-based modelling study. Lancet Neurol. 2015;14(10):1002–9. pmid:26271532
- View Article
- PubMed/NCBI
- Google Scholar
20. Saffie-Awad P, Grant SM, Makarious MB, Elsayed I, Sanyaolu AO, Crea PW, et al. Insights into ancestral diversity in Parkinson’s disease risk: a comparative assessment of polygenic risk scores. NPJ Parkinsons Dis. 2025;11(1):201. pmid:40610451
- View Article
- PubMed/NCBI
- Google Scholar
21. Wang Y, Tsuo K, Kanai M, Neale BM, Martin AR. Challenges and Opportunities for Developing More Generalizable Polygenic Risk Scores. Annu Rev Biomed Data Sci. 2022;5:293–320. pmid:35576555
- View Article
- PubMed/NCBI
- Google Scholar
22. Ge T, Chen C-Y, Ni Y, Feng Y-CA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun. 2019;10(1):1776. pmid:30992449
- View Article
- PubMed/NCBI
- Google Scholar
23. Ruan Y, Lin Y-F, Feng Y-CA, Chen C-Y, Lam M, Guo Z, et al. Improving polygenic prediction in ancestrally diverse populations. Nat Genet. 2022;54(5):573–80. pmid:35513724
- View Article
- PubMed/NCBI
- Google Scholar
24. Mavaddat N, Michailidou K, Dennis J, Lush M, Fachal L, Lee A. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. Am J Hum Genet. 2019;104(1):21–34.
- View Article
- Google Scholar
25. Tremblay J, Haloui M, Attaoua R, Tahir R, Hishmih C, Harvey F, et al. Polygenic risk scores predict diabetes complications and their response to intensive blood pressure and glucose control. Diabetologia. 2021;64(9):2012–25. pmid:34226943
- View Article
- PubMed/NCBI
- Google Scholar
26. Ayoub A, McHugh J, Hayward J, Rafi I, Qureshi N. Polygenic risk scores: improving the prediction of future disease or added complexity? Br J Gen Pract. 2022;72(721):396–8. pmid:35902257
- View Article
- PubMed/NCBI
- Google Scholar
27. Roberts MC, Khoury MJ, Mensah GA. Perspective: The Clinical Use of Polygenic Risk Scores: Race, Ethnicity, and Health Disparities. Ethn Dis. 2019;29(3):513–6. pmid:31367172
- View Article
- PubMed/NCBI
- Google Scholar
28. van Rensburg ZJ, Abrahams S, Chetty D, Step K, Acker D, Lombard CJ, et al. The South African Parkinson’s Disease Study Collection. Mov Disord. 202;37(1):230–2.
- View Article
- Google Scholar
29. Hughes AJ, Daniel SE, Kilford L, Lees AJ. Accuracy of clinical diagnosis of idiopathic Parkinson’s disease: a clinico-pathological study of 100 cases. J Neurol Neurosurg Psychiatry. 1992;55(3):181–4. pmid:1564476
- View Article
- PubMed/NCBI
- Google Scholar
30. Global Parkinson’s Genetics Program. GP2: The global Parkinson’s genetics program. Mov Disord. 2021;36(4):842–51.
- View Article
- Google Scholar
31. Bandres-Ciga S, Faghri F, Majounie E, Koretsky MJ, Kim J, Levine KS, et al. NeuroBooster Array: A Genome-Wide Genotyping Platform to Study Neurological Disorders Across Diverse Populations. Mov Disord. 2024;39(11):2039–48. pmid:39283294
- View Article
- PubMed/NCBI
- Google Scholar
32. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75. pmid:17701901
- View Article
- PubMed/NCBI
- Google Scholar
33. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. pmid:25722852
- View Article
- PubMed/NCBI
- Google Scholar
34. Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, et al. Next-generation genotype imputation service and methods. Nat Genet. 2016;48(10):1284–7. pmid:27571263
- View Article
- PubMed/NCBI
- Google Scholar
35. Leal TP, Furlan VC, Gouveia MH, Saraiva Duarte JM, Fonseca PA, Tou R, et al. NAToRA, a relatedness-pruning method to minimize the loss of dataset size in genetic and omics analyses. Comput Struct Biotechnol J. 2022;20:1821–8.
- View Article
- Google Scholar
36. Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen W-M. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26(22):2867–73. pmid:20926424
- View Article
- PubMed/NCBI
- Google Scholar
37. Swart Y, Uren C, van Helden PD, Hoal EG, Möller M. Local Ancestry Adjusted Allelic Association Analysis Robustly Captures Tuberculosis Susceptibility Loci. Front Genet. 2021;12:716558. pmid:34721521
- View Article
- PubMed/NCBI
- Google Scholar
38. Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell. 2022;185(18):3426-3440.e19. pmid:36055201
- View Article
- PubMed/NCBI
- Google Scholar
39. Shriner D, Bentley AR, Gouveia MH, Heuston EF, Doumatey AP, Chen G, et al. Universal genome-wide association studies: Powerful joint ancestry and association testing. HGG Adv. 2023;4(4):100235. pmid:37653728
- View Article
- PubMed/NCBI
- Google Scholar
40. Ragsdale AP, Weaver TD, Atkinson EG, Hoal EG, Möller M, Henn BM, et al. Publisher Correction: A weakly structured stem for human origins in Africa. Nature. 2023;620(7972):E11. pmid:37460744
- View Article
- PubMed/NCBI
- Google Scholar
41. Hilmarsson H, Kumar AS, Rastogi R, Bustamante CD, Montserrat DM, Ioannidis AG. High resolution ancestry deconvolution for next generation genomic data. bioRxiv. 2021. http://dx.doi.org/10.1101/2021.09.19.460980
- View Article
- Google Scholar
42. Choi SW, Mak TS-H, O’Reilly PF. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc. 2020;15(9):2759–72. pmid:32709988
- View Article
- PubMed/NCBI
- Google Scholar
43. Choi SW, O’Reilly PF. PRSice-2: Polygenic Risk Score software for biobank-scale data. Gigascience. 2019;8(7).
- View Article
- Google Scholar
44. Cerezo M, Sollis E, Ji Y, Lewis E, Abid A, Bircan KO, et al. The NHGRI-EBI GWAS Catalog: standards for reusability, sustainability and diversity. Nucleic Acids Res. 2025;53(D1):D998–1005. pmid:39530240
- View Article
- PubMed/NCBI
- Google Scholar
45. Perez G, Barber GP, Benet-Pages A, Casper J, Clawson H, Diekhans M, et al. The UCSC Genome Browser database: 2025 update. Nucleic Acids Res. 2025;53(D1):D1243–9.
- View Article
- Google Scholar
46. 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. pmid:26432245
- View Article
- PubMed/NCBI
- Google Scholar
47. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77. pmid:21414208
- View Article
- PubMed/NCBI
- Google Scholar
48. Konuma T, Okada Y. Statistical genetics and polygenic risk score for precision medicine. Inflamm Regen. 2021;41(1):18. pmid:34140035
- View Article
- PubMed/NCBI
- Google Scholar
49. Ou Z, Pan J, Tang S, Duan D, Yu D, Nong H, et al. Global trends in the incidence, prevalence, and years lived with disability of Parkinson’s disease in 204 countries/territories from 1990 to 2019. Front Public Health. 2021;9:776847.
- View Article
- Google Scholar

[ref1] 1. Bloem BR, Okun MS, Klein C. Parkinson’s disease. Lancet. 2021;397(10291):2284–303.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Chen H, Ritz B. The Search for Environmental Causes of Parkinson’s Disease: Moving Forward. J Parkinsons Dis. 2018;8(s1):S9–17. pmid:30584168
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Ascherio A, Schwarzschild MA. The epidemiology of Parkinson’s disease: risk factors and prevention. Lancet Neurol. 2016;15(12):1257–72. pmid:27751556
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Jia F, Fellner A, Kumar KR. Monogenic Parkinson’s disease: Genotype, phenotype, pathophysiology, and genetic testing. Genes. 2022;13(3).
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Klein C, Westenberger A. Genetics of Parkinson’s disease. Cold Spring Harb Perspect Med. 2012;2(1):a008888.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref6] 6. Reed X, Bandrés-Ciga S, Blauwendraat C, Cookson MR. The role of monogenic genes in idiopathic Parkinson’s disease. Neurobiol Dis. 2019;124:230–9. pmid:30448284
View Article
PubMed/NCBI
Google Scholar

[19] View Article

[20] PubMed/NCBI

[21] Google Scholar

[ref7] 7. Nalls MA, Blauwendraat C, Vallerga CL, Heilbron K, Bandres-Ciga S, Chang D, et al. Identification of novel risk loci, causal insights, and heritable risk for Parkinson’s disease: a meta-analysis of genome-wide association studies. Lancet Neurol. 2019;18(12):1091–102. pmid:31701892
View Article
PubMed/NCBI
Google Scholar

[23] View Article

[24] PubMed/NCBI

[25] Google Scholar

[ref8] 8. Step K, Ndong Sima CAA, Mata I, Bardien S. Exploring the role of underrepresented populations in polygenic risk scores for neurodegenerative disease risk prediction. Front Neurosci. 2024;18:1380860. pmid:38859922
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref9] 9. Ndong Sima CAA, Step K, Swart Y, Schurz H, Uren C, Möller M. Methodologies underpinning polygenic risk scores estimation: a comprehensive overview. Hum Genet. 2024;143(11):1265–80. pmid:39425790
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref10] 10. Pihlstrøm L, Morset KR, Grimstad E, Vitelli V, Toft M. A cumulative genetic risk score predicts progression in Parkinson’s disease: Genetic Risk and Progression in Parkinson’s Disease. Mov Disord. 2016;31(4):487–90.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref11] 11. Li W-W, Fan D-Y, Shen Y-Y, Zhou F-Y, Chen Y, Wang Y-R, et al. Association of the Polygenic Risk Score with the Incidence Risk of Parkinson’s Disease and Cerebrospinal Fluid α-Synuclein in a Chinese Cohort. Neurotox Res. 2019;36(3):515–22. pmid:31209785
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref12] 12. Foo JN, Chew EGY, Chung SJ, Peng R, Blauwendraat C, Nalls MA, et al. Identification of Risk Loci for Parkinson Disease in Asians and Comparison of Risk Between Asians and Europeans: A Genome-Wide Association Study. JAMA Neurol. 2020;77(6):746–54. pmid:32310270
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref13] 13. Ibanez L, Dube U, Saef B, Budde J, Black K, Medvedeva A, et al. Parkinson disease polygenic risk score is associated with Parkinson disease status and age at onset but not with alpha-synuclein cerebrospinal fluid levels. BMC Neurol. 2017;17(1):198. pmid:29141588
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref14] 14. Loesch DP, Horimoto ARVR, Sarihan EI, Inca-Martinez M, Mason E, Cornejo-Olivas M, et al. Polygenic risk prediction and SNCA haplotype analysis in a Latino Parkinson’s disease cohort. Parkinsonism Relat Disord. 2022;102:7–15. pmid:35917738
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref15] 15. Dehestani M, Liu H, Gasser T. Polygenic Risk Scores Contribute to Personalized Medicine of Parkinson’s Disease. J Pers Med. 2021;11(10):1030. pmid:34683174
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref16] 16. Step K, Leal TP, Waldo E, Madula L, Swart Y, Hernández CF, et al. Genome-wide association analyses reveal susceptibility variants linked to Parkinson’s disease in the South African population using inferred global and local ancestry. medRxiv. 2025. Available from:
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref17] 17. Kim JJ, Vitale D, Otani DV, Lian MM, Heilbron K, 23andMe Research Team, et al. Multi-ancestry genome-wide association meta-analysis of Parkinson’s disease. Nat Genet. 2024;56(1):27–36. pmid:38155330
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref18] 18. Staerk C, Klinkhammer H, Wistuba T, Maj C, Mayr A. Generalizability of polygenic prediction models: how is the R2 defined on test data? BMC Med Genomics. 2024 May 16;17(1):132.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref19] 19. Nalls MA, McLean CY, Rick J, Eberly S, Hutten SJ, Gwinn K, et al. Diagnosis of Parkinson’s disease on the basis of clinical and genetic classification: a population-based modelling study. Lancet Neurol. 2015;14(10):1002–9. pmid:26271532
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref20] 20. Saffie-Awad P, Grant SM, Makarious MB, Elsayed I, Sanyaolu AO, Crea PW, et al. Insights into ancestral diversity in Parkinson’s disease risk: a comparative assessment of polygenic risk scores. NPJ Parkinsons Dis. 2025;11(1):201. pmid:40610451
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref21] 21. Wang Y, Tsuo K, Kanai M, Neale BM, Martin AR. Challenges and Opportunities for Developing More Generalizable Polygenic Risk Scores. Annu Rev Biomed Data Sci. 2022;5:293–320. pmid:35576555
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref22] 22. Ge T, Chen C-Y, Ni Y, Feng Y-CA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun. 2019;10(1):1776. pmid:30992449
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref23] 23. Ruan Y, Lin Y-F, Feng Y-CA, Chen C-Y, Lam M, Guo Z, et al. Improving polygenic prediction in ancestrally diverse populations. Nat Genet. 2022;54(5):573–80. pmid:35513724
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref24] 24. Mavaddat N, Michailidou K, Dennis J, Lush M, Fachal L, Lee A. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. Am J Hum Genet. 2019;104(1):21–34.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref25] 25. Tremblay J, Haloui M, Attaoua R, Tahir R, Hishmih C, Harvey F, et al. Polygenic risk scores predict diabetes complications and their response to intensive blood pressure and glucose control. Diabetologia. 2021;64(9):2012–25. pmid:34226943
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref26] 26. Ayoub A, McHugh J, Hayward J, Rafi I, Qureshi N. Polygenic risk scores: improving the prediction of future disease or added complexity? Br J Gen Pract. 2022;72(721):396–8. pmid:35902257
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref27] 27. Roberts MC, Khoury MJ, Mensah GA. Perspective: The Clinical Use of Polygenic Risk Scores: Race, Ethnicity, and Health Disparities. Ethn Dis. 2019;29(3):513–6. pmid:31367172
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref28] 28. van Rensburg ZJ, Abrahams S, Chetty D, Step K, Acker D, Lombard CJ, et al. The South African Parkinson’s Disease Study Collection. Mov Disord. 202;37(1):230–2.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref29] 29. Hughes AJ, Daniel SE, Kilford L, Lees AJ. Accuracy of clinical diagnosis of idiopathic Parkinson’s disease: a clinico-pathological study of 100 cases. J Neurol Neurosurg Psychiatry. 1992;55(3):181–4. pmid:1564476
View Article
PubMed/NCBI
Google Scholar

[106] View Article

[107] PubMed/NCBI

[108] Google Scholar

[ref30] 30. Global Parkinson’s Genetics Program. GP2: The global Parkinson’s genetics program. Mov Disord. 2021;36(4):842–51.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

[ref31] 31. Bandres-Ciga S, Faghri F, Majounie E, Koretsky MJ, Kim J, Levine KS, et al. NeuroBooster Array: A Genome-Wide Genotyping Platform to Study Neurological Disorders Across Diverse Populations. Mov Disord. 2024;39(11):2039–48. pmid:39283294
View Article
PubMed/NCBI
Google Scholar

[113] View Article

[114] PubMed/NCBI

[115] Google Scholar

[ref32] 32. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75. pmid:17701901
View Article
PubMed/NCBI
Google Scholar

[117] View Article

[118] PubMed/NCBI

[119] Google Scholar

[ref33] 33. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. pmid:25722852
View Article
PubMed/NCBI
Google Scholar

[121] View Article

[122] PubMed/NCBI

[123] Google Scholar

[ref34] 34. Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, et al. Next-generation genotype imputation service and methods. Nat Genet. 2016;48(10):1284–7. pmid:27571263
View Article
PubMed/NCBI
Google Scholar

[125] View Article

[126] PubMed/NCBI

[127] Google Scholar

[ref35] 35. Leal TP, Furlan VC, Gouveia MH, Saraiva Duarte JM, Fonseca PA, Tou R, et al. NAToRA, a relatedness-pruning method to minimize the loss of dataset size in genetic and omics analyses. Comput Struct Biotechnol J. 2022;20:1821–8.
View Article
Google Scholar

[129] View Article

[130] Google Scholar

[ref36] 36. Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen W-M. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26(22):2867–73. pmid:20926424
View Article
PubMed/NCBI
Google Scholar

[132] View Article

[133] PubMed/NCBI

[134] Google Scholar

[ref37] 37. Swart Y, Uren C, van Helden PD, Hoal EG, Möller M. Local Ancestry Adjusted Allelic Association Analysis Robustly Captures Tuberculosis Susceptibility Loci. Front Genet. 2021;12:716558. pmid:34721521
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref38] 38. Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell. 2022;185(18):3426-3440.e19. pmid:36055201
View Article
PubMed/NCBI
Google Scholar

[140] View Article

[141] PubMed/NCBI

[142] Google Scholar

[ref39] 39. Shriner D, Bentley AR, Gouveia MH, Heuston EF, Doumatey AP, Chen G, et al. Universal genome-wide association studies: Powerful joint ancestry and association testing. HGG Adv. 2023;4(4):100235. pmid:37653728
View Article
PubMed/NCBI
Google Scholar

[144] View Article

[145] PubMed/NCBI

[146] Google Scholar

[ref40] 40. Ragsdale AP, Weaver TD, Atkinson EG, Hoal EG, Möller M, Henn BM, et al. Publisher Correction: A weakly structured stem for human origins in Africa. Nature. 2023;620(7972):E11. pmid:37460744
View Article
PubMed/NCBI
Google Scholar

[148] View Article

[149] PubMed/NCBI

[150] Google Scholar

[ref41] 41. Hilmarsson H, Kumar AS, Rastogi R, Bustamante CD, Montserrat DM, Ioannidis AG. High resolution ancestry deconvolution for next generation genomic data. bioRxiv. 2021. http://dx.doi.org/10.1101/2021.09.19.460980
View Article
Google Scholar

[152] View Article

[153] Google Scholar

[ref42] 42. Choi SW, Mak TS-H, O’Reilly PF. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc. 2020;15(9):2759–72. pmid:32709988
View Article
PubMed/NCBI
Google Scholar

[155] View Article

[156] PubMed/NCBI

[157] Google Scholar

[ref43] 43. Choi SW, O’Reilly PF. PRSice-2: Polygenic Risk Score software for biobank-scale data. Gigascience. 2019;8(7).
View Article
Google Scholar

[159] View Article

[160] Google Scholar

[ref44] 44. Cerezo M, Sollis E, Ji Y, Lewis E, Abid A, Bircan KO, et al. The NHGRI-EBI GWAS Catalog: standards for reusability, sustainability and diversity. Nucleic Acids Res. 2025;53(D1):D998–1005. pmid:39530240
View Article
PubMed/NCBI
Google Scholar

[162] View Article

[163] PubMed/NCBI

[164] Google Scholar

[ref45] 45. Perez G, Barber GP, Benet-Pages A, Casper J, Clawson H, Diekhans M, et al. The UCSC Genome Browser database: 2025 update. Nucleic Acids Res. 2025;53(D1):D1243–9.
View Article
Google Scholar

[166] View Article

[167] Google Scholar

[ref46] 46. 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. pmid:26432245
View Article
PubMed/NCBI
Google Scholar

[169] View Article

[170] PubMed/NCBI

[171] Google Scholar

[ref47] 47. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77. pmid:21414208
View Article
PubMed/NCBI
Google Scholar

[173] View Article

[174] PubMed/NCBI

[175] Google Scholar

[ref48] 48. Konuma T, Okada Y. Statistical genetics and polygenic risk score for precision medicine. Inflamm Regen. 2021;41(1):18. pmid:34140035
View Article
PubMed/NCBI
Google Scholar

[177] View Article

[178] PubMed/NCBI

[179] Google Scholar

[ref49] 49. Ou Z, Pan J, Tang S, Duan D, Yu D, Nong H, et al. Global trends in the incidence, prevalence, and years lived with disability of Parkinson’s disease in 204 countries/territories from 1990 to 2019. Front Public Health. 2021;9:776847.
View Article
Google Scholar

[181] View Article

[182] Google Scholar

Figures

Abstract

Author summary

Introduction

Results

PRSice-2 for disease status prediction

Assessment of model performance

Covariate contribution to the variance observed

Polygenic risk score analysis: age at onset

Internal polygenic risk score analysis using South African summary statistics

Discussion

Materials and methods

Ethics statement

Participant demographics

Data preprocessing

Polygenic risk score calculation

Polygenic risk score: file preparation

Polygenic risk score analysis: case status

Assessment covariate variance

Polygenic risk score analysis: age at onset

Internal polygenic risk score analysis using South African summary statistics

Supporting information

S1 Fig. The following plots are using Nalls et al 2019 as the base dataset.

S2 Fig. The following plots are using Kim et al 2024 as the base dataset.

S3 Fig. Density plots showing the polygenic risk score distribution for cases versus controls.

S4 Fig. Bars plot showing the proportion of phenotypic variance (R²) explained by the polygenic risk score (PRS) at varying SNP inclusion p-value thresholds.

S5 Fig. Bars plot showing the proportion of phenotypic variance (R²) explained by the polygenic risk score (PRS) at varying SNP inclusion p-value thresholds.

S6 Fig. Karyograms showing the inferred local ancestry of three individuals from the South African Parkinson’s Disease Study Collection.

S1 Table. Positive predictive value and negative predictive value of polygenic risk scores for Parkinson’s disease at multiple top-percentile thresholds.

S2 Table. Study participants included in the polygenic risk score analysis.

S3 Table. Stability of PRS performance estimates across repeated for 20 random splits.

S4 Table. Global Parkinson’s Genetics Program banner author list.

Acknowledgments

References