Benchmarking multi-ancestry prostate cancer polygenic risk scores in a real-world cohort

Prostate cancer is a heritable disease with ancestry-biased incidence and mortality. Polygenic risk scores (PRSs) offer promising advancements in predicting disease risk, including prostate cancer. While their accuracy continues to improve, research aimed at enhancing their effectiveness within African and Asian populations remains key for equitable use. Recent algorithmic developments for PRS derivation have resulted in improved pan-ancestral risk prediction for several diseases. In this study, we benchmark the predictive power of six widely used PRS derivation algorithms, including four of which adjust for ancestry, against prostate cancer cases and controls from the UK Biobank and All of Us cohorts. We find modest improvement in discriminatory ability when compared with a simple method that prioritizes variants, clumping, and published polygenic risk scores. Our findings underscore the importance of improving upon risk prediction algorithms and the sampling of diverse cohorts.


Introduction
Prostate cancer (PCa) is the most commonly diagnosed tumor type in American men, accounting for 26% of all cancer cases [1].Several genetic and environmental factors such as age, body mass index (BMI), family history and ancestry are strong risk factors for developing PCa [2].Men of African ancestry have 1.5-fold higher incidence and those of Asian ancestry have a decreased incidence rate when compared to non-Hispanic white men [3,4].However, the genetic basis for ancestry-linked disparities remains unclear.PCa is a heritable cancer where 58% of disease risk has been linked to genetic factors [5].It is hypothesized that genetic predisposition may contribute to ancestry-linked risk.
Although genome-wide associations studies (GWASs) have identified several single nucleotide polymorphisms (SNPs) associated with PCa risk, they have primarily been conducted in European-ancestry cohorts [6][7][8].Polygenic risk scores (PRSs), which combine risk associated with multiple variants into a single risk score, are usually created using SNPs derived from these GWASs.Consequently, PRSs may share the population-bias of the European ancestrybased GWAS and cannot be confidently applied to non-European populations.These biases are associated with differences in linkage disequilibrium, causal variant status and differing causal variant allele frequencies across populations [9].Admixture mapping and meta analyses have revealed rare and common risk alleles specific to men of African ancestry [10][11][12].The challenge of integrating genetic findings with the well-powered European-based GWAS results has led to the development of several methods that correct the biases within the European ancestry results [13][14][15][16][17].The variability in complexity and correction efficiency across these methods presents an opportunity to systematically analyze their performance in PCa.
PRS-stratified prostate-specific antigen (PSA) screening may help reduce overdiagnosis and subsequent over-treatment associated with PSA screening alone [18].However, given ancestry-associated PCa incidence and outcome biases, it is imperative that risk scores work equally well across populations in order to limit disparities.
In this study, we first evaluated the performance of several published polygenic risk scores in the UK biobank cohort.Next, we trained polygenic risk score models using the largest GWAS summary data from the ELLIPSE Prostate Cancer Meta-Analysis cohorts and tuning dataset from the UK Biobank.Model performance was evaluated in held-out individuals from the UK biobank as well as in individuals from the held-out All of Us cohort.

Population ancestral assignment
We sought to stratify the UK Biobank cohort by ancestry so as to evaluate the efficacy of multiancestry PRSs.Extensive quality control was performed prior to ancestry assignment.We began with the 488,377 individuals who had genotypic data.We removed 1,807 individuals who were considered outliers in heterozygosity, had putative sex chromosome aneuploidy, or had excess relatives according to the quality control procedures completed by the UK Biobank.
Continental global ancestry assignment was performed by integrating genotypic data and self-reported ethnicity, similar to previous work [19].UK Biobank participants were split into three populations (African (AFR) males = 3,846, Asian (ASN) males = 6905.European (EUR) males = 209,894) by using their self-reported answer to the question "What is your ethnic group?".
Next, we performed k-means clustering (k = 3) based on the Euclidian distance across the first 40 genetic principal components.The entire cohort was included in the clustering without regard for their questionnaire-derived population label.We integrated genetic and questionnaire data by assigning each cluster an ancestry based on the most prevalent selfreported population of the constituent individuals.This yielded high concordance between genetic and questionnaire data.97% of all individuals in cluster 1 were considered ASN, 99% of all individuals in cluster 2 were AFR and 99% of all individuals in cluster 3 were EUR.Participants whose self-reported population disagreed with the computed population or who were in the 99 th percentile of distance from the cluster's center were annotated as a population termed Other (n = 2,381, Fig 1A).
We further assessed population assignment by comparing our clusters with published country-level assignments based on the top 16 principal components.This revealed high concordance between annotations.Moreover, participants that did not fall into our 3-population buckets were identified as Other, Iranian, and Asheknazi by Prive ´et al., 2022 [19].Interestingly, the use of additional principal components enabled population classification of samples annotated as Other by Prive ´et al., 2022 [19].This led to the addition of 2,438, 4,882 and 2,406 samples to AFR, ASN and EUR groups, respectively (Fig 1B).
Finally, we limited our dataset to all males who were categorized into either AFR, ASN, or EUR.This resulted in a final dataset of 220,245 participants.Population assignments determined from the k-means clustering were utilized for all analyses.

Assessment of published genetic risk scores
We annotated participants as cases and controls based on ICD-9, ICD-10, and self-reported information on prostate cancer status prior to conducting any genetic analyses.This process identified 13,097 cases and 207,148 controls in our dataset.Stark differences in population grouping, birth year, smoking, alcohol, and vital status were observed between cases and controls (S1 Table ).Next, we aimed to ascertain the accuracy of published polygenic risk scores in our cohort.These scores were not derived from the UK Biobank and included both those which were ancestry-informed and those that were generated in a homogenous European population (S2 Table ).
Fitting of polygenic risk scores involved a random stratified 2:1 split of our dataset into tuning and testing.The split was stratified by population grouping and case-control status.No differences in prostate cancer status or risk factors such as birth year, BMI, smoking status, or alcohol usage were identified (S3 Table ).Publicly available polygenic risk scores were evaluated in the testing dataset.Briefly, we trained logistic models on scores obtained from allelic scoring of risk alleles and the first 10 genetic principal components using the entire testing dataset with cross validation, regardless of population grouping.Evaluation of predictive power in the testing cohort was performed in two ways, both, evaluation within the entire cohort and evaluation within each annotated population (S1A Fig).
We found that prediction AUROCs ranged from 0.55-0.67(Fig 2A) when evaluating the entire cohort and compared them to published results when available.We note that our evaluation resulted in comparable AUROCs to those reported in the original publications.Population-specific evaluation revealed the highest average AUROC for EUR predictions (0.63 [0.58-0.69],mean [95% CI]) across all evaluated PRSs.Interestingly, AUROC for ASN predictions was not significantly different from EUR (0.64 [0.59-0.68],p = 0.46).However, AFR participants had poorer AUROC than EUR (0.55 [0.52-0.58]vs 0.63 [0.58-0.69],p = 0.02, paired t test) (S2A Fig) .Moreover, we find that all tested scores, apart from PGS000714, were associated with worse predictive power in AFR (Fig 2B, p < 0.05, DeLong's test for receiver operator characteristic curves).
Despite these overall lower performances, we find that an ancestry-informed PRS (PGS000662) resulted in improved prediction in AFR participants (AUROC 0.59) relative to other PRSs.We estimated AFR-specific SNP weights by multiplying the SNP beta (as calculated by Conti et al. [10]) by the AFR allele frequency to better understand variants that may enhance performance in AFR individuals.This revealed that rs10090154, rs72725879, and rs11263763 were the top contributing SNPs for prediction in AFR.rs10090154 and rs72725879 are located on the 8q24 AFR risk locus near the MYC oncogene [11,20].Fine mapping in a European cohort has identified the association between rs11263763, located on 17q12, and prostate cancer susceptibility [21].Given that an ancestry-informed meta-analysis (PGS000662) resulted in improved prostate cancer risk prediction in AFR, we hypothesized that additional rigorous adjustment of cross-population linkage disequilibrium and population allele may improve predictive power in AFR men.

Generation of polygenic risk scores
As most existing PRSs were developed in majority EUR populations and were linked to suboptimal performance in African ancestry, we hypothesized that appropriately adjusting for corsspopulation linkage disequilibrium may improve disease prediction across populations.To develop such an improved PRS, we employed clumping, PRScs, PRScsx, IMPACT, PolyFun, and XPASS to generate scores across populations and tuning parameters (Methods).This process generated 206 unique scores that were designed for prediction in population specific (pairwise AFR, ASN, EUR) or agnostic contexts (Total).Indeed, we found that scores were positively correlated with each other (Fig 3A, mean Pearson r 0.266[0.263-0.27]).Although the scores were correlated, hierarchical clustering revealed population-specific clustering (χ 2 p < 2.2x10 -16 ) suggesting that the methods may not adequately adjust for ancestry.
Next, we similarly trained logistic models and evaluated model fit with Nagelkerke's pseudo R 2 .Unsurprisingly, scores derived for EUR were associated with the highest variance explained (Fig 3B and S4 Table ).We calculated a relative R 2 metric to identify optimal methods for ancestry-adjusted scoring.This involved dividing all R 2 values by the mean R 2 associated with clumping in a population-specific manner.This revealed that in AFR and ASN, PRScsx was the only method that consistently explained greater variance than clumping alone (S3A Fig).We repeated the model training and evaluation in propensity score matched data to mitigate the effect of any potential confounding from environmental factors.We found that it was not essential to match cases and controls, as the matched results largely aligned with the unmatched ones (S3B Fig) .Our analysis utilized unmatched data to maximize statistical power and evaluated the prediction accuracy of all scores in the held-out dataset.Pairwise evaluation (all scores evaluated in all target populations) identified that clumping (0.612[0.603-0.62]),PRScsx (0.608[0.6-0.616]) and PRScs (0.605[0.598-0.613])led to the highest AUROC (Fig 3C and S4 Table).Next, we identified scores that performed the best in a population specific manner (Fig 3D).Interestingly, the best performing scores did not correct for inter-population LD patterns and were associated with summary statistics from larger GWASs.We identified that clumping the combined summary statistics generated by Conti et al. [10], led to the highest AUROC when predicting disease in AFR (0.631[0.577-0.685]).In contrast, ASN were associated with the highest AUROC when PRScs was applied to EUR summary statistics (0.712[0.634-0.788]).While the difference between models that explained the greatest variance and those that had the highest AUROC was not statistically significant (S3C Fig, AFR p = 0.2, ASN p = 0.22), potentially due to sample size, our findings suggest that summary statistics from larger GWASs may be more informative than ancestry adjusted summary statistics from smaller ones.

Integration of clinical risk factors
To evaluate the polygenic risk scores in models that reflect potential clinical utility, we fit a model that included top performing PRSs, age at assessment, PCa family history, and body mass index.These models, when evaluated in a population specific manner, drastically improved prostate cancer prediction (Fig 4A).Our analysis revealed that across all ancestries, the addition of age significantly increased the AUROC of a PRS-only model (Fig 4B, p < 0.05).The addition of additional risk factors did not improve predictive power substantially.Interestingly, the AUROC increase associated with the inclusion of age was the greatest in AFR (Δ AUROC = 0.16).It is possible that this increase in AUROC is due to unmatched cases and controls.However, we find that cases were older than controls across all populations (S4A Fig) .Next, we explored whether aggregating all computed polygenic risk scores could improve disease classification using a deep learning framework.We found that employing a deep learning model resulted in performance similar to single-sourced polygenic risk scores (AUROC AFR = 0.79, EUR = 0.78, ASN = 0.81, Total = 0.79, S4B Fig) .Taken together, we find that age and PRS alone are sufficient to predict prostate cancer at AUROCs greater than 0.75 across all ancestries.

Age as a disease risk modifier
In exploring the role of age as a potential modifier of prostate cancer risk, we investigated its influence based on previous findings.Conti et al. (2021) [10] demonstrated that individuals younger than 55 face a heightened risk of disease compared to their older counterparts, especially among high-risk individuals.Our analysis employed logistic models that incorporated an interaction between polygenic risk scores and age categories (� 55 vs > 55).Although our study did not establish age as a significant modifier of disease risk among African and Asian populations, possibly due to the limitations in sample size, we observed that 78 out of 206 analyzed scores indicated an elevated risk in individuals under 55 years (Odds ratio > 1, FDR < 0.05, S4C Fig) .Indeed, the models including the PRS-age interaction yielded a better fit compared to models without this interaction (Likelihood ratio test FDR < 0.05).Taken together, our findings suggest that age is a disease risk modifier in patients younger than 55 years.

Validation in All of Us
To produce out-of-sample validation results, we assayed PRS performance of the top UK biobank score (clumping) in the All of Us (v6) dataset which is annotated with genetic ancestry (S5A Fig) .This process involved array based genotyping data for 28,704 AFR and EUR participants.We could not reliably evaluate ASN individuals due the low sample size of cases.The dataset, while not imputed, contains measurements for 1,824,517 unique biallelic genotypes.We scored these genotypes with the UK Biobank PRS associated with the highest AUROC in AFR (Total clump) as well as the 269-SNP score from Conti et al. [10].A random ancestryand-case-stratified 2:1 split of these data yielded training and testing datasets.Fitting logistic models for these scores revealed that PGS000662 was associated with the poorer pseudo R 2 (Fig 5A, 0.035 vs 0.045).Indeed, we found that clumping was associated with higher AUROC in AFR, although not statistically significant (Fig 5B, 0.55 vs 0.513, p = 0.22).Statistical significance was, however, achieved when evaluating EUR (0.63 vs 0.56, p = 1.4 x 10 −4 ).
Next, we examined the effect of the addition of clinical risk factors in our models.We note that AFR cases and controls are younger than their EUR counterparts (S5B Fig)  performance of trained models drastically improved upon integration of participant age and polygenic score.Similar to our findings in the UK Biobank cohort, the addition of family history and BMI did not increase predictive power (Fig 5C).Taken together, we demonstrate the utility of scores generated from the UK Biobank dataset and show that genetic risk coupled with participant age is sufficient to predict prostate cancer.

Discussion
Recent advances in statistical genetics have led to improved methods for cross-ancestry PRS disease risk prediction.However, these improved models have not been formally assessed for their predictive performance in prostate cancer.In our comprehensive analysis across two biobank-scale datasets, we benchmarked six algorithms and find modest improvement over traditional clumping methods.These algorithms are based on variant prioritization, continuous shrinkage, joint analysis of summary statistics, functional annotation, fine mapping and the utilization of population-specific heritability estimates.To our knowledge, this is the first study to assess pan-ancestry polygenic risk for prostate cancer in the All of Us cohort.
In our benchmark, we find that hierarchical clustering is adequate to stratify polygenic scores by ancestry.These results suggest that the genetic correlation, measured by the correlation of polygenic scores, within individual ancestries is greater than that between prostate cancer cases.This finding may reflect the poor performance of ancestry-adjustment methods.Our study employed ancestry-correction using linkage disequilibrium maps from one of the largest publicly available sources [22].It is possible that the sample sizes that derived these maps are inadequate, and that finer resolution mapping is required.Furthermore, inclusion of local ancestry haplotype models may improve predictive power of scores, especially for admixed individuals.
We find that while PRScsx explains the greatest amount of variance in logistic classifier models, it does not outperform other methods in classification accuracy on held-out data.While there is no consensus on PRS evaluation metrics and several studies report pseudo-R 2 , odds ratio or AUROC, we show that alternative metrics such as AUROC produce a different ranking of polygenic risk scores.Lastly, we note that polygenic score and the age of an individual are sufficient for acceptable classification of prostate cancer cases and controls across ancestries.
Our work, however, is constrained by certain limitations.First, we limit the analysis to genotyping arrays to maintain consistency and maximize sample size across the UK Biobank and All of Us cohorts.Genotype imputation has been performed for UK Biobank, but not for All of Us.Discrepancies in processing pipelines between derivation and validation data may limit the generalizability of our findings.Second, although we identified genetic ancestry, we could not perfectly account for population structure, admixture and subpopulations.For example, we grouped East Asian and South Asian individuals together to maximize the sample size of a population with low prostate cancer incidence.Additionally, it is possible that individuals annotated as AFR may represent distinct subpopulations, considering the vast genetic diversity within Africa, and the dissimilarity may be heightened between continents (Europe vs North America) [23].Moreover, ancestry-bias correcting algorithms assume homogenous population structure for optimal performance.However, most individuals are admixed [24], and each admixture component has further fine-scale population structure that may prevent portability (e.g.Saharan vs Sub-Saharan ancestry).Third, although several ancestry-adjustment algorithms exist [25][26][27][28], we limited our study to six commonly used ones.These algorithms are complex and require fine-tuned parameterization for optimal implementation.Although we performed a robust benchmark and evaluated several parameters, a comprehensive sweep or parameters was not possible.However, based on our results, it seems unlikely that additional tuning would significantly alter prediction accuracy.
Taken together, our findings highlight the necessity for advanced computational strategies that more accurately reflect the genetic diversity across populations.Recent reports suggest that multi-ancestry PRS generating strategies are trait-specific and that out-of-thebox methods may not generalize [29].Careful integration of several strategies such as thresholding genetic correlation between the source and target population and sample size thresholding may yield improved risk prediction.Moreover, the integration of non-linear models could complement the linear strategies we tested.For example, transfer learning and XGBoost result in more accurate predictions across a variety of traits [27,30].Additionally, it is possible that whole genome sequencing, rather than array-based genotyping, may capture additional ancestry-biased causal variants, yielding improved risk prediction.
Given the dramatic diversity in incidence and outcome data across ancestry groups, poorly performing risk prediction tools may compromise risk stratification and patient care, particularly for those most likely to be afflicted.Our work highlights the need for additional algorithms to predict risk across populations as well as the sampling of diverse cohorts.

Data acquisition
The primary derivation source of data was the UK Biobank, a large, prospective, general-health study [31].Approximately 500,000 individuals aged 40-69 were enrolled between the years 2006-2010.A diverse range of data was acquired during assessment, including electronic health records, diet and exercise habits, anthropometric measurements, self-reported disease history and blood samples [32].Genotypic data for each individual was obtained by processing the blood sample with either the UK Biobank Axiom Array or UK BiLEVE Array.The resulting array data was imputed to 93,095,688 variants through the application of the IMPUTE2 program.We identified an individual as having prostate cancer if they either self-reported "prostate cancer" or their electronic health records contained an ICD-10 code C61 or an ICD-9 code 185.The electronic health records are continuously updated during inpatient hospitalizations and are available starting before each individual's date of baseline assessment.
Genetic associations to prostate cancer were acquired from the ELLIPSE Prostate Cancer Meta-Analysis [10].The data from this meta-analysis was deposited within dbGaP (phs001081.v1.p1).After approval (on Sep16, 2021) we downloaded the genome-wide association study (GWAS) summary statistics that were separately computed for four different ancestry-defined populations: European, African, Asian and Total.These four sets of summary statistics underwent standard quality control procedures which removed variants with missing statistics, duplicated rs IDs, alleles that were not A/C/T/G, ambiguous, not found in the UK Biobank imputed set, or that would not properly flip to match the alleles of the UK Biobank [33].After rigorous quality control, all four sets of summary statistics contained more than five million variants.
The data source for score validation was the All of Us cohort (v6).Whole genome sequencing and genotype array data were available for 98,560 and 165,080 individuals respectively.We limited our study the genotype array data to maximize sample size and consistency with the UK Biobank.Genetic ancestry estimation and genotype quality control were performed by the All of Us research group prior to data acquisition.Cases were identified by medical history of prostate cancer while males without disease history were selected as controls.This yielded a total of 1,458 cases (213 African ancestry and 1245 European ancestry) and 27,246 controls (8,915 African ancestry and 18,331 European ancestry) with complete genotype data present.only designed for use in a specific population.Specifically, the PLINK utility's "-score" command computed the polygenic risk score.

Model fit and evaluation
Logistic models were with the glm function as implemented in caret using 5-fold cross validation.In all cases, we used "summaryFunction = twoClassSummary, metric = ROC".Evaluation of model fit in the training dataset involved calculation of Nagelkerke's R 2 as implemented in "fmsb::NagelkerkeR2".A receiver operator characteristics curve as implemented in "pROC:: roc" was used to calculate the area under the curve.Specifically, we used "ci.auc" for estimating bootstrapped confidence intervals and "roc.test" to compare curves.When evaluated, sensitivity and specificity were identified at the Youden index.

Deep learning model to aggregate polygenic risk scores
We trained a multilayer perceptron using PyTorch (v2.2.0) to test if aggregating polygenic risk scores would improve disease classification across ancestries.The architecture consisted of an input layer, two fully connected hidden layers (128 and 64 neurons, each with L1 regularization of 0.1) and a final dense layer with sigmoid activation.The model included dropout (0.1) and batch normalization between layers.Model training involved 5-fold cross validation and included randomly splitting the dataset into training, validation, and testing data.Optimization was performed using Adam with a learning rate of 0.0001, and the objective was to minimize binary cross-entropy loss.The model was trained for a maximum of 100 epochs, where each epoch was fit over a total of 1000 batches.An early stopping threshold of 10 epochs was set to prevent over-fitting based on validation loss, where the best model weights were restored upon early stopping.AUROC was computed using cross-validation predictions.

Statistical analysis
Unless otherwise reported, Wilcoxon-Rank Sum test was used to compare groups and statistical significance was considered as p < 0.05.

Fig 2 .
Fig 2. (A) Evaluation of published prostate cancer polygenic scores sourced from PGScatalog without stratifying by ancestry in the UK Biobank testing dataset.(B) Evaluation of published prostate cancer polygenic scores sourced from PGScatalog after stratifying by ancestry in the UK Biobank testing dataset.Confidence intervals were obtained by bootstrapping the AUC 2000 times.p values were obtained from De-Long's test.https://doi.org/10.1371/journal.pcbi.1011990.g002

Fig 3 .
Fig 3. (A) Pairwise Pearson correlation matrix of 206 scores generated.(B) Evaluation of Nagelkerke's pseudo R 2 based on models trained on summary statistics adjusted for specific ancestries.(C) Evaluation of AUROC in the UK Biobank testing cohort.(D) Identification of polygenic scores associated with highest AUROC for each evaluation-derivation ancestry pair.Text within squares indicates associated algorithm.Top and right sample size annotations are in log scale.https://doi.org/10.1371/journal.pcbi.1011990.g003

Fig 4 .
Fig 4. (A) Integrated risk score containing polygenic risk, age, family history and BMI for scores associated with highest population-specific performance.Best AFR: total-clumping; best EUR and best EAS: EUR-PRScs.(B) Predictive power associated with the inclusion of risk factors.https://doi.org/10.1371/journal.pcbi.1011990.g004 Predictive

Fig 5 .
Fig 5. (A) Evaluation of Nagelkerke's pseudo-R 2 in models trained in the All of Us dataset.(B) Prediction AUROC of polygenic scores alone in the All of Us testing dataset.(C) Prediction AUROC of polygenic scores and age in the All of Us testing dataset.https://doi.org/10.1371/journal.pcbi.1011990.g005