On cross-ancestry cancer polygenic risk scores

Polygenic risk scores (PRS) can provide useful information for personalized risk stratification and disease risk assessment, especially when combined with non-genetic risk factors. However, their construction depends on the availability of summary statistics from genome-wide association studies (GWAS) independent from the target sample. For best compatibility, it was reported that GWAS and the target sample should match in terms of ancestries. Yet, GWAS, especially in the field of cancer, often lack diversity and are predominated by European ancestry. This bias is a limiting factor in PRS research. By using electronic health records and genetic data from the UK Biobank, we contrast the utility of breast and prostate cancer PRS derived from external European-ancestry-based GWAS across African, East Asian, European, and South Asian ancestry groups. We highlight differences in the PRS distributions of these groups that are amplified when PRS methods condense hundreds of thousands of variants into a single score. While European-GWAS-derived PRS were not directly transferrable across ancestries on an absolute scale, we establish their predictive potential when considering them separately within each group. For example, the top 10% of the breast cancer PRS distributions within each ancestry group each revealed significant enrichments of breast cancer cases compared to the bottom 90% (odds ratio of 2.81 [95%CI: 2.69,2.93] in European, 2.88 [1.85, 4.48] in African, 2.60 [1.25, 5.40] in East Asian, and 2.33 [1.55, 3.51] in South Asian individuals). Our findings highlight a compromise solution for PRS research to compensate for the lack of diversity in well-powered European GWAS efforts while recruitment of diverse participants in the field catches up.


Introduction
Translating findings from genome-wide association studies (GWAS) to clinical utility in terms of complex trait prediction is a major milestone in genetics research [1]. This is especially important for traits whose estimated heritability was reported to be high. However, the identified common single nucleotide polymorphisms (SNPs) seldom have deterministic consequences. While each identified common risk SNP contributes to the overall disease risk, by itself it is unlikely to predict a large degree of variation in a disease outcome and thus usually represents a poor predictor by itself. The combination of all risk SNPs into a polygenic risk score (PRS) is a popular approach to improve predictive power and can be valuable for risk stratification, i.e., the identification of a small subset of a population with extreme PRS values that is at higher risk to develop a disease [1].
The discovery of risk SNPs through GWAS often depends on very large sample sizes of genotyped data (hundreds of thousands of tag SNPs or more) especially if one aims to capture a large fraction of the SNP heritability [2][3][4]. Until recently, GWAS of this scale were either exclusively or predominantly based on European populations, trailed by Asian populations, while all other ancestry groups comprised less than 5% [5]. The resulting bias in published GWAS results [6] is passed on to the development and application of PRS for many complex traits and despite current efforts to increase diversity in genetics research will likely continue in the foreseeable future [6].
The lack of portability of PRS across populations with different ancestry compositions is known and usually attributed to differences in causal variants, linkage disequilibrium (LD) patterns, allele frequencies, and effect sizes [7,8]. In addition, genotyping or imputation methods that were originally developed for European ancestry (EUR) studies can amplify such differences [7,8].
There are several examples of studies that explore PRS constructed using GWAS results from different ancestry groups. Belsky et al. [9] constructed an obesity PRS based on EUR-GWAS and found that it performed poorly individuals of African American compared to those of EA [9]. Grinde et al. [10] assessed the performance of PRS based on EUR GWAS in a Hispanic/Latino population for three groups of traits: anthropometric measures, blood pressure, and blood count. The EUR-based PRS performed well for anthropometric and blood count traits but performed poorly for blood pressure traits [10]. EUR-based PRS for these quantitative traits also showed on average a 3.3-fold decrease in predictive performance in East Asian population when compared to the European population [11]. Others have demonstrated an association between PRS and genetic ancestry [12,13]. Simply put, the literature cautions against the transferability of EUR-based GWAS to other populations [5,8]. Recently we have provided a catalog of more than 500 PRS for various cancer using EUR-based GWAS [14]. However, there are little or no reports on the transferability of cancer PRS or whether these PRS can be used for other ancestries.
The UK Biobank Study (UKB) offers detailed questionnaire, electronic health record (EHR) and genetic data representing an excellent resource to study the influence of genetic risk factors on common complex disease. While predominantly European ancestry, it also includes over 20,000 participants of self-reported non-EUR ancestry (reported as "ethnic groups") [15] that can, together with genetically inferred ancestry information, be stratified into the four main ancestry groups: African, East Asian, European or South Asian ancestry (S1 Table). Thus, UKB offers the opportunity to evaluate the performance of PRS across various ancestry groups and to assess the transferability of EUR-based cancer PRS.
To increase power for such an evaluation, we focus on two common cancer traits, breast and prostate cancer. Both of these traits offer several advantages for PRS explorations: high disease prevalence, large fraction of heritability already explained through known risk variants, low chance of phenotype misclassification, and available full summary statistics from very large, EUR-based GWAS [16,17].

Results
We constructed cancer PRS specifically for the European subgroup of UKB individuals using two different approaches for each cancer trait: GPRS is an effect-size weighted PRS based on a sparse set of GWAS hits (independent risk SNPs with P-value below 5x10 -8 ) and CSPRS, a PRS based on the Bayesian-regression method CS-PRS that uses continuous shrinkage (CS) priors [18]. Relatively sparse sets of 334 and 377 SNPs were incorporated in the GPRS for breast cancer and prostate cancer, respectively. By contrast the CSPRS constructs integrated over 1.1 million SNPs for each of the two cancers.
What can be clearly seen in Fig 1 are the different distributions of PRS across the European, South Asian, African and East Asian ancestry groups that were statistically significantly different in group means by one-way ANOVA (P < 2.31x10 -141 ; S2 Table). Both breast cancer PRS were on average higher in non-EUR groups, whereas prostate cancer PRS were higher in African and lower in East and South Asian ancestry groups (Fig 1 and S2 Table). These differences were pronounced for the CSPRS. This is likely a result of the summation of hundreds of thousands allele-frequency differences between ancestry groups compared to a few hundred for the GPRS. Overall, this suggests that these PRS are not directly transferable, e.g., a high breast cancer PRS in EUR individuals might fall into the lower PRS distribution of Africans ancestry individuals. Fig 1 also shows that most of the EAS and AFR and half of the SAS female control individuals would have a breast cancer CSPRS above the 10% population threshold and thus might be considered at increased risk, when using the EUR-based scale as reference. And conversely, most if not all the EAS males would not reach the 10% population threshold of the prostate cancer CSPRS, and males with genetic profiles that would place them at risk within the EAS ancestry group would not be considered at risk on a EUR-centric scale.
Still, what is striking is the consistent right shift of the PRS distributions in cases compared to controls with each ancestry group (Fig 1). With exception of the small sample of East Asian prostate cancer cases (n = 7), all PRS were significantly associated with increased continuous ORs for their corresponding cancers when standardized to one standard deviation (S.D.) within each ancestry group (OR [per unit S.D.] � 1.44, Table 1). Furthermore, all PRS also  Table 1. The CSPRSs usually outperformed the GPRSs in terms of association strength, accuracy and discrimination (Table 1). Especially for the breast cancer, the CSPRS showed consistent effect sizes across the ancestry groups (1.66 � OR [per unit S.D.] � 1.77) and good discriminatory ability (0.64 � AAUC � 0.66) To evaluate if the increased risk is observable with increasing score or only present in the tails of the distribution, we stratified the PRS, again standardized within each ancestry group, and detected a trend of increasing number of cases within the increasing CSPRS deciles. This trend was strikingly monotonous in the substantially larger sample of European ancestry and, except for the small sample of prostate cancer cases of East Asian ancestry, noticeable though more capricious in non-EUR groups (Cochran-Armitage P < 0.00297; Fig 2 and S3 and S4 Tables). We saw similar trends for the GPRS (S1 Fig and S3 and S4 Tables).
Finally, we quantified the PRS's ability to enrich cases in the top 10% of the PRS distribution (defined in controls within each ancestry group) when compared to the bottom 90%. We observed an enrichment for breast cancer cases in the tail of the PRS distribution when we defined the top 10% within each ancestry group (breast cancer: OR Top10% > 2. 18 Table 2).
One may be concerned regarding the unbalanced case: control ratio used in our analysis and potential distortion of the asymptotic properties of the test statistics. For our ancestry specific calibration and reference group selection, we need to retain the maximal number of controls. We conducted a sensitivity analysis by using a matched sample that does not use all of the controls and we noted consistent point estimates but wider confidence intervals due to loss of sample size (S1 Text and S9 and S10 Tables).

Discussion
Overall, our findings in the UKB data are encouraging and suggest that cancer PRS derived from large EUR-based GWAS can, to a certain degree, be useful for risk stratification within EUR or within non-EUR individuals even though their distributions are dissimilar. However, there are limitations in regard to the generalizability of this approach. First, a matching ancestry group with sufficiently large control sample sizes is needed to adequately place a person's PRS within its reference PRS distributions. In this study, we   more homogenous groups by combining self-reported ethnic groups with genetically inferred ancestry groups. However, even within such groups an adjustment for any remaining population stratification, e.g., by including the first ten principal components, should be considered. Secondly, overall breast and prostate cancer were selected because they offered several advantages compared to other traits: their estimated heritability is relatively high [17,19,20], they are common across all ancestry groups (breast cancer 3.1-6.2%; prostate cancer 1.2-5.1%; S1 Table) and each had summary statistics publicly available from large EUR-based GWAS meta-analyses.
Thirdly, the UKB study individuals were recruited from the same country, the UK, where healthcare coverage and non-genetic risk factors might be more similar compared to diverse ancestries from geographically separate populations. Though we recognize that lifestyle, health disparities and socioeconomic factors (e.g., education and income, S1 Table) might vary between ethnic groups of the UKB study.
While a fraction of risk variants is likely population-specific, our observation of a decent predictive PRS performance across ancestry groups indicated that, for the two analyzed cancers, a fraction of the cancer risk variants obtained from an EUR-based GWAS is shared with non-EUR groups. So, while PRS that rely on EUR-based GWAS were reported to be not ideal for non-EUR groups, they can be useful for risk stratification also in non-EUR groups. In our examples, the proportion of cases by PRS risk decile was informative within the studies ancestry group, i.e., an increasing PRS was associated with increased proportion of cases also among non-EUR groups. However, we noted that the EUR-based prostate cancer PRS performed particularly poor in AFR males indicating ancestry-specific diversity for prostate cancer as previously reported [21]. This also suggested that transferability of PRS across ancestries needs to be carefully evaluated by cancer and by ancestry group.
In our manuscript we focused on the two primary methods GPRS and CSPRS representing a simple and complex method of the spectrum of PRS methods. However, we also compared the performance of three alternative methods C+T PRS, Lassosum PRS, and LDpred PRS. For breast and prostate cancer these additional methods performed better than GPRS but poorer than CSPRS (S1 Text and S6-S10 Figs and S5 and S6 Tables). Furthermore, we repeated the same approach in another genetics linked biorepository at the University of Michigan, the Michigan Genomics Initiative (MGI) and though much limited in sample size, we saw similar performance behavior of GPRS and CSPRS in relative terms and the improved performance in terms of risk stratification by using ancestry-specific reference groups (S1 Text and S11 and S12 Figs and S7 and S8 Tables).
We recommend that PRS be constructed using GWAS based on the same ancestry group, if large diverse GWAS and their summary statistics are available. In the absence of large-scale GWAS for non-EUR groups, several groups are developing methods to improve PRS performance in non-EUR groups. These methods may leverage evidence that SNP selection based on EUR-based GWAS is generally appropriate while the use of EUR-based GWAS effect sizes in ethnically mismatched groups might not [22]. Duncan et al. [5] highlight the need for improved understanding and consideration of LD and variant frequencies when applying European ancestry based GWAS to non-EUR groups, while at the same time calling for largescale GWAS in diverse populations [5]. Modelling ancestry into polygenic risk predictors or focusing on global risk variants might allow the retention of comparable predictive power across ancestries [8] and allow risk stratification also in understudies populations as shown for Hispanics/Latinos [10]. However, a restriction to global risk variants, e.g., defined by similar frequencies across all ancestry groups, might lead to the exclusion of true causal risk variants. When we applied such a global risk variant approach to the current dataset through simple frequency filtering, we made PRS distributions more similar across ancestry groups but also observed markedly reduced predictive power (S2-S5 Figs). While efforts are underway to contribute more diverse samples to genetic studies, their sample sizes will trail behind sample sizes of European ancestry GWAS for a long time [6]. Multiethnic PRS that combine larger EURbased GWAS with smaller GWAS of the target ancestry group were recently proposed and might alleviate the discrepancies in sample sizes for the time being [23]. Until we obtain comparable/reasonable sized discovery cohorts, we need alternative approaches that make the best use of the data we already have, so that diverse ancestry groups can benefit from PRS research, too.
Taken together, our findings suggest that cross-ancestry cancer PRS can be useful for risk stratification, especially when there is a lack of well-powered diverse cancer GWAS. However, caution needs to be applied to the interpretation and application of such genetic risk predictors as they can be prone to multiple sources of bias [8].

Ethics statement
UK Biobank received ethical approval from the NHS National Research Ethics Service North West (11/NW/0382). Michigan Genomic Initiative (MGI) study participants' consent forms and protocols were reviewed and approved by the University of Michigan Medical School Institutional Review Board (IRB ID HUM00099605 and HUM00155849). All participants gave full informed written consent.

Subjects/Genotypes
The UK Biobank (UKB) is a population-based cohort collected from multiple sites across the United Kingdom and includes over 500,000 participants aged between 40 and 69 years when recruited in 2006-2010 [15]. The open-access UK Biobank data used in this study included questionnaire data, electronic health record data, and genotype and genotyped derived data. The present analyses were conducted under UK Biobank data application number 24460.
We used both principal component-based ancestry prediction and self-reported ethnic information to define ancestry groups. For the ancestry prediction, we applied online augmentation, decomposition and Procrustes (OADP) method to the genotype data of 488,366 UK Biobank samples with 2492 samples from the 1000 Genomes Project data as the reference (FRAPOSA; see Web resources) [26] to infer the super populations membership (AFR: African, AMR: Ad Mixed American, EAS: East Asian, EUR: European, and SAS: South Asian ancestry). We combined the self-reported ethnic group and the inferred super population membership to define the following four ancestry groups for downstream analyses: African (self-reported "Black or Black British" and inferred AFR), East Asian (self-reported "Asian or Asian British" or East Asian and inferred EAS), European (self-reported European and inferred EUR), and South Asian individuals (self-reported "Asian or Asian British" and inferred SAS). By doing so we excluded individuals with admixed and/or unknown ancestry as well as individuals where self-reported ethnic group did not match their inferred ancestry.
For each cancer trait and each ancestry group, we extracted a maximal set of unrelated individuals (defined as kinship coefficient < 0.0884) [27] by first selecting a maximal set of unrelated cases before selecting a set of unrelated controls that was not related to any of the selected cases. [28] PRS construction PRS combine information across a defined set of genetic loci, incorporating each locus's association with the target trait. The PRS for person j takes the form PRS j = ∑ i β i G ij where i indexes the included loci for that trait, weight β i is the log odds ratios retrieved from the external GWAS summary statistics for locus i, and G ij is a continuous version of the measured dosage data for the risk allele on locus i in subject j.
We downloaded full GWAS summary statistics made available by the "Breast Cancer Association Consortium" (BCAC) [20], and the "Prostate Cancer Association Group to Investigate Cancer Associated Alterations in the Genome" (PRACTICAL) [17] (also see Web resources) both based on European ancestry samples. For each set of GWAS summary statistics, we create two PRS. For the first PRS construction method, we performed linkage disequilibrium (LD) clumping of variants with p-values below 5x10 -8 by using the imputed allele dosages of 10,000 randomly selected samples and a pairwise correlation cut-off at r 2 < 0.1 within 1Mb window. Using the resulting loci ("independent GWAS hits"), we calculated the weighted PRS (see above) denoted as GPRS ("GWAS hits-based PRS"). For the second PRS construction method, we used the software package "PRS-CS" [18] which uses a precomputed LD reference panel based on external European samples of the 1000 Genomes Project ("EUR reference") to define a PRS based on the continuous shrinkage (CS) priors that we denote as CSPRS. We applied a MAF filter of 1% and, in contrast to the GPRS only included autosomal variants that overlap between summary statistics, LD reference panel, and target panel. Full list of weights can be downloaded from our web site (see Web resources).
We obtained deep sequenced data on the 2504 samples in the 1000 Genomes Project's phase three panel that were generated by the New York Genome Center (see Web resources). Sequencing data was filtered to have a minimum depth of 10, to be polymorphic and located on chromosomes 1-22, X. We stratified the data according to their super populations (AFR, African; AMR, Ad Mixed American; EAS, East Asian; EUR, European; SAS, South Asian) and calculated their population specific allele frequencies using PLINK 1.9 (see Web resources). We created five sets of variants whose MAF was >1% in AFR, EAS, EUR, SAS and whose maximal allele frequency difference between any of the four populations was below 5, 10, 15, 20 or 25%. The resulting sets were used to filter the GWAS summary statistics before running PRS-CS.
Using the R package "Rprs" (see Web resources) and the weights from the two PRS methods, the dosage-based value of each PRS was then calculated for each UKB individual. For comparability of association effect sizes corresponding to the continuous PRS across cancer traits and PRS construction methods, we centered PRS values to their mean and scaled them to have a standard deviation of 1.

Statistical tests
For the PRS evaluations, we fit the following model for each PRS and cancer phenotype adjusting for covariates Birthyear, genotyping Array, and the first ten principal components (PC): logit ðPðPhenotype is presentjPRS; Birthyear; Array; PCÞÞ where the PCs were the first ten principal components obtained from the principal component analysis provided by the UK Biobank study and where "Array" represents the genotyping array. For each PRS derived for each GWAS source/method combination, we also assessed the following PRS performance measures relative to observed binary disease status: overall association and the ability to discriminate between cases and controls as measured by the area under the covariate-adjusted receiver operating characteristic (AROC; semiparametric frequentist inference [29]) curve (denoted AAUC) using R package "ROCnReg" [30]. Firth's bias reduction method was used to resolve the problem of separation in logistic regression (R package "brglm2") [31,32].
For each ancestry group (African, East Asian, European, and South Asian), we also stratified the UKB control dataset (i.e., the corresponding gender subset depending on cancer type) into ten groups of equal size by PRS deciles and determined the number of observed case subjects that were observed in the range of each risk decile. To assess for the presence of an association between cancer and increasing PRS risk deciles, we performed a Cochran Armitage Test for Trend implemented in the R package "DescTools" [33]. To study the ability of the PRS to identify high risk patients, we fit the above model (Eq 1) by replacing the PRS with an indicator for whether the PRS value was in the top decile or not.
To test if the PRS means between the ancestry groups are equal we used ANOVA adjusting for genotyping array, birthyear and the first 10 principal components.