Comparing distributions of polygenic risk scores of type 2 diabetes and coronary heart disease within different populations

Polygenic risk scores are gaining more and more attention for estimating genetic risks for liabilities, especially for noncommunicable diseases. They are now calculated using thousands of DNA markers. In this paper, we compare the score distributions of two previously published very large risk score models within different populations. We show that the risk score model together with its risk stratification thresholds, built upon the data of one population, cannot be applied to another population without taking into account the target population’s structure. We also show that if an individual is classified to the wrong population, his/her disease risk can be systematically incorrectly estimated.


Introduction
Noncommunicable diseases, also known as chronic diseases, are currently responsible for more deaths than all other causes together [1]. Cardiovascular diseases (CVD), cancers, chronic respiratory diseases, and diabetes in particular are responsible for the majority of them [1]. The major cause of death among all CVDs is coronary heart disease (CHD) [1].
It is recognised that CVDs and the type 2 diabetes (T2D) are potentially preventable [1,2]. For this reason, the early identification of individuals with a high risk of these diseases is most important.
The cause of these diseases is considered to be complex, combining both genetic and environmental factors [1][2][3][4]. While environmental factors have been thoroughly studied, it has been a challenge to find the exact most important genetic markers that explain the occurrence of such complex diseases. It is believed and supported by large-scale GWAS studies that genetic risk depends on a large number of genetic markers, each one of them having relatively small effect if taken separately [5].
For this reason, the polygenic risk score (PRS), the risk metric calculated on several single nucleotide polymorphisms (SNP), weighted by their effect-size estimates (logistic/linear regression coefficients from GWAS meta-analysis), can be seen as an approximation of the a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

OPEN ACCESS
Citation: Reisberg S, Iljasenko T, Läll K, Fischer K, Vilo J (2017) Comparing distributions of polygenic risk scores of type 2 diabetes and coronary heart disease within different populations. PLoS ONE 12 (7): e0179238. https://doi.org/10.1371/journal. pone.0179238 total genetic risk and is in the focus of the current research. Starting with a few dozens of markers [3,6], PRSs are now being calculated using hundreds, thousands [7] and even tens of thousands [8] of SNPs.
Genetic risk estimation is mostly based on percentiles of the PRS distribution in the study cohort [12] and many studies in this field have estimated the relative or absolute risk differences between highest and lowest deciles or quintiles. However, the long-term purpose is to incorporate PRSs in the clinical risk stratification algorithms, to assess the risk levels of individuals outside the original study cohorts. For that reason, absolute thresholds are needed.
It is shown that PRSs, particularly those that consist of up to hundreds of SNPs, are dependent on the discovery cohort [13]. Usually, the selection of SNPs and their corresponding weights are based on previously published meta-analysis, conducted mainly in Europeanancestry populations [14,15] that makes the PRS to be biased towards Europeans [13]. However, to the best of our knowledge it has not been investigated whether the risk estimates that are based on PRS distribution in one cohort are accurate for individuals that do not belong to the cohort.
In this study, we have used two previously published PRSs, both based on thousands of SNPs and compared their distributions within different populations. The populations of interest include Estonia, Europe, America, South-Asia, East-Asia and Africa.

Materials and methods
PRS is calculated as a sum of weighted effect alleles. The general mathematical formula of the PRS is written as follows: where X i denotes the effect allele count and w i the weight of the i-th SNP for a certain outcome, accordingly. The number of SNPs included in PRS (denoted with n) varies, depending on the trait/disease. We used PRS calculation pipelines from two recently published articles. The first is PRS for predicting the risk of CHD (PRS CHD ), based on 49310 SNPs [8]. It is built on European populations-particularly on Finnish, Dutch and other Western and Southern European ancestries. The second PRS is also built on samples of European descent for predicting T2D (PRS T2D ), based on 7502 SNPs [7]. In both articles, the effect sizes (ß) for SNPs are estimated in the meta-analyses which were performed using additive models. For PRS CHD , weights w i are taken to be equal to the estimated effect sizes ß i . In T2D model, an additional parameter π i is used for each SNP to determine the weight, so that w i = β i Á π i . That kind of double-weighting of SNPs helps to minimise the bias arising from the "winner's curse". For a better comparison, both PRS are scaled over all samples.
In the CHD model, we omitted palindromic SNPs and the calculation was based on 46648 SNPs. According to the supplementary materials of Abraham et al. [8], this does not affect the performance significantly. In addition, we left out 107 SNPs from T2D and 652 SNPs from CHD calculation, due to the missing data (see below). As a result, our calculations were conducted on 7395 and 45996 SNPs accordingly, sharing 5164 common SNPs by ID. To make sure that omitted SNPs have negligible effect on the results, we fitted logistic regression model for prevalent type 2 diabetes, including only PRS T2D as a covariate (there were 1199 common samples with Läll et al. [7]). The odds ratios (OR) remain similar-our OR is 1.76 (95% confidence interval 1.26..2.46) compared to 1.61 (1.16..2.24) in the original article.
In order to calculate the PRS CHD and PRS T2D in different populations, we used 1000 Genomes Project data from Phase 3 (October 2014) release [16]. It contains samples of 2504 individuals from 5 super-populations (in this paper called populations): East-Asia (EAS, 504 individuals), South-Asia (SAS, 489), Europe (EUR, 503), America (AMR, 347) and Africa (AMR, 661). In addition, to represent the Estonian population (EST), we added 2244 samples having a full DNA sequence available, from the Estonian Biobank [17]-a population-based biobank, holding samples of approximately 5% of the Estonian adult population [17]. SNPs were extracted from both datasets either by their ID that was mentioned in the PRS model or by their alias, found from dbSNP [18]. No imputation was performed. SNPs that were not present in VCF, are listed in supplementary materials (S1 File) and were left out from the analysis.
PRSs were calculated by using PLINK (v1.9) [19]. Output files are available in S2 File. For the genetic risk estimation, individuals are divided into quintiles, based on the PRS values in each study cohort. In both models the risk is considered to be highest for individuals in the top PRS quintile and lowest for the bottom PRS quintile.
Finally, the distributions of the scores in all populations were plotted, quintiles calculated and compared by using R version 3.2.3.
As there is no phenotype data in 1000 Genome Project, we were unable to analyse the association between disease prevalence and the risk score within different populations.
In order to explore the genetic variability of the input data between populations and to better interpret the outcomes, we also performed a Principal Component Analysis (PCA) of SNPs.

Results
The observed distributions of PRS CHD and PRS T2D are shown in Fig 1 and Fig 2 accordingly. The corresponding quintiles are given in Table 1.
The order of PRS CHD and PRS T2D distributions follow the same pattern: Europeans, including Estonians, are getting lower scores than Americans and South-Asians on both plot. East-Asians and Africans are getting the highest scores. However, when looking at the means of the PCA plot of all samples, based on 7395 SNPs from T2D model, is shown in Fig 3. PCA plot for CHD model is almost identical (available in S1 Fig).
For each population, the correlation between the first component of PCA, conducted only on the SNP data of that population, and PRS is given in Table 1. The correlation is very strong within American and African populations.
In order to illustrate the differences in effect allele frequencies of SNPs that have the strongest effect in the model, we have taken 20 top SNPs from the T2D model and compared their effect allele frequencies in African and European population in    is the allele which increases the risk score (has positive weight). It can be observed that frequencies tend to be higher in African than European population, which considerably is the cause of getting higher PRS values.

Discussion
We calculated two polygenic risk scores PRS CHD and PRS T2D , both containing large number of SNPs, for the samples from different populations and compared their distributions. We found that the distribution plots for both PRS models follow a similar pattern-the distribution parameters are considerably different. Estonians, together with other Europeans, tend to get the lowest and Africans the highest scores among considered demographic groups. Large shifts mean that the absolute ranges of the quintiles can be very different in different populations. The absolute score which in one population indicates the highest risk may mean the lowest risk in the other. If we apply the genetic risk cut-offs from European ancestry in individuals of African ancestry, then everyone would have an extremely high estimated risk level. Although the prevalence of the disease in different populations is indeed slightly different [20], it does not explain such a large variability in PRS distributions and stratifying the entire population to a high risk group does not make sense. One might argue that the importance of absolute cut-off values is questionable because usually relative PRS thresholds are used in research instead. This holds only for research domain, where one is mainly interested in the strength of PRS-phenotype association and the absolute values are not important. However, if a PRS is used for personalised risk prediction in clinical practice, absolute thresholds are needed. Therefore, these PRS models together with their absolute thresholds, which were designed to the data of European populations, cannot be applied directly to other populations for risk estimation. These findings are coherent with Martin et al. [13] who repeated PRS calculations for different models up to several hundred SNPs. They also found that in different populations the distributions vary. We can see that by using thousands of SNPs, the distribution plots of large PRS models are more likely to drift apart.
Carlson et al. provide an explanation for observing higher scores for non-European populations. They argue that because of the linkage disequilibrium in GWAS studies which are conducted mainly on European ancestries, discovered rare disease-associated variants are often not the true causal variants. As the linkage disequilibrium between causal and associated SNPs varies in different populations, the effect size of the disease-associated variant tends to be overestimated in non-European ancestries for approximately a quarter of SNPs [15]. We also observed that most contributing (largest z-score) SNPs in our models tend to have higher effect allele frequencies in African populations compared to Europeans (Fig 4), consequently leading to relatively higher scores.
As a result of different effect allele frequencies, SNP data that is used for PRS calculation already includes the population information. The PCA plot in Fig 3 confirms that the populations are different when viewed from the 7395- (Fig 3) or the 45996-SNP perspective. That is, even before applying any weighting in the PRS model, populations already differ considerably from each other, making the starting point of using the model unequal. It is coherent with Lu et al. [21] who used 7775 SNPs (different from our models) from an older release of 1000G data for PCA plotting and found that African, European, and Asian ancestries are clearly distinguishable from each other, while the American population is admixed.
In order to overcome the PRS distribution shift problem the final score or SNP weights individually have to be adjusted according to the particular population where the score is applied. So far, the large-scale GWAS studies have found relatively little between-cohort heterogeneity in the effects of individual SNPs. We cannot distinguish, whether it is so because there is not enough power to detect that or whether the effect sizes are actually homogenous-research so far supports the latter. Thus, there is no evidence to support differential weighting of individual SNPs. As the differences in PRS distribution depend mainly on different allele frequencies across populations, it seems justified to apply a population-specific correction to the entire PRS (rather than individual SNPs), to make the correct decision on general genetic risk level of any given individual.
One option is to simply recalculate the PRS distribution cut-offs for given target population by using sample data from the same population as a reference. This would solve the problem relatively easily for homogeneous populations. However, the problem still arises in admixed populations, where an individual might have a mixed set of SNPs from several ancestries and his/her individual cut-off thresholds do not match with the others. In such cases, first, we have to detect all these ancestries and then apply corresponding score adjustments to these populations.
However, even in a relatively homogeneous population or discovery-cohort, there is a potential risk of misclassifying an individual into a wrong population which would lead systematically to a wrong risk estimation. It can be observed from Table 1 that there is a significant correlation between the PRS and the first component of PCA analysis, especially for Africans and Americans. Due to the correlation, a person who is misclassified to a wrong population, will also get extreme PRS values and as risk score quintiles differ in different populations, this will lead to wrong risk estimation. This highlights the importance of correct population detection. Even in the discovery-population, in order to apply personalised medicine approaches like PRS-based risk estimation for an individual, he should always be tested beforehand to verify his descent from the same population.
How to detect the true mixture of ancestries for an individual effectively and taking it into account when adjusting PRS, remain an open question. We are getting incomparable scores because of the differences of the effect allele frequencies between populations, and at the same time, in order to suppress these differences, we have to know the descent of the individual. That brings us back to the PCA plot where we saw that the sample already holds the information about the ancestry. Can the same information be used for adjusting the score in-place? For instance, by weighting SNPs accordingly, by using the characteristics of the sample. We believe this issue deserves further investigation.
Adjusting the weights according to the descent require a trans-ethnic understanding of the disease-associated SNPs. Gathering such information is a tremendous challenge and this might be also the reason, why we did not find any such PRS models during the writing of this paper that has at least thousand SNPs and is built on global GWAS data.