Clinical Utility of a Coronary Heart Disease Risk Prediction Gene Score in UK Healthy Middle Aged Men and in the Pakistani Population

Background Numerous risk prediction algorithms based on conventional risk factors for Coronary Heart Disease (CHD) are available but provide only modest discrimination. The inclusion of genetic information may improve clinical utility. Methods We tested the use of two gene scores (GS) in the prospective second Northwick Park Heart Study (NPHSII) of 2775 healthy UK men (284 cases), and Pakistani case-control studies from Islamabad/Rawalpindi (321 cases/228 controls) and Lahore (414 cases/219 controls). The 19-SNP GS included SNPs in loci identified by GWAS and candidate gene studies, while the 13-SNP GS only included SNPs in loci identified by the CARDIoGRAMplusC4D consortium. Results In NPHSII, the mean of both gene scores was higher in those who went on to develop CHD over 13.5 years of follow-up (19-SNP p=0.01, 13-SNP p=7x10-3). In combination with the Framingham algorithm the GSs appeared to show improvement in discrimination (increase in area under the ROC curve, 19-SNP p=0.48, 13-SNP p=0.82) and risk classification (net reclassification improvement (NRI), 19-SNP p=0.28, 13-SNP p=0.42) compared to the Framingham algorithm alone, but these were not statistically significant. When considering only individuals who moved up a risk category with inclusion of the GS, the improvement in risk classification was statistically significant (19-SNP p=0.01, 13-SNP p=0.04). In the Pakistani samples, risk allele frequencies were significantly lower compared to NPHSII for 13/19 SNPs. In the Islamabad study, the mean gene score was higher in cases than controls only for the 13-SNP GS (2.24 v 2.34, p=0.04). There was no association with CHD and either score in the Lahore study. Conclusion The performance of both GSs showed potential clinical utility in European men but much less utility in subjects from Pakistan, suggesting that a different set of risk loci or SNPs may be required for risk prediction in the South Asian population.


Methods
We tested the use of two gene scores (GS) in the prospective second Northwick Park Heart Study (NPHSII) of 2775 healthy UK men (284 cases), and Pakistani case-control studies from Islamabad/Rawalpindi (321 cases/228 controls) and Lahore (414 cases/219 controls). The 19-SNP GS included SNPs in loci identified by GWAS and candidate gene studies, while the 13-SNP GS only included SNPs in loci identified by the CARDIoGRAMplusC4D consortium.

Results
In NPHSII, the mean of both gene scores was higher in those who went on to develop CHD over 13.5 years of follow-up (19-SNP p=0.01, 13-SNP p=7x10 -3 ). In combination with the Framingham algorithm the GSs appeared to show improvement in discrimination (increase in area under the ROC curve, 19-SNP p=0.48, 13-SNP p=0.82) and risk classification (net reclassification improvement (NRI), 19-SNP p=0.28, 13-SNP p=0.42) compared to the Framingham algorithm alone, but these were not statistically significant. When considering only individuals who moved up a risk category with inclusion of the GS, the improvement in

Introduction
Despite being largely preventable, coronary heart disease (CHD) remains the most common cause of the death worldwide [1,2]. There is a growing CHD burden in the developing world [3], particularly in South Asian countries such as Pakistan where the prevalence of CHD in urban Karachi has approximately doubled since 1970 [4]. This increase is most likely due to the adoption of a more "Western" lifestyle combined with a greater susceptibility to metabolic syndrome [5]. Many CHD risk factors, such as smoking status and hypertension, can be modified through lifestyle changes or drug therapy [6], which can greatly reduce CHD morbidity and mortality [7,8]. Given that the atherosclerotic process can begin many decades before clinical manifestation, this provides an opportunity for preventative measures to be employed to avoid its escalation [9].
In order to target lifestyle and therapeutic intervention appropriately, those most at risk of developing CHD must be identified. A number of conventional risk factor (CRF) scores to determine 10-year risk CHD risk have been developed. These include the Framingham risk score [10], SCORE [11] and QRISK2 [12]. Individuals are then classified on the basis of the CRF score. Until recently the cut-off for the high risk category (those who qualify for statin treatment for primary prevention of CHD) was set at 20% ten year risk of CHD [6]. However, lower cut-offs have been proposed in both the USA [13] and the UK, with the 2014 National Institute for Health and Care Excellence (NICE) guidelines setting the threshold at 10% [14]. To date, such risk scores have provided only modest discrimination. Most events occur in individuals classified as being at intermediate risk [1,15]. By improving risk prediction tools, preventative measures can be targeted more appropriately.
It is has been estimated that heritability of CHD is 40-50% [16] and more than forty CHD associated loci have been identified using both traditional candidate gene studies and genomewide association studies (GWAS) [17,18]. The most recent publication from the CARDIo-GRAMplusC4D consortium [18] increased the number of robustly associated loci to 46. The inclusion of genetic information is therefore a good candidate to improve CHD risk prediction. As each variant is associated with a modest effect size, the addition of a single CHD risk variant to a CRF score does not improve risk stratification [19][20][21]. However, combining even a small number of variants into a "gene score" has been shown to improve classification or discrimination of individuals [22,23].
The objective of this study was to determine whether the inclusion of a CHD risk gene score has clinical utility in the European population. We sought to do this by investigating whether addition of a gene score improves discrimination and prediction over and above the use of the Framingham CRF score alone in a prospective study of middle-aged men from the UK. We then looked to determine if the same genetic variants can be used in individuals from the Indian Subcontinent. Two gene scores, one with 19 SNPs taken from both GWAS and candidate gene studies and one with 13 SNPs where only SNPs in loci identified by the CARDIo-GRAMplusC4D consortium were included, were tested.

Methods NPHSII
NPHSII is a prospective CHD study of approximately 3000 men as described previously [2]. Briefly, middle-aged men (50-64) were recruited from 9 general practices in the UK. Anyone with a history of CHD or diabetes was excluded. There was a median of 13.5 years follow-up. CHD was defined as acute myocardial infarction (MI), silent MI or undergoing coronary surgery. All subjects gave written informed consent and the study had ethical approval from the national research ethics service (NRES) Committee London-Central.

Pakistani cohorts
Two case-control studies from Pakistan were collected. One group was collected from the Rawalpindi Institute of Cardiology, Pakistan. All cases had had an MI as defined by a positive test for troponin T, ST segment changes on electrocardiogram and typical chest pain radiating in the chest that was not relieved at rest. This study will hereafter be referred to as the "Islamabad" study. The study had approval from the Institutional Review Board and Ethics Committee of Shifa College of Medicine, Shifa International Hospital, Islamabad and all subjects gave written informed consent. The second group were collected from hospitals in Lahore, Pakistan. All cases had CHD as defined by echocardiogram or angiography. This study will hereafter be referred to as the "Lahore" study. All subjects gave written informed consent and the study had ethical approval from the institutional ethical committee, University of the Punjab, Lahore. In both studies controls were taken from the general population.

Gene Scores
The SNPs included in the gene scores are presented in Table 1, along with the source publication(s). Each SNP was weighted by its effect size as reported in their respective publications and the gene scores were calculated by multiplying the number of risk alleles by the natural log of the odds ratio for each SNP and adding the products together. It was assumed that all SNPs were acting in an additive manner apart from rs1799983, which was treated in a recessive manner. A 13 SNP score was also considered using only SNPs in loci identified by the most recent publication from the CARDIoGRAMplusC4D consortium [18].

Results from NPHSII
The baseline characteristics for the participants of NPHSII are presented in Table 2. As expected, the men who went on to develop CHD were older, had higher BMI, higher blood pressure, higher cholesterol, higher triglycerides and a higher proportion were smokers, at baseline. Moreover, those who went on to develop CHD had a statistically significantly higher 10 year CHD risk as calculated using the Framingham score. The genotype frequency of each SNP in NPHSII is shown in S1 Table. All SNPs except rs1042031, in APOB, were in Hardy-Weinberg equilibrium. A comparison of risk allele frequency in those who did and those who did not go on to develop CHD is presented in S4 Table. There was a statistically significant difference between the two groups for rs10757274 at the 9p21 locus and rs1746048 (close to gene CXCL12), with the risk allele frequency being higher in the CHD group for both SNPs.
The genotype information was combined using a weighted genetic risk score. For the 19 SNP GS, full data for Framingham score plus gene score was available for 1164 individuals, 114 of whom developed CHD. For the 13 SNP GS full data for Framingham score plus gene score was available for 1437 individuals, 143 of whom developed CHD. Both gene scores were higher in those who developed CHD during follow-up (19 SNP GS p = 0.01, 13 SNP GS p = 7x10 -3 ).
We examined the ability of the two gene scores to reclassify those who did and those who did not go on to develop CHD. As shown in Table 3, although more participants appeared to be reclassified correctly with the inclusion of both the 19 SNP and 13 SNP GS compared to the Framingham score alone, this increase was not statistically significant in either case. When only those who were reclassified into the high risk category (set at 10% ten-year risk) were considered (those with a strong genetic predisposition to CHD with intermediate and low CRF scores), a statistically significant number of participants were reclassified correctly (Table 3). This was also the case when the cut-off for high risk was set at 20% (shown in S5 Table).
The predicative ability of using the Framingham CRF score was compared to the Framingham score plus the genetic risk score. There was no statistically significant increase in the area under the ROC curve for either the 19 SNP GS (p = 0.48) or the 13 SNP GS (p = 0.82) (S1 Fig). As shown in Fig 1 and S6 Table, there was a statistically significant trend for higher quintiles of gene score to have higher odds ratio for CHD compared to the bottom quintile for both the 19 SNP GS (p = 8x10 -3 ) and the 13 SNP GS (p = 0.01) after adjustment for age.

Results from Pakistani studies
We then attempted to transpose these scores in two case-control samples from Pakistan. The available conventional risk factor information for both studies is presented in Table 4. Data was missing for up to 60% of participants for some variables in the Islamabad study. As expected, in both studies the case group was older and had a higher proportion with diabetes, hypertension and smokers. There was no difference in the proportion of males present in each group. Surprisingly, in the Islamabad study, mean LDL cholesterol was not statistically significantly different between cases and controls. This can be attributed to the treatment of those post-MI with lipid lowering therapies. BMI, total cholesterol, triglycerides, family history and HDL cholesterol data were available for some of the participants of the Islamabad study. Only for HDL cholesterol was the difference between the two groups statistically significant, being lower in the cases. The 19 SNPs were genotyped in both Pakistani sample sets and the results are presented in S1 Table. For the Islamabad study, five SNPs were not in Hardy-Weinberg equilibrium-MIA3 rs17465637, CXCL12 rs1746048, MRAS rs9818870, LPL rs1801177 and SMAD3 rs17228212-with an excess of homozygotes. Genotypes were confirmed by sequencing. In the Lahore study, only one SNP-LPA rs10455872-was not in Hardy Weinberg equilibrium. The risk allele frequencies did not differ between the two Pakistani groups as shown in Table 5. The data from the control groups was combined and compared to that from the group in NPHSII.  The risk allele frequency was lower in the Pakistani group for 13 SNPs and higher for three SNPs compared to NPHSII. The 19 and 13 SNP GSs were calculated for both Pakistani studies and the results are shown in Table 4. For the 19 SNP GS, full genotyping was available for 304 samples (119 controls/175 cases) in the Islamabad study and for 438 samples (130 controls/308 cases) in the Lahore study. For the 13 SNP GS, full genotyping was available for 317 samples (123 controls/194 cases) in the Islamabad study and for 490 samples (145 controls/345 cases) in the Lahore study. In the Islamabad sample, a statistically significant increase in the mean gene score was observed between cases and controls for the 13 SNP GS (p = 0.04), but not for the 19 SNP GS (p = 0.35). For the Lahore sample, the mean gene score was not statistically significantly different between cases and controls for either score (19 SNP GS p = 0.75, 13 SNP GS p = 0.95). The 19 and 13 SNP GSs were found to be lower in the controls compared to those in NPHSII who did not go on to develop CHD in both the Islamabad (19 SNP GS p = 2.92 x10 -14 , 13 SNP GS p = 5.62x10 -6 ) and the Lahore cohorts (19 SNP GS p = 7.78 x10 -13 , 13 SNP GS p = 1.49x10 -9 ).
While the higher quintiles of the 13 SNP GS gene score appear to have a greater odds ratio for outcome compared to the bottom quintile in the Islamabad sample, this was not statistically significant. No trend between quintile of gene score and outcome was observed for the 19 SNP GS in the Islamabad sample or for either GS in the Lahore sample (Fig 2 and S7 Table).

Discussion
The aim of including genetic information in CHD risk prediction is to identify those who have a high overall risk but who would be classed as being at low or intermediate risk using conventional risk factor scores. In this study, we found that a 19 SNP GS, which uses CHD risk weighting from published meta-analyses, showed improved risk classification over and above the Framingham score alone in NPHSII when those men who went up a risk category (for example moving into the high risk category) were considered. The exclusion of six SNPs in non-CARDI-oGRAMplusC4D loci had a modest detrimental effect on the utility of the gene score in the UK men, reducing the overall NRI non-significantly from 4.5% to 3.0% with a similar non-significant decrease in the number of men correctly reclassified into the high risk group. Thus the 19 SNP GS, appears to have the better utility. By contrast, in two case-control studies from Pakistan the gene scores did not perform as well. While in the subjects from Islamabad the 13 SNP GS showed a significant case-control effect, neither score showed utility in the samples from Lahore.
There are a number of potential reasons why the results from the European group were not confirmed in Pakistanis. Firstly both Pakistani studies comprise only a few hundred samples and thus sample size and statistical power is an issue. We estimated that should the same effect found in NPHSII be present in the Pakistani group, to have 80% power to detect it (at the 5% significance threshold), approximately 340 cases and approximately 340 controls would be required for both gene scores. This is well above the number of individuals for whom complete genotyping was available in both Pakistani studies. Secondly, "cases" were defined differently in the two Pakistani studies. There was very little difference between the mean gene score in cases and controls in the Lahore sample which used a much broader definition of CHD, compared to the Islamabad group which used an MI phenotype (more like the "hard" endpoints used in the prospective NPHSII). A further limitation of the study is that using retrospectively collected cases restricts the case group to those who have survived which may introduce a survivor bias to the sample.
A further possibility is SNP selection. While eight of the SNPs are non-synonymous (and thus are likely to be the functional), and the APOA5 promoter variant has been shown to be directly functional [27] many are not in protein coding regions. It is possible that such SNPs merely "tag" the functional SNP located in the same linkage disequilibrium (LD) block. Given the differing LD patterns between ethnicities, it is possible that some of the SNPs used here are in strong LD with the functional SNP in Europeans but not in other ethnic groups. For example, the SNPs rs599839 and rs646776 are often used to tag each other at the CELSR2-PSRC1--SORT1 locus, with rs599839 being known to be functional [28]. Data from the HapMap project [29] shows that the LD between the two SNPs is r 2 = 0.95 in the Northern European group and r 2 = 0.76 in the Japanese group but the two SNPs are not in LD in the Yoruba group (from Ibadan, Nigeria). However, a study of approximately 2000 Indians collected from Mumbai and Bangalore found the two SNPs to be in strong LD (r 2 = 0.98) in this group. Furthermore, a number of SNPs at the 9p21 locus were genotyped in the PROMIS study of ethnic Pakistanis living in the UK and the LD pattern was found to be similar to Europeans at this locus [30] and a comparison of the genetic architecture of the LPA gene in individuals of different ethnicities (including South Asians) living in Canada found similar haplotype blocks to be present in the groups studied [31]. Therefore, for these loci we conclude that it is appropriate to use the SNP identified in studies with European participants. To what extent this can be extrapolated to the other SNPs and loci included is not clear, and for example, a study of CETP polymorphisms in the Punjabi population from Northern India found the LD pattern to differ from that observed in Europeans [32] but overall information regarding LD patterns in South Asians is limited. Moreover, given that the risk allele frequency is statistically significantly lower for 13 of the 19 SNPs in the Pakistani group compared to NPHSII, it is unsurprising that the gene scores are also statistically significantly lower. This shows that even if many of the SNPs are functional, then gene scores based on these SNPs will perform less well in the Pakistani group. In order to optimise the use of the gene scores for use in those of South Asian ethnicity, the functional SNP should be identified (bearing in mind that the functional SNP may differ between ethnic groups) in all cases to maximise the performance of a gene score based on this set of loci, and it may be necessary to design a South-Asian specific gene score.
We can only speculate whether the modest performance of the gene scores observed in Pakistani subjects here will be seen in samples from other parts of the Indian Subcontinent, as there is considerable genetic diversity within the region and since score performance is in part dependent on the frequency of the risk alleles. To date, allele frequency information across the Indian Subcontinent is limited for the SNPs used. A small number of Punjabi subjects (n = 96) from Lahore were genotyped as part of the 1000 genomes project (phase 3) [33]. No difference in allele frequencies was observed for any of the SNPs in this data set when compared with the control groups presented here. The risk allele frequency of the 9p21 SNPs genotyped in the PROMIS study were statistically significantly higher than observed here [30]. While a similar risk allele frequency to that found here was observed for rs662799 (in the APOA5 promoter) in a Punjabi Pakistani case-control study, the risk allele frequency observed for rs599839 (at the SORT1 locus) in that study was much lower than observed here [34]. Another factor which may influence how representative the results presented here are, is that five of the 19 SNPs genotyped in the Islamabad case-control study were not in Hardy-Weinberg equilibrium. In each case this was due to an excess of homozygotes compared to what would be expected. This is likely caused by the presence of population sub-structure [35]. Clearly much more data is required from larger samples representative of different regions and groups from the Indian Subcontinent to determine the frequencies of these SNPs.
Overall this work demonstrated that the use of a 19 SNP gene score has the potential to improve risk stratification in European individuals over and above classical CHD risk factors. Reducing the number of SNPs in the score to only the 13 with GWAS-proven effects on CHD risk had a modest detrimental effect on the utility. We do not have definitive evidence of the utility of either score in subjects from Pakistan and larger samples are required to determine this.   Table. Comparison of risk allele frequencies in those who did and did not develop CHD during follow-up of NPHSII. Comparisons were performed using proportion tests. CI = Confidence Interval. (DOCX) S5 Table. Reclassification of NPHSII participants with the addition of the gene scores to the Framingham conventional risk factor score (20% cut-off for high risk). NRI = Net Reclassification Index. (DOCX) S6 Table. Odds ratio for CHD in NPHSII by quintile of gene score, compared to the lowest quintile. Logistic regression, age adjusted, was performed for each group. OR = Odds Ratio, CI = Confidence Interval. (DOCX) S7 Table. Odds ratio for outcome (MI for Islamabad, CHD for Lahore) by quintile of gene score, compared to the lowest quintile. Logistic regression, adjusted for age and sex, was performed for each group. OR = Odds Ratio, CI = Confidence Interval. For the Islamabad study full data was available for 258 participants for the 19 SNP GS (106 controls/152 cases) and 268 participants for the 13 SNP GS (110 controls/158 cases). For the Lahore study full data was available for 438 participants for the 19 SNP GS (130 controls/308 cases) and 490 participants for the 13 SNP GS (145 controls/345 cases). (DOCX)