Identification of Allelic Heterogeneity at Type-2 Diabetes Loci and Impact on Prediction

Although over 60 single nucleotide polymorphisms (SNPs) have been identified by meta-analysis of genome-wide association studies for type-2 diabetes (T2D) among individuals of European descent, much of the genetic variation remains unexplained. There are likely many more SNPs that contribute to variation in T2D risk, some of which may lie in the regions surrounding established SNPs - a phenomenon often referred to as allelic heterogeneity. Here, we use the summary statistics from the DIAGRAM consortium meta-analysis of T2D genome-wide association studies along with linkage disequilibrium patterns inferred from a large reference sample to identify novel SNPs associated with T2D surrounding each of the previously established risk loci. We then examine the extent to which the use of these additional SNPs improves prediction of T2D risk in an independent validation dataset. Our results suggest that multiple SNPs at each of 3 loci contribute to T2D susceptibility (TCF7L2, CDKN2A/B, and KCNQ1; p<5×10−8). Using a less stringent threshold (p<5×10−4), we identify 34 additional loci with multiple associated SNPs. The addition of these SNPs slightly improves T2D prediction compared to the use of only the respective lead SNPs, when assessed using an independent validation cohort. Our findings suggest that some currently established T2D risk loci likely harbor multiple polymorphisms which contribute independently and collectively to T2D risk. This opens a promising avenue for improving prediction of T2D, and for a better understanding of the genetic architecture of T2D.


Introduction
Approximately 65 loci have been shown to be associated with type-2 diabetes (T2D) through genome-wide association studies (GWAS). However, variation at these loci accounts for a small proportion of the expected heritability of T2D [1,2]. Among several potential strategies for identifying additional contributing genetic variation, one approach is to determine whether there are additional genetic markers near established loci that act independently or jointly with the reported marker (lead SNP).
Allelic heterogeneity is a feature of the genetic architecture of many traits, including common traits and diseases such as height, BMI, and T2D [3±6]. In the context of T2D genetics, both Morris et al. [2] and Yang et al. [7] have suggested that additional SNPs in established loci are associated with T2D risk. However, Morris et al. only considered SNPs in weak linkage disequilibrium (r 2 ,0.05) with the lead SNP, and that were not in the same recombination interval. Hence, without formal conditional analysis, they identified two loci as having multiple associations at genome-wide significance (KCNQ1 and CDKN2A/B), and two more at suggestive levels (DGKB and MC4R). Yang et al. have recently developed a method for identifying additional associated SNPs based on conditional/joint (C/J) analysis using GWAS summary statistics and linkage disequilibrium (LD) information from a reference sample [7]. They applied their method to only a single established T2D locus (CDKN2A/B), and identified two novel SNPs at that locus that were significantly associated with T2D when fitted jointly. Finally, on a smaller scale (1,924 cases and 5,380 controls), Ke [8] identified multiple associated loci at the CDKN2A/B and TSPAN8 loci. Although higher power is afforded with the GWAS meta-analysis approach to identify associations with single SNPs, it does not allow for direct C/J analysis since the actual genotype data is not available. The advantage of the method developed by Yang et al. is that it takes advantage of the greater power of GWAS meta-analyses, while also testing for C/J associations, which would otherwise be impossible without individual level data.
Here, we comprehensively examine allelic heterogeneity based on the method developed by Yang et al. at 65 T2D loci discovered by the DIAGRAM consortium, using the summary statistics from their recent meta-analysis of T2D GWAS. We then examine the extent to which these newly identified SNPs increase the accuracy of T2D risk prediction in an independent validation dataset.

Datasets
We used 6,054 nominally unrelated European-American subjects (genomic relationship coefficient ,0.025, based on approximately 2.5 million SNPs) from the Atherosclerosis Risk in Communities (ARIC) study [9] to obtain linkage disequilibrium (LD) estimates. According to Yang et al. [7], this sample size is sufficient for LD estimation with minimal error. In order to maximize the overlap of SNPs between the meta-analysis summary statistics (see below) and the ARIC study, we used IMPUTE2 software [10] along with 1000 Genomes reference data to impute millions of additional SNPs. Prior to imputation, we excluded individuals with a high genotype missing rate (.10%). SNPs were excluded based on extreme minor allele frequency (,0.5%), a high missing rate (.10%), or failed Hardy-Weinberg equilibrium (p,0.005). After imputation, we excluded SNPs with info' ,0.6 (measure of imputation quality), and SNPs with genotype dosage between 0.33 and 0.66, or between 1.33 and 1.66. Intermediate dosages outside of these specified ranges were rounded to the nearest integer. We did not use intermediate genotype dosages since this was not an option with the GCTA software, described below.
The validation dataset consisted of European-American subjects from the Multi-Ethnic Study of Atherosclerosis (MESA) [11], which included 225 T2D cases and 1,985 controls. T2D cases were defined as having a fasting glucose level $126 mg/dL, a selfreport of taking diabetes medication, or a physician diagnosis of T2D. This dataset can be considered an independent validation dataset since it was not part of the DIAGRAM meta-analysis, whereas ARIC was a part of this meta-analysis, thus precluding it from any validation assessment. We implemented genotype QC and imputation as detailed above. However, we did not round or remove intermediate genotype dosages. The MESA and ARIC dataset were obtained from dbGaP (database of Genotypes and Phenotypes). IRB approval was obtained from the University of Arizona.

Conditional/Joint Analysis
Using the summary statistics from the discovery phase of the latest version (v3) of the DIAbetes Genetics Replication And Metaanalysis (DIAGRAM) consortium, available to the public through an online source [12] and LD estimates from the ARIC dataset as described above, we used GCTA software [13] to perform stepwise model selection. Briefly, SNPs were selected into the model based on p-values in the meta-analysis. An iterative scheme was adopted in which C/J analyses were alternatively performed with the stepwise selection procedure. SNPs with a re-estimated (i. e. through C/J estimation as opposed to marginal estimation) pvalue under a certain threshold were selected. For a full description of the method, see Yang et al. [7]. We restricted our analysis to only the genomic regions within 1 Mb of the top SNP at the 65 established T2D loci as reported in Morris et al [2]. This filtering along with the QC filtering described above resulted in 112,329 SNPs being used in this analysis. For each SNP, we recorded the following information as input for the C/J analysis: effect allele, effect size (log of odds ratio), corresponding standard error, p-value, allele frequency of the effect allele (based on ARIC sample described above, as this was not available in the DIAGRAM summary statistic file), and sample size (sum of cases and controls). We used PolyPhen-2 [14] to determine whether any of the newly identified SNPs had any predicted functional effect, and RegulomeDB [15] to determine whether these SNPs may lie in regulatory regions (e.g. transcription factor binding sites) or are associated with specific DNA features (e.g. DNAse sensitivity).

Validation/Prediction
We compared several prediction models. First we constructed a baseline model which only considered demographic information (sex and age). Then we added a weighted genetic risk score [16] based on only the set of lead SNPs with weights corresponding to the log odds ratios according to the DIAGRAM meta-analysis summary statistics. Lead SNPs were defined as those that had the lowest p-value in the respective 2 Mb region according to the DIAGRAM Stage 1 meta-analysis summary statistics. We then considered a weighted genetic risk score based on all SNPs identified by the C/J analysis with weights corresponding to the coefficients estimated from the C/J analysis. We conducted the above analyses at the following p-value thresholds based on the C/ J results: 5610 28 , 5610 27 , 5610 26 , 5610 25 , and 5610 24 . We examined the proportion of variance explained by these additional SNPs by calculating the variance explained on the liability scale, estimated through the odds ratios and allele frequencies of the SNPs, and assuming a disease prevalence of 10%, using the Mangrove package [17] in R [18]. We also calculated Nagelkerke's R 2 [19] of the models which include age and sex and each of the GRS, using the fmsb package [20] in R, and report the Akaike information criterion (AIC) [21] for each of these models. Prediction accuracy was estimated using the area under the receiver operating characteristic curve (AUC) as implemented in the pROC package [22] in R. Differences in AUC among models were compared by examining the change in AUC (DAUC) and assessed using the DeLong test [23] to determine statistical significance.

Conditional/Joint analysis
We identified novel genome-wide significant (p,5610 28 ) SNPs in the C/J analysis at the three following loci: TCF7L2, CDKN2A/ B, and KCNQ1 (see Table 1). In the TCF7L2 region, we identified three SNPs (rs7917983, rs17747324, rs12266632) within a 32 kb region. The lead SNP (rs4506565) was not selected in this model, but is positioned in this region and is in moderate LD with each of the novel findings (r 2 between 0.18 and 0.70). For each of these novel findings, the marginal effect sizes and p-values in the meta-analysis are similar to those estimated in the C/J analysis. By relaxing the p-value threshold to p,5610 24 , we discovered an additional SNP in this region (rs10128255). In the CDKN2A/B region, the lead SNP (rs2383208) was not selected in the C/J analysis. Instead, two SNPs (rs10757282 and rs10811661) approximately 1.9 kb downstream of the lead SNP were discovered. These SNPs are only 110 bases apart. rs10757282 is in relatively low LD with the lead SNP (r 2 = 0.29). However, rs10811661 is in high LD with the lead SNP (r 2 = 0.94). It should be noted that the correlation between the respective risk alleles is negative (r = 20.54), suggesting that estimates obtained through the single marker association were underestimated for both SNPs, as evidenced by the larger effect sizes and lower p-values estimated in the C/J analysis compared to the meta-analysis marginal association results (see Table 1). In the KCNQ1 region, we identified two SNPs in the C/J analysis (p,5610 28 ). One SNP (rs462402) is in moderate LD with the lead SNP, rs231362 (r 2 = 0.45). The other SNP (rs163177) is approximately 121 kb upstream and is not in LD with the lead SNP (r 2 ,0.01). By relaxing the p-value threshold to p,5610 26 , we identified additional novel discoveries in the DGKB and TP53INP1 genes. Continuing to relax this threshold, we identify 17 (p,5610 25 ) (see Table 1) and 34 (p,5610 24 ) regions with multiple associated SNPs. According to PolyPhen, none of the SNPs identified through C/J analysis had any predicted functional effect. According to our query in RegulomeDB, rs387769 near HNF4A shows evidence of being linked to expression of a gene target, affecting binding of a transcription factor, and shows evidence of a DNase footprint. SNP rs7176681 near ZFAND6 also displays evidence of being a transcription factor binding site, and evidence of a DNase footprint. SNP rs17168486 in DGKB shows evidence of transcription factor binding and a DNase peak. Several other SNPs show evidence of transcription factor binding or a DNase peak (see Table 1).

Validation/Prediction
The AUC of the baseline prediction model which included only sex and age was 0.5702. For each of the three loci with additional SNPs that were significant at the p,5610 28 threshold (TCF7L2, CDKN2A/B, and KCNQ1), the inclusion of the SNPs identified by the C/J analysis resulted in a higher AUC than a model including only the lead SNP (although not statistically significant) in all regions except for KCNQ1 (see Figure 1).
Considering all three loci with additional SNPs at the p,5610 28 threshold collectively, we found that the use of the seven SNPs identified by the C/J analysis resulted in a slightly higher AUC (0.5979) than when using only the three lead SNPs (0.5803). This represents a doubling in DAUC over the age+ sex model (see Figure 2), although this difference is not quite statistically significant (p = 0.055), according to the DeLong test. The inclusion of all SNPs (lead and from C/J analysis) results in a statistically significant (p = 0.049), yet small, increase in AUC (see Figure 2). At the p,5610 26 threshold, the use of 11 SNPs at 5 loci (TCF7L2, CDKN2A/B, KCNQ1, DGKB and TP53INP1), slightly, but not significantly, increased prediction accuracy (AUC = 0.5885) over a model considering only the corresponding 5 lead SNPs (AUC = 0.5779; p = 0.126). At the p,5610 25 threshold, we observe a small increase in prediction accuracy when using the 39 SNPs identified by the C/J analysis instead of the corresponding 17 lead SNPs (AUC = 0.5892 vs. 0.5724; p = 0.079). Finally, at the p,5610 24 threshold, the use of 120 SNPs identified by the C/J analysis and the lead SNPs results in a slightly higher and nearly statistically significant increase in AUC over that of a model which includes only the 34 lead SNPs at the corresponding loci (AUC = 0.5965 vs. 0.5858; p = 0.067). Table 2 shows the proportion of variance explained by the additional SNPs identified by the C/J analysis. For each of the three loci, the SNPs identified by the C/J analysis explained slightly more of the variance in T2D risk than the lead SNP. Similarly, for the collection of SNPs identified by the C/J analysis at various p-value thresholds, we observe an increase in the proportion of T2D variance explained by the SNPs and the GRSs, along with decreasing AIC values.

Discussion
Our analyses confirm previous findings regarding the allelic heterogeneity present at the CDKN2A/B, KCNQ1, DGKB, and MC4R loci. We provide novel evidence of allelic heterogeneity at genome-wide significance at the TCF7L2 locus. We support our finding in TCF7L2 by showing that the use of the three identified SNPs results in a small increase in AUC (albeit not statistically significant) compared to using the lead TCF7L2 SNP (rs4506565) alone. We observe similar but much weaker trends at the CDKN2A/B and KCNQ1 loci.  Abbreviations: Chr: chromosome, bp: base pair position, refA: reference allele, freq: frequency of the risk allele, b: regression coefficient from meta-analysis summary statistics, se: standard error from meta-analysis summary statistics, p: p-value from meta-analysis summary statistics, n: sample size in meta-analysis, bJ: regression coefficient estimated from conditional/joint analysis, bJ_se: standard error estimated from conditional/joint analysis, pJ: pvalue from estimated from conditional/joint analysis, LD (r): linkage disequilibrium between corresponding SNP and the following SNP at the same locus.
(1f: eQTL+transcription factor (TF) binding/DNase peak; 2a: TF binding+matched TF motif+matched DNase Footprint+DNase peak; 3a: TF binding+any motif+DNase peak; 5: TF binding or DNase peak). doi:10.1371/journal.pone.0113072.t001 At less stringent p-value thresholds, we observe additional putatively associated SNPs at up to 34 loci. Considering the collective set of loci in which additional associated SNPs were identified through C/J analysis, prediction accuracy appears to slightly improve with the addition of these additional SNPs in our validation dataset. At all p-value thresholds, the DAUC over the sex + age model is at least two-fold greater when using the C/J identified SNPs compared to using the lead SNPs alone.
The strength of the method developed by Yang et al. is well exemplified by the multiple associated SNPs identified at the TCF7L2 locus, since the use of the three SNPs (which do not include the lead SNP) appears to be more informative than only using the lead SNP, rs4506565. Another example of the strength  of this method is the case in which two risk alleles are in negative LD. Without the C/J analysis, the additional SNPs in the CDKN2A/B region would not be identified when analyzed on their own.
The main limitation of this method is that associations are not tested directly, but rather through knowledge of marginal associations, and LD patterns in a different dataset (of the same ancestral background). A major limitation of the validation stage of our study is the relatively small sample size which limits the statistical power to detect differences in prediction accuracy between different GRSs. From this perspective, it will be important to continue validating these findings in larger datasets, and to combine actual genotype data across multiple datasets instead of using summary statistics. Furthermore, it will be important to dissect the allelic heterogeneity on a locus-by-locus basis to closely examine the patterns/existence of dependencies and additive or interactive effects. Finally, it will be important to functionally characterize these as well as all GWAS findings to more firmly establish causality and better understand molecular mechanisms leading to T2D.
Nevertheless, this approach is clearly promising for a greater understanding of the molecular basis of type-2 diabetes, and potentially for use in risk prediction scores. As additional loci are identified through GWAS, it will be important to systematically identify instances of allelic heterogeneity and to examine the extent to which additional SNPs can help to shed light on the functional basis of genetic variation.