Caution in Interpreting Results from Imputation Analysis When Linkage Disequilibrium Extends over a Large Distance: A Case Study on Venous Thrombosis

By applying an imputation strategy based on the 1000 Genomes project to two genome-wide association studies (GWAS), we detected a susceptibility locus for venous thrombosis on chromosome 11p11.2 that was missed by previous GWAS analyses that had been conducted on the same datasets. A comprehensive linkage disequilibrium and haplotype analysis of the whole locus where twelve SNPs exhibited association p-values lower than 2.23 10−11 and the use of independent case-control samples demonstrated that the culprit variant was a rare variant located ∼1 Mb away from the original hits, not tagged by current genome-wide genotyping arrays and even not well imputed in the original GWAS samples. This variant was in fact the rs1799963, also known as the FII G20210A prothrombin mutation. This work may be of major interest not only for its scientific impact but also for its methodological findings.


Introduction
We had previously reported the results of two genome-wide association studies (GWAS) for venous thrombosis (VT) conducted in samples of French origin [1,2]. The first GWAS included 419 VT patients and 1,228 controls genotyped with the Illumina Sentrix HumanHap 300 beadchip [1] while the second, composed of 1,542 cases and 1,110 controls were genotyped with Illumina Human 610Quad and 660W beadchips [2]. In both studies, cases were unrelated VT patients, free of any chronic conditions and without known major genetic risk factors including anti-thrombin, protein C or protein S deficiency, homozygosity for FV Leiden and FII G2021A mutations. Controls were healthy individuals selected from two French national cohorts, the SUVIMAX [3] and Three-City Study (3C) [4], respectively. The meta-analysis of these two GWAS identified the well-established ABO, F5, F11 and FGG genes and provided novel strong support in favor of HIVEP1, PROCR and STAB2 loci as VT susceptibility genes [2]. This metaanalysis was performed using imputed data derived from the HapMap2 release 21 reference dataset containing 2,557,252 autosomal single nucleotide polymorphisms (SNPs). This genotype resource, used in a very large number of meta-GWAS analyses, is particularly efficient for testing the association with a phenotype of non genotyped SNPs whose minor allele frequencies (MAF) are greater than 5% [5]. However, it may not be adapted to infer the association of SNPs with lower MAF. To overcome this limitation and make possible the inference of rare SNPs with MAF as low as 1%, imputation analysis using data from the 1000 Genomes project [6] has recently been proposed, and a few examples [7][8][9][10] substantiated the interest of such approach.

Results
The imputation analysis was performed using MACH [11] (v1.0.16a) (http://www.sph.umich.edu/csg/abecasis/mach/) and Minimac (v4.4.3) (http://genome.sph.umich.edu/wki/Minimac) software. 6,754,935 SNPs were imputed with good imputation quality (r 2 .0.3) [11] in both GWAS. The allele frequency distribution of the imputed studied SNPs was shown in Figure  S1. Association of imputed SNPs with VT was tested using a likelihood ratio test statistics implemented in the mach2dat (v1.0.18) software (http://www.sph.umich.edu/csg/abecasis/ mach/) while adjusting for the first four principal components as described in [2], separately in each GWAS. The results obtained in the two GWAS were then combined using a fixed-effect metaanalysis based methodology implemented in the METAL software [12]. A Quantile-Quantile (QQ) plot representation of the results was shown in Figure S2 and the resulting genomic control factor was 0.993. 217 SNPs were found, at the genome-wide significant at the 7.4610 29 level, consistently associated with VT across the two GWAS samples (Table S1). These VT-associated SNPs overlapped five loci on four chromosomes. Four of the loci mapped the aforementioned ABO, F5, F11 and FGG genes while a novel association involving the 11p11.2 locus (Table 1) was identified. Twelve SNPs, from position 47,373,425 to 48,064,194 (according to hg19 reference) and overlapping the MYBPC3-SPI1-CELF1-KBTBD4-NUP160-PTPRJ gene cluster demonstrated significant associations with VT, ranging from P = 6.97 10 213 to P = 2.23 10 211 . All these SNPs had similar allele frequencies (,3%) and similar genetic effects on VT risk (Odds Ratio (OR) , 2.5) suggesting the existence of a strong linkage disequilibrium (LD) block (pairwise r 2 or |D'|.0.8) explaining the association signal observed at the 11p11.2 locus. This hypothesis was supported by the results of a series of conditional logistic analyses showing that after adjusting on any of these SNPs all other observed associations at this locus vanished (P.0.10). Conversely, genotyped SNPs at this locus exhibited low to moderate LD ( Figure 2) with median and 90 th percentile of the pairwise r 2 distribution being 0.10 and 0.52, respectively. We further investigated the haplotype structure derived from the genotyped SNPs using a previously described statistical methodology [13] based on the Stochastic-EM algorithm [14]. For this, an Akaike Information Criterion (AIC) based strategy was applied to our largest GWAS [2] in order to identify the most informative and parsimonious haplotypic model of 1 to 4 genotyped SNPs, mapping 47,300,000 to 48,100,000, with respect to disease risk prediction. The best identified combination included three SNPs, rs2856650, rs3740689 and rs10769258, that generated six haplotypes whose global distribution strongly differed between cases and controls, consistently in both GWAS ( Table 2). The haplotypic association appeared to be mainly attributable to the uncommon AGT haplotype that was more frequent in cases than in controls. The Odds Ratio for VT associated with this AGT haplotype was 2.37 [1.36-4.15] (P = 2.39 10 23 ) and 2.99 [2.02-4.44] (P = 4.23 10 28 ) in the first and second GWAS, respectively) ( Table 2). When haplotype analyses were adjusted for the imputed dose at any of the twelve SNPs, these ORs were reduced to 1.03 [0.41-2.58] (P = 0.950) and 0.96 [0.55-1.66] (P = 0.879) suggesting that the AGT haplotype actually tagged the rare alleles of the long range LD block.
To replicate the association observed at the 11p11.2 locus, we genotyped two SNPs selected from Table 1, the MYBPC3 rs2856656 and the CELF1 rs60206633, in an independent sample of 592 controls and 596 VT cases from the FARIVE study [1]. We opted for the genotyping of two SNPs to further validate our hypothesis that the detected imputed rare variants were in strong LD. In FARIVE, the two SNPs were indeed in strong LD (D' = +0.89, r 2 = 0.49) and showed evidence for association with VT (Table 3). In particular, the rs2856656-G allele was associated with an OR of 1.70 [1.15-2.51] (P = 0.008; P Bonferroni-corrected = 0.016). This pattern of association was very consistent with that observed with the imputed SNPs of the long range LD block (Tables 1 & 2). Haplotype analysis of these two SNPs further showed that the rs2856656-G allele was carried by two haplotypes that were both more frequent in cases than in controls (Table S2). One of these two haplotypes also carried the rare rs60206633-G allele suggesting that the association signal observed at rs60206633 was due to its LD with rs2856656.
To further study the association of the 11p11.2 locus with VT, we genotyped the MYBPC3 rs2856656 and the CELF1 rs60206633 in the MARTHA study [1] composed of 1150 VT cases, among which 812 were part of the second GWAS, and a new sample of 801 controls that were not part of any of the two GWAS. By design, the control group of the MARTHA study had been enriched in healthy individuals heterozygous for the FV Leiden or FII G20210A mutations (,40% of all controls). The association of rs2856656 with VT risk was not significant in MARTHA. While the frequency of the rs2856656-G allele in cases was comparable to that of other VT samples, in contrast it was higher than expected in controls (Minor Allele Frequency = 0.069). Because of the enrichment of MARTHA controls in FV and FII carriers, we further investigated the genotype distribution of the rs2856656 according to the presence of FV and FII mutations and observed that the frequency of the rs2856656-G allele was strongly increased in FII carriers in all subgroups (Table 4). Conditioning on FII G20210A mutation, the allele frequency of the rs2856656 did not differ according to VT status (Table 4). In FARIVE, a conditional logistic regression analysis was conducted and revealed that, after adjusting for the FII G20210A (rs1799963), the rs2856656-G allele was no longer associated with VT risk (OR = 1.26 [0.80-1.99], p = 0.310). The rs1799963 variant is located on chromosome 11 at a distance of 612 kb from rs2856656. The pairwise LD r 2 between these two SNPs were 0.53 and 0.24 in MARTHA and FARIVE, respectively, while the associated D' were +0.86 and +0.68, respectively.
Coming back to our imputation GWAS data, we observed that the rs1799963 variant was imputed with poor quality (r 2 = 0.12 and r 2 = 0.27 in the first and second GWAS, respectively) and thus did not pass the quality control. Nevertheless, it showed suggestive association with VT, P = 0.053 and P = 0.110, in the first and second GWAS, respectively, for a combined statistical evidence of P = 0.020. A conditional logistic regression analysis was then conducted to estimate the effect of the imputed rs2856656 after adjusting for the imputed rs1799963. As indicated in Table 5, the imputed rs2856656-G allele was still associated with increased risk for VT, P = 7.22 10 24 and P = 1.13 10 211 , in the first and second GWAS, respectively. However, because the rs1799963 variant was typed in the GWAS patients as part of the inclusion/exclusion criteria (see Materials and Methods), we re-ran the conditional analyses using the true genotyped rs1799963 in cases rather than the imputed dose. The association of rs2856656 with VT was now no longer significant, P = 0.643 and P = 0.122, in the first and second GWAS respectively (Table 5). To corroborate the poor quality of the imputation at rs1799963 mentioned above, we calculated the Spearman correlation between the imputed dose and the true genotype at rs1799963 in all cases for whom both information was available (Figure 1). This correlation was only r = 0.36 and r = 0.48 in the 419 and 1,542 cases from the first and second GWAS, respectively. As shown in Figure 3, the rs1799963 genotype was poorly imputed in heterozygotes individuals. Finally, a further haplotype analysis revealed that the rare rs17999963-A allele was mainly carried by the AGT at-risk haplotype discussed above (Table S3). Of note, a LD analysis of the whole chromosome 11 region from 46,600,000 to 48,000,000 bp containing 119 SNPs (Figure 4) showed that the rs1799963 variant was not in strong pairwise LD with any other common SNPs, the higher observed r 2 being 0.15 with AGBL2 rs7930612.
To summarize, in the two discovery GWAS based on imputed data as well as in the two standard case-control samples where the rs2856656 and the rs1799963 were directly genotyped, the association of rs2856656 with VT vanished after adjusting for rs1799963.
Finally, we re-analyzed the two imputed GWAS datasets by conditioning on the previously identified VT-associated SNPs at ABO, HIVEP1, FV, F11, FGG, PROCR and STAB2 loci [2]. Except the 11p11.2 locus that remained genome-wide significant, no other locus demonstrated statistical association at P,1.0 10 26 .

Discussion
By conducting an updated and comprehensive in-depth analysis of two GWAS, we were able to ''re-discover'' a strong risk locus for VT known for more than one decade [15], the F2 gene, but missed by all large scale association studies conducted so far on the disease [1,2,16]. Several conclusions can be drawn from this work. First, it adds to the rather limited illustrative literature about the interest of imputation-based GWAS analyses using the 1000 Genomes project that can help identify rare variants in new disease-associated loci not detected by the first waves of GWAS; Second, the functional variant could be quite far away from the detected hits. In our example, the original association signal mapped to an interval from 47,373,425 bp (MYBPC3) to 48,064,194 bp (PTPRJ) on chromosome 11, and this is up to 1.3 Mb away from the functional G20210A mutation. Would PTPRJ have been a plausible biological candidate for VT, our quest for the culprit variant could have led us to a dead end; Third, a functional variant could be missed if its imputation quality is low which would likely be the case for a non genotyped rare variant showing low to modest pairwise LD with other SNPs in its neighborhood. As shown in Figure 5, imputation quality was satisfactory for SNPs with inferred MAF .0.01. About 75% of the SNPs with MAF ,0.01 demonstrated poor imputation quality in our study while ,82% of the SNPs with MAF .0.01 were correctly (r 2 .0.3) imputed. Rare variants, like FII G20210A, which are not present on genotyping arrays can nevertheless be tagged by haplotypes generated from common SNPs not necessarily in strong  LD with each other. A similar phenomenon was previously observed at the LPA locus associated with coronary artery disease [13]. From a population genetics perspective, it would be interesting to investigate whether evolutionary selection forces could be exerted on the F2 locus as suspected for the F5 gene [17,18] and could explain why a functional ''deleterious'' allele was maintained on long-range haplotype.
To conclude, we have shown how a comprehensive analysis of 1000G imputed genotype data was able to discover a disease locus missed by previous GWASes. Our work also demonstrates the need for exercising careful analysis of detected imputed rare associations to avoid false inference on the functional variant, and for this, LD and haplotype analyses of the associated loci may be of great value. This may have both scientific and methodological utility for geneticists involved in similar studies.

Ethics Statement
Each individual study was approved by its institutional ethics committee and informed written consent was obtained in accordance with the Declaration of Helsinki. All subjects were of European origin. All subjects were of European origin.
Ethics approval were obtained :

Studied populations
The description of the studied populations has already been extensively described in [1] for the first GWAS MARTHA and FARIVE studies, and in [2] for the second GWAS.
Briefly, all studied patients were unrelated Caucasians subjects with a document first event of VT and lacking strong known genetic risk factors, including AT, PC, or PS deficiencies, and   [3] with no chronic conditions, and no regular medicines. Patients of the second GWAS (n = 1,597) were recruited from the Thrombophilia center of La Timone hospital (Marseille, France) with no restriction one age of onset, and were compared to another independent sample of 1,140 French controls free of any chronic disease and selected from the 3C study [4]. The MARTHA study was composed of 1,150 VT patients also recruited from the Thrombophilia center of La Timone hospital, among which 812 were part of the second GWAS, and of a control sample of 801 healthy individuals. 475 of these controls were recruited from the Marseille area and 326 were healthy heterozygous carriers of the FV Leiden or FII G20210A mutations selected from the national health examination centers of the French Social Security in collaboration with the Hemostasis and Thrombosis Study Group. The FARIVE study is a multicenter case-control study composed of 607 patients with a documented first episode of VT and 607 controls matched for age and sex.

Genotyping and Quality control
First Genome Wide Association Study [1]. Individuals participating in this GWAS were genotyped for 317,139 SNPs using the Illumina Sentrix HumanHap300 Beadchip. The application of several quality control criteria previously described [1] lead the final analysis of 291,872 genotyped SNPs in a sample of 419 cases and 1,228 controls.
Second Genome Wide Association Study [2]. As extensively described in [2], 1011 VT patients were typed with the Illumina Human 610-Quad Beadchip and 586 VT patients were typed with the Illumina Human 660W-Quad Beadchip. Healthy individuals from the 3C study were typed with the Illumina Human 610-Quad Beadchip. The application of quality control filters lead to the final analysis of 481,0002 autosomal SNPs in a sample of 1,542 VT patients and 1,110 healthy individuals.
Replication studies. In FARIVE and MARTHA, the MYBPC3 rs2856656 and the CELF1 rs60206633 were genotyped by allele-specific PCR.

Statistical Analysis
Imputation. In both GWAS datasets, imputation of 11,572,501 autosomal SNPs was conducted using the MACH [11] according the 1000G 2010-08 release reference dataset. The association of each imputed SNP with VT was tested by use of a logistic regression analysis in which allele dosage (from 0 to 2 copies of the minor allele) of imputed SNPs was used. Analyses were adjusted for the first four principal components and were performed using the mach2dat (v 1.0.18) software (http://www. sph.umich.edu/csg/abecasis/MACH/download/).
Combined GWAS analysis. All SNPs with acceptable imputation quality (r 2 .0.3) in both imputed GWAS datasets were entered into a meta-analysis, leading to 6,754,935 SNPs left for statistical association analysis. For the meta-analysis, a fixed-effect model relying on the inverse-variance weighting was used as implemented in the METAL software [12]. Homogeneity of associations across the two GWAS studies was tested using the Mantel-Haenszel method [19]. A statistical threshold of 7.4 10 29 , which controls for the Bonferroni corrected type I error rate of 0.05 according to the number of tested SNPs, was used to declare genome-wide significance.
Haplotype analysis. A systematic analysis of all possible combinations of 1 to 4 typed SNPs within the chromosome 11 47,300,000-48,100,000 interval was performed to select the most informative and parsimonious haplotype configuration in terms of predicting disease status using a previously described strategy based on the Akaike's Information Criterion (AIC) [13]. This strategy calculates an AIC value for each investigated haplotype model and then subtracts from each model the minimum AIC value obtained over all models explored, giving a rescaled AIC value for each haplotype model. The models with a rescaled AIC #2 are considered equivalent to the most informative model. Among these equivalent models, the most parsimonious model with the fewest polymorphisms is considered the best model. The THESIAS program implementing the stochastic-EM algorithm [14] for haplotype inference was used to estimate haplotype frequencies and haplotype effects on VT risk derived from specific sets of SNPs.
Replication. In the FARIVE and MARTHA studies, association of tested SNPs with VT risk was assessed by use of the Cochran-Armitage trend test [20] and logistic regression models. Supporting Information Figure S1 Distribution of the inferred minor allele frequencies in two imputed GWAS datasets. Only SNPs with imputation quality control r 2 . 0.30 are represented. (TIF) Figure S2 Quantile-Quantile plots summarizing the results of a meta-analysis of two GWAS for VT. QQ plot derived from SNPs imputed according to 1000G 2010-08 release is shown in blue with its 95% confidence interval in shaded area. The corresponding genomic control coefficient was 0.993. QQ plot derived from HapMap2 release 21 imputed data is shown in black with a genomic control of 1.023. (TIF) Table S1 Genome-wide significant (p,7.4610-9) SNPs associated with VT risk in a meta-analysis of two imputed GWAS datasets. Results are shown in a separate EXCEL file (Table S1). Beta: log Odds Ratio for VT risk associated with the A2 allele SE: Standard error of the beta coefficient r 2 : imputation quality control coefficient. (XLSX)