Evaluation and application of summary statistic imputation to discover new height-associated loci

As most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation, which we improved to accommodate variable sample size across SNVs. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, genotype imputation boasts a 3- to 5-fold lower root-mean-square error, and better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded a decrease in statistical power by 9, 43 and 35%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian randomisation or LD-score regression.

Genome-wide association studies (GWASs) have been successfully applied to reveal 2 genetic markers associated with hundreds of traits and diseases. The genotyping arrays 3 used in these studies only interrogate a small proportion of the genome and are 4 therefore typically unable to pinpoint the causal variant. Such arrays have been 5 designed to be cost-effective and include only a set of tag single nucleotide variants 6 (SNVs) that allow the inference of many other unmeasured markers. To date, thousands 7 of individuals have been sequenced [1, 2] to provide high resolution haplotypes for 8 genotype imputation tools such as IMPUTE and minimac [3,4], which are able to infer 9 sequence variants with ever-increasing accuracy as the reference haplotype set grows. 10 Downstream analyses such as Mendelian randomisation [5], approximate conditional 11 analysis [6], heritability estimation [7], and enrichment analysis using high resolution 12 annotation (such as DHS) [8] often require genome-wide association results at the 13 highest possible genomic resolution. Summary statistics imputation has been proposed 14 as a solution that only requires summary statistics and the linkage disequilibrium (LD) 15 information estimated from the latest sequencing panel to directly impute up-to-date (M), and c represents the correlations between M and SNV u. Both correlation entities 38 are regularised [14] using a regularisation parameter λ, yielding C λ and c λ . In this 39 paper we use λ = (1 − 2 √ n ) [15]. Z u|M describes the Z-statistic of an untyped SNV u 40 given the Z-statistics of a set tag SNVs (M). 41 Since LD between SNVs is minimal beyond 250 Kb, we choose M to include all 42 measured variants within at least 250 Kb from SNV u. To speed up the computation 43 when imputing SNVs genome-wide, we apply a sliding window strategy, where SNVs 44 within a 1 Mb window are imputed simultaneously using the same set of m tag SNVs 45 within the 1 Mb window ± 250 Kb flanking regions. 46 We use an adjusted imputation quality that corrects for the effective number of tag 47 SNVs p eff [16]: To account for variable sample size in summary statistics of tag SNVs, we use an 49 approach to down-weight entries in the C λ and c λ matrices for which summary 50 statistics were estimated from a GWAS sample size lower than the maximum sample 51 size in that dataset [10]. 52 For more details on our summary statistics imputation method and extensions of it, 53 see our complementary paper [10]. 54

55
To assess the performance of summary statistics imputation in realistic scenarios we 56 used two different datasets. In Section "Comparison of summary statistics imputation 57 versus genotype imputation" we compare the performance of summary statistics 58 imputation to genotype imputation, using measured and imputed genotype data from 59 120 086 individuals in the UK Biobank. In Section "Summary statistics imputation of 60 the height GWAS of the GIANT consortium", we use published association summary 61 statistics from 253 288 individuals to show that summary statistics imputation can be 62 used to identify novel associations. Both analyses are centered around the genetics of 63 human height. In the following we will often refer to two GIANT (Genetic Investigation 64 of ANthropometric Traits) publications: Wood et al. [12], an analysis of HapMap  88 We term the 6 080 SNVs correlated with a height-associated lead SNV as associated 89 SNVs. Conversely, we refer to the 31 567 SNVs that are not correlated with any 90 height-associated lead SNV as null SNVs. For both, null and associated SNV groups, 91 the largest group of analysed variants were common and well-imputed (S1 Fig)  an advantage compared to genotype imputation when α is low. As we approach lower 156 imputation quality and MAF, summary statistics imputation advantage becomes more 157 and more apparent for all range of α values. In general, whenever summary statistics 158 imputation is outperforming genotype imputation, this is because lower FPR (horizontal 159 shift), and not due to increased power. This aspect is more clearly visible in S5 Fig While previous studies have examined the role of (common) HapMap variants for 166 height [12,17], the impact of rare coding variants could not be investigated until 167 bespoke genotyping chips (interrogating low-frequency and rare coding variants) were 168 designed to address this question in a cost-effective manner. Such an exome chip based 169 study was conducted by the GIANT consortium in 381 000 individuals and revealed 120 170 height-associated loci, of which 83 loci were rare or low-frequency [13]. These  Table 1. By imputing > 6M additional SNVs summary statistics using HapMap variants [12] as 178 tag SNPs we were interested in two aspects: (1) discovering new height-associated 179 candidate loci, and (2) replicating these candidate loci in the UK Biobank and the 180 GIANT exome chip look-up (Fig 6). We used the HapMap-based height study and the 181 UK10K reference panel as inputs for summary statistics imputation and used all 182 HapMap SNVs as tag SNVs. We imputed variants that were available in UK10K with a 183 MAF UK10K ≥ 0.1%, as well as all reported exome variants in Marouli et al. [13]. In   variants are listed in S1 Table and locus-zoom plots are provided in S7 Fig.   192 Next, we used the UK Biobank to replicate the associations with height of these 35 193 candidate variants and subsequently grouped them into replicating (20 variants) and 194 not replicating (15 variants) (at α = 0.05/35 level).

195
An overview of the 20 replicating variants is given in Table 2. One region had 196 already been discovered in the GIANT exome chip study: rs28929474, located in gene 197 SERPINA1. Fig 7 shows this region as locus-zoom plot with summary statistics from the 198 HapMap study, summary statistics imputation, and the exome chip study. To annotate 199 these 20 novel candidate variants further, we investigated whether they are eQTLs or 200 associated with other traits. We report this in Table 3 where we list eQTLs detected by 201 GTEx [18] and Table 4 that presents a curated association-trait list by 202 Phenoscanner [19]. In the following we describe variants that replicated in UK Biobank 203 which are either eQTLs or have previously been associated with another trait. 204 We can classify the 35 candidate loci into three categories (i), (ii) and (iii) that  three candidate loci (#2, #3, #21 in S1 Table) contain borderline significant HapMap 214 signals (P -value between 10 −6 and 10 −8 in [12]). 215 We observed that variants with higher MAF have higher chance to replicate. Among 216 the 20 candidate variants that did replicate in UK Biobank, 19 were common and one a 217 low-frequency variant (rs112635299, MAF = 2.32%). Conversely, among the 15 218 candidate variants that did not replicate in the UK Biobank, 10 are rare, three are 219 low-frequency variants, and two are common.

220
Locus #1: rs112635299 (imputed P -value 4.21 × 10 −14 ), is a proxy of rs28929474 221 (LD= 0.88), has been associated with alpha-1 globulin [20] and is associated with 222 Overview of imputation and replication scheme. This illustration gives an overview how we used > 2M GIANT HapMap summary statistics (black rectangle) as tag SNVs to impute > 10M variants with MAF≥ 0.1% in UK10K. After adjusting the summary statistics for conditional analysis we applied a selection process that resulted in 35 candidate loci. To confirm these 35 loci we used summary statistics from UK Biobank (blue) as replication as well as summary statistics from the exome chip study, if available [13] (red). Loci that had not been discovered by the exome chip study, were termed novel. multiple lipid metabolites [21]. rs28929474 was identified in the GIANT exome 223 chip study to be height-associated (P = 1.39 × 10 −45 ) [13]. The P -value 224 calculated with summary statistics imputation was P = 1.06 × 10 −13 . rs28929474 225 is a low-frequency variant (MAF= 2.3%) and replicates in the UK Biobank with 226 P = 1.66 × 10 −25 .  This table presents 20 regions that contain at least one imputed variant that is independent from top HapMap variants nearby and that replicated in the UK Biobank (at α = 0.05/35 level). Each row represents one region (#), indicating the SNV with the lowest conditional P -value. The first seven columns provide general information for each variant, followed by the P -value and sample size from summary statistics imputation, P -value and sample size from the UK Biobank. The last column assigns each of the 35 candidate loci to one of three groups: candidate loci (i) that were reported by [13] already, (ii) that had no reported HapMap variant nearby and (iii) that had reported HapMap variants nearby. r 2 pred,adj of all variants listed was greater than or equal to 0.3. We provide a more detailed table for all 35 variants (both replicating and not replicating) in S1 Table. (*) rs28929474, exome chip study results: P = 1. Locus #15: rs78566116 is a variant on chromosome 6. rs78566116 has been 240 associated with HPV8 seropositivity in cancer [22], rheumatoid arthritis [23] and 241 ulcerative colitis [24]. Next, we focussed on 122 novel variants of Marouli et al. [13]. For this analysis we did 247 not apply any MAF restrictions. Of these 122 variants, 11 variants were either not 248 referenced in UK10K or on chromosome X, and were therefore not imputed, limiting the 249 number of loci and variants to 111 (S3 Table). By grouping results below or above the 250 P -value threshold of α = 0.05/111 we could classify variants into the ones that 251 replicated and those that failed replication. This is summarised in Table 5   This table shows SNVs which are significant eQTLs in GTEx [18]. We only report SNV-gene expression associations where the summary statistics pass the significance threshold of α = 10 −6 . The first four columns represent the region number, SNV, P -value from summary statistics imputation and the P -value in the UK Biobank. The four remaining columns are information extracted from GTEx, with the tissue name, gene name, the P -value of the association between the SNV and the gene expression, and the gene type. For each region, we only include the tissue with the lowest P -value per SNV-gene associations. The full version of this table is available in S2 Table. # refers to the region number. This table describes SNVs previously associated with other traits. The search was conducted with Phenoscanner [19]. We only list SNVs for which Phenoscanner had information available regarding GWAS traits or metabolites. The first four columns specify region, SNV-id, followed by the P -value from summary statistics imputation and the P -value from the UK Biobank. Column five to ten contain information extracted from Phenoscanner. We report the respective summary statistics that pass the significance threshold of α = 10 −6 . # refers to the region number, conc. to concentration.

254
Details to the imputation of all 111 variants are listed in S3 Table. 255 This table presents summary statistics imputation results, limited to 111 variants identified as "novel" by [13]. We summarised the results according to their allele frequency and imputation quality category. For each subgroup we calculated the fraction of top exome variants that had a P -value ≤ 0.05/111 with summary statistics imputation.

256
In this article, we focussed on the comparison between genotype and summary statistics 257 imputation. In contrast to previous work by others and us [10,11,25], here we 258 systematically assessed the performance and limitations of summary statistics 259 imputation through real data applications for different SNV subgroups characterised by 260 allele frequency, imputation quality and association status (null/associated). In 261 addition, we demonstrated the usefulness of summary statistics imputation to discover 262 novel associated regions using existing association data. Note that in this paper we used 263 an improved version of the original summary statistics imputation [11], which uses 264 reference panel size dependent shrinking of the correlation matrix and incorporates 265 variable sample size of tag SNVs.

266
Our study design has several limitations: for replication of summary statistics from 267 European individuals we use the UK Biobank, which represents only a subset of all 268 European ancestries and is genotype-imputed (instead of sequenced), but on the other 269 hand provides a reliable resource due to its sample size. Furthermore, in UK Biobank, 270 genotype imputation done for genotyped variants can only partially be compared to 271 genotype imputation for untyped variants, as genotyped variants were used for phasing 272 (therefore genotype imputation of genotyped variants is much easier and leads  The summary statistics imputation method itself has several limitations too. First, 277 due to the size of publicly available sequenced reference panels we can not explore the 278 performance of rare variants (MAF< 1%). Second, the imputation quality metric 279 r 2 pred,adj tends to be inaccurate in case of small reference panels. Third, the imputation 280 of summary statistics of an untyped SNV is essentially the linear combination of the 281 summary statistics of the tag SNVs (Eq. (1)). Such a model cannot capture non-linear 282 dependence between tag-and target SNVs [9], which is often the case for rare 283 variants [26,27]. In contrast, genotype imputation is able to capture such non-linear  Comparison of summary statistics imputation versus genotype 289 imputation 290 We compared summary statistics imputation and genotype imputation by using 291 individual-level data from the UK Biobank, where we evaluated the imputation results 292 for 6 080 SNVs that were correlated with a height-associated variant (associated SNVs) 293 and 31 567 that were not correlated to any height-associated SNVs on the same 294 chromosome (null SNVs).

295
In general, imputation using summary statistics imputation leads to a larger RMSE 296 than genotype imputation in all twelve SNV subgroups investigated (Fig 4). Among 297 associated SNVs, summary statistics imputation performs similar to genotype imputation 298 for well-imputed SNVs, but shows a trend for underestimation of the Z-statistics and 299 lower correlation with the true effect size for medium-and badly-imputed SNVs (Fig 2). 300 Conversely, genotype imputation has more consistent results for most of the twelve SNV 301 subgroups (Fig 2 and 3), that is reflected in a correlation close to one between 302 Z-statistics from genotype data and genotype imputation data.

303
Underestimation for null and associated SNVs 304 Ultimately, the underestimation of imputed Z-statistics with summary statistics 305 imputation leads to a lower type I error. We calculated power and FPR for both 306 methods and observe that for a given significance threshold, summary statistics 307 imputation has a lower FPR at the cost of lower power compared to genotype 308 imputation. This effect is amplified for SNV groups with lower imputation quality 309 ( r 2 pred,adj < 1). For associated SNVs with r 2 pred,adj < 1 we expect an underestimation for 310 associated SNVs due to the fact that we are imputing summary statistics under the null 311 model, whereas for null SNVs with r 2 pred,adj < 1 we expect an underestimation due to 312 decreased variance of the summary statistics imputation estimation.

313
Ideally, for an unbiased estimation of causal and null SNVs, the imputed Z-statistics 314 (Eq. (1)) should be divided by r 2 . However, as the imputation quality r 2 pred,adj is noisily 315 estimated from small reference panels (discussed below) and it is not guaranteed that 316 the SNV we impute is causal, we risk to overestimate the summary statistics of 317 associated SNVs. This is the reason why refrain from doing so. SNVs with an accumulation of low −log10(P )-values for well-imputed SNVs and an 320 accumulation of high −log10(P )-values for badly-imputed SNVs. We think that two 321 factors are in play here. First, mostly due to polygenicity, the genomic lambda for 322 height is λ GC = 1.94, therefore we expect even seemingly null variants to show inflation. 323 Second, for null SNVs, the sample variance of the imputed Z-statistics should be 324 proportional to the average imputation quality. We calculated for each of the null SNV 325 subgroups the ratio between the sample variance for Z-statistics from summary 326 statistics imputation and the sample variance for Z-statistics from genotype data. For 327 common null SNVs we observe a ratio that gradually decreases with imputation quality 328 (0.89 for perfectly-, 0.79 for medium-and 0.68 for badly imputed SNVs). For   Because the number of associated SNVs with MAF < 1% was too low (13 variants) to 336 draw any meaningful conclusions, we refrained from analysing this MAF group. One 337 other reason to exclude rare variants from this analysis is, that the reference panel used 338 (UK10K) contains 3 871 individuals and therefore estimations for LD of rare variants 339 are unreliable and rare variants can (in theory) only be covered down to MAF 340 = 1/(2 · 3 871). We believe improving summary statistics imputation for rare variants 341 will require not only larger reference panels to allow estimation of LD of rare variants, 342 but also methods which would allow non-linear tagging of variants. It should be kept in 343 mind that, just like for genotype imputation, even with very large reference panels, one 344 will not be able to impute variants with extremely rare allele counts. To investigate 345 these SNVs full genome sequencing is indispensable [28]. 346 Imputation quality 347 We find that our imputation quality measure r 2 pred,adj is conservative and probably  Biobank have borderline significant HapMap signals in close proximity (P -value 386 between 10 −6 and 10 −8 in [12]) and were therefore not reported in the study in 2014.

387
The 15 non-replicating candidate loci were on average on a lower allele frequency 388 spectrum (ten are rare, three are low-frequency variants, and two are common). Allele 389 frequency was higher among the 20 replicating candidate variants (19 were common and 390 one a low-frequency variant).

391
Replicating GIANT exome chip imputation results 392 We then focussed on the summary statistics imputation of the the 111 reported exome 393 chip variants [13]. Knowing from our previous findings that rare variants are challenging 394 to impute due to reference panel size, we expected to retrieve a larger fraction of 395 common and low-frequency than rare variants. S8 Fig shows that we retrieved 49.5%

400
Among variants with lower imputation quality only two common and medium-imputed 401 variants could be retrieved. As shown in Fig 2 and 5, the power of summary statistics 402 imputation decreases with lower MAF and imputation quality. imputation is a rapid and cost-effective way to discover novel trait associated loci. We 411 also highlight that the principal limitations of summary statistics imputation are rooted 412 in the LD estimation and in imputing very rare variants with sufficient confidence. analysis we used Caucasians individuals (amongst people who self-identified as British) 420 from the first release of the genetic data (n = 120 086). For SNVs, the number of 421 individuals range between n = 3 431 and n = 120 082. Additionally to custom SNP 422 array data, UK Biobank contains imputed genotypes [30]. A subset of 820 967 variants 423 were genotyped and imputed, and 72M variants were imputed by UK Biobank, using 424 SHAPEIT2 and IMPUTE2 [30].

Imputation of height GWAS summary statistics conducted in UK Biobank 426
We imputed GWAS Z-statistics (ran on directly genotyped data) within 1 Mb-wide 427 regions, by blinding one at the time and therefore allowing the remaining SNVs to be 428 used for tagging. As tag SNVs we used all SNVs except the focal SNV within a 1.5 Mb 429 window.

431
We selected 706 regions in total, consisting of 535 loci containing height-associated 432 SNVs [12,13] and 171 regions not containing any height-associated (all P ≥ 10 −5 ) SNV. 433 More specifically, within each height-associated region we only imputed SNVs that have 434 LD max > 0.2. LD max was defined as the largest squared correlation between a SNV and 435 all height-associated SNVs on the same chromosome. In the 171 null regions we chose To compare the performance between summary statistics imputation and genotype 444 imputation followed by association we compared each method to the directly genotyped 445 data association as gold standard. We used RMSE, bias, correlation, and the regression 446 slope (no intercept) to evaluate these approaches against the truth.

447
More precisely, the RMSE and the Bias for a set of k = 1 . . . K SNVs is: being the Z-statistic resulting from summary statistics imputation for SNV 448 k and Z k the Z-statistic resulting from genotype data for SNV k (our gold standard).

451
For genotype summary statistics from associated SNVs that resulted from data with 452 partial sample size, we computed an upsampled Z-statistics, where Z u represents the 453 Z-statistics for SNV u, N u the sample size of SNV u and N max the maximal sample size 454 within the study: Z * u = Z u · Nmax Nu . Whenever we use Z-statistics from associated 455 genotype data we use this upsampled version Z * .

456
Additionally, we calculated power and false positive rate (FPR) for each method.

457
For SNVs with a real association we calculated the power as the fraction of SNVs with a 458 P ≤ α, whereas for SNVs with no association we calculated FPR as the fraction of 459 SNVs with P ≤ α. We varied α between 0 and 1 and visualised FPR versus power for 460 each method.

483
Summary statistics imputation of Wood et al. 484 We imputed all non-HapMap variants that were available in UK10K, using the summary 485 statistics in [12] as tag SNVs. In general, we only imputed variants with 486 MAF UK10K ≥ 0.1% (this allows a minimal allele count of 8 0.001 · 3781 · 2), except for 487 the 111 exome variants reported in [13], which we imputed regardless of their MAF. We 488 divided the genome into 2 789 core windows of 1 Mb. We imputed the summary imputed P -value ≤ 10 −8 , ranging from position bp (1) to bp (2) . The second SNV set 499 contained all reported HapMap SNVs (697 in total) within a range of bp (1) − 1 Mb and 500 bp (2) + 1 Mb. Having two SNV sets -the first set with newly detected variants, the 501 second set with reported HapMap variants -we could then condition each SNV in the 502 first set on all SNVs in the second set, using approximate conditional analysis [32] and 503 UK10K as the reference panel. Next, we declared a region as a candidate locus if at 504 least one imputed variant in that locus had a conditional P -value ≤ 10 −8 . Finally, we 505 performed a conditional analysis for nearby candidate loci (neighbouring windows), to 506

PLOS
19/26 avoid double counting. In each candidate locus we report the imputed variant with the 507 smallest conditional P -value as the top variant.

508
Replication of candidate loci emerging from summary statistics imputation 509 We replicate our findings using our UK Biobank height GWAS results and for SNVs 510 present on the exome chip we also use the recent height GWAS [13]. For both attempts 511 to replicate our findings, UK Biobank and the exome chip study, the significance 512 threshold for replication is α = 0.05/k, with k as the number of candidate loci.

513
For replication using UK Biobank we used summary statistics based on the latest 514 release of genetic data with n = 336 474 individuals, provided by the Neale lab [33]. For 515 SNVs that were not present in the latest release we used summary statistics from the 516 first release of genetic data (n = 120 086)).

517
Annotation of candidate loci 518 We use two databases to annotate newly discovered SNVs. First, we use GTEx [18], an 519 eQTL database with SNV-gene expression association summary statistics for 53 tissues. 520 Second, we conduct a search in Phenoscanner [19], to identify previous studies (GWAS 521 and metabolites) where the newly discovered SNVs had already appeared. For these 522 two databases we report the respective summary statistics that pass the significance 523 threshold of α = 10 −6 . We only extract the information for variants that were defined 524 as as novel discoveries.

525
Reference panels 526 To estimate LD structure in C and c (Eq. (1)) we used 3 781 individuals from UK10K 527 data [34,35], a reference panel of British ancestry that combines the TWINSUK and 528 ALSPAC cohorts.

529
Software 530 All analysis was performed with R-3.2.5 [36] programming language, except GWAS 531 summary statistics computation for UK Biobank genotype and genotype imputed data, 532 for which SNPTEST-5.2 [37] was used. For summary statistics imputation we used 533 SSIMP [38].     [13] or HapMap 574 study [12]). To make the density more visible, dots have been made transparent. The  [13]. In the top window we mark the rs-id of variants 591 that are part of the 122 reported variants of [13] in bold black, and if they are part of 592 the 697 variants of [12] in bold orange font. Variants that are black (plain) are imputed 593 variants (that had the lowest conditional P -value). Variants in orange (plain) are SNP.cond.info presents each HapMap SNV used for conditional analysis, including its 621 MAF, LD between the HapMap SNV and the imputed SNV, and a reversed conditional 622 analysis result (HapMap variant conditioned on the imputed SNV). The column Group 623 classifies each row into candidate loci (i) that were reported by [13] already, (ii) that 624 had no reported HapMap variant nearby, (iii) that had at least one reported HapMap 625 variants nearby. P = P -value, N = sample size, r2 = imputation quality, eff = effect 626 size, EAF = effect allele frequency, MAF = minor allele frequency. If a candidate locus 627 was not available in the UK Biobank, we provide a replication for a second variant that 628 is in high LD with the primary variant, hence duplicated region numbers for some 629 candidate loci.

630
Link to S1 Table. 631 S2 Table. GTEx annotation results for variants in eQTLs 632 This table shows SNVs which are significant eQTLs in GTEx [18]. We only report 633 SNV-gene expression associations where the summary statistics pass the significance 634 threshold of α = 10 −6 . The first four columns represent the region number, SNV, 635 P -value from summary statistics imputation and the P -value in the UK Biobank. The 636 three remaining columns are information extracted from GTEx, with the tissue name, 637 gene name and the P -value of the association between the SNV and the gene expression. 638 For each region, we order SNV-gene-tissue associations according to their P -value. # 639 refers to the region number. Link to S2 Table. 640 Fig 7. Replication of exome variant rs28929474 is a missense variant on chromosome 14 in gene SERPINA1, low-frequency (MAF=2.3%), imputed summary statistics (P SSimp = 1.06× −13 ), replication in the UK Biobank (P UKBB = 6.49× −78 ). rs112635299 has the strongest signal in this region (P = 4.21 × 10 −14 ), but is highly correlated to rs28929474 (LD=0.95). This figure shows three datasets: Results from the HapMap and the exome chip study, and imputed summary statistics. The top window shows HapMap P -values as orange circles and the imputed P -values (using summary statistics imputation) as solid circles, with the colour representing the imputation quality (only r 2 pred,adj ≥ 0.3 shown). The bottom window shows exome chip study results as solid, grey dots. Each dot represents the summary statistics of one variant. The x-axis shows the position (in Mb) on a ≥ 2 Mb range and the y-axis the −log10(P )-value. The horizontal line shows the P -value threshold of 10 −6 (dotted) and 10 −8 (dashed). Top and bottom window have annotated summary statistics: In the bottom window we mark dots as black if it is are part of the 122 reported hits of [13]. In the top window we mark the rs-id of variants that are part of the 122 reported variants of [13] in bold black, and if they are part of the 697 variants of [12] in bold orange font. Variants that are black (plain) are imputed variants (that had the lowest conditional P -value). Variants in orange (plain) are HapMap variants, but were not among the 697 reported hits. Each of the annotated variants is marked for clarity with a bold circle in the respective colour. The genes annotated in the middle window are printed in grey if the gene has a length < 5 000 bp or is an unrecognised gene (RP-).