Correcting the coverage of credible sets in Bayesian genetic fine-mapping

Genome Wide Association Studies (GWAS) have successfully identified thousands of loci associated with human diseases. Bayesian genetic fine-mapping studies aim to identify the specific causal variants within GWAS loci responsible for each association, reporting credible sets of plausible causal variants, which are interpreted as containing the causal variant with some “coverage probability”. Here, we investigate the coverage probabilities of credible sets through simulations and find that these are systematically biased. We present a method to re-estimate the coverage of credible sets using rapid simulations based on the observed, or estimated, SNP correlation structure, we call this the “corrected coverage estimate”. This is extended to find “corrected credible sets”, which are the smallest set of variants such that their corrected coverage estimate meets the target coverage. We use our method to improve the resolution of a fine-mapping study of type 1 diabetes. We found that in 27 out of 39 associated genomic regions our method could reduce the number of potentially causal variants to consider for follow-up, and found that none of the 95% or 99% credible sets required the inclusion of more variants – a pattern matched in simulations of well powered GWAS. Crucially, our correction method requires only GWAS summary statistics and remains accurate when SNP correlations are estimated from a large reference panel. Using our method to improve the resolution of fine-mapping studies will enable more efficient expenditure of resources in the follow-up process of annotating the variants in the credible set to determine the implicated genes and pathways in human diseases. Author summary Pinpointing specific genetic variants within the genome that are causal for human diseases is difficult due to complex correlation patterns existing between variants. Consequently, researchers typically prioritise a set of plausible causal variants for functional validation - these sets of putative causal variants are called “credible sets”. We find that the probabilistic interpretation that these credible sets do indeed contain the true causal variant are systematically biased, in that the reported probabilities consistently underestimate the true coverage of the causal variant in the credible set. We have developed a method to provide researchers with a “corrected coverage estimate” that the true causal variant appears in the credible set, and this has been extended to find “corrected credible sets”, allowing for more efficient allocation of resources in the expensive follow-up laboratory experiments. We used our method to reduce the number of genetic variants to consider as causal candidates for follow-up in 27 genomic regions that are associated with type 1 diabetes.

Genome-Wide Association Studies (GWAS) have identified thousands of disease-associated regions in the causal variant association studies using 1006 European haplotypes or 1322 African haplotypes from the 1000 50 Genomes Phase 3 data set [26]. Sets of SNPs were sampled from genomic regions with various LD patterns 51 and here we present results from two regions; one of low LD ( Fig 1D) and one of high LD (Fig 1H), although 52 the results are similar for other LD patterns. 53 In each region, causality was randomly allocated to one of the variants (with minor allele frequency (MAF) 54 > 0.05) with an additive phenotypic effect (OR; 1.05, 1.1 or 1.2). Sample sizes (NN; number of cases = 55 number of controls = 5000, 10000 or 50000) were also varied across simulations. We calculated the frequentist 56 empirical estimate of the true coverage for each simulated credible set as the proportion of 5000 replicate 57 credible sets that contained the simulated causal variant. 58 We found that the threshold coverage estimate consistently under-estimated the true coverage (Fig 1A,E), 59 implying that this value could be used as a lower bound for the coverage of the credible set [8,22,23]. As the 60 minimum P value across all the SNPs in the region (P min ) decreases, the power of the study increases and 61 the ability to detect the true causal variant (and therefore include it in the credible set) increases, while the 62 threshold coverage estimate remains fixed. 63 We define the "claimed coverage" of credible sets as the sum of the PPs of the variants in the set [24]. In 64 very high powered studies (P min < 10 −12 ), the claimed coverage estimates were unbiased but with relatively 65 large variability between estimates, especially in high LD regions. However, we found that these claimed 66 coverage estimates are also systematically biased in representatively powered simulations where fine-mapping 67 is usually performed, P min > 10 −12 (Fig 1B,F). 68 Thus, the probabilities that the causal variant is contained within the credible set that are reported in the 69 literature are typically too low, and researchers can afford to be "more confident" that they have captured 70 the true causal variant in their credible set. 71 Our results imply that the accuracy of the coverage estimates depend on the background LD for the 72 genomic region. We repeated the analysis, varying LD patterns over a much larger population (7562 European 73 UK10K haplotypes) and averaging the results over a range of LD patterns (see Methods). We found that the 74 results were similar to those of the high LD region in Fig 1 (S1 Fig). 75 We next investigated potential causes of this systematic bias. We found that the PPs themselves are 76 empirically well calibrated (S2 Fig), implying that it is the procedure of forming the credible sets that is 77 causing the bias. Sorting the variants into descending order of PPs prior to the assembly of the credible set 78 ensures that the sets contain as few variants as possible. However, it also confers additional information 79 which is not utilised in the procedure, specifically the ranking of each SNP's PP relative to all the other PPs 80 for the SNPs in the region. We found that removing this ordering step from the algorithm, such that the 81 variants were added to the set in a random order had only a minor effect on the bias (S3 ( 1 0 For each of the simulated credible sets, we found that the corrected coverage estimates were better 95 empirically calibrated than the claimed coverage estimates in simulations that are representative of those 96 considered for fine-mapping (P min < 0.01) (Fig 1C,G). Particularly, the median accuracy of the corrected 97 coverage estimates improve for 10 −12 < P min < 10 −2 , and their variability also decreases where the claimed 98 coverage estimates are unbiased but with large variability (P min < 10 −12 ).

99
Corrected coverage robust to departures from single causal variant assumption 100 The Bayesian approach for fine-mapping described by Maller et al. assumes a single causal variant per 101 genomic region, which may be unrealistic [28]. Using simulated data with 2 causal variants, and defining 102 coverage as the frequency with which a credible set contained at least 1 causal variant, we found that the 103 corrected coverage estimates tend to be more accurate than the claimed coverage estimates for causal variants 104 in low LD (r 2 < 0.01, Fig 2A). When the 2 causal variants are in high LD (r 2 > 0.7), the corrected coverage 105 estimates are still generally more accurate than the claimed coverage estimates, although both tend to 106 underestimate the true coverage (and are thus conservative) ( Fig 2B). These results imply that even when 107 the key assumption underlying Bayesian fine-mapping is violated, the corrected coverage estimates are often 108 still better empirically calibrated than the claimed coverage estimates.  Our method relies on MAF and SNP correlation data to simulate GWAS summary statistics representative of 111 the GWAS data. So far we have assumed that this information is available from the GWAS samples, but this 112 is not generally the case. We therefore evaluated the performance of our correction when using independent 113 7/23 reference data to estimate MAFs and SNP correlations. We applied our correction to sets simulated from the 114 1000 Genomes data using either 1000 Genomes or UK10K MAF and SNP correlation estimates. We found 115 that the corrected coverage estimates remained accurate (Fig 3).

117
Obtaining an accurate coverage estimate that the causal variant appears in the credible set is useful in its 118 own right, but it is also beneficial to obtain a "corrected credible set" -that is, the smallest set of variants 119 required such that the corrected coverage estimate of the resultant credible set achieves some desired coverage. 120 For example, discovering that a 90% credible set actually has 99% coverage of the causal variant is useful, 121 but an obvious follow-up question is "What variants do I need such that the coverage is actually 90%?". 122 We explored this using an example simulated GWAS across 200 SNPs. The 90% credible set, constructed 123 using the standard Bayesian approach, contained 8 variants and had a claimed coverage value of 0.903. The 124 corrected coverage estimate of this credible set was 0.969, which is close to the empirical coverage of the 125 credible set, which was 0.972. 126 We used the root bisection method [29] to iteratively search for the threshold value required that yields a 127 8/23 credible set with accurate coverage of the causal variant. We found a corrected 90% credible set could be 128 constructed using a threshold value of 0.781. This corrected credible set had a coverage estimate of 0.905 129 (empirical estimated coverage of 0.907) and reduced in size from 8 to 4 variants (Fig 4).   Impact of correcting credible sets in a GWAS 147 We applied our corrected coverage method to association data from a large type 1 diabetes (T1D) genetic 148 association study consisting of 6,670 cases and 12,262 controls [30]. In the original study, 99% credible sets 149 are found for 40 genomic regions. Here we focus on 95% credible sets as these best illustrate the utility of our 150 method due to the greater margin for error, and we exclude the INS region with lead SNP rs689 which failed 151 QC in the ImmunoChip (and for which additional genotyping data was used in the original study). 152 The results match our previous findings -that the claimed coverage estimates are often too low (Fig 5). 153 We found that the size of the 95% credible set could be reduced in 27 out of the 39 regions, without the use 154 of any additional data (S1 Table S2 Table). Similarly, we found that the size of the 99% credible set could be 155 reduced in 26 out of the 39 regions (S8 Fig, S2 File, S3 Table, S4 Table). Fine mapping to single base resolution has been used as a measure of GWAS resolution [23]. scenarios where fine-mapping is usually performed. 176 We could not pinpoint the exact source of the bias. We found that the posterior probabilities of causality 177 themselves are well empirically calibrated, which implies that it is the procedure of forming the credible sets 178 that is responsible for the bias. Investigating the cause of this bias and whether this problem arises in other 179 variable selection problems is an interesting direction for future research.

180
Our method is limited in that it does not model multiple causal variants. Fine-mapping approaches that 181 jointly model SNPs have been developed, such as GUESSFM [4] which uses genotype data and FINEMAP [11] 182 and JAM [14] which attempt to reconstruct multivariate SNP associations from univariate GWAS summary 183 statistics, differing both in the form they use for the likelihood and the method used to stochastically search 184 the model space. The output from these methods are posterior probabilities for various configurations of 185 causal variants, and therefore the grouping of SNPs to distinct association signals must be performed post-hoc 186 to obtain similar inferences to that of single causal variant fine-mapping (e.g. to obtain credible sets). 187 The sum of single effects (SuSiE) method [8] removes the single causal variant assumption and groups 188 SNPs to distinct association signals in the analysis, such that it aims to find as many credible sets of variants 189 that are required so that each set captures an effect variant, whilst also containing as few variants as possible 190 (similar to "signal clusters" in Lee et al.'s DAP-G method [32]). This sophisticated approach has great 191 potential but the simulated 95% credible sets formed using both the SuSiE and DAP-G methods "typically 192 had coverage slightly below 0.95, but usually > 0.9" (Fig 3 and Fig S3 in

206
The values for the OR and MAF of the causal variant (CV) were selected such that the simulations reflect 207 the common disease-common variant (CDCV) hypothesis, which asserts that common diseases are caused 208 by common variants with small to modest effects [33,34]. Sample sizes (NN; number of cases = number of 209 controls = 5000, 10000 or 50000) were also varied across simulations. 210 The haplotype frequencies and sampled parameter values were then used in the simGWAS R package [27] 211 to: (i) simulate the results of a case-control GWAS (the study to "correct") (ii) simulate results from 5000 212 case-control GWASs (to evaluate the accuracy of our method). These simulated GWAS results are marginal Z 213 scores, which were then converted to PPs using the corrcoverage::ppfunc or corrcoverage::ppfunc.mat 214 functions. 215 The variants are then sorted into descending order of their PPs and these are summed until the credible 216 set threshold (0.9) is exceeded. The variants required to exceed this threshold comprise the 90% credible set. 217 The frequentist empirical estimate of the true coverage is calculated as the proportion of the 5000 simulated 218 credible sets that contain the CV. The claimed coverage is defined as the size of credible set that we wish to 219 correct -that is, the sum of the PPs of the variants in the credible set [24], which must be greater than or 220 For the evaluation of averaging results over a range of LD patterns, we used haplotypes from the UK10K 225 data. In each simulation, genomic regions were randomly selected (bounded by recombination hotspots 226 defined using the LD detect method [35]) on chromosome 22 and two non-overlapping sets of 100 adjacent 227 variants were selected, so that the simulated region consisted of 200 correlated and non-correlated variants. 228 The simulation pipeline described above was then followed to obtain a final simulation data set for various 229 LD patterns.

230
For investigating the effect of the ordering step in the credible set algorithm, we re-ran the simulations 231 whereby the variants were not sorted into descending order of PP prior to assembly of the credible set. This 232 means that they were included into the credible set in a random order until the threshold was exceeded.
where Z J has length equal to the number of SNPs in the region, and all elements equal to 0 except at the 244 causal SNP's position which takes the value µ. 245 The value of µ is unknown in genetic association studies and it must therefore be estimated in our method 246 to derive the Z J vector. We consider using the absolute Z score at the lead-SNP as an estimate for µ, but 247 find this to be too high in low powered scenarios (S5 Fig). This is because E(|Z|) > 0 even when E(Z) = 0, 248 and thus E(|Z|) > E(Z) when E(Z) is close to zero. Instead, we consider a weighted average of the absolute 249 Z scores, so that for a region comprising of k SNPs, and find this estimate to have small relative error, even at small µ (S5 Fig). 251 The joint Z vector can now be estimated by and the expected marginal Z scores can be written as where Σ is the SNP correlation matrix [36]. 254 The asymptotic distribution of these test statistics is multi-variate normal (MVN) with variance equal to 255 the SNP correlation matrix [36], We simulate multiple replicates of marginal Z scores, convert these into PPs and derive credible sets. 257 Thus, the proportion of these credible sets that contain the assumed CV can be empirically calculated. For 258 each SNP i considered as the CV, the joint Z vector is constructed as and we simulate N = 1000 marginal Z score vectors, Each simulated Z * vector is then converted to PPs and credible sets are formed using the standard 261 Bayesian method (sort and sum). The proportion of the N = 1000 simulated credible sets that contain SNP 262 i, prop i , is calculated.

263
This procedure is implemented for each SNP in the genomic region with P P > 0.001 (this value can be 264 altered using the 'pp0min' parameter in the software) considered as causal. The final corrected coverage 265 estimate is then calculated by weighting each of these proportions by the PP of the SNP considered causal, 266 15/23 so that for a region containing p SNPs with P P > 0.001, values were then used with MAFs and haplotype data from the 1000 Genomes data to simulate marginal Z 286 scores from various genetic association studies. The standard claimed and corrected coverage estimates (Fig 287   3A,B respectively) were then derived as usual and the corrected coverage estimates were also calculated when 288 using the reference data to estimate MAFs and SNP correlations (Fig 3C).

289
For comparison, we also investigated the effect of using a reference panel for the correction in the high 290 LD region previously discussed (we omitted the low LD region as this used African haplotypes, for which we 291  For the T1D data analysis, we used the index SNPs for the genomic regions reported in the original study [30]. 296 For each of these index SNPs, we found the relevant 1000 Genomes build 37 genomic region and used 297 ImmunoChip data to find the other SNPs in each of these regions. We then used the corrcoverage R 298 package with default parameters to find 95% (and 99%) credible sets of variants, along with the claimed 299 and corrected coverage estimates for each of these. 95% confidence intervals for the corrected coverage ignoring the null model and rescaling, so that the PPs over the variants sum to 1, implies that some variants 329 will be inappropriately dropped, causing the empirical coverage to be lower than the target.  90% credible sets were "corrected" using the corrcoverage::corrected cs function (with default parameters 339 and 'desired.cov=0.9'), and the "required threshold" value obtained from each simulation was used to form 340 5000 replicate credible sets to estimate the empirical coverage of these corrected 90% credible sets. where the credible set did not need to be corrected since the threshold was contained in the 99% confidence 348 interval of the coverage estimate, or because the credible set already contained only a single variant.