^{1}

^{2}

^{3}

^{4}

^{5}

^{3}

^{4}

^{6}

^{1}

^{5}

^{7}

^{8}

The authors have declared that no competing interests exist.

The maximum entropy (ME) method is a recently-developed approach for estimating local false discovery rates (LFDR) that incorporates external information allowing assignment of a subset of tests to a category with a different prior probability of following the null hypothesis. Using this ME method, we have reanalyzed the findings from a recent large genome-wide association study of coronary artery disease (CAD), incorporating biologic annotations. Our revised LFDR estimates show many large reductions in LFDR, particularly among the genetic variants belonging to annotation categories that were known to be of particular interest for CAD. However, among SNPs with rare minor allele frequencies, the reductions in LFDR were modest in size.

Current technologies for measuring genome-wide genetic variation easily capture millions of variants across the genome. High dimensional genotyping arrays already commonly include several million variants. Through direct sequencing or by imputation against large previously-sequenced reference panels where sequencing has been performed[

Stringent genome-wide significance thresholds of 5x10^{-8}, established to control the family-wise error rate (FWER) at approximately 5% for genome-wide testing, have been in standard usage in the field of human genetics for many years[

However, although few positive results are now published when associations meet genome-wide significance thresholds controlling FWER, power can be severely compromised by the use of these necessarily very small significance thresholds. Substantial missing heritability has been seen for many traits and diseases, yet many true loci of small effect may not be identified due to low power studies.

In order to improve power, many research groups are increasing their sample sizes through collaborations and meta-analyses (e.g.[

Despite the potential benefits of using FDR-defined significance thresholds instead of the FWER-defined thresholds, control of type 1 errors through the use of a chosen, fixed FDR threshold may be suboptimal. In fact, the use fixed FDR thresholds incurs a bias and tends to allow a higher proportion of false positives than indicated by the selected FDR[

FDR at a cutoff of 1.95 (p-value 0.05 for a normally distributed test) is the ratio of the area of the light blue region divided by the area of (beige plus light blue). LFDR compares the height of the dark blue line to the height of the brown line.

Power remains low even when FDR methods (or LFDR methods) are used, due not only to the chosen significance thresholds, but also because the effect sizes of most genetic variants tend to be small. Furthermore, power is particularly low for variants that have lower minor allele frequencies—i.e. variants that are uncommon in the population—since the standard errors associated with the estimated effects are large due to the small number of individuals carrying the uncommon variants. Additional strategies for increasing power are, therefore, of great interest. Approaches motivated along these lines include finding subsets of genetic variants that are of particular interest, and performing significance threshold adjustments that prioritize these subsets. It has been shown that judicious use of good external annotation about the genetic variants can increase the statistical power to identify associations with low prior power, and that the significance rankings can be improved[

In this paper, we focus on obtaining improved false discovery rate estimates for coronary artery disease (CAD) through the use of an LFDR method that incorporates external information about the genetic variants, leading to a posterior probability of non-association that varies with the annotations. Similar approaches for the FDR, i.e. methods that stratify or modify the FDR as a function of external information, have been shown to be effective in reducing the overall type 1 error rates[

We modelled the p-values arising from the CARDIOGRAMplusC4D genome-wide association (GWA) consortium[

The p-values from the CARDIOGRAMplusC4D Consortium (

There are several well-known approaches to estimate LFDR[

To decide whether a separate or a combined reference class should be used, the ME method bases the LFDR estimate mostly on the separate reference class if it has enough SNPs for reliable estimation and otherwise uses the combined reference class alone [

SNPs were categorized into 53 overlapping functional categories based on the annotation data from Finucane et al. [

The results from CARDIOGRAMplusC4D can be seen in their primary publication[^{-8} (^{-8}, and 32,508 that are in the tail of the QQ plot that deviates from the null distribution. For this latter definition, we refer to these SNPs as “P deviated” in

MAF bins | <0.001 | 0.001–0.005 | 0.005–0.01 | 0.01–0.05 | ≥0.05 | All |
---|---|---|---|---|---|---|

0 | 0 | 240,423 | 2,500,103 | 6,715,230 | 9,455,778 | |

0 | 0 | 2.54 | 26.44 | 71.02 | 100 | |

0 | 0 | 103 | 1,988 | 30,417 | 32,508 | |

0 | 0 | 0.32 | 6.11 | 93.57 | 100 | |

^{-8} |
0 | 0 | 0 | 61 | 2,152 | 2,213 |

0 | 0 | 0 | 2.76 | 97.24 | 100 | |

^{-8} |
0 | 0 | 0 | 39 | 1,836 | 1,875 |

0 | 0 | 0 | 2.08 | 97.92 | 100 |

*

To demonstrate the potential of the ME LFDR method, we applied it to nine annotation categories(22) known to significantly contribute to CAD heritability (

Also, the distances between p-value distributions (D-statistics) from Kolmogorov-Smirnov tests are shown, comparing different MAF groups: (a) [0.005–0.01) vs. [0.01–0.05); (b) [0.005–0.001) vs. (≥0.05); (c) [0.01–0.05) vs. (≥0.05).

Annotation Category | h^{2} obs (SE) |
h^{2} exp |
P-value (adjusted)^{(1)} |
# of SNPs ^{(2)} |
KS-test D measure (a,b,c) |
---|---|---|---|---|---|

Enhancer_Hoffman. extend.500^{(3)} |
0.18 (0.03) | 0.03 | 1.1x10^{-04} |
401,897 | 0.030, 0.069, 0.042 |

H3K9ac_Trynka | 0.15 (0.03) | 0.02 | 2.7x10^{-04} |
322,412 | 0.027, 0.074, 0.048 |

H3K9ac_Trynka.extend.500 | 0.18 (0.03) | 0.04 | 3.7x10^{-04} |
601,848 | 0.028, 0.072, 0.045 |

Enhancer_Hoffman | 0.14 (0.03) | 0.01 | 4.1x10^{-04} |
163,480 | 0.030, 0.072, 0.044 |

H3K27ac_PGC2.extend.500 | 0.19 (0.03) | 0.07 | 3.8x10^{-03} |
962,593 | 0.024, 0.065, 0.041 |

H3K4me3_Trynka.extend.500 | 0.20 (0.04) | 0.05 | 3.9x10^{-03} |
713,844 | 0.024, 0.065, 0.042 |

H3K27ac_PGC2 | 0.18 (0.03) | 0.05 | 3.9x10^{-03} |
768,410 | 0.024, 0.065, 0.042 |

H3K9ac_peaks_Trynka | 0.11 (0.03) | 0.01 | 4.0x10^{-03} |
95,531 | 0.032, 0.079, 0.049 |

FetalDHS_Trynka | 0.18 (0.04) | 0.02 | 9.1x10^{-03} |
255,582 | 0.022, 0.059, 0.039 |

^{(1)} Adjusted p-value for enrichment, using a Bonferroni correction

^{(2)} The number of SNPs used for the adjusted p-value

^{(3)} “extend.500” implies that a 500 base pair window around the category was included with the annotation to minimize inflation of heritability from flanking regions[

Within each panel, the three distributions are divided by p-value ranges: unadjusted p<0.05; unadjusted p<0.01; unadjusted p<0.001.

In ^{−16}) indicating differences between the distributions.

Differences are on an 0.25 power scale. (Left) MAF between 0.005 and 0.01. (Middle) MAF between 0.01 and 0.05. (Right) MAF greater than 0.05.

Finally, in

Among the selected subset of 93 SNPs from the H3K9ac annotation, rs41423244 on chromosome 12 showed the greatest LFDR decrease of 14.7%, from 24.7% to 9.998%, with a raw p-value of 1.35x10^{-4}. This SNP lies in the CS gene (citrate synthase), and the gene has been previously associated with psoriasis, height and celiac disease. Similarly, the SNP with the largest LFDR decrease among the 67 SNPs highlighted in ^{-5}; LFDR falls from 18.5% to 9.5% with the ME method. Although no previous GWAS associations have been reported with this SNP, the gene DENND5A has been associated with Beta2-glycoprotein plasma levels. Finally, the SNP whose LFDR was most influenced by use of the fetal DHS mark is rs75274818 on chromosome 12 (naïve p = 6.1x10^{-5}; LFDR.ME 9.9%; original LFDR 14.5%), located in

These data showed many associations with CAD as has been previously reported[

Although the LFDR estimates changed for many SNPs, the LFDR estimates that changed most due to the use of the ME method were not those that were particularly rare. When we restricted SNPs to those with MAF <0.1 and LFDR < 0.1, it was always the SNPs near the upper MAF bound that had the largest LFDR decreases (

Here, we explored the effect on LFDR estimates by partitioning SNPs into reference sets using functional annotation categories pointing towards excess risk for CAD that were obtained from LD-score regression[

The general concept of giving different priority to different subsets of hypotheses has been previously approached in several ways. Stratified FDR estimates can be obtained by separately calculating the FDR in different classes, and then combining the results[

Some substantial decreases in LFDR were seen in our work—as large as a drop of 0.4 in the LFDR estimate. Inevitably, these very large changes tended to occur for SNPs where the original LFDR estimate was large, and hence these SNPs may not be of great interest. Nevertheless, the ME LFDR method has the potential to increase the level of interest for pursuing a SNP for further investigations by using external annotation in a statistically principled way, and we saw larger reductions in the LFDR. Since all the LFDR estimates calculated here are relevant for functional annotations shown to be significantly associated with CAD, SNPs associated with substantially reduced LFDR estimates may be worth further investigation. Therefore, we have provided a spreadsheet (

Following [_{i}, follows a chi-square distribution. Under the null hypothesis that there is no association between SNP _{i} follows the central chi-square distribution with one degree of freedom and the corresponding density is denoted by _{0}(.). Under the alternative hypothesis, the test statistic _{i} is assumed to follow a non-central chi-square distribution with one degree of freedom and non-centrality parameter _{δ}(.). Now, let
_{i}, where _{0} is prior probability of SNPs not being associated with the disease. To estimate _{i}, one would have to estimate the parameters _{0} and _{0} and _{0} and

Following [_{0}, _{i}(_{0}_{0}(_{i}) + (1 − _{0})_{δ}(_{i})) is the likelihood function based on SNPs falling into the separate reference class _{1} and _{2} are pre-specified limits of the non-centrality parameter _{1} = 0.1 and _{2} = 50.

The likelihood set _{s} provides a set of pairs of (_{0}, _{0}, _{i} can be computed. Computing _{i} values for all pairs of (_{0}, _{s} would provide us a range of LFDR values, say _{i}, consider the following relative entropy function

Then _{i,ME}, the ME estimate, is the value of _{i} that minimizes the relative entropy function

(TIF)

(TIF)

Columns include the SNP id (legendrs), chromosome (chr), position (pos), minor allele frequency (maf), slope coefficient (beta) and p-value (p_dgc) for association with CAD from the consortium, z-squared (z_sq), and then various LFDR estimates. They are named for the set of SNPs used (LFDR.ME for the ME method, LFDR.Big for LFDR estimated from the large set of SNPs, and LFDR.Small for LFDR from the small annotated category) as well as for which annotation category was used (EH_ext for Enhancer Hoffman extend 500, H3K9_Try for H3K9ac Trynka, H3K9_Try_ext for H3K9ac Trynka extend 500, EH for Enhancer Hoffman, H3K27_ext for H3K27ac PGC2 extend 500, H3K4_Try_ext for H3K4me3 Trynka extend 500, H3K27 for H3K27ac PGC2, H3K9 for H3K9ac peaks Trynka and FDHS for Fetal DHS Trynka). Differences between overall LFDR and maximum entropy LFDR are also provided (Diff). The gene name is provided if the SNP is in a gene.

(XLSX)

AK made his contributions while a postdoctoral fellow at the University of Ottawa with DB. This work was funded primarily by CIHR Operating grant #123508 to DB, and also by an NSERC operating grant to CG. We would like to acknowledge the substantial assistance of Majid Nikpay with providing the data and annotations, and assisting with understanding of these data.