Integrating comprehensive functional annotations to boost power and accuracy in gene-based association analysis

doi:10.1371/journal.pgen.1009060

Table 1.

Forms of gene-based test statistics.

More »

Expand

Fig 1.

GAMBIT analysis framework & workflow.

Broad overview of GAMBIT software pipeline. (1) GWAS association summary statistics (single-variant z-scores, or effect size estimates and standard errors) are cross-referenced and linked with multiple sets of functional annotations. (2) Annotated GWAS variants are cross-referenced with LD reference data (a haplotype reference panel to estimate LD as needed). (3) GWAS summary statistics, annotations, and LD estimates are used to calculate stratified gene-based test statistics. (4) Stratified gene-based tests are combined for each gene to construct omnibus test statistics. GAMBIT supports multiple single-annotation test methods and multiple omnibus test methods to combine single-annotation tests. Statistical tests are listed in Table 1; basic annotation types are illustrated in Fig 2 and listed in Table 2. A complete description of statistical methods and annotation types can be found in Materials and Methods.

More »

Expand

Fig 2.

Regulatory annotation tracks and gene weights.

Illustration of primary regulatory annotation tracks used in GAMBIT gene-based analysis framework at the CELSR2 locus on chromosome 1. Top panel: Distance-to-transcription start site (dTSS) weights, calculated as w_jk(α) = exp(−α|d_jk|), where d_jk is the number of base pairs between variant j and the TSS of gene k, shown for α = 10⁻⁵ (solid lines), α = 5 × 10⁻⁵ (dashed lines), and α = 10⁻⁴ (dotted lines). Gene bodies are indicated by arrows and variant locations are marked in black at y = 0. Middle panel: enhancer-to-target-gene confidence weights. Weights are shown for enhancer variant and target gene, and unique enhancer elements are marked by black lines at y = 0. Lower panel: tissue-specific eQTL weights for each gene. eQTL tissues are differentiated by shape.

More »

Expand

Table 2.

Single-annotation gene-based tests.

More »

Expand

Fig 3.

Performance identifying causal genes in simulations.

Proportion of simulation replicates in which causal gene is top-ranked at its locus (y-axis) for each gene-based association or gene ranking method (x-axis & bar fill color) stratified by locus heritability (color shade) when either coding, eQTL, enhancer, UTR variants are causal (left panel facets), or a mixture in which either coding, eQTL, enhancer, or UTR variants are causal with equal probability (“heterogeneous across loci”; right panel). TSS-to-top-SNP refers to ranking genes by the distance between their TSS and the most significant single variant at each locus; dTSS-weighted gene-based tests (labeled dTSS) use exponential weight functions to assign higher weight to variants nearer the TSS for each gene (Materials and methods).

More »

Expand

Fig 4.

Statistical power to detect gene-based associations in simulations.

Statistical power (proportion of simulation replicates in which gene-based p-value ≤2.5 × 10⁻⁶ across loci; y-axis) for each gene-based testing approach (x-axis & color) stratified by locus heritability (plot rows) when coding, eQTL, enhancer, UTR variants, or a mixture of these (“heterogeneous across loci”) are causal (plot columns). In the rightmost column, either coding, eQTL, enhancer, or UTR variants are causal with equal probability (as when the causal annotation class is heterogeneous across loci for a single trait). Power is shown separately for causal genes and proximal genes (non-causal genes that are proximal to a causal gene, as defined in Materials and methods). Ideally, gene-based tests should have high power for causal genes, and relatively lower power for proximal genes. Error bars show 95% confidence intervals for average power across loci.

More »

Expand

Fig 5.

UK Biobank analysis: Numbers of significant independent associations detected.

Numbers of independent gene-based associations (at Bonferroni-corrected 5% significance level) detected by each method across 128 UK Biobank traits. Panel A: Total number of significant independent associations across traits (delineated by horizontal black lines) for each gene-based test; Wilcoxon signed-rank p-values (top) for paired comparisons between no. associations detected by the omnibus test (“GAMBIT”; red) versus Pascal/SOCS (blue) and single-annotation gene-based tests (green). The omnibus test detects significantly more associations than any individual constituent gene-based test or by Pascal/SOCS across UK Biobank traits. Panel B: Comparison of total numbers of genes detected across individual traits for the omnibus test (y-axis) versus single-annotation tests (x-axis).

More »

Expand

Fig 6.

UK Biobank analysis: Overlap between gene-based association methods.

Panel A: Total number of significant genes (p-value < 2.5e-6) for each method across all 128 traits. Unlike Fig 5, gene-based associations in Fig 6 are not filtered or LD pruned, and a single significant GWAS variant can produce multiple significant gene-based associations for a given method. Here, a larger number of significant genes does not necessarily suggest greater statistical power. Panel B: The i, j^th heatmap element can be interpreted as the conditional probability that gene-based test i is significant given that gene-based test j is significant, which is estimated as the total number of overlapping significant genes between tests i and j divided by the total number of significant genes for test j.

More »

Expand

Fig 7.

UK Biobank analysis: Performance identifying benchmark genes.

Percentage of loci at which the benchmark gene (identified from HPO and/or ClinVar) is top-ranked for each gene-based association or gene ranking method. For each method, bars on the left (outlined in black) are calculated for benchmark loci present in both HPO and ClinVar (54 loci), and bars on the right (faded outline) are calculated using the union of all HPO and ClinVar loci (153 loci). Horizontal red lines indicate the expected percentage of top-ranked benchmark genes under the null hypothesis that gene rank and benchmark labels are independent. Error bars indicate 95% confidence intervals. TSS-to-top SNP refers to ranking genes by the distance between TSS and the most significant single variant at each causal locus; the dTSS-weighted gene-based test (dTSS) uses an exponential weight funcion to assign higher weight to variants nearer the TSS for each gene (Methods).

More »

Expand