Fig 1.
An overview of gene-set analyses for GWAS data.
Gene-set analyses for GWAS data require that SNVs are mapped to genes (A) SNV-to-Gene Mapping, so that GWAS results (that is, SNV-level associations for all tested SNVs) can be used to test genes for association with a phenotype (B) Gene Scoring. The resulting gene scores are then used to identify biological processes enriched for phenotype association (C) Competitive Gene-Set Analysis. Everything is illustrated in the context of our work, which leveraged MAGMA, one of the most popular tools for gene-set analysis for GWAS data. Abbreviations: RI (regulatory interaction).
Table 1.
Overview of SNV-to-gene mappings incorporated into our analyses (baseline model in bold font).
Table 2.
GWAS datasets (summary statistics) analyzed.
Fig 2.
Larger SNV-to-gene mappings (by coverage) yield more significant genes.
(A) The number of significant genes detected for diverse SNV-to-gene mappings as a function of their coverage (that is, the number of non-redundant base pairs covered by all features–gene bodies, flanks, and regulatory elements—defining a gene, summed across all genes). Blue, solid line shows trend (loess smooth) for flank-based mappings (including Gene Body + U0D0). For each phenotype, we reported Spearman’s rank correlation coefficient (ρ) and its associated p-value (p) for a test of positive correlation based on all mappings (including gene bodies with 250kb flanks, gene bodies with 500kb flanks, and gene bodies with 1000kb flanks, which are not depicted on the graphs themselves but provided in Table B in S2 Table). (B) Gene scores of individual genes (circles), comparing between a mapping with large flanks (that is, 100kb flanks) and a mapping with small flanks (that is, 10kb flanks, which is the baseline model). Black, solid line shows the identity line. Red, dashed line shows the significance cut-off (α = 0.05). We tested (binomial test) if there were more novel genes (N; genes significant only with large flanks) than lost genes (L; genes significant only with small flanks), against the null hypothesis that both outcomes are equally likely or that losing is more likely. Counts (N:L) and the FDR-adjusted p-value (p) of each test (that is, FDR-adjusted within each phenotype across all the mappings reported in S3 Table) are reported. (A) and (B) Flanks were defined as regions extending from gene bodies; specifically, as UX (U; upstream from the transcription start-site) and DY (Y; downstream from the transcription end-site), where X and Y are flank size in kb. Phenotype abbreviations: C-Artery Disease (coronary-artery disease); Mac. Degeneration (macular degeneration).
Fig 3.
Genuine augmentations are associated with more significant genes than matched, random augmentations.
The number of significant genes detected for selected, genuine augmentations (saturated bars), in comparison to matched, random augmentations (desaturated bars). For random augmentations, counts and error bars represent the mean and standard deviation, respectively (based on 20 independent permutations of EPVP). Red, dashed line shows the number of significant genes detected with the baseline model itself. We tested (T-test) if there were more significant genes with genuine augmentation than with matched, random augmentation (a above a saturated bar indicates that the p-value of the test, which was FDR-adjusted across all mappings within each phenotype, was smaller than 0.05). We then tested, in a similar way, if there were more significant genes with the baseline model than with random augmentation of the baseline model (b above a desaturated bar for an FDR-adjusted p-value smaller than 0.05). The augmentation, “Big Flanks”, refers to gene bodies with 100kb upstream- and downstream flanks (note, as part of the baseline model, the first 10kb on either side of a gene body were not permuted by EPVP). Phenotype abbreviations: C-Artery Disease (coronary-artery disease); Mac. Degeneration (macular degeneration). Mapping abbreviations: Br-Neuronal Cells (brain-neuronal cells); H-Muscle Cells (heart-muscle cells); NP Cells (neural-progenitor cells); Pa-Islet Cells (pancreatic-islet cells); Pr-Gland Cells (prostate-gland cells).
Fig 4.
Mappings that yield more significant genes are associated with fewer significant gene sets.
(A) The number of non-redundant, significant gene sets detected as a function of the number of significant genes detected for diverse mappings. For each phenotype, we reported Spearman’s rank correlation coefficient (ρ) and its associated p-value (p) for a test of negative correlation based on all mappings (including gene bodies with 250kb flanks, gene bodies with 500kb flanks, and gene bodies with 1000kb flanks, which are not depicted on the graphs themselves but provided in Table B in S6 Table). (B) Gene-set scores of individual gene sets (circles), comparing between a mapping with large flanks (that is, 100kb flanks) and a mapping with small flanks (that is, 10kb flanks, which is the baseline model). All gene sets are shown (irrespective of redundancy). Black, solid line shows the identity line. Red, dashed line shows the significance cut-off (α = 0.05). We tested (one-sided, paired Wilcoxon test) if there was a tendency for gene-set scores to be attenuated with large flanks (relative to small flanks). The FDR-adjusted p-value (p) of each test (that is, FDR-adjusted within each phenotype across all the mappings reported in S6 Table) is shown (note that extreme p-values were truncated to 1e-300 for readability). (A) and (B) Flanks were defined as regions extending from gene bodies; specifically, as UX (U; upstream from the transcription start-site) and DY (Y; downstream from the transcription end-site), where X and Y are flank size in kb. Phenotype abbreviations: C-Artery Disease (coronary-artery disease); Mac. Degeneration (macular degeneration).
Fig 5.
Gene sets detected as significant with some augmentation, do not necessarily gain from that augmentation.
Significant gene sets detected for selected augmentations for each phenotype were stratified according to whether they gained from augmentation (that is, demonstrated a stronger enrichment for phenotype association with, than without, augmentation) or not. A bar directed to the right-hand side (positive values) denotes the number of gene sets that gained (“gaining”), and a bar directed to the left-hand side (negative values) denotes the number of gene sets that did not gain (“non-gaining”). The augmentation, “Big Flanks”, refers to gene bodies with 100kb upstream- and downstream flanks. Phenotype abbreviations: C-Artery Disease (coronary-artery disease); Mac. Degeneration (macular degeneration). Mapping abbreviations: Br-Neuronal Cells (brain-neuronal cells); H-Muscle Cells (heart-muscle cells); NP Cells (neural-progenitor cells); Pa-Islet Cells (pancreatic-islet cells); Pr-Gland Cells (prostate-gland cells).
Fig 6.
Significant gene sets that gain from augmentation, often gain no more from genuine augmentation than from matched, random augmentation.
Left panel: the numbers of gaining (bars directed rightwards and positive values) and non-gaining (bars directed leftwards and negative values) gene sets amongst significant gene sets detected for atrial fibrillation with selected augmentations (note, this is similar to the graph for atrial fibrillation in Fig 5). Right panel: significant gene sets that gained from augmentation were stratified and counted according to three validation categories (strongly validated, mildly validated, and invalidated), which defined how pronounced a gain was with genuine augmentation, over that with matched, random augmentation. The augmentation, “Big Flanks”, refers to gene bodies with 100kb upstream- and downstream flanks (note, as part of the baseline model, the first 10kb on either side of a gene body were not permuted by EPVP). Mapping abbreviations: H-Muscle Cells (heart-muscle cells). Refer to S4 Fig for similar graphs for the other phenotypes.
Fig 7.
The IRED procedure for identifying robust gains from augmentation for gene sets.
Genes were iteratively and cumulatively removed, one-by-one, from a selected gene set (beginning with the top-gaining gene, then the second-biggest gaining gene, and so forth). At each iteration, the impact on the gain of the gene set itself was inspected. One dot is shown for each iteration. Each dot represents the gene-set score difference (determined following probit transformations of one minus each FDR-adjusted, upper-tail p-value) in relation to the difference in the gene score (as used in gene-set analysis and for ranking the genes for IRED, before multiple-testing correction) of the top-gaining gene still present at an iteration. Given the nature of the procedure, progressive iterations are traced by moving from right to left on a graph. For clarity, red dots mean that the gene set demonstrates a gain from augmentation (otherwise, dots are black). The number of iterations for a gain to be lost for the first time (that is, the number of red dots before the first black dot is encountered when moving from right to left), is counted to assess robustness of a gain of a gene set. For both graphs, some top-gaining genes have been labelled for reference. (A) A robust gain (actin-mediated cell contraction gene set detected for atrial fibrillation with augmentation from the EPM of JEME dataset of regulatory interactions). (B) A non-robust gain (positive T-cell selection gene set detected for Crohn’s disease with augmentation from the EPM of PsychENCODE dataset of regulatory interactions).
Fig 8.
Identifying genes and regulatory elements that carry gains from particular augmentations for selected gene sets.
(A) Ten top-gaining genes from the schizophrenia-associated gene set, post-synaptic chemical transmission (officially, go_chemical_synaptic_transmission_postsynaptic), which gained from augmentation of the baseline model with the pc-HiC of brain dataset of regulatory interactions, were inspected to relate regulatory elements to schizophrenia risk. Left panel: illustrating genes in their genomic context to relate their regulatory elements to their gains (right panel), and in turn, to the gain at the level of the gene set itself. Regulatory elements that overlap with strong SNV-level associations for a phenotype (note, SNV-level associations from the GWAS dataset are depicted on the red midline) are of particular interest (note that the legend specifies the most significant SNV-level association represented by a given shade of red). Regulatory interactions involving interesting regulatory elements may be independently supported by eQTL data (GTEx version 8 for the European population, covering all tissues and cell types) for the same gene (purple dots). Right panel: gene scores (as used in gene-set analysis, before multiple-testing correction) for selected mappings. Bigger, positive scores imply a stronger phenotype association. For random augmentation, gene scores and error bars represent the mean and standard deviation, respectively (based on 20 independent permutations of EPVP). (B) Ten top-gaining genes from the type-2 diabetes associated gene-set, endocrine system development (officially, go_endocrine_system_development), which gained from augmentation with the cMap of Pa-Islet Cells dataset of regulatory interactions, were likewise inspected. Mapping abbreviations: Pa-Islet Cells (pancreatic-islet cells).