Fig 1.
Overview of the methodology to compute gene and pathway scores.
a) We compute gene scores by aggregating SNP p-values from a GWAS meta-analysis (without the need for individual genotypes), while correcting for linkage disequilibrium (LD) structure. To this end, we use numerical and analytic solutions to compute gene p-values efficiently and accurately given LD information from a reference population (e.g. one provided by the 1000 Genomes Project[22]). Two options are available: the max and sum of chi-squared statistics, which are based on the most significant SNP and the average association signal across the region, respectively. b) We use external databases to define gene sets for each reported pathway. We then compute pathway scores by combining the scores of genes that belong to the same pathways, i.e. gene sets. The fast gene scoring method allows us to dynamically recalculate gene scores by aggregating SNP p-values across pathway genes that are in LD and thus cannot be treated independently. This amounts to fusing the genes and computing a new score that takes the full LD structure of the corresponding locus into account. We evaluate pathway enrichment of high-scoring (possibly fused) genes using one of two parameter-free procedures (chi-squared or empirical score), avoiding any p-value thresholds inherent to standard binary enrichment tests.
Fig 2.
Comparing efficiency between VEGAS and Pascal.
a) Run times of VEGAS and Pascal (both options). Gene scores were computed on two GWAS (one HapMap imputed[23], one 1KG imputed[22,25]) for 18,132 genes on a single core. Pascal was compared to VEGAS for the HapMap imputed study and VEGAS2 for the 1KG-imputed study. For this plot, VEGAS and VEGAS2 were used with the default maximum number of Monte Carlo samples of 106 for both studies and additionally with 108 Monte Carlo samples for the HapMap imputed study. b) Scatter plot of -log10-transformed gene p-values for the sum gene scores obtained by VEGAS and Pascal, respectively. P-values above 10−6 are in excellent concordance. Below this value VEGAS could not give precise estimates, since it was run with the maximal number of Monte Carlo samples set to 108.
Fig 3.
Pathway scores for random phenotypes.
As input data we used 100 simulated instances of a random Gaussian phenotype and genotype data for 379 individuals from the EUR-1KG panel. Using the Pascal pipeline with sum gene scores and chi-squared pathway integration strategy we computed p-values for 1,077 pathways from our pathway library (results for max gene scores are similar, see S4 Fig). Panel (a) shows the p-value distributions without merging of neighbouring genes and (b) with merging of neighbouring genes (gene-fusion strategy).P-value distributions are represented by QQ-plots (upper panels) and histograms (lower panels). Results are colour-coded according to the fraction of genes in a given pathway that have a neighbouring gene in the same pathway, i.e. that are located nearby on the genome (distance <300kb). (a) P-values of pathways that contain genes in LD are strongly inflated without correction. (b) The gene fusion approach provides well-calibrated p-values independently of the number of pathway genes in LD.
Fig 4.
Performance of pathway enrichment methods for blood lipid traits and Crohn’s disease.
Displayed is the mean area under the precision-recall curve (AUC) for pathways identified using Pascal, a standard hypergeometric test at various gene score threshold levels, and a rank-sum test (vertical bars show the standard error). We show results for the max gene scores (sum gene score results are similar, see S5 Fig). a) Results for four blood lipid traits. The gold standard pathway list was defined as all pathways that show a significance level below 5×10−6 for any of the tested threshold parameters for hypergeometric tests in the largest study of lipid traits to date[23]. The significance level of 5×10−6 corresponds to the Bonferroni corrected, genome-wide significance threshold at the 0.5% level for a single method. For each phenotype, error bars denote the standard error computed from three independent subsamples of the CoLaus study (including 1500 individuals each). We see good overall performance of Pascal pathway scores, whereas results for discrete gene sets vary widely with the particular choice for the threshold parameter of hypergeometric test. b) Results for Crohn’s disease using the same approach as in (a). A reference standard pathway list was defined as in (a) using the largest study of Crohn’s disease traits to date[31]. We observe that the chi-squared strategy performs at least as well as all other strategies in this setting, whereas performance of the hypergeometric testing strategy varies.
Fig 5.
Power of pathway scoring methods across diverse traits and diseases.
Bar heights represent the number of pathways found to be significant after Bonferroni-correction. Within a given trait group, results are aggregated for all tested GWAS studies. 65 GWAS had at least one significant pathway in one of the tested methods. For each GWAS, the raw number of significant pathways was divided by the number of pathways found by the best performing method. This was done in order to avoid that a few studies with many emerging pathways dominate. We show results for the MOCS gene scores (SOCS gene score results are similar, see S6 Fig). (a) Results are aggregated over all trait groups. (b) Results for different trait groups.
Fig 6.
Examples of pathway enrichments comparing Pascal (chi-squared method) to the hypergeometric method.
Displayed are results for four phenotypes showing improvement when using Pascal instead of the hypergeometric (binary) enrichment strategy at the 5% threshold level. Underlying gene scores were calculated using the sum method. Dashed lines refer to the Bonferroni significance level when correcting for the number of pathways (1077). Besides from few cancer-related pathways, all pathways highlighted by this analysis have been implied by prior research (see main text). (a) For the trait insulin resistance, Pascal scored the pathway insulin signal attenuation first, followed by two other trait-relevant pathways (PI3K AKT activation and insulin receptor signaling), while the hypergeometric test did not find any significant pathways. (b) For smoking amount (number of cigarettes per day), Pascal revealed three significant pathways related to nicotinic acetylcholine receptors. (c) For osteoporosis, two cancer-related pathways scored significant using both Pascal and the hypergeometric test, but only Pascal revealed the WNT and Hedgehog signaling pathways, which are known to be involved in osteoblast biology. (d) For macular degeneration, Pascal found three significant, trait-relevant pathways related to lipoproteins and the complement system.