Gene set meta-analysis with Quantitative Set Analysis for Gene Expression (QuSAGE)

Small sample sizes combined with high person-to-person variability can make it difficult to detect significant gene expression changes from transcriptional profiling studies. Subtle, but coordinated, gene expression changes may be detected using gene set analysis approaches. Meta-analysis is another approach to increase the power to detect biologically relevant changes by integrating information from multiple studies. Here, we present a framework that combines both approaches and allows for meta-analysis of gene sets. QuSAGE meta-analysis extends our previously published QuSAGE framework, which offers several advantages for gene set analysis, including fully accounting for gene-gene correlations and quantifying gene set activity as a full probability density function. Application of QuSAGE meta-analysis to influenza vaccination response shows it can detect significant activity that is not apparent in individual studies.


Introduction
Whole-genome transcriptional profiling, using DNA microarray technology or next-generation sequencing (RNA-seq), is widely used to gain insights into disease pathophysiology and response to therapy. While it is important to identify individual genetic associations, the high level of variation between individuals due to genetic and phenotypic heterogeneity can result in inconsistent biological insights [1]. With the availability of biological annotation for known genes [2][3][4][5], the focus of gene analysis has shifted from individual genes to gene sets. Gene set analysis can be used to detect and compare the activity of pre-defined lists of genes that can be related directly to the underlying biological processes. Compared to differential expression PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1006899 April 2, 2019 1 / 10 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 (DE) analysis of individual genes, gene set analysis examines the cumulative effect of multiple related genes, and thus offers the possibility to detect more subtle, but coordinated, expression changes [6][7][8][9][10]. Despite this increased power, gene set analysis can still be limited by the small sample sizes of many current studies. Combining multiple related studies through meta-analysis offers the possibility of increased power and improved reproducibility [11]. Such studies can leverage the large and growing number of transcriptional profiling data sets available in public repositories, such as GEO [12]. However, combining information from multiple studies and performing meta-analysis at the gene set level remains challenging. Meta-Analysis of Pathway Enrichment (MAPE), including MAPE-P, MAPE-G, and MAPE-I, use maximum, minimum, or Fisher's statistics to combine P values from each individual study for meta-analysis [13]. Instead of combining P values, MetaPath leverages a Bayesian model and was developed to perform gene set meta-analysis by simultaneously modeling gene expression data and gene set information from multiple studies [14]. Recently, Lu et al. developed iGSEA that uses an adaptive testing method for choosing either random Effects (RE) or fixed effects (FE) model to integrate gene set analysis from multiple studies [15]. We previously proposed Quantitative Set Analysis for Gene Expression (QuSAGE) [16] as a computational framework for gene set analysis. QuSAGE quantifies gene set activity with a complete probability density function (PDF), and improves power by accounting for genegene correlations. The QuSAGE R package is available on Bioconductor [17], and is widely used with 1554 downloads from distinct IPs in 2017. In 2015, Turner et al. extended the applicability of QuSAGE to longitudinal studies by adding functionality for general linear mixed models [18]. In this study, we further extend the applicability of QuSAGE to include metaanalysis of gene sets. QuSAGE meta-analysis was adopted by the NIH/NIAID Human Immunology Project Consortium (HIPC)-Center for Human Immunology (CHI) Signature Project Team to successfully detect baseline transcriptional predictors of influenza vaccination responses from multiple studies [19].
As an alternative gene set meta-analysis method, QuSAGE meta-analysis has several advantages: 1) It is a natural extension of QuSAGE, so it facilitates gene set meta-analysis for the large number of existing QuSAGE users, 2) QuSAGE improves power by accounting for genegene correlations and QuSAGE meta-analysis inherits this advantage, and 3) Since QuSAGE quantifies a gene set activity with a PDF, it is capable of performing complicated post hoc comparisons that other gene set meta-analysis methods cannot achieve easily, as we demonstrate in our case study.

Design & implementation
QuSAGE quantifies gene set activity with a complete probability density function (PDF). The QuSAGE meta-analysis pipeline proceeds in three steps (Fig 1).
Frist, gene set analysis is performed with gene expression data separately for each individual study using QuSAGE. Differential gene expression of individual gene is quantified by a full PDF rather than a single P value. Then all PDFs of genes within the gene set of interest are combined into a single activity (PDF) using numerical convolution. The variance of the combined PDF is corrected for gene-gene correlation by calculating a variance inflation factor (VIF).
Next, the meta-analysis is performed through the function combinePDFs (Table 1). To carry out meta-analysis of S studies, the PDFs from each individual study are combined into a single PDF using a weighted numeric convolution algorithm [20]. The sample sizes of each study are considered as weight factors. In short, the continuous PDFs are sampled within an interval that spans their individual ranges. Each PDF is sampled by a finite number of points that is proportional to its weight. These discretized PDFs are then convoluted and the result is resampled and transformed back to the initial interval. P values and confidence intervals can be easily extracted from the resulting combined PDF.
Finally, the results of QuSAGE meta-analysis can be visualized by the function plotCombinedPDF. Gene expression data of each study is first analyzed separately by QuSAGE to produce gene set activity PDFs. Next, meta-analysis is performed through the function combinePDFs, where PDFs from each individual study are combined into a single PDF using a weighted numeric convolution algorithm. The results of QuSAGE meta-analysis can then be visualized by the function plotCombinedPDF. https://doi.org/10.1371/journal.pcbi.1006899.g001

Results
To illustrate how QuSAGE meta-analysis works, we analyzed three influenza vaccination transcriptional profiling studies of young adults [21]. The data from these studies is available in GEO (GSE59635, GSE59654, and GSE59743) and ImmPort (SDY63, SDY404, and SDY400). The goal of the analysis was to detect gene sets associated with successful (i.e., high) antibody responses using the transcriptional response data measured from blood samples taken pre-and 7 days post-vaccination. Subjects were categorized as high-responders (HR) and low-responders (LR) based on their adjusted maximum fold change (adjMFC) from hemagglutination inhibition assay (HAI) measurements taken pre-and 28 days post-vaccination [22]. GSE59635 (SDY63) included 7 young subjects (3 LR and 4 HR); GSE59654 (SDY404) contained 13 young subjects (7 LR and 6 HR); GSE59743 (SDY400) had 15 young subjects (7 LR and 8 HR). The data and R code of this case study can be found from: https://bitbucket.org/kleinstein/qusage.
The analysis consisted of two major steps: 1. Identify candidate vaccination response gene sets. First, the set of 346 blood transcription modules (BTMs) described in Li et al. [4] was filtered to a smaller list of "response" sets that showed significant activity following influenza vaccination in the set of HR subjects. To define these response gene sets, QuSAGE meta-analysis was used to compare day 7 postvaccination with pre-vaccination transcriptional profiles in HR subjects across all three studies. This analysis identified 62 response gene sets with a Benjamani-Hochberg false discovery rate (FDR) cutoff of 5%.
2. Detect gene sets associated with successful antibody responses. For each response gene set selected in step 1, QuSAGE was first used to carry out a two-way comparison on each study independently. A PDF reflecting the response difference between HR and LR was quantified by calculating the difference of two PDFs, one representing the temporal gene set activity in HR (day 7 vs. pre-vaccination) and the other representing LR (day 7 vs. prevaccination). Next, QuSAGE meta-analysis was used to combine the PDFs from the three studies into one single PDF. Statistical significance of the meta-analysis was calculated by testing whether the central tendency of the final PDF is zero using a two-sided test with 15% FDR cutoff.
As expected from the known biology, "plasma cells, immunoglobulins (M156.1)" was one of top-ranked gene sets from QuSAGE meta-analysis (Fig 2), and was significantly more up-regulated (day 7 vs. pre-vaccination) in HR compared to LR. In total, QuSAGE meta-analysis identified 11 gene sets associated with a successful antibody response (Table 2). In most cases (8 of 11; 73%), the QuSAGE meta-analysis of these gene sets yielded a lower P value compared with the individual studies.
We next compared QuSAGE meta-analysis with other meta-analysis approaches. Existing gene set meta-analysis methods were designed to perform pairwise comparisons between two phenotypes/conditions and cannot be easily applied to the four-way comparison in our case study. For our comparative analysis, we first used Fisher's method [23] and Stouffer's method [24] to combine P values from QuSAGE single gene set analysis from each study and compared the results with QuSAGE meta-analysis. Using the same FDR cutoff of 15%, Fisher's QuSAGE meta-analysis of gene set "plasma cells, immunoglobulins (M156.1)". The differential response between HR and LR subjects was first calculated for each individual study (colored lines). QuSAGE meta-analysis was then used to combine these individual PDFs into a single meta-analysis PDF (black line).
https://doi.org/10.1371/journal.pcbi.1006899.g002 method and Stouffer's method identified fewer gene sets than QuSAGE. Fisher's method and Stouffer's method identified 4 and 1 significant gene sets, respectively, including only a single gene set not found by QuSAGE (Fig 3A, Table 2). It is possible that QuSAGE meta-analysis was more sensitive, and identified additional significant gene sets, compared with Fisher's method or Stouffer's method at the cost of decreased specificity. To investigate the specificity of QuSAGE meta-analysis, we permutated the labels of LR and HR individuals 2000 times and applied the same meta-analyses using all three approaches. With the same FDR cutoff 15% applied to each permutation, only 134 out of 2000 permutations generated even a single false positive gene set result using QuSAGE meta-analysis; while 380 and 384 permutations produced false positives when using Fisher's and Stouffer's method, respectively (Fig 3B). These results suggest that QuSAGE meta-analysis is conservative and the increased number of significant gene sets identified by QuSAGE in the real data was not due to QuSAGE simply generating lower P values (i.e., QuSAGE meta-analysis is not trading off specificity for sensitivity). However, a limitation of Fisher's method and Stouffer's method is that neither accounts for the direction of gene set activity (e.g., higher in HR vs. higher in LR), but simply combines the resulting P values from each individual study. As a consequence, low P values may be produced by cases where the change for the individual studies is significant but in different directions, leading to false positives. To account for the directionality of gene set activity differences when applying Fisher's method and Stouffer's method, we carried out a three-step analysis, which were referred to directional Fisher's method and directional Stouffer's method. First, separate one-tailed tests were carried out for each study to test for (1) higher gene set activity in HR, and (2) higher gene set activity in LR. In this way, lower P values in each type of onetailed test, have a consistent meaning. Second, in the meta-analysis, Fisher's method or Stouffer's method was applied to the set of P values from each type of one-tailed test to generate a combined P values. Third, the final P value of the meta-analysis was the smaller of the two combined P value from each of the one-tailed tests, corrected by multiplying by 2. We also tested another popular meta-analysis method in which effect sizes (Hedges' g) are calculated for every gene set in each study separately and then combined using linear (mixed-effects) models (implemented in the rma() function from the metafor R package, and hereafter referred to as the "effect-size" method) [25]. Using the same FDR cutoff of 15%, directional Fisher's method, Stouffer's method and the effect-size method identified 16, 27 and 40 significant gene sets respectively (S1 Table). All 11 gene sets detected by QuSAGE meta-analysis were found by directional Fisher's method and directional Stouffer's method, and 10 of the 11 gene sets were found by the effect-size method, suggesting a high level of confidence in the QuSAGE results (Fig 4A). To quantify the specificity of the three approaches, we permutated the labels of LR and HR individuals 2000 times and applied the same meta-analyses on each permuted data set. With the same FDR cutoff 15% applied to each permutation, QuSAGE meta-analysis generated false positive results in only 8% (159 out of 2000) of the permutations (Fig 4B). In contrast, directional Fisher's method, directional Stouffer's method and the effectsize method generated at least one false positive gene set in 17%, 14% and 63% (337, 280 and 1267 out of 2000) of the permutations, respectively (Fig 4B).This higher false positive rate may account, at least partially, for the additional gene sets identified by directional Fisher's method, directional Stouffer's method and the effect-size method. Overall, the results on this case study show that QuSAGE meta-analysis is comparable with existing methods, but has better specificity.

Gene Sets
In this study, we describe an extension of QuSAGE to enable meta-analysis of gene sets. Instead of summarizing P values, QuSAGE integrates gene set activity and estimates a full PDF of activity across multiple studies, thus easing the process of post hoc comparisons. Furthermore, by integrating information from a larger pool of samples, QuSAGE meta-analysis increases the power of analysis, and allows detection of biologically-relevant gene sets that would not be detectable in single studies. Existing common meta-analysis methods, such as Fisher's method, Stouffer's method, or the effect-size method, are limited by the fact that the gene set activity from each study is represented by a single P value (Stouffer weighs P values by sample size from each study) or a single statistic (effect size). However, QuSAGE describes the gene set activity using a PDF and the meta-analysis of QuSAGE fully takes the advantage of the richer information provided from PDFs. QuSAGE meta-analysis combines PDFs from multiple studies using a weighted numeric convolution algorithm, and thus implicitly considers not only the differences but also directions and confidence intervals of gene set activities, leading to a more accurate estimation of combined gene set activity. The QuSAGE algorithm is also computationally efficient. It took totally only 4 minutes to run the whole case study in our manuscript on a single PC with a 2.80GHz Intel Core i7 CPU and 16G memory. Our case study suggests that QuSAGE is comparable or better than the commonly used Fisher and Stouffer methods. In the future, performing comparisons of QuSAGE with other existing meta-analysis methods [13][14][15]26]would be desirable.