Integrating predicted transcriptome from multiple tissues improves association detection.

Integration of genome-wide association studies (GWAS) and expression quantitative trait loci (eQTL) studies is needed to improve our understanding of the biological mechanisms underlying GWAS hits, and our ability to identify therapeutic targets. Gene-level association methods such as PrediXcan can prioritize candidate targets. However, limited eQTL sample sizes and absence of relevant developmental and disease context restrict our ability to detect associations. Here we propose an efficient statistical method (MultiXcan) that leverages the substantial sharing of eQTLs across tissues and contexts to improve our ability to identify potential target genes. MultiXcan integrates evidence across multiple panels using multivariate regression, which naturally takes into account the correlation structure. We apply our method to simulated and real traits from the UK Biobank and show that, in realistic settings, we can detect a larger set of significantly associated genes than using each panel separately. To improve applicability, we developed a summary result-based extension called S-MultiXcan, which we show yields highly concordant results with the individual level version when LD is well matched. Our multivariate model-based approach allowed us to use the individual level results as a gold standard to calibrate and develop a robust implementation of the summary-based extension. Results from our analysis as well as software and necessary resources to apply our method are publicly available.


94
To expand the applicability of our method to massive sample sizes and to studies where individual 95 level data are not available, we have extended our method to use GWAS summary results rather than 96 individual-level data. 97 We call this extension Summary-MulTiXcan (S-MulTiXcan). Here we derive an analytic expression 98 that relates the joint regression estimates (g j ) to the marginal regression estimates (γ j as obtained from 99 S-PrediXcan), assuming a known LD structure from a reference panel. We display a conceptual overview  To reduce false positives due to LD misspecification, we discard any significant association result 105 for a gene when the best single tissue result has p-value greater than 10 −4 . This is rather conservative 106 since it is possible that evidence with modest significance from weakly correlated tissues can lead to very 107 We examined the biological relevance of some of the genes detected by our new method that was missed 120 by looking at one tissue at a time (S-PrediXcan).

121
For example, in the Early Growth Genetics (EGG) Consortium's Body-Mass Index (BMI) study,

126
In the CARDIoGRAM+C4D Coronary Artery Disease (CAD) study, S-MulTiXcan detected 12 as-127 sociations not significant in S-PrediXcan. The top result was AS3MT (p-value=4.3 × 10 −9 ), related 128 to arsenic metabolism; interestingly, environmental and toxicological studies link arsenic exposure and (MulTiXcan) that aggregates information across multiple tissues and improves the identification of genes 140 significantly associated with complex traits. To expand the applicability of our approach we have extended 141 the method to accommodate GWAS studies where only summary results are available (S-MulTiXcan).

142
We show through applications to hundreds of traits the performance of both individual and summary 143 based methods. We also show that the summary based method provides a reasonably good approximation 144 to the individual level results.

145
As any method relying on a reference panel, S-MulTiXcan might be inaccurate when the study 146 population has a different Linkage Disequilibrium (LD) structure than the reference panel. For example, 147 should two models for a gene yield predicted expressions that are lowly correlated in the reference panel 148 but highly correlated in the study population, then this method underestimates their correlation. To avoid 149 this misspecification, a reference panel matching the study population should be used when available (i.e.

150
using East Asian population from 1000 Genomes if the study set is composed of East Asian individuals).

151
A limitation of PrediXcan and S-PrediXcan is LD contamination, i.e. when causal loci for the trait 152 and expression are different but in LD. We have addressed this in S-PrediXcan through an additional 153 colocalization filtering step. For MulTiXcan, this could be avoided by restricting the analysis to gene-154 tissue pairs with high colocalization probability.

155
Here we showed the advantages of our joint estimation method through application to multiple traits 156 with publicly available GWAS results as well as the ones available in the UK Biobank. These results 157 include many novel associations of interest, which we make publicly available to the research community 158 in http://gene2pheno.org.

159
Software And Resources 160 We make our software publicly available on a GitHub repository: https://github.com/hakyimlab/ 7 dictDB.org. A short working example can be found on the GitHub page; more extensive documentation 163 can be found on the project's wiki. The results of S-MulTiXcan applied to the 44 human tissues and a 164 broad set of phenotypes can be queried on http://gene2pheno.org.

167
Let us consider a GWAS study of n samples, and assume availability of prediction models in p different 168 tissues. Each model j is a collection of prediction weights w j i .

170
• y be an n-vector of phenotypes, assumed to be centered for convenience.

171
• X the genotype matrix, where each column X l is the n-vector genotype for SNP l. We assume it 172 coded in the range [0,2] but it can be defined in another range, or standardized.

173
•t j = i∈modelj w j i X i be the predicted expression in tissue j. This is the independent variable used 174 by single-tissue PrediXcan.

175
• t j be the standardization oft j to mean = 0 and standard deviation = 1.

176
In our application, we will consider p = 44 models for a given gene's expression, trained on GTEx 177 data. This method is easily extensible to support incorporation of other covariates, or correction by them.

178
MulTiXcan 179 MulTiXcan consists of fitting a linear regression of the phenotype on predicted expression from multiple 180 tissue models jointly: • Computation of single tissue association results with S-PrediXcan.

210
• Estimation of the correlation matrix of predicted gene expression for the models using the Linkage

215
• Estimation of joint effects from the univariate (single-tissue) results and expression correlation.

216
• Discarding suspicious results, suspect to be false positives arising from LD-structure mismatch.

218
To derive the multivariate regression (2) effect sizes and variances using the marginal regression (3) 219 estimates, we employ a technique presented in [33].

220
More specifically, we want to obtain the multivariate regression coefficient estimates for g j (2) using 221 the estimates from the marginal regression: where we assume y centered for convenience, and j is the marginal regression error term with variance 223 σ 2 (i.e. we assume a common variance σ 2 for all j).

224
First, notice that the solution to the multivariate regression in Eq (2) iŝ whereas the solution to the marginal regression in Eq (3) is: where γ is the vector of effect sizes γ j , and 1 is the p × p identity matrix. Please note that, since the t j 226 are standardized, then D = (n − 1)1 and se(γ j ) = var(γ j ) = σ √ n−1 .

227
From (6) we get T t y = Dγ, which we replace in (4) and obtain the relationship between marginal 228 and joint estimates: To compute the variance of the estimated effect sizes (5) we use the variance of the phenotype as a 230 conservative estimate of σ 2 e and LD information from reference samples as described next.

232
As the genotypes from most GWAS are typically unavailable, we must use a reference panel to compute 233 T t T, using only those SNPS available in the GWAS results. To do so, notice that: where Γ ij are the elements of the covariance matrix Γ = var (X) = (X −X) t (X −X)/(n − 1). We compute the variances as in the S-PrediXcan analysis: Addressing Singularity Of The Correlation Matrix

235
Given the high degree of correlation among many of the prediction models, T t T is close to singular 236 and its inverse cannot be reliably calculated for many genes. To address this problem, we compute the 237 pseudo-inverse via Singular Value Decomposition, decomposing the correlation matrix into its principal 238 components and removing those with small eigenvalues. In other terms, we will restrict the analysis to 239 axes of largest variation of the expression data. This is analogous to the principal components-based 240 approach used with individual level data. We denote with Σ + the pseudo-inverse for any matrix Σ. To quantify significance, we use the fact that the regression coefficient estimates follow a (approximate) multivariate normal distribution:ĝ ∼ N (g, σ 2 e (T t T) −1 ). Under the null hypothesis of no association, it follows thatĝ t T t T σ 2 eĝ ∼ χ 2 p . We can then replaceĝ with its estimate from the marginal regression: where Cor(T) is the autocorrelation of T, andẑ is the p-vector of marginal analysis z-scores, γ j /se(γ j ).

246
We have used σ 2 e ≈ σ 2 as an approximation (i.e. the residual variance of the marginal regression as 247 approximation of the residual variance of the joint regression). This simplification is conservative, and 248 based on our comparison to the individual multivariate results we consider the loss of efficiency acceptable.

249
In practice, we will use the SVD pseudo-inverse Cor(T) + as explained in the previous section, and a 250 χ 2 -test:ẑ t Cor(T) +ẑ ∼ χ 2 k , with k the number of components surviving the SVD pseudoinverse.

252
Prediction Models were obtained from PredictDB.org resource. These models were trained using Elastic  Figure 1. MulTiXcan method. Panel a illustrates MulTiXcan method. Predicted expression from all available tissue models are used as explanatory variables. To avoid multicolinearity, we use the first k Principal Components of the predicted expression. y is a vector of phenotypes for n individuals, t tissue j g is the standardized predicted gene expression for tissue j, g j is its effect size, a is an intercept and e is an error term. Panel b shows a schematic representation of MulTiXcan results compared to classical PrediXcan, both for a single relevant tissue and all available tissues in agnostic scanning. y is a vector of phenotypes for n individuals, t j is the standardized predicted gene expression for model j, g j is its effect size in the joint regression, γ j is its effect size in the marginal regression using only prediction j, e and j are error terms.  GTEx individuals were used as a reference panel for estimating expression correlation in the study population. The summary data-based method shows a good level of agreement with the individual-based method. In cases where the LD-structure between reference and study cohorts is mismatched, the summary-based method becomes less accurate. For example in Asthma, two genes are significantly overestimated; however it tends to be conservative for most genes. Panel c displays QQ-Plots for the association p-values from S-MulTiXcan and S-PrediXcan in Schizophrenia, using a model trained on brain's cerebellum, and S-PrediXcan associations for all 44 GTEx tissues. Panel d shows the number of significant associations in Schizophrenia for each method as a bar plot.