Cross-GWAS coherence test at the gene and pathway level

Proximal genetic variants are frequently correlated, implying that the corresponding effect sizes detected by genome-wide association studies (GWAS) are also not independent. Methods already exist to account for this when aggregating effects from a single GWAS across genes or pathways. Here we present a rigorous yet fast method for detecting genes with coherent association signals for two traits, facilitating cross-GWAS analyses. To this end, we devised a new significance test for the covariance of datapoints not drawn independently but with a known inter-sample covariance structure. We show that the distribution of its test statistic is a linear combination of χ2 distributions with positive and negative coefficients. The corresponding cumulative distribution function can be efficiently calculated with Davies’ algorithm at high precision. We apply this general framework to test for dependence between SNP-wise effect sizes of two GWAS at the gene level. We extend this test to detect also gene-wise causal links. We demonstrate the utility of our method by uncovering potential shared genetic links between the severity of COVID-19 and (1) being prescribed class M05B medication (drugs affecting bone structure and mineralization), (2) rheumatoid arthritis, (3) vitamin D (25OHD), and (4) serum calcium concentrations. Our method detects a potential role played by chemokine receptor genes linked to TH1 versus TH2 immune response, a gene related to integrin beta-1 cell surface expression, and other genes potentially impacting the severity of COVID-19. Our approach will be useful for similar analyses involving datapoints with known auto-correlation structures.

We deduce that we can still factorize the moment generating function, such that xy ∼ σ x σ y 2 χ 2 with χ 2 1 (c) the non-central χ 2 distribution with one degree of freedom. Hence, the general product-normal can be expressed as a linear combination of non-central χ 2 1 distributions, for ϱ = 0.

PROBABILITY DENSITY FUNCTIONS
a. Product-Normal: Making use of the relation (2) for standardized variables and expressed in terms of gamma distributions, the pdf, denoted as f , of the product-normal distribution can be calculated analytically via convolution (cf., [2])  Completing the square and invoking a hyperbolic substitution, we arrive at with K 0 the modified Bessel function of the second kind at zero order. The result above for f agrees with the previous derivations of [3,4]. Note that the analytic calculation of the corresponding cdf requires the solution of an integral of the type ∞ x dt e at K 0 (t) . We are not aware of a known closed-form solution.
For illustration, we plot the pdf for ϱ = 1/2 together with the corresponding histogram sampled from (2) expressed as a difference of gamma distributions in Fig 1. We also show the cdf obtained via numerical integration of (3). We verified that the numerical integration matches the results obtained via Davies' algorithm for the cdf calculation for various ϱ, cf., Table 1.
Note that since the pdf of the non-central χ 2 distribution includes a Bessel function, analytic calculation of the pdf of xy in the more general case of section 1 is more complicated than in (3), and will not be discussed here.
b. Variance-Gamma: Consider a random variable X distributed according to The corresponding pdf can be calculated similarly as above via convolution. We infer dy (y 2 − xy) h/2−1 e − 2y g . Completing the square and using as before hyperbolic substitution, we arrive at with K n a modified Bessel function of second kind at order n.
Hence, due to symmetry the pdf is well normalized. However, we are not aware of closed-form solutions for Bessel function integrals of the type ∞ x dt t ν K ν (t), which are needed to provide a closed-form expression for the cdf.
The distribution given by (5) occurred before in the finance domain as a special case of the variance-gamma distribution [5]. It can be traced back further to the distribution of the bivariate correlation. In detail, the gamma-variance distribution corresponds to the offdiagonal marginal of a two-dimensional Wishart distribution, which models the covariance matrix [6]. However, to the best of our knowledge, what is new is the expression in terms of the difference distribution in equation (4).
For h = n and g = 1, we have Hence, for n = 1, we obtain the product-normal distribution with ϱ = 0 as discussed in the previous section.
For general n, we can view V G(n, 1) as the distribution of a sum of n independently distributed product-normal random variables. In particular, we can use Davies' algorithm to calculate the cdf for V G(n, 1) exactly at the desired precision.

SIMULATIONS: N = 2
The importance of correcting for the inter-dependence between the elements of w and z in the index I = i w i z i introduced in the main text can be seen easily in the N = 2 case. Consider the covariance matrices, such that w ∼ N (0, Σ) and z ∼ N (0, Σ), and with r varying. In Fig 2 we show various significance threshold curves of I for varying r, as calculated from the to I corresponding distribution expressed as a linear combination of χ 2 distributions, cf., the main text. Clearly, for increasing inter-element correlation, the significance threshold level rises. The magnitude of the effect increases with the desired level of significance.

ACCOUNTING FOR SAMPLE OVERLAP
For GWAS, the populations used to investigate two traits for co-significant signals might overlap. In particular, it may even be that the GWAS for the two traits have been conducted for the same population (full overlap). If there is in addition a strong phenotypic correlation between the traits, a danger for a significant biased result is present, as long as one does not correct for this effect induced by the sample overlap. Let us consider for simplicity first the case of full population sample overlap. For same population (and so genotype matrix X), and for two different traits y (1) and y (2) , we have as multi-variate models with true effect sizes α (i) and residuals ϵ (i) ∼ N (0, 1).
We can assume that the residuals are independent for independent populations but generally ζ ̸ = 0 for overlapping populations. This implies Recall from the main text that the z-scored null effect estimates read (as vector over some region) with Σ := 1 n x T x and x i columns of X. It follows that Using block Gaussian elimination, one can show that Hence, We conclude that Consider now a matrix U such that Σ = U ΛU T with Λ the diagonal matrix of eigenvalues. Clearly, U U T = 1. We then have with w, v ∼ N (0, 1). The correlation between the new variables w and v can be calculated to be given by It follows that a. Coherence test Recall from the main text (Methods, Product-Normal distribution) that the product distribution for correlated N (0, 1) variables with coefficient ϱ can be expressed as We conclude that for the corrected null distribution with identical populations for the two traits (16) The coefficient ζ is given by the phenotypic correlation. As one would expect, for identical traits (ζ = 1), the above turns into the standard χ 2 test of Pascal [7]. For ζ = 0, we end up with the formula (2.2) of the main text.
The generalization to partial sample overlap is straightforward. We can assume that the samples are sorted such that the overlapping samples form the first n o rows, followed by the n and the z-scored null effect estimates with the first n o rows of x (1) identical to x (2) .
We assume that the population covariance matrix Σ is well approximated by the sub-populations of both GWAS, i.e., For the off-diagonal parts, we have under the assumption that the overlapping subpopulation also approximates the population covariance structure well. It follows that the sample overlap calculation proceeds as for the full overlap case discussed above, but with with n (k) = n o + n (k) i the individual GWAS sample sizes. Note that for n o sufficiently small the correction factor vanishes, as it should be.
The calculation of ζ requires knowledge of the phenotypic correlation. However, often only public GWAS summary statistics are at hand. Fortunately, in this case, one can use LD score regression [8] to obtain an estimation, as the intercept term thereof precisely corresponds to the correction factor (19).
In order to illustrate the impact of such a non-zero correction factor, we consider a multivariate normal distribution with correlation matrix Σ of dimension 100. Σ is taken to be glued together from two 50-dimensional correlation matrices with off-diagonal elements set identical to 0.2 and two off-diagonal block matrices with elements of 0.2 multiplied by various ζ. This setup simulates (9).
We calculate p-values for the index (z (1) ) T z (2) given in equation (16) for 1000 random samples with and without correction factor ζ. Corresponding QQ-plots are shown in Fig 3 and Fig 4. We observe that inclusion of ζ indeed removes the bias.
b. Ratio test The index of the ratio test reads (see Results, section Ratio test of the main text) with z (i) defined as above. (Similarly for interchanged indices.) We have We can linearly transform variables such that Under the re-definition u := v − r w we arrive at Pr(R ≤ r) = Pr(w T Λu ≤ 0) = Fwū(0) , as in the main text. However, because of the nonvanishing correlation (15), we now have insteadw ∼ N (0, Λ) andū ∼ N (0, (1 + r 2 − 2ζr)Λ). In particular, With Var(w i ) = 1 and Var(u j ) = (1+r 2 −2ζr) we obtain for the corresponding correlation Therefore, we deduce similar as in the main text that   Similar to the main text, a consistency check can be performed against the Cauchy distribution. We know that for two normal distributed random variables x ∼ N (0, σ 2 x ) and y ∼ N (0, σ 2 y ) with correlation ζ, with C the general Cauchy distribution. The corresponding cumulative distribution function for our case reads Explicit evaluation shows agreement with the χ 2 based calculation via equation (22). For partial sample overlap, the previous discussion for the coherence test applies one to one to the ratio test, and therefore the phenotypic correlation ζ has to be adjusted as well according to equation (19). As for the coherence, we calculate p-values for the index R defined in equation (20) for 1000 random samples with and without correction factor ζ. Corresponding QQ-plots are shown in Fig  5 and Fig 6. We observe that the correction indeed improves the calibration of p-values.