Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies

Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation.


The Likelihood
The log-likelihood and its first and second derivatives are given by where η i = y i β 0 + k j=1 β ij x ij and y ∈ {−1, 1} denotes case/control status. If β j = 0 it will remain there if the derivative of the log-posterior at the origin is negative in |β|, which occurs if This is bounded above and below as follows where I(E) equals one when E is true and zero otherwise. Thus the log-likelihood only needs to be calculated when the absolute values of either of the bounds is greater than f ′ (β j = 0 + ).
The bounds must be updated every time β changes. Implementation of this bound speeded up the code by approximately a factor of 60.

Derivation of formula for type-I error probability
We can choose prior parameters to control the type-I error by assuming asymptotic normality of the likelihood function. The asymptotic null distribution of an MLE is [2] By evaluating (7) at the null β = 0, assigning β 0 = log(n 1 /n 0 ), and with standardised genotype data, (9) can be expressed aŝ β j ∼ N 0, n 0 + n 1 n 0 n 1 .
Differentiating the log-likelihood defined by (10) and substituting into (8) gives |β j | n 0 n 1 n 0 + n 1 < f ′ (β j = 0 + ) and thus β j will remain at the origin if For equal numbers of cases and controls this simplifies to |β j | < 4f ′ (β j = 0)/(n 0 + n 1 ). For a per-SNP type-I error rate of α we require the probability of (11) to be 1 − α at β j = 0.
From the distribution ofβ j given in (10) this implies 3.1 Behaviour when β k = 0 When one or more SNPs are included in the model, i.e. β k = 0, the null distribution ofβ j (9) becomesβ and from (8) the criterion for inclusion becomes giving a probability of inclusion of Substituting in the value of f ′ (β j = 0 + ) given in (12) and assuming equal numbers of cases and controls this can be expressed as since the numerator in the square-root is greater than the denominator Thus the test is now conservative. As more SNPs are included in the model the model fit improves, the log-likelihood (6) will increase and the η i 's will get closer to 0 or 1 and the test will become increasingly conservative. This establishes that inclusion of true positives reduces the false positive rate below that expected under the global null.
For a univariate hypothesis this can be simplified to .
From the asymptotic distribution given in (9) and assuming η is constant for all individuals (no other covariate effects) this can be expressed as Since the terms involving η are constant, the condition on β j remaining at the origin can be expressed as for some constant κ that controls the type-I error at the desired rate.
Returning to variable selection via shrinkage priors; for normalised data and η constant the criteria for β j remaining at the origin (11) can be expressed as where κ is determined by the derivative of the prior at the origin; all priors with the same derivative at the origin will have the same type-I error which can be controlled at the desired rate by appropriate choice of prior parameters. Letβ ′ j , x ′ ij andβ j , x ij denote the MLE and covariates for normalised and unnormalised data respectively. Since x ij /n. Rewriting (15) in terms of unnormalised data we get (14), thus the ATT and univariate variable selection using shrinkage priors, starting the search at the origin, are equivalent when our asymptotic assumptions hold.