^{¤a}

^{*}

^{¤b}

BS and MS conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, and wrote the paper.

¤a Current address: INRA Laboratoire de Génétique Cellulaire, Castanet-Tolosan, France

The authors have declared that no competing interests exist.

We introduce a new framework for the analysis of association studies, designed to allow untyped variants to be more effectively and directly tested for association with a phenotype. The idea is to combine knowledge on patterns of correlation among SNPs (e.g., from the International HapMap project or resequencing data in a candidate region of interest) with genotype data at tag SNPs collected on a phenotyped study sample, to estimate (“impute”) unmeasured genotypes, and then assess association between the phenotype and these estimated genotypes. Compared with standard single-SNP tests, this approach results in increased power to detect association, even in cases in which the causal variant is typed, with the greatest gain occurring when multiple causal variants are present. It also provides more interpretable explanations for observed associations, including assessing, for each SNP, the strength of the evidence that it (rather than another correlated SNP) is causal. Although we focus on association studies with quantitative phenotype and a relatively restricted region (e.g., a candidate gene), the framework is applicable and computationally practical for whole genome association studies. Methods described here are implemented in a software package, Bim-Bam, available from the Stephens Lab website

Although the development of cheap high-throughput genotyping assays have made large-scale association studies a reality, most ongoing association studies genotype only a small proportion of SNPs in the region of study (be that the whole genome, or a set of candidate regions). Because of correlation (linkage disequilibrium, LD) among nearby markers, many untyped SNPs in a region will be highly correlated with one or more nearby typed SNPs. Thus, intuitively, testing typed SNPs for association with a phenotype will also have some power to pick up associations between the phenotype and untyped SNPs. In practice, typical analyses involve testing each typed SNP individually, and in some cases combinations of typed SNPs jointly (e.g., haplotypes), for association with phenotype, and hoping that these tests will indirectly pick up associations due to untyped SNPs. Here, we present a framework for more directly and effectively interrogating untyped variation.

In outline, our approach improves on standard analyses by exploiting available information on LD among untyped and typed SNPs. Partial information on this is generally available from the International HapMap project [

These imputation-based methods can be viewed as a natural

Although our methods are developed in a Bayesian framework, they can also be used to compute

We focus on an association study design in which genotype data are available for a dense set of SNPs on a panel of individuals, and genotypes are available for a subset of these SNPs (which for convenience we refer to as “tag SNPs”) on a cohort of individuals who have been phenotyped for a univariate quantitative trait. We assume the cohort to be a random sample from the population, and consider application to other designs in the discussion.

Our strategy is to use patterns of LD in the panel, together with the tag SNP genotypes in the cohort, to explicitly predict the genotypes at all markers for members of the cohort, and then analyze the data

We now provide further details of our Bayesian regression approach. The literature on Bayesian regression methods is too large to review here, but papers particularly relevant to our work include [

For simplicity, we focus on the situation where cohort genotypes are known at all SNPs (tag and non-tag). Extension to the situation, where the cohort is genotyped only at tag SNPs and other genotypes are imputed using sampling-based algorithms such as PHASE [

Let G denote the cohort genotypes for all _{1}, …, _{n}_{i}_{ij}_{j}_{i}_{i}

We assume a genetic model where the genetic effect is additive across SNPs (i.e., no interactions) and where the three possible genotypes at each SNP (major allele homozygote, heterozygote, and minor allele homozygote) have effects 0, _{j1},_{j}_{2}), which are, respectively, the SNP additive effect _{j}_{j} = a_{j}k_{j}

Prior specification is intrinsically subjective, and specifying priors that satisfy everyone is probably a hopeless goal. Our aim is to specify “useful” priors, which avoid some potential pitfalls (discussed below), facilitate computation, and have some appealing properties, while leaving some room for context-specific subjective input. In particular, we describe two priors below, which we refer to as prior _{1}_{2}, that were developed based on the following considerations: (i) inference should not depend on the units in which the phenotype is measured; (ii) even if the phenotype is affected by SNPs in this region, the majority of SNPs will likely not be causal; (iii) for each causal variant there should be some allowance for deviations from additive effects (i.e., dominant/recessive effects) without entirely discarding additivity as a helpful parsimonious assumption; and (iv) computations should be sufficiently rapid to make application to genome-wide studies practical (this last consideration refers to prior _{2}).

The parameters μ and τ relate to the mean and variance of the phenotype, which depend on units of measurement. It seems desirable that estimates (and, more generally, posterior distributions) of these parameters scale appropriately with the units of measurement, so, for example, multiplying all phenotypes by 1,000 should also multiply estimates of μ by 1,000. Motivated by this, for prior _{1} we used Jeffreys' prior for these parameters:

This prior is well known to have the desired scaling properties in the simpler context where observed data are assumed to be _{1} also possesses these desired scaling properties in the more complex context considered here, although we have not proven this.

For prior _{2} we used a slightly different prior, based on assuming a prior for (μ, τ) of the form

Specifically, our prior _{2} assumes the limiting form of this prior as κ,λ → 0 and

Both prior distributions above are “improper” (meaning that the densities do not integrate to a finite value). Great care is necessary before using improper priors, particularly where one intends to compute BFs to compare models, as we do here. However, we believe results obtained using these priors are sensible. For prior _{2}_{1} we believe this to be true, although we have not proven it.

For brevity, we refer to SNPs that affect phenotype as QTNs, for quantitative trait nucleotides. Our prior on the SNP effects has two components: a prior on which SNPs are QTNs and a prior on the QTN effect sizes.

We assume that with some probability, _{0}, none of the SNPs is a QTN; that is, the “null model” of no genotype–phenotype association holds. Otherwise, with probability (1 − _{0}), we assume there are _{s}_{s}_{0} and

If SNP _{j}_{j} = a_{j}k_{j}_{j}_{j}_{j}_{a}_{a}

The parameter _{j} = a_{j}k_{j}_{j} =_{j} =

Prior _{1}_{j}_{j}_{k}

Prior _{2}: We assume that _{j}_{j}_{d}_{a}_{j}_{j}_{j}_{j}

Prior _{1} has the attractive property that the prior probability of overdominance is independent of the QTN additive effect _{j}_{j}_{j}_{1} by numerical methods, such as Laplace Approximation; e.g., [_{2} is more convenient, as, when combined with the priors on μ and τ in

For both priors _{1} and _{2} we assume effect parameters for different SNPs are, a priori, independent (given the other parameters).

The above priors include “hyperparameters,” _{0} and σ_{a}_{0} gives the prior probability that the region contains no QTNs. While choice of appropriate value is both subjective and context-specific, for candidate regions we suggest _{0} will typically fall in the range 10^{−2} to 0.5. If data on multiple regions are available, then it might be possible to estimate _{0} from the data, although we do not pursue this here. Instead, we mostly sidestep the issue of specifying _{0} by focusing on the BF (described below), which allows readers of an analysis to use their own value for _{0} when interpreting results.

In specifying the prior,

Finally, specification of the standard deviation of the effect size, σ_{a}_{a}_{a}_{a}_{a}_{a}_{a}_{a},_{a}

We focus on two key inferential problems: (i) _{j}_{j}

To measure the evidence for _{0} denotes the null hypothesis that none of the SNPs is a QTN (_{j} = d_{j} =_{1} denotes the complementary event (i.e., at least one SNP is a QTN). Computing the BF involves integrating out unknown parameters, as described in _{0} = 0.5, so association with genetic variation in the region is considered equally plausible, a priori, as no association) then a BF of 10 gives posterior odds of 10:1, or ∼91% probability of an association.

In the special case where we allow at most one QTN, _{j}_{j}

From a Bayesian viewpoint, the BF provides _{0} can be obtained from a BF through permutation. Specifically, one can compute the BF for the observed data, and for artificial data sets created by permuting observed phenotypes among cohort individuals, and obtain a

To “explain” observed associations we compute posterior distributions for SNP effects (_{j}_{j}_{j}_{j}

We also argue that the imputation-based approach brings us closer to being able to interpret estimated effects for each SNP as actual

In the tagSNP design, observed genotypes G_{obs}_{1}, this involves adding a step in the MCMC scheme to sample the imputed genotypes from their posterior distribution given all the data; for prior _{2} it involves simply averaging relevant calculations over imputations. Details are given in

Methods described here are implemented in a software package, Bim-Bam (Bayesian IMputation-Based Association Mapping), available from the Stephens Lab website

We compared the power of our approach to other common approaches via simulation. We simulated genotype and phenotype data (with μ = 0 and τ = 1) for genetic regions of length 20 kb containing a single QTN, and genetic regions of length 80 kb containing four QTNs, as follows:

(1) Using a coalescent-based simulation program,

(2) Form genotypes for a “panel” of 100 individuals by randomly pairing 200 haplotypes, and a “cohort” of 200 individuals by randomly pairing the other 400 haplotypes.

(3) Select tag SNPs from the panel data using the approach of Carlson et al. [^{2} cutoff of 0.8. As in Carlson et al. [

(4) Select which SNPs are QTNs, and their effect sizes, and simulate phenotype data for each cohort individual according to

We compared power of tests based on the BF (under prior _{2}, allowing at most one QTN, using

(1) Two tests based on _{min},

(2) A test based on _{reg},

(3) A test based on BF_{max}

For each test, we analyzed each dataset in two ways: as if data had been collected using (i) a “resequencing design” (i.e., all individuals were completely resequenced, so genotype data are available at all SNPs in all individuals); and (ii) a “tag SNP design” (i.e., in panel individuals genotype data are available at all SNPs, but in cohort individuals genotype data are available at tag SNPs only). For the tag SNP design, we assumed haplotypic phase is known in the panel (as it is, mostly, for the HapMap data for example), but not in the cohort; however our approach can also deal with unknown phase in the panel. For _{reg}_{min}_{max},_{reg},

(A) single common variant, modest dominance; (B) single common variant, strong dominance for minor allele; (C) single rare variant, no dominance; (D) multiple common variants.

Each colored line shows power of test varying with significance threshold (type I error). Black: BF from our method (prior _{2}); Green: _{min}_{min}_{reg},_{max}

Comparing _{min}_{reg}_{min}_{reg}_{min}_{reg}_{min}_{reg}_{reg}_{reg}

Turning now to our approach, except for Scenario (B) in the tag SNP design, the test based on the BF is as powerful or more powerful than the other tests. Thus, unlike _{reg}_{min},

In Scenario (D), which involved multiple QTNs, tests based on the BF clearly outperformed other tests considered, even though the BF was computed allowing at most one QTN. Our explanation is that the BF, being the _{max},

A second, and perhaps more surprising, situation where the BF outperforms other methods is when all SNPs are typed and tested (i.e., Scenario (A), resequencing design). Here, in contrast to Scenario (D), BF_{max}_{0}, and so ^{−5} in a study involving few individuals may be less impressive than the same

An important feature of

Panels show: (a) errors in the estimates (posterior means) of the heterozygote effect (

In contrast, when the causal variant is rare, there is a noticeable drop in power for the tag SNP design versus the resequencing design, and the BFs, posterior probabilities, and effect size estimates under the two designs often differ substantially (unpublished data). This may seem slightly disappointing: one might have hoped that, even with tag SNPs chosen to capture common variants, they might also capture some rare variants. Indeed, this can happen: in some simulated data sets the rare causal variant was clearly identified by our approach, presumably because it was highly correlated with a particular haplotype background, and could thus be accurately predicted by tag SNPs. However, this occurred relatively rarely (just a few simulations out of 100).

We wondered whether a different tagging strategy, aimed at capturing rare variants, might improve performance when the causal variant is rare. The development of such strategies lies outside the scope of this paper, but, to assess potential gains that ^{2}-based tag SNP selection, it remained substantially lower than in the resequencing design, where the causal variant is typed.

Solid line: Resequencing design; dashed line: tag SNP design, with tags selected using method from [

We also wondered whether a different approach to impute missing genotypes (in the cohort at non-tag SNPs) might improve performance. For results above, we used the software fastPHASE [

In summary, imputation-based methods appear to increase power of the tag SNP design to detect rare variants, but nevertheless remain notably less powerful than BFs based on the complete resequencing data.

Priors _{1} and _{2} differ in their assumed correlation between the dominance effect (_{1} the prior probability of overdominance is independent of _{2} overdominance is more likely for small _{1} is perhaps more sensible than _{2}; however, _{2} is computationally much simpler. To examine the effects of these priors on inference, we compared (i) the BF and (ii) the posterior probability assigned to the actual causal variant under each prior for the datasets from Scenarios (A) and (B). Results agreed quite closely (_{2} provides a reasonable approximation to prior _{1} in the scenarios considered. This is important, since prior _{2} is computationally practical for computing BFs for very large datasets (e.g., genome-wide association studies with hundreds of thousands of SNPs), for which sampling posterior distributions of parameters using an MCMC scheme would be computationally daunting.

The solid yellow line corresponds to

Results shown are for all datasets for the common variant Scenario (A) and (B) and for both the resequencing design and the tag SNP design. The discrepancy between the larger estimated BFs is caused by the fact that we used insufficient MCMC iterations to accurately estimate very large BFs (>10^{6}) under prior _{1}.

When analyzing a candidate region, one would ideally like not only to detect any association, but also to identify the causal variants (QTNs). Since a candidate region could contain multiple QTNs, we implemented an MCMC scheme (using prior _{1}) to fit multi-QTN models where the number of QTNs is estimated from the data; here, we consider a multi-QTN model with equal prior probabilities on 1, 2, 3, or 4 QTNs. (A similar MCMC scheme could also be implemented for prior _{2}, and could exploit the analytical advantages of this prior to reduce computation. Indeed, for regions containing a modest number of SNPs it would be possible to examine all subsets of SNPs, and entirely avoid MCMC.)

We compare this multi-QTN model with a one-QTN model on a dataset simulated with four QTNs (scenario [D]). The estimated BF for a one-QTN model was ∼6,000, while for the multi-QTN model it was >10^{5} (we did not perform sufficient iterations to estimate how much bigger than 10^{5}). Thus, if a region contains multiple causal variants, then allowing for this possibility may provide substantially higher BFs.

The figure shows, for each SNP in a dataset simulated under Scenario (D), the estimated posterior probability that it is a QTN, conditional on an association being observed. Left: Results from one-QTN model. Right: Results from multi-QTN model allowing up to four QTNs. The four actual QTNs are indicated with a star. Colors of the vertical lines indicate tag SNP “bins” (i.e., groups of SNPs tagged by the same variant).

We applied our method to data from association studies involving the

We first estimated haplotypes in 64 parents using the trio option in PHASE [_{1} and _{2}.

BFs for priors _{1} and _{2} were, respectively, 3.15 and 2.33, and the corresponding

Among the 15 SNPs analyzed, snp7 was assigned the highest posterior probability of being a QTN (

Left panel shows the posterior probability assigned to each SNP being a QTN, with filled triangles denoting tag SNPs and open circles denoting non-tag SNPs. The right panel shows (in gray) estimated posterior densities of the additive effect for each of the seven SNPs assigned the highest posterior probabilities of non-zero effect (representing 90% of the posterior mass). The average of these curves is shown in black.

In summary, these data provide modest evidence of association between

We described a new approach for analysis of association studies, with two important components: (i) it uses imputation of unknown genotypes, based on statistical modeling of patterns of LD, to allow untyped SNPs to be directly assessed for association with phenotype; (ii) it uses BFs, rather than

The idea of trying to find associations between phenotypes and untyped variants is old, and underlies many existing methods for assessing association. In some cases this aim is implicit (e.g., testing for association between haplotypes and phenotypes can be thought of as an attempt to indirectly test untyped variants that may lie on a particular haplotype background), and in others it is explicit (to give just one example, Zöllner and Pritchard [

While several papers have suggested Bayesian approaches to association studies (e.g., [_{2} (see

Perhaps the most important _{a}, σ_{d}

Choice of _{a}, σ_{d}_{2}) _{a}_{d} = σ_{a}_{a}_{d} = σ_{a}

Regarding the normality assumption, following a suggestion by Mathew Barber (personal communication), in practical applications, we are currently applying a normal quantile transform to phenotypes (replacing the

Throughout this paper, we have assumed a “population” sampling design in which phenotype and genotype data are available on a random sample from a population, and perform analyses conditional on the observed genotype data. An alternative common design involves collecting genotypes only on individuals whose phenotypes lie in the tails of the distribution [

(100 KB PDF)

(66 KB PDF)

The National Center for Biotechnology Information (NCBI) Entrez (

We thank D. Goldstein for access to the

¤b Current address: Departments of Statistics and Human Genetics, University of Chicago. Chicago, Illinois, United States of America

Bayes factor

degree of freedom

identity by descent

linkage disequilibrium

Markov Chain Monte Carlo

quantitative trait nucleotide