An Evolutionary Framework for Association Testing in Resequencing Studies

Sequencing technologies are becoming cheap enough to apply to large numbers of study participants and promise to provide new insights into human phenotypes by bringing to light rare and previously unknown genetic variants. We develop a new framework for the analysis of sequence data that incorporates all of the major features of previously proposed approaches, including those focused on allele counts and allele burden, but is both more general and more powerful. We harness population genetic theory to provide prior information on effect sizes and to create a pooling strategy for information from rare variants. Our method, EMMPAT (Evolutionary Mixed Model for Pooled Association Testing), generates a single test per gene (substantially reducing multiple testing concerns), facilitates graphical summaries, and improves the interpretation of results by allowing calculation of attributable variance. Simulations show that, relative to previously used approaches, our method increases the power to detect genes that affect phenotype when natural selection has kept alleles with large effect sizes rare. We demonstrate our approach on a population-based re-sequencing study of association between serum triglycerides and variation in ANGPTL4.


Relationship to Previous Methods in the Evolution of Quantitative Traits
The main text uses a simplified development based on fairly strong assumptions: directional selection, and independent pleiotropic effects. Several alternatives are available; these alternatives also illustrate the potential violation of the linearity assumption. Previous work which describes the evolution of quantitative traits starts with phenotype effects and derives conditional fitness effects [1]. For example, using the stabilizing selection approximation of a gaussian penalty for square effect size (nor-optimal approximation): The square penalty has theoretical basis (suitably differentiable functions are locally quadratic around their optima), but is largely chosen for mathematical tractability. Under this assumption, conditioning on observed frequency yields and where LR(± √ s j ) is the likelihood ratio for γ j equal to positive and minus √ s j using the current parameters of the distribution of γ. This scheme can be plugged into our framework using the simulated distribution of s j |f j , albeit with somewhat more difficulty in optimization. Of note, if γ are all the same sign for a particular gene under polygenic stabilizing selection, then this reduces to This model is quite similar to that proposed in the main paper; differentiating between them will require empirical testing on large datasets.
Loosening the assumption to a stabilizing selection with pleiotropic effects of unknown joint distribu-tion with phenotype effects and only specifying conditional moments is also possible. Something similar to our model arises by giving s "at birth" the form of a correlated pleiotropic effect with u independent of z and γ, that is, a random piece of the phenotype effect becomes a fitness effect along with a random impact not mediated by the phenotype. While it seems straightforward to proceed as above with iterated conditional expectations of s j and a Taylor expansion of E √ s j , conditioning on the observed frequency removes the independence and complicates the development. Going further will require either making additional assumptions or use of more advanced semiparametric techniques, which we leave for future work. Special attention must be paid to the boundary conditions around neutrality [1], as further discussed in the section on attributable variance, as well as ensuring compatible marginal and conditional distributions. Our current model requires that selection acts through pure pleiotropy since we do not require neutral alleles to have zero effect.
The complete generality of allowing arbitrary joint distributions of pleiotropic effects and phenotype effects or arbitrary nonlinear forms of stabilizing selection comes with an intractable analytical problem.
Appealing to methods based on the first few moments of the joint distribution, as we do, is one away around that problem. Additionally, since we do not have an idea of what the marginal distribution of γ for new mutations should be a-priori, modeling its moments conditional on s, for which we do have a marginal distribution, is easier. Semiparametric methods or using the empirical distribution ofγ are another way forward.

SNP Effect Prediction and Attributable Variance Calculations
The calculation of the variance explained by genetic factors can proceed in two ways. First, we can follow the derivation of Johnson and Barton [1] (their appendix A) which requires that E γ 2 s be finite. The formula there yields in the diffusion approximation However, the second term is infinite in our chosen DFE because it contains a point mass at zero; the positive variance of phenotype effects among these neutral mutations creates infinite E γ 2 s . Were we to substitute a DFE with finite harmonic mean, this would provide a separation into a selection dependent first term and selection independent second term. To avoid the reliance on the diffusion approximation, we could in the same spirit write the total variance using an iterated conditional expectation where the outer variances and expectations are across individuals in the sample. The interpretation of these estimates is unintuitive; they describe an average variation explained over possible sets of SNPs in a new population with identical SNP structure to the present study. That may not be interesting to practicing epidemiologists who want to know about the variance due to SNPs which actually exist.
We can also try to build attributable variance based on predicted effect sizes. A formula for best linear then we can estimate the total SNP effect contribution to each person which are fitness independent and fitness dependent components. Taking the sample variance over subjects of each half of SNP effect contribution then produces attributable variance. This formula relies on getting effect sizes right, and may be more sensitive to incorrect specification. Because the model ignored features like the higher moments of the DFE and point mass at zero, a certain amount of mis-specification is guaranteed. Additionally, because of the phenomenon of shrinkage whereby random effect estimates are biased towards zero, the above estimates will tend to systematically underestimate attributable variance.
In particular, while the model allows that, as a collection, the effects of rare SNPs are systematically larger than common SNPs, because there is little information per-rare-SNP the effect estimates are strongly shrunk to a common value.
Formulas for the asymptotic sampling variance of random effect estimates are also provided in [2], and these are used to generate confidence intervals. These asymptotics are known to converge somewhat slowly, so the coverage properties may not be exact.

DFE as known versus estimating its parameters
We take the DFE as known in our procedures, but for investigators with large datasets it may be tempting to re-estimate its parameters, perhaps on a per-gene basis, to calibrate our assumed relationship between observed frequencyf j and mean fitnessŝ j . The prfreq software [3] uses the observed site-frequency spectra of derived alleles as data to fit an assumed shape of a DFE with an EM algorithm. Because it relies on having many variants to form a reliable empirical distribution of allele frequencies, this method is suitable to collections of genes or genes which contain a large number of variants believed to be suitable for pooling. Similarly, mkprf [4] compares the rate of polymorphism and fixed differences across specified genomic areas to estimate selection intensity and rank genes based on evidence for selection. Because more data is gathered by expanding the phylogenetic tree examined for fixed differences rather than by including a broader genomic area, it can estimate the mean fitness effect very locally and is suitable to single genes or even smaller regions.
We do not use gene-specific estimates of the DFE in our approach because neither method implemented in existing software is adapted to our setting. prfreq relies on having a large number of variants and is computationally workable only with small sample sizes, but our example has over 3500 participants sequenced on only 3 genes. mkprf assumes a hierarchical structure on fitness-effects and requires a larger number of genes to create effective contrast between genes. It also is very sensitive to prior parameters [5].
Because it fits a single fitness effect per-gene, it is not clear how it functions in the presence of both adaptive and purifying selection. Future sequencing studies may better meet the requirements of the above software, and some investigators are attempting to combine both site frequency spectrum and phylogenetic approaches into a single estimate or include more information, such as the apparent age of variants.
One potential problem is that we have taken parameters estimated with uncertainly and treated them as knows. Particularly, estimating demographic parameters from the datasets used in [3] may not be reliable because the small number of people sequenced constrains the extent to which the SFS can inform on recent demographic patterns. These estimates are doubtless incorrect in some way, and we will lose efficiency as a result. We do not believe our procedure for computing fitness effects to be optimal, but an advantage of our formulation is that as improved methods for estimating of fitness effects become available they can be easily substituted into the association procedure.
Instead of our fairly complicated mechanism for estimating fitness effects, one could instead choose apriori to have the mean and variance of γ come from a specified form and maximize over its parameters.
For example, one could impose thatŝ j |f j followed a power-law and maximize the implied likelihood over its parameter. We found a power law to be a reasonable approximation toŝ in some settings, but not others (data not shown). Such an approach would come with additional flexibility at the price of more variable null hypothesis behavior and hence a loss in power. Because the other parameters in the model could change value dramatically depending on the chosen parameters forŝ j |f j , confidence intervals for them might be adversely effected. Of course should the underlying conditional expectation of fitness not follow the assumed form, we would expect this to be less accurate. It would also eliminate the fitness-phenotype correlation interpretation of the fitted model. Because the population-genetic approach describes the underlying data-generating mechanism for a resequencing study, we prefer it to ad-hoc methods.

Non-frequency Predictors of Fitness or Function
It would be advantageous to be able to include prior information on SNP effects besides frequency. For example, predictions based on substitution type and sequence composition for non-synonymous variants [6][7][8][9][10][11][12][13][14][15] or eQTL maps for non-coding variation [16] would add valuable prior information. Our procedure can be generalized to include them in three ways. First, for those bioinformatic tools already calibrated to predict fitness effect, the features of deleterious sequences which they rely on could be added to the generative model for fitness effects of new mutations in the simulation method. The mean and variance of true fitness effects conditional on observed features of the variants in the real data could then be calculated in the same way as before. Second, for any of the tools which are not calibrated on fitness, their estimates of SNP function can be included as a fixed effect in the model analogous toŝ. If the mean prediction error of the tool relative to its underlying construct is available, that prediction error can be included as a variance component analogous to the role of V j . Third, tools which identify regions as probably relevant (like highly conserved regions) could warrant either stratified parameter estimates or a correlation between the effects of variants in the region.