^{1}

^{*}

^{1}

^{2}

^{3}

Conceived and designed the experiments: CRK PJR DLN. Analyzed the data: CRK. Contributed reagents/materials/analysis tools: CRK PJR. Wrote the paper: CRK DLN.

The authors have declared that no competing interests exist.

Sequencing technologies are becoming cheap enough to apply to large numbers of study participants and promise to provide new insights into human phenotypes by bringing to light rare and previously unknown genetic variants. We develop a new framework for the analysis of sequence data that incorporates all of the major features of previously proposed approaches, including those focused on allele counts and allele burden, but is both more general and more powerful. We harness population genetic theory to provide prior information on effect sizes and to create a pooling strategy for information from rare variants. Our method, EMMPAT (Evolutionary Mixed Model for Pooled Association Testing), generates a single test per gene (substantially reducing multiple testing concerns), facilitates graphical summaries, and improves the interpretation of results by allowing calculation of attributable variance. Simulations show that, relative to previously used approaches, our method increases the power to detect genes that affect phenotype when natural selection has kept alleles with large effect sizes rare. We demonstrate our approach on a population-based re-sequencing study of association between serum triglycerides and variation in ANGPTL4.

Studies correlating genetic variation to disease and other human traits have examined mostly common mutations, partly because of technological restrictions. However, recent advances have resulted in dramatically declining costs of obtaining genomic sequence data, which provides the opportunity to detect rare genetic variation. Existing methods of analysis designed for an earlier era of technology are not optimal for discovering links to rare mutations. We take advantage of 1) the advanced theoretical understanding of evolutionary mechanics and 2) genome-wide evidence about evolutionary forces on the human genome to suggest a framework for understanding observed correlations between rare genetic variation and modern traits. The model leads to a powerful test for genetic association and to an improved interpretation of results. We demonstrate the new method on previously confirmed results in a gene related to high blood cholesterol levels.

Over the past 20 years, positional cloning guided by linkage analysis and genome wide association studies (GWAS) have identified many loci relevant to human disease and other quantitative phenotypes such as height, body mass index, and serum lipid composition. However, in most cases the total amount of phenotypic variance explained is small compared to the heritability observed in twin or adoption studies

There are three signatures of association in a resequencing study which we want to use to assess candidate genes. Some SNPs could have effect sizes large enough that they have individually noticeable impact on phenotype; this is the information underlying regression procedures, like those put forward by Hoggart et al

We present a method capable of detecting all three signatures of association. Our method generalizes allele count and rare-variant-burden methods by explicitly constructing a model relating disease impact, selective pressure, and SNP frequency in a candidate gene. By doing so, we will be able to provide intuitive interpretations to detected associations, allowing investigators to answer additional questions with their data. Our approach will yield substantially more power if the model is close to correct without introducing bias or sacrificing much efficiency when our assumptions are not met.

We propose to estimate the evolutionary fitness burden of each SNP using its observed frequency and population genetic parameters inferred by other authors. That estimate of fitness burden will act as prior information on the variant effect, acting like a burden function

In what follows, we will briefly introduce the population genetics ideas which underly our approach. Next, we construct our statistical model and discuss estimation and testing within it. Finally, we illustrate the method both in simulation studies and on a real candidate gene resequencing study examining serum triglyceride levels in a multi-ethnic prospective community-based sample

Several authors have reviewed the potential contribution of low frequency alleles to variation in phenotypes

In

Panel A depicts the scenario where the trait is directly under selection. Panel B depicts the scenario where a gene with pleiotropic effects creates fitness-trait correlation via a related phenotype.

Hartl and Clark

Rather than assume that all variants in the region have the same

With these facts in mind, in what follows we will use fitness effects to operationalize the construct of functional status for each SNP. Whereas Johnson and Barton

Assume the context of a simple random cross-sectional sample of

We can write a regression model for person

Using standard least-squares regression to estimate such a model will pose several problems. First, because there will be many rare variants,

To overcome these problems, we need to make more assumptions and model the

Equation (4) asserts that phenotype-effect and fitness-effect are linearly related; that seems correct for the scenario in

Our model is quite general in that existing methods correspond to submodels of (4). An allele count method tests the model with only

When

The specification of equations (1), (4), and (6) yields a natural interpretation to the fitted model. After estimating the population parameters of phenotype effects, we will be able to jointly estimate individual SNP effects

Simulation results using fitted DFE of non-synonymous variation from

An important consideration is how to interpret the results when multiple ethnic groups are analyzed simultaneously. Because some genetic variation is fixed between ethnic groups in the sample, the average effect of single-population variation will be absorbed into the fitted mean for that group. As a result, the interpretation for “total explained variation” is actually “total explained within-ethnic-group variation;” genetic variation may explain some of the phenotypic difference between groups, but we do not include it in our estimate because of confounding between environmental exposures and ethnic background.

Another point requiring clarification is the assumption that genotype effects are independent. In the context of GWAS, nearby SNPs often are thought to have correlated effects because they mutually tag a functional variant. Additionally, estimates of SNP effects will be correlated due to LD making their true separate effects difficult or impossible to identify. However, in the underlying data generating mechanism true genotype effects are independent. Because sequencing identifies all the variation within the region and eliminates much of the correlation due to untyped alleles, we believe that the independence assumption is a useful approximation in this case. Non-independence of the true effects could be accommodated by imposing a covariance structure on SNP effects, for example using their spatial distance in the genome or folded protein. Alternatively, the phylogenetic approach of TreeLD

Model (4) relies on a prediction

Take as given the fitted distributional form of fitness effects and population history since out-of-Africa

Use existing software SFS_CODE

For each variant in the real dataset, find variants in the pseudo-data with the same sampled frequency, and calculate the mean

To reduce computational requirements, steps 2 and 3 above can be replaced by simulating a smaller number of large populations and calculating the expected mean and variance of fitness using simple random sampling.

An advantage of this method is that because it refers to a feature of genetic history rather than a phenotype, it need only be done once for any trait under study on the same cohort. While the fitness - phenotype relationship will be different for all traits, that is modeled by the fitted parameter

Our model fitting procedure will be likelihood-based, so we will use a standard hypothesis testing method: likelihood ratio tests. To improve robustness, our examples will use permutation p-values obtained by comparing the likelihood ratio of the fitted model to that generated under the null hypothesis by randomly swapping genotype vectors between members of the same ethnicity. Permuting genotype labels simulates the null hypothesis that no relationship exists between any genotype and any aspect of the response, which in our parametric setup is equivalent to

For numerical convenience and statistical robustness, we will use only the first two moments of the model in equations (1), (4), and (6), and assume

We allow the procedure to exploit the possibility that individuals with a high burden of rare alleles not only have drift in their mean phenotype because of

We will fit the mixed effects model (8)–(9) using modified Newton-Raphson optimization of the implied likelihood. The linear mixed effects approach is equivalent to assuming normality for the error terms

As discussed above, we will be interested in fitting distinct

Alternatively, if we use a single

Our model is easily recast in a purely Bayesian framework. One would need to write priors for

The Bayesian analyst could continue to use our normal approximation of the distribution of the latent

About 3500 prospectively sampled individuals from the population in Dallas, Texas, were sequenced at a candidate gene for dyslipidemia: ANGPTL4 (Ensembl Acc:16039). These individuals come primarily from three ethnic backgrounds: non-Hispanic white (N = 1043), non-Hispanic black (N = 1832), and Hispanic (N = 601). We will exclude from our analysis the 75 individuals listed as “Other” ethnicity. Our outcome phenotype is log-transformed serum triglyceride levels. Details of the cohort

Population | N individuals | N Non-synonymous variants | N Non-coding variants |

Pooled | 3476 | 32 | 62 |

Non-Hispanic whites | 1043 | 20 | 23 |

Non-Hispanic blacks | 1832 | 15 | 38 |

Hispanic | 601 | 8 | 17 |

Population | SNP Type | SE | nonfitness % variance | fitness % variance | |||

Pooled | non-syn | 0.13 | 0.0 | 2.5 | 8.7 | 0.54 | 0.003 |

Pooled | non-coding | 0.02 | 8.3 | −9.6 | 6.5 | 0.09 | 0.08 |

NHW | non-syn | 0.15 | 0.0 | 5.8 | 13.5 | 0.53 | 0.03 |

NHW | non-coding | 0.02 | 0.0 | 1.9 | 7.3 | 0.004 | 0.008 |

NHB | non-syn | 0.08 | 0.0 | 0.5 | 11.4 | 0.42 | 0.0002 |

NHB | non-coding | 0.02 | 0.0 | −11.4 | 8.1 | 0.07 | 0.13 |

Hispanic | non-syn | 0.00 | 0.0 | 20.5 | 43.9 | 0 | 0.03 |

Hispanic | non-coding | 0.10 | 19.6 | −40.8 | 38.2 | 0.08 | 0.66 |

For ANGPTL4, we observe a p-value of .006 on 10,000 permutations versus the strong null hypothesis that no SNPs have any effect. Previous authors

The SNPs have been rank-ordered by observed frequency on the x-axis with ties broken by estimated effect size. The left y-axis is the predicted effect (

An interesting data point in

We can understand this model fit by looking at the green OLS estimates in

An interesting potential story about natural selection on ANGPTL4 activity emerges from

In order to determine the power and robustness of our procedure, we simulated variation in a gene with the exon structure of the gene ANGPTL4 in a study population using SFS_CODE

We chose several levels of the phenotype parameters to correspond to potential cases of interest while keeping the total fraction of variation explained by the gene about the same: a weak mean variant effect, a strong fitness-related component of the phenotype, and a strong fitness independent component of the phenotype. We chose the baseline values such that

scenario | residual standard deviation | expected fitness % variance explained | expected nonfitness % variance explained | |||

Base | −7.0 | 0.012 | 0.007 | 0.22 | 0.84 | 0.84 |

High |
−21.0 | 0.012 | 0.007 | 0.50 | 1.51 | 0.17 |

High |
−7.0 | 0.018 | 0.007 | 0.28 | 0.55 | 1.13 |

Low |
−6.4 | 0.012 | 0.003 | 0.21 | 0.83 | 0.85 |

Very high |
−63.1 | 0.012 | 0.003 | 1.43 | 1.66 | 0.02 |

Zero |
0.0 | 0.012 | 0.007 | 0.16 | 0.00 | 1.68 |

Two additional batches of simulation examine the robustness of our procedure to incorrect assumptions. First we created violations of the assumed population model. We mis-specified the assumed DFE in our analysis, making the scale parameter a factor of 5 too large or too small and keeping the truth the same. We also simulated violation of our demographic assumptions using a population which experienced an additional 100 fold exponential growth over the last 11% of generations since out-of-Africa. Second we created violations of the assumed statistical model. We simulated three scenarios violating the linearity assumption. First, with

To compare power with existing methods, we included several proposed methods of analysis. First, we test the method of Bonferroni corrected minimum p-value of SNPs with minor allele frequency >1% or >5%. Other proposed methods using allele counts like CAST

To demonstrate the gain (or loss) in information by considering the marginal variance, we apply a similar regression with an optimal mean model, that is (8) either for all SNPs or treating common SNPs as free. We tested our model both with a single

Min |
CAST | CMC | Weighted Sum | Optimal Mean | EMMPAT | |||||||

scenario | 1% | 5% | 1% | 5% | 1% | 5% | All | 5% | All | 5% | Split |
One |

Null | .05 | .05 | .05 | .05 | .04 | .06 | .05 | .04 | .04 | .05 | .05 | .04 |

Base | .22 | .26 | .30 | .27 | .30 | .36 | .22 | .35 | .45 | .39 | .54 | .56 |

High |
.06 | .08 | .12 | .11 | .08 | .10 | .26 | .13 | .38 | .21 | .48 | .48 |

High |
.28 | .34 | .27 | .27 | .42 | .46 | .18 | .45 | .36 | .47 | .52 | .53 |

Low |
.20 | .26 | .21 | .21 | .32 | .36 | .18 | .38 | .36 | .42 | .49 | .48 |

Very High |
.04 | .06 | .06 | .06 | .05 | .06 | .22 | .08 | .32 | .15 | .45 | .45 |

Zero |
.44 | .52 | .48 | .44 | .61 | .66 | .09 | .62 | .45 | .61 | .58 | .62 |

violation of population model assumptions | ||||||||||||

DFE*5 | .23 | .28 | .30 | .27 | .33 | .39 | .24 | .39 | .43 | .42 | .51 | .53 |

DFE*.2 | .22 | .27 | .33 | .30 | .33 | .36 | .24 | .37 | .46 | .39 | .56 | .55 |

Exponential growth | .29 | .32 | .37 | .35 | .46 | .51 | .13 | .49 | .38 | .48 | .50 | .53 |

violation of fitness linearity and distribution of |
||||||||||||

Square | .20 | .23 | .28 | .23 | .29 | .33 | .12 | .31 | .39 | .39 | .57 | .59 |

Random sign | .20 | .27 | .28 | .25 | .32 | .36 | .10 | .33 | .28 | .34 | .43 | .45 |

Square root | .23 | .27 | .35 | .28 | .35 | .34 | .23 | .39 | .43 | .41 | .49 | .48 |

Skew effects | .22 | .29 | .34 | .31 | .33 | .37 | .24 | .38 | .45 | .41 | .56 | .56 |

20% no effect | .30 | .33 | .31 | .28 | .40 | .45 | .24 | .46 | .45 | .50 | .59 | .60 |

80% no effect | .33 | .33 | .18 | .17 | .35 | .36 | .13 | .36 | .24 | .38 | .44 | .43 |

We propose a novel method, EMMPAT, for association between sequenced genes and phenotype which utilizes population genetic theory to pool information among rare variants. Our method generalizes allele-count and allele-burden techniques, and presents several advantages. Of greatest importance to the practicing scientist will be increased power and interpretability. As shown above, our method allows us to leverage allele frequency as auxiliary data related to SNP effects and to substantially increase power to detect association in many scenarios. The availability of a well motivated pooling strategy allows an omnibus test which incorporates common and rare variation simultaneously. Our approach provides clear interpretations for the fitted model, such as the attributable variance in phenotype due to all polymorphisms observed in a gene, particular types of SNPs, or only the rare variation. Furthermore it facilitates tests of meaningful parameters (such as mean derived allele burden) and group differences (such as non-synonymous versus non-coding). The regression toolbox allows model checking and exploration, such as in

A relevant question is how important our method will be for diseases which have not been strongly selected against. There are three answers to consider. First, when selection and disease effect are completely independent, common SNPs will tend to have just as large effect sizes as rare SNPs and explain much of the heritable variation in phenotype

We have planned several extensions to this method. In addition to improved techniques of estimating fitness effects, we need to incorporate evidence for adaptive selection. Signatures of positive selection

Supplementary methods and discussion.

(0.05 MB PDF)

We would like to thank Dara Torgerson and Ryan Hernandez for their assistance with using SFS_CODE and insightful thoughts on population genetic models and software. We would like to thanks Helen Hobbs and Jonathan Cohen for access to the Dallas Heart Study dataset. We are grateful to Nancy Cox and anonymous reviewers for comments on a draft of the paper.