^{1}

^{2}

^{1}

^{*}

Conceived and designed the experiments: TJH NJM JSW. Performed the experiments: TJH. Wrote the paper: TJH NJM JSW.

The authors have declared that no competing interests exist.

Recent findings suggest that rare variants play an important role in both monogenic and common diseases. Due to their rarity, however, it remains unclear how to appropriately analyze the association between such variants and disease. A common approach entails combining rare variants together based on

There is increasing evidence supporting the role of rare variants in both monogenic and complex diseases

Evaluating the potential impact of rare variants on disease is complicated, however, by their uncommon nature. Several approaches have been proposed for the analysis of rare variants. On the one extreme is collecting such an enormous study sample that rare variants are detected sufficiently often to allow for testing each variant individually; for example, Nejentsev et al.

An alternative is to combine rare variants together into groups in a reasonable manner so they can be efficiently analyzed. Note that when we use “efficient” in this manuscript, we will always be referring to statistical power; computational time will be referred to as runtime. One might simply tabulate in cases and controls the number of individuals that have any rare variants (e.g., within a given locus), and contrast these counts. Morgenthaler et al.

Another option is to somehow weight each rare variant and then combine them. The optimal approach will upweight the variants most likely to cause disease and downweight variants that have no effect on disease. The weights could be calculated in a number of different ways. Madsen and Browning

The decision to aggregate rare variants – with or without explicit weighting – requires a number of strong assumptions about the similarity of their effects on disease. This raises a critical unanswered question: how to best combine rare variants for analysis? For instance, one might choose a minor allele frequency threshold to define what is “rare,” or choose a weighting scheme for the variants (even if constant weights). In addition, one might decide to only aggregate nonsynonymous variants in the coding regions

Two methods have recently been proposed to collapse rare variants in a data-driven manner. Price et al.

Our approach considers multiple possible groupings, choosing the “best” set based on statistical criteria, and correcting by permutation. One can use prior information from several sources to define these groupings; e.g., different protein coding function algorithms. Alternatively, or in addition, one can use data-driven methods to define these groupings based only on statistical criterion; e.g., all possible allele frequencies, all possible subsets of rare variants, or a “step-up” approach we propose here. That is, we use the data to decide whether a variant should be deleterious or protective, or whether the variant should even be in the model at all. We use a simulation study to evaluate these approaches. The simulations are based on data from deeply sequenced candidate genes in the one-carbon folate metabolic pathway

Assume that we have undertaken a study of the relationship between

An alternative is to somehow aggregate multiple rare variants, and leverage their combined strength to improve estimation. This can be formalized with a second-stage model for the parameters of interest, a vector of coefficients

However, most of the existing rare variant approaches essentially model a single combined genetic effect

We will explore different ways to model

For the continuous weight

If we believe all variants have a deleterious effect, we can set

Lastly, we have

Another example of how to chose

There is one other model we will introduce for

With the computational complexity of testing multiple weights in mind, we also consider a data-driven method for specifying

We can further extend this to allow the set of all models considered to include any combination of the approaches from above, restricted to being computationally feasible. That is,

In the previous section we described a general framework and strategies for constructing a model for the variant weights

This is similar to CAST, but summing

In addition to these, we then fit models

We investigated several different rare variant disease models. Dichotomous traits were simulated using the disease model given in equation 1 under a logit link, and continuous traits with the identity link. We simulated a range of odds ratios (

The variant data was generated using the haplotype frequencies across genes from an existing sequence-level dataset. One thousand cases were drawn according to the joint distribution of

We ran 500 simulations per gene, and averaged the empirical power over all of the genes according to a type I error rate of 0.05 (i.e., average power for gene-specific detection, not pathway). We ran 500 permutations for each test (except CMC, for which an asymptotic test is available

We ran several simulations for dichotomous traits with the following values of

We also reran simulations 1 and 5 for continuous traits. Here we replace the odds ratio

The deep sequenced dataset on which our simulations were based was rich with rare variants; out of 764 putative SNPs, 653 had allele frequencies less than

For each gene, the number of variants from sequencing that are nonsynonymous, and then deemed deleterious by three different methods (SIFT

SIFT | PMUT | PolyPhen | Count |

I | Path | Prob | 8 |

I-LC | Path | Prob | 2 |

I | Neut | Prob | |

I-LC | Neut | Prob | |

tolerated | Neut | Prob | 1 |

I | Path | Poss | 3 |

tolerated | Path | Poss | |

I | Neut | Poss | |

I-LC | Neut | Poss | 2 |

tolerated | Neut | Poss | 6 |

I | Path | Ben | |

I-LC | Path | Ben | |

tolerated | Path | Ben | |

I | Neut | Ben | |

I-LC | Neut | Ben | 1 |

tolerated | Neut | Ben | 43 |

Each simulation enumerated above is highlighted in

500 simulations were based on haplotype distribution for each of 13 deep sequenced candidate genes, and averaged. 500 permutations were run per test. Information for each situation on the bottom of each plot consists of three parts that indicate the test used:

Results in Figures A and B show the effect of having both deleterious and protective rare variants. Figures C and D switches to a continuous trait, with Figure D showing the effect of having both deleterious and protective rare variants. Results are sorted by the plot that has the highest area, i.e., the most powerful overall. See the

We then looked at the effect of common variation according to the PMUT algorithm in simulation 4 (4 genes had common variants,

In the top panels of

When considering continuous traits our simulations gave generally similar results as seen for dichotomous traits.

We have compared several different approaches to rare variant analysis that incorporate varying amounts of prior information in deciding how to aggregate such variants. When one does not know how rare variants affect disease, and is hesitant to make the strong assumptions required to collapse them together, the completely agnostic step-up approach presented here may be the most appropriate. It performed either the best, or close to the best (excluding the “perfect” but unrealistic tests) in the various situations considered.

When it is possible that both protective and deleterious variants are present, we found it useful to sign variants (although little difference between stepwise and signed stepwise). Signing variants greatly improved the efficiency when both protective and deleterious variants are present, although some efficiency was lost when only deleterious alleles were present. The weighting schemes we considered based on allele frequency (models for

Our simulations focused on combining rare variants within particular genes. One can extend this approach to pathways, exomes, or entire genomes, although the latter may be computationally challenging. Some computational time may be saved by using an adaptive permutation that stops earlier for genes or regions that appear to have no impact. For exomes, one could also further collapse entire pathways instead of genes. A fast analysis of different pathways could be done by testing each gene individually, and combining the resulting p-values with the Fisher product test statistic

Many complex diseases are likely due to a combination of rare and common variants. One can jointly analyze rare and common variants as in the CMC approach

Another promising approach for rare variant analysis is hierarchical modeling

As with any genetic analysis, one may need to adjust for potential confounding (e.g., due to population stratification). Dichotomous covariates, or covariates with only a few levels, can be included easily in these rare variant approaches by stratifying on them. Otherwise the residuals of a logistic/linear repression of the trait on the covariates of interest can be fit with the continuous version of the test. One could also just use the model in Equation 1 adjusting for covariates; here, one might always use linear regression as it will be faster. The score test from linear regression is nearly the same as the score test from logistic regression, with the modification that the information contributions of each subject is weighted by

In summary our simulations suggest that the step-up approach works quite well without requiring

Our thanks to Dr. Gary Shaw and the California Department of Public Health for use of the deeply sequenced genetic data.