Testing for an Unusual Distribution of Rare Variants

Technological advances make it possible to use high-throughput sequencing as a primary discovery tool of medical genetics, specifically for assaying rare variation. Still this approach faces the analytic challenge that the influence of very rare variants can only be evaluated effectively as a group. A further complication is that any given rare variant could have no effect, could increase risk, or could be protective. We propose here the C-alpha test statistic as a novel approach for testing for the presence of this mixture of effects across a set of rare variants. Unlike existing burden tests, C-alpha, by testing the variance rather than the mean, maintains consistent power when the target set contains both risk and protective variants. Through simulations and analysis of case/control data, we demonstrate good power relative to existing methods that assess the burden of rare variants in individuals.

• N is the set of distinct counts of variants. If the data includes doubletons, triples, ..., up to max(n) then N = {2, 3, . . . , max(n), n }, where n refers to the number of singletons; see below.
• m(n) is the number of variants with n copies.
•q n (y) is the proportion of variants with y copies out of n total, i.e., the empirical probability density function of variants with n copies.

Motivation
Under the null hypothesis, y i ∼ binomial(n i , p 0 ). Under the alternative hypothesis, p i is less than p 0 for some variants and greater than p 0 for others. More formally, we say p 1 , . . . , p m is a sample from a distribution G. We take a nonparametric approach, allowing G to follow an arbitrary distribution. For a Mendelian risk variant p i = 1.
We test to determine if H 0 : p i = p 0 for all variants, versus H 1 : p i = p 0 for some variants.
The "C-alpha test", introduced by Neyman and Scott, is the score test for this hypothesis [4,6]. The test is based on how strongly y i deviates from n i p 0 over the whole set of variants. We do not try to determine which variants are risk or protective. The test is based on determining if the mixing distribution G has variance. If G is a point mass at p 0 it has no variance. For this reason the test has only 1 degree of freedom, resulting in a flexible and powerful test.
We expect the number of copies of each variant, n i , to vary, yielding singletons (n i = 1), doubletons (n i = 2), triples (n i = 3), and so forth, up to the largest count declared to be rare, say 0.01(N 1 + N 2 ). The potential information increases as n i gets larger; quadruples can be more informative than doubles and so forth. On the other hand, risk is believed be higher for variants that are more rare. Singletons must be handled specially.
The Test Statistic Assuming no effect (H 0 ), the probability of observing y i copies of a variant in cases out of n i copies total is f (y i |n i , p 0 ). To derive the C-alpha test, first reparameterized from p to θ, the log odds: θ 0 = log(p 0 /1 − p 0 ). We move between parameterizations as convenient, but derivatives are always taken with respect to θ. Re-expressing the probability density function for a single variant in the exponential family form we have Following Zelterman and Chen (1988), the C-alpha test statistic is Direct calculation of the derivatives, using facts about the exponential family form, reveals Note that T can be rewritten more compactly as where u is a dummy variable that indexes over possible values of y.
To compute the denominator of the C-alpha statistic we require the variance of T , which is approximately where This quantity can be written in a computationally more convenient form as This is an approximation because we ignore potential variability in p 0 .
Reject when Z = T / √ c is larger than expected using a one-tailed test. Alternatively we can do a permutation directly, computing T or Z for each permutation, to get the P-value.
It is a good idea to do the permutation for confirmation in any promising target regions. The permutation can account for the effects of features like linkage disequilibrium.
Singletons do impart some information, but they have to be treated specially because a singleton tells nothing about variability by itself. If we pool singletons, then we can extract information from them. The simplest approach would be to let y be the total count of singletons in cases, and n be the total count of singletons. Under H 0 , y ∼ binomial(n , p 0 ).
For the remainder of the supplement we pool the singletons into a single variant having n copies, with y of them observed in the cases.
Estimating p 0 In the proposed formulation of the C-alpha test, we require an estimate of the probability of observing a variant in the cases, p 0 . Two procedures for p 0 are possible for this test statistic, fixing it to a known value, or estimation from observed data. Fixing p 0 to a known value (e.g. p 0 = N 1 /(N 1 + N 2 ), which equals 1/2 for matched case-controls) assumes matching of cases and controls was carefully done to account for any biases that might be introduced in the variant count data. Several estimation procedures for p 0 are proposed for when data are not carefully matched. All methods are performed post discovery of variants.
• Let p 0 be the number of case variant alleles divided by the total number of variant alleles in the region of interest. This approach is very robust to stratification, but it will lose power in the presence of real effects.
• Extend the calculation above to all sequenced regions.
• Restrict the calculation above to variants that are likely to be phenotypically neutral (e.g. synonymous variants). This approach could be powerful and yet robust if a large fraction of the exome is sequenced.

Genomic Control
If the C-alpha statistic is inflated due to population stratification we expect the test statistic to be inflated at a constant rate across the genome (approximately). To correct for stratification we suggest using Genomic Control. To do so, first select equal numbers of random clusters of variants from the regions sequenced. These variants can be selected following the same strategy as used to estimate p 0 . Next, for each constructed target region, convert the observed P-value to a χ 2 statistic. The median of these statistics, divided by 0.455, estimates λ. Because the C-alpha is distributed normally, divide each test by √ λ rather than λ to correct for stratification. While it is preferable to control for stratification by matching cases and controls by ancestry, this correction will reduce the impact of stratification. The approach works best if target regions are of approximately equal length.

Weighted C-alpha Test
Let w i be a weight for the i'th variant, such as the quantity used by Madsen and Browning (2009) based on the standard deviation in the controls [3]. The weighted C-alpha test follows directly with and Notice that when we let the weights vary by variant, a computational "shortcut" formula like equation (3) cannot be obtained.
To preserve normality of the statistic, weights must follow some simple constraints, typical of weighted Central Limit Theorems (Weber 2006) [5]. For example, no weights should strongly dominate the others. In addition, the i'th weight should be conditionally independent of the contribution to the statistic from the i'th variant, given p i (Genovese et al. 2006) [1]. We propose using an independent resource, like the 1000 Genomes Project to derive weights.
Specifically, a natural choice would be to use those individuals of similar ancestry to the cases and controls to weigh variants based on their standard deviation in the 1000 genome subset.
By contrast, Madsen and Browning derive their weights from the study-specific controls. The use of such weights can introduce bias, and would require a special permutation to obtain an appropriate version of the C-alpha test. The bias can only be corrected if the algorithm recomputes the weights as well as the statistic, for each permutation. In this way we capture the effect of the bias and adjust the P-values accordingly.
Fitting mixture models and calculating diagnostics We limit ourselves to those regions in which C-alpha is significant (at least at the 0.05 level). Rather than assume the mixing distribution has K = 3 components, we can find the mixing distribution that best fits the data. Using the estimated mixing distribution we obtain post hoc estimates of the effects of particular variants -these inferences will be very weak, but still interesting. For instance, K = 3 points might give a good fit, if your initial assumptions were approximately true. This more general approach finds what K gives the best approximation. K = 1 is the null hypothesis.
A previously published paper describes diagnostics for mixing that can be calculated and plotted to determine if the data appear to be mixed, and to find K [2]. The diagnostic for mixing would be helpful to determine if it is worthwhile to do the permutation test.
The EM algorithm to fit the mixture Notation • k = 1, . . . , K index for the mixture components • p k binomial parameter for the k'th component • π k mixing weight for the k'th component • b ik probability that y i comes from the k'th component in the mixture distribution.
• Likelihood contribution from i'th variant for k'th component For the EM algorithm iterate between the following 3 equations, updating b ik 's, π k 's and p k 's each step, until convergence.
, i = 1, . . . , m; k = 1, . . . , K, Note: b ik is the posterior probability variant i is from mixing component k. Results can be interpreted to wager which variants are most likely detrimental (or likewise protective).
Supposing we have a significant C-alpha test, and we fit a mixture component which has a component with p k > p 0 . We can conclude that any variant i for which b ik >> π k is likely to have a strong detrimental effect. (These are not significance tests, but if the overall test is significant we can take a guess at which variants seem to be causing the signal.)

Gradient Diagnostics
A diagnostic plot helps to find K that best fits the data. Try K = 2 and increment until the following diagnostic criterion is satisfied. Let G K be the maximum likelihood estimate (mle) mixing distribution with K components using the EM algorithm. The gradient function is Plot this as a function of p. When D(G K , p) is (numerically approximately) nonpositive for p ∈ the interval [0,1], then G K is the global mle -no K will give a better solution in terms of likelihood. In fact, for the right choice of K, D(G K , p) looks like a polynomial with zeros at points corresponding to components in the mixing distribution. When K = 1, f (y i |n i , G 1 ) = f (y i |n i , p 0 ). If a mixture model is appropriate, the diagnostic D(G 1 , p) = D(p 0 , p) should be a positive function that looks like a U. The C-alpha test is based on the curvature of D(p 0 , p) at p 0 -a strong second derivative indicates a mixing distribution. So the U shaped plot and the C-alpha test are two views of the same feature.
When the diagnostic and test show mixing, we believe it corresponds to a rare variant effect.