Integrated Model of De Novo and Inherited Genetic Variants Yields Greater Power to Identify Risk Genes

De novo mutations affect risk for many diseases and disorders, especially those with early-onset. An example is autism spectrum disorders (ASD). Four recent whole-exome sequencing (WES) studies of ASD families revealed a handful of novel risk genes, based on independent de novo loss-of-function (LoF) mutations falling in the same gene, and found that de novo LoF mutations occurred at a twofold higher rate than expected by chance. However successful these studies were, they used only a small fraction of the data, excluding other types of de novo mutations and inherited rare variants. Moreover, such analyses cannot readily incorporate data from case-control studies. An important research challenge in gene discovery, therefore, is to develop statistical methods that accommodate a broader class of rare variation. We develop methods that can incorporate WES data regarding de novo mutations, inherited variants present, and variants identified within cases and controls. TADA, for Transmission And De novo Association, integrates these data by a gene-based likelihood model involving parameters for allele frequencies and gene-specific penetrances. Inference is based on a Hierarchical Bayes strategy that borrows information across all genes to infer parameters that would be difficult to estimate for individual genes. In addition to theoretical development we validated TADA using realistic simulations mimicking rare, large-effect mutations affecting risk for ASD and show it has dramatically better power than other common methods of analysis. Thus TADA's integration of various kinds of WES data can be a highly effective means of identifying novel risk genes. Indeed, application of TADA to WES data from subjects with ASD and their families, as well as from a study of ASD subjects and controls, revealed several novel and promising ASD candidate genes with strong statistical support.


Probabilistic model of de novo mutations and inherited variations in family data
Given a trio of normal parents and an affected offspring, there are four primary genotype combinations ( Figure 2 in the main text). The informative cases are the de novo mutations, nontransmitted and transmitted variants (the second, third and fourth cases of Figure 2, respectively). Risk genes are more likely to carry de novo mutations than non-risk genes, and also are more likely to display an imbalance of transmitted vs. nontransmitted variants (the basis of transmission disequilibrium test, or TDT). We use p d , p nt and p t , to denote the probability and X d , X nt and X t the count of events of de novo, not-transmitted and transmitted events, respectively, out of N d families. For mutations with strong deleterious effects, all three probabilities are small. Hence we model the counts using Poisson distributions: X d ∼ Pois(N d p d ), and so forth for X nt and X t . Let x be the genotype (AA or Aa) and y be the phenotype of an individual (Disease or Unaffected). The probability of de novo mutations given the trio is: (1) where the subscripts p and c represent the parents and the child respectively. To compute the denominator, we sum over all the possible genotypes of the parent and child: P (y c = D|y p = UU) = xp,xc P (x p |y p = UU)P (x c |x p )P (y c = D|x c ). ( The genotype probabilities (the first two terms) and the phenotype probabilities (the last term) of the four cases are shown in Figure 2. Plugging in the relevant terms into Equation 1, we have: We focus on rare variants, hence q is typically less than 1%. The de novo mutation rate is generally very low, on the order of 1 × 10 −5 to 1 × 10 −6 for LoF or damaging missense mutations. We thus have the approximation: p d ≈ 2μγ. Following the same logic, it is easy to show that p nt ≈ q and p t ≈ qγ. Thus we have the approximate distributions for X d , X u and X t : Because an estimate of the mutation rate μ can be obtained from other sources, for each gene, we assume μ is known, but q and γ need to be estimated. For the case/control data, the number of Aa genotypes in the cases and controls follow a binomial distribution: X cs ∼ Bin(N cs , p cs ) and X cn ∼ Bin(N cn , p cn ). Following the standard analysis of case/control data [1], it is easy to show that for rare variants: where K is the disease prevalence. Again, assuming q is small, the binomial distribution can be approximated by the Poisson distribution: The similarity of the distributions of case/control and transmitted/nontransmitted data suggests a natural way to combine the two types of data. Focus on the type of mutation we are studying (e.g. LoF) and let X 0 = X u + X cn be the total number of LoF variants in the controls plus the number of nontransmitted LoF variants, and X 1 = X t + X cs be the total number of LoF variants in the cases plus the number of transmitted LoF variants, we have: where N 0 = N cn + N d and N 1 = N cs + N d . For any gene, the likelihood function is simply the product of the three individual probabilities:

The impact of mutation rates on the Multiplicity Test
Our model states that the number of de novo mutations per gene (X) follows Poisson distribution, X ∼ Pois (2μγN ), where μ is the mutation rate of the gene (of the type of mutation we are studying, e.g. LoF), N is the number of family trios and γ is the relative risk of the mutation (1 for non-risk genes). Let λ = 2μγN be the Poisson rate. We introduce an approximation for Poisson distribution when the rate parameter λ is small: where we use the second-order Taylor expansion: e −λ ≈ 1 − λ + λ 2 /2, which suggests that the probability of having a recurrent mutation increases quadratically with the mutation rate μ. This observation has two consequences. First, it means that the power of detecting highmutation rate (usually large) genes by the Multiplicity Test is much higher than for smaller genes, and it explains our observations in Figure 1A. Second, it also means that the same double-hit (two de novo mutations in one gene) should be assigned different meaning or p-value for genes of different sizes, as it is easier to achieve a double-hit by chance for large genes. The De Novo Poisson Test adjusts the mutation rate difference across genes. For instance, in Table S3, we see that the gene CHD8 has a relatively high mutation rate 0.0001 (the genome-wide average is about 2.4×10 −5 ), and its p-value from the De Novo Test is about 0.0001; on the other hand, KATNAL2 has below-average mutation rate, and its p-value from the De Novo Test is much smaller at 3.1 × 10 −6 .

Bayesian inference of risk genes
In this section, we focus on data from a single type of mutation (LoF or damaging missense) and discuss how to combine the evidence from the two types of mutations later. To infer if a gene is a causal gene of the disease, we compare two hypothesis, H 1 : γ = 1 vs. H 0 : γ = 1. For most genes, the number of LoF mutations are generally very small (often 0), so the likelihood-based method may produce incorrect or unrealistic values for the parameters, e.g.q = 0. Thus we take a Bayesian approach. Let x = (x d , x 1 , x 0 ) be the count data for a gene to be tested, and let q and γ be the population frequency and relative risk, respectively. We model the prior of q and γ using the Gamma distribution, which is conjugate to the Poisson distribution: Note that we use different priors for q under H 1 and H 0 . This reflects the possibly different selection pressure on a risk gene, which would result in a different population frequency. The model evidence P (x|H 1 ) and P (x|H 0 ) are given by: and P (x|H 1 ) = p(x|q, γ)p(q|H 1 )p(γ|H 1 )dqdγ. (12) The computation of the model evidence can be simplified by taking advantage of the fact that the Gamma distribution is the conjugate prior for the Poisson distribution. First, for the model evidence of H 0 , we note that P (x d |H 0 ) can be factored out because its distribution does not depend on q.
The probability of P (x 1 , x 0 |H 0 ) can be computed using the marginal distribution of the Poisson random variable (integrating the rate parameters): Multiply this with Pois(x d |2μN d ), we have: where we have the constant terms (not dependent on the parameters): and The R term is 1 if x 1 + x 0 = 0. For the model evidence of H 1 , we use the iterated integral: Integration over q is similar to the integration under H 0 above: Plugging in the relevant terms, we have: where The integration is performed numerically. Once we have the model evidence, the role of gene can be inferred by Bayes factor: In general the Bayes factor is somewhat sensitive to the values of the prior parameters, so we estimate the empirical p-value of the Bayes factor using simulations. Specifically, given the count of LoF variants of the test gene, we estimate the population frequency, q, of the mutant allele assuming that the gene is not associated with the disease. Given the prior distribution q|H 0 ∼ Gamma(ρ 0 , ν 0 ), its posterior distribution, given the counts . Thus our point estimate of q is its posterior expectation:q We obtain the null distribution of the Bayes factor by simulating data under the estimated q and γ = 1, according to Equation 7.

Estimation of the prior parameters by the Hierarchical Bayes method
The Bayes inference procedure described above depends on some prior parameters. For ease of notation, we reparameterize the prior distributions using the expectations:γ = α/β,q 1 = ρ/ν andq 0 = ρ 0 /ν 0 . We thus have the hyperparameters to be estimated: φ = (γ, β,q 1 , ν 1 ) for H 1 and φ 0 = (q 1 , ν 0 ) for H 0 . We estimate the values of these parameters by joint analysis of all genes in our dataset, following a hierarchical Bayes strategy [2]. Specifically, we assume a mixture model on all genes, with each gene having probability π to be a risk gene. The marginal likelihood for all genes, as a function of φ 1 and φ 0 , is thus: where m is the total number of genes, x represents the count data of all genes, and x i the count data of the i-th gene. The model evidence P (x i |φ 0 ) and P (x i |φ 1 ) is given by Equation 14 and 19, respectively. We assume the value of π (the fraction of risk genes) is known and maximize the marginal likelihood to obtain the values of φ 1 and φ 0 .

TADA for multiple types of mutations
When analyzing multiple types of mutations (LoF and Mis3 in our analysis of ASD data), we assume the data from each type of mutation are independent of each other. In the first step, we estimate the prior parameters of each type of mutation using the hierarchical Bayes method. Next, we compute the Bayes factor of any gene, as the product of the Bayes factor from each type of mutation. In the case of two types of mutations, we simply have: In our analysis of ASD data, we note that the mis3 mutations annotated in the data are probably a mixture of those causing protein-damaging changes that increase the risk of ASD and those having no real effects on the phenotype. For example, none of the five genes with recurrent LoF have x 1 > x 0 for mis3 mutations ( Table 1, Table S3).
We developed a mixture model for the Mis3 data. Let x be the count data of mis3 variants (de novo count, and the counts from the case-control and inherited data), I be an indicator variable of whether the mis3 mutations are informative or not, and D be the indicator variable for whether the gene is associated with the disease. Also let w be the mixture ratio, i.e., P (I|D), the probability the mis3 data are informative given that the gene is related to the disease. The BF is then defined as: where P (x|I, D)/P (x|D) is simply the BF defined before. So our procedure of computing the new BF for a gene is: where B LoF and B Mis3 are the BFs computed from LoF and Mis3 data, without the benefit of mixture modeling.

Method of moment estimation of the number of ASD genes and their relative risks
We take advantage of our probabilistic model of de novo mutations to estimate the number of ASD genes. Suppose we have m genes in the genome, k of which are ASD genes (the set denoted as A), and m − k are non-ASD genes (the set denoted as A c ). The sample size is N = 932 in this study.
For the i-th gene, its rate of LoF mutation is μ i , and its relative risk is γ i (equal to 1 for non-ASD genes). The value of μ i is known, and γ i is assumed to follow the distribution: Gamma(α, β). We write: α =γβ, whereγ is the average relative risk of all ASD genes. Let C be the observed number of LoF events in all genes, and M be the observed number of multi-hit genes (the genes with recurrent LoF mutations). Our goal is to estimate k andγ from the observed values of C and M . Following our model, Equation 4, the expected number of LoF events in all genes is given by: Assuming μ i and γ i are independent, we take the expectation on both sides: This leads to the constraint relating k andγ: To estimate the number of multi-hit genes, we define p i as the probability of the i-th gene having more than one LoF event in the de novo data. If the gene is an ASD gene, this probability is: where X i is the number of de novo LoF events in this gene across N families. If the gene is not associated with ASD, the probability is: Summing over all ASD and non-ASD genes: Taking the expectation in both sides, we have: where the two expectation terms are the average of the multi-hit probability of all genes under the ASD and non-ASD cases, respectively: Given the values ofγ and β, these expectation terms can be easily calculated according to Equations 30 and 31. There are three unknowns in Equations 29 and 33: k,γ and β, so they cannot be uniquely determined. We explore the values of k andγ under different choices of β. In our analysis of ASD data, we fix β = 1, and estimate k andγ accordingly.

Comparison with an earlier estimate of the number of ASD genes
In an earlier paper, it was estimated that the number of ASD genes (k) is about 370 [3]. This is considerably lower than our estimate (nearly 1,000). We first review the methodology of the earlier work and then explain that it may underestimate k by treating all genes as equal. Specifically, the earlier paper uses the same de novo data as ours, including 127 de novo LoF events. By comparing with the rate of de novo LoF events in the unaffected siblings, it was estimated that about half of them (65) happened in ASD genes. Among these 65 genes, 5 genes have two de novo LoF mutations. Let the number of ASD genes be k. The data suggests that if we sample from the k genes (with replacement) 65 times, five genes will be sampled twice. For each ASD gene, let X be the number of times it was sampled, then X follows Binomial distribution with probably 1/k. So we have: k · P (X ≥ 2|Bin(65, 1/k)) = 5.
Solving the equation gives k = 370. Clearly, in the above analysis, the probability of sampling one ASD gene is assumed to be equal. In fact, if we assume that all genes have the same mutation rates, and all ASD genes have the same relative risks, solving Equations 29 and 33 would lead to k = 376. We show that assuming equal mutation rates and equal relative risks significantly underestimate k. First, we use an approximation for the Poisson distribution introduced earlier (Equation 9), when the rate parameter λ is small: Applying this approximation to the probability of having multiple de novo mutations under the ASD and non-ASD scenarios (Equations 30 and 31, assuming a fixed γ i ): Taking the expectation of p : We can then solve for E (γ 2 ): where the inequality above is based on the Jensen's Inequality: [E (μ)] 2 ≤ E (μ 2 ). This suggests that if we assume all genes have equal mutation rate, and replace E (μ 2 ) above with [E (μ)] 2 , we would overestimate E (γ 2 ). Again by Jensen's inequality, we know that E (γ) 2 ≤ E (γ 2 ). Taken together, we have: Thus, the actual average γ is always lower than the estimate based on the assumption that all genes have the same mutation rates and all ASD genes have the same relative risks. Because of the relationship between k and γ (Equation 29), overestimation of γ leads to an underestimation of k.

Parameterization in the analysis of ASD data
We first discuss how we estimate the parameters of the Hierarchical Bayes (HB) model displayed in Figure 6 for the LoF data. We use a two-step procedure.
1. Using only the de novo data we identify a range of values of k, denoted as R k , that are consistent with our observations (M = 5, C = 123). From the method of moments analysis, by varying M from 4 to 6 we estimate that there are R k = [550, 1000] ASD genes (main text, Figure 3B). For each k ∈ R k , the fraction of ASD genes π(k) = k/18, 900 (18,900 is the number of human genes), and the correspondingγ(k) is determined by Equation 29. For example, at k = 550, we haveγ = 38.5, and at k = 1000, we haveγ = 20.2.
2. Using the transmitted and case-control data, we performed the HB analysis for a grid of values of k ∈ R k (k = 550, 600, · · · , 1000) and the corresponding values of π(k) andγ(k). We maximize the marginal likelihood defined by Equation 23. In the HB analysis, the other parameters in φ 1 and φ 0 are free parameters, but we limit the search range of the variance parameters of H 1 : β ≤ 1, and ν 1 ≤ 10000 (our simulations suggest that the likelihood becomes unstable outside of this range). We found that the likelihood was maximized at k = 1000. Setting k = 1000, the resulting values of the hyperparameters from the HB optimization are shown in Table S1.
We also performed a heuristic estimate ofq 1 (the mean LoF frequency of ASD genes) to check its consistency with the HB estimations. We analyze the list of 111 genes that have at least one de novo LoF mutation in the combined 932 families, and the counts of the LoF mutations of these genes in the AASC controls. It has been estimated that about half of these genes are ASD genes [3]. In the control sample we observe 64, 20 and 27 genes which have 0, 1 and ≥ 2 LoF hits, respectively. Based on the high penetrance of LoF mutations, we infer that genes with 0 counts in the controls are much more likely ASD candidates than those with 2 or more counts. The 20 single-count genes create more ambiguity. Four of these have 2 or more counts in cases, rendering them likely ASD candidates as well. This simple analysis suggests that in the 55 candidate ASD genes, somewhere between 4 and 20 LoF hits were observed, providing an interval estimate forq 1 of 8.4 × 10 −5 to 4.2 × 10 −4 . Furthermore, we note that the genes with higher mutation rates are more likely to have de novo LoF mutations, thus the group of 111 genes should be enriched with high mutation-rate genes. Because we are estimating the average q 1 of all ASD genes, we need to correct for the higher mutation rates in our list. We estimate via simulation the mean mutation rate of genes, given that the genes have at least one de novo LoF. It is about 1.7 times higher than average. We assume that the population frequency is proportional to μ (from mutation-selection balance in population genetics, q = μ/s, where s is the selection coefficient), thus we correctq 1 by dividing 1.7, and this leads to an estimate ofq 1 from 4.9 × 10 −5 to 2.4 × 10 −4 for ASD genes.
For Mis3 mutations, we first estimate w in Equation 26, the fraction of informative Mis3 mutations. We compile a small list of highly confident ASD genes. (1) We first choose the most likely ASD genes from a list of about 100 candidates from a recent review paper [4]: TSC2, TSC1, NLGN4X, SCN1A, CNTNAP2, CHD7, CACNA1C and TBX1. (2) We intersect this list with the list of genes with at least one de novo LoF mutation in the probands, and this results in five additional genes: FOXP1, GRIN2B, MBD5, NRXN1, FMR1. (3) We choose all five double-hit genes: KATNAL2, CHD8, DYRK1A, POGZ and SCN2A (Table 1, Table S3). (4) We add two additional genes TBR1 and CUL3 that have been replicated recently [5,6]. In this total list of 20 genes: 11 genes have more Mis3 mutations in cases than in controls. So we estimate that the Mis3 data are informative in about 55% of the risk genes. We thus choose w = 0.55 in experiments.
Assuming k = 1000 ASD genes from the LoF analysis, we estimate the averageγ of Mis3 mutations using Equation 29. Specifically, we observe E = 319 de novo Mis3 mutations in probands. The expected number of mutations, 2Nμm, is 253.7 (number of de novo Mis3 mutations in siblings, normalized by the number of siblings). There is clearly, however, some uncertainty of this estimate, so we allow a 5% error of this expected number (241 to 266), and this leads to a range ofγ in (4.73, 7.12). We then search for the bestγ in this range with the HB method. Similar to the LoF analysis, we restrict the variance parameters of γ and q under H 1 : β ≤ 1, ν 1 ≤ 2000 (ν 1 is lower to allow more variance of q, which is expected for missense mutations, as opposed to LoF mutations).
In summary, the parameters we use in the predictions are k = 1000 genes, or π = 1000/18, 900 = 0.053, the weight of the Mis3 mutations, w = 0.55, and the hyperparameters for the LoF and Mis3 mutations are listed in Table S1.

Controlling FDR in simulations
In our simulations, we apply the three tests, De Novo, Meta and TADA, to the data simulated from the procedure described in Methods. We describe how we control FDR for each test. Specifically, for any simulated dataset (about 18,000 genes), we calculate M 1 , the number of true discoveries, and FDR, under various significance levels. At any significance level α, we count the number of genes with p-value below α and denote it by M + . We also estimate the number of false positives at the level α, M 0 . For the Meta and TADA tests, this is simply: where m = 18, 000 is the total number of genes, and k = 1, 000 is the number of risk genes. This follows from the uniform distribution of p-values under the null model. For the De Novo test, because p-values are generally discrete (a gene can have 0, 1 or 2 de novo LoF mutations), they do not follow the uniform distribution; however, we can easily estimate M 0 as follows. For the i-th gene, let p i be the probability that it reaches a p-value below α, assuming it is not related to the disease. Because we know that the count of genes under the null model follows the distribution X i ∼ Pois(2μ i N ), this count is easily calculated for any α. Consequently we have: Once M 0 is estimated, we have M 1 = M + −M 0 for any significance level α, and the FDR is M 0 /M + . To meet the FDR target 0.1, we choose the largest α that leads to FDR below 0.1, and obtain the corresponding M 1 as the number of true discoveries.

Genomic control for TADA
In Liu [7] it was noted that a variation on the traditional genomic control (GC) factor performs better when the test statistic follows the assumed distribution for small p-values, but does not closely follow this distribution for p-values that are not of interest. TADA offers an example of this phenomenon: for a substantial number of genes all LoF counts are equal to 0 creating a mass of statistics with equal p-value. (For these data, this mass occurs near .3.) Naturally, these genes are not of interest, but this mass of p-values can cause a substantial inflation in λ, as traditionally calculated.
To compute the traditional GC factor, we define p med to be the median p-value and let λ = F −1 (1 − p med )/.455, where F −1 is the inverse CDF of the χ 2 1 distribution. To obtain a more robust GC factor, let p q be the first quantile of the p-values, and define λ q = F −1 (1 − p q )/1.32. This statistic has expectation 1 if the statistic follows the assumed model. Values bigger than 1 arise due to deviations from the assumed distribution such as would arise due to population stratification.

Variant Calling
De novo events were called as described in the relevant publications [8,9,10,11]. Variant calling for the case-control data is described in Liu et al. [7]. To call transmitted variants for the portion of the data derived from the Simon's trios the following procedures were utilized. Whole-exome data was obtained using the NimbleGen EZ SeqCap v2 and Illumina HiSeq for 656 ASD families (473 quartets including an unaffected sibling, and 183 trios with only an affected proband). The reads were aligned to hg19 using BWA and variants were predicted using Samtools after removal of duplicate reads for all family members. An in-house script was used to exclude variants that were over 37bp off-target (i.e. 50% of the 75bp read-length from the target). Fifteen families were excluded on the basis of at least one family member having a number of predicted variants that was > 3 standard deviations from the mean leaving 641 families (1,282 parents; 460 quartets, 181 trios). All predicted substitution variants were cross-referenced with the proband and sibling data to identify the percentage of unique reads in the offspring that support the parental variant; a variant was considered transmitted if this value was greater than 20%. Parental variants were annotated against UCSC gene definitions and the Exome Sequencing Project (ESP) variant database for 6,500 control individuals [12].
To ensure high quality data the variant were filtered to include only the variants with: ≥ 20 unique reads in all family members; ≥ 6 unique reads supporting the variant in the parent; heterozygous variants in the parent; and a Samtools SNP quality score ≥ 50 in the parent. Common variants were excluded by removing any variant that was present at > 1% population frequency in the ESP controls and/or the 1,282 parents. After filtering, 364,445 variants remained giving a mean of 284 variants per parent.