A Flexible, Efficient Binomial Mixed Model for Identifying Differential DNA Methylation in Bisulfite Sequencing Data

Identifying sources of variation in DNA methylation levels is important for understanding gene regulation. Recently, bisulfite sequencing has become a popular tool for investigating DNA methylation levels. However, modeling bisulfite sequencing data is complicated by dramatic variation in coverage across sites and individual samples, and because of the computational challenges of controlling for genetic covariance in count data. To address these challenges, we present a binomial mixed model and an efficient, sampling-based algorithm (MACAU: Mixed model association for count data via data augmentation) for approximate parameter estimation and p-value computation. This framework allows us to simultaneously account for both the over-dispersed, count-based nature of bisulfite sequencing data, as well as genetic relatedness among individuals. Using simulations and two real data sets (whole genome bisulfite sequencing (WGBS) data from Arabidopsis thaliana and reduced representation bisulfite sequencing (RRBS) data from baboons), we show that our method provides well-calibrated test statistics in the presence of population structure. Further, it improves power to detect differentially methylated sites: in the RRBS data set, MACAU detected 1.6-fold more age-associated CpG sites than a beta-binomial model (the next best approach). Changes in these sites are consistent with known age-related shifts in DNA methylation levels, and are enriched near genes that are differentially expressed with age in the same population. Taken together, our results indicate that MACAU is an efficient, effective tool for analyzing bisulfite sequencing data, with particular salience to analyses of structured populations. MACAU is freely available at www.xzlab.org/software.html.

Text S1: Detailed Methods

Binomial Mixed Model
To detect differentially methylated sites, we model each potential target of DNA methylation one site at a time. For each site, we consider the following binomial mixed model (BMM): where r i is the total read count for ith individual; y i is the methylated read count for that individual, constrained to be an integer value less than or equal to r i ; and π i is an unknown parameter that represents the true proportion of methylated reads for the individual at the site. We use a logit link to model π i as a linear function of parameters: g = c(g 1 , · · · , g n ) T ∼ MVN(0, σ 2 h 2 K), e = c(e 1 , · · · , e n ) T ∼ MVN(0, σ 2 (1 − h 2 )I n×n ), where logit denotes a logistic transformation logit(π i ) = log( πi 1−πi ); λ i = πi 1−πi is the odds; w i is a c-vector of covariates including an intercept and α is a c-vector of corresponding coefficients; x i is the predictor of interest and β is its coefficient; g is an n-vector of genetic random effects that model correlation due to population structure or individual relatedness; e is an n-vector of environmental residual errors that model independent variation; K is a known n by n relatedness matrix that can be calculated based on a pedigree or genotype data and that has been standardized to ensure tr(K)/n = 1 (this ensures that h 2 lies between 0 and 1, and can be interpreted as heritability, see [1]); I is an n by n identity matrix; σ 2 h 2 is the genetic variance component; σ 2 (1 − h 2 ) is the environmental variance component; h 2 is the heritability of the logit transformed methylation proportion (i.e. logit(π)); and MVN denotes the multivariate normal distribution.
The binomial mixed model proposed here belongs to the generalized linear mixed model family [2]. Both g and e model over-dispersion, the increased variance in the data that is not explained by the binomial model. However, they model different aspects of over-dispersion: e models the variation that is due to independent environmental noise (a known problem in data sets based on sequencing reads), while g models the variation that is explained by kinship or population structure. Effectively, our model improves and generalizes the previous beta binomial model by introducing this extra g term to model individual relatedness due to kinship, population structure, or stratification.

Inference Method Overview
We are interested in testing the null hypothesis H 0 : β = 0. This requires obtaining the maximum likelihood estimateβ from the model. Unlike its linear counter-part, obtaining the estimate of β from the binomial mixed model is not a trivial task, as the joint likelihood consists of an n-dimensional integral that cannot be solved analytically [2]. Previous frequentist approaches to address this problem include direct numerical integration using Gauss-Hermite quadrature [3], or Laplace approximation that is applied to the likelihood function [4] or the quasi-likelihood function [5][6][7][8]. However, both numerical integration and analytic approximation do not scale well with the increasing dimension of the integral, which unfortunately equals the sample size in our model. Even a second order Laplace approximation yields a biased estimate and overly narrow confidence interval, especially when the uncertainty in the variance component estimate is large [9][10][11][12][13]. Therefore, frequentist approaches for estimation and inference in the binomial mixed model remain notoriously difficult and is still an active area of research [14].
In contrast to the frequentist methods, Markov chain Monte Carlo (MCMC)-based Bayesian approaches provide an appearing alternative [11]. Bayesian methods naturally account for the uncertainty in the variance component estimates and can achieve arbitrary inference accuracy if the chain is allowed to run long enough. Despite these attractive theoretical features, however, constructing an efficient MCMC algorithm for practical problems is not easy. Previous MCMC approaches for generalized linear mixed models either require a normal approximation to the likelihood function that diminishes its gains over the frequentist methods [15,16], or use n-steps of Metropolis-Hastings algorithm to sample the n-dimensional latent rate variables where efficient proposal distributions for all of them can be hard to construct [17,18]. To improve upon these previous approaches, a new MCMC algorithm [19][20][21] has been recently developed based on auxiliary variable representation of the binomial distribution [22]. By introducing latent variables to replace the observed count data, the algorithm makes sampling and computation relatively straightforward.
Therefore, we rely on this particular form of MCMC in the present study. Our main contribution is to further develop an accurate approximation to the distribution of these latent variables, where the approximation form is specifically designed to allow us to adapt recent mixed model innovations [23][24][25][26] that substantially reduce the computational burden. By using a mean-normal mixture approximation to the negative log gamma distribution, our approach reduces the per-MCMC iteration computational complexity from O(n 3 ) to O(n 2 ), where n is the sample size. This modification allows the binomial mixed model to be efficiently applied to hundreds of individuals and millions of methylation sites.
Although we use MCMC for posterior sampling, our primary goal is not to perform a Bayesian analysis by producing Bayes factors for model comparison (although this is an interesting area to explore in the future). Rather, our goal is to use MCMC as a convenient and accurate tool to obtain the marginal likelihood of β that is otherwise infeasible or inaccurate to obtain under various frequentist approaches. Under asymptotics, both the likelihood function and the marginal posterior distribution for β will be approximately normal [27]. Since the likelihood function is simply the difference between the posterior and the prior, once we have obtained the posterior mean and standard deviation of β and paired these values to their prior counter-parts, we can easily obtain the approximate likelihood function and compute the approximate maximum likelihood estimateβ and its standard error se(β) using the method of moments. We can then construct approximate Wald test statistics and p values for hypothesis testing.
In the present study, we use flat priors for all nuisance parameters (α, σ 2 , h 2 ), or p(α) ∝ 1, p(σ 2 ) ∝ 1 and p(h 2 ) ∝ 1. (Notice that a uniform distribution for σ 2 on the log scale, or p(log(σ 2 )) ∝ 1, would make the posterior distribution different from the likelihood.) For the parameter of interest, β, we could also use a flat prior, in which case the posterior would be the likelihood. For computational stability reasons, however, we use a relatively informative prior, β ∼ N (0, σ 2 b ) instead. A relatively informative prior restricts the sampling space when the likelihood is not informative, allowing efficient posterior sampling. Since we rely on the difference betwen the posterior and the prior for approximate inference, the choice of prior for β does not influence the eventual results. In the present study, we set σ 2 b = 1. Applications to real data confirm that this procedure produces well-calibrated p-values (Figure 1), suggesting that a few dozen samples are large enough to ensure asymptotic behavior. Moreover, although our approach is inherently stochastic -because the posterior mean and standard deviation of β may be slightly different for different chains -we show that a thousand MCMC iterations per site is large enough to produce stable estimates of the test statistics and p values ( Figure S2).

Data Augmentation
To bypass the difficult likelihood function that results from the count nature of the data, we introduce continuous auxiliary variables to replace y i . For ith individual, observing y i methylated reads out of r i total reads is equivalent to observing a sequence of r i binary read indicators (y i1 , · · · , y iri ), where y ij = 1 indicates that the jth read is a methylated read and y ij = 0 indicates otherwise. Obviously, y i = ri j=1 y ij . We can view each y ij as a random variable generated from a logistic regression model with mean log(λ i ). We further introduce a continuous latent variable u ij [19,20], often referred to as a utility [22], such that where EV(0, 1) denotes a standard type-1 extreme value distribution with density function e −x e −e −x . Then where 0 ij ∼ EV(0, 1). The above two equations come from the fact that the difference between two type-1 extreme value distributed random variables follows a logistic distribution, and a random variable that follows a logistic distribution serves as a liability variable for a logistic regression [22].
The attractive feature of introducing this set of independently and identically distributed u ij is that, conditional on all u ij , the posterior of (α, β, σ 2 , h 2 ) no longer depends on the observed methylated read indicator y ij , hence removing the non-linearity constraint that comes with the binomial aspect of our model. Applying the relationship between the EV distribution and the exponential distribution, we have e −uij ∼ Exp(λ i ) and e − 0 ij ∼ Exp(1), where Exp denotes the exponential distribution. This relationship allows us to easily sample u ij conditional on λ i and y ij based on the convenient exponential distribution rather than the more difficult EV distribution, as e −uij ∼ Exp(1 + λ i ) if y ij = 1 and e −uij ∼ Exp(1 + λ i ) + Exp(λ i ) if y ij = 0.
An undesirable feature of the above approach, however, is that we have to work with a much larger latent space of u ij than the original n observations of y i . Effectively, we have to retain the data at the individual read level. This drawback can be mitigated by combining all exponentiated negative latent utilities together [21], by introducing a new latent variable where i = − log( ri j=1 e − 1 ij ) follows a negative log gamma distribution, − log(Ga(r i , 1)); Ga denotes a gamma distribution with the two parameters representing shape and rate, respectively. This is because a gamma random variable is a summation of independent exponential random variables with a same rate parameter.
Using the latent variable z i instead of u ij reduces the size of the latent space back to the observed space. Conditional on z i , we again do not need to use y i , allowing us to bypass the count feature of the observed data in the algorithm.

Normal Mixture Approximation
To further circumvent the difficulty introduced by the non-normality of i , we follow previous ideas [20,21] to approximate the non-normal distribution by using a mixture of normals. Importantly, we take advantage of recent innovations in efficient mixed model algorithms [23][24][25][26] by using a mean mixture of normals where each normal distribution has a different mean but share the same variance.
Specifically, for every possible integer value of r, we identify a normal approximation in the form of kr k=1 w rk N(m rk , s 2 r ), to the negative log gamma distribution − log (Ga(r, 1)). Because the mean (−Ψ(r), where Ψ denotes a digamma function) and the variance (Ψ (r), where Ψ denotes a trigamma function) of the negative log gamma distribution is a function of r, to ensure approximation stability we work on the standardized version of the negative log gamma distribution, by centering with the mean and standardizing with the standard deviation. Then, we estimate the number of components k r , the weights w rk , the means m rk and the variance s 2 r via the Nelder-Mead algorithm by minimizing the Kullback-Leibler (KL) divergence between the two distributions. These parameter estimates ensure that the KL divergence is smaller than 0.0005, so that the difference between the approximate and the exact distributions are ignorable in practice. Because the negative log gamma distribution asymptotically approximates a normal distribution, the approximation becomes easier for larger r. Therefore, we can use increasingly smaller number of normal components for accurate approximation.
For small values of r (r ∈ [1, 5]), we provide detailed parameter values in Table S1. For median values of r (r ∈ [6, 169]), we no longer need to store parameters for every r. Instead, we can interpolate the weight, mean and variance estimates across the range of r using rational functions without loss of accuracy. These functions are provided in the Table S2. For large values of r (r ∈ [170, ∞), we use a single normal distribution N(0, Ψ (r)) for approximation. The mean normal mixture approximations are accurate. Even in the most difficult case where r = 1, we only observe small difference between the approximate and the exact distributions ( Figure S3).

Detailed Sampling Steps and Efficient Computation
Now we are ready to describe the detailed MCMC algorithm. Here, with the normal mixture approximation, we have We introduce a vector of latent indicators γ = (γ 1 , · · · , γ n ), where each γ i ∈ (1, · · · , k ri ) indicates which normal component the corresponding i is from. Conditional on z i and (α, β, g i , e i ), we have where k ∈ (1, · · · , k ri ) and Φ denotes the normal density function. Conditional on γ, we can integrate out α, β, g, e and analytically to obtain the marginal distribution of σ 2 and h 2 , where z = (z 1 , · · · , z n ) T , m γ = (m r1γ1 , · · · , m rnγn ) T , W = (w 1 , · · · , w n ) T , D r is an n by n diagonal matrix with iith element σ 2 We can use the Metropolis-Hastings (MH) algorithm to obtain posterior samples for σ 2 and h 2 jointly. Afterwards, we can obtain posterior samples for α, β and g + e in turn, Finally, conditional on y i and λ i , the posterior of z i is easy to sample. By using the relationship between the gamma distribution and the exponential distribution, we have The most computationally expensive part of the algorithm is the MH step: a naive approach to evaluate P (σ 2 , h 2 |z i , γ i ) would involve cubic operations. Our mean normal mixture approximation allows us to evaluate this marginal likelihood efficiently as we can apply here the mixed model innovations developed recently [23][24][25][26]. This is because given the observed data, D r is a fixed diagonal matrix where the elements do not depend on a γ that changes in every MCMC iteration. Therefore, for a given matrix V, we can perform an eigen-decomposition on D r Ux. This procedure avoids any cubic operations later on in the MCMC steps. Therefore, with the mean normal mixture approximation, we only need to perform eigen-decompositions at the beginning of the MCMC. Afterwards, each Gibbs step only requires quadratic operations (transformation of z − m γ ). In practice, because V is a function of h 2 , we assign a discrete uniform prior for h 2 and evaluate the eigen-decompositions on every discrete values of h 2 . In the present study, we found that using either 10 or 100 discrete values of h 2 yields almost identical results (and we present the analyses results for the formal in the main text), suggesting that a fine grid for h 2 is not necessary because of our small sample size. Finally, for all analyses in the present study, we ran 1100 Gibbs sampling iterations with the first 100 as burn-in. In each Gibbs iteration, after sampling the latent variables z and the latent indicators γ, we further ran 10 MH steps before continuing the Gibbs iterations.

Parameter Estimation and p Value Computation
Denoteβ as the posterior mean and σ 2 β as the posterior variance. Since both the likelihood and the posterior follow normal distributions asymptotically, and because we also use a normal distribution as the prior distribution, we can easily obtain the approximate maximum likelihood estimate and its standard error by the method of moments, orβ The condition σ 2 b > σ 2 β is guaranteed by asymptotics. In rare cases, however, this condition may not be satisfied because of the limited MCMC sampling iterations in practice. This may be particularly concerning for sites where the likelihood function is not informative. Arguably, these non-informative sites are the ones that we do not want to perform analysis on in the first place. Therefore, this condition gives us a natural way to perform post-filtering. In the software implementation, we do not analyze sites where σ 2 β ≥ cσ 2 b for a user defined threshold c (c ≤ 1). We use c = 0.95 throughout the present study. This post-filtering step, however, has minimal influence on the results, as only a few dozen sites, out of half a million, are filtered out in each analysis.