msCentipede: Modeling Heterogeneity across Genomic Sites and Replicates Improves Accuracy in the Inference of Transcription Factor Binding

Understanding global gene regulation depends critically on accurate annotation of regulatory elements that are functional in a given cell type. CENTIPEDE, a powerful, probabilistic framework for identifying transcription factor binding sites from tissue-specific DNase I cleavage patterns and genomic sequence content, leverages the hypersensitivity of factor-bound chromatin and the information in the DNase I spatial cleavage profile characteristic of each DNA binding protein to accurately infer functional factor binding sites. However, the model for the spatial profile in this framework fails to account for the substantial variation in the DNase I cleavage profiles across different binding sites. Neither does it account for variation in the profiles at the same binding site across multiple replicate DNase I experiments, which are increasingly available. In this work, we introduce new methods, based on multi-scale models for inhomogeneous Poisson processes, to account for such variation in DNase I cleavage patterns both within and across binding sites. These models account for the spatial structure in the heterogeneity in DNase I cleavage patterns for each factor. Using DNase-seq measurements assayed in a lymphoblastoid cell line, we demonstrate the improved performance of this model for several transcription factors by comparing against the Chip-seq peaks for those factors. Finally, we explore the effects of DNase I sequence bias on inference of factor binding using a simple extension to our framework that allows for a more flexible background model. The proposed model can also be easily applied to paired-end ATAC-seq and DNase-seq data. msCentipede, a Python implementation of our algorithm, is available at http://rajanil.github.io/msCentipede.


msCentipede model for one replicate
Consider a genomic window of length L around each of N putative binding sites. We define X n = (X n l ) L l=1 for n = 1, . . . , N , where X n l is read count at l th base pair in the window around the n th site. Let Z n denote a binary indicator for whether the n th site is bound (Z n = 1). A mixture model at the n th site can be written as P(X n ) = P(X n |Z n = 1)P(Z n = 1) + P(X n |Z n = 0)P(Z n = 0), (1) where the prior probability P(Z n = 1) = ζ n can be modeled as a logistic function of genomic information (e.g. position weight matrix score and sequence conservation score of the motif instance) as in CEN-TIPEDE. Specifically, if S n = {S n1 , . . . , S nF } is a vector representing various genomic features for the n th site, then ζ n = 1 1+e − P f γ f S nf , where γ f captures how much the f th feature informs the prior probability that a site is bound by a transcription factor.
At the zeroth scale, we model the total number of reads in the entire region Y n 01 := l X n l as follows: At the first scale, conditional on Y n 01 , we model the number of reads in the first half of the region Y n 11 := l≤L/2 X n l using a binomial distribution, Y n 11 | Y n 01 ∼ binomial(Y n 01 , p n 01 ). Then, the number of reads in the second half of the region is determined to be Y n 12 := Y n 01 − Y n 11 . At the second scale, conditional on Y n 11 , we model the read count in the first quarter of the region Y n 21 using a binomial distribution, Y n 21 | Y n 11 ∼ binomial(Y n 11 , p n 11 ), leading to the read count in the second quarter of the region Y n 22 := Y n 11 −Y n 21 . Figure S1: Multi-scale models for inhomogeneous Poisson processes. Suppose we have observations X n = (X n l ) 4 l=1 on four bases. Multi-scale models assume that Y n , and Y n 23 | Y n 12 ∼ binomial(Y n 12 , p n 12 ) (then, Y n 24 = Y n 12 − Y n 23 ). It follows from elementary properties of the Poisson distribution (see [2] and [3] for details) that it is equivalent to Y n 2l = X n l ∼ Poisson(µ n l ) for l = 1, 2, 3, 4, where µ n 1 = λ n p n 01 p n 11 , µ n 2 = λ n p n 01 (1 − p n 11 ), µ n 3 = λ n (1 − p n 01 )p n 12 , and µ n Conditional on Y n 12 , the read count in the third quarter of the region Y n 23 is modeled using a binomial distribution, Y n 23 | Y n 12 ∼ binomial(Y n 12 , p n 12 ), leading to the read count in the fourth quarter of the region Y n 24 := Y n 12 − Y n 23 . This process continues through the scales: at the J th scale (finest-resolution), there are L = 2 J parts of the region. The read counts in the l th part of the region Y n Jl , which is equivalent to X n l , is modeled as follows: for k = 1, . . . , 2 J−1 (see Figure S1 for a brief illustration of the model).
When Z n = 1, heterogeneity across genomic sites is modeled as follows: where α andλ 0 capture the mean (λ 0 ) and variance (λ 2 0 α ) in overall DNase I hypersensitivity of TF bound genomic locations, and p jk and τ j capture the mean and variance in the DNase I cleavage profiles across TF bound genomic locations. When Z n = 0, we model heterogeneity in total DNase I hypersensitivity as follows:

msCentipede model for multiple replicates
Suppose we have S replicate DNase-seq measurements for a particular cell type. Given a genomic window of length L around each of N putative binding sites, we define X n,s = (X n,s l ) L l=1 for n = 1, . . . , N and s = 1, . . . , S, where X n,s l is read count at l th base pair in the window around the n th site for the s th replicate.
Ideally, it is desirable to model the variation across genomic sites and the variation across replicates separately. However, in practice, we usually have only two or three replicate DNase-seq measurements in a given cell type, making it difficult to accurately quantify the variation across replicates. Instead, we assume that variation across replicates and variation across genomic sites are the same. The background model P(X n |Z n = 0) can be constructed in a similar way.

Maximum likelihood estimation and inference
Inference for the above model requires computing the posterior distribution P(Z|X), evaluated at the maximum likelihood estimate of the model parameters Θ * , where Θ = {p jk , τ j , α s ,λ s 0 , α s o ,λ s 0o }. This problem is equivalent to finding the optimal distribution in a parameteric family of distributions q(Z) that has the smallest Kullback-Leibler (KL) divergence to the posterior distribution of interest.
The first term in 15 is the log likelihood of the data for some fixed value for the model parameters Θ. Thus, minimizing the KL divergence is equivalent to maximizing the function F[q(Z); Θ], keeping the model parameters fixed. Note that, when the true posterior distribution P(Z|X) lies in the specified parametric family, the maximum value F[q * (Z); Θ] is equal to the log likelihood of the data log P(X; Θ) (i.e., the minimum KL divergence is zero). The maximum likelihood estimate of the model parameters Θ can then be obtained by maximizing the function F[q * (Z); Θ]. Maximizing F[q(Z); Θ] with respect to q(Z) and Θ, iteratively till convergence, gives us the maximum likelihood estimate of the model parameters and posterior probability of transcription factor binding. Conditional on the model parameters, the data at all putative binding site in all replicates are independent. Thus, the true posterior P(Z|X) will factorize as P(Z|X) = n P(Z n |X n ), motivating the following choice for q(Z).
The function to be maximized can now be written as follows: = nζ n log P(X n |Z n = 1, Θ) + (1 −ζ n ) log P(X n |Z n = 0, Θ) (18) Maximizing F with respect toζ n for fixed Θ gives This equation gives us the posterior probability that a site is bound, and is equivalent to the E-step in the EM algorithm. Using the model specified in the previous section, we have where P(Y n,s 01 |α s ,λ s 0 ) = negative binomial α s ,λ s 0 P(Y n,s j,2k−1 |Y n,s j−1,k ,p jk , τ j ) = beta binomial(Y n,s j−1,k ;p jk τ j , (1 −p jk )τ j ).
A similar likelihood function for the background model log P(X n |Z n = 0, Θ) can be written. Keeping q(Z) fixed, the function F can be maximized with respect to the model parameters Θ, within appropriate constraints, using standard optimization solvers. By iterating between computing the posterior probability q(Z) (at fixed Θ) and optimizing F (keeping q(Z) fixed), until convergence, we get the maximum likelihood estimates of the parameters Θ. In this sense, the iterative variational optimization scheme described above is equivalent to the EM algorithm for computing the maximum likelihood estimates of the parameters.
In the current version of the software, we optimize the model parameters Θ using cvxopt [1], a standard primal-dual interior-point optimization solver. In addition, by treating the simple iterative algorithm above as a fixed-point solver, we can accelerate its convergence [4]. While this accelerated version reaches the optimum using fewer evaluations of the gradient and hessian of F, this gain is at the expense of the monotonicity guaranteed by the simple iterative algorithm.