Differential chromatin profiles partially determine transcription factor binding

We characterize how genomic variants that alter chromatin accessibility influence regulatory factor binding with a new method called DeltaBind that predicts condition specific factor binding more accurately than other methods based on DNase-seq data. Using DeltaBind and DNase-seq experiments we predicted the differential binding of 18 factors in K562 and GM12878 cells with an average precision of 28% at 10% recall, with the prediction of individual factors ranging from 5% to 65% precision. We further found that genome variants that alter chromatin accessibility are not necessarily predictive of altering proximal factor binding. Taken together these findings suggest that DNase-seq or ATAC-seq Quantitative Trait Loci (dsQTLs), while important, must be considered in a broader context to establish causality for phenotypic changes.

We find that it is infeasible for an unsupervised method to model the conditional probability of differential binding directly, since we know neither the prediction accuracy of the single-condition scores, nor the size of the true ground truth differential ChIP-seq events. However, it is possible to model the joint probability (given DNase scores) of an event being reproducible in one condition and significantly more weakly bound in the other condition. We find that this can be used as a good surrogate for probability of differential binding, and provides a good score metric to rank candidate binding sites in terms of their likelihood of differential binding.
Suppose we have two DNase-seq experiment replicates each for K562 and GM12878, and we want to infer binding sites which are bound in K562 and unbound in GM12878. Let !" ! , 1 ≤ ≤ , = " " " ", = 1 2, be the rank of the PIQ shape score of binding site , condition , and replicate number , where is the number of candidate binding sites, = " " denotes a GM12878 value, = " " denotes a K562 value, and indexes the replicates. Let ! ! = ( !! ! , !! ! ) be the average rank of a binding site in condition , and ! be the vector of all 4 ranks in two conditions and two replicates. Let ! be the event that site is bound in K562, and ! be the event that site has a significantly weaker DNase profile in GM12878 than K562. Thus, we aim to model the conditional probability, for each site . We decompose this into and model each component part by assuming mixture models on the relevant subset of data.
The first part, ( ! | ! ), denotes the probability of a binding site being bound in K562. First we note that it does not depend on GM12878 ranks, so As mentioned above, we model this probability by resorting to a measure of reproducibility. Reproducibility is a concept introduced in (Li, Brown, Huang, & Bickel, 2011)] and characterizes an event which produces positively correlated scores in replicate experiments with high mean. Its counterpart, irreproducibility, characterizes an event which produce uncorrelated scores in replicates with low mean. Let ! ! denote a reproducible event in K562. Then ! ! does not depend on GM12878 scores, so . Using the framework in the IDR paper, we estimate ! ! !! ! , !! ! ) by transforming the rank values through a normal quantile function and fitting a reproducible and irreproducible cluster (Fig. S1a).
To improve robustness against noise in DNase scores and hence ranks, we compute the binding probability as a function of mean K562 ranks ! ! . This is done by averaging across all binding sites with identical mean rank ! ! under the IDR model, i.e., . This average can be estimated numerically using the fitted joint distributions of the IDR model for each small window of ! ! = . In practice, we find that it is important to have fine window sizes for high values of as they correspond to bound K562 sites; for low values of , we use coarser windows.
In this process, we made the assumption that reproducibility of K562 DNase rank is a good surrogate for the actual binding of the transcription factor. This is reasonable because we expect strongly bound sites to have higher reproducible DHS shapes, and weak or unbound sites to generate random DHS shapes.
In the second part, we estimate ( ! | ! , ! ), the probability of site belonging to a weaker-than-typical class of binding sites in GM12878. We reason that this probability depends on the event ! , the rank of site in K562, ! ! , and the difference of ranks between GM12878 and K562, !
into a fine sequence of values covering its range, and then estimating a model of ( ! | ! , ! ) for each fixed ! ! value. For each ! ! (within a small window), we fit a mixture model for the observed values of ! ! -! ! .
Conditioning on ! can be approximated well for strong K562 binding sites by selecting binding sites close to the diagonal on the !! ! vs !! ! plot, since sites on this diagonal has high probability of belonging to the reproducible component in the IDR model. This approximation is very good for highly ranked K562 binding sites, since the reproducible component dominates there, and thus provides high fidelity for the events we are most interested in. Figure S1b shows a typical plot of ! ! -! ! for a high K562 rank (black curve). The plot shows a concentrated cluster of "typical" values for the rank difference around 0, and a heavier left tail of sites with much lower ranks ("atypical values") in GM12878. We use a mixture model to capture this overall distribution, consisting of three components, a "typical" component in the center around 0, an "atypical left" component and an "atypical right" component (insignificant for this window of ! ! ). For simplicity, we model the two tail components as uniform distributions and the center component Gaussian. Results show that this simplification works reasonably well. Then, we estimate model parameters with an EM-like algorithm to iteratively assign sites to each cluster and refine parameters.
Finally, taking the product of the two probabilities gives an estimate of the probability of differential binding for site . Then, decision boundaries at any specified differential binding probability can be defined. A set of decision boundaries derived by this probability score is shown in Figure S1c (orange). We have enforced that decision boundaries are increasing and lie above the diagonal. Axes are average K562 rank vs. average GM12878 rank. The red points are the ground truth sites.
The DeltaBind software is available at: bitbucket.org/deltabind_cgs/deltabind (a) (b) (c) Figure S1: DeltaBind model fitting. (a) Fitting reproducible (red) and irreproducible (orange) clusters of binding sites based on PIQ shape score. (b) Fitting the conditional model of differential binding, black: data distribution, red: non-differential component, blue: differential component with lower ranks in GM12878 than K562, green: differential component with higher ranks in GM12878 than K562. (c) Decision boundaries (orange) of different confidence levels. Axes are K562 PIQ rank vs. GM12878 PIQ rank. Red represents true differential sites indicated by ChIP-seq signals.
PR and ROC curves for selected TFs. Figure S2 shows the PR and ROC curves of some of the well-performing factors that we have identified in our dataset. The dashed line represents the linear diffPIQ classifier which prioritizes candidate sites by the difference between K562 shape score and GM12878 shape score. We note that while AUPR is relatively low for all TFs, AUROC is generally high for most TFs (values in Table 1). Figure S2: PR and ROC curves for selected TFs.

DeltaBind improves differential binding prediction from Centipede scores
Centipede uses a hierarchical Bayesian mixture model to infer the posterior probability of transcription factor binding from single condition DNase-seq experiments. We first explored whether we can use Centipede outputs for K562 and GM12878 to detect differential factor occupancy. Let be the posterior probability of a site in K562, and be that in GM12878, we tested three methods to combine scores from the two cell types into a score for ranking differential binding (bound in K562 and unbound in GM12878): * (1 − ) product of probability of being bound in K562 and probability of not bound in GM12878 difference of probability difference of log probability We find that DeltaBind applied to PIQ output achieves higher prediction accuracy than all three Centipede scoring methods for all eighteen factors tested. Figure S3 shows the AUPR for DeltaBind and the best Centipede method for each factor.
We next tested applying DeltaBind to Centipede's posterior binding probabilities jointly, by converting them into rank space. Our results show that DeltaBind is able to improve the prediction accuracy compared to all three Centipede scoring methods above, demonstrating the generality of the DeltaBind method. Overall, DeltaBind has the best prediction performance when applied to PIQ outputs. (Fig. S4) Figure S3. Comparison of DeltaBind prediction performance with best Centipede scoring method. Figure S4. DeltaBind applied to Centipede outputs improves differential binding prediction. Table S1 shows the relevant statistics of the differential binding inference task for all transcription factors studied. Specifically, the table columns show, in order, factor name, number of candidate binding sites, number of true differentially bound sites evidenced by ChIP-seq data (true positives), the proportion of the true positives, DeltaBind precision, AUC and AUPR.