DNase-capture reveals differential transcription factor binding modalities

We describe DNase-capture, an assay that increases the analytical resolution of DNase-seq by focusing its sequencing phase on selected genomic regions. We introduce a new method to compensate for capture bias called BaseNormal that allows for accurate recovery of transcription factor protection profiles from DNase-capture data. We show that these normalized data allow for nuanced detection of transcription factor binding heterogeneity with as few as dozens of sites.

were retained. Coverage at a base was calculated using only the outermost coordinate of a read.
DNase-seq and DNase-capture replicates were kept separate.

BaseNormal DNase-capture normalization
BaseNormal correction is described is the order in which the corrections are applied.
Previous studies have shown that the sequence affinity of DNase-I contributes substantially towards the bias of DNase-seq and that this bias can be characterized by 6-mers. To correct for the sequence affinity of DNase-I, we used the 6-mer bias table computed by cut rates in a naked DNA digestion [11]. The DNase-seq and DNase-capture data was then adjusted at a base by base level by the table to correct for the 6-mer cut bias, by dividing the number of reads in the DNase-seq and DNase-capture data by the value of the table for the given 6-mer.
Different tiling densities were used in different genomic regions, and this resulted in more reads in higher density genomic regions. To correct for the tiling density bias, we normalized the expected counts for each tiling density as a preprocessing step. We calculated the average number of reads in each tiling density and normalized the reads so that each tiling density had the same number of expected reads.
Finally we control for other bias effects arising from the DNase-capture protocol and the effect of read towers by modeling the DNase-seq reads as a linear combination of DNase-capture reads, DNase-capture genomic control reads, and two indicators for read towers, one for each dataset. The fitted values (the predicted ! as defined below) are the bias controlled estimates for accessibility we use. The features used in the linear regression for given experiment are its observed DNase-capture reads, DNase-capture genomic control reads, and an indicator for the genomic and DNase-capture reads if they were above the 95 th percentile, each with a +/-15bp window. Formally, denote the reads from the DNase-capture genomic control at base i to be g ! and the reads for a given DNase-capture experiment to be c ! , both truncated at their respective 95 th percentiles. Let I ! ! and I ! ! to be indicators that g ! and c ! were truncated respectively. Let y ! be the reads of the DNase-seq experiments at base i. Finally, let β !,! be the coefficients of regression. Then, the regression is in the form The regression was done using Vowpal-Wabbit [2], a fast, open-source regressor. Default options were used with Vowpal-Wabbit, except two passes were used in the learning step. The default loss function is L2.

ChIP-seq peak calling
Analogously to how DNA sequence was processed, we realigned raw ChIP reads as single-paired reads and the same BWA settings for DNA sequence processing was used.
ChIP-seq peaks were called using the GEM peak caller [3] using the recommended parameters k_min=6, k_max=13, and s=2000000000. ChIP-seq peak calling was done with all the reads, to better learn the ChIP-seq profiles. For analysis, we only used peaks that were called within the capture region and with motifs.

AUROC p-value computation
To compute the p-value of the AUROC, we used a permutation test. Specifically, the class labels for the examples were randomly permuted, and the leave-one-out validation was redone with the permuted labels. The permutation and validation was repeated 1000 times. To compute the p-value, we computed the rank of the true AUROC value of the 1000 trials, divided by 1000, and subtracted the result from 1.