Figure 1.
H3K27me3 and H3K36me3 diffuse histone marks.
ChIP-seq was used to identify regions of enrichment based on read density profiles, visualized here in the UCSC genome browser (http://genome.ucsc.edu/). The enriched islands identified by BCP (blue) and SICER (red) are indicated. Additionally, posterior mean estimates used in BCP island detection are shown along with a line (orange) illustrating how thresholds are used to segment the signal. The correlation between H3K36me3 and gene bodies (outlined in green) and the mutually exclusivity of H3K27me3 and H3K36me3 were evident. The signal fluctuations caused by the highly variable read densities common to ChIP-seq data of diffuse marks is one of the notable difficulties for standard peak-calling algorithms, causing them to fragment the broader regions of enrichment into smaller, discontiguous peaks.
Table 1.
H3K36me3 islands and common associations.
Figure 2.
The distance from H3K36me3 island boundaries to nearest gene boundary was used as a measure of accuracy.
H3K36me3 islands have been shown to correspond to actively transcribed gene bodies so we expected the boundaries of island and genes to coincide. The sum of the distances from both upstream and downstream island boundaries to the nearest gene boundaries were used as a per island error and illustrated in the histogram for BCP (left) and SICER (right).
Figure 3.
BCP was robust, providing consistent results in replicate and at various coverage depths.
Using a second H3K36me3 data set and sub-samplings of the full replicate one dataset (30–90% randomly selected reads), we evaluated the reproducibility of BCP island calls. A) Enriched regions coinciding with gene coordinates were captured by the large, contiguous BCP islands (blue), while SICER islands (red) were more fractionated. B) We quantified the reproducible fraction of the full data set results versus the sub-samples (the number of full dataset island bases covered by a replicate/sub-sample island divided by total bases in full dataset islands, averaged across all islands) and vice versa. Also, we computed the fraction of island basepairs overlapping genic and intergenic regions (number of islands bases covered by genic/intergenic annotation divided by total bases in island, average across all islands).
Figure 4.
BCP dynamically adapted to many different types of data.
To demonstrate its versatility, we compiled a set of several histone modifications and analyzed each under the default parameters for BCP and SICER. Regardless of the histone mark characteristics, whether more punctate as in acetylation marks and H3K4me3 or broad as in H3K27me3, H3K36me3, and H3K9me3, BCP (black) was able to make reasonable island calls that effectively described the underlying read profiles. SICER (grey) seemed more primed to identify smaller, sharper islands so often fragmented more general regions of enrichment.
Figure 5.
BCP showed strong performance in punctate transcription factor ChIP-seq data.
Compared to MACS, a representative peak-calling algorithm designed for punctate peaks detection, BCP showed a comparable false-discovery rate (FDR) and rate of motif occurrence in both CTCF and STAT1 datasets. We apply the empirical FDR described in the Methods and by [17], dividing the negative peaks (detected when the input control sample was set as the test and the ChIP sample was set as the control) by the number of test peaks (the ChIP sample was set as the test and the input control sample was set as the control). Peaks are ranked according to p-value. Additionally, BCP displayed a slightly improved motif occurrence rate (the fraction of peaks containing a match to the TRANSFAC consensus motifs, as determined by STORM, ).