Identification of factors associated with duplicate rate in ChIP-seq data

doi:10.1371/journal.pone.0214723

Fig 1.

Duplicate rate versus sequencing depth and target enrichment level in ER ChIP-seq data.

(A) Duplicate rate versus sequencing depth. (B) Enrichment level and the percentage of nonredundant reads in peaks. Duplicate rate was defined as the ratio of duplicate reads over uniquely mapped reads. Enrichment level was estimated as (number of nonredundant reads in peaks / total nonredundant reads in IP) / (number of nonredundant reads in peak-corresponding regions in input / total nonredundant reads in input).

More »

Expand

Table 1.

Duplicate level in ER peaks and non-peak regions.

More »

Expand

Table 2.

Enrichment of duplicates in H3K4me3 and NRF1 peaks.

More »

Expand

Fig 2.

Scatter plot of duplicate level within peaks between replicates.

(A) NRF1 in K562. (B) NRF1 in MCF7. (C) H3K4me3 in GM12878. (D) H3K4me3 in GM12891. Peaks overlapping the blacklist were filtered out. Duplicate level was estimated as the number of duplicates per kb per 10 million uniquely mapped reads and log₂ transformed. R² value was calculated using Pearson correlation.

More »

Expand

Fig 3.

Plot of duplicate abundance versus peak confidence.

(A,C,E,G,I) Duplicate rate in 10 groups of peaks and in the corresponding regions in input. (B,D,F,H,J) Proportion of total library duplicates in each of the groups and in the corresponding regions in input. Peaks were sorted based on p value in ascending order and split into 10 equal-sized groups, with group 1 having the smallest p values. Y-axis in the left panels represents duplicate rate per group, i.e., the number of duplicates over the total uniquely mapped reads in a group, as defined in Fig 1 legend. Y-axis in the right panels represents the proportion of total duplicates from a library in each group and in the peak-corresponding regions in input. The dotted horizontal lines in the left panels denote duplicate rates in the non-peak regions in IP (> = 100 bp away from peaks). See S1 Table for sample information.

More »

Expand

Fig 4.

Duplicate level in peak is correlated with mark enrichment.

(A) PCR-free H3K4me3 ChIP-seq data in HeLa cell line. (B-D) ER ChIP-seq data in MCF7 cell line. (E,F) NRF1 ChIP-seq data in HepG2 and MCF7 cell line. (G-I) H3K36me3 ChIP-seq data in MCF7 and ZR751 cell line and in fetal retinal tissue. Peaks were called from alignments with duplicate removed. X-axis indicates mark enrichment level in peaks, estimated as the number of nonredundant reads per kb, and y-axis shows the number of duplicates per kb. The curve was constructed using the "lowess()" function in R. R² value was calculated using Spearman rank correlation.

More »

Expand

Fig 5.

Box plot of Spearman rank correlation between duplicate level in peak and six factors.

(A) Thirteen ER libraries in breast cancer cell lines. (B) Six NRF1 libraries, including one in HepG2, two in MCF7 and three in K562. (C) Thirteen H3K4me3 libraries in lymphoblastoid cell lines. (D) Four H3K36me3 libraries in fetal retinal tissue. (E) Twelve H3K36me3 libraries in breast cancer cell lines. (F) Twelve H3K27me3 libraries in breast cancer cell lines. For each peak, duplicate level was estimated as the number of duplicates divided by peak size in kb, and non-duplicate level was estimated similarly. Duplicate and non-duplicate levels in peak corresponding regions in input were also calculated. GC content represents the number of guanine and cytosine bases divided by the total bases in a peak. Percentage of segmental duplication is the proportion of a peak that overlaps regions of segmental duplication, defined as those with > = 90% sequence identity over at least 1 kb (http://humanparalogy.gs.washington.edu/build37/build37.htm) [30]. Percentage of low-complexity sequence is the proportion of a peak that overlaps low complexity regions (https://figshare.com/articles/Low_complexity_regions_in_hs37d5/969685) [31].

More »

Expand

Fig 6.

Prediction of duplicates as signal based on peak enrichment.

(A) Duplicate rate in a library and in peaks and proportion of duplicates in peaks. Duplicate rate in a lib was estimated as the number of duplicates divided by the number of uniquely mapped reads. Duplicate rate in peaks was estimated in the same way. (B) Plot of peak coverage, fraction of positions with duplicates and fraction of nonredundant reads in peaks. Peak coverage was estimated as the total peak size over the mappable genome size (0.75 x genome size). Fraction of positions with duplicates was estimated as the number of positions with duplicates over the number of positions with uniquely mapped reads. Fraction of reads in peaks (FRiP), fraction of uniquely-mapped, nonredundant reads in peaks. (C) Proportion of duplicates predicted as signal. The prediction was based on the correlation between peak duplicate and non-duplicate level, as showed in Fig 4.

More »

Expand

Fig 7.

Flowchart for optimal deduplication in peaks.

The workflow takes a BAM file and a list of peaks as input. It outputs a table that shows the number of nonredundant reads (non-duplicates), duplicates predicted as signal and duplicates as noise for each peak. A properly deduplicated BAM file is also generated, which contains alignments for all nonredundant reads and for duplicates in peaks that are predicted as signal. For each peak, if N represents the predicted number of noise duplicates and S represents the predicted number of signal duplicates, a list of N read ID is randomly extracted from N+S duplicates mapped to that peak. Alignments for the noise duplicates are then excluded, and alignments for the remaining duplicates are combined with those from nonredundant reads.

More »

Expand