dPeak: High Resolution Identification of Transcription Factor Binding Sites from PET and SET ChIP-Seq Data

Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) has been successfully used for genome-wide profiling of transcription factor binding sites, histone modifications, and nucleosome occupancy in many model organisms and humans. Because the compact genomes of prokaryotes harbor many binding sites separated by only few base pairs, applications of ChIP-Seq in this domain have not reached their full potential. Applications in prokaryotic genomes are further hampered by the fact that well studied data analysis methods for ChIP-Seq do not result in a resolution required for deciphering the locations of nearby binding events. We generated single-end tag (SET) and paired-end tag (PET) ChIP-Seq data for factor in Escherichia coli (E. coli). Direct comparison of these datasets revealed that although PET assay enables higher resolution identification of binding events, standard ChIP-Seq analysis methods are not equipped to utilize PET-specific features of the data. To address this problem, we developed dPeak as a high resolution binding site identification (deconvolution) algorithm. dPeak implements a probabilistic model that accurately describes ChIP-Seq data generation process for both the SET and PET assays. For SET data, dPeak outperforms or performs comparably to the state-of-the-art high-resolution ChIP-Seq peak deconvolution algorithms such as PICS, GPS, and GEM. When coupled with PET data, dPeak significantly outperforms SET-based analysis with any of the current state-of-the-art methods. Experimental validations of a subset of dPeak predictions from PET ChIP-Seq data indicate that dPeak can estimate locations of binding events with as high as to resolution. Applications of dPeak to ChIP-Seq data in E. coli under aerobic and anaerobic conditions reveal closely located promoters that are differentially occupied and further illustrate the importance of high resolution analysis of ChIP-Seq data.

). For PET ChIP-Seq data, MACS first finds the best pairs of 5 and 3 reads from multiple alignment results. Then, only the 5 read position is kept for every pair and shifted to its 3 direction by 100bp without estimation of the shift parameter. Then, the 1 standard MACS analysis [1] is applied to the processed data. In MOSAiCS, when bin-level data are constructed, each read pair is connected and this connected read pair contributes to all the bins it overlaps. The standard MOSAiCS analysis [2] is applied to this bin-level data. Detailed comparison of the MACS and MOSAiCS peaks reveals that each MACS peak on average has 1.54 to 2.23 MOSAiCS peaks (Table S2)

The dPeak Model
Consider a peak region with n reads (DNA fragments) and let 1 and m denote the start and end positions of the peak region, respectively. Let g * denote the number of binding events within the region and µ g be the position of g-th binding event, g = 1, 2, · · · , g * . Without loss of generality, assume that 1 ≤ µ 1 < µ 2 < · · · < µ g * ≤ m for identifiability. In both PET and SET data, a fraction of reads will denote background noise. We assume that background reads are uniformly distributed over the whole candidate region and denote the background component as g = 0.
Let π g denote the strength of g-th binding event, g = 1, 2, · · · , g * . π 0 indicates degree of non-specific binding in the candidate region. Let Z i be the group index of i-th DNA fragment and Z i ∈ {0, 1, 2, · · · , g * }. For notational convenience, we denote Z ig = 1 {Z i = g}, where 1 {A} is an indicator function of event A. We assume that P (Z i = g) = P (Z ig = 1) = π g , g = 0, 1, 2, · · · , g * and g * g=0 π g = 1. Note that the dPeak model allows each DNA fragment to overlap with multiple binding events. The unobserved Z i variable ensures that each fragment that is not part of the background overlaps with at least one binding event.

Generative model for paired-end tag (PET) data
Let S i and L i be the start position and length of i-th DNA fragment, respectively. If we denote end position of i-th fragment as E i , then E i = S i +L i −1 by definition. In the PET data, we directly observe S i and E i (equivalently, S i and L i ) for each DNA fragment. Moreover, distribution of library size, P (L), can be empirically estimated from the PET data and hence, we treat P (L) as known. We denote the whole candidate region as C = {1 − L i + 1 ≤ S i ≤ m} and the region corresponding to g-th binding event as B g = {µ g − L i + 1 ≤ S i ≤ µ g }. If i-th fragment is generated from g-th binding event (Z i = g), then for given L i , we assume that S i is generated from the following Uniform-like distribution: where γ denotes the weight assigned to the area outside of the region corresponding to g-th binding event.
The main purpose to using P (s|l; µ g , γ) is to make it easier to escape from local maxima during the early iterations of EM algorithm, by making boundaries of B g "softer" than Uniform distribution. As shown in Section 3.1, γ estimate is essentially obtained as the proportion of DNA fragments that belong to one of the binding 4 events (i.e., not correspond to background) but do not overlap positions of binding events (µ g ). As iterations progress in the EM algorithm, estimates of µ g improve and number of such DNA fragments decreases. As a result, in the later iterations of EM algorithm, γ estimate becomes close to zero and P (s|l; µ g , γ) converges to Uniform distribution.
2. Draw library size, L i , from known distribution P (L).
3. Draw start position of the DNA fragment, S i , conditional on Z i and L i : (a) If the DNA fragment belongs to g-th binding event (Z ig = 1, 1 ≤ g ≤ g * ), draw start position of the fragment, S i , from P (S|L; µ g , γ).

Generative model for single-end tag (SET) data
In the SET data, one of two ends of each DNA fragment is randomly selected and sequenced. Hence, L i for each fragment is not observable; however, positions and strands of the reads corresponding to the sequenced ends are known (denoted by R i and D i , respectively). We assume that D i follows Bernoulli distribution with known parameter p D . Exploratory analysis indicates that these read distributions can be well approximated with Normal distribution. Specifically, we assume that (R|Z = g, D = 1; µ g , δ, σ 2 ) ∼ N (µ g − δ, σ 2 ), and (R|Z = g, D = 0; µ g , δ, σ 2 ) ∼ N (µ g + δ, σ 2 ).
3. Draw position of the read, R i , conditional on Z i and D i : (a) If the read belongs to g-th binding event (Z ig = 1, 1 ≤ g ≤ g * ) and it is in the forward strand (D i = 1), draw position of the read, R i , from Normal(µ g − δ, σ 2 ).
(b) If the read belongs to g-th binding event (Z ig = 1, 1 ≤ g ≤ g * ) and it is in the reverse strand (D i = 0), draw position of the read, R i , from Normal(µ g + δ, σ 2 ).
(c) If the read is from background (Z i0 = 1) and it is in the forward strand (D i = 1), draw position of the read, R i , from Uniform(1 − β + 1, m).
(d) If the read is from background (Z i0 = 1) and it is in the reverse strand (D i = 0), draw position of the read, R i , from Uniform(1, m + β − 1). 6

The dPeak Algorithm
We estimate parameters of the models for PET and SET data using the Expectation-Maximization (EM) algorithm [3]. We do not have explicit solutions in the M-step for the PET model. Maximization with respect to (µ 1 , µ 2 , · · · , µ g * ) requires searching over g * -dimensional space and O(m g * ) operations, which is computationally prohibitive. In order to boost up computation and stabilize estimation, we employ the Expectation-Conditional-Maximization (ECM) algorithm [4]. The ECM algorithm requires only searching over one-dimensional space, [1, m], for the maximization with respect to each µ g while keeping the other parameters fixed. This reduces the computation time to O(mg * ) operations. Our simulation studies show that this approach is computationally efficient and provides fast convergence with accurate and stable estimation (data not shown). We have explicit solutions in the M-step for the SET model. Although the EM algorithm has desirable convergence properties, it does not guarantee convergence to the global maximum when there are multiple maxima. As a result, the final estimates depend upon the initial values [5,6]. In order to address this issue, we consider the stochastic EM algorithm [7], which is a special case of Monte Carlo EM [5,6], for the first half of iterations. The stochastic EM algorithm allows a chance of escaping from a current path of convergence to a local maximizer to other paths [5]. After certain number of iterations, we switch to the ordinary version of our EM algorithm because the stochastic EM is not desirable when the process is near to convergence to a suitable local maximizer [5].
In the EM implementation, non-identifiability due to overfitting (fitting too many components in the model) is problematic and should be avoided [5,8]. We address this issue during the EM iterations as follows. If the distance between two binding events is shorter than the size of the binding site (defined by the length of the known or predicted consensus motif), we combine these two components and consider it as one component during the remaining iterations. For the σ 70 application, we set this parameter to 20bp since σ 70 binds to −35bp and −10bp from transcription start site. Moreover, if the strength of a binding event is too weak (π g < 0.01), this component is also removed from further consideration in the remaining iterations. 7

The dPeak algorithm for PET data
Given the generative model for PET data described in Section 2.1, we have the following complete likelihood: Let S = (S 1 , S 2 , · · · , S n ), L = (L 1 , L 2 , · · · , L n ), and g * , γ (t) . Then, the EM algorithm for the PET data is obtained as follows:

M-step:
For g = 1, 2, · · · , g * , we obtain Moreover, This algorithm has the following intuitive interpretation. In the E step, each fragment is allocated to a binding event or background component based on whether or not the fragment overlaps the actual binding events. When the fragment overlaps with more than one binding events, it is assigned to each of these events in a fractional manner. The fractions are proportional to relative strengths of the binding events (π g ). In the M step, location of each binding event (µ g ) is essentially updated to the position with the largest number of aligning fragments. In this step, fragments with shorter library size (L i ) have more voting power. This is intuitive from the experimental procedure point of view because it is easier to identify the actual position of a binding event with shorter fragments.

The dPeak algorithm for SET data
Given the generative model for SET data described in Section 2.2, we have the following complete likelihood: Let R = (R 1 , R 2 , · · · , R n ), D = (D 1 , D 2 , · · · , D n ), and Then, the EM algorithm for the SET data is obtained as follows: E-step: For g = 1, 2, · · · , g * , , and for g = 0, where A is an appropriate normalizing constant.

M-step:
For g = 1, 2, · · · , g * , we obtain Similarly, Moreover, This algorithm has the following intuitive interpretation. In the E step, each read is allocated to a binding event or background component based on the distance between the binding events and the read shifted by δ towards its 3 direction. Both the peak shape (p D , δ, and σ 2 ) and the relative strengths of the binding events (π g ) are considered in this allocation. In the M step, location of each binding event (µ g ) is updated to the averaged position of reads corresponding to the binding event, after reads are shifted by δ towards their 3 direction. One peak shape is estimated for each candidate region through δ and σ 2 . Optimal shift of reads from their corresponding binding events, δ, is updated to the averaged distance between the location of each binding event and the positions of reads corresponding to this binding event, averaged over binding events in the region. Dispersion of the reads around their corresponding binding events, σ 2 , is updated to the variance of the position of reads corresponding to the binding event around location of each binding event (µ g ), after reads are shifted by δ to their 3 direction, averaged over binding events in the region.

Model selection
In practice, determining the optimal number of binding events, g * , in each candidate region can be cast as a model selection problem. Model selection based on the Bayesian Information Criterion (BIC) [9] is a popular choice in mixture modeling and has shown superior performance in diverse applications [10,11]. Therefore, for pre-specified g max , we fit models for each of g * = 1, 2, · · · , g max binding event components and choose the model with the BIC value corresponding to the first local minimum, as the final model.
Choice of g max is an important issue in model selection. g max should be large enough so that all binding events in each candidate region can be considered. On the other hand, setting g max larger than necessary should also be avoided in order to prevent choosing a model due to ill-conditioning rather than a genuine indication of a better model [10,11]. For appropriate choice of g max in the current application, we checked the number of known binding events in each candidate region of σ 70 data from the RegulonDB database [12] (http://regulondb.ccg.unam.mx) and found that 92% of the peaks have either one or two binding sites within the peak region. Based on this exploratory analysis, we set g max = 5 as the default value and use it for all the analysis described in the manuscript. For other applications, appropriate choice of g max might depend on the protein type and experimental conditions. Figure S2A displays the density of library size in the σ 70 PET ChIP-Seq data. The corresponding mean and standard deviation are 192.01bp and 26.90bp, respectively. Figure S2B shows the estimated density of 2δ in the σ 70 quasi-SET ChIP-Seq data, where δ is the half of the distance between modes of forward and reverse strand reads belonging to each binding event in the candidate region. Mean and standard deviation of 2δ are 187.36bp and 9.04bp, respectively. Figure S2C depicts the scatter plot of library size vs. estimated 2δ and it indicates that, overall, we have larger 2δ estimates for the candidate regions with larger average library sizes. We observe the same pattern in Figure S2D, which displays a similar plot for PET and SET simulation data.  Figure 4C for the PET and quasi-SET ChIP-Seq data, respectively. GOF plots compare the empirical distribution of the read positions with that obtained by simulating from estimated model parameters. These GOF plots are representative of the GOF plots for other candidate regions and they indicate that the dPeak models fit the data well.

Effects of the Merging Step in PICS for Closely Spaced Binding Events
PICS [13] generates initial predictions for locations of protein binding events and then merges initial predictions that have overlapping "binding event neighborhoods". A binding event neighborhood is defined as the predicted location of a binding event extended by three standard errors of the shift parameter estimate to both sides. In order to evaluate the effect of merging on PICS binding event predictions, we regenerated results in Figures 2A, B without the merging step for PICS. Figures S4A,  B show that PICS without merging step performs comparable to dPeak for SET ChIP-Seq data and the merging step of PICS results in loss of resolution for closely spaced binding events. Although it might be possible to tune the merging step, PICS currently does not provide this functionality.

Peak Shape Estimation of GPS for Closely Spaced
Binding Events Figure S5A displays the peak shape estimated by the GPS algorithm [14] for synthetic ChIP-Seq data when there is only one binding event in each candidate region. It depicts density of forward strand reads with respect to the distance from the location of binding event (corresponding to zero in the x axis). This same peak shape is used genome-wide for modeling of reads in both forward and reverse strands.
When there is single binding event, peak shape is correctly estimated as uni-modal. Figure S5B displays the peak shape when the distance between two binding sites in each candidate region is set to 450bp. The peak shape is still correctly estimated as uni-modal and it looks similar to the peak shape estimated for single binding events. Moreover, in these two cases, the estimated peak shapes are similar to their initial shapes. Figure S5C shows the estimated peak shape when the distance between two binding sites in each candidate region is set to 140bp. In this case, both of the two closely spaced binding events affect peak shape estimation of the GPS algorithm. As a result, the peak shape is estimated as bi-modal, which in turn leads to predicting the two binding events as a single event after a few rounds of the GPS iterations. We note that this problem typically occurs for nonparametric mixture models when the distances between mixture components are relatively short compared to the bandwidth.

Evaluations on Simulation Data based on Different Data Generation Process
When comparing PET and SET data with simulations ( Figures 2C, D and Figure  S6), we first generated PET data and then obtained corresponding SET data by randomly sampling one of two ends of each resulting DNA fragment. Although such a data generation process closely mimics the process for generating real SET ChIP-Seq data, dPeak model for SET ChIP-Seq data capitulates this process by a Normal approximation of the density of each of forward and reverse strand reads.
In order to assure that our evaluation using random sampling does not give unwarranted advantages to PET data, we generated SET data with read positions directly originating from Normal distribution and repeated the analysis in Figures 2C, D. Figures S7A, B, C confirm that the comparisons between PET and SET data remain the same regardless of how SET data is simulated.  Figure S6, where SET data is generated by random sampling of one of the two ends from each DNA fragment in PET data. 24 cation Consider a region with two closely located binding events. Processing of DNA fragments generated from this region will lead to classification of the fragments in one of the following four categories: Category I: Fragments overlapping a single true binding event.
Category II: Fragments overlapping both binding events.
Category III: Fragments overlapping only the false binding event.
Category IV: Fragments not overlapping any binding events.
Only fragments in category I are truly informative. Fragments in category II are less informative than fragments in category I. They could potentially contribute to both binding events, possibly through proportional allocation based on relative distances from each binding event. However, ambiguity in prediction increases as the number of fragments in category II increases. Fragments in category III introduce noise to binding event estimation since they are associated with the wrong binding event. Fragments in category IV are uninformative. In summary, invasion refers to increased number of category II fragments in SET data compared to PET data and truncation refers to increased number of category III and IV fragments in SET data compared to PET data. Table S5 displays the number of fragments in each category from one simulated dataset where we set the distance between the two binding events as 50bp. Average library size is 139bp in the PET data. The estimated library size used with SET analysis are reported in parentheses in the first column. In the corresponding SET data, even when extension is relatively accurate (extension = 150bp), numbers of fragments in categories II to IV increase significantly compared to PET data. When the library size is under-estimated as 100bp, we have significantly more fragments in categories III and IV (truncation; Figure S8B). In contrast, when it is over-estimated as 200bp, we have significantly more fragments in category II (invasion; Figure S8A).
We used the dPeak generative model and calculated the probability of invasion and truncation ( Figure S8) as follows. As in the previous sections, let S and L be start position of DNA fragment and its length, respectively, in PET ChIP-Seq data. Let l * denote the fixed library size used in the analysis of SET ChIP-Seq data. Z indicates group index of the DNA fragment where Z = 1 and Z = 2 indicates correspondence to the first and second binding events, respectively. Let µ 1 and µ 2 be  Table S5: Classification of 600 DNA fragments from one simulated dataset with two binding events separated by 50bp.
positions of first and second binding events, respectively, and assume that µ 1 < µ 2 . Probability of invasion ( Figure S8A) is obtained as: As illustrated in Figure S8B, for truncation, we consider the case that the original DNA fragment covers both binding events in PET data. The corresponding probability can be calculated by defining the truncation event with the use of the Z variable: P (T runcation) = E L [P (S + l * < µ 2 , S < µ 1 < µ 2 < S + L|Z = 2, L)] = L=l P (L = l)P (S + l * < µ 2 , S < µ 1 < µ 2 < S + l|Z = 2, L = l)

Evaluations on σ 70 PET and SET ChIP-Seq Data Using RegulonDB and Experimentally Validated Sites as a Gold Standard
We compared performances of deconvolution algorithms dPeak, PICS, GPS, and GEM using σ 70 PET and quasi-SET ChIP-Seq data by considering RegulonDB annotated binding sites as a gold standard. We assessed sensitivity of each algorithm using the set of candidate regions with at least two annotated binding sites and evaluated resolution using the candidate regions with exactly one annotated binding site. Table S6 and Figures S9A, B show that dPeak using PET ChIP-Seq data provides significantly higher sensitivity and resolution than SET ChIP-Seq data regardless of the deconvolution algorithm used. GPS performs the worst and its poor performance had recently motivated the development of GEM [15]. Overall, dPeak and GEM perform similarly and both are slightly better than PICS with SET data in terms of sensitivity.
We also compared deconvolution algorithms using our small set of experimentally validated binding sites as a gold standard. This comparison ( Figure S9C) further confirmed our conclusions from the RegulonDB-based comparisons. The differences in resolution between dPeak using PET ChIP-Seq data and each of the deconvolution algorithms using SET ChIP-Seq data are statistically significant with p-values < 0.01.  Table S6: Sensitivity comparisons across regions with at least two annotated binding events for σ 70 PET and quasi-SET ChIP-Seq data in aerobic and anaerobic conditions. RegulonDB annotated binding sites are used as a gold standard. A gold standard binding event is marked as identified if the distance between the prediction and the RegulonDB reported location is less than 30bp (overall conclusions remained the same with other distances).    Figure S10: MochiView genome browser [16] screenshots of promoter regions of yejG (A) and spr (B) genes.  Figure S12: MochiView genome browser [16] screenshots of promoter regions of serC (A) and hybO (B) genes.  Figure S13: MochiView genome browser [16] screenshots of promoter regions of ybgI (A) and ptsG (B) genes.
15 Differential Occupancy of Closely Located Binding Sites between Aerobic and Anaerobic Conditions in E. coli σ 70 PET ChIP-Seq Data Figure 5C elucidates the merit of high resolution analysis in the studies of differential occupancy. However, if there is no occupancy in one condition at all, such differential binding could still be identified in the peak-level analysis and high resolution analysis might be considered less interesting. High resolution analysis is perhaps most interesting when the same region is identified as a peak in both conditions but different numbers of binding events are identified between conditions. We further decomposed the predicted binding events based on the number of predicted events in the region in each condition. Table S8 shows that although many regions are occupied in both conditions, the number of predicted binding events can differ significantly. Figure S14 depicts an example of differential occupancy of closely located binding sites in the promoter region of gltA gene. Specifically, two binding sites are predicted by dPeak in anaerobic condition while only one of them are identified in aerobic condition. In contrast, MOSAiCS identified the region covering both binding sites as a single peak in both aerobic and anaerobic conditions.   Figure S14: MochiView genome browser [16] screenshots of promoter region of gltA gene. Both gltAp1 and gltAp2 are identified as binding sites in anaerobic condition using PET ChIP-seq data while only gltAp1 is identified in aerobic condition.

Evaluations of the Algorithms for PET ChIP-Seq Data
To the best of our knowledge, SIPeS is currently the only algorithm specifically designed for supporting PET ChIP-Seq data and has been shown to attain better resolution than a version of MACS that can analyze PET data [17]. We used C implementation version 2.0 of SIPeS from http://gmdd.shgmo.org/Computational-Biology/ ChIP-Seq/download/SIPeS. In our computational experiments and data analysis, we both used its default parameters and also considered alternative values for the parameters that define the range of the dynamic baseline to construct the signal map. SIPeS constructs signal map by piling up the aligned paired-end reads. [17] observed that SIPeS was able to attain high resolution for binding event identification when used with a wide range of dynamic baseline. Therefore, we investigated the performance of SIPeS when the DNA fragment pileups corresponding to two binding events are above ( Figures S15A, B) and within the range of dynamic baseline ( Figure S15C) in our computational experiments as described in the main manuscript. We observed that tuning the range of the dynamic baseline is far from trivial. Furthermore, a global value across the whole genome is not likely to perform well. There are also no guidelines or objective ways of configuring such a range. Figure S15A shows that SIPeS has low sensitivity when two binding events are closely spaced. In this case, the value of the DNA fragment pileups between these two binding events did not belong to the range of dynamic baseline and, as a result, SIPeS identified the whole region as a single peak. Hence, although two binding events reside within this peak, SIPeS reported only a single summit. In contrast, Figure S15B shows that, on average, SIPeS identified more than 10 binding events when the distance between two binding events is larger than average library size. For these settings, there were some regions with low DNA fragment pileup within the range of the dynamic baseline between the two binding events. As a result, SIPeS essentially identified all local maxima as binding events and this resulted in low positive predictive value of SIPeS.
When the values of the DNA fragment pileups corresponding to the binding events are within the range of dynamic baseline, SIPeS is able to identify the two binding events ( Figure S15C). However, SIPeS also identified all other local maxima as binding events and exhibited significant loss of positive predictive value. We also note that the SIPeS predictions corresponding to true binding events could not be distinguished from others, using the other summary statistics such as p-value or maximum fragment pileup value provided by SIPeS.
Finally, we evaluated SIPeS predictions for the 8 experimentally validated re-gions of σ 70 PET ChIP-Seq data. As discussed in the manuscript, these regions harbor a total of 14 experimentally validated σ 70 binding sites. Figure S15D illustrates that dPeak attains significantly higher resolution compared to SIPeS in these regions (p-value of the paired t-test between dPeak and SIPeS < 0.01). Furthermore, although these regions harbor at most two validated σ 70 binding sites, SIPeS predicted 2 to 18 binding sites. In summary, SIPeS does not sufficiently leverage PET ChIP-Seq data to provide high resolution for studying protein-DNA interactions. Furthermore, it is also highly sensitive to background noise in ChIP-Seq data and requires parameter tuning. We also note that the analysis of high depth PET ChIP-Seq data, such as that of σ 70 , using SIPeS requires considering wider ranges of dynamic baseline. This, in turn, increases the computation time significantly.
Overall, it seems computationally prohibitive to implement a genome-wide analysis of such data using SIPeS, i.e., analysis of σ 70 required more than 72 hours on a standard 64 bit machine with Intel Xeon 3.0GHz processor.

Application of dPeak to a GATA1 SET ChIP-Seq Peak
In this section, we discuss an application of dPeak in eukaryotic genomes using the GATA1 SET ChIP-Seq data from [18]. This dataset has 106,381,508 reads and measures GATA1 occupancy in G1E-ER4 cells after estradiol treatment. GATA1 is known to bind to short consensus sequence WGATAR (W = A or T, R = A or G) [19]. A typical GATA1 ChIP-Seq peak on average harbors 2.32 WGATAR sites in this dataset. Being able to identify which of these are occupied is important for refining consensus sequences and deriving functional roles of about 7 million WGATAR sites in the mouse genome. Figure S16 displays coverage plot of the GATA switch site of the GATA2 locus (-2.8 kb). This region contains four WGATAR motifs separated by 20bp to 109bp. dPeak predicts that GATA1 factor binds to the second consensus site.

Evaluations on Human SET ChIP-Seq Data
We evaluated the performance of dPeak on human SET ChIP-Seq data that GPS and PICS were optimized for. We considered GABPA SET ChIP-Seq data in GM12878 cell line from the ENCODE database. We identified 2,469 candidate regions using MOSAiCS (FDR = 1e-20) and these candidate regions were explicitly provided to the GPS and GEM algorithms as candidate regions. Candidate regions for PICS were identified using the function segmentReads() in the PICS R package (default parameters). Default tuning parameters were used during model fitting for all the methods.
In the case of a sequence-specific factor with well-conserved motif such as the GABPA factor, we observed that dPeak prediction can be further improved in a straightforward way by incorporating sequence information. Specifically, after identifying initial dPeak predictions, we identified a de novo motif using MEME [20] and detected positions of these consensus sequences using FIMO [21]. Then, we updated the dPeak predictions if the GABPA consensus sequences were found within the 50bp window around initial dPeak predictions. We call these dPeak predictions that integrate sequence information as 'dPeak2'. Figure S17 shows resolution comparison on the GABPA-GM12878 dataset. The resolution is defined as the absolute distance to the nearest predicted consensus site, where the prediction utilizes the independent position weight matrix from JASPAR [22]. The results indicate that dPeak performs comparable to GPS (median resolution = 18bp and 19bp for dPeak and GPS, respectively) and they both significantly outperform PICS (median resolution = 30bp). Moreover, dPeak2 performs comparable to GEM and identifies the GABPA binding sites with high accuracy.