Detection of cooperatively bound transcription factor pairs using ChIP-seq peak intensities and expectation maximization

Transcription factors (TFs) often work cooperatively, where the binding of one TF to DNA enhances the binding affinity of a second TF to a nearby location. Such cooperative binding is important for activating gene expression from promoters and enhancers in both prokaryotic and eukaryotic cells. Existing methods to detect cooperative binding of a TF pair rely on analyzing the sequence that is bound. We propose a method that uses, instead, only ChIP-seq peak intensities and an expectation maximization (CPI-EM) algorithm. We validate our method using ChIP-seq data from cells where one of a pair of TFs under consideration has been genetically knocked out. Our algorithm relies on our observation that cooperative TF-TF binding is correlated with weak binding of one of the TFs, which we demonstrate in a variety of cell types, including E. coli, S. cerevisiae and M. musculus cells. We show that this method performs significantly better than a predictor based only on the ChIP-seq peak distance of the TFs under consideration. This suggests that peak intensities contain information that can help detect the cooperative binding of a TF pair. CPI-EM also outperforms an existing sequence-based algorithm in detecting cooperative binding. The CPI-EM algorithm is available at https://github.com/vishakad/cpi-em.

called peaks on these combined alignments. We called peaks on this merged set using MACS2, with a p-value threshold of 0.1, and retained peaks whose q-values were less than 0.1. MACS2 was run with the additional --nomodel --extsize 147 -g 1.21e7 options.
A.3 ChIP-seq of FIS and CRP in E. coli from early-exponential (EE) and mid-exponential (ME) phase cultures: For FIS and CRP ChIP-seq datasets, we utilized pre-computed peak calls that were available on the GEO database with accession number GSE92255. Though the ChIP-seq experiments were carried out in replicates, these peaks were called by running MACS2 on merged alignments of sequence reads from both replicates. The peak calls in this set were filtered such that all peaks had a q-value less than 0.05.
In order to analyze whether these samples were sequenced to saturation, we had to call peaks on individual replicates.
We followed the same peak calling protocol for the individual replicates as was stated on the GEO database entry. We used MACS2 (v2.0.1) in paired-end mode (-f BAMPE) to call peaks on the invidividual replicates, with those peak calls whose q-value is less than 0.05 being retained. To call peaks on the merged replicates, we passed alignment files of both individual ChIP and input replicates to the -t and -c options of MACS2 (v2.0.1).
We would like to highlight here that the pipeline used for calling peaks in the individual early-exponential (∆FIS) CRP ChIP-seq replicates was slightly modified from those of the other datasets, where MACS2 was (v2.0.1) was run with the additional options (--keep-dup 'all'). This was in line with the original study that reported this data, where PCR duplicates were kept in the individual replicates because MACS2 reported no peaks in the second ChIP-seq replicate.
However, the peak calls for the merged replicates were called with the default PCR duplicate removal options for MACS2.  Table A: Summary of number of reads and peak calls in each ChIP-seq replicate of FOXA1, HNF4A and CEBPA. As stated in A Section, for FOXA1 (∆CEBPA), FOXA1 (∆HNF4A) and FOXA1 (∆HNF4A) datasets, only a single replicate of ChIP-seq was performed. The peak calls for these data sets have a q-value less than 0.05. For the remaining datasets where ChIP-seq was performed in two replicates, the number of peaks shown for each replicate are those whose p-value is less than 0.001. For the merged replicates, those peaks whose IDR values are less than 0.01 are then retained for further analyses.      Figure 2. The second column describes the total number of doubly bound regions in each dataset. In parentheses, the total number of target TF peaks, and the fraction of target TF peak regions that are doubly bound are shown (f = n double /n peaks ). The remaining columns list the number of target TF peaks that were lost, increased in rank, decreased in rank, or unchanged in rank, from genomic regions bound by both target and partner TFs. The numbers in parentheses are the number of target TF peaks that were lost, increased in rank, decreased in rank, or unchanged in rank, from all genomic regions bound by the target TF. The last column is the total fraction of cooperatively bound peaks in each dataset. Statistical tests to detect significant peak rank changes could be carried out only in FOXA1-HNF4A, FOXA1-CEBPA and HNF4A-CEBPA datasets (see A1 and A3 sections).   Figure I after indirectly bound peaks have been removed. The second column describes the total number of doubly bound regions in each dataset. In parentheses, the total number of target TF peaks, and the fraction of target TF peak regions that are doubly bound are shown (f = n double /n peaks ). The remaining columns list the number of target TF peaks that were lost, increased in rank, decreased in rank, or unchanged in rank, from genomic regions bound by both target and partner TFs. The numbers in parentheses are the number of target TF peaks that were lost, increased in rank, decreased in rank, or unchanged in rank, from all genomic regions bound by the target TF. Statistical tests to detect significant peak rank changes could be carried out only in FOXA1-HNF4A, FOXA1-CEBPA and HNF4A-CEBPA datasets (see A1 and A3 sections).

Genotype
C Estimation of average mappability of M. musculus, S. cerevisiae and

E. coli genomes
The single-end reads in the M. musculus data we employed was 36 bp in length. We downloaded mappability data for the mm9 mouse genome assembly for 36 bp reads from the UCSC genome browser mm9 download section, which was in the bigWig format. We converted this into a bedgraph file using the bigWigToBedGraph program (version dated 03 April 2018).
For E. coli and S. cerevisiae genomes, we created mappability files in the wiggle format using the program GEM (v2.5.2.1) [25]. We used GEM to compute the mappability of 42 bp long reads in the S. cerevisiae genome and 101 bp long reads in the E. coli genome using the gem-indexer. These were the read lengths employed in the original studies from which we sourced the ChIP-seq data. The sequence of commands used to create the S. cerevisiae genome mappability data for 42 bp reads were - We note that this E. coli mappability track was created assuming that the reads were single-end reads. Given that the data we used employed paired-end reads, the mappability of these regions are higher in practice than that calculated by gem. The wig2bed program (v2.4.32) was part of the BEDOPS suite [26].
From these mappability tracks, we followed the procedure outlined in [24] to compute the mappability of ChIP-seq peaks. For a peak of length L base pairs in length, if the mappability score of a M base pair read at each of the L positions is m 1 , m 2 , . . . , m L , then, the average mappability of the peak is computed asm = L i=1 m i /L. This calculation was performed for all peaks in each peak call file using the map program (with options -o mean -c 4, where the 4th column of the bedgraph is the mappability score) of the BEDTools suite (v2.25.0). D Intensities of trimmed peaks of cooperatively bound target TFs are weaker than those of non-cooperatively bound target TFs Since we constructed target and partner TF peak pair sets by choosing those peak pairs whose peak regions overlapped by at least one base pair, we checked if our results were an artefact of this overlap criterion. When we trimmed all ChIP-seq peaks across datasets to 50 base pairs on either side of their summits, we still found cooperatively bound target TF peaks to be more weakly bound than non-cooperatively bound target TF peaks ( D Fig). However, this trend was no longer statistically significant in S. cerevisiae and E. coli datasets. However, the fact that the notches of the boxplots are larger than the 25 − th and 75 − th percentiles of the peak intensity distributions indicates that the number of peaks available in these datasets is small. Thus, the lack of statistical significance of the trends in intensities of these datasets is likely a signature of the small number of peaks.

E Performance of different variants of the CPI-EM algorithm
When CPI-EM was run to compute cooperative binding probabilities in the main text, Lognormal shapes were fitted to the joint probability function of observing cooperative and non-cooperatively bound peak intensity pairs (step 2 of the CPI-EM algorithm). In Figure L, we compare the detection performance of this version of the CPI-EM algorithm with two other variants that fit Gamma and Gaussian shapes instead of a Lognormal shape. Figure L shows the auROC of these three variants of the CPI-EM algorithm after they were run on all the datasets shown in Figure 1 in the main text. We also compared the auROC of these CPI-EM variants to the peak distance detector, which is described in the Materials and Methods of the main text, and a detector based purely on chance. The chance detector is based on using tosses from a biased coin, whose probability of showing heads is α, to detect cooperative interactions. The area under the ROC of this detector will be 0.5 for any dataset (see J Section). An auROC of 0.5 thus represents the minimum level of detection performance that an algorithm should obtain to be considered a useful detector in practice.
In Figure L, it can be seen that the Log-normal CPI-EM variant has an auROC of at least 0.7, and thus can consistently detect cooperative interactions across all datasets. The Gamma and Gaussian variants, on the other hand, perform close to the level of the chance detector in FOXA1-CEBPA, HNF4A-CEBPA and early-exponential phase CRP-FIS datasets. There is considerable variation in the auROC of the peak distance based algorithm: less than 0.5 in early-exponential phase FIS-CRP and RTG3-GCN4 datasets, but higher than 0.5 in the remaining datasets. The fact that this algorithm can perform worse than a chance detector shows that peak distance, by itself, is an unreliable criterion for detecting cooperative binding. The complete ROC curves of each of the CPI-EM and peak distance algorithms for the datasets in We checked if the performance of CPI-EM was dependent on instances of indirect binding between target and partner TF peaks and the size of the ChIP-seq peaks input to it. We found that removing indirectly bound peaks of target and partner TFs ( Figure QA) did not noticeably alter the auROCs of any of the CPI-EM variants, with the log-normal CPI-EM variant continuing to perform better than chance across all datasets. When we trimmed the peaks of both TFs to within 50 base pair on either side of the peak summits before passing them as inputs to CPI-EM, there was no noticeable drop in the auROCs of any of the CPI-EM variants, except in the case of the mid-exponential phase CRP-FIS dataset. The auROC of the log-normal CPI-EM variant ( Figure QB) on this dataset dropped below 0.5.

F Analyzing the extent of saturation of sequencing depth for ChIP-seq datasets
To determine the extent to which the ChIP and input samples were sequenced, we called peaks on ChIP and input samples after sub-sampling 20%,40%,60% and 80% of reads from each file. In samples that have been sequenced to saturation, a small change in the number of reads should not result in a change in the number of peaks called.
We sub-sampled reads from the BAM alignments using the samtools view -bs program. This was done by passing values of the form xy.f to the −s flag, where xy was a integer sampled uniformly between 0 and 99, and f was either 2,4,6 or 8, which corresponded to sampling 20,40,60 or 80% of the reads. We ran this command three times for each ChIP and input BAM alignment pair, thus generating three replicates for each sub-sampling fraction. We called peaks on each matched pair of sub-sampled ChIP and input BAM files using the appropriate pipelines described in A Section.
The results of this analysis are shown in K Fig. G Detecting significant rank changes of target TF peaks after the partner TF is knocked out To determine if a change in the peak rank of X upon the knocking out of Y is statistically significant, we construct a null distribution that captures the magnitude of rank changes of X expected purely due to variability in the ChIP-seq protocol. Suppose r 2 , . . . , r (2) n represent the normalized ranks (whose values are between 0 and 1) of n overlapping peaks in biological replicates 1 and 2 of the ChIP-seq of X (in the presence of Y). We then divide the interval [0, 1] into 10 equally sized bins (we verified that changing the number of bins did not drastically change the results), and compute the null rank change probability density g k null (x) of the k − th bin from the samples falls in the k th bin. X Gaussian kernel density estimator implemented in the Scipy library was used to compute g k null (x) for each bin. This represents the probability of observing a rank change purely due to inter-replicate variation, conditioned on the bin to which the peak's rank in replicate 1 belongs. The process of computing rank changes separately within each bin better captured the skew expected in rank changes arising from replicate variation. For instance, a peak of X, whose rank in replicate 1 is low, is far more likely to have a higher rank in replicate 2, than a peak with a high rank in replicate 1.
We then proceed to compute the significance of rank changes observed in peaks of X after Y has been knocked out. For this, we computed the ranks r from peaks of X that have been called from merging the read alignments of replicates 1 and 2. The average change in peak rank due to the merging of alignments was close to zero, i.e., the ranks where p is the number of peaks common between peak calls in the replicates and merged alignments. We also compute the ranks r ∆ 1 , r ∆ 2 , . . . , r ∆ q of peak calls from the ChIP-seq of X after Y is knocked out. We then construct the set of rank changes {|r falls. This is the probability of observing a rank change of magnitude |r belongs to the k − th bin. We finally obtain a sequence of probabilities p 1 , p 2 , . . . , p q corresponding to each rank change observed upon knocking out Y.
We then conduct q one-sided hypothesis tests, each of which test the null hypothesis H i : |r We carry out the hypothesis tests by checking if each p i < α, where α is chosen according to the Benjamini-Hochberg multiple hypothesis testing procedure [10] that sets the false discovery rate at 0.01. Statistics on the number of significant peak rank changes we observed in different datasets are shown in Table E.
We used this procedure to detect significant rank changes in FOXA1-HNF4A, FOXA1-CEBPA, and HNF4A-CEBPA datasets only. This was because (a) the ChIP-seq of FOXA1, HNF4A and CEBPA were carried out in replicates, and (b) the number of peaks that remained after IDR-based filtering was large. This gave us a sufficient number of peaks with which to reliably compute the null rank change distribution. This was not the case in the RTG3-GCN4 and GCN4-RTG3 datasets, where the number of ChIP-seq peaks in the merged alignments (89 in RTG3-GCN4 and 343 in GCN4-RTG3 datasets) was small. We could not detect rank changes in the CRP-FIS and FIS-CRP datasets because peak calls from individual replicates were not available, and hence, the null rank change distributions could not be computed.
H CPI-EM : Estimating parameters required to compute the probability of cooperative binding at a location The input to the CPI-EM algorithm consists of a set of peak intensity pairs where {x i } and {y i } are peak intensities of the target TF X and partner TF Y. We assume that the joint probability density of peak intensities from all these regions, p(x, y), is a mixture (i.e., a sum) of two densities representing cooperative and non-cooperative peak intensity distributions: where p 0 and p 1 are the joint densities of peak intensities from non-cooperatively and cooperatively bound regions, respectively. θ 0 and θ 1 represent the parameters of both joint distributions. As shown in the main text, we make three assumptions in the CPI-EM algorithm, which we describe and justify in detail below - The parameter vectors of the joint and marginal distributions are related as θ 0 = (θ X 0 , θ Y 0 ) and θ 1 = (θ X 1 , θ Y 1 ). This assumption reduces equation (1) to We found this to be a reasonable assumption across all our data sets when we calculated the mutual information (MI) [13] between peak intensities of cooperatively and non-cooperatively bound peak pairs, as determined by partner TF knockouts, across all our data sets (Table G). Mutual information, measured in bits, is a robust measure of statistical dependence between two random variables, whose value is zero if the variables are statistically independent [12].  Table G: A summary of the MI values estimated between pairs of peak intensities on the one hand and motif scores from the sequences underlying these peaks. We computed the MI separately for cooperatively, non-cooperatively and all doubly bound regions. The MI between target and partner TF peak intensities from both cooperatively and non-cooperatively bound regions are close to zero across all data sets. This is the case even after indirectly bound peaks are removed or if the MI is computed between motif score pairs instead of peak intensity pairs. MI is a non-negative quantity but the estimator we employ can give negative estimates of MI if the true MI in the data is low or if the number of peak pairs available is low. In cases where the number of peak pairs was far too low (typically, < 20), the MI could not be estimated and we have displayed these values as "-".
• Assumption 2 : We choose p X 0 , p Y 0 , p X 1 , p Y 1 to be either a Lognormal, Gamma or Gaussian density function, whose expressions and corresponding parameter sets are - Across most datasets, we found that the Lognormal distribution tended to best fit peak intensity distributions. This could be seen in terms of the log-likelihood scores obtained from fitting the three distributions individually to peak intensities of target and partner TFs from cooperatively and non-cooperatively bound regions. The log-likelihood score obtained from fitting these distributions to a set of peak intensities of a given TF is computed as where p is either the cooperative (p 1 ) or non-cooperative , which are target or partner TF peak intensities, respectively. Θ are parameters of the distribution chosen for p. A larger log-likelihood value indicates a better fit to data.
We computed Θ for each distribution using maximum likelihood estimates of these parameters. We used fit routines of the stats library of the Python package SciPy [11] to compute these estimates. The log-likelihood values calculated for each of the three distributions across all ChIP-seq datasets is shown in Table H.  • Assumption 3 : A cooperatively bound target TF is, on average, more weakly bound than a non-cooperatively bound target TF. This implies that p X 1 (x) < p X 0 (x) . We found this to be a reasonable assumption since in Figure  2 in the main text, cooperatively bound target TF peak intensities were significantly lower than non-cooperatively bound target TF peak intensities across all datasets.
From equation (3) each of θ X 0 , θ X 1 , θ Y 0 , θ Y 1 consist of two parameters, irrespective of whether p 1 is a Lognormal, Gamma or Gaussian density function. Along with π 0 , there are thus a total of 9 parameters that we need to estimate from D in order to compute the probability of each peak intensity pair in D being cooperative -

H.1 The expectation-maximization (EM) algorithm
We use the expectation-maximization (EM) algorithm [14,15] to estimate the parameters in equations (5) and (2). The output of the EM algorithm is a single set of parameters Θ = (π 0 , θ X 0 , θ Y 0 , θ X 1 , θ Y 1 ) that maximizes the log-likelihood log P (D, L|Θ), where D represents the peak intensity pairs {(x i , y i )} N i=1 and L = (L 1 , L 2 , . . . , L N ) are labels assigned to each of the N locations, where L i = 1 represents cooperative binding and L i = 0 represents non-cooperative binding.
The expectation-maximization algorithm [14,15] does this by computing a function Q(Θ, Θ ′ ), which is the expected value of the log-likelihood log P (D, L|Θ), given an earlier estimate of Θ = Θ ′ [16]: where S represents the set of all possible values of L.
We now describe the Q function employed in CPI-EM, and the implementation of the EM iteration process in detail below.
The set S of all possible labels L in equation (6) consists of 2 N elements because each element of L takes on values of either 0 or 1. This is a very large number of terms that need to be added to evaluate Q. However, Q simplifies to a sum over N terms for our model of cooperative binding. Q can be rewritten as Since a peak intensity pair is either cooperative or non-cooperative, we can write P (X i = x i , Y i = y i ) = π 0 p 0 (x i , y i ; θ 0 ) + π 1 p 1 (x i , y i ; θ 1 ), where π 0 + π 1 = 1. Since we consider (X i , Y i ) and (X j , Y j ) (i = j) to be statistically independent, P (L|D, Θ) in equation (7) can be expanded as - where l i is 0 or 1. Substituting the above two expansions into the expression for Q in equation (7), it can be shown that Q simplifies to the form shown below, where it is a sum over only N terms (page 4 in [16]) where, . Note that the first term is independent of Θ ′ , so it can be maximized independently of the second term. The choice of π l that maximizes the first term (page 5 in [16]) is - The k − th EM iteration involves choosing a value Θ = Θ (k+1) that maximizes Q(Θ, Θ (k) ), where Θ (k) is kept fixed at the value obtained in the previous EM iteration that maximizes Q(Θ, Θ (k−1) ). EM involves two steps, an E-step and an M-step, which are both needed to maximize Q(Θ, Θ (k) ). The E-and M-steps evaluated at the i − th iteration in our algorithm are [17] - π 0 (k) p 0 (x i , y i ; Θ (k) ) + π 1 (k) p 1 (x i , y i ; Θ (k) ) for l = 0, 1; i = 1, 2, . . . , N.
We choose the initial value Θ (0) as follows. From the data {(x i , y i )} N i=1 , we separate the peak intensities of X and Y as We then compute the value θ X mle that maximizes the likelihood N i=1 p(x i ; θ), where f is a Lognormal, Gamma or Gaussian density function. Similarly, we also compute the value of θ Y mle that maximizes the likelihood N i=1 p(y i ; θ). These maximum likelihood estimates θ X mle and θ Y mle are computed using the fit function provided by the Python Scipy stats library, which can provide maximum likelihood estimates when p 0 and p 1 are either Lognormal, Gamma or Gaussian density functions. We choose π (0) 0 from a Uniform[0, 1] distribution. We finally set our initial parameter vector Θ (0) to (π (0) 0 , θ X mle , θ Y mle , θ X mle , θ Y mle ). We verified that EM converged to the same local maximum when Θ (0) was perturbed by up to 30% around this choice (data not shown).

I Calculation of receiver operating characteristic (ROC) curves
CPI-EM : Given probability of cooperative binding p coop 1 , p coop 2 , . . . , p coop N , the ROC curve was calculated by picking thresholds α on these probabilities that corresponded to false positive rates between 0.1 and 1 in steps of 0.1. The true positive rate at each of these thresholds was then computed. The area under the ROC curve was then calculated using the trapezoidal integration rule available in the Python numpy library. This procedure was repeated for each of the three variants of the CPI-EM algorithm. Similarly, the ROC curve of the peak distance algorithm was computed by choosing thresholds on the peak distance that corresponded to false positive rates between 0.1 and 1 in steps of 0.1. After the true positive rate at each threshold was calculated, the area under the ROC was computed with the trapezoidal integration rule. J Area under ROC of a chance detector is 0.5 The chance detector is based purely on using tosses from a biased coin to detect cooperative interactions. Let the probability of the coin showing heads be α. Out of a set of N peak intensity pairs {(x i , y i )}, suppose there are N c and N nc cooperatively and non-cooperatively bound pairs, respectively. The number of false positives, resulting from N tosses of the coin, would be N nc α, while the number of true positives would be N c α. This means that both the FPR and TPR of the chance detector would be α. Thus, as α is varied between 0 and 1, the ROC of the chance detector will be the straight line F P R = T P R = α, which encloses an area of 0.5.
We checked if the performance of CPI-EM was dependent on instances of indirect binding between target and partner TF peaks and the size of the ChIP-seq peaks input to it. We found that removing indirectly bound peaks of target and partner TFs (

K Calculation of precision-recall curves
Precision-recall curves provide an alternate method to determining the performance of a detection algorithm. Using the terminology introduced in the main text, the precision (P ) and recall R of an algorithm is where, N T P is the number of true positives, N F P is the number of false positives and N F N is the number of false negatives (i.e. the number of cooperatively bound peaks that are erroneously declared as non-cooperatively bound). As stated in the main text, these quantities are dependent on the detection threshold employed in the algorithm. Changing the threshold over a range gives a series of precision and recall values, with the area under the precision-recall curve (referred to as the average precision), providing a measure of detection performance.
We to these two routines. In the case of STAP, we passed the cooperative indices ∆ 1 , ∆ 2 , . . . , ∆ n to compute its precisionrecall curve.

L Estimation of mutual information
Given a probability distribution p(x, y) over the set of peak intensity pairs {(x i , y i )} N i=1 , mutual information (MI) is calculated as where p X (y) and p Y (y) are the marginal distributions of p(x, y), and n is the number of {(x i , y i )} pairs. MI is a nonnegative quantity whose value is zero if X and Y are statistically independent i.e. if p(x, y) = p X (x)p Y (y). From the knockout data available for each of the TF pairs, we separated the peak intensity pairs {(x i , y i } N i=1 into a set of cooperatively bound peak intensity pairs A c = {(x j , y j )} and a set of non-cooperatively bound peak intensity pairs A nc = {(x k , y k )}. We separately computed the MI of peak intensity pairs in A c (setting p = p 1 in equation (L)) and A nc (setting p = p 0 in equation (L)) in each ChIP-seq knockout dataset we analyzed (Table G).
Estimating mutual information (MI) through direct use of the definition specified by equation (L) leads to many problems; such MI estimates can be biased, or go to infinity in the case of certain distributions [13]. We estimated MI using the LNC algorithm implemented in [13] that circumvents these issues. The drawback of the LNC algorithm was that it gave non-negative estimates of MI only when a sufficient number of data points (in our case, at least 20 peak pairs) were present and if the true value of MI is not too low [13] . We were thus unable to reliably estimate MI in some of the ChIP-seq datasets we analyzed. The code for the LNC algorithm is available at https://github.com/BiuBiuBiLL/NPEET_LNC/blob/master/lnc.py Our finding that the mutual information between target and partner TF peak intensities is low, irrespective of whether they are cooperatively or non-cooperatively bound (Table G) reflects only on the low statistical dependence between both distribution of peak intensities. Importantly, this does not necessarily contradict our other finding that target TF peaks are more weakly bound when that are cooperatively bound, in comparison to regions where they are not cooperatively bound. The example below demonstrates that both these findings can be consistent with each other -Suppose A and B are target and partner TFs, and the intensities of A and B are either 1 or 2, with "1" representing weak binding and "2" representing strong binding. Our statement that target TFs are more weakly bound when they  Two sets of probabilities that satisfy this statement are shown in Table I, where each cell represents the probability of observing a given peak intensity for A and B. For example, in sub-table I of Table I Thus, the statement that A is more likely to be weakly bound when it is cooperatively bound with B is satisfied.
The statement about mutual information being low shows that the peak intensities of A and B are statistically independent. In our example, the mutual information between peak intensities of A and B is computed as where E refers to the distribution corresponding to the values in sub-tables I or II in Table I. It can be seen that the MI will be zero if the ratios within the logarithms are equal to 1, that is - where i and j are either 1 or 2. The values chosen in Table I are such that the equalities in equation (9) will always hold.
We demonstrate this below for the case when A = 2 and B = 1, and both A and B cooperatively bind DNA. Thus, in this example, we can see that the MI between the peak intensities of A and B is zero, and at the same time, A is more weakly bound when it is cooperatively bound with B when compared to regions where A is not cooperatively bound.
In more intuitive terms, that a target TF tended to be more weakly bound when it cooperatively bound DNA, means that we are more likely to find the target TF peak intensity to be low if we are told that it is cooperatively bound by the partner TF. While it is true that a target TF should have a higher affinity to DNA when cooperatively binding with a partner TF, this implicitly assumes that binding site affinities of the target TF are, on average, the same between cooperatively and non-cooperatively bound regions. However, in F Fig The low value of MI reflects the statistical independence between the peak intensities of TFs that cooperatively bind DNA. The consequence of this is that the probability of finding a target TF to be weakly bound is independent of the peak intensity of the partner TF, and that this is true whether or not the two are cooperatively bound. Put differently, if we were asked to guess if the target TFs peak intensity at a particular location was low or high, then, the accuracy of our guess will remain the same even if we were told the peak intensity of the partner TF. Thus, the mutual information refers to the probability of finding a target TF to be weakly bound based on knowledge of the partner TFs peak intensity, and does not refer to the actual peak intensity of the target TF. In each panel, the white boxplot is the intensity distribution of those cooperatively bound target TF peaks whose average mappability falls within the 5th and 95th percentile of the average mappability distribution of non-cooperatively bound target TF peaks. The orange boxplot is the intensity distribution of all cooperatively bound peaks, which is identical to the boxplots of the target TF shown in orange in Figure 1 in the main text. The p-values in each panel are calculated from a Wilcoxon rank sum test between the white and orange distributions in each panel.    Figure 2 of the main text. The motifs of FOXA1, HNF4A and CEBPA in M. musculus ChIP-seq data were sourced from the HOCOMOCO v10 database [19]. The motifs of GCN4 and RTG3 were sourced from the ScerTF database [20]. The motifs of CRP and FIS were learnt de novo using the MEME suite (see Materials and Methods in the main text).  Each panel is a scatter between the highest motif score in each peak region and the intensity of that peak, with the R 2 value representing the Pearson correlation coefficient between the motif scores and peak intensities of each dataset, with ** indicating a p-value of less than 0.01. The left column represents scatters of motif scores and peak intensities from all doubly bound regions across the genome, while the middle and the right columns are scatters from cooperatively bound regions and non-cooperatively bound regions, respectively. The R 2 value is statistically significant across all M. musculus datasets (FOXA1,HNF4A and CEBPA), but is true only in subsets of doubly bound locations in the remaining datasets. Figure I: After indirectly bound ChIP-seq peaks are removed, the peak intensities of the remaining cooperatively bound target TFs are lower than the peak intensities of the remaining non-cooperatively bound target TFs. Box-plots of peak intensity distributions of cooperatively (orange) and non-cooperatively (gray) bound TF pairs, with target TFs on the left and partner TFs on the right. **** indicates a p-value of < 10 −4 from a Wilcoxon rank sum test. The whiskers of the box plot are the 5 − th and 95 − th percentiles of the distributions shown. In S. cerevisiae and E. coli datasets, cooperatively bound target peak intensities are not significantly lower than those of non-cooperatively bound peaks. However, this is partly because a large number of peaks were filtered out as indirectly bound in these datasets (E and F Tables). Figure J: After indirectly bound peaks are removed, motif scores and peak intensities of the target TF are not significantly correlated Each row represents a different dataset. Each panel is a scatter between the highest motif score in each peak region and the intensity of that peak, with the R 2 value representing the Pearson correlation coefficient between the motif scores and peak intensities of each dataset, with ** indicating a p-value of less than 0.01. The left column represents scatters of motif scores and peak intensities from all doubly bound regions across the genome, while the middle and the right columns are scatters from cooperatively bound regions and non-cooperatively bound regions, respectively. 0 Figure L: The CPI-EM variant that fits Log-normal distributions to peak intensity pairs consistently performs well across all datasets The area under the ROC curve (auROC) of the CPI-EM algorithm applied to each of the datasets shown in Figure 1 in the main text. CPI-EM variants that fit Lognormal, Gamma and Gaussian distributions are represented in orange, black and gray, respectively. The auROC of the peak distance based detector is shown in blue, while the auROC of the chance detector is shown by a dashed line. See I Section for the calculation of the ROC curve for both the CPI-EM and peak distance algorithms. The complete ROC curves for these datasets are shown in Figure    CPI-EM variants that fit Lognormal, Gamma and Gaussian distributions are represented in orange, black and gray, respectively. The auROC of the peak distance based detector is shown in blue.(A) In most datasets, removing indirectly bound peaks has only a minimal impact on the performance of CPI-EM. In the mid-exponential CRP-FIS dataset, however, the Log-normal CPI-EM variant performs worse than chance.
(B) The trimming of peaks to 50 base pairs on either side of the summit does not affect the performance of CPI-EM in M. musculus ChIP-seq data. However, in the mid-exponential CRP-FIS dataset, the Log-normal CPI-EM variant performed worse than chance.