Population size estimation for quality control of ChIP-Seq datasets

Chromatin immunoprecipitation followed by sequencing, i.e. ChIP-Seq, is a widely used experimental technology for the identification of functional protein-DNA interactions. Nowadays, such databases as ENCODE, GTRD, ChIP-Atlas and ReMap systematically collect and annotate a large number of ChIP-Seq datasets. Comprehensive control of dataset quality is currently indispensable to select the most reliable data for further analysis. In addition to existing quality control metrics, we have developed two novel metrics that allow to control false positives and false negatives in ChIP-Seq datasets. For this purpose, we have adapted well-known population size estimate for determination of unknown number of genuine transcription factor binding regions. Determination of the proposed metrics was based on overlapping distinct binding sites derived from processing one ChIP-Seq experiment by different peak callers. Moreover, the metrics also can be useful for assessing quality of datasets obtained from processing distinct ChIP-Seq experiments by a given peak caller. We also have shown that these metrics appear to be useful not only for dataset selection but also for comparison of peak callers and identification of site motifs based on ChIP-Seq datasets. The developed algorithm for determination of the false positive control metric and false negative control metric for ChIP-Seq datasets was implemented as a plugin for a BioUML platform: https://ict.biouml.org/bioumlweb/chipseq_analysis.html.


Introduction
Understanding the basic mechanisms of transcription regulation remains to be the great challenge in modern biology. Regulation of transcription is a complex process in which transcription factors (TFs) play the key role. As a rule, TFs recognize and bind with corresponding TF binding sites (TFBSs) in the genome. The in silico recognition of those TFBSs in whole genomes has been staying one of the most complex problems in bioinformatics. Nowadays, chromatin immunoprecipitation followed by sequencing (ChIP-Seq) is a widely used experimental technology for the identification of TF binding regions (TFBRs) containing TFBSs. For now, tens of thousands of ChIP-Seq experiments have been conducted. It is reasonable to assume that this number will increase rapidly year by year. PLOS

Algorithm for determination of FPCM and FNCM
Let D denote meta-set D = {D 1 , . . .,D k } consisting of k datasets of TFBRs D i , i = 1, . . .,k. We considered two following dual settings. In the first case, D 1 , . . .,D k are datasets of TFBRs obtained by independent application of k distinct peak callers to the same ChIP-Seq set of reads aligned to the reference genome. In particular, we considered the following k = 4 peak callers available in GTRD: GEM [19], MACS [20], PICS [21], and SISSRs [22]. In the second case, a meta-set contains TFBRs datasets obtained by application of single peak caller to the distinct ChIP-Seq sets of reads when the same TF was studied in different ChIP-Seq experiments. We developed our FPCM and FNCM metrics to assess the quality of individual datasets D i , i = 1, . . .,k as well as the whole meta-set D.
To derive FPCM and FNCM, we initially merged all TFBRs available in meta-set D, see Fig  1. After that, the pivotal frequencies f 1 , . . .,f k were counted on the basis of all merged TFBRs where f i was defined as the number of merged TFBRs that were composed by exactly i TFBRs from meta-set D. In particular, f 1 + . . .+ f k = n where n denotes the number of all merged TFBRs. On the one hand, f k is the frequency of those merged TFBRs that contained initial TFBRs from each D 1 , . . .,D k . On the other hand, f 1 is the number of so-called orphans, i.e. such TFBRs that did not overlap with other initial TFBRs. To control FP rate, we defined FPCM under the natural assumption that almost all FP TFBRs must be orphans, i.e. not confirmed by other TFBRs. Additionally, we assumed that unknown number of genuine TFBRs is a random variable with Poisson distribution. Under these assumptions, FPCM was defined as the ratio of the observed number of orphans f 1 to the unknown number of genuine orphans f 1 gen , i.e. ð1Þ The estimate f 1 e of the unknown f 1 gen was derived, in turn, as solution of the system of three equations where λ is unknown parameter of Poisson distribution, and pi is a probability that randomly chosen merged TFBR is composed by exactly i initial TFBRs. As a result, the final version of FPCM is expressed as ð5Þ where f 1 e is the expected number of orphans.
The detailed derivation of FPCM is as follows. According to formulas for p 1 and p 2 , the ratio p 2 / p 1 is equal to λ / 2, hence According to formulas for p 1 and p 3 , the ratio p 3 / p 1 is equal to λ 2 / 6, hence According to formulas (7) and (8), p 1 can be estimated as p 1 e where p 1 e = 2 p 2 2 / (3 p 3 ).
Finally, the formula (5) for FPCM = p 1 / p 1 e was obtained with the help of replacement of probabilities p 1 , p 2 , and p 3 by frequencies f 1 , f 2 , and f 3 respectively. In general, we can assume that each set of orphans may consist of True Positive Orphans (TPOs) and False Positive Orphans (FPOs), i.e. number of orphans (f 1 ), can be expressed as The number of TPOs was estimated as f 1 e with the help of Poisson distribution. Therefore the proportion of FPOs (say, p false ) can be estimated as and FPCM can be re-expressed as FPCM = 1 / (1-p false ). If Poisson distribution is not contaminated then f 1 and f 1 e have to approximately coincide. In this case p false has to be negligible and FPCM has to be close to 1. However, if Poisson assumption is seriously violated then proportion p false takes large values and FPCM considerably exceeds 1. For example, if p false = 1/2, 2/3 or 0.98, then FPCM takes the values 2.0, 3.0 and 50.0 correspondingly. In other words, If FPCM takes the values 2.0 or 3.0 then a half or more than a half orphans are FPOs. Basically, FPCM can not only assess the quality of datasets, but also can recommend to modify them to improve their quality, if necessary. Thus, if FPCM exceeds a prespecified threshold, such as FPCM 0 = 2 or 3, then FPCM recommends to modify dataset by removing orphans because in these cases, at least, a half (p false = 1/2) or a majority (p false = 2/3) of them are falsely generated by peak callers.
To control FN rates in D 1 , . . .,D k datasets, we defined FNCMs for each of them. Thus, FNCM was defined for every Di as the ratio of the observed number of TFBRs in Di to the unknown number of genuine TFBRs, say, N gen , i.e.
where | D i | denotes the size of the D i dataset. If FPCM is less than the prespecified threshold (FPCM 0 ), then it is not necessary to modify initial datasets, and N gen is estimated as the average of the four distinct published estimates E C , E LB , E Z , and E ML of the N gen , i.e.
where E C is Chao's estimate [23], E LB is Lanumteang-Bohling's estimate [24], E Z is Zelterman's estimate [25] and E ML is maximum likelihood estimate [26]. The explicit forms of these estimates are as follow: ð15Þ where λ � is calculated numerically by maximization of log-likelihood function L(λ) of zero- If FPCM exceeds the prespecified threshold FPCM 0 , then it is necessary to modify initial datasets by removing orphans. In this case the estimates E C , E LB , E Z , and E ML are degenerated because f 1 is vanished (f 1 = 0) due to discarding orphans. To obtain new estimate of unknown number of genuine TFBRs N gen , we considered all k(k-1)/2 distinct pairs (D i , D j ) i<j and calculated for each pair (D i , D j ) the Chapman's estimate [27] E i,j by the formula Then we checked for outliers in the obtained sample E Chap = {E i,j } and discarded the detected outliers. An arbitrary element X in sample is classified as outlier, if the following inequality holds: where X 0 and σ are mean value and standard deviation when X is temporary removed from the sample E Chap . Finally, N gen is estimated as the average of sample E Chap and FNCM(D i ) is expressed as

FPCM and FNCM: Guidelines for dataset selection and modification
To get the first idea about some particularities of FNCM and FPCM, we applied independently the proposed algorithm for their calculation to two meta-sets derived from the GTRD database developed by our team [2]. Meta-sets PEAKS035099 and PEAKS039626 contained TFBRs of TF CTCF and were generated by the following peak callers in GTRD: GEM, MACS, PICS, and SISSRs. According to Fig 1, we merged initially four individual datasets PEAKS035099 generated separately by those four peak callers. Then, the pivotal frequencies f i were computed on the base of 44699 merged TFBRs: f 1 = 5534, f 2 = 4542, f 3 = 2482, and f 4 = 32141. The value 0.998 of FPCM was calculated by formula (5). One can conclude that there are almost no FPs among TFBRs because FPCM was approximately equal to 1.0. In other words, the observed number of orphans was approximately equal to the expected one. To estimate the total number of genuine binding regions N 1 e in this case, it is sufficient to calculate the arithmetic mean of four estimates: Finally, the estimated number N 1 e = 49525 of TFBRs was used to compute FNCMs for all four individual datasets PEAKS035099. Table 1 contains these values.
92436 merged TFBRs were obtained after merging four individual datasets PEAKS039626. The pivotal frequencies f 1 = 46452, f 2 = 5797, f 3 = 12012, and f 4 = 28175 were used for computation of FPCM = 24.901. Obviously, this value essentially exceeded the threshold 2.0 (or 3.0) therefore we concluded that there is a large number of FPs among 46452 orphans. In other words, the majority of orphans were, in fact, falsely generated TFBRs. In this case, we were forced to discard orphans and obtain the following six Chapman's estimates for calculation of FNCMs by using formula (19):  Table 1. It is interesting to note that some high values of FPCM can be easily explained by abnormal outcome of one of the peak callers. Thus, GEM, MACS, PICS, and SISSRs generated 41827, 45318, 78011, and 43215 TFBRs correspondingly in case of PEAKS035099. It seems likely that PICS overgenerated a large number of TFBRs and many of them were classified by FPCM as FPs.
Basically, FPCM recommends to remove or not the orphans while the FNCMs allow to select more reliable dataset from meta-set. Thus, FPCM recommended to remove orphans from PEAKS039626, PEAKS038673, PEAKS038812, and PEAKS040149, see FPCM values in Table 1. FNCM recommended to select MACS-generated datasets PEAKS035099, PEAKS039626, PEAKS033837, PEAKS039665, PEAKS033184, PEAKS038673, and PEAKS038812 while in cases of PEAKS038038 and PEAKS040149 FNCMs recommended to select GEM-generated datasets.

Comprehensive quality control of ChIP-Seq datasets in the GTRD database
The first release of GTRD [28] has been prepared without taking into account the quality control of ChIP-Seq TFBR datasets. For the second release, we have computed the quality metrics FNCMs and FPCMs for all available datasets in GTRD according to the proposed algorithm. Then we studied the influence of presence/absence of input control in ChIP-Seq experiments on quality of TFBRs datasets generated by distinct peak callers. Four classification models: perceptron, Fisher's discriminant model, logistic regression, and support vector machine (SVM)- were exploited to reveal putative relation between presence/absence of input control and our metrics FNCMs and FPCM. The strength of putative relation was measured by accuracy of the classification models, which was defined as the fraction of all correctly classified instances. We applied the mentioned classification models to all 5084 human datasets in GTRD where input controls were available for 4033 (79.3%) datasets. To control the overfitting of the classification models we divided the whole set of datasets into equal-halves: training subset and test subset.  Table 3) for datasets with control is greater than the mean value of FPCM 11.876 for datasets without control. However, after removing 132 outliers (abnormal values) one can conclude that mean value of FPCM (namely, 3.923, see 2-nd row of Table 3) for datasets with control is less, than the mean value of FPCM 8.562 for datasets without control. Empirical densities of FPCM (see Fig 2(A)) confirmed correctness of the second version of FPCM average. On the base of this version and all FNCM averages in Table 3, we made a conclusion that the absence of input control resulted in increase of FP rate and in decrease of FN rates of MACS, PICS, and SISSRs. Using Wilcoxon rank sum test we have found that FPCM and almost all FNCMs (excluding FNCMs for MACS) made the statistically significant contribution into discrimination between presence/ absence of input control, see the corresponding p-values in Table 3. Fig 2(B) demonstrates the densities of FNCM(PICS), which is the most significant feature for discrimination.

Peak caller comparison
Despite the current existence of more than 30 published peak callers, various comparative analyses of them did not reveal the best one. These comparisons were performed frequently on a small number of datasets and distinct metrics were exploited for comparison. We also performed comparative analysis of GEM, MACS, SISSRs, and PICS by using 5084 human datasets and FNCMs. For this purpose, we determined the priority of peak callers in descending order of FNCMs for each dataset. Then we counted the frequencies of all 4! = 24 distinct priorities. Expected proportion of each priority was defined as its probability when all distinct priorities are equally probable, hence expected proportion is equal to 1 / 4! = 0.042. The following priority appeared to be the most frequent among distinct ones:

MACS > GEM > SISSRs > PICS:
On the one hand, the observed proportion 0.181 of this priority essentially and significantly (p-value < 10 −20 ) exceeded the expected probability, because the ratio between observed and expected proportions (say, R o/e ) was equal to 4.3. Statistical significance was estimated with the help of binomial distribution. Importantly, this excess was invariant with respect to the presence or absence of input control, see Table 4.
On the other hand, we could not accept this priority as the general tendency among peak callers, because the majority (namely, 81.8%) of datasets had other priorities. To obtain more reliable inference, we considered the relaxed arrangements for comparison of pairs of peak callers. In this case the most frequent priority among 12 distinct ones was fMACS and GEMg > fSISSRs and PICSg: The observed proportion 0.469 of this priority also essentially (R o/e = 5.6) and significantly (p-value < 10 −20 ) exceeded the expected proportion 1 / 12 = 0.083, and about a half of datasets had this priority, see Table 4. Therefore, one can conclude that, in general, MACS and GEM outperformed SISSRs and PICS. It is important to note that the reliability of this conclusion is increasing during transition from all datasets to datasets with input control, because the observed proportion of the corresponding priority increased from 0.469 to 0.568. This conclusion is also confirmed by FNCM values contained in Table 3. Finally, it is interesting to note that GEM appeared to produce weaker results when the input control was absent, and the priority fSISSRs; MACS; and PICSg > GEM was observed for majority (63.5%) of datasets without input control. Tables 3 and 4 illustrate this conclusion.

Relationships between the proposed quality metrics and other features of ChIP-Seq datasets
There are, at least, two types of ChIP-Seq dataset features in addition to the proposed quality metrics. The first type features are well-known standard quality metrics developed by the ENCODE consortium. For instance, metrics such as NRF, PBC1, PBC2, NSC, and RSC assess quality of read alignment to individual genomes. The second type features can be easily determined on the base of characteristics generated by individual peak callers. For example, MACS assigned such characteristics as 'Fold enrichment', 'FDR' (False Discovery Rate), 'Tags number' and '-lg(p-value)' to each generated TFBR. To obtain second type features for the whole dataset we averaged available characteristics over generated TFBRs in given dataset.
To study relations between the proposed metrics and the features of both types we performed regression analysis. For this purpose, we applied three multiple regression models, namely, ordinary least squares (OLS), random forest (RF), and SVM to 5084 human datasets in GTRD. The strength of relationships between the features and the FNCM/FPCM metrics was measured by application of Pearson's correlation between observed and predicted values of the metrics. To avoid the impact of overfitting the regression models we divided the whole set of 5084 datasets into the following equal-sizes subsets: training and test subsets. The regression models were fitted to the training subset while the correlation between observed and predicted values were calculated on the test subset. The maximal values of correlation 0.657 and 0.644 were achieved by RF models, see Table 5. In first case, regression model described the relation between FNCM (PICS) and the ENCODE quality metrics while the second regression described the relation between FNCM (GEM) and the peak caller characteristics. In general, moderate values of correlations indicate that there are no strong relations between the proposed metrics and the existing features, in particular, see Fig 3. In other words, there are no combinations of known features that can replace FNCM or FPCM. Despite the absence of strong relations, it is fruitful to interpret the moderate associations between the individual features and FNCM or FPCM. For this purpose, for each FNCM and FPCM we determined the relevant features as features with absolute values of individual correlations between them and proposed metrics greater than 0.1. All revealed relevant features representing the quality metrics from ENCODE had positive correlations with FNCMs, see Table 5. In other words, the more qualitative dataset from the point of view of the ENCODE metrics, the lower FN rates from the point of view of FNCMs. Thus, there is positive association between the ENCODE metrics and FNCMs.
When we performed analogous comparison between FNCMs and the peak caller characteristics, it appeared that, on the one hand, there existed positive association between FNCMs and the probabilistic characteristics such as-lg(p-value) or-lg(q-value) and, on the other hand, there was the surprising negative association between FNCMs and the not-probabilistic characteristics such as Fold enrichment and Tags number.
Finally, it was important to demonstrate the usefulness of the proposed quality metrics, when almost all ENCODE control metrics failed. Thus, Fig 4 demonstrates such cases. On the one hand, such characteristics as NRF, PBC1, NSC and RSC recommended to exclude these data from further analysis. On the other hand, in almost all cases FNCM indicated the high rate of FNs. In other words, peak callers overlooked numerous genuine TFBRs. However, Population size estimation for quality control of ChIP-Seq datasets FPCM indicated the low rate of FPs, therefore it recommended to use whole merged datasets without removing orphans for applications such as identification of TFBSs within ChIP-Seq datasets, or comparison of motif prediction methods.

Identification of TFBSs within ChIP-Seq datasets
Accurate identification of TFBSs (or site motifs) is still the big challenge in bioinformatics. Though comprehensive study of abilities of the existing models for motif prediction is beyond of our study, we demonstrate below that reasonable application of FPCM can essentially improve the accuracy of TFBS identification within TFBRs. To confirm this, we applied two PWM models (namely, MATCH and HOCOMOCO) to some datasets of TFBRs when these models shared the same matrices.
In general, accuracy of TFBS identification within a given dataset of TFBRs depends on, at least, four following factors: 1) quality of matrix, 2) quality of scoring method, 3) quality of dataset, and 4) unknown proportion of tethered binding when a given TF bound to DNA fragment not because it recognized its site, but due to protein-protein interaction with another TF that, in turn, bounds to DNA directly. To demonstrate the influence of dataset quality on motif identification, we built ROC curves and calculated AUCs on datasets of TFBSs mentioned in Table 1. In particular, Fig 5 contains the ROC curves obtained on the whole dataset PEAKS038038 and the one without orphans. On the base of low values of AUCs (0.633 for the HOCOMOCO model and 0.565 for the MATCH model, see Table 6) one can conclude that both models had low predictive abilities. However, according to Table 1, FPCM was equal to 48.883, hence majority of orphans in PEAKS038038 were TFBRs falsely identified by peak callers. After removing orphans, the values of AUCs (0.843 for the HOCOMOCO model and 0.808 for the MATCH model, see Table 6) increased essentially. The analogous effect of significant increase of AUCs after removing orphans has been observed for other datasets (such as PEAKS039626, PEAKS038673, PEAKS038812, and PEAKS040149). According to Table 1, the FPCM values calculated for these datasets considerably exceeded 1.0. However, exclusion of orphans did not lead to essential increase of the AUC values for such datasets as PEAKS035099, PEAKS033754, PEAKS033837, PEAKS039665, and PEAKS033184. This effect was not unexpected because the corresponding FPCM values for these datasets were close to 1.0.
It is interesting to note that the HOCOMOCO model outperformed the MATCH model. This outperformance can be the result of taking into account the background nucleotide Population size estimation for quality control of ChIP-Seq datasets  composition of the motifs by the HOCOMOCO model (unlike the MATCH model) [29]. However, this superiority was not high because of small differences between corresponding AUCs. Anyway, the differences between AUCs indicated that distinctions between the PWM models were less considerable than effect of appropriate removing orphans. Finally, the same relations between removing orphans and essential increase of AUCs have also been observed when we merged datasets of TFBRs obtained by application of a single peak caller to the distinct ChIP-Seq sets of reads when the same TF was studied in different ChIP-Seq experiments. In particular, Table 7 contains AUCs obtained for the following TFs: ATF-1, SRF and NF-E2. We selected these TFs because the corresponding values of FPCM exceeded threshold 2.0, hence FPCM actually recommended to remove orphans.

Implementation
The developed algorithm for determination of FPCM and FNCM for ChIP-Seq datasets was implemented as a plugin for the BioUML platform [30]: https://ict.biouml.org/bioumlweb/ chipseq_analysis.html. BioUML is an open source comprehensive bioinformatics platform, free for non-commercial use.

Conclusions
In this study we developed two novel metrics: FPCM and FNCM, which allow to control FP and FN rates of peak callers for assessment of quality of TFBR datasets. The main aim of the developed metrics is the selection of the most reliable datasets or recommendation of dataset modification by removing the orphans.
After estimation of FNCM and FPCM metrics for all human ChIP-Seq datasets from GTRD, we observed strong relations between presence/absence of input control in ChIP-Seq experiment and the FNCM and FPCM metrics. In particular, the absence of input control resulted in increase of FP rate and decrease of FN rates of the peak callers MACS, PICS, and SISSRs. In addition, we performed a comparative analysis of four peak callers: MACS, PICS, GEM and SISSRs using FNCM metrics. It was revealed that, in general, MACS and GEM outperformed SISSRs and PICS, especially when input controls were available for ChIP-Seq datasets. Moreover, comparative analysis of the existing quality metrics developed by the ENCODE project, FNCM and FPCM metrics and characteristics generated by individual peak callers has been performed. No strong relationships between FNCM and FPCM metrics and existing quality metrics or peak callers' characteristics have been revealed. In other words, there are no combinations of known metrics and peak callers' characteristics that can replace FNCMs and FPCM metrics. Thus, reasonable application of FPCM can considerably improve the accuracy of TFBS identification within ChIP-Seq datasets.