Transcription Factor Binding Sites Prediction Based on Modified Nucleosomes

In computational methods, position weight matrices (PWMs) are commonly applied for transcription factor binding site (TFBS) prediction. Although these matrices are more accurate than simple consensus sequences to predict actual binding sites, they usually produce a large number of false positive (FP) predictions and so are impoverished sources of information. Several studies have employed additional sources of information such as sequence conservation or the vicinity to transcription start sites to distinguish true binding regions from random ones. Recently, the spatial distribution of modified nucleosomes has been shown to be associated with different promoter architectures. These aligned patterns can facilitate DNA accessibility for transcription factors. We hypothesize that using data from these aligned and periodic patterns can improve the performance of binding region prediction. In this study, we propose two effective features, “modified nucleosomes neighboring” and “modified nucleosomes occupancy”, to decrease FP in binding site discovery. Based on these features, we designed a logistic regression classifier which estimates the probability of a region as a TFBS. Our model learned each feature based on Sp1 binding sites on Chromosome 1 and was tested on the other chromosomes in human CD4+T cells. In this work, we investigated 21 histone modifications and found that only 8 out of 21 marks are strongly correlated with transcription factor binding regions. To prove that these features are not specific to Sp1, we combined the logistic regression classifier with the PWM, and created a new model to search TFBSs on the genome. We tested the model using transcription factors MAZ, PU.1 and ELF1 and compared the results to those using only the PWM. The results show that our model can predict Transcription factor binding regions more successfully. The relative simplicity of the model and capability of integrating other features make it a superior method for TFBS prediction.


Introduction
Gene regulation is affected by the binding of transcription factors (TFs) to regulatory sequences in DNA. Recognition of transcription factor binding sites (TFBSs) improves insights into the genes regulated by a TF. These target genes combined with their expression data can be used to elucidate transcriptional regulatory networks and transcription regulation mechanisms [1,2]. Due to significant sequence variation in the binding sites of a TF, transcription factor binding site prediction is still known as a difficult and central problem in computational biology [1][2][3][4][5][6][7][8][9][10]. This problem uses motif prediction in which a set of annotated binding sites and a new sequence are given as input, with the goal of finding similar binding site on the sequence [5,8].
Binding sites of a TF can be represented as a consensus sequence or a position weight matrix (PWM). Despite the ease of visual interpretation, variations in nucleotide composition of binding sites make consensus sequences an unsuitable approach to represent TFs [4,6]. So, the most common methods apply PWMs for TFBS representation instead of consensus sequence [8].
TFs usually bind to short (4-12 bp) DNA sequences. The repetitive nature of DNA causes the binding sites to occur at many locations throughout the eukaryotic genome, of which only a small number are involved in the regulatory processes of the cell. These considerations make motif scanning i.e. searching DNA sequence for matches with a PWM, to be highly uncertain and to produce a high frequency of false positive predictions [3,5,9]. This problem is more evident in mammalian genome since cis-regulatory elements are usually kilo bases away from target genes, making it necessary to search large regions, which in turn leads to an increase in false positives [7]. These challenges undermine the use of motif scanning as a standalone method for TFBS prediction.
Common PWMs do not take into account higher order dependencies between nucleotides, thus it is believed that developing better models for binding sites and utilizing higher order background models will improve the performance of motif prediction [22][23][24]. Construction of such complex models has proved to be challenging [25], so the use of additional data sources in the context of TFBS prediction is attracting more attention [5].
Eukaryotic DNA is packaged into nucleosomes and forms local structures of chromatin [26,27]. Dynamic changes in chromatin structure through post-translational modifications of histones, restrict accessibility of DNA for TFs [3,28]. Several studies have shown that TFs binding to genomic regions are associated with various histone modification levels [20,[29][30][31]. According to these observations, several studies have developed frameworks to improve TFBSs prediction accuracy using a limited number of epigenetic experimental assays [1][2][3]7,9,10]. They considered the numbers of different histone modification tags as additional information sources for improving the prediction accuracy.
Nozaki et al. [28] showed that nucleosomes harboring histone modifications like H3K4me1, H3K4me2, H3K4me3 as well as the histone variant H2A.z have an aligned and periodic pattern around broad promoters. They concluded that this might be due to the accessibility of TFs to DNA.
In this study, we were interested in using information from the positions and distribution of modified nucleosomes to improve the performance of TFBS prediction. We have examined the effects of two features ''modified nucleosomes neighboring'' (MNN) and ''modified nucleosomes occupancy'' (MNO) around TFBSs.
The MNN feature considers the closest distance from a TFBS to the nearest nucleosome harboring a specific histone modification. The MNO feature represents the total number of nucleosomes containing a histone modification around the binding sites of a transcription factor.
To investigate these features, we considered 21 histone methylation modifications [30]. For each feature (MNN, MNO) a set of values corresponding to different modifications were computed based on Sp1 binding sites on Chromosome 1 in human CD4+T cells. Then, these values were applied as a training set in a logistic regression classifier (LRC). The rest of (Chromosome 2-22) autosomes and two sex chromosomes were used as a test set to show that these features are capable of predicting Sp1 binding regions on other chromosomes. We found that only 8 out of 21 histone modifications, namely H2A.z; H3K4me1, H3K4me2, H3K4me3; H3K9me1; H3K27me1; H4K20me1 and H2BK5me1 are strongly correlated with transcription factor binding regions.
We next designed a second model to search a genome for TFBSs based on the features (MNN, MNO) combined with using the PWM. We applied this model on MAZ, PU.1 and ELF1 TFs and compared the results with the common model using PWM alone. The results show that false positives are significantly decreased with only a minor decrease in true positives.

Results and Discussion
In this section, we introduce two features based on the modified nucleosomes which are effective for distinguishing transcription factor binding regions from random ones. We call these features ''modified nucleosomes neighboring'' (MNN) and ''modified nucleosomes occupancy'' (MNO).
At first, we analyzed the MNN feature using the general transcription factor Sp1 in human resting CD4+T cells. Through the evaluation of the MNN feature, eight significant histone modifications were identified for TFBS prediction. Next, the MNO feature was applied on these eight marks to show that the occupancy of modified nucleosomes is also predictive of true binding regions. Finally, a model was designed which integrates PWM with the MNN and MNO features to improve prediction of the MAZ, PU.1 and ELF1 binding sites.

Benefit of ''Modified Nucleosomes Neighboring''
In this part, we would like to show that the vicinity of nucleosomes containing a certain histone modification is useful for predicting TFBSs. At first, we extracted Sp1 binding sites from Chromosome 1 in human CD4+T cells. Then, for each histone modification, the distance between the Sp1 binding sites and the nearest nucleosomes containing the modification was calculated and considered as a training set in the LRC model (see Methods). Next, 21 (Chromosome 2-22) autosomes and two sex chromosomes were binned into 1-kb non overlapping intervals and used as a test set for the model (see Methods). Intervals on these chromosomes were then predicted by the model as Sp1 binding regions [1].
This task was repeated for nucleosomes containing each of the 21 histone modifications. Figure 1 (and Figure S1 and Table S1) and Figure 2 show ROC curves and AUC values, respectively for 21 LRCs associated with different histone modifications. These figures show the results of evaluations averaged on the test set (21 autosomes and two sex chromosomes).
To compare the accuracy of this feature with PWM, test set intervals were scored using an Sp1 PWM constructed based on binding sites on Chromosome 1 (see Methods). The ROC curve and AUC value of PWM scanning is shown in Figure 3.
Comparison of the PWM AUC value with previous results clearly shows that the MNN feature, especially for certain marks, greatly outperforms the traditional scanning approach. This demonstrates the usefulness of our proposed MNN feature in recognition of target sites.
These figures clearly show that using the MNN feature greatly reduces the number of false positives with a minor decrease in sensitivity. However histone modifications do not contribute equally in prediction improvements. Comparing AUC values show that seven types of histone modification, H3K4me1, H3K4me2, H3K4me3, H4K20me1, H2BK5me1, H3K9me1, H3K27m1 as well as histone variant H2A.z have a significant effect on improvement. We refer to these eight modifications as top marks.
It has been shown that 21 histone modifications can be classified into three groups [21]. Active modifications are related to active genes and enhancers. Repressive modifications are connected with repressed genes and heterochromatin and moderate modifications have no preference toward any of activated or repressed genes. Table 1 shows each modification and its assigned category.
Eight predicted top marks using our model are among the active modifications. Modifications with weak prediction accuracies (H3K9me2, H3K9me3 and H3K27me3) can be seen in the repressive group. The other histone modifications, with average prediction accuracies, appear in the moderate modifications. Among these moderate marks, H3K27me2 and H4K20me3 have the lowest AUC values. In contrast, H3K79me1 and H3K36me1 show relatively acceptable accuracies with AUC<0.79. From these observations, we conclude that H3K27me2, H4K20me3 as well as other modifications in the moderate group with low AUC values have a tendency toward repressed genes. On the other side, highly predictive modifications can be connected to the activation of transcription. These findings are in line with previous studies that H4K20me3 is associated with heterochromatin, H3K36me1 shows a slight tendency toward active genes and H3K27me2 signals are more prevalent at silent promoters [30,[32][33][34]. The signals and functional consequences of H3K79me1 are not well studied but our results predict a slightly active tendency for this mark.
Putting it all together, we reason out that active histone modifications are more predictive for TFBSs prediction with the exception of H3K36me3 mark. Studies have shown that this epigenetic mark is considered as an active mark only when this mark lies in the coding region of a gene, and a repressing mark in the promoter region [26,30]. This particular distribution of H3K36me3 may explain a cause for poor predictive power of this modification.
The histone code hypothesis [35] suggests that a combination of histone modifications affects gene regulation. So, we analyzed whether combining histone modifications can predict Sp1 binding sites better than single marks.
We considered three combinations among histone modifications as follows: 1) combining all 21 modifications, 2) combining the eight top marks 3) integrating H3K4me3 and H2A.Z data. We chose H3K4me3 and H2A.z because they have the largest AUC values among top 8 marks ( Figure 1).
In comparison with using single modifications, integrating more marks is expected to enhance the accuracy of predictions. The results show that the most predictive single histone modification for Sp1 is H3K4me3 with an AUC value of 0.9683. The AUC values for 21, 8 and 2 combined histone modifications are 0.9149, 0.9545 and 0.9682, respectively ( Figure 4).
The failure of combining modifications may be due to the fact that histone modifications are closely correlated and there is a data redundancy among them [36].2.
In this part, we are interested to show that the combination of the top eight marks can be used as an acceptable approach in TFBSs prediction. So, we considered another feature called modified nucleosomes occupancy which represents the total number of nucleosomes harboring a histone mark in the binding regions. Therefore, the total number of nucleosomes containing the top 8 marks in 1-kbp regions flanking Sp1 sites on Chromosome1 were computed and applied as a training set in the LRC (see Methods). As before, Sp1 bound regions on 21 autosomes and two sex chromosomes were employed to evaluate the predictive power of the feature. Figure S2 shows that utilizing occupancies of nucleosomes can be informative in actual binding sites prediction.
To provide evidence why certain modifications are more predictive than the others, the ratio of nucleosome containing top 8 marks and three repressive histone modifications (as a control) were calculated at each position around binding sites of Sp1 (see Methods). We found that nucleosomes containing the top 8 marks are enriched and show a bimodal pattern around Sp1 binding sites. A nucleosome depleted region with respect to the center of the binding sites is evident in active marks ( Figure 5). This bimodal distribution may indicate TFs compete with nucleosomes to access DNA. We can consider these nucleosome free regions flanked by modified nucleosomes as landing sites which direct TFs into the true binding locations.

Evaluation of the Suggested Features on the other Transcription Factors
Studies have suggested that epigenetic data show a general binding tendency for TFs in genomic regions and thus are not specific to a given TF [3]. Therefore, the assigned scores to the test set chromosomes based on Sp1 MNN and MNO features, may be predictive of the other TFs as well. First, each test set interval was scored based on the MNN feature, trained on Sp1 binding sites. Then, these scores were combined with the PWM of the corresponding TF (see Methods). These scores were used to predict binding regions related to each TF. ROC Curves for these three TFs and 21 modifications are shown in Figure 6 (and Figures S3). As expected, Comparing these figures with ROC curves of PWM scanning ( Figures S4, S5, S6) confirms the higher predictive power of the top eight identified modifications. The AUC values are illustrated in Table S2.
We further evaluated the effect of the MNO feature on prediction of these TFs ( Figure S7 and Table S3). Like the MNN feature, the nucleosome occupancies combined with PWM scores, significantly enhance predictions over using PWM alone. This confirms the usefulness of epigenetic data in the context of TFBS prediction.
Finally, we illustrated the ratio of marked nucleosome at each position around binding sites of MAZ, PU.1 and ELF1 ( Figures  S8, S9, S10) and observed the bimodal patterns of the top 8 modified nucleosomes distributed around the central position of binding sites in these three TFs as well.

Conclusions
By using a probabilistic approach, we discovered that using the genomic position of modified nucleosomes can be informative for predicting the binding locations of TFs. We first showed that the vicinity of modified nucleosomes around TF binding sites combined with PWM can enhance the performance of predictions over using PWM alone. Then, we observed that eight types of histone modifications correlate more highly with TFBSs. Using these eight modifications, we investigated the nucleosomes occupancy around the binding sites, and again showed that this feature is also correlated to the target regions of TFs. The analysis of the modified nucleosomes distribution around binding sites revealed that these nucleosomes show a bimodal distribution with a depleted region right on the center of binding sites.
In this study, we used two features, namely ''Modified Nucleosome Neighboring'' and ''Modified Nucleosome Occupancy'' to analyze whether the spatial distribution of nucleosomes are informative for TFBS prediction. The proposed features as well as the classifier can be easily applied to other TFs to evaluate how well these features will perform in prediction processes.
Here, we only analyzed the role of 21 histone methylation in TFBS prediction. As more and more genome-scale histone modification data sets become available, more complex features related to the distribution of nucleosomes may be defined and used  to uncover the actual patterns that the modified nucleosomes take around binding sites. We believe that our study is a step toward understanding epigenetic regulation of target genes of TFs and inferring how epigenetic modifications influence and recruit regulatory proteins.

Nucleosome Position Detection and Dataset
The genomic position of the 21 modified nucleosomes in human resting CD4+T cells were obtained from [30]. These data show the genomic position of unambiguously mapped sequence tags from chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-Seq) experiments. We used the NPS package [21] to determine the genomic positions of nucleosomes corresponding to the short sequence tags. The March 2006 human genome (NCBI Build 36.1, hg18 assembly) was used as a reference genome.

Position Weight Matrix Representation
A zero order background model represented by [38] was used to construct the matrix based on Sp1 binding sites extracted from Chromosome1. The number of nucleotides in each position was calculated and converted to a frequency as follows: where n ca and b ca are the real counts and pseudocounts of nucleotide a, respectively in position c. The total number of real counts and pseudocounts are called N c and B c in the position, respectively. Here we consider B c~1 and b ca~Bc |p a where p a is the background frequency of base a. These frequencies are then converted to a log odd score as follows: where w ca is the matrix value of base a in position c. The scores given to a binding site are converted to a relative unit scale by: score{score min scoremax{score min , where score min and score max are the sums of minimum and  Figure 1 shows that applying the LRCs to the data of single modifications perform better than those LRCs trained with the combination of histone modifications. This may be due to the fact that the predictive ability for distinguishing true target regions is redundantly encoded among histone marks. doi:10.1371/journal.pone.0089226.g004 maximum scores in each column of constructed PWM, respectively. This strategy was used to construct MAZ, PU.1 and ELF1 PWMs.

Scanning and Scoring Regions by PWM
All chromosomes except Chromosome 1, which is used as a training set, were binned into non-overlapping 1 kb intervals. Every base b i in each interval was scored by PWM. So, two scores score z i and score { i were assigned to the base b i corresponding to two sub sequences on positive and negative strands. The score of each interval is finally represented by: According to the constructed PWM, a score is assigned to each interval for each TF under study.

Training Set
For the modified nucleosome neighboring feature (MNN), the distances from the center of Sp1 binding sites on Chromosome 1 to the central positions of the nearest nucleosomes containing a

Test Set
Each interval from 21 autosomes and two sex chromosomes containing the center of a reported binding site was considered as a true binding location and the others as false binding regions. For each region, values corresponding to the MNN and MNO features were computed as follows. For the modified nucleosome neigh-boring feature (MNN), the closest distance from the center of the region to the nearest nucleosome containing a specific modification was computed. This task was done for each interval and for each 21 different modifications. Thus, for each interval 21 values corresponding to 21 different histone modifications were obtained. Finally each single value or combinations of these values, as described in the results, were inserted into the LRC model as a test set. The LRC classified the intervals as binding or non-binding locations.
For the nucleosome occupancy feature (MNO), only top 8 marks (recognized through evaluation of the MNN feature) were considered and in each test set interval, the total numbers of nucleosomes containing each of these top modifications were calculated separately. So, 8 values were assigned to each interval. These 8 values were inserted into the LRC for evaluation purposes. The LRCs assigned a score to each interval which showed the Sp1 binding probability to that interval.

Evaluating the Features on MAZ, PU.1 and ELF1
Since Sp1 is a well-known and ubiquitous protein and has been reported to bind practically everywhere in the human genome [5], the same LRCs trained on Sp1 were used to score test set intervals. Then, these scores were made specific to each MAZ, PU.1 and ELF by integrating these values with the PWMs corresponding to each factor (see below).

Integrating Logistic Regression with Position Weight Matrix
For each test interval the score is computed as follows: where Score interval and Score feature{i i[fMNN,MNOg, are calculated based on the PWM of the TF under study and the LRC model trained on SP1.

Logistic Regression Classifier
We used a logistic regression classifier (LRC) to integrate multiple data sources. This classifier maps a single or a set of computed features to a score which represents the probability of a TFBS.
A TFBS prediction can be represented as a binary classification problem.The LRC hypothesis function used for prediction is defined as h h (x)~g(h T x) where x is the vector of input features. The vector of parameters h (also called weights) can be estimated based on the training examples. In this study we chose the logistic function, g(z), to be a sigmoid function: is always a real number between 0 and 1(0ƒh h (x)ƒ1). Here h h (x) shows the probability of a region being a binding site: where y (i) shows the i th region which can be the target of a TF. To estimate the parameters h , we define the cost function as: Where m is the number of training examples and x (i) is the vector of input features (the MNN or the MNO) computed for the i th interval. We call this function J(h). To Estimate the parameters h, we need to minimize this function. The Matlab function fminunc was used to minimize and estimate these weights. All calculations were done in Matlab R2012a. Preparation of the data was done in C# 2010. Figure S1 ROC curves for predicting the binding regions of Sp1 based on the MNN feature. ROC curves are shown for 13 modifications with less predictive power for prediction of Sp1 binding regions on the test set. The MNN feature is used to train corresponding LRCs on Chromosome 1. Only scores assigned by the LRCs (without using PWM scores) are used to predict binding regions. The x-axis is the false positive rate and the y-axis is the true positive rate. (TIF) Figure S2 ROC Curve of the modified nucleosome occupancy feature for prediction of the Sp1 target regions. The ability of the LRC trained on the MNO featureto differentiate between reported bound locations of Sp1 and random sites (AUC = 0.9413). Not only is the vicinity to modified nucleosomes but also the total number of these nucleosomes an appropriate identifier of true binding regions. The MNO feature is an eight dimensional vector (corresponding to top 8 marks), each element of which is the total number of nucleosomes containing a certain marks.