ChIP-seq, which combines chromatin immunoprecipitation (ChIP) with next-generation parallel sequencing, allows for the genome-wide identification of protein-DNA interactions. This technology poses new challenges for the development of novel motif-finding algorithms and methods for determining exact protein-DNA binding sites from ChIP-enriched sequencing data. State-of-the-art heuristic, exhaustive search algorithms have limited application for the identification of short (, ) motifs (, ) contained in ChIP-enriched regions. In this work we have developed a more powerful exhaustive method (FMotif) for finding long (, ) motifs in DNA sequences. In conjunction with our method, we have adopted a simple ChIP-enriched sampling strategy for finding these motifs in large-scale ChIP-enriched regions. Empirical studies on synthetic samples and applications using several ChIP data sets including 16 TF (transcription factor) ChIP-seq data sets and five TF ChIP-exo data sets have demonstrated that our proposed method is capable of finding these motifs with high efficiency and accuracy. The source code for FMotif is available at http://184.108.40.206/FMotif/.
Citation: Jia C, Carson MB, Wang Y, Lin Y, Lu H (2014) A New Exhaustive Method and Strategy for Finding Motifs in ChIP-Enriched Regions. PLoS ONE 9(1): e86044. https://doi.org/10.1371/journal.pone.0086044
Editor: Ying Xu, University of Georgia, United States of America
Received: June 3, 2013; Accepted: December 4, 2013; Published: January 24, 2014
Copyright: © 2014 Jia et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported in part by National Nature Science Foundation of China (Grant No. 60905029, 61105055, 61105056, 81230086, and 31071167), the Beijing Natural Science Foundation (Grant No. 4112046), and the Fundamental Research Funds for the Central Universities. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Protein-DNA interactions play key roles in several cellular processes and functions including DNA transcription, packaging, replication, and repair. Identification of regions such as transcription factor binding sites (TFBSs), which are targeted by proteins called transcription factors (TFs), is crucial for a better understanding of transcriptional regulation. Although traditional footprinting assays can accurately identify the precise binding sites of any factor, this low-throughput method is highly technical and can only be used to analyze a single small region (1 kilobase pairs (kb)) at a time. Chromatin immunoprecipitation followed by high-throughput deep sequencing (ChIP-seq) enables genome-wide detection of transcription factor binding sites as well as the localization of epigenetic regulatory markers on a genomic scale , . It typically returns millions of short (35–50 base pairs (bps)) sequence tags mapped onto a reference genome from a sample organism. Putative binding sites with high confidence can be extracted from peak-enriched regions in the genome by peak-calling programs . However, the resolution of binding regions identified from ChIP-seq can be a few hundred base pairs and is one or two orders of magnitude larger than a typical TFBS. By using an exonuclease that trims DNA regions at a precise distance from binding sites, the novel ChIP-seq technique ChIP-exo is able to locate binding sites at high resolution . However, according to the results in Rhee and Pugh , binding regions identified from ChIP-exo experiments may be tens of bps away from the exact binding locations, although some of them at the location indicated by the experiments. Computational methods are still needed to identify the exact binding locations of a TF in ChIP-seq or ChIP-exo data sets.
Binding sites for a specific TF are often highly conserved and have strong evidence for sequence specificity . An actual DNA region interacting with and bound by a single TF usually ranges in size from 8–10 to 16–20 bps. In the past two decades, numerous programs have been developed to identify over-represented DNA sequence motifs from the promoters of co-regulated or homologous genes . These programs can be divided into two groups. The first includes profile-based methods such as CONSENSUS , MEME , Gibsampler , AlignACE , PROJECTION , and CRMD , each of which attempts to maximize a statistic- or entropy-related score from a profile matrix (also called a position weight matrix (PWM)). The second group is comprised of consensus-based methods, which include SPELLER , WEEDER , , MITRA-count , Voting , PMSprune , WINNOWER , iTriplet , VINE , Stemming , and RecMotif . These progams are designed to find potential motifs within DNA sequences , where is the length of a motif and is the maximum number of mutations between a predicted binding site and the motif consensus. In most cases, profile-based methods are faster but suffer from lower accuracy due to their tendency to be trapped in a local optimum. Consensus-based methods are more accurate but slower due to the exponential growth of the search space with increasing values of and .
Consensus-based methods can be further divided into two categories: pattern-driven and sample-driven approaches . A pattern-driven approach attempts to enumerate all possible -mer motifs with lexical order, while a sample-driven approach tries to test all possible (, ) motifs generated from real -mers of input sequences. For the methods mentioned above, SPELLER, WEEDER, and MITRA-count are pattern-driven approaches and Voting, PMPprune, WINNOWER, iTriplet, VINE, Stemming, and RecMotif are sample-driven. By using pattern-driven approaches (with the exception of MITRA-count), one can automatically find planted (, ) motifs without prior knowledge of the length . On the contrary, sample-driven approaches require that be specified for each run. In real applications, the exact length of motifs contained in a set of sequences is usually unknown. The pattern-driven algorithm WEEDER has been successful in real eukaryotic applications  but has not been improved upon to the best of our knowledge. In this study, we have developed a more powerful method to extract motifs and their binding locations contained in DNA sequences without prior knowledge of motif length and have used this method to identify motifs and their binding locations in ChIP-enriched regions.
The pattern-driven approach MITRA-count builds a mismatch tree for all -mers first, then traverses search space recursively from the root down in depth-first order. Therefore, the length of a predicted motif must be specified in advance. SPELLER enumerates all possible motifs in a depth-first manner throughout the search space, then scans and counts all possible instances of the current motif with length () from the suffix tree of input sequences. The algorithm can identify planted (, ) motifs efficiently when and (see Table 1). In order to increase the speed of SPELLER, WEEDER includes an error ratio () for the algorithm that narrows the search space such that for all the number of mismatches between the first nucleotides of a candidate -mer motif and the first nucleotides of a valid instance of the motif is at most . The algorithm can accelerate SPELLER to some extent, especially when is small (e.g., . Unfortunately, not all motif occurrences satisfy this restriction. WEEDER must lower the occurrence frequency to make sure exact motifs will not be missed. However, WEEDER’s run time increases dramatically with the decrease of . For instance, for (15, 4), should be lowered to half the number of sequences at the OOPS constraint (one occurrence(s) of the motif instance(s) per sequence) to make sure that the true motif will be discovered . However, WEEDER’s run time may be even longer than SPELLER under the condition that the two algorithms use the same programming techniques. Thus, a more efficient method is needed to improve the efficiency of pattern-driven algorithms without knowledge of the length of predicted motifs under the ZOMOPS constraint (zero, one or multiple occurrence(s) of the motif instance(s) per sequence).
Additionally, the programs mentioned above are not computationally efficient enough to process a large number of ChIP-seq peaks. In recent years, several programs have been developed to cope with large-scale ChIP-seq data. Some are ChIP-tailored versions of previously-developed software (e.g., ChIP-MEME , DREME , and GimmeMotifs ). These typically restrict motif discovery to a few hundred peaks and usually ignore the remaining unselected sequences. Other programs are faster versions of previous software (e.g., STEME , ChIPMunk , and HMS ). STEME is a faster version of MEME and involves indexing sequences with a suffix tree, which accelerates the expectation-maximization (EM) steps. ChIPMunk combines EM with a greedy approach similar to CONSENSUS and decreases the run time of the optimization procedure. HMS is an improved version of Gibbs Sampler and combines stochastic sampling with deterministic, greedy search steps. Another group of programs integrate other information such as TFBS positional priors  or transcription start sites  in order to optimize a PWM of ChIP-enriched regions. As mentioned above, these programs still have a local optimum problem. Similar to SPELLER and WEEDER, some of these programs are consensus-based methods (sometimes called word enumeration methods). These include RSAT , Cisfinder  and POSMO . RSAT is a word enumeration method and has been developed to process whole ChIP-seq peak data sets, but is limited to short (, ) motifs (). Cisfinder is a word clustering method and combines short -mer enumeration (, 8, or 9) with a clustering strategy. POSMO, also a word clustering method, uses TFBS positional bias information along with -mer enumeration and clustering. However, both Cisfinder and POSMO use clustering methods to group short -mers and therefore cannot find exact (, ) motifs contained in sequences. Thus, finding exact (, ) motifs with larger values of and in a large-scale sequence data set is still very difficult.
According to a previous study by Keich and Pevzner , real signals may be mixed with spurious motifs contained in background sequences under the OOPS constraint when the degenerative ratio . A larger makes it more difficult to discriminate between a real motif and spurious motifs. However, some sequences may not contain any occurrence of a motif. As previously mentioned, we have concentrated on a more generalized model (the ZOMOPS constraint). Under this constraint, we have found that, except for the degenerative ratio , the ratio of noise sequences , where is the number of sequences and is the number of sequences containing at least one variant of a motif, negatively affects (, ) motif searches. A larger leads to more spurious motifs in background sequences. It is suspected that of factor-bound locations in ChIP-seq data may be false positives . Plus, there may be different versions of DNA-binding motifs for any given TF. A specified motif may only occur in of binding regions. Although false positive rates in ChIP-seq data sets are low enough that statistical conclusions can be drawn in most cases, the noise (plus the diversity of DNA-binding modes) still interrupts the motif-finding process and alters motif-finding results. Thus, this may not be the best way to identify motifs in full-size ChIP-seq data sets. After running a peak-calling program on a raw ChIP-seq data set, peaks along with their ChIP enrichment values, -values, or false discovery rates (FDRs) can be obtained. False positive peaks are those with low peak enrichment values, -values, or FDRs. A better method may be to find motifs with a high confidence value (i.e., those that are plentiful enough to draw statistical conclusions) in peak-enriched regions and subsequently scan their binding locations with the degenerative value in the remaining peak regions that have low peak enrichment values, -values, or FDRs. This would not only exclude more noise and spurious motifs , but it would also take advantage of well-developed motif-finding tools with an acceptable level of scalability. A similar idea was used in MICSA and achieved good performance . However, MICSA used the optimal method MEME (the accuracy of which is limited , ) to get the PWM of a motif for only the first three hundred peak-enriched regions.
In this study we have found that for motifs with length , both SPELLER and WEEDER have been designed to check each -mer () in the pattern space with depth-first order and count the variants of the -mer in the suffix tree of sequences from the root to layer . The suffix tree is scanned one time for each -mer pattern. Thus, as increases, the algorithms scan the suffix tree an increasing number of times. In fact, the mismatch information in layer of a suffix tree can be used to search for ()-mers in the pattern space. For this reason, we constructed a new suffix tree structure with mismatch information (called a mismatched suffix tree) and developed a fast motif enumerative method (FMotif) under the ZOMOPS constraint. Using the newly constructed suffix trees, we incorporated the mismatch information in layer of the mismatched suffix trees to verify ()-mers in the pattern space. We then updated mismatch information in layer of the mismatched suffix trees. In this way we were able to implement a depth-first search within the pattern space and the mismatched suffix trees simultaneously. To process large-scale ChIP-seq data sets, we integrated the peak detection method MACS  with our motif-finding method and ChIP-enriched sampling strategy, which allowed us to locate the exact binding locations in ChIP-seq and ChIP-exo data sets. We chose MACS because it has been shown to perform well when compared to several other peak-calling programs .
Experimental Results on Artificial Data Sets
We compared FMotif with the existing pattern-driving methods including SPELLER, WEEDER, and MITRA-count (MITRA for short) on synthetic samples to show the efficiency of our proposed method. All synthetic samples were generated following the method of Pevsner and Sze , where () variants of an -length motif were randomly planted into sequences selected randomly from a set of sequences with length . In this (, ) model, each planted variant of the motif with length had exactly mismatches with the motif itself.
In the first group of experiments, we tested the performance of these algorithms on (, ) sample sets without noise sequences (i.e., ) at standard settings, where the number and the length of sequences are set to 20 and 600, respectively , , . These test results are shown in Table 1. ‘WEEDER()’ indicates the execution time of WEEDER given the occurrence frequency threshold . ‘’ indicates a run time of over 10 hours. , , and denote seconds, minutes, and hours respectively. The number after each run time is the ranking number of a true planted motif among the top 25 predicted motifs. ‘’ after a run time indicates that the real motifs were not in the top 25. In the second group of experiments, we first tested the influence of the ratio of noise sequences () on (, ) samples using FMotif with typical settings (i.e., and ). In order to provide a more comprehensive comparison of the calculation speed when noise was added, we compared FMotif to SPELLER and MITRA on (10, 2), (11, 2), (12, 3), and (13, 3) samples. We avoided comparisons over motifs more complicated than (13, 3) because SPELLER lacked computational efficiency on these problems and WEEDER required tuning of parameter . These test results are shown in Table 2, where is set at , and , ‘’ indicates that the real motifs were not in the top 25, and the first line for (10, 2), (11, 2), (12, 3) and (13, 3) is the FMotif result, the second line (denoted by ‘M..’) is the MITRA result, and the third line (denoted ‘S..’) is the SPELLER result. We then tested the influence of the noise ratio on samples with and to simulate ChIP-enriched regions because those regions are usually relatively short and the number of regions is usually large. We subsequently compared FMotif to SPELLER and MITRA on (10, 2), (11, 2), (12, 3), and (13, 3) samples as before. These test results are shown in Table 3, where is set at , and . In the third group of experiments, we tested FMotif scalability using two groups of samples to see whether it was suitable for recognizing motifs in large-scale ChIP-enriched regions. The settings of the first group were , and no noise sequences (). These test results are shown in Table 4. The settings of the second group were , and noise sequences in order to mimick ChIP-seq data. These test results are shown in Table 5. All experiments were performed on a computer with an Intel 2.99 GHz processor, 2.00GB of main memory, and the Windows XP operating system.
The results in Tables 1–5 lead to three observations. First, FMotif is a fast and exact algorithm and capable of finding (, ) motifs in synthetic samples without being given the length of a predicted motif. It performs faster than SPELLER, MITRA, and WEEDER without sacrificing accuracy. As mentioned above, WEEDER’s efficiency suffers significantly (see (14, 4) in Table 1 for an example) when the occurrence frequency threshold is too low, and MITRA requires that the length be specified a priori. It should be noted that FMotif ranked all motifs with different lengths together by significance score. For the samples whose true motifs were not ranked in the top 25, the top motifs were usually - or - substrings of the true motifs with length . In these cases the true motifs were still in the output list but were ranked below the top 25. Second, noise sequences have a strong effect on the results and the speed of the method. With an increase in , the run time increases as well. Like the degenerative ratio , the ratio of noise sequences also weakens motifs, especially when background sequences are long (see Table 2). Spurious motifs in background sequences bury the authentic signals when either or is large. For example, real motifs were difficult to filter by their significance score for the (15, 5) motif in Table 2, even when the noise sequence ratio was set to . In this case, many spurious motifs of length 14–15 with a large significance score were ranked among the top 25. When the length of background sequences was shorter and the number of sequences was larger, the signals were stronger and could be easily identified even if a large portion of noise sequences was added (see Table 3). This is consistent with the previous result that false-positives could be reduced by decreasing the sequence length or by adding more sequences to the data set . Third, as shown in Tables 4 and 5, FMotif is capable of operating on a large scale even when there are noise sequences in samples. This allowed us to use FMotif to process peak regions within ChIP-seq and ChIP-exo data sets.
Additionally, we compared FMotif with CisFinder, which uses -mer enumeration with -mer clustering to find motifs in large-scale ChIP-seq peak regions. Using both algorithms, we verified the accuracy of FMotif and CisFinder by searching for long (, ) motifs in synthetic sample sets with 3000 sequences, each of which contained a planted variant of a parent motif. The experimental results are shown in Table 6, where ‘Planted Motif’ indicates a planted motif consensus in a set of sequences, ‘FMotif (Top-1)’ indicates the top ranked motif consensus found by FMotif in a sample set, ‘CisFinder’ indicates the closest matching motif consensus (described by IUPAC nucleotide codes) found by CisFinder in a sample set, ‘’ indicates the number of variants of a reported motif found by FMotif or CisFinder in a sample set, and ‘Rank’ after ‘’ is the ranking number of the reported motif found by Cisfinder in Table 6.
We used the site-level sensitivity () and positive predictive value () metrics described by Tompa  to statistically quantify the accuracy of the two methods, where and , is the number of known sites overlapping predicted sites, is the number of known sites not overlapping predicted sites, and is the number of predicted sites not overlapping known sites. A predicted site overlaps a known site if they share at least a half of the length of known sites. In order to give a more comprehensive comparison of the accuracy of the two methods on simulated ChIP-seq data sets, we added noise sequences to samples with and and performed the experiments again. These test results are shown in Table 7.
As evident from Tables 6 and 7, FMotif is an exact algorithm. It reported all true motif consensuses and their planted variants plus false positive variants in background sequences. CisFinder performed quickly but suffered from low accuracy (due to low sensitivity), especially when was large. FMotif and CisFinder both were robust after noise sequences were added to the samples. It should be pointed out that, although there are various resources on CisFinder’s website (http://lgsun.grc.nia.nih.gov/cis-finder/download.html), we used only the motif-finding program. There are other programs focused on motif clustering, motif improvement, motif comparison, and other tasks. If all programs were used together, a better motif and more of its binding sites may be identified. However, the CisFinder algorithm  was implemented in that motif-finding program and there was no direct way to use all these programs together based on our knowledge.
Experimental Results Using ChIP-seq Data Sets
We tested FMotif using 12 mouse ChIP-seq data sets for 12 DNA-binding TFs (CTCF, cMyc, Esrrb, Klf4, Nanog, nMyc, Oct4, Smad1, Sox2, STAT3, Tcfcp2I1, and Zfx) involved in mouse embryonic stem cell pluripotency and self-renewal . These ChIP-seq data sets have been deposited in the GEO database with ID number GSE11431. We also tested FMotif using four widely used human ChIP-seq data sets for four DNA-binding TFs including CTCF (CCCTC-binding factor , named CTCF() in the paper), FoxA1 (hepatocyte nuclear factor 3 ), NRSF (neuron-restrictive silencer factor ), and STAT1 (signal transducer and activator of transcription protein ). The raw sequence of the FoxA1 ChIP-seq data set was downloaded from http://liulab.dfci.harvard.edu/MACS/Sample.html. The bed format of the CTCF, NRSF and STAT1 ChIP-seq data sets was downloaded from http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/. These downloaded short reads were mapped onto the newest version of mouse genome assembly mm10 and human genome assembly hg19, respectively. The peak regions were extracted from these reads using the peak finding program MACS  with a false discovery rate (FDR) threshold of 0.2. The reads were ranked by their FDR if a negative control was available, or by -value otherwise. To prepare the data sets for use with motif discovery algorithms, we mapped the summits of the ChIP-seq peaks and extracted the 100 bps of genomic sequence centered around each peak.
In order to facilitate a fast motif search, avoid the potential influence of false positive peaks, and reduce false positive motifs in background sequences , we ran FMotif on the first 3000 ChIP-enriched genomic sequences and then scanned for potential binding locations in the remaining genomic sequences with the degenerative value . Since binding sites could exist on either DNA strand and CisFinder searched both, we counted the instances of a predicted motif and those of its reverse complement motif. We then compared FMotif with CisFinder and published motifs , , – in literature. The experimental results are shown in Figures 1 and 2, where ‘Nb’ indicates the number of peak-enriched regions predicted by the peak-calling program MACS with an FDR threshold of 0.2 or a -value threshold of , ‘FMotif’ and ‘CisFinder’ indicate the closest matching motif logos found by these programs (all motif logos were generated using the web-based tool Weblogo ), ‘Literature’ indicates the corresponding motif logos published in literature, ‘’ indicates the number of binding sites found by either FMotif or CisFinder, and ‘Rank’ after ‘’ is the ranking number of a reported motif found by either FMotif or CisFinder.
FMotif was tested using mouse ChIP-seq data sets for 12 DNA-binding TFs (CTCF, cMyc, Esrrb, Klf4, Nanog, nMyc, Oct4, Smad1, Sox2, STAT3, Tcfcp2I1, and Zfx) involved in mouse embryonic stem cell pluripotency and self-renewal . Results from CisFinder and published motifs in literature are shown for comparison. ‘Nb’ indicates the number of peak-enriched regions predicted by the peak-calling program MACS with an FDR threshold of 0.2 or a -value threshold of , ‘FMotif’ and ‘CisFinder’ indicate the closest matching motif logos found by these programs (all motif logos were generated using the web-based tool Weblogo ), ‘Literature’ indicates the corresponding motif logos published in literature, ‘’ indicates the number of binding sites found by either FMotif or CisFinder, and ‘Rank’ after ‘’ is the ranking number of a reported motif found by either FMotif or CisFinder.
FMotif was tested with four widely used human ChIP-seq data sets for four DNA-binding TFs including CTCF (CCCTC-binding factor , named CTCF()), FoxA1 (hepatocyte nuclear factor 3 ), NRSF (neuron-restrictive silencer factor ), and STAT1 (signal transducer and activator of transcription protein ). Results from CisFinder and published motifs in literature are shown for comparison. Column definitions are the same as those in Figure 1.
We compared predicted and published motifs using a motif-level accuracy measure called the performance coefficient , where is a predicted motif consensus and is the motif consensus published in literature . As shown in Figures 1 and 2, the motif logos found by FMotif were more accurate compared with published logos from literature than those found by CisFinder. Furthermore, FMotif identified more TFBS locations than CisFinder. As for the 12 mouse TFs DNA-binding logos in , Chen et al. used the motif discovery algorithm WEEDER and subsequently extended the motifs using an expectation-maximization method. This second step was necessary because the supplied version of the WEEDER algorithm limited the motif search to a maximum of 12 bps. As discussed in the previous sections, WEEDER operated with low efficiency for long motifs and was difficult to tune for the parameter .
To estimate the robustness of our sampling strategy, we ran FMotif on the first 500, 1000, 1500, …, and 5000 sequences and the full-size ChIP-enriched genomic sequences of TFs n-Myc, Oct4, and NRSF. For all subsets and the full-size data sets, each of the corresponding motifs in Figures 1 and 2 was ranked within the top 25 motifs predicted by FMotif. The ranking number of reported motifs increased with subset size and tended to be stable when the size was greater than 1000. All potential binding sites of reported motifs were obtained from subsets and discovered during the scanning step. Thus, it was not necessary to run a motif-finding algorithm on the whole ChIP-seq data sets, especially when data sets were very large. Additionally, we tested FMotif on randomly selected sequences ( = 500, 1000, 1500, …, and 3000). These experiments were repeated 10 times. We then compared these results to those of the first sequences and those of the last sequences for TFs n-Myc, Oct4, and NRSF. In general, motif consensuses predicted from the first sequences were the most similar to published motifs and ranked highest in the final output. Those predicted from randomly selected sequences tended to be ranked second, while those predicted from the last sequences were usually ranked the lowest. Furthermore, for the same reported motif of a TF, the number of binding sites found in the first sequences was significantly greater than that found in randomly selected sequences. The number of predicted binding sites found in the last sequences was the lowest. In some cases there was no corresponding motif in randomly selected sequences or in the last sequences when employing the same parameter settings. This situation occurred more often when using the last sequences. Therefore, we decided that the first sequences with the lowest -value or FDR (i.e., the most ChIP-enriched sequences) were the best choice for drawing statistical conclusions about a corresponding motif. This is because, as discussed in the Introduction section, the first sequences were the least affected by noise. We selected the first 3000 peak regions to be sure that the selected subsets were large enough to account for the specificity of TF-DNA binding and to exclude false positive motifs. The same results may be obtained by running the algorithm on the first 1000–2000 sequences and then scanning potential locations in the remaining sequences.
To test the sensitivity of FMotif, we ran it on an NRSF-positive TFBS set (NRSF/qPCR), which was composed of 83 binding sites verified by qPCR . We then ran FMotif on four yeast DNA-binding TFs (Reb1, Gal4, Phd1, and Rap1) and one human TF (CTCF) ChIP-exo data sets. Raw sequence of the five ChIP-exo data sets are available from the NCBI Sequence Read Archive with accession number SRA0044886 . Since it is thought that of ChIP-exo peak regions contain one recognizable DNA-binding motif within tens of bps away from peak summits, these can be viewed as positive TFBS sets. We used the five ChIP-exo peaks reported in Data S1 from Rhee and Pugh . Similarly, we mapped the summits of ChIP-exo peaks and extracted 50 bps of genomic sequence centered around each peak in yeast genome sacCer3 and human genome hg19, respectively. This allowed us to avoid peak regions overlapping with each other due to some of the summits of ChIP-exo peaks being very close together. Results from CisFinder and published motifs , , , – in literature are shown for comparison (see Figure 3).
FMotif sensitivity was measured using an NRSF-positive TFBS set (NRSF/qPCR), which was composed of 83 binding sites verified by qPCR , four yeast DNA-binding TFs (Reb1, Gal4, Phd1, and Rap1), and one human TF (CTCF) ChIP-exo data sets. Results from CisFinder and published motifs in literature are shown for comparison. Column definitions are the same as those in Figure 1.
As shown in Figure 3, FMotif was capable of finding more matching motifs and true TF-binding locations when compared to CisFinder. For example, 76 true binding sites of NRSF/qPCR were predicted exactly by FMotif. On the same data set (NRSF/qPCR), MICSA  using MEME reported only 55 sites. This highlights the fact that FMotif is capable of identifying TF-binding locations with high sensitivity. It is well-established that specificity is an important consideration for this type of method. However, the ChIP-exo technique is a high-throughput approach, and the resolution of binding regions identified by ChIP-exo may still be tens of base pairs from where the true binding sites of between 8 and 25 base pairs are located. In addition, some of those binding regions are false positives, and it is difficult to say which ones are truly false positives without carefully designed wet-lab experiments. However, we show the specificity (i.e., ) of FMotif for artificial samples in Tables 6 and 7. From this information we conclude that FMotif has both a higher sensitivity and a higher specificity than CisFinder.
In this study, we have proposed a new and fast heuristic enumeration method, FMotif, for extracting motifs from sequences. We have used this method to identify motifs and their binding locations in widely-used large-scale ChIP-seq and ChIP-exo data sets by combining FMotif with a peak-enriched sampling strategy. Our empirical studies have shown that this algorithm is fast and exact when searching for motifs in (, ) samples and has achieved good performance when identifying motifs in ChIP-enriched regions. In addition, the ChIP-enriched sampling strategy worked well on large-scale ChIP-seq and ChIP-exo data sets. It not only allowed us to exclude both noise occurring in lower ChIP-enriched peak regions and false positive motifs contained in background sequences, but also let us take advantage of well-developed motif-finding tools with low-level scalability. However, it should be pointed out that, in general, no method can outperform others under all conditions. FMotif performed faster than SPELLER, WEEDER, and MITRA but used more memory to store mismatched information in suffix trees, and FMotif was much more accurate but much slower than CisFinder. FMotif does, however, provide a good trade-off between time, space, and accuracy.
Motif discovery has been a popular area of study for more than two decades. Many successful motif-finding programs have been developed. The programs are ideal for finding motifs in tens or hundreds of promoters of co-regulated or homologous genes and for extracting motifs in genome-wide ChIP-enriched regions contained in large-scale ChIP-chip, ChIP-seq, and ChIP-exo data sets. Still, the problem is far from solved due to diversity in gene expression/regulation and the low specificity of binding sites. With the advance of high-throughput and high-resolution sequencing techniques like ChIP-exo, researchers have an increasing number of tools for studying gene regulation on a genomic scale. This will make the motif-finding problem easier to solve. Using advanced techniques such as ChIP-exo, it is possible to acquire new knowledge of regulatory binding sites. This will not only be beneficial for understanding the mechanisms of gene regulation, but also for creating a proper computational model that will replace (, ) models and PWM matrix profiles for motif representation.
Materials and Methods
The (, ) Motif Search and Suffix Tree
A transcription factor binds to specific DNA sequences and is involved in controlling the transcription of genetic information from DNA to mRNA. The actual DNA regions bound by a TF usually range in size from 8–10 to 16–20 bps and display a short motif, but differ by a few nucleotides from one another. The computational problem is to determine such a motif by analyzing a set of sequences that contain instances of the motif.
In current literature, there are two main approaches to motif representation. The first involves using a motif profile characterized by a PWM . The PWM records the probability of an observed nucleotide () at position () for all aligned sites, where is the length of the motif. Numerous programs have been developed to maximize the score of a PWM by measuring, for example, the information content of a PWM:where is the background frequency for the nucleotide, which measures motif conservation . Using the second approach, one can characterize a motif as an -length consensus string and describe it using the most frequent nucleotide in each position of all aligned sites under the assumption that each sequence contains zero or one motif instance with up to or exactly mutations within the motif. Finding (, ) motifs with exactly mutations is more challenging than finding (, ) motifs with up to mutations, and algorithms designed for the former can usually be directly used to find the latter. In addition, profile-based optimization methods, e.g., CONSENSUS and MEME, have failed to find (, ) motifs such as (15, 4), where a 15-bp motif is planted into 20 sequences, each 600 bps in length with exactly 4 mismatches . Thus, in this study we focus on designing a fast and exact algorithm to find (, ) motifs with exactly mutations on (, ) samples.
When searching for the exact motifs contained in an (, ) sample, it is customary to perform an exhaustive search for all potential -mers and verify their occurrence in the entire sample set. When using SPELLER and WEEDER to perform fast -mer substring searching in a sequence set, a suffix tree structure is used to index sequences. A suffix tree presents the suffixes of a given string or a given set of strings in a way that allows for a very fast implementation of string operations. An example of a classic suffix tree for the string GAGAC is shown in Figure 4a. When the suffix tree of a string with length is constructed, searching for a substring of length () in the string only requires time proportional to instead of .
(a) The suffix tree of the sequence GAGAC. (b) A tree representation of pattern space in the search for an (, ) motif.
Nevertheless, it is still time consuming to perform an exhaustive motif search in a suffix tree of sequences because the search space (shown as a four-branch tree in Figure 4b) can be up to in size. With the increase in degenerative value , the valid instances of a motif in the suffix tree will increase dramatically. Therefore, SPELLER can handle only short (, ) motifs with and . In order to increase the speed of SPELLER, WEEDER introduces an error ratio () to narrow the search space such that for all , the number of mismatches between the first nucleotides of a candidate -mer motif and the first nucleotides of a valid instance of the motif is at most . The strategy can quickly discard -mers in the search space that do not satisfy this restriction. However, not all motif occurrences satisfy this restriction, and therefore the real motif may be missed by the algorithm. WEEDER lowers the occurrence frequency to make sure that the true motif will not be missed. Still, WEEDER is an almost exact algorithm. What’s more, with the decrease of , WEEDER’s run time will increase dramatically . Therefore, a fast and exact (, ) motif search method is needed.
Mismatched Suffix Trees and FMotif
SPELLER and WEEDER use a depth-first search to scan the entire pattern space. If an -mer along the pattern tree has enough instances in the suffix tree of sequences, the -mer can grow up to 4 -mers in the next layer of the pattern tree (see Figure 4b). Otherwise, the end node of an -mer in the pattern tree will not be allowed to grow and the sibling nodes of the -mer will be checked. If any of the sibling nodes can grow to the -th layer in the pattern tree, the search process will go down to the -th layer of the pattern tree in a depth-first manner. Otherwise, it will backtrack to the uncle nodes (the siblings of parent nodes) of the -mer in the -th layer of the pattern tree and so forth. The algorithms will end at the longest -mer or -mers in the pattern tree. The difference between SPELLER and WEEDER is that WEEDER reduces the number of possible instances of a motif by restricting its mutation locations such that the valid paths on the pattern tree are sharply reduced.
We discovered that for finding motifs with length , both SPELLER and WEEDER must check each -mer () in the pattern space with depth-first order and count the variants of the -mer in the suffix tree of sequences from the root to layer . The suffix tree is scanned one time for each -mer pattern. Thus, the algorithms scan the suffix tree an increasing number of times with the increase of . Actually, the mismatch information in layer of a suffix tree can be used to search ()-mers in the pattern space. In this work we constructed a new suffix tree structure with mismatch information, called a mismatched suffix tree, for each sequence. Using these trees, we took advantage of the mismatch information in the -th layer of the trees to verify ()-mers in the pattern space and then updated the mismatch information in the -th layer. In this way we were able to implement a depth-first search on the pattern space and mismatched suffix trees simultaneously, which avoided a large number of repeated scans on the suffix trees of sequences.
For instance, when searching occurrences of (4, 1) motifs in the sequence GAGAC, we started from the root of the pattern tree represented as in Figure 5a and initialized the mismatched suffix tree for the sequence GAGAC. We then checked the occurrences of pattern with up to 1 mismatch in the mismatched suffix tree and found that all nodes in the first layer have 0 or 1 mismatch(es) with . Next, we updated the mismatch value along the valid nodes and linked all of these nodes by points (see Figure 5b). We subsequently performed a depth-first search again and arrived at the pattern . We updated mismatch information for all child nodes of the nodes in the link set in the first layer by using the mismatch information of those nodes in the link set and found all nodes in the second layer had 1 mismatch with the pattern . We updated the mismatch value along the valid nodes in the second layer and linked all of these nodes by points to form a new link set (see Figure 5c). Then, we moved to the pattern in a depth-first manner and updated mismatch information of all child nodes of the nodes in the newly generated link set by using the mismatch information of those nodes in the new link set. We found that only the child node A, representing the 3-mer AGA from the root to node A in the third layer of the suffix tree, had 1 mismatch with the pattern (see Figure 5d). Other nodes with 2 mismatches did not need to be updated and checked for the longer pattern . We found that the child node of the node A in the third layer did not satisfy the 1-mismatch restriction with the pattern , so we looked at the pattern and found a (4, 1) occurrence of (see Figure 5e). We then went to the patterns and and found no occurrence of these patterns in the sequence GAGAC. We backtracked to the pattern and updated the mismatch information in the third layer by using the mismatch information of their parent nodes in link set of the second layer. There we found that only node C in the third layer satisfies the restriction (see Figure 5f), but that node C has no child. We then backtracked to pattern and continued the process as before until we found all occurrences for each (4, 1) motif. The details of the pattern search and mismatched suffix tree construction are shown in the subroutine , where is the mismatched suffix tree for a sequence, is the node currently being processed (representing a -mer pattern) in the -th layer of the pattern tree, is the link set representing all valid occurrences of the -mer pattern represented by the node in the -th layer of the pattern tree, and is mismatch value of the pattern represented by compared with the substring represented by the node in the tree .
Figures (a)–(f) illustrate the search process of (4, 1) motifs on the mismatched suffix tree of the sequence GAGAC.
for head node to tail node of do
for each do ( is the child node set of the node )
if , then
add to ;
if or then
for j = 1 4 do
for each tree in the tree set do
For each set of sequences, we counted the number of occurrences of a potential pattern in all sequences instead of just one sequence shown in subroutine . If the number of occurrences was larger than the threshold of occurrence frequency , it was reported as a potential pattern. The subroutine for counting occurrences of a -mer pattern, represented by the node in the -th layer of the pattern tree, is shown in , where is the maximum length allowed for a motif, is the set of mismatched suffix trees for all sequences, = , and is the number of total sequences.
The entire process of finding motifs with at least occurrences in a set of sequences is shown below. Additionally, since there may be many motifs that satisfy the quorum restriction , we sorted all potential motifs according to their statistical significance using the method in , . We reported the top significant motifs and their occurrences as output, where (, ) indicates an instance of a motif starting at the -th position of the -th sequence .
The FMotif Algorithm
- Initialize a mismatched suffix tree like the one shown in Figure 5a, ;
- Initialize tempnode, where tempnode is the pointer of a temporary node,
- Input ;
- Rank the found motifs according to their significance scores;
- Output the top motifs, their instances, and the positions (, ) of these instances.
According to our empirical study, FMotif is capable of increasing the speed of the algorithms SPELLER and WEEDER without loss of accuracy. In addition, we used the WEEDER strategy to further decrease the search space by allowing mismatches occurring at most times with an increase in . This strategy decreased FMotif’s run time but caused problems during the tuning of the parameter and resulted in a loss of accuracy.
Conceived and designed the experiments: CJ YW. Performed the experiments: CJ YW. Analyzed the data: CJ MBC. Contributed reagents/materials/analysis tools: YL HL. Wrote the paper: CJ MBC.
- 1. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, et al. (2007) Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 4: 651–657.
- 2. Johnson DS, Mortazavi A, Myers RM, Wold B (2007) Genome-wide mapping of in vivo protein- DNA interactions. Science 316: 1497–1502.
- 3. Wilbanks EG, Facciotti MT (2010) Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One 5: e11471.
- 4. Rhee, S H, Pugh BF (2011) Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 147: 480–483.
- 5. Zhao Y, Stormo GD (2011) Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat Biotechnol 29: 1408–1419.
- 6. Zambelli F, Pesole G, Pavesi G (2013) Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform 14: 225–237.
- 7. Hertz GZ, Stormo GD (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15: 563–577.
- 8. Bailey TL, Elkan C (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 1994, Menlo Park, CA. 28–36.
- 9. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, et al. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262: 208–214.
- 10. Hughes JD, Estep PW, Tavazoie S, Church GM (2000) Computational identification of cisregulatory elements associated with groups of functionally related genes in saccharomyces cerevisiae. Journal of Molecular Biology 296: 1205–1214.
- 11. Buhler J, Tompa M (2002) Finding motifs using random projections. Journal of Computational Biology 9: 225–242.
- 12. Gang L, Chan TM, Leung KS, Lee KH (2010) A cluster refinement algorithm for motif discovery. IEEE/ACM Trans on Computational Biology and Bioinformatic 7: 654–668.
- 13. Sagot MF (1998) Spelling approximate repeated or common motifs using a suffix tree. Proceedings of LATIN’98: Theoretical Informatics, LNCS 1380: 111–127.
- 14. Pavesi G, Mauri G, Pesole G (2001) An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17: 207–214.
- 15. Pavesi G, Mereghetti P, Mauri G, Pesole G (2004) Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Research: Web Server Issue 32: W199–W203.
- 16. Eskin E, Pevzner P (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics 18: 354–363.
- 17. Chin YL, Leung CM (2005) Voting algorithms for discovering long motifs. In: Proceedings of the Third Asia-Pacific Bioinformatics Conference, 2005, Singapore. 261–271.
- 18. Davila J, Balla S, Rajasekaran S (2007) Fast and practical algorithms for planted (l, d) motif search. IEEE/ACM Trans On Computational Biology and Bioinformatics 4: 544–552.
- 19. Pevzner P, Sze S (2000) Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, 2000, California, USA. 269–278.
- 20. Ho ES, Jakubowski CD, Gunderson SI (2009) iTriplet, a rule-based nucleic acid sequence motif finder. Algorithms Molecular Biology 4: doi:10.1186/1748-7188-4-14.
- 21. Huang CW, Lee WS, Hsieh SY (2010) An improved heuristic algorithm for finding motif signals in DNA sequences. IEEE/ACM Trans On Computational Biology and Bioinformatics 8: 959–975.
- 22. Kuksa PP, Pavlovic V (2010) Efficient motif finding algorithms for large-alphabet inputs. BMC Bioinformatics 1: S1.
- 23. Sun HQ, Low MYH, Hsu WJ, Rajapakse JC (2010) RecMotif: a novel fast algorithm for weak motif discovery. BMC Bioinformatics 11: S8.
- 24. Tompa M (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 23: 137–144.
- 25. Machanick P, Bailey TL (2011) MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27: 1696–1697.
- 26. Bailey TL (2011) DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 27: 1653–1659.
- 27. Heeringen SJV, Veenstra GJ (2011) GimmeMotifs: a de novo motif prediction pipeline for ChIPsequencing experiments. Bioinformatics 27: 270–271.
- 28. Reid JE, Wernisch L (2011) STEME: efficient EM to find motifs in large data sets. Nucleic Acids Res 38: e126.
- 29. Kulakovskiy IV, Boeva VA, Favorov AV, Makeev VJ (2010) Deep and wide digging for binding motifs in ChIP-seq data. Bioinformatics 26: 2622–2623.
- 30. Hu M, Yu J, Taylor JM, Chinnaiyan AM, Qin ZS (2010) On the detection and refinement of transcription factor binding sites using ChIP-seq data. Nucleic Acids Res 38: 2154–2167.
- 31. Guo Y, Mahony S, Gifford DK (2012) High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Comput Biol 8: e1002638.
- 32. He Y, Zhang Y, Zheng G, Wei C (2012) CTF: a CRF-based transcription factor binding sites finding system. BMC Genomics 13: S18.
- 33. Thomas-ChollierM, Herrmann C, Defrance M, Sand O, Thieffry D, et al. (2012) RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res 40: e31.
- 34. Sharov AA, Ko MSH (2009) Exhaustive search for over-represented DNA sequence motif with CisFinder. DNA Research 16: 261–273.
- 35. Ma X, Kulkarni A, Zhang Z, Xuan Z, Serfling R, et al. (2012) A highly efficient and effective motif discovery method for ChIP-seq/ChIP-chip data using positional information. Nucleic Acids Res 40: e50.
- 36. Keich U, Pevzner P (2002) Subtle motif: defining the limits of finding algorithms. Bioinformatics 18: 1382–1390.
- 37. Zia A, Moses AM (2012) Towards a theoretical understanding of false positives in DNA motif finding. BMC Bioinformatics 13: doi: 10.1186/1471-2105-13-151.
- 38. Boeva V, Surdez D, Guillon N, Tirode F, Fejes A, et al. (2010) De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis. Nucleic Acids Res 38: e126.
- 39. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson D, et al. (2008) Model-based analysis of ChIPSeq (MACS). Genome Biology 9: R137.
- 40. Chen X, Xu H, Yuan P, Fang F, Huss M, et al. (2008) Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133: 1106–1117.
- 41. Barski A, Cuddapah S, Cui K, Roh T, Schones D, et al. (2007) High-resolution profiling of histone methylations in the human genome. Cell 129: 823–837.
- 42. Lupien M, Eeckhoute J, Meyer CA, Wang Q, Zhang Y, et al. (2008) FoxA1 translates epigenetic signatures into enhancer-driven lineage-specific transcription. Cell 132: 958–970.
- 43. Jothi R, Cuddapah S, Barski A, Cui K, Zhao K (2008) Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res 36: 5221–5231.
- 44. Jothi R, Cuddapah S, Barski A, Cui K, Zhao K (2012) Breast cancer risk-associated SNPs modulate the affinity of chromatin for FOXA1 and alter gene expression. Nat Genet 44: 1191–1198.
- 45. Crooks GE, Hon G, Chandonia JM, Brenner SE (2004) WebLogo: a sequence logo generator. Genome Research 14: 1188–1190.
- 46. Badis G, Chan ET, van Bakel H, Pena-Castillo L, Tillo D, et al. (2008) A library of yeast transcription factor motifs reveals a widespread function for Rsc3 in targeting nucleosome exclusion at promoters. Mol Cell 32: 878–887.
- 47. Spivak AT, Stormo GD (2012) ScerTF: a comprehensive database of benchmarked position weight matrices for Saccharomyces species. Nucleic Acids Res 40: D162–168.
- 48. Pachkov M, Erb I, Molina N, van Nimwegen E (2007) SwissRegulon: a database of genome-wide annotations of regulatory sites. Nucleic Acids Res 35: D127–131.