PairMotif: A New Pattern-Driven Algorithm for Planted (l, d) DNA Motif Search

Motif search is a fundamental problem in bioinformatics with an important application in locating transcription factor binding sites (TFBSs) in DNA sequences. The exact algorithms can report all (l, d) motifs and find the best one under a specific objective function. However, it is still a challenging task to identify weak motifs, since either a large amount of memory or execution time is required by current exact algorithms. A new exact algorithm, PairMotif, is proposed for planted (l, d) motif search (PMS) in this paper. To effectively reduce both candidate motifs and scanned l-mers, multiple pairs of l-mers with relatively large distances are selected from input sequences to restrict the search space. Comparisons with several recently proposed algorithms show that PairMotif requires less storage space and runs faster on most PMS instances. Particularly, among the algorithms compared, only PairMotif can solve the weak instance (27, 9) within 10 hours. Moreover, the performance of PairMotif is stable over the sequence length, which allows it to identify motifs in longer sequences. For the real biological data, experimental results demonstrate the validity of the proposed algorithm.


Introduction
Motif search plays an important role in gene finding and gene regulation relationship understanding. Taking a survey of recent developments in motif recognition algorithms, Das and Dai [1] pointed out that DNA motif search would still be an opening challenge for researchers. In this paper, we focus on the planted (l, d) motif search [2], a widely used model for motif finding:

Planted (l, d) Motif Search (PMS)
Given a set of n-length sequences S = {s 1 , s 2 , …, s t } over the alphabet {A, T, C, G} and nonnegative integers l and d, satisfying 0# d , l , n, the task is to find an l-mer (i.e., an l-length string) m, called a motif, such that each sequence s i contains an l-mer m i differing from m in at most d positions.
In the PMS problem, typical values of t and n are 20 and 600; then, various combinations of l and d correspond to different instances of PMS. The instances where the value of d is large in relation to the value of l are called weak instances, which are difficult to be solved. For example, the instances (15,4) and (18,6) are well-known weak instances [3].
Numerous recognition algorithms, either approximate or exact, have been proposed. The initialization of most approximate algorithms is selecting start sites randomly to begin iterations, which makes them easily fall into a local optimum. Gibbs Sampling [4] and MEME [5] are classic algorithms in this approach, and they usually use several groups of start sites as the initial states to avoid the local optimum. PROJECTION [6] partitions all l-mers in S into many buckets, and selects some valid buckets that contain several occurrences of the desired motif and little else. Then EM algorithm is used to refine the valid buckets. MotifCut [7] is a graph-theoretic approach in which motif finding is formulated as a search for the maximum density subgraph. MCEMDA [8], a Monte Carlo version of the EM motif finding algorithm, starts from an initial model, and then it iteratively performs Monte Carlo simulation and parameter update until convergence. SBaSeTraM [9] is a Bayesian search method obtained by combining two models, namely a foreground model and a background model, which describe the distribution of sequences under the alternative hypothesis that there is a TFBS and under the null hypothesis that there is no TFBS at the site, respectively. Vine [10], the recent method, is a polynomial-time heuristic algorithm for motif search based on WINNOWER [2]. Generally, the approximate algorithms are able to produce results in a short time, but they cannot guarantee global optimality.
Exact recognition algorithms are guaranteed to obtain the best solution, although exponential running time is required in the worst case due to the NP-hard nature of PMS [11]. It is therefore meaningful to design the exact algorithm with a small time overhead. According to the search space of PMS, there are two types of exact recognition algorithms. One is the exact algorithms based on alignment matrix, which test all (n 2 l +1) t possible combinations of motif positions in each of sequences to find the one that yields the highest score. The other is the exact algorithms based on pattern, which verify all 4 l possible patterns to find the one that appears in all t sequences with the minimum number of mutations. Although the two types of exact algorithms produce the consistent results [12], the initial search space of the latter is much smaller than that of the former. Therefore, the research mainly concentrates on the latter in recent years, and the associated algorithms [13][14][15][16][17][18][19][20][21][22] are called pattern-driven algorithms.
All pattern-driven algorithms aim to reduce candidate motifs through various means. However, no single algorithm is superior to others on all PMS instances. Being the fastest in the family of suffix tree algorithms [13][14][15][16][17][18], RISOTTO [18] shows competitive execution time, but its performance drops significantly with the increase of the motif length. PMSP [19] adopts the following idea: for each l-mer x in s 1 , it generates d-neighbors of x, and then verifies if each d-neighbor y is a motif by checking whether there are l-mers in s i (2# i # t) that are at distance d from y. PMSprune [20] is an improvement over PMSP obtained by using branch and bound, capable of solving many PMS instances in a short time. The iTriplet algorithm [21], which constructs multiple triplets of lmers to reduce candidate motifs, is suitable for identifying long motifs (.20 nucleotides), but suffers from substantial computational requirements. PMS5 [22], whose main idea is to use integer programming to compute the common d-neighbors of three lmers, is an efficient algorithm for solving weak PMS instances with the value of l about 20. But PMS5 is difficult to solve weak instances with large values of l, because of the substantial memory required for storing the results of all possible integer linear programs. There are some other methods for the exact algorithms, see for example: [23][24][25] etc.
Besides the search method, objective functions also play an important role in motif search. The analysis of wild-type and mutant Zif268 zinc fingers genes made by Bulyk et al. [26] indicated that there exists interdependence among positions in transcription factor binding sites for real biological data. If the objective function cannot express the interdependence, the obtained optimal solution may not possess a real biological significance. Li and Tompa [27] assessed and classified the objective functions used by existing tools, and they pointed out that an ideal objective function should assign the optimal score to the true motif. For the exact algorithms, the objective function is used to rank reported (l, d) motifs, and the prediction is done by selecting the motif with best score.
To effectively solve the PMS problem, a new pattern-driven algorithm, PairMotif, is proposed in this paper. First, motivated by the observation that the number of candidate motifs shared by two l-mers is dramatically affected by the distance between the two lmers, multiple pairs of l-mers are carefully selected from input sequences to limit the total candidate volume. Second, a new filtering rule based on a pair of l-mers is designed to filter out more l-mers to be scanned in candidate verification. Third, a deterministic and efficient method for traversing candidate motifs is given to perform candidate verification. The experimental results demonstrate the efficiency and effectiveness of the proposed algorithm.

Ethics Statement
N/A. The ethics approval is not necessary for our study. That is because: 1) our study doesn't make use of any human or vertebrate animal subjects and tissue; 2) our study focuses on faster algorithms for planted (l, d) motif search, which is a widely used computing model for DNA motif search, and our experiments are completed only by using computers.

Notations and Definitions
The Hamming distance between two l-mers x and x9 is denoted by d H (x, x9). In PairMotif, all pairs of l-mers x and x9 that satisfy d H (x, x9) .2d are ignored, since the Hamming distance between two instances of the same motif must be less than or equal to 2d and 2d is less than l. Given an l-mer x and a sequence s, let C(x, s) be the set of all such l-mers y in s that d H (x, y) #2d.
Given two l-mers x and x9, let P 1 (x, x9) be the set of positions at which the corresponding characters are identical, and P 0 (x, x9) be the set of positions at which the corresponding characters are different. Given another l-mer y, the l positions in the alignment of x, x9 and y can be divided into four categories: P 00 (x, x9, y), P 01 (x, x9, y), P 10 (x, x9, y) and P 11 (x, x9, y). More precisely, for each position i, assume that it belongs to P ab (x, x9, y). Then, a is 1 if Figure 1 shows an example for partitioning the positions in the alignment of two l-mers and that of three l-mers. Definition 1. Given two l-mers x and x9, the candidate motifs shared by x and x9, M d (x, x9), is defined to be {y: |y| = l, d H (y, x) # d and d H (y, x9) # d}.

PairMotif Algorithm
As shown in Figure 2, PairMotif consists of the following three stages: (1) Selecting pairs. For each l-mer x in s 1 , select a reference sequence s r from S 2 {s 1 }. Then, multiple pairs of l-mers are formed by pairing the l-mer x and each l-mer x9 in C(x, s r ). Note that, s r is selected with the most restrictive limit on the total number of candidate motifs in comparison with other sequences. (2) Filtering l-mers. For each selected pair of l-mers x and x9, use two filtering rules to filter out l-mers in C(x, s i ) (2# i # t, i ? r) that cannot be instances of any candidate motif in M d (x, x9). Let C9(x, s i ) denote the set of the remaining l-mers in C(x, s i ) after filtering. (3) Verifying candidate motifs. For each selected pair of l-mers x and x9, traverse each candidate motif y in M d (x, x9), and verify if y is a motif by checking whether there is an instance of y in each C9(x, s i ).
Based on these three stages, the PairMotif algorithm is presented as follows. 2: for each l-mer x in s 1 do 3: Compute C(x, s i ) (2# i # t) 4: Select a reference sequence s r from S 2 {s 1 } 5: for each x9 M C(x, s r ) do 6: Form a pair of l-mers x and x9 7: Filter l-mers in C(x, s i ) (2# i # t, i ? r) and store the remaining

Algorithm PairMotif
10: Add y to M 11: Output M In line 1, the set of (l, d) motifs M is initialized to an empty set. Lines 2-6 correspond to the stage of selecting pairs, in which C(x, s i ) is obtained by calculating the Hamming distance from the l-mer x to all l-mers in s i (2# i # t). Line 7 and lines 8-10 show the stages of filtering l-mers and verifying candidate motifs, respectively. PairMotif guarantees to discover all (l, d) motifs and outputs them in line 11.
Next, we explain key techniques for each stage.

Stage 1: Selecting Pairs
For each l-mer x in s 1 , a reference sequence s r is required to form multiple pairs of l-mers. Specifically, s r is selected from S -{s 1 } by satisfying.
where P . For any four 15-mers x 1 , x 1 9, x 2 and x 2 9, even though the difference between d H (x 1 , x 1 9) and d H (x 2 , x 2 9) is not large, |M d (x 1 , x 1 9)| and |M d (x 2 , x 2 9)| can differ by several times or more. For example, when d H (x 1 , Based on this observation, given an l-mer x in s 1 , we can analyze how the number of candidate motifs changes with different s i in S -{s 1 }. On the one hand, there are a relatively small number of candidate motifs, if all l-mers in s i are at a relatively large Hamming distance from x. On the other hand, x and s i will lead to a huge number of candidate motifs, once there are several l-mers in s i at a very small Hamming distance from x. Our selection method limits the total candidate volume by preventing the occurrence of the latter case.

Stage 2: Filtering l-mers
In PairMotif, for each selected pair of l-mers x and x9, two filtering rules below are used to determine if each l-mer z in S -{s 1 , s r } is a possible instance of a certain candidate motif in M d (x, x9): ) .2d, then z is not an instance of any candidate motif in M d (x, x9).
Rule 2. If there is no such a 2-tuple ,a, b. M R(x, x9) that satisfies abs(|P 10 (x, x9, z)|2a) + abs(|P 00 (x, x9, z)|2b) # d, then z is not an instance of any candidate motif in M d (x, x9). The symbol abs(?) denotes the operation of taking the absolute value.
Rule 1 employs the widely used criterion that two instances of the same motif must not differ by more than 2d differences. Rule 2 realizes filtration from a different perspective (the proof of its correctness is included in the Text S1): it compares the distance with d in the case that there is an error value ,a, b.. Therefore, Rule 2 can filter out some l-mers that cannot be filtered out by Rule 1. Let us take the instance (15,4) as an example. Assume that there are three l-mers x (AAAAAAAGGGGGGGG), x9 (AAAAAAACCCCCCCC) and z (AAAAAAATTTTTTTT). By Rule 2, z can be filtered out successfully since R(x, x9) = {,0, 0.} and abs(|P 10 (x, x9, z)|20) + abs(|P 00 (x, x9, z)|20) = 0+8 = 8. d.
However, there also exist some l-mers that can be filtered out by Rule 1, but cannot be filtered out by Rule 2. For example, keep x and x9 unchanged, and let z = TTTTAAAGGGGGGGG. By Rule 1, d H (x9, z) = 12.2d, so z is filtered out. But by Rule 2, z cannot be filtered out since abs(|P 10 (x, x9, z)|20) + abs(|P 00 (x, x9, z)|20) = 4+0 = 4# d. Taking these considerations into account, we use these two rules to simultaneously perform filtration.
Moreover, the fact that most pairs of l-mers selected in Stage 1 have relatively large Hamming distances is conducive to filtering out more l-mers. This can be understood through Observation 1. The larger the value of d H (x, x9), the smaller the value of |M d (x, x9)|. Accordingly, given a random l-mer z, the probability that z is one of the d-neighbors of M d (x, x9) is relatively small when d H (x, x9) is relatively large.
On the basis above, most l-mers, which need to be compared with each candidate motif in Stage 3, are filtered out in advance.

Stage 3: Verifying Motifs
For each selected pair of l-mers x and x9, candidate motifs in M d (x, x9) need to be traversed to perform candidate verification. This section gives a deterministic and efficient traversing method, rather than enumerating all possible l-mers.
At first, we discuss how to compute R(x, x9) given a pair of l-mers x and x9; R(x, x9) implies the approach to traversing candidate motifs in M d (x, x9). Assume that y is a candidate motif in M d (x, x9) with R(x, x9, y) = ,a, b.. By Definition 2, we have a # l2d H (x, x9), b # d H (x, x9) and d H (y, x) + d H (y, x9) = 2a +2b + (d H (x, x9)2b). Furthermore, we have d H (y, x) + d H (y, x9) #2d because both x and x9 are instances of y. Thus, we obtain: Obviously, the value of R(x, x9) is determined by d H (x, x9). Given the value of d H (x, x9), it is easy to compute R(x, x9) for the specified (l, d) instance by listing all 2-tuples ,a, b. satisfying (2). Table 1 shows R(x, x9) for the instance (15,4) under different Hamming distances.
Based on different 2-tuples in R(x, x9), candidate motifs in M d (x, x9) can be traversed in a simple way. M d (x, x9) consists of several mutually disjoint subsets with each subset sharing a different 2tuple in R(x, x9); these subsets are visited one by one. As shown in Figure 3, for each ,a, b. in R(x, x9), let y = x9, and then we traverse candidate motifs in the associated subset of M d (x, x9) by changing y with the following three steps. First, select a positions from P 1 (x, x9), and for each i of these a positions, change y[i] to one of the three characters different from x[i]; second, select b positions from P 0 (x, x9), and for each i of these b positions, change y[i] to one Table 1. R(x, x9) and |M d (x, x9)| for the instance (15,4).  According to the process of traversing candidate motifs, the size of M d (x, x9) is calculated as follows: Moreover, as the process of verifying candidate motifs is to frequently calculate the Hamming distance between two l-mers, an  , obtain d H (x, x9), which is the number of 2-bits that are not 00 in X, by searching a 256 byte table ql/4r times. At each position of the table, we cache the number of 2-bits that are not 00 in the associated 8-bit. In the implementation of PairMotif, all l-mers in S are represented as integers and cached for skipping the first step of this method. Therefore, in practice, this method is four times faster than comparing two l-mers directly. The details of this method and the evaluation of its speedup are included in the Text S2.

Results and Discussion
We mainly compare the time performance of PairMotif with that of other famous exact algorithms, since all exact algorithms report the same results with different time overheads.

Time and Space Analysis
Let p k denote the probability that the Hamming distance between two random l-mers is not more than k (0, k , l): The time complexity of PairMotif is analyzed by estimating the number of candidate motifs and the number of l-mers to be scanned in verifying each candidate motif. PairMotif loops through O(n 2 ) pairs of l-mers. For each pair of l-mers x and x9, the probability that a random l-mer y becomes a candidate motif is is approximately equal to 4 l p 2 d . Thus, the total number of candidate motifs isO(n 2 4 l p 2 d ). Furthermore, there are O(tn) potential l-mers to be scanned in verifying each candidate motif. After being filtered by both Rule 1 and Rule 2, each remaining l-mer z satisfies at least the condition that d H (z, x) #2d and d H (z, x9) #2d. Hence, the number of l-mers to be scanned is O(tnp 2 2d ). Based on the above considerations, the time complexity of PairMotif is O(tn 3 4 l p 2 d p 2 2d ). PairMotif requires little storage for implementation. In PairMotif, all l-mers in s 1 are traversed with each of them processed independently. After processing one l-mer, the associated memory can be released, and we just need to consider the memory requirement for processing one l-mer x. Specifically, we store the (t21)(n-l+1) l-mers in s 2 , …, s t and the Hamming distances from x to these l-mers. Therefore, the space complexity of PairMotif is O(tn). Table 2 shows the time and space complexities of PairMotif and those of several famous exact algorithms, such as PMSprune [20], iTriplet [21] and PMS5 [22]. PairMotif requires the least amount of storage space. Particularly, the space complexity of PairMotif and that of PMSprune, which depend on t and n, are fixed on different PMS instances; whereas the storage requirement of This table shows the time and space complexities of PairMotif and that of other famous exact algorithms. Note that, t is the number of sequences; n is the sequence length; p k is the probability that the Hamming distance between two random l-mers is not more than k; L represents the time to load the ILP   iTriplet and that of PMS5, which depend on l and d, may be unrealistic on the PMS instances with large values of l and d. For time complexity, PairMotif outperforms PMSprune because the ratio of their time complexities, np d p 2 2d , is less than 1. It also outperforms iTriplet on most PMS instances except for those with very large values of l. Moreover, PairMotif shows its performance advantage over PMS5 when l is small (l ,15); in this case, the time overhead of loading the ILP table becomes the limiting factor in the performance of PMS5.

Test on Simulated Data
We adopt the simulated data sets used in [6]. Generate a motif of length l and t identically distributed sequences of length n. Then, for each sequence s, implant a random motif instance, which differs from the motif in at most d positions, to a random position in s.
In experiments, we fix t = 20 and n = 600, and compare the performance of PairMotif with that of several recently proposed exact algorithms, such as RISOTTO [18], PMSprune [20], iTriplet [21] and PMS5 [22], by varying l and d values (PMS instances). For the motif length l, we consider its value ranging from 9 to 30, as the binding sites are short DNA segments. To select a group of PMS instances to carry out comparisons, we use the 2d neighborhood probability p 2d calculated by (4) to indicate the weakness of a PMS instance. The larger the value of p 2d , the weaker the corresponding PMS instance. All algorithms are performed in the same experimental environment with a 2.67 GHz processor and a 4 Gbyte memory. And the average running times are derived by executing different algorithms on ten simulated data sets.
At first, the comparisons are carried out on fixed 2d neighborhood probability p 2d . Table 3 gives the running times of these algorithms on the PMS instances with the value of p 2d around 0.05, which is approximately the same as the p 2d value of  [22] and PMSprune [20] on different sequence lengths on the instance (18,6). The x-axis shows the sequence lengths. The y-axis shows the running times.
doi:10.1371/journal.pone.0048442.g004 the instance (15,4). The results confirm with the time complexities in Table 2. PairMotif achieves the best execution time over other algorithms, and PMSprune is the second fastest. Although iTriplet exhibits stable running times, it does not show its performance advantage. PMS5 is defeated either because of the extra time overhead for loading ILP table or its high storage requirement [22]. RISOTTO is very sensitive to the value of l, and its running time increases exponentially when l increases. Second, the running times of these algorithms are compared on a group of PMS instances with different probability p 2d that ranges from 0.01 to 0.5. We do not consider the probability p 2d whose value exceeds 0.5, as the corresponding motif is so degenerate that although exact algorithms can report all (l, d) motifs, it is difficult to distinguish the planted motif from spurious motifs. The running times are shown in Table 4. The performance of RISOTTO is the worst due to the strong sensitivity to the value of l. The iTriplet algorithm fails to solve the PMS instances with p 2d .0.2 because the value of p 2d severely affects its time complexity. PMSprune can solve most of these PMS instances in a short time except for the instances (27,9) and (19,7). PMS5 works well on the PMS instances with large p 2d , whereas fails on the instances (29,8), (28,9) and (27,9) because of memory limits. For PairMotif, all these PMS instances can be solved within 10 hours. Particularly, among these algorithms, only PairMotif can solve the instance (27,9) while other algorithms fail because either a large amount of memory or execution time is required.
Moreover, it should be noticed that for a specific PMS instance, the longer the input sequences, the weaker the instance. It is therefore necessary to compare the performance of algorithms on different sequence length n given a PMS instance. To carry out comparisons, we select two algorithms PMS5 and PMSprune besides PairMotif, and then perform them on the well-known instance (18, 6) by varying n from 100 to 1000. Figure 4 plots the running times of these algorithms against the increase of n. The running time of PairMotif is almost linearly related to the sequence length; whereas the running time of PMS5 and that of PMSprune increase sharply as the sequence length increases, especially for PMSprune. The reason why the performance of PairMotif is stable over the sequence length is that PairMotif has strong ability of filtering scanned l-mers. That is, the remaining l-mers to be scanned after filtering are so few that their volume is almost unchanged for different sequence lengths. Neither PMS5 nor PMSprune possesses this property. Thus, although PairMotif does not outperform PMS5 when the input sequences are short, it does so when n .700.

Test on Biological Data
To test the validity of PairMotif, we identify the known transcription factor binding sites in five real data sets discussed in [28], including preproinsulin, DHFR, c-fos, metallothionein and Yeast ECB (these data sets are included in the Dataset S1). These data differ substantially from the simulated data.
PairMotif reports all (l, d) motifs. To obtain one predicted motif, a specific objective function is needed to rank the reported motifs, and the motif with maximum score is selected. In our experiments, three objective functions (consensus score [12], relative entropy [29] and sequence specificity [30]) suitable for exact algorithms are used to obtain three predicted motifs for each data set. Table 5 shows the results of our experiments. The (l, d) used for each data set is selected as follows: the value of l is fixed as the length of the published motif; the value of d is the minimum needed to ensure that the reported (l, d) motifs will contain the real motif. The third column gives the order of magnitude of the (l, d) motifs reported by PairMotif. The last column shows the predicted motifs selected from the reported (l, d) motifs by using three PairMotif ensures that each predicted motif is optimal under the associated objective function. On this basis, the prediction accuracy depends on the used objective function. To evaluate each predicted motif, the nucleotide-level correlation coefficient (nCC) [31] is adopted: where nTP, nTN, nFN and nFP are the number of true/false positive/negative predicted nucleotides. The use of correlation coefficient allows an integrated assessment of sensitivity and specificity. Figure 5 compares the predicted motifs under different objective functions by showing their nCCs. The value of nCC ranges from 21 (indicating perfect anticorrelation) to +1 (indicating perfect correlation); the larger the value of nCC, the higher the accuracy of the predicted motif. For the five real data sets, the result accuracy under relative entropy and sequence specificity is obviously higher than that under consensus score. Nevertheless, no single objective function outperforms the others for every data set. For example, sequence specificity is well used for preproinsulin and DHFR, but leads to lower accuracy compared with relative entropy for c-fos and metallothionein. From the above analysis, PairMotif provides a good foundation for obtaining the real motif under a given objective function. If the objective function is able to assign the maximum score to the real motif, the real motif will be quickly found by PairMotif.

Conclusions
DNA motif search is a challenging problem in computer science and bioinformatics. In this paper, we propose a new combinatorial algorithm, PairMotif, which restricts the search space of motif search by generating candidate motifs from multiple pairs of lmers. PairMotif requires a small space complexity of O(tn), where t is the number of input sequences and n is the sequence length. It has a stable time performance over the sequence length, and it can solve most PMS instances in a reasonably short amount of time. Experimental results on real biological data sets show that PairMotif provides a good foundation for obtaining the real motif under a given objective function.
In PairMotif, all l-mers in s 1 are traversed and each l-mer is processed independently. Therefore, PairMotif has a good feature of parallel computing, which allows it to use the strengths of parallel and distributed systems to improve the efficiency and quality of motif finding. Moreover, since all candidate motifs are traversed one by one in the current version of PairMotif, the performance of PairMotif can still be improved by using branch and bound in the process of traversing candidate motifs.

Supporting Information
Text S1 The correctness of filtering rule 2.

(DOC)
Text S2 Method for calculating Hamming distance between two l-mers.

(DOC)
Dataset S1 The real biological data sets used in our experiments, including preproinsulin, DHFR, c-fos, metallothionein and Yeast ECB, which are discussed in [28]. (RAR) Program S1 The executable program of algorithm PairMotif.