qPMS7: A Fast Algorithm for Finding (ℓ, d)-Motifs in DNA and Protein Sequences

Detection of rare events happening in a set of DNA/protein sequences could lead to new biological discoveries. One kind of such rare events is the presence of patterns called motifs in DNA/protein sequences. Finding motifs is a challenging problem since the general version of motif search has been proven to be intractable. Motifs discovery is an important problem in biology. For example, it is useful in the detection of transcription factor binding sites and transcriptional regulatory elements that are very crucial in understanding gene function, human disease, drug design, etc. Many versions of the motif search problem have been proposed in the literature. One such is the -motif search (or Planted Motif Search (PMS)). A generalized version of the PMS problem, namely, Quorum Planted Motif Search (qPMS), is shown to accurately model motifs in real data. However, solving the qPMS problem is an extremely difficult task because a special case of it, the PMS Problem, is already NP-hard, which means that any algorithm solving it can be expected to take exponential time in the worse case scenario. In this paper, we propose a novel algorithm named qPMS7 that tackles the qPMS problem on real data as well as challenging instances. Experimental results show that our Algorithm qPMS7 is on an average 5 times faster than the state-of-art algorithm. The executable program of Algorithm qPMS7 is freely available on the web at http://pms.engr.uconn.edu/downloads/qPMS7.zip. Our online motif discovery tools that use Algorithm qPMS7 are freely available at http://pms.engr.uconn.edu or http://motifsearch.com.


Introduction
Detection of rare events happening in a set of DNA/protein sequences often provides the main clue leading to new biological discoveries. One kind of such rare events is the presence of patterns called motifs. For example, regulatory regions in a genome such as promoters, enhancers, locus control regions, etc., contain motifs that control many biological processes such as gene expression (see [1]). Basically, proteins known as transcription factors regulate the expression of a gene by binding to locations of motifs in regulatory regions. For instance, transcription factors such as TFIID, TFIIA and TFIIB usually bind to sequence 59-TATAAA-39 in the promoter region of a gene in order to initiate its transcription. Such motifs and their locations in regulatory regions, i.e., binding sites, are important and helpful to decipher the regulatory mechanism of gene expression, which is very sophisticated. As a result, motif identification plays an important role in biological studies.
Motif prediction is usually the first stage in the process of identifying motifs. An extensive amount of research has been done on this topic over the past twenty years. In the literature, many approaches for motif prediction have been proposed. One of them is a combinatorial approach that has proven to be more accurate than the others. Even in this combinatorial approach, many variations can be found such as Planted Motif Search (PMS), Simple Motif Search (SMS), and Edit-distance-based Motif Search (EMS) (see e.g., [2]).
Among the combinatorial variations, the PMS Problem has been the most widely studied perhaps because it offers a higher level of accuracy in modeling the true motifs than the others. Motifs typically occur with mutations at binding sites. The binding sites are referred to as instances of a motif. A motif in this model is referred to as a (', d)-motif where ' is its length and d is the maximum number of mutations allowed for its instances. Given a set of sequences, the objective of the PMS problem is to find all the (', d)-motifs in them. The formal definition of the PMS Problem is given in Section 0.1. An algorithm that solves the PMS problem is called a PMS Algorithm.
Owing to its importance, the PMS problem has been extensively studied in the past twenty years. Many PMS algorithms have been proposed in the literature. There are two kinds of PMS algorithms, namely, exact and approximate. An exact algorithm always finds all the (', d)-motifs present in the input sequences. An approximate algorithm may not find all the motifs. In this paper we only consider exact algorithms. The (exact variant of the) PMS problem has been shown to be NP-hard which means that there is unlikely to be a PMS algorithm that takes only polynomial time. As a result, all the existing exact PMS Algorithms take time that is exponential time in some of the parameters in the worst case. In practice, all known PMS Algorithms (both exact and approximate) are only able to find (', d)-motifs for up to certain values of ' and d. The most recent exact algorithms that have been proposed in the literature are Algorithm PMS6 due to [3], Algorithm PMS5 due to [4], Algorithm Pampa due to [5], Algorithm PMSPrune due to [6], Algorithm PMS3 due to [7], Algorithm Voting due to [8], and Algorithm RISSOTO due to [9]. Some earlier PMS algorithms are due to [10], [11], [12], [13], [14], [15], [16], and [17]. Among these known algorithms, Algorithm PMS6 is considered the fastest one and has been developed closely following the ideas of Algorithm PMS5.
Approximate PMS algorithms usually tend to be faster than exact PMS algorithms. Typically, approximate PMS algorithms employ heuristics such as local search, Gibbs sampling, expectation optimization, etc. Examples of approximate algorithms are Algorithm MEME due to [18], Algorithm PROJECTION due to [19], Algorithm GibbsDNA due to [20], Algorithm WINNOWER due to [21], and Algorithm RandomProjection due to [22]. Some other approximate PMS algorithms are Algorithm MULTI-PROFILER due to [23], Algorithm PatternBranching due to [24], Algorithm ProfileBranching due to [24], and Algorithm CONSENSUS due to [25].
A generalized version of the PMS Problem, namely Quorum Planted Motif Search (qPMS) Problem, was first considered in [6]. The qPMS problem is to find all the motifs that have motif instances present in q out of the n input sequences. The qPMS problem captures the nature of motifs more precisely than the PMS problem does because, in practice, some motifs may not have motif instances in all of the input sequences. The qPMS problem is formally defined in Section 0.1. An algorithm that solves the qPMS problem is called a qPMS algorithm. qPMS algorithms can be used to find DNA motifs and protein motifs as well as transcription factor binding sites. The larger the values of ' and d that a qPMS algorithm can handle, the more accurate will be the motifs it finds. So it is important to solve the qPMS problem instances with large values of ' and d. However, solving the qPMS problem is a difficult task since it is even harder than the PMS problem. To the best of our knowledge, the currently best exact qPMS algorithm is Algorithm qPMSPrune due to [6] that can only solve instances up to '~17 and d~5 for q~n 2 , where n is the number of input sequences. In this paper, we propose a new algorithm named Algorithm qPMS7 that can solve larger instances. Also, qPMS7 is ten times as fast as qPMSPrune. In addition, when applied to the PMS problem, our algorithm is faster than the best PMS algorithm, i.e., Algorithm PMS6 due to [3].

Problems Definition and Notations
Definition 0.1 A string x~x½1 . . . x½' of length ' is called an 'mer.
Definition 0.2 Given two x~x½1 . . . x½' and s~s½1 . . . s½m with 'vm, we use the notation x[ ' s if x is a contiguous substring of s. In other words, x[ ' s if there exists 1ƒiƒm{'z1 such that x½j~s½jzi{1 for every 1ƒjƒ'. We also say that x is an '-mer in s. Definition 0.5 Given a set of n strings s 1 , . . . ,n n of length m each, a string M of length ' is called an (', d, q)-motif of the strings if there are at least q out of the n strings such that the Hamming distance between each one of them and M is no more than d. M is called an (', d, q)-motif for short if the set of strings is clear.
The definition of Quorum Planted Motif Search (qPMS) Problem is as follows. Given n input strings s 1 , . . . ,s n of length m each, three integer parameters ', d and q, find all the (',d,q)-motifs of the input strings. The Planted Motif Search (PMS) problem is a special case of the qPMS problem when q~n. In this paper, we propose a fast algorithm for the qPMS problem.

The Existing Algorithm qPMSPrune
Algorithm qPMSPrune for the qPMS problem was proposed by [6]. For the sake of completeness, we will describe Algorithm qPMSPrune in this section briefly because our new algorithm is partially based on it. For more details on Algorithm qPMSPrune, the readers are referred to [6].
Algorithm qPMSPrune uses the d-neighborhood concept defined as follows.

It is easy to see that
Algorithm qPMSPrune is based on the following observation. Any (',d,q)-motif of the input strings must be in B d (x) for some 'mer x in some input string s i and also it must be a (', d, q{1)motif of the input strings excluding s i . This observation can be rewritten formally as follows.
Observation 0.1 Let M be any (',d,q)-motif of the input strings s 1 , . . . ,s n . Then there exists an i (with 1ƒiƒn) and a '-mer x[ ' s i such that M is in B d (x) and M is a (',d,q{1)-motif of the input strings excluding s i .
The above observation suggests the following algorithm. Compute B d (x) for every '-mer x in each input string s i for 1ƒiƒn. For each '-mer in the neighborhoods thus computed, check if it is a (', d, q{1)-motif of the input strings excluding s i . This simple algorithm can be improved further as shown in [6].
The key observation is that it is sufficient to consider each input string s i for 1ƒiƒn{qz1: Observation 0.2 Let M be any (', d, q)-motif of the input strings s 1 , . . . ,s n . Then there exists an i (with 1ƒiƒn{qz1) and a '-mer Algorithm qPMSPrune is based on the above observation. For any '-mer x, it represents B d (x) as a tree T d (x) using the following rules.
1. Each node in T d (x) is a pair (t,p) where t~t½1 . . . t½' is an 'mer and p is an integer between 0 and ' such that For example, the tree T 2 (1010) with alphabet S~f0,1g is illustrated in Figure 1.
Clearly, the following properties of T d (x) can be inferred directly from the rules.

Each node in
The algorithm traverses the tree T d (x) in a depth-first manner. At each node (t,p), it computes d H (t,s j ) incrementally from its parent for 1ƒjƒn, j=i. This operation can be done in O(nm) time by the incremental computation shown in [6]. Let q' be the number of input strings s j such that d H (t,s j )ƒd. Obviously if q' §q{1 then t is a (',d,q{1)-motif of the input strings excluding s i . If this condition is satisfied, it outputs t as a (', d, q)-motif of the entire input strings.
Algorithm qPMSPrune prunes certain nodes (and their descen- Under what conditions can we prune the node (t,p)? Let q'' be the number of input strings s j such that d H (t,s j )ƒ2d{d H (t,x). Observe that if q''vq{1 then none of the nodes in the subtree rooted at node (t,p) could be a (', d, q{1)-motif. This is because if there is a node (t',p') in the subtree which is a (', d, q{1)-motif, then there are at least q{1 input strings s j such that d H (t',s j )ƒd. Consider such an input string s j . By the triangle inequa- . This inequality will infer that q'' §q{1. Therefore, if the condition q''vq{1 occurs, it can safely prune the subtree rooted at node (t,p) without missing any (', d, q{1)-motif. The pseudo-code of Algorithm qPMSPrune is described as follows.

Algorithm qPMSPrune
For each x[ ' s i ,1ƒiƒn{qz1 do: Traverse the tree T d (x) in a depth-first manner. At each node (t,p), do the following steps. i.
Incrementally compute d H (t,s j ) from its parent for 1ƒjƒn,j=i. ii. Let q' be the number of input strings s j such that d H (t,s j )ƒd.
If q' §q{1, output t. iii. Let q'' be the number of input strings s j such that d H (t,s j )ƒ2d{d H (t,x). If q''vq{1, then prune the subtree rooted at node (t,p). Otherwise, explore its children.
It is easy to see that the time and space complexities of Algorithm qPMSPrune are O((n{qz1)nm 2 N (',d)) and O(nm 2 ), respectively.

A Computational Technique Improving upon Algorithm qPMSPrune
In this section, we propose a speedup technique to improve the runtime of Algorithm qPMSPrune. Specifically, the technique will reduce the time taken for computing Hamming distances d H (t,s j ) in step (1) of Algorithm qPMSPrune. Recall that the operation takes at least V(nm) time in Algorithm qPMSPrune because it considers every '-mer in each input string s j . We observe that some '-mers can be ignored without changing the result since we notice that we just need to count q' and q''. Any '-mer z in s j can be ignored, as far as a node (t,p) in the tree The reason for this will be given in the next paragraph. Based on this observation, the technique is implemented as follows. At each node (t,p), we store a list of surviving '-mers for each input string s j . It is sufficient to store the positions of the '-mers in the input strings. If the list of surviving 'mers of s j is empty, then we set d H (t,s j )~?. In terms of the incremental distance computation, only the surviving '-mers are considered. The runtime of the operation now depends on the sizes of the lists of surviving '-mers.
The reason for ignoring any '-mer z in s j , as far as a node (t,p) x) is as follows. If this condition occurs, then for any node (t',p') in the subtree rooted at node (t,p) we have: Therefore, ignoring '-mer z at any node (t',p') in the subtree rooted at node (t,p) will not change its q'. The value of q'' at node node (t',p') may become smaller as a result of ignoring the '-mer z. However, the pruning condition based on q'' in step (3) in the pseudo-code still holds.
Another way to view the ignoring condition is as follows. Consider a node (t,p) in the tree T d (x) and an '-mer z in the input string s j . Let us separate each of t,x and z into two parts based on p, namely, t~t 1 t 2 ,x~x 1 x 2 and z~z 1 z 2 where p~Dt 1 D~Dx 1 D~Dz 1 D and '{p~Dt 2 D~Dx 2 D~Dz 2 D. Notice that t 2~x2 . Then the inequality d H (t,z)w2d{d H (t,x) is equivalent to d H (x 2 ,z 2 )wd{ d H (t 1 ,x 1 )zd{d H (t 1 ,z 1 ). In other words, B d{dH (t1,x1) (x 2 ) and B d{dH (t1,z1) (z 2 ) are disjoint. Notice that this condition is independent of t 2 . This view helps us in designing our best algorithm qPMS7 which is described in Section 0.4.
The speedup technique reduces the runtime of Algorithm qPMSPrune drastically because the deeper a node is, the smaller will be the size of its list of surviving '-mers. Note that the number of nodes at a depth of h from the root will be exponential in h. In practice, the runtime of Algorithm qPMSPrune is improved by a factor of around 5 when this technique is used (see Table 1 and Table 2). However, it does not change the worst case time complexity of Algorithm qPMSPrune, theoretically.

Our Best Algorithm qPMS7
In this section, we propose a fast algorithm called qPMS7 for the qPMS problem. Algorithm qPMS7 is a generalized version of Algorithm qPMSPrune combined with the core idea of Algorithm PMS5 which was introduced in [4].
Recall that Algorithm qPMSPrune considers one '-mer x in a specific input string s i at a time. Algorithm qPMS7 extends Algorithm qPMSrune by considering two '-mers x and y in two different input strings s i and s j . An observation similar to that of Algorithm qPMSPrune can be obtained as follows. Using an argument similar to the one in [6], we infer that it is enough to consider every pair of input strings s i and s j with 1ƒi,jƒ(n{qz2). As a result, the above observation gets strengthened as follows.
Observation 0.4 Let M be any (',d,q)-motif of the input strings s 1 , . . . ,s n . Then there exist 1ƒi=jƒn{qz2 and '-mer M is a (', d, q{2)-motif of the input strings excluding s i and s j .
Like Algorithm qPMSPrune, Algorithm qPMS7 uses a routine that finds all of the motifs M such that M is in B d (x)\B d (y) and is a (', d, q{2)-motif of the input strings excluding s i and s j . Recall that Algorithm qPMSPrune explores B d (x) by traversing the tree T d (x). In Algorithm qPMS7, we also explore B d (x)\B d (y) by traversing an acyclic graph, denoted as G d (x,y), with similar construction rules. The rules for constructing G d (x,y) are given below.
1. Each node in G d (x,y) is a pair (t,p) where t is an '-mer and p is an integer between 0 and '. A node (t,p) is referred to as '-mer t if p is clear. Let t~t 1 t 2 ,x~x 1 x 2 and y~y 1 y 2 where p~Dt 1 D~Dx 1 D~Dy 1 D and '{p~Dt 2 D~Dx 2 D~Dy 2 D. Node (t,p) must satisfy the following constraints: x)ƒd and d H (t,y)ƒd.
It is not hard to see that if we traverse the graph G d (x,y) in a depth-first manner starting from node (x,0), then all the '-mers in B d (x)\B d (y) will be visited. For example, Figure 2 illustrates the visited nodes in the graph G 2 (1010,1100) in a depth-first manner starting from node (x,0)~(1010,x) where the alphabet S~f0,1g.
Algorithm qPMS7 traverses the graph G d (x,y) in a depth-first manner with the starting node (x,0). During the traversal, at each node (t,d) it computes d H (t,s k ) incrementally from its parent for 1ƒjƒn,k=i,k=j. With the same method as the one in Algorithm qPMSPrune, we can achieve this task in O(nm) time. Also, it is easy to see that if q' §q{2 then t is a (',d,q{2)-motif of the input strings excluding s i and s j , where q' is the number of input strings s k =s i ,s k =s j such that d H (t,s k )ƒd. If this is the case, it outputs t as a (',d,q)-motif of the entire input strings.
Algorithm qPMS7 also uses a similar pruning strategy to that of Algorithm qPMSRune and the speedup technique discussed in Section 0.3. In this case, the speedup technique ignores some 'mers in s k when computing d H (t,s k ) at each node (t,p) during the traversal of the graph G d (x,y). The ignoring condition of an '-mer z in s k for this case resembles that in Section 0.3. Let t~t 1 t 2 ,x~x 1 x 2 ,y~y 1 y 2 and z~z 1 z 2 where p~Dt 1 D~Dx 1 Dỹ 1~z1 . It is not hard to see that '-mer z can be safely ignored if is empty. Checking for this condition can be done in O(1) time using the incremental computation shown in [4]. During the traversal of the graph, at each node (t,p) we also store a list of surviving '-mers for each input string s k . At node (t,p), if the list of surviving '-mers of an input string is empty, then the input string will contribute nothing to any descendant node of (t,p) in order for that descendant to be a (',d,q{2)-motif. Therefore, the pruning condition is q''vq{2 where q'' is the number of input strings whose lists of surviving '-mers are not empty. The following pseudo-code describes Algorithm qPMS7.

Algorithm qPMS7
1. For each x[ ' s i ,y[ ' s j ,1ƒivjƒn{qz2 do: (a) Traverse the graph G d (x,y) in a depth-first manner starting from node (x,0). At each node (t,p), do the following steps.
i. Incrementally compute d H (t,s k ) from its parent for 1ƒkƒn,k=i,k=j.  ii. Let q' be the number of input strings s j such that d H (t,s j )ƒd.
If q' §q{2, output t. iii. Let q'' be the number of input strings whose lists of surviving '-mers are not empty. If q''vq{2, then backtrack. Otherwise, explore its children.
Theoretically, the time and space complexities of Algorithm qPMS7 are O((n{qz1) 2 nm 2 N (',d)) and O(nm 2 ), respectively. In the worst case scenario, the runtime of Algorithm qPMS7 is worse than that of Algorithm qPMSPrune by a factor of n{qz1. However, Algorithm qPMS7 is much faster than Algorithm qPMSPrune in practice, as shown in Section 0.5.
Algorithm qPMS7 also employs the following observation which has been used in many prior works such as [6] and [26]. Let M be any (',d) motif in inputs strings s 1 ,s 2 , . . . ,s n . Let M i be an instance of M in s i (for 1ƒiƒn). Then the Hamming distance between M i and M j is ƒ2d for any i and j (with 1ƒi,jƒn). In other words, if M i is any '-mer in some s i , then it could possibly be an instance of M only if there are at least q{1 out of n{1 sequences s j 's, j=i, that have an '-mer M j such that the Hamming distance between M i and M j is ƒ2d. This observation can be utilized to preprocess the input strings so that for any input string only those '-mers that satisfy the above condition are kept (and the other '-mers are ignored from further processing).

Transcription Factor Binding Sites Discovery
In this section we will discuss how to use a qPMS Algorithm, e.g., Algorithm qPMS7, to discover transcription factor-binding sites. Given a set of DNA strings that likely contains transcription factor-binding sites, we propose a general framework to find them. The framework consists of two phases. The first phase will select a set of motifs by repeatedly calling the qPMS Algorithm on different values of ',d, and q. The second phase will use a scoring function to eliminate some of the motifs returned in the first phase, and then identify the transcription factor-binding sites based on the surviving motifs.
In the first phase we employ different values, ranging between ' min and ' max , for the length ' of motifs, where ' min and ' max are user-specified parameters. For each value of ', we let d range from 0 to d max , where d max is another user-specified parameter, and call the best qPMS algorithm (let it be Algorithm A) to find (',d,q)motifs. In this process, if some (',d,q)-motif(s) are found, we add them to the set of motifs. The pseudo-code of the first phase follows.
Phase I: selecting candidate motifs Input: a set of strings Parameters: ' min ,' max ,d max and q Output: a set of (',d,q)-motifs M 1: M/1 2: for '~' min to ' max d~0 to do 3: for d = 0 to d max do 4: Run the fastest qPMS Algorithm A to find (',d,q)-motifs of the input strings 5: if algorithm A takes too long then 6: Terminate algorithm A 7: break the for loop of d 8: end if 9: Let M(',d,q) be the set of (',d,q)-motifs returned by algorithm A 10: if M(',d,q) is NOT empty then 11: M/M(',d,q) 12: break the for loop of d 13: end if 14: end for 15: end for In the second phase, we sort the (',d,q)-motifs according to their scores and pick the top k motifs, where k is a user-specified parameter. For each picked (',d,q)-motif M and each input string s i , transcription binding sites in s i are identified as follows. We consider every '-mer z in s i and output the location of z in s i as a transcription binding site if d H (M,z)ƒd. The following pseudocode describes the second phase.
Phase II: identifying transcription factor binding sites Input: a set of strings and a set of (',d The accuracy of the framework in discovering transcription factor-binding sites heavily depends on two factors: the qPMS Algorithm and the scoring function. Of course, the faster the qPMS Algorithm is, the more accurate will be the results it provides. Designing fast qPMS algorithms is our main focus because it is a difficult task. On the other hand, the choice of the scoring function is also critical. In general, the scoring function should measure the biological significance of a candidate motif possibly via a probabilistic model. As a rule of thumb, the smaller the probability that a motif appears (by random chance) is, the more likely will it be to be biologically significant. In addition, the impact of the scoring function on the accuracy also depends on the size of the list of candidate motifs M. The larger the size is, the more will be the scoring function's impact. For example, the scoring function called ''sequence specificity'' is usually used. It is defined to be f (M)~{ P n i~1 log (E(d H (M,s i ))) where E(d H (M,s i )) is the expected number of times a motif appears in string s i with up to d H (M,s i ) mismatches [6].

Results
In this section we evaluate the performance of Algorithm qPMS7 on simulated as well as real data. With simulated data, we compare its runtime with that of other existing algorithms. With real data, we measure the accuracy of qPMS7 in detecting real motifs. Of course, the larger the values of ' and d that an algorithm can solve, the more accurate will be the results it yields because it covers a larger search space of motifs.

Experiments on Simulated Data
We compared the runtime of Algorithm qPMS7 with other well-known algorithms such as Algorithm qPMSPrune of [6], Algorithm PMS6 of [3], Algorithm PMS5 of [4], Algorithm Pampa of [5], Algorithm Voting of [8], and Algorithm RISSOTO of [9]. Recall that among these algorithms, only Algorithm qPMSPrune deals with the qPMS problem. The rest of the algorithms deal with the simpler version, i.e., the PMS problem. The improved Algorithm qPMSPrune in Section 0.3 is named qPMSPruneI. To evaluate the performance of algorithms, we usually test them on challenging and hard instances of the problem. All of these algorithms have been run on the same machine running Windows XP Operating System with a Dual Core Pentium 2.4GHz CPU and 3GB RAM. The experimental results below show that Algorithm qPMS7 is better than any other algorithm.
0.6.1 DNA sequences. Following [21] and [6], the set of input strings of a challenging instance is typically generated as follows. Each input string is a random DNA string drawn according to the i.i.d model. A random '-mer M is chosen as a (',d,q)-motif and mutations of this '-mer M are planted in q out of the n input strings at random positions. The Hamming distance between M and any of these mutations is at most d. The number of input strings n and the length of each of them m are chosen to be 20 and 600, respectively.
In the case of the PMS problem, q~n. The pairs (',d) corresponding to challenging instances are (13,4), (15,5), (17,6), (19,7), (21,8), (23,9), (25,10), and so on. To the best of our knowledge, there has not been any algorithm that can solve the challenging instance (25,10). Therefore, Table 1 reports the runtime of the algorithms on the challenging instances up to (23,9). Algorithms qPMS7, PMS6 and PMS5 can solve any of these challenging instances. In Table 1, the letter '-' indicates that the corresponding algorithm either takes too long or uses too much memory on the corresponding challenging instance.
Since none of the exact algorithms reported in the literature deals with the qPMS problem for protein sequences, we restrict our comparison to the algorithms qPMSPrune, qPMSPruneI, and qPMS7. Table 3 and Table 4 show the runtimes of these algorithms for the two cases q~n~20 and q~n 2~1 0, respectively. As the results show, Algorithm qPMS7 outperforms Algorithms qPMSPruneI and qPMSPrune on all the cases. 0.7 Experiments on Real Data 0.7.1 Finding real DNA motifs. We tested Algorithm qPMS7 on the real datasets discussed in [27] which is commonly used to measure the accuracy of the existing algorithms (see e.g., [27], [19], and [7]). Each of the datasets is a collection of DNA orthologous sequences from many organisms. These real datasets are substantially different from the simulated data because they contain known transcription regulatory elements, i.e, known motifs. Algorithm qPMS7 was able to identify these known motifs for appropriate values of the parameters ',d and q~n 2 . We report these motifs in Table 5. However, we should mention that our results are similar to those published in [7], [6] as well as other papers. 0.7.2 Detecting transcription factor-binding sites. We have also tested our algorithms on the biological datasets described in [28]. In this collection there are several datasets. Some strings of each of these datasets contain known transcription factor-binding sites of different lengths and the others do not. Therefore, in order to test these real datasets we rely on the framework for transcription factor-binding sites discovery described in Section 0.5. Recall that this framework needs a qPMS algorithm and a scoring function. Since Algorithm qPMS7 is currently the fastest, we employ it in this framework. Regarding the scoring function, we use the function called ''sequence specificity'' which is also the one used in [6], which basically is defined to be f (M)~{ P n i~1 log (E(d H (M,s i ))) where E(d H (M,s i )) is the expected number of times a motif appears in string s i with up to d H (M,s i ) mismatches, assuming the i.i.d model. To complete the tests, we need to choose the parameters of the framework ' min , ' max , d max , q, and k. We set ' min~1 0, ' max~2 1, d max~7 , q~n 2 , and k~5. With this setting, we obtain good results like those in [28], [6], and [4] with many transcription factor-binding sites predicted correctly. Table 6 reports some of these correctly predicted binding sites together with the predicted motifs.   The alphabet size DSD~20, n~20, m~600, and q~n~20. doi:10.1371/journal.pone.0041425.t003

Discussion
In this paper we have presented Algorithm qPMS7 for the qPMS problem and tested it on DNA as well as protein sequences. Experimental results indicate that Algorithm qPMS7 is faster than other existing algorithms, especially for large values of ' and d. Since Algorithm qPMS7 is a search-based algorithm, it uses a small amount of memory. This feature of Algorithm qPMS7 is a major advantage compared to other algorithms such as RISOT-TO, Voting, PMS5, and PMS6 which require a large amount of memory when solving instances with large values of ' and d. Another advantage of Algorithm qPMS7 over these algorithms is that they cannot deal with the qPMS problem and in particular they only handle the PMS problem.
Algorithm qPMS7 is the result of a combination of an extension of Algorithm qPMSPrune and the core idea of algorithm PMS5.
In Algorithm qPMSPrune, a ''pivot'' '-mer is used. In Algorithm qPMS7, we extended this idea by considering two pivot '-mers. This idea can be further generalized by considering more than two pivot '-mers, In this paper we have also proposed a framework for transcription factor-binding sites discovery. It should be mentioned that our framework together with Algorithm qPMS7 is currently deployed in our online tools at http://pms.engr.uconn.edu or at http://motifsearch.com. We will be very happy to receive any comments and feedback from users.