CLIMP: Clustering Motifs via Maximal Cliques with Parallel Computing Design

A set of conserved binding sites recognized by a transcription factor is called a motif, which can be found by many applications of comparative genomics for identifying over-represented segments. Moreover, when numerous putative motifs are predicted from a collection of genome-wide data, their similarity data can be represented as a large graph, where these motifs are connected to one another. However, an efficient clustering algorithm is desired for clustering the motifs that belong to the same groups and separating the motifs that belong to different groups, or even deleting an amount of spurious ones. In this work, a new motif clustering algorithm, CLIMP, is proposed by using maximal cliques and sped up by parallelizing its program. When a synthetic motif dataset from the database JASPAR, a set of putative motifs from a phylogenetic foot-printing dataset, and a set of putative motifs from a ChIP dataset are used to compare the performances of CLIMP and two other high-performance algorithms, the results demonstrate that CLIMP mostly outperforms the two algorithms on the three datasets for motif clustering, so that it can be a useful complement of the clustering procedures in some genome-wide motif prediction pipelines. CLIMP is available at http://sqzhang.cn/climp.html.


Introduction
The rapid development of new technologies has led to the declining cost of genome sequencing, and as a result, thousands of genomes are being sequenced [1,2]. Furthermore, numerous comparative genomics-based algorithms have been developed in order to decipher the biological functions of various sequenced genomes; this can be computed because these biological functions are encoded and relatively conserved in a group of closely related genomes. Moreover, transcription regulation is usually triggered by the binding of proteins called transcription factors (TFs) to specific DNA segments known as TF binding sites (TFBSs). Furthermore, these TFBSs are for the most part predicted by comparing multiple non-coding sequences that potentially contain the TFBSs. A set of TFBSs recognized by the same TF is called a motif, which summarizes the commonalities among the binding sites of a TF [3]. Additionally, A survey on graph clustering summarized by Schaeffer [33] classified graph clustering methods into three categories: density-based methods (e.g. k-means and k-center [34]), cutbased methods (e.g. hierarchical clustering [35] and spectral clustering [36]), and random walks methods (e.g. MCL). In fact, the main aims of density-based and cut-based methods are to find maximum cliques and sparsest cuts, respectively, both of which are known to be NPhard [37,38]. Methods, such as k-means, k-center, and expectation maximization (EM) clustering [39] algorithms, keep a relatively small set of estimated cluster centers at each step. Afterwards, AP improves these algorithms by simultaneously considering all data points as candidate centers and then gradually merging them to identify clusters. It should also be noted that both hierarchical clustering and spectral clustering are only capable of dealing with another type of clustering problem of recursively comparing pairs of data points to partition the data. However, hierarchical clustering and spectral clustering methods, are not well-suited to group motifs because two motifs that should not be clustered together may, in fact, be clustered together by a series of pairwise groupings [32]. These applications for both motif similarity graphs and protein interaction graphs encounter the additional problem of not properly managing a significant level of background noise (e.g. the high similarity score among spurious motifs). Therefore, in this paper, AP and MCL, which are the best as we know among these density-based and random walks methods respectively, are selected to make comparisons.
Due to the large-scale collection of similarity data among a large number of motifs produced from genome-wide prediction as well as the additional problem of handling background noise, a clustering algorithm that can tolerate mass data is required in order to produce more accurate results in a short timespan than what is produced by other existing methods. Therefore, in this paper, we propose a new clustering algorithm called CLIMP (Cluster Cliques of Motifs in Parallels with openMP) and demonstrate that it mostly outperforms two outstanding clustering algorithms, MCL and AP, for large-scale motif similarity graph clustering.

Motivation and basic idea
The binding sites belonging to a TF may be identified by one or more motif finding tools in one or more datasets. Motif finding tools are usually designed for finding well-conserved sites in the upstream sequence set of either a group of co-regulated genes in a genome or a group of orthologous genes in a set of closely related genomes (i.e., the phylogenetic foot-printing technique), or in a ChIP dataset of a TF. The binding sites of a TF are often degenerate in a genome and divergent across related genomes [26,40,41]. Due to the degeneration and diversity of the binding sites of a TF, multiple distinct sub-motifs of the TF may be found by one motif finding tool through outputting multiple top results or by multiple motif finding tools. For example, the experimentally verified motif of the TF CRP in E. coli K12 can be classified into at least three well-conserved sub-motifs; i.e., a canonical palindromic sub-motif, an A-rich sub-motif, and a T-rich sub-motif although both the latter ones share a certain number of elements with the canonical one [28]. A report showed that roughly half of 104 distinct mouse DNA binding proteins each recognized multiple distinctly different sequence motifs when the binding sites were examined in the mouse ChIP-chip datasets [41]. Furthermore, the binding sites of some TFs were reported to be always divergent in three different yeast species (S. cerevisae, S. mikatae, and S. bayanus) [40]. For example, of the 221 and 255 recognized sites bound in total by two TFs Ste12 and Tec1, respectively, only 47 (Ste12, 21%) and 50 (Tec1, 20%) sites were conserved across all three yeast species [40]. So a certain percentage of Ste12 and Tec1's sites were conserved across at most two yeast species. Suppose that the entire motif M of a TF can be divided into k well-conserved sub-motifs {M 1 , M 2 , . . ., M k }. If each well-conserved sub-motif M i is partially or fully predicted many times by one or more motif-finding tools from multiple sequence sets, a set P(M i ) of predicted motifs corresponding to each sub-motif M i will be found. If each predicted motif of a sub-motif is treated as a node of a graph, and two predicted motifs are connected by an edge if their similarity is above a cutoff value, the predicted motif set P(M i ) of a sub-motif M i are likely to form a clique (i.e., a complete subgraph). Therefore, a sub-motif can be modelled as a clique of its predicted motifs and a motif composed of k submotifs can be modelled as the mergence of k cliques.
The CLIMP algorithm with parallel computing design For a set of binding site motifs with their corresponding position frequency matrices and position weight matrices, a motif similarity graph is constructed by using the SPIC metric to compute the similarity score between each pair of motifs. In the graph, each node represents a motif, and two nodes are connected by an edge, whose weight is the similarity score between the corresponding two motifs, if and only if the similarity score is greater than a preset threshold. More specifically, binding site motifs that belong to the same TF are more likely to form highly connected sub-graphs with high edge weights in the motif similarity graph than are those from different TFs or spurious motifs. However, due to the degenerate nature of the binding sites from the same motif, the similarity between two subsets (called sub-motifs, here) of a motif may not be significantly high. For this reason, motifs that are very similar to each other are initially grouped together in order to generate a set of clusters, and then, each of the remaining motifs is assigned into a cluster if the motif is similar to a large proportion of the motifs present in a given cluster. Given a motif similarity graph G = (V, E) where V is the set of nodes and E is the set of edges, the algorithm is separated into four steps as follows.
Step 1: For each node, find a maximal clique associated with it. For a node, a "greedy" strategy is used to find a maximal clique associated with the node. The clique can be regarded as the cohesion of the node.
The problems of enumerating all the maximal cliques and finding the maximum clique in a graph are NP-hard [38]. Here, only |V| maximal cliques rather than all the maximal cliques and the maximum clique in the graph G = (V, E) would be intended to be found, where |V| is the number of nodes in G. That is, for each given node v, it is intended that a maximal clique would be found, whose nodes have the closest relationship to v. For this purpose, the neighborhood sub-graph N(v) of a node v is defined as the sub-graph induced by v and its neighbor nodes. Definitely, all the maximal and maximum cliques containing the node v are in N(v).
Here, for each node v, the neighborhood sub-graph N(v) is extracted from the graph G and a greedy strategy to find a maximal clique C v in N(v) is designed as follows: 2. For each node u i in the array U, u i and its incident edges are sequentially deleted until the degrees of v and the remaining nodes are identical in C v . The deleted nodes are labelled as an array {u 1 ,. . .,u k−1 ,u k }. Go to (c).
3. For each node u j from u k−1 to u 1 in reverse order, if each of the nodes in C v is joined with u j by an edge in N(v), update C v by adding u j and its incident edges {(u j ,u): The finally obtained C v is called the clique associated with node v. An example of finding a maximal clique associated with node v in sub-graph N(v) is illustrated in Fig 1. It should be noted that the clique that is obtained in this step may not be the maximum clique associated with v in N(v). Even though the obtained maximal clique may not be a maximum clique, it remains superior to the maximum clique because its nodes have the closer relationship (higher similarity) to v than do the nodes in the maximum one.
For each node v, the procedure for finding its associated clique is identical, and the time complexity is Oðd 2 v Þ where d v is its degree. Fortunately, the motif similarity graph is generally sparse, and the degree d v is usually small. In this step, the "for" loop for clique finding can be easily parallelized. For example, if the openMP libraries (http://openmp.org) are included in the program, the routine, "#pragma omp parallel for" is just called before the "for" loop. If k processes are invoked simultaneously, the time complexity will be reduced to OðjVj Á max v2V fd 2 v g=kÞ, where |V| is the number of nodes in graph G = (V, E).
Step 2: Merge cliques into clusters. Based on the law of gravity, for two substances, a third has greater attraction with the heavier one of them if the third has the same distance from the two substances. Similarly, for two cliques, a third clique has a greater affinity with the bigger one of them if the third has the same extent of overlap with the two cliques. So a large clique is more likely to be the core of a cluster than a small one is. In other words, the smaller a clique is, the more likely it is that its nodes do not belong to the same cluster. Therefore, in this clique merging step, both the sizes of cliques and the extent of overlapping among cliques are considered. Initially, all redundant cliques are merged to form unique ones, and then, all unique cliques are sorted in descending order by the sum of edge weights in order to generate a ranked queue {C 1 ,C 2 ,. . .,C n }. Subsequently, as shown in Fig 2, the procedure begins from the first largest clique, and for each current unassigned clique, it is set as an initial cluster. Then, any following unassigned cliques are successively integrated into the cluster if an unassigned clique has a For two cliques C i and C j (i < j), what is the specific rule of merging C j into C i 's cluster? If the (overlap) ratio of the nodes in C j appearing in C i is no less than α (i.e., |C i \C j |/|C j |!α) and in the graph G the ratio of nodes of C j having adjacent nodes in C i is no less than β, C j is merged into C i 's cluster; otherwise, nothing is done. Such a process is labeled as ðC i À ? C j Þ for a pair of cliques C i and C j (i < j). The parameters α and β (α β) can be set by users. This step can also be parallelized by using a pipeline design. For each clique C i from i = 1 to n−1, a different processor can be called to run the merging processes ðC i À ? C j Þ from j = i+1 to n. As shown in Fig 3, after running a process ðC i À ? C j Þ in processor P i , if C j cannot be merged into C i , the process ðC iþ1 À ? C j Þ in processor P i+1 and the process ðC i À ? C jþ1 Þ in processor P i are run simultaneously. Clearly, if only one processor is called, the total number of processes is no more than 1 + 2 + Á Á Á + (n−1) = (n−1)n/2, but if n processors are simultaneously called, the asynchronous processes are at most (n−1) + (n−2) = 2n−3, as illustrated in Fig 3. Step 3: Delete redundant nodes. In the reduced sub-graph of a cluster, for each node the corresponding weight sum of edges incident to it in the sub-graph is first calculated. For all clusters, because there is no interaction between each pair when the edge-weight sums of each node is calculated, this step can be parallelized by separately dealing with the clusters in different processors. After that, for each node that appears redundantly in more than one cluster, only that which has the maximum edge-weight sum is kept, and the redundancies are deleted from these clusters.
Step 4: Sort clusters. All clusters are sorted in a descending order of edge-weight sums in order to obtain the final set of ranked clusters. Note that the calculation of each cluster's edgeweight sum can also be parallelized.
Among the four steps described in the CLIMP algorithm, the largest computation involves finding all maximal cliques in Step 1. Since motif similarity graphs are generally sparse and there is no vector-vector or matrix-matrix multiplication in clique finding, an adjacency list is used to store such a sparse graph instead of an adjacency matrix in order to reduce graph G's storage. In Step 1, for each node v, only a list of its neighbors is required to be reported in O(| d v |) time, and the neighbors are represented as a sorted array according to edge weight. Finally, the pseudo-code of the parallel clustering algorithm is shown in Table 1.

Performance assessment
Clearly, an ideal motif clustering algorithm can group two relevant motifs in a cluster in addition to separating two irrelevant motifs in different clusters. In a perfect motif clustering result, each cluster should contain exactly one motif, and each motif should also only be located in exactly one cluster. For m obtained clusters and n given motifs, the ability of a clustering algorithm to recover motifs from a motif similarity graph is evaluated using the Adjusted Rand Index (ARI) [42] derived from a contingency table (n ij ) n×m , where each n ij represents the number of objects that are in both motif i and cluster j. Let N be the number of all objects. Let n i• and n •j be the number of objects in motif i and cluster j, respectively. The formula of the  10. If C i is not labeled "merged", in the thread P i 11.
If C j is not labeled "merged"

13.
If |C i \C j |/|C j |!α and the ratio of nodes of C j having adjacent nodes in C i is no less than β

14.
Merge C j into C i and label C j "merged"; (i.e. the process C i C j ) 15.
Else if C i+1 is not labeled "merged"

16.
In the thread P i+1 to do the process C i+1 C j ; Adjusted Rand Index is: Programs and parameter optimization

Parameter selection and performance on motif retrievals
Currently, all motif-finding tools are limited as they can only find the partial binding sites of a TF; consequently, all TF binding sites always appear as subsets of them. Furthermore, any two subsets (sub-motifs) of binding sites recognized by the same TF are always highly conserved. Therefore, a perfect motif clustering algorithm should be able to ensure that each cluster only contains the binding site motifs of exactly one TF as well as locate each TF's motifs in exactly one cluster. Therefore, if the binding sites of a TF are shuffled to generate a series of sub-sets (sub-motifs), a clustering algorithm is necessarily proposed to test whether these sub-motifs can then be clustered together again. In order to estimate the parameters of CLIMP and evaluate CLIMP's performance on grouping sub-motifs from the same motifs together and separating sub-motifs from different motifs, all non-redundant transcription factor binding sites (TFBSs), which belong to 593 motif profiles, were first downloaded from the JASPAR core database Version 5.0 (http://jaspar.genereg.net/html/DOWNLOAD/sites.tar.gz), which is a collection of experimentally defined TFBSs for eukaryotes [43]; and these motifs are then used to generate numerous sub-motifs. We used the method described in the paper for evaluating the SPIC metric [22] to produce artificial sub-motifs. For each motif consisting of n TFBSs, the motif is randomly divided into two sub-sets (sub-motifs) of sizes k and n−k, respectively, for each k = 1,2,. . .,[n/2]. Therefore, 2×[n/2] sub-sets (sub-motifs) can be generated for a motif of n TFBSs. Moreover, it is obvious that a motif with a greater number of binding sites would necessarily result in a greater number of sub-motifs. In addition, since all motif-finding tools were designed to find overrepresented segments as predicted binding sites in a set of DNA sequences, an "overrepresented" motif (i.e., a motif that has more binding sites) is more easily distinguished by motif-finding tools than a "non-overrepresented" motif (i.e., a motif that has fewer binding sites). Therefore, a motif of n binding sites is divided into about n sub-motifs. For the 593 motif profiles, 30,000 sub-motifs are finally obtained. For these sub-motifs, the SPIC metric is employed to calculate the similarity between each pair. In addition, two distributions (Fig 4) are plotted in order to determine whether the similarity graph contains clustering properties. In Fig 4, the curve labeled as "all pairs" is the distribution of the similarity scores between each pair of the 30,000 sub-motifs, and the curve labeled as "inner pairs" is the distribution of the similarity scores between each pair of sub-motifs within the same profile. Clearly, the two curves have a small overlapping area. Based on Fig 4, a similarity score cutoff can be chosen such that as many as possible nodes that represent the sub-motifs of a particular motif profile are connected, while as many as possible nodes that represent sub-motifs of different motif profiles are disconnected. Therefore, the SPIC-constructed similarity graph will have the sparsest edges, whereas the relevant sub-motifs are still likely to be connected if the similarity score cutoff γ>0.4 as shown in Fig 4. For example, even if γ = 0.6, 83% of the sampled sub-motifs of a motif profile had an "inner pair" similarity score greater than 0.6, and the graph constructed with 0.6 as the cutoff contained only 1.3% of "all pairs" possible edges of the motif similarity graph.
After the construction of a motif similarity graph, the clusters produced by each of the three clustering tools in sub-motif similarity graphs with different cutoff settings are evaluated. Based on the observation of Fig 5, the optimal similarity score cutoff falls within the range [0.4, . From γ = 0.4 to 0.7 with an interval of 0.05, sub-motif similarity graphs are successively constructed by keeping all edges with weights of no less than γ. Each of the three tools are used on each graph with their optimal parameters in order to acquire a set of clusters, and the corresponding adjusted Rand index values are calculated. As shown in Fig 5, each of the three clustering algorithms achieves the highest ARI values at the 0.6 cutoff, and of important notice, CLIMP outperforms both MCL and AP in these graphs with different cutoffs for clustering sub-motifs of the same motif and separating sub-motifs belonging to different motifs.
When the similarity score cutoff is 0.6, MCL, AP, and CLIMP are separately used to cluster the similarity graph with their optimal parameters (i.e., the Inflation parameter value of MCL is 2.6, the Reference parameter value of AP is 0.55, and (α, β) = (0.5, 0.5) for CLIMP), which can maximize their adjusted Rand indices. Finally, 1647, 1423, and 1569 clusters are respectively output by CLIMP, MCL, and AP. Clearly, a perfect clustering solution should result in one cluster corresponding to one motif. To evaluate the correspondence of the motif profiles and the clusters obtained by each tool, the number of motif profiles recovered by a cluster was first counted. From which, the majority of them corresponded to exactly one motif profile. For CLIMP, 62% of the clusters each contain only one motif profile, while the percentage is 56% in the MCL's clusters and 51% in the AP's clusters. Conversely, the number of obtained clusters that each motif profile's sub-motifs are located in was also counted. The majority of the 593 known motif profiles were clustered into one cluster. For CLIMP, 45% of the motifs were located in exactly one cluster, while the corresponding percentages are 47% and 48% for MCL and AP, respectively.

Performance on identifying true motifs from putative motifs
A genome-wide phylogenetic foot-printing dataset of yeast was downloaded from MotifClick's website (http://motifclick.uncc.edu/yeast_intergenic_seq_sets.tar.gz) [44]. The dataset is composed of 5,137 intergenic sequence sets of orthologous genes from the target genome Saccharomyces (S.) cerevisiae and 6 reference genomes (S. castellii, S. bayanus, S. kluyveri, S. mikatae, S. kudriavzevii, and S. paradoxus). More specifically, orthologous genes between two genomes were predicted by the bidirectional best hits (BDBH) method using BLASTP with an E-value cutoff of 10 −20 for both searches. Then, for each group of orthologous genes in the seven genomes, up to 1,000 bases upstream inter-genic region of each gene were extracted to form an orthologous sequence set. Finally 5,137 orthologous sequence sets each containing at least three sequences were obtained. As illustrated in the MotifClick paper [44], the motif length is set as eight bases, and three performance-outstanding motif-finding tools, MotifClick [44], MEME [45], and BioProspector [46], are separately run in the 'anr' mode if available to output the top 10 motifs on each of the 5,137 sequence sets. As a result, approximately 150,000 putative motifs, which contain 122 known TF motifs of S. cerevisiae in both the YEASTRACT database (http://www.yeastract.com/download/TFConsensusList_20130918.Transfac.gz) [47] and the Saccharomyces Genome Database (SGD) (http://downloads.yeastgenome.org/ published_datasets/MacIsaac_2006_PMID_16522208/) [48], are obtained. Based on the fact that a TF can regulate multiple genes and a real motif is more likely to be predicted by multiple motif-finding tools than any spurious one, a real motif belonging to the same TF could be gathered in a set of similar putative predicted motifs. MCL, AP, and CLIMP are then tested to cluster these putative motifs to see whether or not the clusters that contain a majority of the 122 known motifs rank high on the cluster list.
At first, the SPIC metric is utilized to compute the similarity between each pair of these putative motifs, and the cutoff is chosen as 0.6 based on the analysis in the first experiment in order to generate a motif similarity graph with 145,581 nodes and 34,413,340 edges. The three clustering algorithms with their optimal parameters in the first experiment (i.e., the Inflation parameter value of MCL is 2.6, the Reference parameter value of AP is 0.55, and (α, β) = (0.5, 0.5) for CLIMP) are successively run on the resulting motif similarity graph. We say a putative motif recovering a true motif if the sites of the target genome in the putative motif are binding sites of the true motif. As shown in Fig 6(A), the top 130 clusters of CLIMP recover 104 (85.2%) of the total 122 motifs, whereas the top 130 clusters of MCL and AP only recover 92 (75.4%) and 90 (73.8%) of the 122 motifs, respectively. After the 130th cluster, the motif recovery rate of CLIMP's clusters increases at a more gradual rate than do MCL's and AP's motif recovery rates. In other words, compared to both MCL's clusters and AP's clusters, the motif recovery rate of CLIMP's clusters is becoming more highly saturated after the 130th cluster, and the CLIMP's clusters that contain known motifs rank higher on the sequence of the top 130 clusters than do MCL's and AP's. Clearly, the top ranked clusters contain more known motifs than low ranked ones. Note that those clusters that do not contain any known motif might be novel ones. Specially, CLIMP's clusters essentially achieve the saturated condition in the 200th cluster, which is consistent with the number of transcription-related proteins in the DBD database [49] and two references [50,51]. Furthermore, Fig 6(B) shows that CLIMP's clusters contain less cumulative putative motifs than AP's and MCL's; therefore, CLIMP can filter out more spurious motifs than the other clustering algorithms.

Performance on clustering motifs for ChIP datasets
In DePCRM [13], which is a tool for de novo prediction of cis-regulatory elements (CREs) and modules from ChIP datasets in an eukaryote, 168 ChIP datasets of 56 TFs from Drosophila melanogaster were collected from the Berkeley drosophila transcription network project (BDTNP) [52], the modENCODE project [53], and literature. The majority of the binding peaks in these datasets have a length of around 1,000 bp. In the binding peaks of each ChIP dataset, DREME [54] was selected in DePCRM to identify all possible motifs. Finally, a total of 17,890 putative motifs containing 35,359,819 putative CREs were identified in 162 datasets of the 168 ChIP datasets (6 low-quality datasets were removed). Clearly, the vast majority of the putative motifs found in the datasets are spurious predictions. The TOMTOM motif comparison tool (http://mccb.umassmed.edu/meme/cgi-bin/tomtom.cgi) was used to compare putative motifs with the known motifs of D. melanogaster in the Redfly v3.0 [55], FlyFactorSurvey [56] and FlyReg [57] databases. For each of the 17,890 putative motifs, we say it is likely a true motif if it is highly similar to known motifs in D. melanogaster at p<0.001. After doing the comparisons using TOMTOM, we found that the 17,890 putative motifs cover 144 known true motifs of D. melanogaster with p<0.001.
Similar to the first experiment, MCL, AP, and CLIMP are tested to cluster the 17,890 putative motifs to see whether or not the clusters that hit known true motifs rank high on the cluster list. At first, we construct a motif similarity graph using the putative motifs as nodes and linking any two motifs by an edge if their SPIC metric score is no less than a preset cutoff γ. Based on the analysis in the first experiment, the motif similarity score cutoff γ is set as 0.6, and the three clustering algorithms with the parameters that are the same as in the first experiment (i.e., the Inflation parameter value of MCL is 2.6, the Reference parameter value of AP is 0.55, and (α, β) = (0.5, 0.5) for CLIMP) are successively run on the resulting motif similarity graph. As shown in Fig 7(A), in most cases (up to the top 160 ranked clusters), CLIMP cumulatively recovers more known motifs than AP and MCL. Furthermore, as shown in Fig 7(B), CLIMP's ranked clusters contain less cumulative putative motifs than AP's and MCL's. Consequently, CLIMP can filter out more spurious motifs than the other two clustering algorithms.

Computational speeds
Without parallel computing design, MCL is the fastest program among the three evaluated clustering algorithms. For sparse or small graphs, the running times of the three algorithms are acceptable. Since there is not an available parallel version of AP, the computational speeds of CLIMP are compared to MCL on a workstation with Intel Xeon E5 CPUs. When CLIMP and MCL were run on the aforementioned graph with 145,581 nodes and 34,413,340 edges (the similarity score cutoff was set as 0.6) in the section of evaluating the yeast phylogenetic footprinting dataset, MCL requires three hours wall-clock time with one thread; in contrast, CLIMP requires twelve hours wall-clock time with one thread, and its running time is reduced to about three hours if ten processes are called. Therefore, it is necessary for CLIMP to speed up by parallelizing its program.
For further comparison, 2,000 nodes are randomly selected from the 145,581 nodes (motifs) in the section of the yeast dataset when different similarity score cutoffs were selected from 0.10 to 0.95 in steps of 0.05, so a series of motif similarity graphs are constructed with different graph densities (the density of a graph is defined as the number of edges divided by the number of nodes). Single process and four processes are called respectively by CLIMP and MCL on these constructed graphs with different densities. The running times are plotted in Fig 8, which  shows that CLIMP's running time is acceptable if enough processes (threads) are called; however, in most cases, CLIMP is slower than MCL because CLIMP is a heuristic enumeration algorithm with time complexity OðjVj Á max v2V fd 2 v gÞ while MCL is a stochastic flow simulation algorithm with time complexity O(|V| 2 ). If a graph is very sparse, CLIMP runs faster than MCL. But if the graph is dense, MCL runs faster than CLIMP, but the computational speed of CLIMP can be improved by using more computer nodes.

Conclusions and Availability
In the paper, a new efficient clustering algorithm is proposed for large-scale motif clustering, which can be a complement of MCL and AP in some genome-wide motif prediction pipelines such as GLECLUBS [28], eGLECLUBS [29], and DePCRM [13]. The C++ source code parallelized with openMP, the three datasets used in this article, and a web server of CLIMP are publicly available at http://sqzhang.cn/climp.html.