Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

QuateXelero: An Accelerated Exact Network Motif Detection Algorithm

  • Sahand Khakabimamaghani,

    Affiliation Laboratory of Systems Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran

  • Iman Sharafuddin,

    Affiliation Laboratory of Systems Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran

  • Norbert Dichter,

    Affiliation Molecular Bioinformatics, Johann Wolfgang Goethe-University, Frankfurt am Main, Germany

  • Ina Koch,

    Affiliation Molecular Bioinformatics, Johann Wolfgang Goethe-University, Frankfurt am Main, Germany

  • Ali Masoudi-Nejad

    amasoudin@ibb.ut.ac.ir

    Affiliation Laboratory of Systems Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran

QuateXelero: An Accelerated Exact Network Motif Detection Algorithm

  • Sahand Khakabimamaghani, 
  • Iman Sharafuddin, 
  • Norbert Dichter, 
  • Ina Koch, 
  • Ali Masoudi-Nejad
PLOS
x

Abstract

Finding motifs in biological, social, technological, and other types of networks has become a widespread method to gain more knowledge about these networks’ structure and function. However, this task is very computationally demanding, because it is highly associated with the graph isomorphism which is an NP problem (not known to belong to P or NP-complete subsets yet). Accordingly, this research is endeavoring to decrease the need to call NAUTY isomorphism detection method, which is the most time-consuming step in many existing algorithms. The work provides an extremely fast motif detection algorithm called QuateXelero, which has a Quaternary Tree data structure in the heart. The proposed algorithm is based on the well-known ESU (FANMOD) motif detection algorithm. The results of experiments on some standard model networks approve the overal superiority of the proposed algorithm, namely QuateXelero, compared with two of the fastest existing algorithms, G-Tries and Kavosh. QuateXelero is especially fastest in constructing the central data structure of the algorithm from scratch based on the input network.

Introduction

Milo et al. [1] define “Network Motifs” as connectivity-patterns (subgraphs) in a particular network that occur much more often than they do in random networks. These patterns can be seen as the building blocks of networks. The importance of network motifs arises from the fact that they are closely related to many network properties such as structure, function, and robustness.

Since the introduction of this concept by Milo et al. in a seminal paper [1], a considerable number of researches have been conducted on this subject. Some of these researches focused on the biological aspects [2] [3] [4] and others concentrated on computational facets [5] [6] [7] [8] [9] [10]. The first group has endeavored to interpret the motifs detected in biological networks by the existing motif detection tools. But, the second group has tried to improve the existing motif detection tools to make this job easier for researchers of the first group. The current research belongs to the second group.

Motif detection in networks consists of two main steps: first, calculating the number of occurrences of a subgraph in the network and, second, evaluating the subgraph significance. Various methods proposed so far differ mainly in the first step, the enumeration of subgraphs. These methods can be grouped roughly into two categories regarding this aspect:

  1. Methods counting subgraph occurrences exactly.
  2. Methods using sampling and statistical approximations for the enumeration.

In this work, the focus is in the first category, which is also much more computationally demanding. The methods in this group require classifying the subgraphs after enumerating them in the network. In other words, the non-isomorphic classes of enumerated subgraphs should be determined. This can be done in two ways. First, one can generate all different non-isomorphic classes of a prescribed size and then calculate the frequency of each in the network (i.e., count the number of matches of each class in the network). The drawback is that the number of non-isomorphic classes grows exponentially with the given size of the subgraph. Grochow-Kellis [7] and MODA [11] exploits this approach. Second, one can perform the classification after the subgraphs are enumerated (i.e., for each enumerated subgraph we determine the non-isomorphic class separately). Faster tools, such as FANMOD [5], Kavosh [6] and G-Tries [8], use the latter classification method. This is also the approach used in the algorithm proposed in this paper.

The classification step is the most time consuming step of the second category methods. The reason is the application of isomorphism detection algorithms, mostly NAUTY [12], in this step. For example, in FANMOD and Kavosh, after enumerating each subgraph of a predefined size s it is first inputted to NAUTY algorithm, which produces a binary canonical labeling of length s2 for that subgraph. Then, the canonical labeling is used as a key to search a binary tree, each leaf of which indicates a particular non-isomorphic class of size s. ESU, the algorithm used in FANMOD tool, is shown in Table 1: Algorithm 1 below (adapted from [13]).

The approach is different in G-Tries, in which a multi-way tree of depth s, the G-Trie, is used instead of the binary tree. However, again NAUTY is used for enumerating the subgraphs of the original network. But, the structure of the G-Trie tree is such that it can classify subgraphs of random networks without calling NAUTY. So, NAUTY is only used for census on the original network. This makes the G-Tries the fastest in the census on random networks.

Although NAUTY is one of the fastest isomorphism detection methods, but its computational cost is O(s!) in the worst case, which is very remarkable. Unfortunately, the isomorphism detection is an NP problem and no polynomial time algorithm is designed for solving it yet. Only a few methods, like SAUCY [14] and BLISS [15], have been designed for improving NAUTY’s performance in special cases, such as sparse graphs. However, still the upper bound is O(s!). Furthermore, searching the binary tree takes s2s operations, which is also considerable.

According to the above, it seems rational to search for methods that eliminate or decrease the number of executions of NAUTY in finding motifs. In fact, as stated above, this is the reason of G-Tries’s success as the fastest method so far. G-Tries algorithm eliminates the need to call NAUTY during the census on random networks. But, still, it uses the FANMOD for enumerating the subgraphs of the original network which is very time consuming and sometimes infeasible when the size of network and subgraph are large. G-Tries also provides other options that will improve its performance on original network, but applying these options need some prior knowledge or preprocessings. These options will be discuss later.

This paper provides a new algorithm with the aim of decreasing the number of calls to NAUTY. For this, the authors propose embedding a quaternary tree data structure in ESU (the algorithm used in FANMOD). A quaternary tree is a rooted tree data structure and each internal node has at most four children (see Figure 1). Accordingly, each internal node in the tree can have at most five neighbors, one of which is its parents and the others are its children.

thumbnail
Figure 1. An example quaternary tree of depth 3.

The root node and internal nodes have at most four children.

https://doi.org/10.1371/journal.pone.0068073.g001

Each edge, connecting a parent to one of its children, can be labeled with a mark, which can be a number, character, or any other symbol. A labeled quaternary tree can be searched using a given string that consists of the same set of symbols used for labeling that tree. This searching initiates in the tree’s root. In each step, one symbol is read from the input string and the current pointer, initially set as root, moves to the child of the current node, connecting edge of which corresponds to the symbol that is read recently from the input string. Because it is allowed to add nodes during the search, if one node in the path has no child for an input symbol, a child is added to the current node for that symbol and the current pointer moves to that child. Thus, this search continues until the input string is read completely. See Figure 2 for an example.

thumbnail
Figure 2. Searching a sample quaternary tree for input string “321”.

Searching starts at the root of the tree. After respectively visiting children 3 and 2 throughout the path, the search finishes in a newly added leaf, corresponding to number 1.

https://doi.org/10.1371/journal.pone.0068073.g002

This quaternary tree performs a partial classification for enumerated subgraphs in the proposed algorithm. This data structure, which is similar to G-Trie data structure in some aspects, is used before calling NAUTY and eliminates the need to use it most of the times. According to experimental results, the proposed novel algorithm outperforms the existing algorithms in most of the cases.

Materials and Methods

Like G-Tries, Kavosh, and FANMOD, QuateXelero consists of three main phases: enumeration, classification, and motif detection. Although enumeration and classification phases are intertwined, describing them separately makes them more understandable. Below, these phases are elaborated.

Enumeration

For enumerating all subgraphs of size k in a given network, the general procedure is like the one in FANMOD algorithm. What makes the enumeration in QuateXelero different from that in FANMOD is the use of a quaternary tree. As in FANMOD the subgraph is extended by one vertex (hereafter, we use ‘vertex’ instead of ‘node’ when referring to the nodes of the input network, and alternatively, ‘node’ is used when referring to the nodes of the quaternary or binary trees) in each step, using the procedure EXTENDSUBGRAPH. However, this step by step extension allows the use of the quaternary tree, which is searched along with the extension. In other words, as the partial subgraph is extended by one vertex, the quaternary tree is also searched some levels further. Table 2: Algorithm 2 shows the algorithm of QuateXelero for census on the original network in detail.

Lines 6, 7, and 8 classify a subgraph after it is fully expanded. This is described in detail in the next section. Here, the SEARCH procedure is described. This procedure is called inside the function EXTENDSUBGRAPH, which expands the partial subgraph by one vertex each time it is called. After the new vertex w is selected from VExt in line 11, the SEARCH procedure in line 12 uses the pattern of connections of w to other vertices of the partial subgraph (i.e. VSubg) to search the quaternary tree from CurQTNode to CurQTNode' which is |VSubg| nodes deeper (lines 17 to 27). It is notable that during this search the quaternary tree might be expanded with new nodes as described in section 2.1. The pattern of connections of w to other vertices of the partial subgraph is represented by a string of length e = |VSubg| consisting of the symbols {−1, 0, 1, 2} respectively indicating one way connection from a previously added vertex u in the subgraph to the newly added vertex w, no connection between these vertices, one way connection in the reverse direction, and a two way connection between them. An example of such a search is depicted in Figure 3. Since the procedure EXTENDSUBGRAPH is called k−1 times for a particular subgraph of size k, the total length of the path from the root of the quaternary tree to its leaf will be of length 1+2+ …+k−1 =  k(k−1)/2. This is the maximal complexity for procedure SEARCH. But, as a consequence of the recursive nature of the implementation, it is not needed to search the quaternary tree from the root for all subgraphs, so the complexity of the algorithm is reduced.

thumbnail
Figure 3. Steps taken to search the quaternary tree during expanding (enumerating) a sample subgraph.

In this figure, −1 indicates one way connection from the existing vertex to added vertex, 0 indicates no connection between them, 1 stands for a one way connection in the reverse direction, and 2 shows a two way connection. The order of numbers in the input string is the same order as the corresponding vertices are added during expanding the subgraph (that is 1, 2, 3, and then 4 in this example).

https://doi.org/10.1371/journal.pone.0068073.g003

After searching the quaternary tree, the VExt and VSubg sets are updated in lines 13 and 14 and the procedure EXTENDSUBGRAPH is recursively called based on these sets and the node CurQTNode'.

Classification

During the enumeration, the appropriate leaf of the quaternary tree is returned by the SEARCH procedure before the last call for EXTENDSUBGRAPH for a partial subgraph, in which the size of that subgraph reaches k. Then, the condition of ‘if’ in line 5 in Table 2: Algorithm 2 is satisfied. At this point, two cases might happen:

  1. The CurQTNode is created during the search being performed for the current subgraph (see Figure 3): in this case, which is determined in line 6, it is needed to call NAUTY or CANONICALLABELING for the enumerated subgraph to determine its corresponding class which relates to a leaf in the binary tree. Then a pointer from CurQTNode is set to that leaf of the binary tree (see Figure 4). This is performed in line 7 of Table 2: Algorithm 2.
  2. The leaf already existed in the tree and is not added newly: in this case, this leaf will have a previously set pointer to a leaf in the binary tree (i.e., the condition in line 6 is not satisfied) which indicates the isomorphism class to which the current subgraph belongs (see Figure 5). So there is no need to call NAUTY and search the binary tree for this subgraph.
thumbnail
Figure 4. Steps taken during classifying a subgraph, in which a new leaf is added to the quaternary tree.

1) The quaternary tree is searched and the new leaf is added 2) Because the leaf is new and its pointers is not set, NAUTY is executed for the subgraph being enumerated 3) After finding the canonical label for the subgraph, the binary tree is searched using that label and the corresponding leaf in the binary tree is identified 4) The subgraph counter of that leaf (which indicates the number of subgraph of that class found so far in the network) is increase one unit 5) The pointer of the leaf of quaternary tree is set to the identified leaf of the Binary Tree.

https://doi.org/10.1371/journal.pone.0068073.g004

thumbnail
Figure 5. Steps taken during classifying a subgraph which has reached a previously existing leaf in the quaternary tree.

1) The quaternary tree is searched and the corresponding leaf is identified 2) Using the identified leaf’s pointer to the corresponding leaf from binary tree, the latter’s counter is augmented.

https://doi.org/10.1371/journal.pone.0068073.g005

In either of the above cases, the next step is to increase the counter of the corresponding leaf in the binary tree. This is performed in line 8 of Table 2: Algorithm 2, using the CurQTNode.pointer which points to the binary tree’s leaf.

The rationale underlying this classification is that if two different subgraphs reach the same leaf in the proposed quaternary tree, then those subgraphs are isomorphs of each other. But, it should be noted that the reverse is not true; in other words, it is possible for two isomorphic subgraphs to reach two different leafs of the quaternary tree. Thus, there may be two or more different quaternary tree leaves pointing to the same Binary Tree leaf.

Accordingly, in this algorithm (lines 6 to 7) the need to invoke the NAUTY function and searching the binary tree is eliminated in many cases by exploiting the proposed quaternary tree. That is, the cost of s2s+O(s!) is reduced to less than s(s−1)/2 for many of the enumerated subgraphs, while for others an extra O(s(s−1)/2) operation is added to ss+O(s!). But, how is the ratio of the former subgraphs (i.e., cost reduced) to the latter ones (i.e., cost augmented)? The answer to this question indicates the speedup ratio of the QuateXelero compared with Kavosh and FANMOD. As discussed in section 4, this highly depends on the number of non-isomorphic classes of the subgraphs of the given network. However, regarding the experimental results, in most cases, QuateXelero will perform remarkably better than existing algorithms, because the number of subgraphs is so much more than the number of non-isomorphic classes (especially in large biological networks). This means that a remarkable number of subgraphs will reach the same leaf of the quaternary tree, and so calling the NAUTY will not be required for them except for the first one. Consequently, this will significantly reduce the computational time of motif finding.

There is a delicate difference between census on the original network (Table 2: Algorithm 2) and the random networks in QuateXelero. During census on the original network, the binary tree would be modified when a new class of isomorphism is detected. However, for the random networks function BLeaf does not change the structure of a binary tree. It searches the binary tree until it reaches either a null node or a leaf. The former case means that the recently enumerated subgraph is of an isomorphism class that does not exist in the original network; so that the subgraph is ignored. In the latter case, the counter of the corresponding leaf in the binary tree is increased to account for the enumerated subgraph.

At the first glance, the algorithm might seem similar to the ESU option of G-Tries algorithm [8] (please refer to http://www.dcc.fc.up.pt/gtries/), but there are substantial differences. While the function of quaternary tree structure is the same as the G-Trie multi-way data structure and both have theoretically, but not practically, similar structures, it should be noted that the way of exploiting these data structures is completely different in two algorithms. First, like QuateXelero, the G-Tries structure is also constructed while processing the original network with the delicate difference that Quaternary Tree is developed along with enumerations but G-Trie is generated after the completion of enumerating the subgraphs of the original network (ESU). On the other hand, unlike QuateXelero, the canonical labeling is computed for all subgraphs of the original network in ESU step of G-Tries algorithm using NAUTY. This remarkably reduces the computational time of census on the original network in QuateXelero compared with G-Tries. Second, after constructing the G-Tries, NAUTY is not used any more for random networks, and instead the subgraphs are enumerated and classified using G-Tries data structure. But, in this work, the NAUTY is also possibly called for some subgraphs of random networks. However, this possibility gradually reduces during processing the random networks. Accordingly, it is the total number of executions of NAUTY in these algorithms that determines the superiority of one to another. Recall that NAUTY is the most time consuming part of the motif detection algorithms depending on it.

Motif Detection

After the census on the original network with the help of a quaternary tree, each leaf of the binary tree will contain the number of subgraphs belonging to the corresponding isomorphism class. Then, some random networks are generated by rewiring and the census on is repeated on them. As the random generation method, we used the same method applied in G-Tries (3 swaps per edge with random Markov Chain process). The generated networks are checked against those generated by G-Tries and the results indicate the consistency of the random generation method.

Finally, the number of subgraphs of each isomorphism class for original and random networks will be used in calculating the z-score of each isomorphism class as below:where Ci, µi and σi are respectively the number of occurrences of i in the original network, average number of occurrences of i in the random networks, and the standard deviation of occurrences of i in the random networks. The higher the z-score, the more possible the particular isomorphism class (i) is a motif in the given network.

Datasets

We used six standard networks for evaluating our algorithm. These were three biological networks: the metabolic pathway of bacteria E. coli [16], the transcription network of Yeast S. cerevisiae [17], and the protein-protein interaction network of the budding Yeast [18], [19], and three other non-biological networks: a real social network [6], a dolphins social network [20], [21] and an electronic network [1]. Self-loops were removed from all networks. The features of these networks are displayed in Table 3. All these datasets are included in the available online package for convenience.

Results

Because Kavosh and G-Tries are the bests amongst the existing motif finders, they are chosen for comparison with QuateXelero. G-Tries is superior regarding the speed and Kavosh is better in memory usage.

Comparison with Kavosh

For comparing QuateXelero with Kavosh, both algorithms were executed on the same computer with Quad Core AMD Opteron ™ Processor 2354 and CentOS Linux Release 6.0 (final) operating system. The number of random networks is set to two in all experiments, which is enough for having valid results in experiments. It is important to note that this number of random networks is not suitable for motif detection in practice and is only used here for getting fast results for comparison. Moreover, different sizes of motif were considered in the experiments in order to assess the effect of the motif size on the performance of the algorithms.

The results are illustrated in Table 4. It is seen that, while QuateXelero is very faster than Kavosh in all cases, the amount of this superiority depends on the network size and structure, motif size, and the variety of its non-isomorphic classes. More precisely, it is completely related to the ratio of number of subgraphs to number of classes displayed in the fifth column of Table 4. The greater the ratio is, the more superior the performance of QuateXelero becomes. For example, QuateXelero is up to 86 times faster when finding motifs of size 8 in the Yeast network, but only 21 times faster for E.coli network in identifying motifs of size 9. This is mainly because the number of subgraphs in Yeast is greater than E.coli, but these subgraphs fall in a smaller number of non-isomorphic classes in Yeast compared with E.coli. So the need to call NAUTY is more reduced for Yeast than for E.coli.

However, generally, the results indicate that QuateXelero outperforms Kavosh regarding processing time in all cases. This is also illustrated in Figure 6, which also indicates the growing gap between algorithms when the size of the motif (i.e., s) is increased. In other words, QuateXelero still acts much better when the motif size increases. Average run time growth ratios in Table 4 further approve this fact.

thumbnail
Figure 6. Growing gap between the running times of Kavosh and QuateXelero.

In the charts, the horizontal axis indicates the size of motif and the vertical axis is the log of running time. The bases of logarithms are set to integer numbers close to the average running time growth rates shown in Table 5 for each network. The growing gaps are more visible in the charts for Yeast, Electronic, and E.coli networks.

https://doi.org/10.1371/journal.pone.0068073.g006

The only drawback of the proposed algorithm is the considerable amount of memory that is used to construct the quaternary tree for larger motif sizes and for networks containing larger number of non-isomorphic subgraph classes. For example, among the experiments mentioned in Table 4, the highest amount of memory used by Kavosh was about 370 MB for Social network and motif size 9. On the other hand, QuateXelero occupied about 2.8 GB of memory (more than 7 times larger) for the same test case and about 4.6 GB for Electronic network and motif size 11. Nevertheless, regarding the availability and low prices of large memories nowadays, this could not be a very serious shortage, at least for smaller more popular sizes.

Comparison with G-Tries

To compare QuateXelero with G-Tries, three groups of experiments are conducted. First, both of the algorithms are tested against smaller motif sizes on directed networks, second the same experiments are performed for larger sizes to understand the effects of motif size on run time of the two algorithms, and finally algorithms’ performances are tested for undirected networks.

Here, before explaining the experimental results, there is a point that worths noting. Currently, G-Tries provide an important and useful option for census on networks: having a list of non-isomorphic classes whose occurances are going to be counted, one can generate a G-Trie based on those subgraphs and then apply that G-Trie for enumerating subgraphs of both original and random networks.

However, it should be noted that if the goal is to exploit this option to enumerate all subgraphs occurring in a given network, two rough solutions might come to mind initially: 1) knowing all non-isomorphic classes occurring in the given network in advance, one can generate a G-Trie based on those subgraphs and then apply the G-Trie for enumeration, and 2) one can generate a G-Trie containing all possible non-isomorphic classes of a given size and then using it for enumeration. The first solution is obviously impossible as we need to first enumerate all subgraphs of a network before knowing their complete list of non-isomorphic classes. In other words, before being able to use this option to generate the solution, we need the solution itself. The second solution, although useful in smaller motif sizes, becomes impractical for sizes larger than 7 or 8 for directed and 11 or 12 for undirected networks, since the number of non-isomorphic classes grows exponentially and storing the generated G-Tries would need a tremendous amount of memory.

The provided option in G-Tries is useful when we are performing a set-centric subgraph enumeration (i.e., counting the occurances of a given set of subgraphs) or when the motif size is small. This option can (and is planned to) also be embedded in QuateXelero easily, as the general structure of QuateXelero and G-Tries are similar. However, the aim of this paper is not to compare the performance of two algorithms in set-centric searches, but this work is aimed at comparing these algorithms in both steps of generating and applying the Quaternary Tree and G-Trie data structures, specially for larger motifs where the set-centric option becomes inapplicable. Thus, here we emphasize the ESU option of G-Tries, which we call ESU+G-Tries. So the algorithm will have two steps: ESU (the algorithm of FANMOD) or census on original network, and G-Tries or census on randomized networks. The comparison of other options of G-Tries with the equivalent options in the proposed algorithm (which are planned to be implemented) takes a separate research.

Having said this, we continue discussing the comparison results. For comparing the algorithms a metric called “Equality Point” is defined. The equality point (ep) indicates the number of random networks, for which both algorithms take the same processing time to identify motifs. In other words, ep is the number of random networks at which the total processing times of both algorithms are equal. This can be calculated using the equation below, in which toi is the time required by algorithm i for performing all calculations other than the census on random networks (including census on the original network, writing the output file, etc.), and tri is the average time that an algorithm i spends for census on a single random network.

This concept is also illustrated in Figure 7. This figure exhibits two different cases when the ep is positive (the left chart) and when it is negative (the right chart). In the former case, the equality point is the point after which the superior algorithm (i.e., A) becomes the inferior one, and the inferior one (i.e., B) becomes the superior. However, in the second case, one algorithm (e.g., Algorithm B) is superior to the other for all numbers of random networks. The ep metric is used later to investigate the usefulness of the proposed algorithm.

thumbnail
Figure 7. The concept of Equality Point.

Positive and negative equality points are illustrated respectively in the left and the right charts. The vertical axis t indicates the total time of algorithms and the horizontal axis r shows the number of random networks used for motif detection.

https://doi.org/10.1371/journal.pone.0068073.g007

First the results for the small motifs are discussed. These results are presented in Table 5. Before interpreting these results, there is a need to remark a significant feature of QuateXelero, which is not found in G-Tries. This feature is illustrated in Figure 8. This figure indicates that, except for Yeast, for all other networks the average time spend for census on random networks decreases as the number of random networks soars. This is especially observable for Social network, for which the variety of non-isomorphic classes is greater than for other networks. This phenomenon is the result of the fact that the quaternary tree becomes more and more complete when more random networks are enumerated using it. In other words, the more the variety of input subgraphs (i.e., more random networks), the more comprehensive the quaternary tree. So, the need to call NAUTY declines for the successive random networks and less time is spent on them. This fact was respected in designing the experiments for smaller motifs. Based on this phenomenon, the numbers of random networks for Yeast, Social, E.coli, and Electronic networks were set to 10, 100, 100, and 100, respectively. This was done with the assumption that many of the motif finding tasks uses 100 random networks in their calculations.

thumbnail
Figure 8. Effect of number of random networks on average time of census on a single random network.

Numbers in the parenthesis show the size of the motif for which the experiments are conducted (the results can be generalized to other motif sizes). The vertical axis indicates the ratio (in percentage) of run time to the run time for 20 random networks. Except Yeast, the other networks exhibit a decline in the random network census time for the successive random networks.

https://doi.org/10.1371/journal.pone.0068073.g008

Now, we return back to Table 5. It is seen in this table that in all cases, QuateXelero accomplishes census on the original network several times faster than ESU of G-Tries. However, on the other hand, G-Tries is faster in census on the random networks for Yeast. Again, with the assumption that most of the motif finding tasks uses 100 random networks and according to Equality Point values, it can be said that QuateXelero will detect motifs faster than ESU+G-Tries in all cases, except when finding motif of size 6 in the Yeast regulatory network, for which the ep is below 100. Both of the algorithms almost acts similarly for motifs of size 7 in the Yeast network (ep ≈ 100).

Taking into account the results for larger motifs shown in Table 6, it can be concluded that in Social and Electronic networks the performance of two algorithms converge as the size of motifs grows, and in a point, ESU+G-Tries would surpasses QuateXelero. For Social network, this has happened in Table 6, where the ep values are below 100. As stated in the previous section, this is partially related to the ratio subgraphs/classes displayed in column five, which is a very smaller value in Social network in comparison with other networks. Furthermore, unlike the other networks, for Social network this value decreases when the size of motif (i.e., s) is increased (i.e., its growth ratio is below 1). However, this is not the only factor influencing the Equality Point. Another factor is the degree distribution, which is closer to a normal distribution in Social network than the other networks, which have power-law distributions. Also, Social network has higher density (0.041) compared to Yeast (0.002), E.coli (0.003), and Electronic (0.006). All these factors augment the variety of subgraphs in random networks and so increase the possibility that QuateXelero calls NAUTY during the census on the random networks. This makes QuateXelero slower than ESU+G-Tries in detection of Social network’s large motifs when the number of the random networks is high. While QuateXelero has always been better in detecting the motifs of the Electronic network in our experiments, the trend of ep values indicates that ESU+G-Tries will surpass QuateXelero for larger motif sizes. These are also concludible according to the values of average growth ratios, as the average growth ratio of the time of census on random networks for QuateXelero (column 10) is always greater than the same value for G-Tries (column 9), except for large motifs of the E.coli network.

For Yeast network the situation is different. While the limited experiments here are not enough to make a judgment about this, but regarding Tables 5 and 6, it can be inferred that ep values do not exhibit a meaningful trend for this network, and the two algorithms act almost equally with ESU+G-Tries, being somewhat superior in detecting larger motifs.

However, for E.coli, QuateXelero has always been superior to ESU+G-Tries, and the trend of ep values indicates that for larger motifs these values will remain negative, which shows that QuateXelero will also be better for those motif sizes.

The third series of experiments were about undirected networks. These results are displayed in Table 7 and Figure 9. From the table and figure, it can be understood that QuateXelero is faster for small and slower for medium size motifs. However, regarding the trends of random census time ratios (i.e. ratio of average time spent by QuateXelero for census on random networks to the same time required for G-Tries) and ep values, respectively in the left and right charts in Figure 9, it seems that the results for YeastPPI and Electronic will perform the same behavior observed for Dolphins in larger motif sizes. In other words, it seems that QuateXelero will again surpass in larger motifs, for which some limitations (time for YeastPPI and core dumping during running ESU+G-Tries for size 11 on Electronic network) prevented us from conducting more experiments. Furthermore, probabily there is a relationship between the ratio Subgraphs/Classes (column 4 of Table 7) and the performance of algorithms. Seemingly, QuateXelero will perform generally better for networks for which this ratio is small, as illustrated for Dolphins network.

thumbnail
Figure 9. Trends of random network census time ratio (left) and Equality Point (right) for undirected networks.

The ratio in the left chart indicates the ratio of average time spent by QuateXelero for census on random networks to the same time required for G-Tries.

https://doi.org/10.1371/journal.pone.0068073.g009

Generally, regarding the experiments the followings can be concluded:

  1. QuateXelero is always faster in census on original networks compared with ESU of G-Tries.
  2. QuateXelero is generally faster in census on random networks for smaller motifs.
  3. G-Tries is in most of the cases (especially for directed networks) faster in census on random networks for larger motif sizes.
  4. QuateXelero is always better than ESU+G-Tries in the experienced motif sizes on E.coli network regardless of the number of random networks (negative ep) and probably would dominant in larger motif sizes too.
  5. QuateXelero is generally better than ESU+G-Tries for smaller motif sizes.
  6. QuateXelero surpasses ESU+G-Tries in most of our experiments for larger motif sizes in directed networks, however,it seems that ESU+G-Tries will be better for larger sizes not achievable with facilities available to the authors.
  7. For undirected networks, QuateXelero surpasses ESU+G-Tries in smaller and seemingly larger motifs, however, ESU+G-Tries is better for medium size motifs.

There are two points that should be noted here. First, regarding the exponential growth in occupied memory, it seems infeasible to go further in motif size than what we have done, since it requires huge amounts of memory found only in limited scales in super-computers. Second, most of the current researches focus on motifs of size under 8, because the dynamical features of bigger motifs are yet unknown. Accordingly, the performed tests seem to be sufficient to provide reliable data.

For small size experiments, we employed a laptop computer with Intel Core™ 2 Duo CPU 2.5 GHz and 4 GB of RAM. For larger experiments, a master node of model Quad-Core AMD Opteron ™ Processor 2384 800 MHz with 64 GB main memory was used. The experiments for each network were conducted up to as large motif size as possible. However, some experiments were limited to the available memory and time. Generally, QuateXelero was mainly limited by the available memory while ESU+G-Tries was sometimes limited by time and sometimes by memory. These limitations and their details are listed in Table 8. Since the tests lasting more than 48 hours were cancelled, two first cases indicated in Table 8 were not completed. Accordingly, the results displayed in Table 6 for ESU+G-Tries in the case of finding motifs of size 9 in Yeast transcription network were estimated. The estimation was performed regarding results shown in Table 5. For this aim, the ratios of times used by QuateXelero for census on original and random networks to those times for ESU+G-Tries were traced regarding the values in Table 5. Then, we extrapolated these ratios for size 9 according to the trends recorded for sizes 5 to 7. Finally, by simply dividing the real times registered for QuateXelero by the extrapolated ratios, the estimated times for G-Tries were calculated.

Conclusions and Future Works

Network motif detection is a challenging problem regarding the computational time and memory it requires and there have been remarkable efforts to solve it efficiently. This paper provides a new solution for this problem which is claimed to be superior in terms of processing time to the existing solutions in special cases. This claim is approved with respect to the experimental results on some standard complex networks. The results of comparing the proposed algorithm, namely QuateXelero, with the well-known existing method Kavosh indicated the superiority of it to Kavosh in all cases regarding processing time. But QuateXelero uses a massive amount of memory compared with Kavosh. Another more important analysis was the comparison against ESU+G-Tries algorithm (ESU option of G-Tries algorithm). Generally, the results indicate that QuateXelero is always much faster than ESU of G-Tries in constructing the central data structure (i.e., the census on the original network), but slower in the census on random networks for larger motif sizes in most of the directed cases. The results for undirected networks illustrate the superiority of QuateXelero in small and probabily large motif detection, but not in the medium size problems. Furthermore, while QuateXelero is faster in most of the attempted experiences, but it seems that two algorithms, QuateXelero and ESU+G-Tries, will converge and the situation will be reverse when the size of the directed motif is set to numbers greater than those tested here. However, it should be noted that greater motifs are only detectable by using huge main memories, which might be only found in special super-computers. Moreover, current research does not exhibit a tendency towards larger motifs that what we have discussed.

Anyway, the proposed algorithm still seems to be improvable. With respect to the above, the future works can be focused on comparing the other options of G-Tries algorithm with the equivalent options in QuateXelero. Besides, combining the strength points of QuateXelero (e.g., faster census on original network) with the strength points of G-Tries (e.g., generally faster census on random networks and less memory occupation), to achieve a more efficient motif detection tool for solving problems in which the motif size is large and so other options are infeasible is another topic for further reseach. Furthermore, the question “When is QuateXelero faster than G-Tries or vice versa in the census on random networks?” is not answered completely yet. So, another point of focus can be the development of a strategy for choosing the appropriate method between two algorithms for census on random networks in processing a particular input network. Finally, one can use more compact data structures to compress the size of constructed quaternary tree to improve the memory complexity of QuateXelero.

Implementation and Availability

QuateXelero is implemented in C++ programming language under Linux operating system. The program is also applicable under Windows (please refer to help file). The source code and sample networks are available for download at: http://lbb.ut.ac.ir/Download/LBBsoft/QuateXelero/.

Acknowledgments

AMN would like to appreciate DAAD visiting professorship research program in Frankfurt University. The authors also acknowledge supports of Dr. Pedro Ribeiro from University of Porto, Portugal for providing the source code of G-Tries and his invaluable comments on improving the manuscript.

Author Contributions

Conceived and designed the experiments: AMN SK. Performed the experiments: SK IS ND. Analyzed the data: SK. Contributed reagents/materials/analysis tools: IK. Wrote the paper: SK AMN IK.

References

  1. 1. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, et al. (2002) Network motifs: simple building blocks of complex networks. Science 298: 824–827.
  2. 2. Dekel E, Mangan S, Alon U (2005) Environmental selection of the feed-forward loop circuit in gene-regulation networks. Physical biology 2: 81–88.
  3. 3. Zabet NR (2011) Negative feedback and physical limits of genes. Journal of theoretical biology 284: 82–91.
  4. 4. Mangan S, Alon U (2003) Structure and function of the feed-forward loop network motif. Proceedings of the National Academy of Sciences of the United States of America 100: 11980–11985.
  5. 5. Wernicke S, Rasche F (2006) FANMOD: a tool for fast network motif detection. Bioinformatics 22: 1152–1153.
  6. 6. Kashani ZR, Ahrabian H, Elahi E, Nowzari-Dalini A, Ansari ES, et al. (2009) Kavosh: a new algorithm for finding network motifs. BMC bioinformatics 10: 318.
  7. 7. Grochow JA, Kellis M (2007) Network Motif Discovery Using Sub-graph Enumeration and Symmetry-Breaking. RECOMB. 92–106.
  8. 8. Ribeiro P, Silva F (2010) G-Tries: an efficient data structure for discovering network motifs. 25th ACM Symposium on Applied Computing - Bioinformatics and Computational Systems Biology Track, Sierre, Switzerland.
  9. 9. Wang J, Huang Y, Wu FX, Pan Y (2012) Symmetry Compression method for Discovering Network Motifs. IEEE/ACM transactions on computational biology and bioinformatics/IEEE, ACM 10A02234-FB2C-42D1-AE5A-CA813BF34133.
  10. 10. Beber ME, Fretter C, Jain S, Sonnenschein N, Muller-Hannemann M, et al. (2012) Artefacts in statistical analyses of network motifs: general framework and application to metabolic networks. Journal of the Royal Society, Interface/the Royal Society 9: 3426–3435.
  11. 11. Omidi S, Schreiber F, Masoudi-Nejad A (2009) MODA: an efficient algorithm for network motif discovery in biological networks. Genes & genetic systems 84: 385–395.
  12. 12. Brendan M (1981) Practical Graph Isomorphism. Congressus Numerantium 30: 45–87.
  13. 13. Ribeiro P, Silva F, Kaiser M (2009) Strategies for Network Motifs Discovery. Fifth IEEE International Conference on e-Science. 80–87.
  14. 14. Darga P, Sakallah K, Markov IL (2008) Faster Symmetry Discovery using Sparsity of Symmetries. The 45st Design Automation Conference. 149–154.
  15. 15. Junttila T, Kaski P (2007) Engineering an efficient canonical labeling tool for large and sparse graphs. the Ninth Workshop on Algorithm Engineering and Experiments (ALENEX07).
  16. 16. The E.coli Database. Available: http://www.kegg.com/
  17. 17. The S. cerevisiae Database. Available: http://www.weizmann.ac.il/mcb/UriAlon/
  18. 18. Bu D, Zhao Y, Cai L, Xue H, Zhu X, et al. (2003) Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic acids research 31: 2443–2450.
  19. 19. Batagelj M, Mrvar A (2006) Pajek Datasets. Available: http://vlado.fmf.uni-lj.si/pub/networks/data/
  20. 20. Lusseau D, Schneider K, Boisseau OJ, Haase P, Slooten E, et al. (2003) The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. can geographic isolation explain this unique trait? Behavioral Ecology and Sociobiology 54: 396–405.
  21. 21. Newman M (2009) Network Data. Available: http://www-personal.umich.edu/~mejn/netdata/