The Index-Based Subgraph Matching Algorithm with General Symmetries (ISMAGS): Exploiting Symmetry for Faster Subgraph Enumeration

Subgraph matching algorithms are used to find and enumerate specific interconnection structures in networks. By enumerating these specific structures/subgraphs, the fundamental properties of the network can be derived. More specifically in biological networks, subgraph matching algorithms are used to discover network motifs, specific patterns occurring more often than expected by chance. Finding these network motifs yields information on the underlying biological relations modelled by the network. In this work, we present the Index-based Subgraph Matching Algorithm with General Symmetries (ISMAGS), an improved version of the Index-based Subgraph Matching Algorithm (ISMA). ISMA quickly finds all instances of a predefined motif in a network by intelligently exploring the search space and taking into account easily identifiable symmetric structures. However, more complex symmetries (possibly involving switching multiple nodes) are not taken into account, resulting in superfluous output. ISMAGS overcomes this problem by using a customised symmetry analysis phase to detect all symmetric structures in the network motif subgraphs. These structures are then converted to symmetry-breaking constraints used to prune the search space and speed up calculations. The performance of the algorithm was tested on several types of networks (biological, social and computer networks) for various subgraphs with a varying degree of symmetry. For subgraphs with complex (multi-node) symmetric structures, high speed-up factors are obtained as the search space is pruned by the symmetry-breaking constraints. For subgraphs with no or simple symmetric structures, ISMAGS still reduces computation times by optimising set operations. Moreover, the calculated list of subgraph instances is minimal as it contains no instances that differ by only a subgraph symmetry. An implementation of the algorithm is freely available at https://github.com/mhoubraken/ISMAGS.


Introduction
In modern society, technology has been applied to create and study numerous advanced systems in various fields as biology, sociology, informatics and others. To understand their internal dynamics, many of these systems can be modelled using graph theory. By interpreting the systems as graphs of interconnected components, a vast array of network processing methods enables detailed analysis of the underlying, fundamental properties.
More specifically in biology, graphs are very well suited to model interactions between different proteins. A graph can be constructed by modelling proteins and interactions among them as nodes and edges respectively. A powerful analysis technique is described in [1] and consists of finding network motifs in the graph. These network motifs denote small interactions patterns between several proteins that are unusually more present in the graph than expected by chance. They can be modelled as small subgraphs which can then be searched in the larger network representing all known interactions between all proteins. By discovering these network motifs, our understanding of the underlying mechanisms of the network can be improved.
To find these network motifs, several tools and algorithms have been developed. Mfinder [2] was one of the early tools to mine graph data for network motifs. Similarly, the FANMOD [3] tool The algorithms mentioned above use subgraph enumeration to find network motifs in the network. However, a different but related network analysis approach [8] is based on calculating graphlet degree distributions. A graphlet is a small connected nonisomorphic induced subgraph in a larger network for which the instances will be counted. However, contrary to network motifs which are partial subgraphs, graphlets are induced graphs, which means that if an edge is absent in the graphlet specification, it should also be absent in the graphlet instances and thus in the larger network. While graphlets and motifs are defined differently, they are both used to analyse networks by enumerating the graphlet/motif instances in the graph. The network motif analysis consists of finding unusually frequent subgraphs, whereas the graphlet-based analysis aims at characterising entire graphs by counting the occurrences for each graphlet from a predefined set. Similar to a node degree distribution, the counts form a distribution that represents the structure of the network in terms of graphlets. However, contrary to network motif analysis, the graphlets do not need to be over-represented (compared to random networks) [1]. As with network motif analysis, the graphlet analysis heavily relies on the enumeration of the graphlet instances which should be optimised.
While the discussed algorithms so far were developed to analyse full networks (by using network motifs and graphlets), the aim of this paper is to present a general subgraph matching algorithm. This algorithm can be used on its own to count or enumerate all specific occurrences of a subgraph in a larger network but can also be used as a building block for a full network analysis algorithm. Such an algorithm needs to be carefully designed as the subgraph isomorphism problem is proven to be NP-complete [9,10]. As network modelling is used in various applications, the subgraph isomorphism problem has many variants and several classes of algorithms exist for solving it. In this paper, we focus on exact algorithms for which a strict correspondence between the specified subgraph and the requested instances in the graph is required. Well-known algorithms in this class are the Ullmann [11], the VF [12] and the VF2 [13] algorithms. Ullmann uses a matrix-based representation of the search space and iteratively prunes uninteresting branches in the search tree. Pruning is done by applying a refinement procedure to eliminate candidate nodes (for mapping to a subgraph node) based on the neighbours of the candidate and the required connectedness to the neighbours of the subgraph node. While the Ullmann algorithm is versatile, as it can be used in a wide range of isomorphism problems, it is matrix-based, which causes high memory requirements. Less memory is required by VF and VF2 algorithms which are graph-based. These algorithms search the network by creating an initial partial mapping between the source graph ( = the large network) and the subgraph ( = the network motif) and iteratively generating candidate pairs to be added to the mapping. Aside from speeding up the search, the graph modelling in VF2 significantly reduces the memory requirements as it only requires O(N) memory while Ullmann requires O(N 3 ). As biological networks tend to be very large (millions of nodes in some applications), reducing the memory requirements allows for a greater applicability.
In previous work, the Index-based Subgraph Matching Algorithm (ISMA) [14] was presented and compared against the above-mentioned subgraph matching algorithms. Like VF2, ISMA also searches the source graph for subgraph instances by creating a partial subgraph-to-graph-node mapping and expanding it iteratively. However, ISMA intelligently determines the order in which the partial mapping is expanded and avoids unnecessary computations. These optimisations greatly reduce the search space and speed up query times. In this paper, we introduce the Index-based Subgraph Matching Algorithm with General Symmetries (ISMAGS) in which search space size and query times are further reduced by incorporating the internal symmetry of subgraphs as constraints into the algorithm. Our constraint-based pruning is similar to that in [7] in which the breaking of the symmetry is done by iterating the symmetry analysis during the search. Based on the partial mapping constructed at that point, constraints are generated to avoid exploring symmetric parts of the search tree. However, the symmetry analysis in [7] is repeated several times and requires generating an exhaustive list of isomorphisms of the subgraph. In ISMAGS, only one symmetry analysis is needed to obtain a compact set of generating permutations and constraints to break the symmetry. Compared to [7], we also present results for larger networks with multiple edge types.
In the rest of this paper, we first briefly outline the ISMA algorithm and its functionality for incorporating simple symmetric structures. As only basic symmetric relations are incorporated by ISMA, we then continue by presenting our approach to incorporating symmetry in the search tree. After explaining how ISMA deals with symmetry, the symmetry detection in ISMAGS is presented and validated in a group-theoretical context. Subsequently, we show the derivation of the symmetry-breaking constraints and integrate them into the global algorithm. We then show the performance gain of ISMAGS over ISMA for multiple networks with various properties (size, edge types) and various subgraphs. We also compare against the VF2 algorithm and (parts of) the G-Trie and the Grochow-Kellis algorithm (as the full algorithms were developed to find network motifs while ISMAGS is only concerned with the subgraph enumeration method).

Methods
The following section contains a brief introduction to the core aspects of ISMA, followed by closer examination of its internal symmetry handling. Next, the core features of ISMAGS to take into account all symmetric structures are presented along with a description of the global algorithm.

ISMA
In previous work [14], ISMA was developed to find matches for composite network motifs (subgraphs with type-annotated edges) in large graphs by dynamically optimising the order in which nodes are investigated during the search process. More formally, the algorithm searches in a graph G~fV ,Eg with V denoting the set of nodes and E denoting the set of edges. Each edge e[E can be represented by a triplet (u,v,t) with u and v the start and end node respectively and the type t of the edge, defining properties such as whether it is directed or undirected.
Adopting the terminology of [14], a subgraph SG is defined as a string of tokens representing the specific subgraph topology. As (anti-)parallel edges are not allowed, a subgraph of s nodes has a maximum of K~s (s{1) 2 edges and can be represented by a string of K tokens, with each token encoding the type of the edge (including the direction) at that position. The first token in a subgraph string denotes the edge going from node 1 to node 2, the next 2 tokens denote the edges going from node 1 and 2 to node 3 and so on. The string of tokens ''ABCDEF…'' thus denotes a subgraph with an edge of type A from node 1 to node 2, an edge of type B from node 1 to node 3, an edge of type C from node 2 to node 3, an edge of type D from node 1 to node 4 and so on. A more elaborate description is given in [14].
In ISMA, the main goal is to reduce the search space by carefully selecting the next node to be matched. At any given stage in the search process, a partial mapping of graph nodes (in G) to subgraph nodes (in SG) is maintained. Initially, the mapping is empty as no nodes have been matched. The algorithm then selects a subgraph node based on the number of candidate nodes in G that can be mapped on that node. At the start of the algorithm, the candidate set for each node is constructed based on the edges arriving/departing in/from that node. For example, if the first node sgn 1 in the subgraph has an outgoing edge of type A and an incoming edge of type B, the candidate set C 1 is calculated as the intersection of the set of nodes in G with outgoing edges of type A and the set of nodes in G with incoming edges of type B. The subgraph node (sgn s ) with smallest candidate set is then selected to investigate next.
Once sgn s is found, the (partial) mapping is expanded by iteratively mapping every graph node n in its candidate set to sgn s . Mapping n to sgn s introduces new constraints for the rest of the subgraph instance: if a subgraph node sgn k is connected to sgn s with an edge of type B, the graph node mapped to sgn k also has to be connected to n with an edge of type B. To select the next node to investigate, the constraints are incorporated in the candidate sets by intersecting the old candidate sets with the neighbour set of the newly mapped node n. The mapping process can then repeat for each node in the subgraph. By iteratively mapping and backtracking, all subgraph instances in the graph are found.
The subgraph node selection method is optimised in ISMA to avoid unnecessary set operations. As described above, the candidate set C i is calculated as the intersection of a number of other sets. Instead of calculating all C i sets (sg in total) each requiring intersecting a few sets to find the smallest candidate set, ISMA keeps track of the smallest set for each subgraph node. The size of this set is a heuristic estimate for the size of C i . The algorithm is further optimised by using custom data structures for set operations.

Symmetry in ISMA
While ISMA can find all subgraph instances in a graph for any subgraph, it was designed to exploit symmetric properties. While a brief description of the symmetry-handling in ISMA is given here, the reader is referred to [14] for the full approach. Two nodes have a reflection symmetry if and only if they can be switched without changing the subgraph topology. If a subgraph contains two reflection symmetric nodes sgn a and sgn b , the graph nodes a and b, mapped to sgn a and sgn b respectively, can be switched with the result being a valid subgraph instance. This property can be exploited and allows to only examine half of the search space.
The symmetry is exploited in ISMA by adding extra constraints to the candidate set generation. If subgraph node sgn a is present in the partial mapping, the algorithm takes this into account when mapping sgn b . ISMA will prohibit nodes, that were previously mapped to sgn a , to be considered for mapping onto sgn b .
When the candidate set C a of subgraph node sgn a is determined, it will consist of all nodes that are valid to be mapped onto sgn a and by symmetry also onto sgn b . As described above, ISMA will iteratively add every node in C a to the partial mapping and continue mapping the rest of the nodes. Every time a new node from C a is examined, it is removed from C a before the search is continued. When the candidates for sgn b are determined, ISMA will use C a as a constraint to force nodes in C b to be in C a . This will avoid generating the symmetric counterparts of the subgraph instances. If ½X ,a,Y ,b,Z was previously examined, its counterpart ½X ,b,Y ,a,Z will not be examined as, at that time, a will no longer be in C a and therefore not in C b .
The same basic principle is also applied to cyclic rotations. A subgraph contains a cyclic rotation symmetry if and only if it has a sequence of nodes that can be shifted without changing the subgraph configuration. An example can be seen in the ''XX00XX'' subgraph in Fig. 1.

Symmetry in ISMAGS
While the symmetry handling approach in ISMA performs well for small subgraphs with small reflection or rotational symmetries, it cannot efficiently tackle larger subgraphs with more elaborate symmetric structures. ISMA was only optimized for simple symmetric structures that can be easily detected like reflection symmetries between 2 nodes ( = 2 nodes that can be switched) or ring structures ( = rotation symmetries). It eliminates similar subgraph instances induced by these symmetries but does not handle larger symmetric structures (consisting of multiple nodes being switched) or more complex symmetric structures (in which nodes can be part of multiple symmetric structures and the symmetric properties are less easily detected). As the symmetry analysis in ISMA was limited and did not extract complex multinode symmetries, these were not taken into account which means that multiple similar subgraph instances induced by these symmetries will be returned. Fig. 2 shows valid subgraph instances that differ by only a permutation of the subgraph nodes. As the Figure 1. Subgraph examples. In ''XXXXXX'', every node is symmetric to every other node as, in a clique, all nodes can be swapped. The ''XX00XX'' graph is symmetric as it has rotation symmetry (all nodes can be shifted in the ring) and reflection symmetry (top two nodes can be switched with bottom two nodes for example). The ''XZ00ZY'' graph is also symmetric as the same configuration is obtained when node 1 is switched with node 2 and node 3 with node 4. While the G4 graph has no symmetric properties, the tetrahedron and the Petersen graph [25] have more complex symmetric structures. doi:10.1371/journal.pone.0097896.g001 degree of symmetry increases, listing all possible permutations becomes very time-consuming. A more efficient approach would be to avoid finding the permuted instances by reducing the search space.
This section deals with the modified approach in ISMAGS to successfully handle all symmetric structures in a subgraph. The main idea of ISMAGS is to detect the symmetric properties in the subgraphs and convert them to pruning rules for the search space. Detecting the symmetries is detailed in the next paragraph, followed by the symmetry-breaking approach and a description of the data structures used. In general, breaking symmetry means that the information of the symmetries is used to develop rules or constraints to simplify the search and speed up calculations. By fully breaking the symmetry, the set of subgraph instances returned by ISMAGS is minimal as a single subgraph instance will only be exported once while similar subgraph instances induced by the subgraph symmetry are omitted.

Symmetry detection
The first step in ISMAGS is to determine the symmetric properties of the subgraph under examination. While the subgraph isomorphism problem is NP-complete, several basic techniques have been developed to minimise the required work. These techniques and how they are used in ISMAGS to develop a custom symmetry-breaking approach are explained next.
Subgraph partitioning. The basis for the symmetry detection is derived from Nauty [5]. The basic mechanisms are explained here but the reader is referred to [5,15] for more details. The analysis starts by grouping the subgraph nodes sgn i based on their incoming and outgoing edges as only nodes with similar properties could be symmetric to each other. The nodes are first grouped into an ordered partition p~½W 1 DW 2 D:::DW p with every node in a cell W i having the same number of outgoing/incoming edges to/from each of the other cells. Fig. 3 illustrates how these partitions are formed. In the initial partition, all nodes are put in the same cell W 1 . Every node is then analysed by determining the source/target cells of its edges. Nodes 1 and 2 both have 1 X -edge to node in W 1 and 1 outgoing Z-edge to another node in W 1 . However, the top 2 nodes have different edge properties compared to the bottom 2 nodes. If nodes within one cell do not have the same properties, the partition needs to be refined. Cells containing nodes with different properties are split to ensure partition validity. The top nodes of the subgraph are separated from the bottom nodes by introducing a new cell. This leads to the partition in the bottom of Fig. 3. The nodes are then again analysed by their edges as the introduction of new cells can require splitting up other cells. This process is repeated until all nodes in the same cell have the same properties.
Ordered partition pair. The actual subgraph symmetry analysis uses 2 partitions p t~½ T 1 DT 2 D:::DT t and p b~½ B 1 DB 2 D:::DB b to analyse the subgraph. An example of the analysis can be found in Fig. 4. The 2 partitions together from an ordered partition pair (OPP). This OPP will be used to investigate symmetric structures by simultaneously refining both partitions. If a subgraph has symmetric structures, the refinement will have multiple branching points which lead to the symmetries in the subgraph. By exploring all branches during the analysis, all symmetries can be found.
Coupling. The 2 partitions in the initial OPP are identical and follow from the initial partitioning of the subgraph. Fig. 4 shows the symmetry-breaking algorithm for the ''XX00XX'' subgraph. The partitioning in the initial OPP at the top of the search tree consists of a single cell as all nodes have 2 edges to other nodes. The different branches in the OPP search tree are then separately investigated by selecting one of the subgraph nodes in one of the cells of p t and mapping it to all subgraph nodes in the corresponding cell in p b . In the remainder of this work, each such mapping is referred to as a coupling. When coupling, the nodes are selected in the order of increasing node ID. The node with the smallest ID among all unmapped nodes gets selected first, both in the top partition (of which only 1 node is chosen) as in the bottom partition (for which coupling iterates over all nodes in order of increasing ID).
Recursive refinement. A coupling operation thus maps a node (sgn t ) from a top partition cell to a node (sgn b ) in the corresponding bottom partition cell. This is done in the partitions by putting sgn t and sgn b in a newly created cell in their respective partitions. The coupling is followed by a refinement operation on each of the partitions (top and bottom). As mentioned above, each partition groups nodes with identical properties. By introducing a new cell (for sgn t or sgn b ), the grouping might no longer be accurate. Nodes with edges to/from sgn t now have edges to/from the new cell while previously those edges were to/from the original cell of sgn t (analogously for the bottom partition with sgn b ). This change in properties needs to be incorporated in the partition, possibly yielding more cells to ensure that all nodes in one cell have the same properties. The partition is refined until it is completely valid again. Note that while the refinements of p t and p b are done separately, each top partition cell will correspond to a bottom cell. The coupling and refinement operations are done recursively until all nodes have their own cell. At that point, there is a one-on-one correspondence between nodes in the top partition and nodes in the bottom partition, yielding a valid permutation.
In the example of Fig. 4, subgraph node 1 from T 1 is coupled to the 4 possibilities in B 1 . This splits up T 1 and B 1 as node 1 gets its own cell because of the coupling. It also leads to the further refinement of the initial OPP as in both partitions for nodes 2 and 3, one of the incoming edges is coming from the new cell while for node 4 that is not the case. As shown in Fig. 4, the recursive refinement leads to the full subgraph analysis.
The key strength of using the OPPs in the analysis is that the discovered symmetry relations can be used to prune and speed up the analysis. When the coupling of a node in p t to a node in p b leads to 2 partitions with a different number of cells, the configuration can be discarded as no symmetries can be found. Even more extensive pruning is used in the original Nauty algorithm and its successors but is simplified here as the analysis in ISMAGS is combined with symmetry-breaking constraints (see next section) that modify the search process.
Orbit pruning. Orbit pruning [5] is a group theoretical optimisation technique to narrow down the search space. During the analysis of the subgraph, a set of generating permutations is built for the automorphism group A of the subgraph. In the example of Fig. 4, the set of generating permutations consists of P 1 and P 2 . The automorphism group A is a permutation group and will act on the set S of all possible permutations on the subgraph nodes. The effect of A can now be analysed in 2 ways.
First, the effect of permutations P k [A on individual subgraph nodes sgn i [V is considered. Starting from an element e[S, the permutation permutes the individual nodes to different positions. In general, the element e is transformed in P k (e) by permuting the individual subgraph nodes in the element. e~½sgn a ,sgn b ,::sgn s ?P k (e) ½sgn x ,sgn y ,:::sgn z ,fa,b,:::,sg~fx,y,:::,zg ð1Þ The image P k (sgn i ) of sgn i under P k can then be used to define the orbit partition. This is a partition of V into disjoint cells C j for which holds that V~fC 1 DC 2 D:::DC z g ð 2Þ More intuitively, when an automorphism is applied, a node can only be mapped to itself or another node in its orbit partition cell. Every time a permutation is found during the symmetry detection, the orbit partition is updated by merging the orbit partition cells of nodes that can be mapped to each other. In Fig. 4, P 1~( 23) is found as an automorphism. The image P 1 (2) of node 2 is node 3 (and vice versa) and so they will share the same cell in the orbit partition.
For the second analysis of the effect of A, consider the relations between the elements of S. Given an element e of S, its orbit is the set of all elements of S that can be found by combining all permutations in A to e. The orbits themselves form a partition of the set S of all possible permutations on the subgraph nodes.
Orbit pruning relies on this partitioning to prune parts of the search space. As one element suffices for generating the entire orbit (by applying the generating permutations), the analysis can omit parts of the search space that would result in redundant generators. The orbit partition is used to detect these cases and to abort the search. A more detailed application of orbit pruning can be found in [16].
Without orbit pruning, the subgraph symmetry analysis described above continues finding all permutations after the necessary set of permutations is already exported. As only that set is needed to generate all permutations, orbit pruning is used to reduce the search space. In the example of Fig. 4, the orbit partition gets updated when permutation P 1~( 23) is found and again when P 2~( 12)(34) is exported. For P 1 , node 2 and node 3 are put in the same cell in the orbit partition while P 2 merges the cell of node 2 with the cell of node 1 and the cell of node 3 with the cell of node 4. This results in all nodes being in a single cell. When the subgraph analysis is continued after finding P 2 , node 2 in the top partition would be coupled to node 4 in the bottom partition but as node 2 and 4 share the same cell in the orbit partition, the coupling can be pruned. Backtracking further in the analysis would lead to top node 1 being recoupled to bottom nodes 2, 3 and 4 but this can be omitted analogously.

Symmetry breaking
The symmetry detection in ISMAGS results in a set of permutations of the subgraph nodes and an orbit partition. The permutations can be applied on any valid subgraph instance to produce other valid instances. To avoid generating and exporting instances that can be obtained through permuting previously found instances, the symmetry needs to be broken. To explain the symmetry breaking in ISMAGS, some group theoretical concepts first need to be introduced.
Stabilisers. In general, permutations in a permutation group G act on sequences of nodes by switching/swapping nodes to different positions in the sequence. However, some permutations do not change all nodes and leave some nodes on their original position. These permutations can be used to define stabilisers. For every node i, the stabiliser i G is defined as Stabiliser chains. Using the concept of stabilisers, a stabiliser chain of (sub-)groups G 0 ,G 1 ,:::,G s is defined for a permutation group G, acting here on the set of subgraph node permutations. Each group in the chain is a subgroup of the previous (G i 5G i{1 ) starting with G 0~G . Formally, the groups are defined as More intuitively, the permutations in G 1 are those permutations that do not change node sgn 1 while the permutations in G 2 are those permutations in G 1 that do not change node sgn 2 and thus leave 2 nodes unchanged. Coset representatives. The coset representative set C i of a subgraph node sgn i is defined as the set of subgraph nodes sgn x to which sgn i can be mapped in G i{1 .
Vi~1::s : C i~f sgn x DP(sgn i )~sgn x ,P[G i{1 g ð 8Þ Given G 1 in subgraph ''XX00XX'', C 2~f sgn 2 ,sgn 3 ,sgn 4 g as sgn 2 can be mapped to any subgraph node in C 2 by the permutations in G 1 . Given the above definitions, the subgroup G i consists of those permutations that leave the first i nodes unchanged. As shown in [17], stabiliser chains can be converted to symmetry-breaking constraints that fully break the symmetry of the subgraph. However, contrary to [17], the G i groups are not fully generated in ISMAGS. To generate the constraints, only the coset representatives for each subgraph node are needed. These coset representatives are generated during the symmetry analysis phase.
Determining coset representatives. Recall that during the symmetry analysis, nodes are coupled according to increasing ID. In the search tree, constructed by the couplings, the first leaf node to be reached is the identity permutation. When backtracking starts, the couplings are undone by replacing the last coupling, say sgn k ?sgn k , with a new coupling, say sgn k ?sgn l . Undoing the mapping can be interpreted as looking for permutations that leave the first k{1 nodes in place but permute the remainder of the subgraph nodes. If such a permutation is found, it belongs to the G k{1 group. In Fig. 4, when P 1~( 23) is found, it is a permutation leaving the first node unchanged and thus belongs to the G 1 group. As detailed above, when a permutation is found, the orbit partition is updated. After P 1 is found, the orbit partition cell O 2 of node sgn 2 will be merged with the cell of node sgn 3 .
Subsequently, the algorithm continues by backtracking and uncouples sgn 2 ?sgn 3 . As all possible couplings for sgn 2 are tested, ISMAGS further backtracks and uncouples sgn 1 ?sgn 1 . However, at this point in the analysis, the coset representative set C 2 can be determined. Orbit partition cell O 2 only contains subgraph nodes sgn l that can be mapped to sgn 2 without remapping the first node as the first node has thus far remained mapped to itself due to the coupling. Cell O 2 thus corresponds to the set of coset representatives of sgn 2 .
Following the example above, the coset representative set C i is found for every node as an intermediate orbit partition cell. Coset representative set C i is set equal to its orbit partition cell when all possible mappings sgn i ?sgn j are evaluated and sgn i{1 ?sgn i{1 is to be undone next. While the orbit partition is updated every time a permutation is found, the C i sets are not.
Note that not all nodes will have coset representatives generated. During the creation of the initial chain of couplings that results in the identity permutation, some nodes are mapped to themselves by the couplings while other nodes are mapped to themselves by the refinement procedure. For the latter nodes, no coset representatives will be generated as, with the first nodes fixed, they can only be mapped to themselves.
For the example in Fig. 4, C 2 is set as f2,3g after P 1 is found, right before sgn 1 ?sgn 1 is undone. Set C 1~f 1,2,3,4g is found similarly at the end of the subgraph symmetry analysis, after sgn 1 ?sgn 4 is investigated. Sets C 3 and C 4 can be omitted as they can only be mapped to themselves if sgn 1 and sgn 2 need to remain fixed.
Generating symmetry-breaking constraints. To break the symmetry in the subgraph, the coset representatives are converted into constraints on the IDs of the graph nodes mapped to the subgraph nodes, as shown in [17]. For every subgraph node sgn j in the coset representative set C i of subgraph node sgn i , a constraint is introduced to force the ID of the graph node mapped to sgn j to be higher than the ID of the graph node mapped to sgn i . More formally, the following constraints are introduced to be used during the node mapping in ISMAGS.
While the constraint generation is done as in [17], ISMAGS does not require explicitly generating the stabiliser chains (introduced by [18]) as only the coset representatives are necessary and found during subgraph symmetry analysis.
The approach to breaking symmetry in ISMAGS begins with a subgraph symmetry analysis derived from Nauty [16]. In general, Nauty generates a set of generating permutations for the automorphism group of the subgraph. However, the set of generating permutations generated is not necessarily unique. For the ''XXXXXX'' subgraph in Fig. 1, a generating set of permutations could be Q 1~f (12),(123),(1234)g while Q 2~f (12),(23),(34)g, generated by ISMAGS, is equally as valid for generating the full set of permutations. By tuning the order of the node coupling, ISMAGS embeds the stabiliser chains into the search process. The permutations found are generators for as many subgroups G i as possible while the coset representatives can readily be found. This eliminates the need for explicitly generating all possible permutations, the stabiliser chains and coset representatives as in [17].

Integrating symmetry detection and symmetry breaking with ISMA
With the symmetry-breaking constraints described in the equation above, a full description of ISMAGS can now be given. While ISMAGS reuses the basic principles of ISMA to limit the search space, it goes much further in the subgraph analysis and pruning. A pseudocode description of the different steps is given in Table 1 in Algorithm 1. The actual subgraph instances are found and exported in the mapNodes function.
The search for all instances starts with the analysis of the subgraph specification for symmetric properties. This analysis is detailed in the previous sections and results in a set of constraints between the IDs of the graph nodes in a mapping. These constraints are stored for each subgraph node and used for determining the candidates for a specific subgraph node. After subgraph analysis, the search for subgraph instances, similar to ISMA, begins on line 3 of Algorithm 1 in Table 1 by creating the first candidate node lists based on the subgraph configuration. As in ISMA, the candidates for a subgraph node are determined by intersecting collections of nodes. However, the collections in ISMAGS are lists of ordered nodes based on node ID. Sorting of the lists of nodes and neighbours only needs to be done once for every network as the networks can be stored with the lists sorted. Using ordered lists accommodates quick subset selection (detailed further below) during node mapping. The candidates for a subgraph node are determined based on the edges in the subgraph specification. For each of its edges, the corresponding list of graph nodes is added to a set of lists. Once all lists are known, their intersection gives all graph nodes that have edges of the required types. Note that, as in ISMA, the intersection of the starting sets is delayed until after the subgraph node is selected to avoid calculating intersections that are not used.
Once the initial subgraph node is determined, the candidate node list is generated (see line 11 of Algorithm 1 in Table 1). This candidate node list is calculated with a linear sweep over the different (ordered) lists while all other intersections in ISMAGS are calculated by checking node membership to all lists (as in ISMA). As node lists are large at this point, a linear sweep is more efficient than node-by-node set membership tests.
The mapNodes function is very similar to the approach in ISMA and recursively maps graph nodes to subgraph nodes to find all subgraph instances. Every time a graph node is mapped to a subgraph node, the neighbours of the graph nodes are taken into account as new constraints for the remaining unmapped nodes. If graph node a is mapped to subgraph node sgn i and sgn i has an edge of type e to subgraph node sgn j , the candidate graph node for sgn j needs to have an edge of type e from a. Note that the direction of the edges is taken into account in the edge type. Once the lists of neighbouring nodes are added to the constraint sets, the next node to investigate is determined.
The next subgraph node to examine is determined on line 13 heuristically as in ISMA. The intersection of lists is delayed to avoid unnecessary work. Instead, the lists that need to be intersected are stored separately per subgraph node. The size of the intersection is estimated by the size of the smallest list of that subgraph node. This is an upper bound on the actual size and trades off a slightly larger search space for less list intersections. To further optimise the calculation of the candidate list, ISMAGS disregards the lists introduced during initialisation (see Algorithm 2 in Table 2). These lists are generally very large (they list all nodes with an edge of the correct type) and contain little useful information, as the lists only indicates the presence of an edge which is verified each time a node is mapped. Omitting the lists results in shorter computation times during intersection with only a limited increase in search space size.
The key difference with ISMA is that when a subgraph node is selected for examination, the actual intersection is calculated using the symmetry-breaking constraints. The constraints are combined with the partial instance constructed so far to determine boundaries for the node ID of the graph nodes to be mapped on the subgraph node. If for the selected node sgn i a constraint ID i vID j was generated and node sgn j is already mapped, this gives an upper bound on the ID of a candidate for sgn i . A lower limit is found analogously by considering the set of subgraph nodes which should have a smaller ID. The boundaries can then be used during intersection, allowing to skip the nodes with IDs outside the allowed range. As the lists to be intersected are sorted, binary search can be used to quickly find the start and ending point of the sublist of valid IDs.
While the symmetry breaking is the main source of the performance gain, additional speed-up could be gained by maintaining extra state. As explained above, the candidate sets are stored in memory as ordered lists to allow quick retrieval of nodes within a specific ID range. When lists are intersected, the intersection is determined by iterating over all nodes (in that valid ID range) and checking for membership of the other lists. In ISMAGS, this checking is done by using binary search on the sorted lists. To speed up this operation, a copy of the list could be maintained in a hash-based set. This would allow to check membership in overall constant time (O(1)), whereas binary search requires logarithmic time (O(log(n)), n = number of entries in the list). Experiments show that an additional 5% speed-up could be gained by using this optimisation at the cost of almost doubling memory requirements. As memory is often a bottleneck in biological networks, the optimisation was not included in the presented version of ISMAGS. The basic pseudocode of the subgraph analysis (to generate the constraints) is given by Algorithm 3 in Table 3. The initial OPP is constructed based on an initial partitioning of the input subgraph as detailed in the symmetry detection section above. The OPP is then recursively refined to find all symmetries (and constraints), as described in the two previous sections. As mentioned above, ISMAGS tunes the order in which the nodes are coupled in the OPPs by always selecting the node with the lowest ID first as shown in lines 5 and 9 of Algorithm 4 in Table 4. The constraints are generated on line 17.

Results
To illustrate the performance of ISMAGS, the algorithm is benchmarked against previously published results and algorithms. After a description of the algorithms and network data use, ISMAGS is compared against it predecessor ISMA to show the effects of the added symmetry breaking and related optimisations. In addition, ISMAGS is compared against the VF2 algorithm and the subgraph enumeration algorithms of Grochow-Kellis [7] (denoted by GK) and the G-Trie algorithm [6].

Algorithms
The ISMA and ISMAGS algorithms were implemented in Java (version 1.6.0_26) while for the VF2 experiments, the VFLibrary (http://mivia.unisa.it/datasets/graph-database/vflib/) was used. To perform the GK experiments, the authors of [7] provided the original (Java) code for their algorithm from which the code for subgraph enumeration was extracted. This was necessary as the GK algorithm is a network motif finding algorithm while we present a subgraph enumeration algorithm. The experiments for the G-Trie results were done using the reference implementation on the homepage of the G-Trie author (http://www.dcc.fc.up.pt/ gtries/). An implementation of the ISMAGS algorithm is freely available at https://github.com/mhoubraken/ISMAGS.
The experiments were performed on a single core of an Intel Core 2 Duo P8400 processor clocked at 2.26 GHz with 4 GB of RAM under a 64-bit Windows installation. To remove the influence of memory operations, the reported times exclude the reading of the networks and writing to memory of the subgraph instances found. The times reported in the results thus only pertain to the time needed to look for all possible matches in the search space and, if applicable, the time needed to analyse the subgraph for symmetries. Most of the results were averaged over 1000 runs. However, some test instances were limited to fewer runs as the  long calculation times were prohibitive for more elaborate testing. The number of runs used for averaging is shown along with the results. When fewer runs were used, this is denoted with an asterisk or circle, depending on the number of runs.

Network data
The input networks for the experiments are similar to the networks from [14]. They are denoted as biological, Slashdot and SNAP (based on the source of the data) and their properties can be found in Table 5. Note that while only results are shown for these 3 types of networks, ISMAGS can be used for subgraph matching in any graph with multiple edge types.
The biological networks comprise two networks with multiple edge types. The first network pertains to physical (P, undirected), genetic (G, undirected) and signalling (S, directed) interactions between kinases and phosphatases in yeast [19,20]. The second network consists of protein-protein interaction in yeast (X, undirected, obtained from the BioGRID [21] database), proteinprotein interactions in humans (Y, undirected, obtained from the BioGRID and STRING [22] databases), and orthology relations between human and yeast proteins (Z, bipartite, from the InParanoid database [23]). Additionally, both networks were modified to obtain additional test networks. The reduced PGSnetwork is constructed by interpreting all edges in the PGSnetwork to be undirected and of the same edge type. The ABZnetwork is derived from the input files of the XYZ-network. The X-and Y-edges, specified as ''node 1 node 2 '' on individual lines in their respective edge file, are interpreted as directed A-and Bedges going from node 1 to node 2 .
The Slashdot network [24] represents ''friend'' and ''foe'' relations between users of the technology-centred Slashdot community. A group of friends in which everyone is a friend of each other can be represented as an F-clique subgraph while two friends with a mutual enemy can be represented by a ''FEE'' subgraph. Note that the edges are assumed to be undirected. Finding subgraphs in this social network is an example of how ISMAGS can be used to mine social data.
The third set of networks consists of some of the networks available in the SNAP database (found at http://snap.stanford. edu/data/). The Wiki-Vote network represents votes cast by users during Wikipedia admin elections, the p2p-Gnutella08 and p2p-Gnutella30 are 2 snapshot of the Gnutella peer-to-peer network and the CA-CondMat and CA-HepTh networks are collaborations networks based on co-authorship of papers published in the arXiv repository in the Condense Matter and the High Energy Physics -Theory category, respectively. For these networks, the 3and 4-node cliques are searched as well as the instances of the subgraphs ''XxXXXX'' (tetrahedron) and ''xXxXxx'' (G4). Note that, during the search for the instances of the cliques, the edges in the networks are considered to be undirected while they are considered to be directed during the search for the instances of the tetrahedron and the G4 graphs. This difference in interpretation allows to show the performance of the symmetry handling of the algorithms on both directed and undirected networks. The tetrahedron subgraph was selected to be searched for as it contains a relatively simple symmetric structure that was not taken into account in ISMA. The G4 graph was selected to show the effects of the optimisation of the set operations in ISMAGS (compared to ISMA) as it does not have any symmetric properties.
Note that the number of nodes and edges reported in Table 5 can differ from the counts of the original data source. This is a result of network preprocessing. Aside from removing unconnected nodes and (anti-)parallel edges, the preprocessing also ensured that only 1 edge is present between any pair of nodes. While  The reported #instances is the number of subgraph instances as exported by ISMAGS. The search space size is the #nodes visited during the search process. SPRF is the search space reduction factor and is calculated as the ratio of the search space size for ISMA to the search space size for ISMAGS. The speed-up factor is defined analogously for the calculation time. All reported timings are averaged over 1000 runs unless denoted by an asterisk *, in which case only one test was performed as the long computation times were prohibitive for more elaborate testing. doi:10.1371/journal.pone.0097896.t006 The reported parameters are analogously defined as in

ISMA versus ISMAGS
To show the advantages of incorporating the symmetric information in the search, Table 6 compares the performance of ISMA to ISMAGS on the biological networks. The search space reduction factor (SPRF) varies depending on the subgraph.
For the subgraphs with SPRF w1, the reduction can be attributed to various factors. For the subgraphs with limited symmetry (''SsS'', ''SsG'', ''PGSPGS''), the reduction is due to a quicker termination of uninteresting paths. When the mapping of a node would result in empty candidate lists for an unmapped node, this is detected in ISMAGS before the node is mapped and can quickly be terminated. For the subgraphs with large symmetric structures (Petersen graph of Fig. 1, XYZ subgraphs), the reduction comes from the symmetry-handling which was not fully exploited in ISMA, showing the benefits of the improved symmetry-breaking approach.
For the ''ssG'' subgraph, a small increase in search space is present. This is due to the omission of list membership tests of the large initial lists of candidates. While omitting these tests allows to map candidates faster, it increases the search space slightly as some graph nodes get examined while they do not have all required edges. However, the speed gain of omitting these tests still outweighs the slight increase.
The SPRFs around 1 indicate that ISMA and ISMAGS follow the same path through the search space. This is primarily the case when the symmetry in the subgraphs is incorporated in ISMA (e.g. ''GGG'', ''SSS'', cliques). Interestingly, ISMAGS still reduces query times for these subgraphs due to the list-based implementation of its symmetry breaking. When calculating candidate sets for the subgraph nodes involved in the symmetric structures, ISMA removes nodes from constraints sets before calculating the intersections of its sets. This ensures that no nodes get mapped in symmetric configurations. However, to calculate the intersection, ISMA still needs to intersect the different sets. ISMAGS uses the ID-based constraints derived during symmetry analysis to avoid most work during list intersection. Using the constructed partial mapping and the constraints, ISMA can quickly find the interesting range of nodes in the to-be-intersected lists and ignore the remainder. This reduces the time needed for candidate set generation and improves execution time results. Additionally, as explained above, ISMAGS omits the large initial lists of candidate graph nodes when calculating candidate set lists once the initial node is determined. This reduces calculation time as less list membership needs to be checked.
The speed-up factor in Table 6 shows the ratio of the calculation time of ISMA to the calculation time of ISMAGS. For most investigated subgraphs, the speed-up factor is larger than 3, indicating that ISMAGS only needs a third of the calculation time of ISMA. While ISMAGS gives a speed-up for the subgraphs previously published [14] due to the more optimised list-based implementation, it was designed to handle more complex symmetries in the subgraphs. This can be seen in the results of the Petersen graph and the line graphs. The Petersen graph [25] is inherently very symmetric as one instance can be permuted in 120 other instances. The line graphs (''P0P'', ''P0P00P'' and ''P0P00P000P'') consist of n nodes, connected by n{1 edges, with node n i connected to n iz1 for i~1::n{1. As such, the line The top row denotes, for each algorithm, the number of runs averaged to obtain the reported timing results. However, for results denoted with an asterisk, only 1 run was performed. For the G-Trie algorithm, some results are missing as the size of the subgraph ( = 10 nodes) was not supported by the reference implementation. The results for the ''P0P'', ''P0P00P'' and ''P0P00P000P'' are also omitted as the algorithm did not support ''don't care'' links. doi:10.1371/journal.pone.0097896.t008 graphs correspond to chains of nodes. The symmetry in these graphs is limited to reversing the chains, as every subgraph instances can be read left-to-right and right-to-left. While this is a simple symmetry, it involves multiple nodes being remapped, which was not fully incorporated in ISMA. An analogous analysis was done on the Slashdot and SNAP networks and can be found in Table 7. The test instances with SPRF~1 (cliques) have similar search spaces between the 2 algorithms. For the G4 graph, the SPRF slightly increases, indicating that ISMAGS mostly traverses the same search space as ISMA (as no symmetry could be exploited), with the exception of a few optimised list selections. The results for the tetrahedron show a SPRF w1, indicating that the symmetric structure in the tetrahedron is taken into account by ISMAGS but not by ISMA. The speed-up factors are slightly lower for the Slashdot test instances than for the biological and SNAP networks. This is mostly due to the fact that ISMA already incorporates most of the symmetric information in the subgraphs and thus leaves less room for improvement for ISMAGS. However, ISMAGS still speeds up the searches. The speed-up factors for the G4 graph are also slightly lower than for the other graphs as no symmetry could be used to speed up the search.

Full comparison
In addition to ISMA, the ISMAGS algorithm was compared to 3 similar algorithms, viz the VF2, the GK and the G-Trie algorithm. Tables 8 and 9 show the timing results of the VF2 algorithm, the subgraph enumeration algorithm from [7] and the G-Trie algorithm for some of the subgraphs from the previous section along with the results of ISMAGS. Note that not all test instances from the previous table are repeated as the reference algorithms implementations, in contrast to ISMA and ISMAGS, were not designed to support multiple edges (of different types) between a pair of nodes. Table 8 shows the results for the test instances on the biological networks. Compared to the VF2 algorithm, the GK algorithm reduces query times by exploiting the symmetry in the cliques. As the cliques have more nodes, the symmetry breaking increasingly prunes the search space (e.g. for the 10-clique, only 1 match out of 10! permutations is retained). ISMAGS however further reduces computation times by the optimised matching processes and symmetry breaking described in the previous sections. For most instances, ISMAGS reduces query times by 1-2 orders of magnitude compared to VF2 and the GK algorithm.
Of the reference algorithms, the G-Trie algorithm was the only one able to match the performance of ISMAGS. While G-Trie Table 9. Comparison between ISMAGS, VF2, GK and G-Trie on the SNAP networks.