Graphettes: Constant-time determination of graphlet and orbit identity including (possibly disconnected) graphlets up to size 8

Graphlets are small connected induced subgraphs of a larger graph G. Graphlets are now commonly used to quantify local and global topology of networks in the field. Methods exist to exhaustively enumerate all graphlets (and their orbits) in large networks as efficiently as possible using orbit counting equations. However, the number of graphlets in G is exponential in both the number of nodes and edges in G. Enumerating them all is already unacceptably expensive on existing large networks, and the problem will only get worse as networks continue to grow in size and density. Here we introduce an efficient method designed to aid statistical sampling of graphlets up to size k = 8 from a large network. We define graphettes as the generalization of graphlets allowing for disconnected graphlets. Given a particular (undirected) graphette g, we introduce the idea of the canonical graphette K(g) as a representative member of the isomorphism group Iso(g) of g. We compute the mapping K, in the form of a lookup table, from all 2k(k − 1)/2 undirected graphettes g of size k ≤ 8 to their canonical representatives K(g), as well as the permutation that transforms g to K(g). We also compute all automorphism orbits for each canonical graphette. Thus, given any k ≤ 8 nodes in a graph G, we can in constant time infer which graphette it is, as well as which orbit each of the k nodes belongs to. Sampling a large number N of such k-sets of nodes provides an approximation of both the distribution of graphlets and orbits across G, and the orbit degree vector at each node.


Introduction
Network comparison is a growing area of research. In general the problem of complete comparison of large networks is intractable, being an NP-complete problem [1]. Thus, approximate heuristics are needed. Networks have been compared for statistical similarity from a high-level using simple, easy-to-calculate measures such as the degree distribution, clustering coefficients, network centrality, among many others [2,3]. While more sophisticated methods such as spectral analysis [4,5] and topological indices [6] have been useful, the study of small subnetworks such as motifs [7] and graphlets [8,9] have become popular. They have been used a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 extensively to globally classify highly disparate types of networks [10] as well as to aid in local measures used to align networks [11][12][13][14].
A graphlet is a small, connected, induced subgraph g of a larger graph G. Given a particular graphlet g, the automorphism orbits of g are the sets of nodes that are topologically identical to each other inside g. Graphlets and their automorphism orbits with up to k = 5 nodes were first introduced in 2004 [8], and are depicted in Fig 1. Recently, automated methods have been created that can enumerate, in a larger graph, all graphlets and their automorphism orbits up to graphlet size k = 5 [15] and subsequently to any k [16], although the latter authors only applied it up to k = 6. Unfortunately, we have found that these methods take a very long time (hours to The numbering of these graphlets and orbits were created by hand [8] and do not correspond to the automatically generated numbering used in this paper. The figure is taken verbatim from [16]. days) even just to count graphlets up to size k = 5 on some large biological networks, such as those in BioGRID [17]. It is not clear that such methods, especially for even larger k, will be applicable to the coming age of ever bigger networks, since the total number of graphlets appearing in a large network tends to increase exponentially with both k (the graphlet size) and n (the number of nodes in the large network). Eventually, an exhaustive enumeration of all graphlets appearing in a large network may become infeasible simply due to the number of graphlets that need to be enumerated, even under the optimization of using orbit counting equations. On the other hand, graphlets are too useful to abandon as a method of quantifying the topological structure of graphs. An achievable alternative for a large network G is to statistically sample its graphlets rather than exhaustively enumerate them. Additionally, such sampling could be useful with the recent advent of comprehensive biological network databases [18]: each sampled graphlet would act as a seed for local matching between larger networks, similar to how k-mers (short sequences of length k) are used for seed-and-extend sequence matching in BLAST [19].
To efficiently create a statistical sample of graphlets in a large network G, one must be able to take an arbitrary set of k nodes from G, and efficiently (preferably in constant time) determine both which graphlet is represented, as well as the automorphism orbits of each of the k nodes. Here, we solve this problem both by enumerating all graphlets (and their disconnected counterparts, which we term graphettes) and their automorphism orbits up to graphettes of size k = 8. We present a method that creates a lookup table that can quickly determine the graphette identity of any k nodes, as well as their automorphism orbits. Since the lookup table required significant time to pre-compute for k = 7 (a few hours on a single core) and k = 8 (hundreds of CPU weeks on a cluster), we provide the actual lookup tables for these values of k online at http://github.com/Neehan/Faye.

Definitions and notations
Given a graph G on n nodes, a k-graphette is a (not necessarily connected) induced subgraph g on any set of k nodes of G. There are many ways one could choose the k nodes, for example (i) choosing k nodes uniformly at random from G, or (ii) performing a local search around some node u. We expect the former to be useful only in dense networks, while the latter is probably more useful in sparse networks because most random sets of k nodes in a sparse graph will be highly disconnected and thus not very informative. One could also (iii) perform edge-based selection (with local expansion) to ensure dense regions are sampled more frequently than sparse regions [20]; still other methods have been suggested [21].
Given a set of k nodes, we wish to quickly ascertain which graphette is represented, and which automorphism orbits each of the k nodes belong to. To do that we need a canonical list of graphettes and their orbits, and a fast way to determine which canonical graphette is represented by any permutation of k nodes. Here we demonstrate how, if k is fixed and relatively small (k 8 in our case), this can be accomplished in constant time by pre-computing and storing a lookup table indexed by a bit vector representation of the lower triangular matrix of the (undirected) adjacency matrix of the induced subgraph. Given such an index, the value associated with that index identifies the canonical graphette (a canonical ordering of the nodes for that graphette). We also pre-compute the automorphism orbits of all the canonical graphettes. Thus, by reversing the lookup table we can, in constant time, infer the orbit identity of each of the k nodes in that k-graphette. As a corrollary, we can also update the (statistically sampled) graphette orbit degree vector of each of the k nodes, similar to the graphlet degree vector [9].
We use the following abbreviations and notations throughout:

Canonization of graphettes
If graphs G and H are isomorphic, it essentially means they are exactly the same graph, but drawn differently. For example, Fig 2 shows three different drawings of the Petersen graph. Technically, an isomorphism between networks G and H is a permutation p : VðGÞ ! VðHÞ so that EðG; u; vÞ () EðH; pðuÞ; pðvÞÞ; Consider a 3-graphette with nodes w, x and y. There are only 4 possible such graphettes, depicted in Fig 3. However, by permuting the order of the nodes, each of these graphettes can be represented by several isomorphic variants. In order to determine if two graphettes are isomorphic, we will represent its (undirected) graph with the lower-triangle of its adjacency matrix. We will place this lower-triangular matrix into a bit vector, resulting in a representation similar to existing ones for orbit identification [16].
We now describe the idea of a canonical representative of each isomorph. To provide an explicit example, consider Fig 4, depicting the three isomorphic configurations of the 3-graphette that has exactly one edge. In order to determine that these graphettes are all isomorphic, we take the bit vector representation depicted, and define the lowest-numbered bitvector among all the isomorphs as the canonical representative. All the other isomorphs in the lookup table point to it. In this way, every graph on 3 nodes can be efficiently mapped to its canonical 3-isomorph.
We also automatically determine the number of automorphism orbits (see below) for each canonical isomorph. Table 1 represents, for various values of k, the number of bits b(k) required to store the lower-triangular matrix of all graphettes on k nodes (i.e., the length of the bit vector used to store this matrix); the resulting total number possible representations of k nodes (which is simply 2 b(k) ); the number of canonical isomorphs NC(k); and the number of canonical automorphism orbits. Note that, to map each possible set of k nodes to their canonical isomorphs, the lookup table has 2 b(k) entries, and each entry has a value between 0 and NC(k) − 1. Note that for k up to 8, the graphettes can be stored in 32 bits. In that case, the maximum space required will be 32 × 2 28 = 1 GB. This is as far as we go, for now. Moore's Law suggests that we may be able to go to k = 9 within a few years, and to k = 10 in perhaps a decade or two. We note that the most expensive part of our algorithm is creating the lookup table between an arbitrary set of k nodes, to the canonical graphette represented by those k nodes; in the absence of a requirement for this lookup table, one could use orbit counting equations [16] to generate automorphism orbits up to k = 12.
Generating the lookup table from non-canonical to canonical graphettes Assume the large graph G has n nodes labeled 0 through n − 1, and pick an arbitrary set of k nodes U = {u 0 , u 1 , . . ., u k − 1 }. Create the subgraph g induced on the nodes in U VðGÞ, and  let its bit vector representation B be of the form lower-triangular matrix described in Fig 4. We now describe how to create the lookup table that maps any such B to its canonical representative.
We iterate through all 2 b(k) bit vectors in order; for each value B, we check to see if it is isomorphic to any of the previously found canonical graphettes; if so, the lookup table value is set to the previously found canonical graphette; otherwise we have a new, previously unseen canonical graphette and the lookup table value is set to itself (B).
When checking for isomorphism between B and all previously found canonical graphettes, we use a relatively simple brute force approach. If the degree distribution of the two graphettes are different, we can immediately discard the pair as non-isomorphic; otherwise we resort to cycling through every permutation of the nodes checking each pair for graph equality, which has worst-case running time of k 2 k!. The total run time to compute the lookup table for a particular value k is thus bounded above by k 2 k! Á NC(k) Á 2 b(k) , where k! is the maximum number of permutations we need to check if a non-canonical matches an existing canonical, k 2 is the worst-case running time to check if 2 specific permutations of k-graphettes are isomorphic, there are at most NC(k) canonicals to check against [22], and 2 b(k) = 2 n(n − 1)/2 is the total number of undirected graphs on k nodes. More sophisticated approaches exist [23], which may more easily allow higher values of k.
This process can also be parallelized, which is what we did for k = 8. Essentially, we can split the 2 b(k) non-canonical graphettes into m sets of about 2 b(k) /m graphettes each, and then spread the computation across m machines. For each of the m sets S i , we loop through all graphettes in that set and mark out which are isomorphic to each other. For each set S i , we will find a set T i of lowest-numbered "temporary" canonical graphettes in S i , along with the map TC: S i ! T i of which graphettes in S i map to each temporary canonical in T i . That is, for each graphette g 2 S i , 9h 2 T i for which the temporary canonical TC(g) = h. Finally, once all the m sets have been evaluated in this way, a second stage passes through all the T i , i = 0, . . ., m − 1, merging the temporary canonicals together into a final, global list of canonical graphettes, while also propagating these globally lowest-numbered canonicals back up through the m temporary canonical  [27]. Note that up to k = 8, together the lookup table for canonical graphettes and their canonical orbits fits into 31 bits, allowing storage as a single 4-byte integer, with 1 bit to store whether the graphette is connected (i.e., also a graphlet). The suffixes K, M, G, T, P, and E represent exactly 2 10 , 2 20 , 2 30 , 2 40 , 2 50 and 2 60 , respectively. maps, so each graphette g globally maps to the globally lowest-numbered canonical; we call this process sifting for canonicals, and it may require several iterations to globally find the final list of canonicals. In this way we ran k = 8 in about a week across 600 cores, for a total of 600 CPU-weeks. This process could probably be made more efficient with smarter isomorphism checking [23,24].

Graph automorphism and orbits
An isomorphism p : VðgÞ ! VðgÞ (from a graph g to itself) is called an automorphism. While an isomorphism is just a permutation of the nodes, it is called an automorphism if it results in exactly the same labeling of the nodes in the same order-in other words exactly the same adjacency matrix. The set of all automorphisms of g will be called Aut(g).
An automorphism orbit, or just orbit, of g is a minimally sized collection of nodes from VðgÞ that remain invariant under every automorphism of g [25]. There can be more than one automorphism orbit, and each orbit can have anywhere from 1 to k member nodes; refer again to Fig 1 for some examples. More formally, a set of nodes ω constitute an orbit of g iff: 1. For any node u 2 ω and any automorphism π of g, u 2 ω () π(u) 2 ω.
2. if nodes u, v 2 ω, then there exists an automorphism π of g and a γ > 0 so that π γ (u) = v. Now, we shall prove a few relevant results that will be useful later for automatically enumerating the orbits. Proposition 1. For each node u 2 VðgÞ and each automorphism p : VðgÞ ! VðgÞ, there exists an integer λ > 0 such that π λ (u) = u.
Proof. Because π is an automorphism, Since jVðgÞj is finite and π is bijective, the conclusion obviously follows. We shall call the set of nodes C p ðuÞ ¼ fu; pðuÞ; . . . ; p lÀ 1 ðuÞg the cycle of u under automorphism π, where λ is the smallest positive integer such that π λ (u) = u. Note that λ is not unique since π λ (u) = π 2λ (u) = Á Á Á = u. Also, π, u, and λ are tied together into triples such that knowing any two determines the third. Corollary 1.1. π maps every node 2 C p ðuÞ to a node (possibly same) 2 C p ðuÞ. Corollary 1.2. In any automorphism π of g, every node appears in exactly one cycle.
Following the same logic, ω 2 ω 1 , implying ω 1 = ω 2 . )( Corollary 2.1. Each cycle appears in exactly one orbit, which completely contains that cycle. Proof. If an orbit ω partially contains a cycle C p ðuÞ, then ω is not invariant under automorphism π, as π will map some node in ω (and C p ðuÞ) to another node outside ω (but still in C p ðuÞ) according to corollary 1.1, contradicting our definition of orbits. Since two orbits are disjoint, C p ðuÞ must appear only in ω, and in none of the other orbits.
These statements are enough to be able to find all orbits of each graphette, as we now demonstrate.

Automatically enumerating all orbits of a graph
From the propositions in the previous section, an algorithm to enumerate the orbits can be constructed like this: 1. Generate all automorphisms of g.

2.
Split each automorphism into its cycles.

Merge the cycles from different automorphisms to form orbits.
Generating all automorphisms of g. Referring to Algorithm 1, the function GENERATEAUTOMORPHISMS() applies every possible permutation of VðgÞ over Adj(g). Each permutation creates an isomorph of Adj(g). If Adj(g) is unchanged under some permutation π, then by definition, π is an automorphism of g. Hence it is saved into Aut(g).
Two optimization strategies are employed: 1. No node is mapped to another node with unequal degree.
2. An automorphism of graph g is also an automorphism of its complement graph g 0 .
In practice, this algorithm generates all automorphisms of all the canonical graphettes up to size 8 in a matter of seconds. Nevertheless, for additional speed up in higher sizes, modern sophisticated automorphism detection algorithms [23,24] may be used.
Splitting automorphisms into cycles. An automorphism π of g is basically a permutation of nodes of g. Hence, to split π into cycles, we can repeatedly apply π over every node u 2 π and remember the nodes u transforms into. This forms the cycle with node u, i.e. C p ðuÞ, which is saved in C. After first visit, each node is marked visited to prevent more visits.
Merging cycles to enumerate orbits. Suppose CðgÞ is the set of all cycles resulting from all the automorphisms of g.
To enumerate orbits from it, first each node u is colored with a unique color ω(u) = u. Then ω(u) is continuously updated to reflect the current color of u, as the nodes belonging to same orbits are gradually colored by identical color.
For the nodes of each cycle c 2 CðgÞ, we save their minimum color in ω min , and then color all of them with ω min . After coloring all the cycles in this way, nodes belonging to same orbits get the same color, and hence, get enumerated. Algorithm 1 Automatically enumerating automorphism orbits of a graph function GENERATEAUTOMORPHISMS (Graph g) Aut(g) = {} // Find the automorphisms of g for each permutation π of VðgÞ do apply π over Adj(g) if Adj(g) == π(Adj(g)) then put π in Aut(g) end if end for end function function GENERATECYCLES (automorphism π) for each node u 2 VðgÞ do ω(u) = u end for for cycle c 2 CðgÞ do let ω min = 1 for node u 2 c do ω min = min(ω min , ω(u)) end for for node u 2 c do ω(u) = ω min end for end for end function

Proof of correctness of Algorithm 1
Here we prove that Algorithm 1 determines every orbit of g.
Suppose a set ω is among the final sets generated by Algorithm 1. We shall prove ω is an orbit of g by showing that it follows the two properties of orbits: 1. Let a node u 2 ω form the cycle C p ðuÞ under automorphism π. The GENERATECYCLES function will apply π repeatedly until it finds a λ so that π λ (u) = u and will therefore determine C p ðuÞ. Since the ENUMERATEORBITS function assigned u to ω, it had also assigned all nodes in C p ðuÞ to ω. Hence u 2 ω () π(u) 2 ω.
2. Suppose nodes u, v 2 ω. Then, either they belonged to a cycle from which they were assigned to a mutual set ω in ENUMERATEORBITS function, or there is a third node w so that w shares separate cycles with u and v under different automorphisms π 1 and π 2 . In the first case, u and v already belong to a common cycle. In the second case, assume p g 1 1 ðwÞ ¼ u and p g 2 2 ðwÞ ¼ v. Consider the permutation 0 ¼ p g 2 2 p À g 1 1 . Since composition of two automorphisms is an automorphism [26], ϕ is also an automorphism. And notice that 0ðuÞ ¼ p g 2 2 ðp À g 1 1 ðuÞÞ ¼ p g 2 2 ðwÞ ¼ v implying u and v belong to a common cycle under ϕ.
Therefore, ω is indeed an orbit of g. Since each node was given a unique orbit color in the beginning of ENUMERATEORBITS, every orbit of g will be eventually found by Algorithm 1.

Results and discussion
Using the algorithms described herein, we have enumerated all possible graphlets, including the generalization of disconnected counterparts called graphettes, up to size k = 8. The code and data can be found in http://github.com/Neehan/Faye. (Note that the github code uses the upper triangle matrix, though we intend to convert it to use the lower tringle as that representation has already been established [16].) We have also enumerated all orbits up to size k = 8. More importantly to the statistical sampling technique described in the Introduction, we have used a bit-vector representation of all possible adjacency matrices of all possible sets of up to k = 8 nodes and created a lookup table from the 2 k(k − 1)/2 k-sets to their canonical graphette representatives. This allows us to determine, in constant time, the graphette represented by these k nodes, as well as the automorphism orbits of each nodes. This allows efficient estimation of both the global distribution of graphlets and orbits, as well as an estimation of the graphlet (or orbit) degree vector for each node in a large graph G.
Although the lookup tables for k > 8 are at present too big to compute or store, we could also use NAUTY or SAUCY to enumerate all the canonical graphettes up to size k = 12, and use our orbit generation code Algorithm 1 to determine all the orbits in all graphettes up to size k = 12. We have verified that previous results are consistent with ours in terms of the number of distinct graphettes [22] and orbits [27] determined, as displayed in Table 1.
In future work we will study which statistical sampling techniques most efficiently produce a good estimate of the complete graphlet and local (per-node) degree vectors. We also intend to study how this method may aid in cataloging of graphlets for database network queries, or in non-alignment network comparison [10]. Finally, there may be ways to combine our method with those of orbit counting equations [15,16] to more efficiently produce samples of orbit counts.