Fig 1.
Graphlet-based graph compression.
(A) Reduced representation of a graph G obtained by contracting subgraphs into colored supernodes representing the subgraphs. (In this example, two different graphlets, colored in blue and green, are selected) The cost for encoding the reduced representation can be split into two parts: (i) encoding the multigraph H obtained by contracting subgraphs in G, L(H, ϕ) (See “Base codes and null models” section), and (ii) encoding which nodes in H are supernodes and their color, designating which graphlet they represent, [Eq (7)]. (B) Hierarchy of the four different dyadic graph models [56] used as base codes. Each node in the diagram represents a model. An edge between two nodes indicates that the upper model is less random than the lower. The models are: the Erdős-Rényi model P(N,E) (cyan); the directed configuration model
(orange); the reciprocal Erdős-Rényi model
(pink); and the reciprocal configuration model
(yellow). (C-E) Encoding the additional information necessary for lossless reconstruction of G from H, incurs a cost
(Eq (8)) that is equal to the sum of three terms for each supernode, corresponding to encoding the labels of the nodes inside the graphlet, i.e., the graphlet’s orientation (C), and how the graphlet’s nodes are wired to other nodes in H (D,E). (C) Encoding the orientation of a graphlet is equivalent to specifying its automorphism class. For the graphlet shown in the example there are 3 possible distinguishable orientations, leading to a codelength of log 3. (D) Encoding the connections between a simple node and a supernode involves designating to which nodes in the graphlet the in- and out-going edges to the supernode are connected. In this example, there are
possible wiring configurations for both the in- and out-going edges, leading to a wiring cost of log 36 (see Eq (9)). (E) Encoding the wiring configuration of the edges from a supernode i to another supernode j involves designating the edges from the group of nodes of supernode i to the group of nodes in j in the bipartite graph composed of the two groups (the edges from j to i are accounted for in the encoding of j). Here, there are
such configurations, leading to a rewiring cost of log20 bits.
Table 1.
Base- and null-model codelenghts.
The codelength of a model is equal to L(H, ϕ) = Sϕ + L(ϕ) (Eq (12)), with the entropy Sϕ and the model complexity L(ϕ) given by the appropriate expressions in the table. The entropy of multigraph models are given in the first four lines and the entropy of the simple graph models are given in the next four. The parametric complexity of the models is the same for multi- and simple graphs and are listed in the following four lines. Finally, expressions for common parametric codelengths are given in the last four lines. For multigraph codes, the asymmetric and symmetric parts of the adjency matrix are denoted by and
, respectively. For reciprocal models (RER and RCM),
is the number of non-reciprocated edges and
is the number of reciprocated edges. For the configuration model (CM),
denotes the out-degrees and
the in-degrees. For the reciprocal CM (RCM),
and
are the non-reciprocated out- and in-degrees, and
are the reciprocal degrees. (Details can be found in S4 Text).
Fig 2.
Greedy optimization algorithm.
(A) Illustration of a single step of the greedy stochastic algorithm. The putative compression ΔL(G, θ, s) that would be obtained by contracting each of the subgraphs in the minibatch is calculated, and the subgraph contraction resulting in the highest compression is selected (highlighted in blue). (B) Example of motif set inferred in the connectome of the right hemisphere of the mushroom bodies (MB right) of the Drosophila larva. (C) Evolution of the codelength during a single algorithm run. The algorithm is continued until no more subgraphs can be contracted. The representation θ* = θt with the shortest codelength is selected; here, after the 31st iteration (indicated by a vertical black dashed line). The horizontal orange dashed line indicates the codelength of the corresponding simple graph model without motifs (see Motif-free reference codes). (D) The algorithm is run a hundred times for each dyadic base model and the most compressing model is selected. Histograms represent the codelengths of models with motifs after each run of the greedy algorithm; colors correspond to the different base models (blue: ER model, orange: configuration model, pink: reciprocal ER model, yellow: reciprocal configuration model, see Fig 1B and Table 1); vertical dashed lines represent the codelengths of models without motifs, and the black dashed line indicates the codelength of the shortest-codelength model—here the configuration model with motifs.
Table 2.
For each connectome, we list its number of non-isolated nodes, N, its number of directed edges, E, its density ρ = E/[N(N − 1)], the features of the most compressing model for the connectome, its compressibility ΔL*, the difference in codelengths between the best models with and without motifs, ΔLmotifs, and the reference to the original publication of the dataset. The absolute compressibility ΔL* measures the number of bits that the shortest-codelength model compresses compared to a simple Erdős-Rényi model (Eq (15)). The difference in compression with and without motifs, ΔLmotifs, quantifies the significance of the inferred motif sets as the number of bits gained by the motif-based encoding compared to the optimal motif-free, dyadic model. For datasets where no motifs are found, this column is marked as “N/A”. All datasets are available at https://gitlab.pasteur.fr/sincobe/brain-motifs/-/tree/master/data.
Fig 3.
Performance of compression-based motif inference on numerically generated networks.
(A-D) Number of spurious motifs inferred using our compression-based method with MDL-based model selection and using hypothesis testing with four different null models in random networks generated from the same four null models: (A) the Erdős-Rényi model (ER); (B) the configuration model (CM); (C) the reciprocal ER model (RER); and (D) the reciprocal CM (RCM). The x-axis labels indicate which method was used for motif inference: our method (MDL) or classic hypothesis testing with each of the four null models as reference. The corresponding generative model is highlighted in boldface. To make hypothesis testing as conservative as possible, we applied a Bonferroni correction, which multiplies the raw p-values by |Γ| = 9576 and we set the uncorrected significance threshold to 0.01. The random networks in (A-D) are all generated by fixing the values of each null model’s parameters to those of the Drosophila larva right MB connectome (e.g., N = 198 and E = 6499 for the ER model). (E-H) Ability of our method to correctly identify a placed graphlet as a motif as a function of the number of times it is repeated, mα. We show results for two selected 5-node graphlets: an hourglass structure (top row) and a clique (bottom row). The clique is the densest graphlet and is totally symmetric (the number of orientations, i.e., the number of non-automorphic node permutations, is equal to one). The hourglass has intermediary density, ρα = 2/5, and symmetry, with 60 non-automorphic orientations within a possible range of 1 to 5! = 120. The generated networks in (E-H) contain N = 300 nodes and an edge density of either ρ = E/N(N − 1) = 0.025 (E,G) or ρ = 0.1 (F,H). Each point is an average over five independently generated graphs. (E,F) The discovery rate is the estimated probability that the planted motif belongs to the inferred motif set, i.e., 〈1 − δ(mα, 0)〉. (G,H) Average inferred number of repetitions of the planted motif, 〈mα〉.
Fig 4.
Compressibility of neural connectomes.
Compressibility (measured in number of bits per edge in the network) ΔL*/E of different connectomes as compared to encoding the edges independently using the Erdős-Rényi simple graph model (see Table 1). Two types of models are shown for the datasets: the best simple network encoding and the best motif-based encoding when this compresses more than the simple encoding. Asterisks highlight connectomes where motifs permit a higher compression than the reference models. (A) Whole-CNS and whole-animal connectomes. (B) Connectomes of three different regions of the adult Drosophila right hemibrain. Note that while the relative increase in compressibility of these connectomes obtained using motifs is relatively small, the motifs are highly significant due to the large size of these connectomes (Table 2). (C) Connectomes of different brain regions of first instar Drosophila larva. (D) Connectomes of C. elegans head ganglia at different developmental stages, from 0 hours to 50 (adult). While no higher-order motifs are found, the compressibility increases with maturation (and thus the size) of the connectome.
Fig 5.
Topological properties of motif sets.
Graph measures averaged over the inferred graphlet multiset, , i.e., for a network measure φ, one point corresponds to the quantity
. The density (A), reciprocity (B) and number of cycles (C) and are standard properties of directed networks [75]. The graph polynomial root (D) measures the structural symmetry of the motifs [74]. Details can be found in S6 Text. Red squares indicate averages over the connectomes’ inferred motif sets. Blue squares are reference values, computed from average over randomized graphlets with their density conserved. To obtain the fixed-density references per motif set, we generate for each graphlet a collection of a hundred randomized configurations sharing the same density. The black dots of panel (A) show the connectomes’ global density.
Fig 6.
Connectomes share common motifs.
Most frequently appearing motifs in the motif sets inferred for all connectomes. (A) Most frequently found motifs: fraction of connectomes in which each motif is found, . (B) Most repeated motifs: average graphlet concentration
.