Clustering via hypergraph modularity

Despite the fact that many important problems (including clustering) can be described using hypergraphs, theoretical foundations as well as practical algorithms using hypergraphs are not well developed yet. In this paper, we propose a hypergraph modularity function that generalizes its well established and widely used graph counterpart measure of how clustered a network is. In order to define it properly, we generalize the Chung-Lu model for graphs to hypergraphs. We then provide the theoretical foundations to search for an optimal solution with respect to our hypergraph modularity function. A simple heuristic algorithm is described and applied to a few illustrative examples. We show that using a strict version of our proposed modularity function often leads to a solution where a smaller number of hyperedges get cut as compared to optimizing modularity of 2-section graph of a hypergraph.


Introduction
An important property of complex networks is their community structure, that is, the organization of vertices in clusters, with many edges joining vertices of the same cluster and comparatively few edges joining vertices of different clusters [1,2]. In social networks, communities may represent groups by interest (practical application include collaborative tagging- [3]), in citation networks they correspond to related papers (see [4]), similarly in the web communities are formed by pages on related topics.
Yet another example could be financial markets where we have several groups of financial instruments that might be correlated with each other in several different groups. Such groups can be represented as hyperedges and hence detection of communities in such hypergraph could lead to better understanding of dependencies between financial instruments.
Hypergraphs can also be used to model transportation systems. For example in [5] the authors consider transportation system represented as a directed hypergraph. A hyperedge can represent a situation where a single stop (starting or destination point) is being serviced by several public transportation lines and vehicle types that can be differently chosen by an agent traveling within the transportation grid and communities can represent paths taken often together. Another application was presented in [6]-the authors suggest using hypergraphs to model interactions between biological cells in computational biology. Finally, in [7] one can find a discussion on how hypergraphs can be used for modeling telecommunication systems PLOS

Chung-Lu model for graphs
Let G = (V, E) be a graph, where V = {v 1 , . . ., v n } are the vertices, the edges E are multisets of V of cardinality 2 (loops allowed), and deg G (v) is the degree of v in G (with a loop at v contributing 2 to the degree of v). For A � V, let the volume of A be vol G (A) = ∑ v2A deg G (v); in particular vol G (V) = ∑ v2V deg G (v) = 2|E|. We will omit the subscript G when the context is clear. We define GðGÞ to be the probability distribution on graphs on the vertex set V following the well-known Chung-Lu model [20][21][22][23]. In this model, each set e = {v i , v j }, v i , v j 2 V, is independently sampled as an edge with probability given by: i ¼ j: (Let us mention about one technical assumption. Note that it might happen that P(v i , v j ) is greater than one and so it should really be regarded as the expected number of edges between i and j; for example, as suggested in [24], one can introduce a Poisson-distributed number of edges with mean P(v i , v j ) between each pair of vertices i, j. However, since typically the maximum degree Δ satisfies Δ 2 � 2|E| it rarely creates a problem and so we may assume that P(v i , v j ) � 1 for all pairs.) This model is a function of the degree sequence of G. One desired property of this random model is that it yields a distribution that preserves the expected degree for each vertex, namely: for any i 2 [n], where all degrees are with respect to graph G. This model will be useful to understand the graph modularity definition and its generalization to hypergraphs.

Review of graph modularity
The definition of modularity for graphs was first introduced by Newman and Girvan in [8].
Despite some known issues with this function such as the "resolution limit" reported in [25], many popular algorithms for partitioning large graph data sets use it [26][27][28]. It was also recently studied for some models of complex networks [29][30][31][32]. The modularity function favours partitions in which a large proportion of the edges fall entirely within the parts (note that throughout the paper, we use the term "part" and "partition" that are more common in mathematical literature; in the information sciences the equivalent term is "cluster") and biases against having too few or too unequally sized parts. For a graph G = (V, E) and a given partition A = {A 1 , . . ., A k } of V, the modularity function is defined as follows: The first term in (1), is called the edge contribution, whereas the second one, then q G (A) = 0, and if A = {{v 1 }, . . ., {v n }}, then q G ðAÞ ¼ À The maximum modularity q � (G) is defined as the maximum of q G (A) over all possible partitions A of V; that is, q � (G) = max A q G (A). In order to maximize q G (A) one wants to find a partition with large edge contribution subject to small degree tax. If q � (G) approaches 1 (which is the trivial upper bound), we observe a strong community structure; conversely, if q � (G) is close to zero (which is the trivial lower bound), there is no community structure. The definition in (1) can be generalized to weighted edges by replacing edge counts with sums of edge weights.

Generalization of the Chung-Lu model to hypergraphs
Similarly to what we did for graphs, we define a random model on hypergraphs, HðHÞ, where the expected degrees of all vertices are the corresponding degrees in H. To simplify the notation, we omit the explicit references to H in the remaining of this section; in particular, deg(v) denotes deg H (v), H denotes HðHÞ, E d denotes the edges of H of size d. Moreover, we use E 0 to denote the edge set of H 0 .
Let F d be the family of multisets of size d; that is, The hypergraphs in the random model are generated via independent random experiments. For each d such that |E d | > 0, the probability of generating the edge e 2 F d is given by: Note that this is the expression found in (2); that is, P H ðeÞ ¼ jE d j � s H ðeÞ. As a result, alternatively one can think about the following auxiliary process. Select a random multiset consisting of d vertices (counting possible repetitions); in d independent rounds, vertex v i is selected with probability p H ðiÞ. Repeat this experiment |E d | times and use the expected number of times edge e occurred in this process for the value of P H ðeÞ. An immediate consequence of this coupling between the two processes is that the expected number of edges of size d is |E d |. Finally, as with the graph Chung-Lu model, if P H ðeÞ > 1, then it should be regarded as the expectation and multi-hypergraph should be considered instead. However, as before, from practical point of view it is safe to assume that all P H ðeÞ � 1.
In order to compute the expected d-degree of a vertex v i 2 V, note that where I fg is the indicator random variable. Hence, using the linearity of expectation, then splitting the sum into d + 1 partial sums for different multiplicities of v i , we get: The second last equality follows from the fact that we obtained the expected value of a random variable with binomial distribution. One can compute the expected degree as follows: We will use the generalization of the Chung-Lu model to hypergraphs as a null model allowing us to define hypergraph modularity.

Hypergraph modularities
Consider a hypergraph H = (V, E) and A = {A 1 , . . ., A k }, a partition of V. For edges of size greater than 2, several definitions can be used to quantify the edge contribution given A, such as: (a). all vertices of an edge have to belong to one of the parts (clusters) to contribute; this is a strict definition that we focus on in this paper; (b). the majority of vertices of an edge belong to one of the parts; (c). at least 2 vertices of an edge belong to the same part; this is implicitly used when we replace a hypergraph with its 2-section graph representation.
We see that the choice of hypergraph modularity function is not unique; in fact, it depends on how strongly we believe that a hyperedge is an indicator that vertices belonging to it fall into one community. More importantly, one needs to decide how often vertices in one community "blend" together with vertices from other community; that is, how hermetic the community is.
In particular, option (c) is the softest one and leads to standard 2-section graph modularity. In order to illustrate how one can obtain formulas for a given variant, we concentrate on the second extreme, option (a), that we call strict. However, it is easy to repeat calculations for other variants. We will skip details but differences will be highlighted and the final formula for option (b) will be presented next. Comparison of various variants will be done in the forthcoming paper.

Strict hypergraph modularity. In this case, the definition of edge contribution for
The strict modularity of A on H is then defined as a natural extension of standard modularity in the following way: Consider any A � V. We want to compute the expected edge contribution of A over H. Let F d (A) � F d be the family of multisets of size d with all members in A; that is, First, note that where Putting (6) properly into (4), we get the strict modularity function of a hypergraph partition: Just as for graphs, the corresponding modularity q � H is defined as the maximum of q H (A) over all possible partitions A of V.
2.4.2 Generalizations. As with graphs, one can easily generalize the modularity function to allow for weighted hyperedges. As we already mentioned, we focused on the strict definition of modularity but it is straightforward to adjust the degree tax to many natural definitions of edge contribution. In particular, for the majority definition (see option (b) at the beginning of this section), one can simply replace Pð (5), and so (vol(A)/vol(V)) d in (6) (that is equivalent to PðBinðd; volðAÞ=volðVÞÞ ¼ dÞ becomes PðBinðd; volðAÞ=volðVÞÞ > d=2Þ. The majority modularity function of a hypergraph partition is then: We can also consider the modularity independently over hyperedges of different sizes. Decomposing H into d-uniform hypergraphs H d , we get the following degree-independent modularity function: This corresponds to (7)

Searching the solution space
In this section, we show that the solution that maximizes (7) lies in a subset of PðVÞ of size at most 2 |E| avoiding the search of the full set PðVÞ. Of course, it is still not feasible to check all subsets of the edge set but observations made in this section will be useful for designing efficient heuristic algorithms. For example, in this paper we already used a simple CNM-like algorithm (see Algorithm 1) in which edges overlapping with at least two parts clusters) have to be considered. Using the results from this section, we can significantly reduce the search space by ignoring edges that fall into one part. More sophisticated (and, hopefully, much faster) algorithms utilizing properties investigated in this section will be developed in the forthcoming paper.
Let SðHÞ denote the set of all sub-hypergraphs of H = (V, E) on the vertex set V: We use |H 0 | to denote |E 0 |, the number of edges in H 0 . Moreover, let p : SðHÞ ! PðVÞ denote the function that sends a sub-hypergraph of H to the partition its connected components induce on V. We define a relation on SðHÞ: that puts two sub-hypergraphs in relation if they have identical connected components. Since � p is an equivalence relation (based on equality), we can define the quotient set SðHÞ= � p . This quotient set contains equivalence classes that are in bijection with the set of all different vertex partitions that can be induced by the union of elements of E. Its cardinality depends on E but is at most 2 |E| ; however, it is typically much smaller than this trivial upper bound. Now, let us define the canonical representative mapping which identifies a natural representative member for each equivalence class. The canonical representative mapping f : ] and so it is impossible that two members have the largest size. Its outcome is the subgraph H � = (V, E � ) whose edge set is the union of edges of all members of the equivalence class. The following lemma explains why the canonical representative is natural with respect to the definition of strict modularity. As this observation follows easily from definitions, the proof is omitted. The next Lemma shows how the degree tax behaves on partition refinement.

Lemma 3.2. Let H = (V, E) be a hypergraph and A be any partition of V. If B is a refinement of A, then the degree tax of A is larger than or equal to the degree tax of B and it is equal if and only if A = B.
Proof Proof. Assume that A = {A 1 ,. . ., A k } maximizes the strict modularity function q H (�). We will show that there exists H � ¼ ðV; E � Þ 2 S � ðHÞ such that q H (p(H � )) � q H (A). Let E � = {e 2 E: e � A i for some i}. By construction of H � , the (strict) edge contribution of partitions A and p(H � ) are identical. Again, from construction, note that the partition p(H � ) is a refinement of A. Hence, the previous Lemma states that the degree tax of A is larger than or equal to the degree tax of p(H � ). With equal edge contribution, this means that q H (p(H � )) � q H (A). Since A is an optimal solution, the equality must hold which is only possible if A = p(H � ).

Examples
In this section, we first illustrate the correlation between our hypergraph modularity and the Hcut measure, which counts the number of edges touching more than one part (cluster). We then propose an algorithm for hypergraph partitioning based on our hypergraph modularity function, which we apply to a real dataset. However, let us stress again that the aim of this paper is to introduce a generalization of the modularity to hypergraphs, not to introduce new algorithms to actually find good partitions. We currently work on designing and testing algorithms for the hypergraph counterpart and, after that, we plan to do extensive experiments. The results will be included in the forthcoming paper.

Synthetic hypergraphs
We generate hyperedges following the process described as line clustering in section 5.1 of [33]. We first randomly generate points from 3 lines in the range [−.5, .5] 2 , 30 points per line, perturbed with Gaussian noise N(0, σ 2 ) where σ = 0.01. Each line cuts the origin, and the respective slopes are -1, 0.02 and 0.8. We also generate 60 outlying points chosen at random over the same range, for a total of 150 points.
We build hyperedges of size 3 (3-edges) by sampling sets of 3 points {i, j, k} for which expðÀ dði; j; kÞ 2 =s 2 d Þ > 0:999, where d() is the mean distance of the points to their best fitting line, and σ d = 0.02. This amounts to selecting 3-edges consisting of sets of 3 well aligned points. We do the same with sets of 4 points to generate the 4-edges.
The hyperedges can either consist of points all coming from the same line (which we call "signal") or not (which we call "noise"). We sample hyperedges so that the expected proportion of signal vs. noise is 2:1, and we consider 3 different regimes for the mix of edge sizes: (i) 75% 3-edges, (ii) 75% 4-edges or (iii) balanced between 3 and 4-edges. For the 3 regimes, we generate 100 hypergraphs and for each hypergraph, we apply the fast Louvain clustering algorithm (see [34]) on the weighted 2-section graph. In most cases, vertices coming from the same line are correctly put in the same part (cluster). The complete source code for this example along with instructions is available online [19].
In the left plot of Fig 1, we plot the standard graph modularity vs. the Hcut value, which is simply the proportion of hyperedges that fall in two or more parts. The Louvain algorithm is not explicitly aiming at preserving the hyperedges, so we do not expect a high correlation between the two measures. In fact, fitting a regression line to the points from the balanced regime, we get a slope of 0.0061 with R 2 value of 0.0008 (for majority of 4-edges the slope is 0.0768 and R 2 0.0734, while for majority of 3-edges the slope is 0.0061 and R 2 0.0004).
In the right plot of Fig 1, we do the same, this time comparing our hypergraph modularity with the Hcut values for the same partitions as in the left plot. The correlation here is very high. For the balanced regime, linear regression yields a slope of -0.6364 with R 2 value of 0.9693 (for majority of 4-edges the slope is -0.7079 and R 2 0.9696, while for majority of 3-edges the slope is -0.7079 and R 2 0.9696). This is an illustration of the fact that when we measure our proposed hypergraph modularity for different partitions, we are favouring keeping hyperedges in the same parts (clusters).

Estimating the modularity
While we can compute the modularity for very small hypergraphs by exhausting over all possible partitions, this is generally not a viable option. We propose a generalization to hypergraphs of the CNM algorithm for graph partitioning [26]. In the CNM algorithm, we start with every vertex in its own part. At each step, we merge the two parts (clusters) that yield the largest increase in modularity, and we repeat until no such move exists. The algorithm comes in two versions. In the full version at each step all hyperedges are searched and evaluated for merging. Since for larger hypergraphs this would be prohibitively numerically expensive, in the stochastic version at each step we evaluate just one randomly chosen hyperedge.
Our proposed algorithm for hypergraphs is presented in Algorithm 1. The idea is that we start with a partition where each node is in its own part. Then in each step, we loop through every hyperedge touching two or more parts, and we select (or we just randomly chose a single hyperedge) the one which, when we merge all the parts it touches, yields the best modularity, provided it is at least as high as the modularity from the previous step. We stop when no such edge exists. We use this algorithm in the next example. compute the partition A e obtained when merging all parts in A opt touched by e, and compute its modularity q e ; 10 end 11 select edge e � 2 E � with highest q e � ; 12 end 13 if q e � � q opt then 14 A opt = A e � and q opt = q e � ; 15 update E � , the set of edges touching two or more parts in A opt ; 16 end 17 until (q e � < q opt ) or E � = ; or computational time budget exceeded; 18 output: A opt and q opt We have implemented the simplified stochastic version of the algorithm in Julia and use it on the hypergraph generated with the forementioned regime (i) that has 75% edges of degree 3 and 25% edges of the degree 4. In our implementation of Algorithm 1, we choose hyperedges to consider at random order (that is, at each step one hyperedge is randomly selected). In order to validate the algorithm, we have run it 1920 times with 500 steps at each run. A single run on a hypergraph consisting of 150 vertices and 5094 hyperedges, using a modern CPU and a single thread, took around 7 seconds. Note that the computational complexity of the proposed algorithm is O(n) and hence it can be practically used for real-world hypergraphs.
The results presented in Fig 2 show that the heuristics presented in Algorithm 1 leads to practically reasonable and useful communities. Firstly, on average, the modularity level after 500 steps is around 75% on modularity for an "optimal" partition of the hypergraph (that is, the partition that uses the knowledge on how the hypergraph was generated). Secondly, this result can be improved by rerunning the Algorithm 1 several times (and this process can be parallelized in high performance computing environments). In particular, we have noticed that for 1.2% of all runs a partition having the optimal modularity (at least 0.6194 for the partition of the partition that matched the process of data generation described in the section 4.1 where we have three groups of points (30 points in group) sharing common hyperedges in each group and an additional 60 points) have been found. Please also note that a higher observed empirical value was observed in some cases-this arises due to the fact that the for last group of synthetic hypergraph having 60 vertices the hyperedges have been generated randomly.

DBLP hypergraph
The DBLP computer science bibliography database contains open bibliographic information on major computer science journals and proceedings. The DBLP database is operated jointly by University of Trier and Schloss Dagstuhl. The DBLP Paper data is available at http://dblp. uni-trier.de/xml/.
We consider a hypergraph of citations where each vertex represents an author and hyperedges are papers. In order to properly match author names across papers we enhance the data with information scraped from journal web pages. The DBLP database contains the doi.org identifier. We use this information to obtain the journal name and retrieve paper author data directly from journal-we update available author name data using ACM, IEEE Xplore, Springer and Elsevier/ScienceDirect databases. Since the same author names can be written differently we match author names of the paper across all three data sources. This can give good representation of author names for later matching. For the analysis, we only kept the (single) large connected component. We obtained a hypergraph with 1637 nodes, 865 edges of size 2, 470 of size 3, 152 or size 4 and 37 of size 5 to 7. The complete source code for this example along with instructions and data files is available online [19].
In Table 1, we show our results with the Louvain algorithm on the 2-section graph using modularity function q G (), as well as the results with our CNM algorithm on hypergraphs. Comparing the Louvain and CNM algorithms, we see that there is a tradeoff between q H and q G and moreover, the Hcut value is lower with the CNM algorithm. The increased number of parts with our algorithms is mainly due to the presence of singletons.
Another observation is that the actual partitions obtained with objective function q G () (Louvain) and q H () (CNM) are different. For the Louvain and CNM algorithms, we found values of 0.4355 for the adjusted Rand index (ARI), 0.4416 for its graph-aware counterpart (see [35]) and 0.684 for the adjusted mutual information, which are commonly used measures of comparisons for partitions. Similar partitions would show values close to 1. We used adjusted measures which are preferable to non-adjusted ones (such as NMI, the normalized mutual information) as they correct for random chance.
One of the difference lies in the number of edges of size 2, 3 and 4 that are cut with the different algorithms, as we see in Table 2. The algorithms based on q H () will tend to cut less of the larger edges, as compared to the Louvain algorithm, at expense of cutting more size-2 edges.

Conclusion
In this paper, we presented a generalization of the Chung-Lu model for hypergraphs, which we used to define a modularity function on hypergraphs. Interestingly, in hypergraph modularity case there is no one unique way to define modularity and we show that it depends on how strongly we think that a hyperedge indicates members of the same community. If the belief is soft this leads to standard 2-section graph modularity. However, if it is strong, a natural definition is strict hypergraph modularity, which we tested on numerical examples. We also proposed the in-between majority-based modularity function.
The objective of this paper is to develop a definition of hypergraph modularity. However, in order to show that this notion is numerically traceable, at least approximately, we provided the theoretical foundations for the development of algorithms using this modularity function that greatly reduce the solution search space.
A key natural question with any new measure is if it provides qualitatively different outcomes than existing ones. Therefore we have compared strict hypergraph modularity with a standard 2-section graph modularity. For this we proposed a simple heuristic algorithm. We illustrated the fact that in comparison to 2-section graph modularity (optimized using Louvain algorithm) optimization using strict modularity function tends to cut a smaller number of hyperedges. Therefore the proposed measure is potentially highly valuable in application scenarios, where a hyperedge is a strong indicator that vertices it contains belong to the same community.
Hypergraph modularity is a new measure, and there is still a lot of work that should be done. First of all, the development of good, efficient heuristic algorithms would allow to look at larger hypergraphs. Such algorithms would allow us to perform a study over hypergraphs with different edge size distributions, comparing the hypergraph modularity function with other definitions such as graph modularity over the 2-section representation of the hyperedges, and hypergraph modularity using the less strict majority rule.
Finally, let us mention that the method of modularity maximization (in its generalized form which incorporates a resolution parameter controlling the size of the communities discovered) is equivalent to another widely used methods of community detection in networks, the method of maximum likelihood applied to the special case of the stochastic block model known as the planted partition model, in which all communities in a network are assumed to have statistically similar properties [36] (see also [37]). It would be interesting (and potentially useful) to investigate if there is a natural counterpart of our hypergraph modularity measure.