Online Community Detection for Large Complex Networks

Complex networks describe a wide range of systems in nature and society. To understand complex networks, it is crucial to investigate their community structure. In this paper, we develop an online community detection algorithm with linear time complexity for large complex networks. Our algorithm processes a network edge by edge in the order that the network is fed to the algorithm. If a new edge is added, it just updates the existing community structure in constant time, and does not need to re-compute the whole network. Therefore, it can efficiently process large networks in real time. Our algorithm optimizes expected modularity instead of modularity at each step to avoid poor performance. The experiments are carried out using 11 public data sets, and are measured by two criteria, modularity and NMI (Normalized Mutual Information). The results show that our algorithm's running time is less than the commonly used Louvain algorithm while it gives competitive performance.


Introduction
Complex networks describe a wide range of systems in nature and society [1][2][3]. Frequently cited examples include the Internet in which routers and computers are connected by physical links, and collaboration networks in which researchers are linked by coauthoring. To understand the formation, evolution, and function of complex networks, it is crucial to investigate their community structure, not only for uncovering the relations between internal structure and functions, but also for practical applications in many disciplines such as biology and sociology [4][5][6].
Intuitively, a community of a complex network consists of a cohesive group of nodes that are relatively densely connected to each other but sparsely connected to other dense groups in the network [7]. Community detection aims to identify the communities by only using the information encoded in the network topology [8]. It is one of the critical issues in the study of complex networks. A wide variety of community detection methods have been developed to serve different scientific needs [8,9].
Modularity is a commonly used criterion for community detection. It was first proposed in Newman et al. [10]. Good et al. [11] describe the performance of modularity maximization in practical contexts and present a broad characterization of its performance in such situations. A wide variety of algorithms for solving the modularity optimization problem have been developed [12]. For example, Clauset et al. [13] present a hierarchical agglomeration algorithm for detecting communities. Newman et al. [14] show that the modularity can be expressed in terms of the eigenvectors of a characteristic matrix for the network. This expression leads to a spectral algorithm for community detection.
Modularity can be generalized in a principled fashion to incorporate the edge information such as direction and weight. Leicht et al. [15] consider the problem of finding communities in directed networks. Newman et al. [16] point out that weighted networks can, in many cases, be analyzed using a simple mapping from a weighted network to an unweighted multigraph. Lancichinetti et al. [9] generate directed and weighted networks with builtin community structure and show how modularity optimization performs on their benchmark. However, Fortunato et al. [17] find that modularity optimization may fail to identify communities smaller than a scale which depends on the total size of the network and on the degree of interconnectedness of the communities, which is called a resolution problem. To mitigate the resolution issue, Reichardt et al. [18] show how community detection can be interpreted as finding the ground state of an infinite range spin glass. Ruan et al. [19] propose a recursive algorithm HQCUT to solve the resolution limit problem. Arenas et al. [20] propose a method that allows for multiple resolution screening of modular structures. Aldecoa et al. [21] introduce a criteria called ''Surprise'' to resolve the resolution problem.
In some kinds of complex networks, new edges continually appear while old edges do not disappear, resulting in a large network. For example, citation networks are growing as new papers cite existing papers. To efficiently process these kinds of networks, we desire a community detection algorithm that will be able to process a network (1) without recomputing whole network after every new edge/node and (2) without the need of whole network structure available at each update. Although many community detection algorithms have been proposed, to our best knowledge, there is no algorithm that can meet these two requirements. Many existing algorithms need to start from the beginning when the network is expanded, even when only one node or edge is added.
Many efforts have been made to meet the two requirements. Leung et al. [22] identified novel characteristics and drawbacks of label propagation algorithm, and extended it by incorporating different heuristics to facilitate reliable and multifunctional real time community detection. Huang et al. [23] introduced a new quality function of local community, and presented a fast local expansion algorithm for uncovering communities in large-scale networks. Kawadia et al. [24] presented a new measure of partition distance called estrangement, and showed that constraining estrangement enables it to find meaningful temporal communities in diverse real-world data sets. However, both Leung's algorithm and Huang's algorithm cannot handle growing networks, since they must recompute the whole network after every new edge/node. Kawadia's algorithm requires the whole network structure to be available at each update.
In this paper, we develop a community detection algorithm to meet the two requirements. Our algorithm is an online algorithm, i.e. it can process a network edge by edge in the order that the network is fed to the algorithm, without having the whole network available from the start. Our algorithm updates existing community structure in constant time once a new edge is added. The update avoids re-processing the whole network structure, since it only needs knowledge about a network's local structure related to the new edge, thus our algorithm can efficiently process large networks in real time. Our algorithm has O(M) time complexity and O(NK) space complexity, where M is number of edges, N is number of nodes, and K is number of communities.
This paper is an extension of our previous work [25] published in IJCAI'13 (downloaded for free in http://ijcai.org/papers13/ Papers/IJCAI13-281.pdf). The main differences are three-fold: (1) This paper proposes a generative model for complex network based on preferential attachment mechanism, which helps us to infer network's future structure by its current structure and gives a solid theoretical motivation to the algorithm; (2) This paper develops a deterministic online community detection algorithm, which uses expected modularity to make an informed choice. The conference paper's non-deterministic algorithm may need many runs; (3) This paper uses additional datasets and extensive experiments for more convincing results.

Method
To achieve the online community detection, we first propose a generative model for complex networks based on the preferential attachment mechanism [26,27], which helps us to predict a network's future structure based on its current structure. We then develop an online community detection algorithm, which processes a network edge by edge. It optimizes expected modularity instead of modularity to avoid poor performance in some specific cases. Expected modularity can be calculated based on our generative model.

Preliminaries
A network G~fV ,Eg is a set of N nodes V~fv 1 , . . . ,v N g connected by a set of M edges E~fe ij~f v i ,v j gg. The network considered here is undirected, unweighted, and without self-loops or isolated node. Let P~fC 1 , . . . ,C K g denote a partition of V . It is a division of V into K non-overlapping and non-empty communities C k that cover all of V . As a performance measure for the partition quality, modularity was first proposed by Newman et al. [28]. It can be expressed as where edg(C k )~Dfe ij Dv i [C k and v j [C k gD is the number of intracommunity edges within community C k , DED is the number of edges within network G, and deg(C k ) is the degree of community Hence community detection can be formulated as a modularity optimization problem max P q(P) and Brandes et al. [29] prove the conjectured hardness of this problem both in the general case and in the case with restriction to number of partitions K. This result makes heuristic techniques the only viable option for modularity optimization problem. However, heuristic techniques cannot guarantee that the partition is good enough. It may result in a poor partition for some networks. In other words, the algorithms fail to achieve an acceptable modularity. We say an algorithm encounters failure if all nodes are assigned to the same community.

Generative Model for Complex Network
Complex networks have non-trivial topological features that do not occur in some simple networks but often occur in real networks. An important feature of many complex networks is that their degree distributions follow a particular mathematical function called the power law [27,30,31], although it does not always hold [32]. The power law implies that the degree distribution of the network has no characteristic scale.
It is widely recognized as a seminal work presenting a model for the observed stationary scale-free distributions of complex networks by Price et al. [26]. Barabasi et al. [27] conclude that this feature is a consequence of two generic mechanisms: (1) networks expand continuously by the addition of new nodes; (2) new nodes attach preferentially to communities that are already well connected. Barabasi's model is recognized by academia [33,34]. Specifically, a new node v j will attach to an existing node v i with probability p(v i ) in proportion to the degree of node v i The above model only considers the case that a new edge links a new node to an existing node. However, a new edge may link two existing nodes or two new nodes. In fact, estimating the likelihood of the appearance of a new edge between two existing nodes, called link prediction, is one of the fundamental problems in network analysis. A variant of preferential attachment mechanism can be used to do link prediction [35]. Specifically, a new edge will link two existing nodes v i and v j with probability p(v i ,v j ) in proportion to the product of the degree of node v i and the degree of node v j For a complete review of the statistical mechanics of network topology and dynamics of complex networks, one can refer to Boccaletti et al. [34] or Albert et al. [36]. Mitzenmacher et al. [37] briefly surveyed some other generative models that lead to scalefree distributions. For a summary of recent progress about link prediction algorithms, one can refer to Lu et al. [38].
To facilitate subsequent work, we generalize a preferential attachment mechanism from node to community. A new node will attach to an existing community C k with probability p(C k ) in proportion to the degree of community C k and a new edge will link two existing communities C k 1 and C k 2 with probability p(C k 1 ,C k 2 ) in proportion to the product of the degree of community C k 1 and the degree of community C k 2 Here we propose a generative model for complex networks. Our model generates a network G with M edges by addition of new edges. It is starting from an empty network G 0~6 0. For m~0, . . . ,M{1, there are three cases for a new edge Case (a): link a new node to an existing node, For case (a) and (b), the addition of the new edge follows preferential attachment mechanism mentioned above (See Fig. 1).
Notice that p a zp b zp c~1 . When p a~1 , our model is the same as Barabasi's model for growing networks.

Online Community Detection Algorithm
A straightforward way to do online community detection is to take a sequence of edges as input, and optimize modularity q(P mz1 ) at each step for current network G mz1 based on previous partition P m . However, this greedy algorithm may have poor performance. Considering Barabasi's model that every new edge links a new node to an existing node, Brandes et al. [29] prove that a partition with maximum modularity has no community that consists of a single node with degree one, and a new node should be assigned to an existing community, however this operation makes all nodes in a same community and results in zero modularity.
To avoid poor performance, our algorithm optimizes expected modularity E½q(P M ) for final network G M , instead of modularity q(P mz1 ) for current network G mz1 at each step. We calculate E½q(P M ) based on our generative model and the partition as follows: for existing nodes, we keep them in their current communities; for new nodes, we assign them to the corresponding existing communities to keep the degree of every existing community (defined as sum of degree of nodes which belong to that community) increasing and the expected increment of the degree of community is proportional to the degree of community. Such partition can make subsequent deriving of expected modularity simple.
First we calculate q(P mz1 ). Notice that where C k,m is community C k at step m and DE m D is the number of edges within network G m , DE m D is always equal to m as our algorithm processes one edge at one step. Hence q(P mz1 ) can be expressed as Then we calculate E(q(P mz1 )) under three cases separately as follows: Case (a): link a new node to an existing node. Without loss of generality, we assume v i is the existing node and v j is the new node. We assign the new node to the same community as the existing node and have where C k(i) is the community which node v i belongs to.
Case (b): link two existing nodes. We do not change the partition and have Case (c): link two new nodes. We assign two new nodes to an existing community with probability in proportion to the degree of the existing community. Case (c)'s q(P mz1 ) and E½q(P mz1 ) are the same as case (a)'s.  Finally we calculate E½q(P M ) by combining E½q(P mz1 ) under three cases together and applying it iteratively where f (M,m) only depends on M and m. As our partition keeps the degree of every existing community increasing, we have and the expected increment of the degree of community is proportional to the degree of community, thus the expected degree of community C k at step m' can be expressed as According to the Popoviciu inequality on variance, the variance of deg(C k,m' ) has a loose upper bound where K m is the number of communities within network G m and Now we describe the online community detection algorithm. For initial network G 0~6 0, it is clear that the best partition P 0 is an empty set too. For subsequent networks G mz1 , m~0,1,2, . . . ,M{1, we consider some candidate operations which update the partition. Each operation has its corresponding E½q(P M ). We take the operation which has the largest E½q(P M ). In fact, we only need to know expected modularity gain DE½q(P M ), which is defined as E½q(P M ) of one operation minus E½q(P M ) of another We describe our operations under three cases separately as follows: Case (a): link a new node to an existing node. We consider two operations: the Split operation where the new node splits as a new community, and the Join operation where the new node joins the   same community as the existing node (See Fig. 2). Without loss of generality, we assume v i is the existing node and v j is the new node.
For the Split operation, we have The existing community C k(i) has degree deg(C k(i),mz1 )~deg(C k(i),m )z1 and the new community C Kz1 has degree deg(C Kz1,mz1 )~1 at step mz1.
For the Join operation, we have The existing community C k(i) has degree deg(C k(i),mz1 )d eg(C k(i),m )z2 at step mz1. Then we have We estimate p b by observed frequency of case (b). Taking together and omitting the error term, we can obtain DE(q(P M )), and take the Split operation if it is positive or the Join operation otherwise.
Case (b): link two existing nodes, two existing nodes may or may not belong to the same community (See Fig. 3). If both nodes belong to the same community, it is hard to propose a suitable candidate operation. So, we take the Dense operation where we keep current partition unchanged. Otherwise we consider two  operations: (1) the Move operation where we move one node from its community to another node's community; (2) the Keep operation where we keep the current partition unchanged. Without loss of generality, we assume v i is the moving node and have where deg(v i ,C k,m )~Dfe ij Dv j [C k,m gD is number of edges from the node v i to the community C k at step m and Therefore, we obtain DE(q(P M )) and determine the operation in the same way as we do in case (a).
Case (c): link two new nodes, we consider two operations: the New operation where we assign two new nodes to a new community and the Merge operation where we assign them to an existing community. We have where C k(i),m is the existing community and Notice that DE½q(P M ) is almost always positive for large complex networks. So we take the New operation for case (c) to reduce complexity.
In summary, our algorithm takes a sequence of edges as input and optimizes expected modularity at each step. We assign node to community according to the maximum expected modularity gain principle. If only one node of the current edge belongs to the existing network, we split another node to a new community if this operation can maximize expected modularity gain, otherwise we let it join the same community as the existing node; if both nodes of current edge belong to the existing network but they belong to different communities, we move one node according to the same principle; if neither node of current edge belongs to the existing network, we just assign them to a new community. Obviously, our algorithm has O(M) time complexity. The space complexity is O(NK) because we need to store deg(v i ,C k ) for calculating expected modularity gain in constant time. Our algorithm has two major advantages: (1) the update only uses knowledge about network's local structure related to the new edge; (2) the update can be done in constant time. Thus it can efficiently process large networks in real time.

Results
In this section, we present the experimental results of our online community detection algorithm and compare it with a state-of-theart algorithm, Louvain algorithm, proposed by Blondel et al. [39]. For simplicity, we use OLEM to refer to our algorithm, OLTM to refer to a simplified version of our algorithm which greedily optimizes temporal modularity q(P mz1 ) (See Eq.(4)) instead of expected modularity E½q(P M ) (See Eq.(6)), and Louvain to refer to the Louvain algorithm.
The experiments use 11 public real-world large network data sets from Stanford Large Network Dataset Collection (http:// snap.stanford.edu/data/), which are commonly used by researchers. Their number of nodes varies from 12,008 to 2,394,385 and their number of edges varies from 93,439 to 4,659,565 (See Table 1). These data sets are N ca-CondMat: Collaboration network of Arxiv Condensed Matter [40]; N ca-HepPh: Collaboration network of Arxiv High Energy Physics [40]; N email-Enron: Email communication network from Enron [41]; N ca-AstroPh: Collaboration network of Arxiv Astro Physics [40]; N cit-HepTh: Arxiv High Energy Physics paper citation network [42]; N cit-HepPh: Arxiv High Energy Physics paper citation network [40]; N com-Amazon: Amazon product network with labeled community structure [43]; N com-DBLP: DBLP collaboration network with labeled community structure [43]; N web-Stanford: Web graph of Stanford.edu [41]; N Amazon0601: Amazon product co-purchasing network from June 1 2003 [44]; N WikiTalk: Wikipedia talk (communication) network [45]. The edges should be processed in the same order as expanding procedure of the networks. However, those data sets do not have timestamps on the edges. In the experiments, we process the edges in order of their appearance in the raw files.
We use C# to implement our algorithms (Our C# implementation can be downloaded from http://www.cs.zju.edu.cn/ ,gpan/code/pone2013.zip). For comparison, we employ the C implementation of the Louvain algorithm provided by the authors (https://sites.google.com/site/findcommunities/). We carry out experiments on a Windows based Genuine Intel (R) CPU i7 @ 2.70 GHz machine with 4.00 GB memory.
Modularity and average running time (in seconds) over 10 runs by OLEM, OLTM, and Louvain are reported in Table 2 and Table 3. The evolution of temporal modularity over time by OLEM and OLTM is shown in Fig. 4.
We can see that OLTM is faster than Louvain in all data sets and OLEM is faster than Louvain in many data sets except ca-AstroPh, cit-HepTh and cit-HepPh. With the modularity measure, OLEM and OLTM cannot achieve similar performance to Louvain. This is due to our algorithms being online one-pass algorithms while Louvain is an offline multi-pass algorithm. Our algorithms' running times are linear in number of edges as we expected while Louvain is not. This is due to the number of passes of Louvain is not fixed. Most of all, Louvain needs to start from the beginning when a new edge is added while our algorithms do not.
OLTM is faster than OLEM because Dq(P mz1 ) is simpler than DE½q(P M ). In fact, we calculate 2(mz1) 2 Dq(P mz1 ) instead of Dq(P mz1 ) in our implementation as the former only involves integer arithmetic which is faster than float-point arithmetic. OLEM keeps relatively stable performance in all data sets while OLTM has exceptionally poor performance in the email-Enron and WikiTalk data sets. We will further investigate the underlying cause for OLTM later. OLTM often performs slightly better than OLEM in the other data sets. It may be due to our approximation of expected modularity by a lower bound in OLEM.
As we mentioned in the Introduction Section, the modularity optimization based approach may fail to identify communities smaller than a scale, which is called a resolution limit problem [17]. To investigate this problem, we compare results of OLEM, OLTM and Louvain in the com-Amazon and com-DBLP data sets. We choose the two data sets because Yang et al. [43] released a labeled community structure for either of the data sets (http:// snap.stanford.edu/data/com-Amazon.html, http://snap.stanford. edu/data/com-DBLP.html). For com-Amazon data set, Yang et al. labeled products from the same category as a community and nodes (products) that belong to a common community share a common function or purpose. For com-DBLP data set, they labeled authors who published to a certain journal or conference as a community and nodes (authors) that belong to a common community share a comon research interest. For each data set, we use the top 5,000 subset, same as [43], for comparison.
We find that, although both our method and the Louvain method optimize the modularity function, the number of communities in Louvain's result is less than that in our results (See Table 4). It is due to our method and the Louvain method achieving optimization in different ways. The Louvain method optimizes the modularity function by merging pair of communities in each pass, while our method optimizes the modularity function  Table 6. The statistics of modularity on 10 reordered data sets as well as modularity on original data set by our algorithm. by moving nodes of the new edge at each step in order to satisfy the real-time processing. Generally speaking, merging communities may obtain higher modularity gain than moving nodes, so the Louvain method is better than our method to optimize the modularity. However, merging communities in each pass will reduce the number of communities in final result as each merging operation will eliminate one community. It causes that the Louvain method will miss small communities. Further, the similarity between the results and labeled community structures can be measured by NMI (Normalized Mutual Information) criterion [46]. We find that, measured in NMI, our results are more similar to labeled community structure than Louvain's result (See Table 5). The main reason may be that our methods can find more communities of small scale, which the Louvain method may be hard to identify.
The reason for OLTM's poor performance in the email-Enron and WikiTalk data sets is that OLTM has no Split operation for case (b) edge. As OLTM is a greedy approach, it only takes the Join operation for case (b) edge to maximize temporal modularity. Hence the only way for OLTM to create new community is its New operation for case (c) edge. If a data set has few case (c) edges at its beginning, OLTM cannot create enough communities in the early stage and obtains a poor final partition. In the worst situation, the data set has no case (c) edge and OLTM fails. In fact, email-Enron and WikiTalk data sets have very few case (c) edges at their beginning, comparing with the other data sets.
In contrast, with the help of expected modularity, OLEM can take the Split operation for case (b) edge. Hence it can create enough communities in the early stage and obtains an acceptable final partition in email-Enron and WikiTalk data sets.
To compare OLTM and OLEM's operations, we plot the percentage of different operations of OLTM and OLEM over time in Fig. 5 and 6. We can see that OLTM generally only takes the Join and Dense operations until very later stage while OLEM takes many Split operations in the early stage in the email-Enron and WikiTalk data sets. Therefore, OLEM's temporal modularity increases steadily over time while OLTM's temporal modularity remains zero until very later stages in email-Enron and WikiTalk data sets (See Fig. 4). In fact, OLEM can obtain an acceptable modularity even in early stage for the email-Enron and WikiTalk data sets.
As a statistical analysis, we created 10 copies of each original data set with the edges randomly reordered and ran our algorithm on those reordered data sets. The statistics of modularity in those reordered data sets as well as modularity in original data set are reported in Table 6. Modularity in original data set is significantly better than those in reordered data sets for ca-CondMat, email-Enron, com-Amazon, com-DBLP, web-Stanford, and Ama-zon0601. We guess, for these six data sets, the storing order of edges may be close to the order of their expanding. For the other data sets, modularity difference between the original and reordered is slight. We think the edges may be not stored by their creation time in those data sets.

Conclusions
In this paper we have examined the problem of online community detection for large complex networks in which new edges continually appear while old edges do not disappear. We have formulated it as a modularity optimization problem. We have proposed a generative model for complex networks and developed an online algorithm with linear time complexity. Our algorithm processes a network edge by edge in the order that the network is fed to the algorithm. It does not optimize modularity but expected modularity to avoid poor performance. The two major advantages of our algorithm are (1) the update only uses knowledge about network's local structure related to the new edge; (2) the update can be done in constant time. Our algorithm can efficiently process large networks in real time. The algorithm has been applied to 11 public real-world large network data sets and our experiments give very encouraging results. Not only is the proposed algorithm scalable in terms of both time and space complexity, but it also gives comparable performance. Our future research will consider (1) combining OLTM and OLEM into a better one; (2) improving the generative model to allow edge to appear and disappear in general probability distribution; (3) exploring how to apply our method to other objective functions.