Identifying influential nodes based on network representation learning in complex networks

Identifying influential nodes is an important topic in many diverse applications, such as accelerating information propagation, controlling rumors and diseases. Many methods have been put forward to identify influential nodes in complex networks, ranging from node centrality to diffusion-based processes. However, most of the previous studies do not take into account overlapping communities in networks. In this paper, we propose an effective method based on network representation learning. The method considers not only the overlapping communities in networks, but also the network structure. Experiments on real-world networks show that the proposed method outperforms many benchmark algorithms and can be used in large-scale networks.


Introduction
Identifying influential nodes in complex networks has gained great attention in the research community [1][2][3][4][5][6][7]. In recent years, many methods have been put forward to find influential nodes in complex networks. The knowledge of node's spreading ability shows new insight for application such as controlling propagation of messages and rumors in social networks [8], ranking reputation of scientists [9] and finding social leaders [10],etc.
The early measure of identifying influential nodes proposed by Shimbel is Stress Centrality [11] in 1950s. He suggested that the centrality of a node should be the total number of shortest paths that go through it. Degree Centrality [12] is a direct and effective method to measure the importance of nodes, but it neglects the global structure of the network. Eigenvector Centrality [13] considers the importance of node's neighbors. Betweenness Centrality [14] and Closeness Centrality [15] need to know all topology information of networks in advance and cannot be applied to large-scale networks. Comin et al. [16] combined degree and betweenness, but it is a time-consuming measure. Chen et al. [2] proposed a semi-local centrality measure, which is a tradeoff between the low-relevant degree centrality and other time-consuming measures. Additionally, Chen et al. [17] proposed ClusterRank, a local ranking method that considers the clustering coefficient of a node. Kitsak et al. [1] suggested that the influence of a node is mainly dependent on its position in the network and proposed K-shell to measure the importance of a node. However, K-shell considers only the links between the residual nodes, whereas the links a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 that connect to the exhausted nodes are entirely ignored. Johanhyun Bae et al. [18] extended the K-shell and proposed C nc and C nc+ . Zeng et al. [6] proposed a mixture decomposition method called Mixed Degree Decomposition (MDD), which considers both the residual degree and the exhausted degree. FD Malliaros et al. [19] proposed K-truss decomposition and suggested that the topological properties of the nodes play a crucial role. Lü et al. [20] showed the relationship between degree, H-index and coreness by constructing an operator and proved that the convergence to coreness can be guaranteed even under an asynchronous updating process. Numerical analyses in real networks suggested that the H-index is a good tradeoff that can better quantify node influence than either degree or coreness. In the same year, Lü et al. [21] reviewed the vital nodes identification methods and experimented on realworld networks to compare the mainstream algorithms. The methods of identifying influential nodes based on random walk are mainly used in web page sorting. The typical methods are Kleinberg's HITS algorithm [22], Google's PageRank algorithm [23] and Lv's LeaderRank algorithm [8].
Most of the previous methods only consider the node's topology information. In fact, realworld networks often have a strong community structure [24]. In social as well as other types of networks, nodes often belong to multiple communities simultaneously [25]. Influential nodes always act as "bridging" between the communities and exist in community overlaps. In this paper, we propose a new local central method to identify the influential nodes. The method assumes that the more communities a node belongs to, the greater influence of the node. To identify the influential nodes, we use the network representation learning to detect overlapping communities, and then combine with the topology information of the nodes. Experiments show state of the art performance in terms of the quality of identified influential nodes.

Network representation learning model
Network representation learning aims at learning distributed vector representation for each vertex in a network. It is also increasingly recognized as an important aspect for network analysis. Network representation learning tasks can be broadly abstracted into the following four categories: (a) node classification [26], (b) link prediction [27], (c) clustering [28], and (d) visualization [29].
J. Yang et al. [25] proposed the BIGCLAM model for network representation learning, which also covers the overlapping community detection. The model assumes that the overlaps of communities tend to be more densely connected than the non-overlapping parts. We briefly introduce this model with a bipartite graph in Fig 1. In Fig 1, the circles on the top represent communities, the squares at the bottom represent the nodes of the graph, and the edges indicate node community affiliations. Each affiliation edge in the bipartite affiliation network has a nonnegative weight. The higher the node's weight to the community the more likely is the node to be connected to other members in the same community. Each community c creates an edge between nodes u and node v with probability 1 − exp(−F uc Á F vc ). Where F uc is the nonnegative weight of node u to community c. The higher the value of F uc , the more likely is the node u has an edge with the nodes in community c. Furthermore, the model assumes that each community creates edges independently. For example, in Fig 1(A), node u and node v belong to community A and community B simultaneously. In Fig 1(B), F uA and F uB indicate the node u's weight of the affiliation to the community A and community B respectively. In community A, the probability of existing an edge between node u and node v is 1 − exp(−F uA Á F vA ). Similarly, the probability that there is an edge between node u and node v is 1 − exp(−F uB Á F vB ). Note that since node u and node v belong to community A and community B simultaneously, node u and v receive two chances to create a link. As each community creates edges independently, the probability of an edge existing between node u and node v is Given a network G(V,E), where V is the node set and E is the edge set. Let F 2 R NÂK be a nonnegative matrix, where N is the number of nodes and K is the number of communities. F uc is the weight between node u 2 V and community c 2 K. Given F, BIGCLAM generates G(V, E) by creating edge (u,v) 2 E between a pair of nodes u,v 2 V with probability p(u,v): where F u is a weight vector for node u. Each element in F u is the weight of node u to the corresponding community. The model aims to finding the most likely affiliation factor matrixF 2 R NÂK of the underlying network G by maximizing the likelihood: where Many times, we take the logarithm of the likelihood and call it log-likelihood: BIGCLAM learns a K-dimensional non-negative vector for each node in the network by optimizing the problem of Eq 4. Each dimension in the vector represents the probability that the node belongs to the corresponding community. After learningF, the model need to determine whether node u belongs to community c or not from the value of F uc . It ignores the membership of node u to community c if F uc is below some threshold δ. Otherwise (F uc > δ), it regards u as belonging to c. One node can belong to more than one community simultaneously. Based on the BIGCLAM model, we assume that the nodes in the community overlaps play the 'bridging' role between the communities. As these nodes belong to multiple communities, information through these nodes can be easily spread to other communities. It is reasonable to assume that nodes in community overlaps have greater influence.

Network constraint coefficient
Structural holes is a concept from social network research, which is originally developed by Burt [30] [31]. A structural hole is understood as a gap between two individuals who have complementary sources to information. Fig 2(A) is a structural hole of node E. The position of node E makes it serve as a bridge or a 'broker' between three different nodes. Thus, node E is likely to receive some non-redundant information from its contacts. The term 'structural holes' is used for the separation between non-redundant contacts. Because of the hole between two contacts, they provide network benefits to the third party (to node E). Burt used the network constraint coefficient C to measure the constraints imposed by forming a structural hole: where Γ(i) is the neighbor set of node i. As shown in Fig 2(B), p ij is proportion of i's energy invested in relationship with j and p ij ¼ 1 NðiÞ , where N(i) represents the degree of node i. p iq ¼ 1 NðiÞ and p qj ¼ 1 NðqÞ represent node i's and node j's energy respectively invested in relationship with the common neighbor q. From Eq 6 we can see that a node with a small constraint coefficient indicates that the degree of the node is large and the connections among neighbors are sparse. Thus, the node with a small constraint coefficient would have more chances to spread the information to a large portion of the network. The smaller the constraint coefficient of a node is, the faster the node can spread information.

Ranking method
Nodes in community overlaps play the 'bridging' role between the communities. Information can be spread to multiple communities through these nodes. The number of communities a node be longs to can be regarded as its propagation capacity. The more communities a node belongs to, the more communities the node can influence. Network constraint coefficient of a node can be regarded as the propagation speed in community. The smaller the constraint coefficient of a node is, the faster the node can spread information. We consider both propagation capacity and propagation speed of a node to evaluate its influence, which denoted by OC. The OC of node i is defined as follows: where Γ(i) is the neighbor set of node i, C k is the network constraint coefficient of node k, maxOC is the normalization factor, Nb(k) represents the total number of communities that node k's neighbors belong to. For example, assuming that node i has 3 neighbors a, b, c, node a has 2 communities named 1, 4, node b has 3 communities named 1, 2, 3, and node c has 4 communities named 2, 4, 5, 6, then the communities that the neighbors of node i have are 1, 2, 3, 4, 5, 6. Thus, Nb(i) = 6. To identify influential nodes in the network, we need to know the total number of communities that each node belongs to and their constraint coefficients. First, we use the BIGCLAM model to detect overlapping communities, and calculate the total number of communities that each node belongs to. Second, we calculate the network constraint coefficient for each node based on Eq 6. Third, according to Eq 7, we calculate the OC value of each node. Note that if there are no overlapping communities in the network, OC degrades to the network constraint coefficient.

Evaluation method
To evaluate the performance of the proposed method, we use the SIR model [32] to examine the influence of nodes. The model is used to simulate the spread of the virus or information process. The SIR model divides the network nodes into three types: (1) Susceptible nodes, healthy but not immune and can be infected; (2) Infected nodes, already infected and can infect susceptible nodes; (3) Recovered nodes, which have been cured and cannot be infected again. In the beginning, the node to be tested is in the Infected state whereas the rest of the nodes of the network are in the Susceptible state. This node triggers a spreading process where every infected node can infect its neighbors at each timestep t with probability β. Each infected node is cured with probability γ. In this paper, we set γ = 1, which means that each node has only one chance to infect its neighbors in every round. The sum (F(t)) of recovered nodes at time t when there is no infected node exiting in the network are defined as the influence of the node. In order to ensure the spreading process, we set β to be slightly larger than the epidemic threshold (b th ¼ hki ½hk 2 iÀ hki ) in the network [33], where hki and hk 2 i denote the average degree and the second order average degree, respectively. In this paper, we set b th ¼ hki To quantify the correctness of the ranking methods, we adopt Kendall's tau [34] as a rank correlation coefficient, which is defined as follows: where R 1 and R 2 are two different rank lists, n t = n(n − 1)/2, n t1 = ∑ i t i (t i − 1)/2, n t2 = ∑ j t j (t j − 1)/ 2, t i and t j are the number of tied values in the ith and jth groups of ties, respectively. n c and n d are the numbers of concordant and discordant pairs, respectively. For example, let X and Y be two ranking lists. (x 1 , y 1 ), (x 2 , y 2 ),. . .,(x i , y j ) are a set of joint ranks from X and Y, respectively. Any pair of ranks (x i , y i ) and (x j , y j ) is said to be concordant if x i > x j and y i > y j or x i < x j and y i < y j . If x i > x j and y i < y j or x i < x j and y i > y j , the pair is said to be discordant. If x i = x j or y i = y j , the pair is neither concordant nor discordant. This metric quantifies the similarity between the orderings of the measures and the real ranking.

Experimental results
In this section, we compare the proposed method OC with Degree Centrality(DC), Betweenness Centrality(BC), Closeness Centrality(CC), Eigenvector Centrality(EC), C nc , C nc+ , Network Constraint Coefficient(NC) and K-shell(KS). In each implementation, one node is selected to be infected, and then infects its neighbors according to the SIR model. The influence of the node (F(t)) is the sum of recovered nodes when the spreading process fade out. This value represents the average over multiple executions of the model (we performed 1000 simulations for large and 100 simulations for small datasets). Without special explanation, in this paper, the value of β is shown in Table 2, γ = 1, and the threshold for the dataset is 1000 nodes.
In Table 2, we compare the Kendall correlation coefficient τ of different ranking methods. The results in Table 2 manifest that our method outperforms the other methods in most cases.
Based on the above results, we plot the correlation of the influence measures in GDciting, Dolphins and CA-CondMat. The results are shown in Fig 3, Fig 4 and Fig 5 respectively. Due to the large number of nodes in CA-CondMat, we only show the result of the top 500 nodes. We can witness that there is a clear correlation between F(t) and OC, while the traditional measures, i.e., the BC and the CC, have little relationship with the influence capability of the spreaders in an epidemic process. As the proposed method OC contains the network constraint coefficient and the node degree information, we compare OC with DC and NC. The results are shown in Fig 6. The color of each point represents the influence of the node. We can observe that OC has strong correlation with DC and NC, but there are still many influential nodes with small values of DC and many little influential nodes with small values of NC. It indicates that NC or DC alone is not sufficient to identify influential nodes. The proposed method OC contains both constraint coefficient and node degree information. It can identify influential nodes better.

Table 1. The statistical properties of the networks, where n is the number of nodes, m is the number of edges, C is the number of communities divided by BIGCLAM model, MLC is the maximum number of communities the nodes have in the network.
To further estimate how the infection probability β affects the effectiveness of different methods, the Kendall correlation coefficient τ as a function of β for different methods is shown in Fig 7. The infection probability β varies from β th to 2β th . As described in Fig 7, on a wide range of probabilities β, OC is better than other measures in the four networks. In Fig 8, we conduct the same experiments for different values of γ, which varies from 0.5 to 1. As shown in Fig 8, the proposed OC presents better results than the other measures in the four networks.  Furthermore, we investigate the number of nodes being at the infected and recovered state for various timesteps of the SIR model. We focus on the nodes that appear in the top-k lists of each method. We initially set these nodes to be infected. Here we set k = 10 for the small networks (GDciting, Netscience and Email) and k = 100 for the large network (CA-CondMat). The cumulative number of infected nodes (F(t)) as a function of time t in the four networks are shown in Fig 9. Due to the randomness of transmission, the experimental result of SIR model is different in each experiment. We use the error bar graph to present the results. As shown in Fig 9, the number of cumulative infected nodes increases with time and ultimately reach the steady value. For all these four networks, OC outperforms the other methods for both spreading rate and the number of infected nodes.
In many cases, people are more interested in a small fraction of the most influential nodes in the network. Here, we use L to represent the fraction of the most influential nodes measured by each method. We let L vary from 0.1 to 1.0 and do the influence comparison experiment between the top nodes ranked by different methods. As shown in Fig 10, our method outperforms the other methods on almost the entire range of L in the four networks.
Determine the number of communities c is a challenging task in community detection. As our method need to use the result of community detection, it is necessary to evaluate the impact of the number of communities c on the result. We divide the network into different numbers of communities and then identify the influential nodes. Fig 11 shows the results in the three networks. The horizontal axis represents the number of communities divided and the vertical axis is the network correlation coefficient τ. As can be seen from Fig 11, the number of communities divided has a limited effect on the results. The fluctuation of τ does not exceed 0.1 with c varying.

Conclusion
Identifying influential nodes in complex networks is very important in theoretical and practical applications. In this paper, we proposed an efficient method based on BIGCLAM model. The method suggests that the community overlaps play the "bridging" role between the communities. The number of communities that a node belongs to represents its propagation capacity. In addition, we consider the network constraint coefficient of the node, which represents its propagation speed in community. The comparison results between the proposed method and the benchmark algorithms demonstrated that proposed method can obtain the best results. Our results could shed some light on how to utilize network representation learning and overlapping community detection to identify influential nodes.
Supporting information S1 Dataset. Nine real-world networks used in this paper. (ZIP) reviewers for their insightful and constructive commendations that have led to an improved version of this paper.