Comprehensive influence of topological location and neighbor information on identifying influential nodes in complex networks

Identifying the influential nodes of complex networks is now seen as essential for optimizing the network structure or efficiently disseminating information through networks. Most of the available methods determine the spreading capability of nodes based on their topological locations or the neighbor information, the degree of node is usually used to denote the neighbor information, and the k-shell is used to denote the locations of nodes, However, k-shell does not provide enough information about the topological connections and position information of the nodes. In this work, a new hybrid method is proposed to identify highly influential spreaders by not only considering the topological location of the node but also the neighbor information. The percentage of triangle structures is employed to measure both the connections among the neighbor nodes and the location of nodes, the contact distance is also taken into consideration to distinguish the interaction influence by different step neighbors. The comparison between our proposed method and some well-known centralities indicates that the proposed measure is more highly correlated with the real spreading process, Furthermore, another comprehensive experiment shows that the top nodes removed according to the proposed method are relatively quick to destroy the network than other compared semi-local measures. Our results may provide further insights into identifying influential individuals according to the structure of the networks.


Introduction and motivation
Networks play an important role in people's social lives nowadays that a wide range of realworld phenomena, from social to medical and biological networks, can be described by complex networks [1,2]. The nodes play different roles in the network since some nodes are more important than others according to their structural positions. Identification of the important nodes in networks has been a fundamental problem and have theoretical significance in many applications, such as constraining and preventing the spreading of disease [3]or rumor [4] a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 to which a node directly links [30], and the combination of eintra-community and inter-community links [32].
In general, it has been revealed that the neighborhood attribute and position attribute are two important factors in determining the importance of a node. Inspired by this, this paper proposes a new hybrid centrality to discover these influential nodes. On one hand, the neighbor number is used to denote the neighborhood attribute, and the position attribute is denoted by the proportion of the triangular structures formed by the node and its neighbors. Evaluation results in terms of discriminability, correctness demonstrate that the proposed method can efficiently discriminate the influence capability of nodes and provide a more reasonable ranking list than other compared methods. The remainder of this paper is organized as follows. In Section 1, related work will be reviewed. Section 2 describes the details of the proposed method. Section 3 reports and analyzes the experimental results, followed by a conclusion in Section 4.

Proposed method
In the current research, it has been attempted to determine the influential nodes using the natural characteristics of networks in a semi-local approach. K-shell is known as the position index of a node in the network, usually, a higher K-shell value means a node is surrounded by large number of denser connected neighbors that the node itself may not be easily removed by every iteration. Once a connection exists between any two of its neighbors, a triangle structure forms. Supposing that many triangle structures formed among the node itself and its neighbors, the node is more likely to locate in a dense part of the network. So, the number of triangles may be an effective indicator in measuring the location of the node, especially, the triangle act as another role, that is, measuring the topological connection among nodes [33], the higher the percentage of the triangular structures formed by a node with its neighbor nodes in the whole network, the denser the connections between the node and its neighbor nodes are. Inspired by this, using the percentage of the triangular structures, we propose a hybrid centrality that considers the neighbor information and position attribute of a node simultaneously. And it is a fact that, during the spreading process, the node usually touches the nearest neighbors first, then the next nearest neighbors, etc. The contact distance between nodes is an important parameter in a spreading process [34], the interaction effect between two nodes decreases with their distance. Unlike any other time-consuming algorithms [35,36] when calculating the shortest path distance. In this paper, we simplify it as follows, the distance from a node to the nearest neighbors is one, and to the next nearest neighbors is two, etc. In this way, the influence for a node is defined as (labeled as C): where k u is the degree of node u, TP(u) is the percentage of triangle structures that exist between the node and its neighborhoods, calculated by: TPðuÞ ¼ NTSðuÞ TNTS , NTS is the triangle structures formed between node u and its neighborhoods, and TNTS is the sum of triangle structures formed by all the nodes in the networks, namely, TNTS ¼ P n v¼1 NTSðuÞ, the total number of triangle structure exists in the network are 1 3 � TNTS, and d(uv) denotes the shortest distance between the node u and v, the neighborhood set u 2 F(v) denote the nearby nodes include but not bounded the nearest neighbors, that is to say, more step far away nodes' information are taken into consideration. To reduce the algorithm complexity, in the paper, the distance ranged d is set to be 2, namely, only the nearest neighbors and the next-nearest neighbors are taken into consideration. And the effect of d is validated in Section 3.
Then, an extended index is further developed based on Eq(1), which is defined as (labeled as Lhc): where w 2 τ(v) is the nearest neighborhood of node v.
The following shows the step of Lhc. The algorithm traverses the nodes in the network in turn. The main work is to calculate the degree value and the number of triangle structures among the node and its neighbor.

Evaluation strategies
The effectiveness of the proposed method is empirically evaluated through a series of experiments. The Lhc is compared with other eight well-known measures involving Local, Global and Semi-local metrics from the aspects of discriminability, correctness and robustness. The methods are DC (degree centrality) [14], BC (betweenness centrality) [15], H-index method (H-index) [19], LC (local centrality) [22], Cnc + (neighborhood coreness) [23], G + (extended gravity index) [26] and EW(extended weight degree centrality) [25] and LGM(local version of the gravity model) [28].
3.2.1 Discriminability. If nodes have much different influence ability, then the influence capability of nodes can be easily distinguished from each other. In this section, the centrality measures are compared to evaluate how well the discriminability of them. Under the help of Complementary Cumulative Distribution Function (CCDF) [23], we can achieve a clear specification of the ranking distributions of different measures and see the frequency of nodes distribution.
Where n i denotes the number of nodes with rank i on the list, and n is the total number of nodes in the network and r is the number of ranks. According to the CCDF principle, if r ! n, the discriminability is well and the CCDF plot will slow down; if r ! 1that means all nodes are assigned in few of ranks and the CCDF plot will decrease rapidly. The CCDF is plotted for the networks of Dolphin, Polbook, Football, Usair, Elegans, and PowerGrid. As can be seen in Fig 1, In the network of Dolphin, Polbook, and Football, the CCDF of DC and H-index tends to zero with a quick slope, large number of nodes' influence values cannot be distinguished from each other. The five semi-local methods, LC, EW, G + , LGM, and our Lhc consider more topological information, so they show better performance, the CCDF plots of them tend to zero with a slower slope following the diagonal line in the network. Though the Cnc + consider more neighbors' K-shell information, the performance seems not to be as well as the LC, G + , LGM, and Lhc. The BC perform almost the same better performance as LC, EW, G + and Lhc, that is to say, the nodes in those three network act as different bridge roles, so a better discriminability the BC method achieve. While, In Usair and Elegans, Lhc tends to show a slower slope and more distinct ranks than LC. As shown in Table 1 about the basic topological statistics of these networks, we can see that the cluster coefficient of the Usair and Elegans network is rather larger, that is to say, a glister of nodes may have many triangle structures formed by the node and its neighbor nodes, our Lhc considers the structure information of a node and its neighbors, so a better ranking distribution it achieves, the CCDF plot of BC slows down at the beginning, then decreases rapidly, that is to say, no more nodes can it distinguish.
When coming to the larger network, PowerGrid, It is clear to note that in the case of DC and H-index, CCDF drops at the beginning like in other networks, BC still cannot achieve a performance as better as the semi-local method even under the circumstance that the BC considers information in the global scope. Particularly, in PowerGrid, the clustering coefficient is small, many nodes encounter with the same degree or K-shell value, so the performance of LC and Cnc + are relatively poor compared with Lhc. Lhc shows the best performance even as the fact that EW and G + also consider the more step neighbor information of a node. It should be noted that the LGM show better performances than the above methods, the main reason is that the average distance of PowerGrid is large, and more path information are taken into consideration by LGM, so it can achieve better performance in discriminability, but with the expense of time-consuming in this large average distance networks. Nodes in the network may have the same value of H-index, DC even the K-shell value, while the number of triangle structure formed between the node itself and its neighbors may different from each other, so a better ranking distribution performance the Lhc can achieve.

Correctness.
Apart from the discriminability evaluation of different measures in the above. In this section, the accuracy and correctness of the proposed measures in node ranking have been evaluated. In principle, the ranked list generated by an effective ranking method should be as consistent as possible with the ranked list generated by the real spreading process. The ranking results of spreading are usually obtained from the SIR model. In the SIR model [17,49], each node can be in one of three states: susceptible (S), infected (I), and recovered (R). Initially, In detail, to check the spreading influence of one given node, only node v is in the infected state, and the other nodes are in the susceptible state. At every time step, each infected node can infect its susceptible neighbors with infection probability β, and then it enters into R state with probability μ. In this paper, we set μ = 1.0. The process continues until no nodes in I state remain in the network. At the end of the SIR process, the number of R nodes is considered as the spreading capability of every node v. By selecting different nodes as the initially infected node, the spreading influence of all network nodes and their ranking list can be obtained. In these experiments, the SIR simulation has been repeated 10 4 times for a network with |E| < 100, 10 3 times for a network with 100 < |E| < 1000. The average number of recovered nodes is regarded as their spreading capability. In SIR simulation, the infection probability β should neither be too small or too large. When β is too small, The epidemic cannot successfully spread over networks, on the contrary, large β may lead to an easy outbreak over almost the whole network. So a suitable β is needed to better measure the spreading ability of each node. Usually, the value for β follows a threshold value, calculated as hki hk 2 i , where hki and hk 2 i denote the average degrees and average second-order degree of the nodes respectively. The value of β is set slightly larger than β th . As Show in Table 2, the β for different networks are given.
Kendall's rank correlation coefficientis(τ) [50] is usually utilized to quantify the correlation between the ranked list generated by a certain centrality measure and the ranked list obtained from the SIR simulation. Let (x 1 , y 1 ). . .(x n , y n ) be a set of rank pairs in two distinct ranking list X and Y. The observations (x i , y i ) and (x j , y j ) is said to be concordant if x i > x j and y i > y j or if x i < x j and y i < y j . Otherwise, if x i > x j and y i < y j or if x i < x j and y i > y j , the pairs is said to be discordant. If x i = x j or y i = y j ,the pair is neither concordant nor discordant. Kendall's tau coefficient (τ) is defined as follows: where N c and N d are the numbers of concordant and discordant pairs in the ranking lists respectively. It is noted that τ is positively related to concordant of the ranking lists. A higher τ value indicates that the ranked list a centrality measure generated is more correlated to the real spreading process. Previously, in the proposed method, the neighborhood distance range is set by the parameter d = 2, that is to say, only the nearest neighbors, next-nearest neighbors are taken into consideration. Under the help of SIR, the effect of different d is provided in the following experiment through the ten real networks, including: Contiguous, Dolphin, Polbook, Football, Jazz, Usair, Netscience, Elegans, Euroroad, PowerGrid and PGP. The Kendall τ correlation between the SIR epidemic ranking list and Lhc ranking list are obtained under a series of d, As shown in Fig 2, in general, the optimal value of d is about 2-3. In most cases of the above networks, d = 2 shows the higher τ,when d > 3 or further increased,the τ becomes stable. Also, with the help of SIR, the effects of K-shell, Clustering coefficient, and Triangle of nodes on the evaluation of nodes' influence are compared together. The K-shell value is a known index usually used to measure the location of a node and the Clustering coefficient is usually employed to evaluate the topological connections among the neighbors. While, the triangles, on one hand, can denote the extent that the neighbors may infect each other and on the other hand, it may be an effective indicator in measuring the location of the node. As shown in Fig 3, the clustering coefficient shows its poor performance in evaluating the spreading ability of nodes since the correlation τ is rather lower than the other two indexes whether in the denser or sparser network. Sometimes, nodes may have a larger clustering coefficient but relatively fewer triangles, in this case, the effectiveness of the clustering coefficient may not be obvious. Compared with K-shell, the percentage of triangles (TP) shows its comparable performance in the network of which the clustering coefficient are rather higher, and in Contiguous, Polbook, Football, Jazz, and Netscience, Tp achieves better performance than K-shell. While, in some sparser network, such as Euroroad and PowerGrid, of which the connections among nodes are rather smaller, K-shell shows its relatively better performance than TP, that is to say, TP may lose its advantage in this kind of networks, so more topological information is needed, and that is what we proposed Lhc considers and combines: degree and TP, one reflects the neighborhood information of nodes, and the other denotes both the connection among the neighbors and the locations of nodes. Kendall's tau correlation coefficients between the two ranking lists for different networks are calculated respectively, the two ranking lists, one is offered by each measure, namely γ, where γ = DC, BC, H-index, LC, Cnc + , G + , EW, LGM, Lhc and the other is obtained from the SIR process, denoted by θ. Shown in Table 2, column τ(γ, θ) shows that from small networks like Contiguous, Dolphin to large network like PGP, the γ offered by Lhc is highly correlated with θ as compared to the other measures.
To further evaluate how the probability β affects the performance of different measures, next, different ranking lists are obtained from the SIR model under a series of β which are all around β th . The correlations are plotted for the Contiguous, Dolphin, Usair, Netscience, Euroroad, and PowerGrid networks. As shown in Fig 4, Lhc can achieve better performance with a constant value of the spreading probability β in the above networks, especially when β is around the epidemic threshold β th , the proposed method is more correlated with θ.
In Contiguous and Dolphin, When β is far smaller than β th , degree centrality shows its better performance, and as the increase of the spreading probability β, the Kendall's τ become lower and lower. Compared with DC and H-index, the six semi-local measures LC, Cnc + , EW, G + , LGM, and Lhc perform better as the spreading probability becomes larger to the β th . The larger the spreading probability, the farther away the epidemics can spread from the initially infected node, the LC, Cnc + , EW, G + , LGM, and Lhc consider nodes with more steps away from the initially infected node, so they can achieve better performance on a wide range of β. The above results confirm the fact that only the local neighbor information is not effective in evaluating the influence of a node. In Usair, of which the clustering coefficient is relatively larger, it means that the connections among nodes are dense, Lhc considers the topological connections structure in evaluating the spreading ability, so a better result it achieves, especially when the β is far larger than β th , Lhc still keep its high correctness. In the Netscience network, the DC, and H-index perform better at the beginning, but as the β becomes larger, they lose their advantage with the two curves turn to decrease. The clustering coefficient of Netscience is also quite large, so the same reasons can be drawn from Usair why Lhc achieves better performance on a wide range of β. As for the two larger networks, Euroroad and Power-Grid. The BC still cannot achieve a better τ than other methods, seen in this way, BC is not good at evaluating the spreading influence of nodes in these networks. Different from the above -referred networks with the high average degree and high clustering coefficient, both the average degree and clustering coefficients in Euroroad and PowerGrid networks are relatively small, in other words, the average neighbor number of every node maybe not very much and the topological connections among the nodes may not be that dense. The Lhc achieves better performance when β is small, even as the β becomes larger, the LC performance almost as well as the Lhc, but Lhc still achieve the largest τ when β is around β th , the results again certifies its effectiveness and robustness in ranking nodes among the networks with different topological charaeteristics. Fig 5 shows the details between the centrality measures and real spreading abilities on three networks, each point indicates a node in the network, the x-axis denotes the centrality value and the y-axis denotes the spreading ability of nodes. In the Dolphin network, both the DC and BC centrality encounter the problem that the spreading ability varies much from each other when the nodes under the same index value. And when comes to the BC centrality, a significant number of nodes are with large spreading influence while the value evaluated by BC is quite small, that is to say, the spreading influence cannot be evaluated by BC properly. The value measured by the centrality method should be consistent with the spreading process, in other words, the larger the centrality value, the better the spreading ability of the node. The Cnc + , EW, LGM, and Lhc consider more neighbor information, so they perform better than DC and BC, and it can be seen that the real spreading distribution of nodes under the same Lhc value is relatively concentrated. In the Polbook network, the correlation between the value of BC measure and the spreading ability is still not so obvious, and the distribution of spreading ability is relatively scattered when the nodes have the same BC value, especially, some nodes hold larger spreading ability, but their BC value is not necessarily large. The clustering coefficient of Elegans is relatively bigger than other networks, the Lhc takes both the neighbors number and the connections among neighbors into consideration, so a better performance it achieves, and the real spreading distribution of nodes under the Lhc value is relatively more concentrated compared with LC and Cnc + , the values assigned by Lhc present a more obvious linear relationship with the real spreading.
From the results of the above three networks, we can see that the value of EW, Cnc + , LGM and Lhc present a positive correlation linear trend with the real spreading ability, that is, the higher the centrality value is, the stronger the node's spreading ability. However, the correlation between the value evaluated by DC, BC are not that obvious, many nodes hold the same DC index value, but their influence is quite different from each other. Moreover, the performance of DC is not always stable in different networks, the points are concentrated in Usair but are relatively scattered in the Dolphin network. The real influence of a single node shows the good linear correlation with the index value can be well seen in Lhc and compared with other semi-local metrics, Lhc still shows better performance, the influence of multiple nodes assigned with the same Lhc value has little difference, and under the same Lhc value, the real influence distribution of nodes is more concentrated. Fig 6 shows the relations between the Lhc and other four centrality measure on three networks, each point indicates a node in the network, the x-axis denotes the Lhc value and the yaxis denotes the value of the four centrality measures, including the H-index, LC, Cnc + and G + , and the color represents the spreading influence of this node, namely S. In Netscience, the node whose H index is smaller than 7 have no much difference with each other on the spreading influence (with less color variation), while, the spreading influence of the nodes whose H index is 8 have much difference with each other. Seen in this way, H-index may not well evaluate the spreading influence of nodes in Netscience. Comparing with the other three cases, G + and Lhc consistent much better with the spreading. In Elegans, the H-index, LC, Cnc + and G + centralities are all positively correlated with Lhc, especially the G + centrality stronger positively correlated relation with Lhc. In addition, we can see that the nodes with higher G + centralities and Lhc have deeper color (that is higher influence). In PowerGrid, some nodes have small Hindex nodes but higher influence, in the three semi-local methods, the high centrality nodes are likely to have high influence. Compared with H-index, more nearby neighbors' information is taken into consideration, so the result of LC, G + and Cnc + consistent much with the spreading. Overall speaking, Among the four cases, the correlation between G+ and Lhc is stronger than the other three cases.

Robustness.
In the experiment above, the semi-local manner method have shown their advantages over other local or global methods in evaluating the spreading influence of nodes. Sometimes the whole network can be greatly damaged by attacking a few nodes in the network, in this case, the nodes' importance lies in the role of maintaining the connectivity of network. In this section, from the perspective of the robustness of the network, the influences of nodes are measured. In the experiment, a certain percentage of nodes in the network are chosen to remove from the network at first, then the change of connectivity part in the network is used to measure the role of the nodes which have been removed before. The ranking of nodes is sorted in descending order by different indexes, and then the nodes with the same proportion (whose value range is [0, 1]) are removed in order. G is used to denote the rest giant component of the network after removing the top-k important nodes. The smaller the value of G, the more isolated individual nodes or small groups in the network, the more important the removed nodes are. We compare Lhc with other four methods-LC, Cnc + , EW and G + on Polbook, Netscience, Elegans, and PowerGrid. It can be seen from Fig 7, the value of G decreases with the number of nodes removed (as the curve decline). In Polbook network, the curve of G + and Lhc decline faster than LC and Cnc + , and Lhc achieves an obvious advantage over other measures after top-30% nodes are removed. In Netscience, removing the top-10% nodes ranked by LC, G + , Lhc makes the network structure break down quickly and the curve of Lhc is slightly quicker after top-20% nodes are removed. The same conclusions can be drawn from Elegans, the curve of G + and Lhc still decline quicker than LC and Cnc + , especially Lhc performance slightly better after top-10% nodes are removed. The most obvious is the PowerGrid network, the clustering coefficient of PowerGrid network is small, although remove some nodes cannot quickly break down the network structure, top-nodes ranked by Lhc are relatively quick to destroy the network.

Conclusion
Effectively identify influential nodes in networks is of practical significance in many areas, such as the acceleration of information dissemination and the control of epidemic spreading. In this paper, a hybrid way is adopted by combine two topological structural characteristics of the node to evaluate its influence. The proposed centrality measure considers both the neighbor information and the topological connections information among the neighbor nodes. The neighbor information is reflected by the degree of the node to see how many nodes it connects with and the number of triangles structure centering on the node is utilized to measure how close its neighbors are connected. The interaction influence by different step neighbors is also considered by the fact that the interaction effect between two nodes decreases with their distance. Experimental results conducted on several real-world networks show that the proposed Lhc method is more effective at distinguishes the node's influence than other conventional centrality methods as well as other semi-local methods. Further, by Kendall's τ correlation coefcient, the rank correlation between the ranked list generated by the SIR model and the different centrality measures are calculated, it shows that the proposed measure outperforms the other methods in evaluating the node's spreading influence. Finally, the node removal methods are applied to evaluate the effectiveness and performance of the centrality method as well, the result shows that the top nodes ranked according to Lhc are important to the structure of networks since they are relatively quick to destroy the network.