Maximizing the Spread of Influence via Generalized Degree Discount

It is a crucial and fundamental issue to identify a small subset of influential spreaders that can control the spreading process in networks. In previous studies, a degree-based heuristic called DegreeDiscount has been shown to effectively identify multiple influential spreaders and has severed as a benchmark method. However, the basic assumption of DegreeDiscount is not adequate, because it treats all the nodes equally without any differences. To consider a general situation in real world networks, a novel heuristic method named GeneralizedDegreeDiscount is proposed in this paper as an effective extension of original method. In our method, the status of a node is defined as a probability of not being influenced by any of its neighbors, and an index generalized discounted degree of one node is presented to measure the expected number of nodes it can influence. Then the spreaders are selected sequentially upon its generalized discounted degree in current network. Empirical experiments are conducted on four real networks, and the results show that the spreaders identified by our approach are more influential than several benchmark methods. Finally, we analyze the relationship between our method and three common degree-based methods.


Introduction
In complex networks, models and methods for propagation behavior are always of great theoretical and practical importance. Consider a scenario in advertising: a small IT company develops a cool online application and it wants to let more people know their product. However, the funds for advertising are limited. An economical way to advertise is to deliver the product to a small group of initial users(or spreaders) who are willing to advertise the product by word of mouth. This is often referred to as influence maximization. Theoretically, influence maximization in networks is a specific problem about how to effectively identify a small subset of nodes and maximize their spreading influence. Although much work has been done on measuring the influence of a single node [1][2][3][4][5][6], methods that can effectively identify multiple influential spreaders are still lacking.
The pioneers of this research area are Domingos and Richardson [7,8] who studied influence maximization as an algorithmic problem and developed a probabilistic method. Kempe, Kleinberg and Tardos [9] also made a significant contribution to this field. They showed that the problem is an NP-hard discrete optimization problem, and proposed a greedy strategy to select the spreaders that could achieve an approximation guarantee of 63%. Unfortunately, their greedy method encountered a serious drawback in computing efficiency, which limited its wide usage in large-scale networks. Leskovec et al. [10] demonstrated that many realistic influence maximization problems exhibit a property of "submodularity", and they proposed a Cost-Effective Lazy Forward(CELF) method to improve the efficiency of the greedy method. Narayanam et al. [11] analyzed the Shapley value concept from cooperative game theory and proposed ShaPley value-based Influential Nodes(SPIN). Zhao et al. [12] attempted to find effective multiple spreaders by generalizing the idea of the coloring problem in graph theory to complex networks. He et al. [13] suggested a novel method to identify multiple spreaders from communities in a balanced way. Zhang et al. [14] presented an iterative method named VoteRank to identify a set of decentralized spreaders. Chen et al. [15] decomposed the local topological structure of nodes and proposed a DegreeDiscount heuristic. Numerical experiments showed that DegreeDiscount could nearly match the performance of the greedy method, while the computational complexity of the former one was quite low. However, all nodes were treated equally in DegreeDiscount, which was a little oversimplified and might reduce the performance of the algorithm.
In this paper, we depict the status of nodes more concisely as a probabilistic score, and propose the GeneralizedDegreeDiscount heuristic. We discuss the computational efficiency of our method and demonstrate that the complexity is linearly correlated with the network scale, which makes our method efficient and scalable to large-scale networks. Experiments are performed on several real networks, and the results show that our method can outperform some centrality-based methods.

Intuition and theory
Degree is a basic centrality index in the research area of complex networks. It is well known that a node with a higher degree can influence more nodes than a node with a lower degree. Some researches in sociology have shown that selecting nodes with the highest degree as spreaders can result in better spreading influence than many other methods. However, in some recent studies, the authors argued that nodes with the highest degree might not always be the most influential ones. Though the effectiveness of the Degree is questionable, the low computational complexity of this strategy results in its widespread use in many business fields. In this section, we try to enhance the performance of this method by using several heuristic strategies.
Let node v be a neighbor of node u. Suppose u has been selected as a spreader. When considering the selection of v as a new spreader, one should formulate a method for calculating the contribution of edge uv to the degree of node v. It cannot be counted as 1, as has been done previously. As u has been selected as a spreader: (i) it is no longer necessary for v to influence u. (ii) u may also influence v with some probability, which further weakens the potential influence of v. Based on these considerations, we explore several heuristic strategies, in which the spreaders are selected one by one.
Degree-basedheuristics. DegreeDistance [16] takes a naive approach to avoid the relative influence between adjacent nodes. For example, if a node has been selected as a spreader, we can ignore its neighbors and consider other nodes. DegreeDistance defines a candidate set C and a distance threshold d td . At first, all the nodes are in the candidate set C. In each round, a node v with maximum degree in C is selected as a spreader, and the nodes within a distance d td to v are removed from C. The procedure ends when all the spreaders have been selected.
In Ref [15], Chen et al. proposed two degree-based heuristics, SingleDiscount and Degree-Discount. SingleDiscount considers a simple adaptive strategy. In each round, a node u with the maximum degree is selected as a spreader. Then, for each v 2 Γ(u), we do not count uv when calculating its degree. In other words, the degree of v will be discounted by 1. This type of degree is named as discount degree, and is denoted by sd v where d v denotes the original degree of v, and t v denotes the number of v's neighbors who have already been selected as spreaders.
DegreeDiscount is specifically designed for the independent cascade model. For a specific spreading probability p, DegreeDiscount attempts to conduct a deeper analysis of the local structure of the nodes. Suppose that we want to calculate the potential spreading ability of node v. Let the spreading probability be p. When p is small, the multi-hop neighbors of v can be ignored, and only the nearest neighbors are counted toward the degree. Let u 2 Γ(v) be a spreader neighbor of v. Obviously, the probability that v is directly influenced by u is p. As a result, u will not only contribute nothing to v, but also weakens the spreading ability of v.
When calculating the potential spreading ability of v, DegreeDiscount ignores the differences of v's neighbors. As all the neighbors are treated equally, the diagram of v and its neighbors Γ (v) can be mapped into a star-like subgraph structure. Let Star(v) be the subgraph considered here, and let the edges in the subgraph be the edges incident to v. Let d v be the degree of node v, t v be the number of spreader neighbors of v, and p be the spreading probability.
As the candidate node v has t v spreader neighbors, the probability that v is influenced by these neighbors is 1 À ð1 À pÞ t v . In this situation, selecting v as a new spreader may not bring any additional influence. In the opposite situation, selecting v will contribute to the spreading process by v itself and its normal neighbors. The former term can influence 1 node(v itself), and the latter can influence d v − t v nodes(normal neighbors) with probability p. Together, the expected number of nodes influenced by v is Under the first order of Taylor expansion, when p is small, the left term can be approximated by 1 − t v p + o(t v p). After further simplification, the whole equation becomes Then, the discounted degree of v can be defined as Note that in the original equation Eq (2), dd v is always non-negative. However, in the simplified form Eq (4), dd v may be negative with some special parameters. In this situation, we manually set dd v to be 0. Fig 1 depicts the local topology considered by DegreeDiscount. In this toy model, d v = 4, t v = 1, and Generalized Degree Discount. As DegreeDiscount models all the neighbors in Γ(v) equally, it ignores the differences among them. Take an extreme case as example, let s, t 2 Γ(v) be two normal neighbors of node v. If all the neighbors of s itself are spreaders and all the neighbors of t itself are normal nodes, they should not be treated equally. Obviously, the probability that s is influenced by its own neighbors is far larger than t. When calculating the potential contribution of s, t towards v, the latter one should be given more weight. To make the original model more precisely, we propose the GeneralizedDegreeDiscount.
Similar to the analysis in DegreeDiscount, the probability that node v is not influenced by its spreader neighbors is ð1 À pÞ t v . If v is not influenced by any of those neighbors, selecting v will enhance the total influence by v itself and its d v − t v normal neighbors. For any normal neighbor w 2 Γ(v), the probability that w is not influenced by its own neighbors is also ð1 À pÞ t w . In other words, w will bring additional influence ð1 À pÞ t w to v with probability p. Together, the expected number of nodes that will be influenced by v is where the summation is over all d v − t v normal neighbors of node v.
Departing from the conduction in DegreeDiscount, here we consider the second-order Taylor expansion for the left term and the first-order expansion for the right term: After further simplification, the equation becomes Then, the generalized discounted degree of v can be defined as Similar to the situation in DegreeDiscount, the simplified equation of gdd v may also be less than zero. In our real implementations, we set gdd v = 0 in this situation. Fig 2 depicts the local topology considered by GeneralizedDegreeDiscount. In this toy model, d v = 4 and t v = 1. Note that the summation is over all normal neighbors, i.e., {g, i, j} in this figure. As t g = 2, t i = 1 and t j = 1, the generalized discounted degree of v is Compared to the formulation of DegreeDiscount(Eq (4)), the formulation of GeneralizedDe-greeDiscount(Eq (9)) adds the last two terms. As the consideration of the latter one is deeper than the former one, GeneralizedDegreeDiscount should be more effective than DegreeDiscount. However, the difference between them is not so significant in real situations. In reality, a small fraction of spreaders must be selected and their influences broadcast with low probability. Because the number of spreaders is not large, usually t v ( d v for all nodes. Thus, in Eq (9), the fourth and fifth terms are smaller than the third term, which makes GeneralizedDegreeDiscount just similar to DegreeDiscount. In the Results section, we will compare the two methods numerically.

Computational efficiency
The GeneralizedDegreeDiscount is implemented in Algorithm 1. If we want to select l spreaders, the algorithm must run for l rounds. Let N be the number of nodes and hki be the average degree. In each round, the selection scheme costs O(N), the neighbor finding scheme costs O(hki 2 ), and for each of those neighbors, the updating process costs O(hki). Then, the total time cost of the algorithm is O(l(N + hki 2 + hki 2 Á hki)) % O(l(N + hki 3 )). In many networks, the average degree is far less than the number of nodes: hki ( N. Thus the time cost of GeneralizedDegreeDiscount will be nearly O(lN), which is just linearly correlated with the scale of the network.

Algorithm 1 GeneralizedDegreeDiscount(G, l, p)
gdd v = 0 end if end for end for return S

Results and Discussion
To evaluate the performance, we simulate the experiments using the Susceptible-Infected-Recovered(SIR) model. The SIR model was originally proposed as a model of the dynamics of the spread of disease. Due to the similarities between epidemic transmission and the spread of information, we use SIR to measure the spreading influence of individual nodes. In the SIR model, a node may assume one of three states(susceptible, infected and recovered). Specifically, susceptible individuals S in the model is analogous to individuals who are not aware of the information. Infected individuals I can be analogous to information carriers who are willing to spread information to their neighbors. Recovered individuals R are those who had previously received the information but later lost interest. To better simulate the real-world spreading process, we use the SIR model with limited contact [17]. At each time step, each infected node will randomly select a neighbor to contact, and will transmit the disease to its neighbor with probability p if the neighbor is susceptible. After the transmission process, the infected node will become a recovered node with probability q. The effective spreading rate λ is defined as p/q. When there are no infected nodes, the process stops, and we use the fraction of recovered nodes to measure the spreading influence.

Data Description
To evaluate the influences of different groups of spreaders selected by various methods, we conduct the experiments on the following four networks from different fields.

Enron [18]: An email communication network which covers all the email communication
within Enron Corporation. Nodes in the network are email addresses and edges represent the email communications among them.
2. Cond-mat [19]: A collaboration network of scientists posting preprints to the condensed matter archive at arxiv.org between January 1, 1995 and March 31,2005. Nodes in the network represent the scientists and edges represent the collaborations among them.
3. Gnutella [20]: A snapshot of the Gnutella peer-to-peer file sharing network at August 31 2002. Nodes represent hosts in the Gnutella network topology and edges represent connections between the Gnutella hosts.

Benchmark methods
In complex networks, many centrality indexes have been defined to measure the importance of nodes and links. It is believed that nodes with higher centrality are more influential than common nodes. Accordingly, one naive solution for the influence maximization problem is to select the top − l nodes with the highest centrality indexes. In this paper, we use centralitybased methods as the benchmark methods.
Degree is a basic local centrality index for nodes. The higher degree a node has, the more important it is. In a social network, a person with more followers or friends is likely to have a larger influence.
Betweenness [22,23] measures the extent to which a node is located on the shortest paths between pairs of nodes in networks.
where σ st denotes the number of shortest paths between a pair (s, t) of nodes, and σ st (v) denotes the number of shortest paths between any pair of nodes that pass through v. Closeness [24] is an evaluation of the geometric location of nodes.
where d(v, u) denotes the distance between nodes v and u. Some researchers have proposed other definitions of closeness to measure the locations of nodes [25,26].
PageRank [27] evaluates the status of nodes in the random walking process in networks, which is also a core algorithm in the many search engines.
where N denotes the total number of nodes, Γ(v) denotes the set of neighbors of v, k out u denotes the out-degree of node u, and d is a dumping factor. In real implementations, we set d = 0.85.
Coreness [28] is a well-established centrality index that focuses on the structure of networks. Kitsak et al. found that the most efficient spreaders are those located within the core of the network as identified by the k-shell decomposition analysis. The decomposition runs in an iterative way. Nodes are assigned to k shells according to their remaining degrees, which are obtained by the successive pruning of nodes with degrees smaller than k s . However, the performance of k-shell decomposition is not stable, and many studies have sought to enhance its effectiveness [29][30][31]. Recently, Liu et al. [32] analyzed the structure of core-like groups in networks, and improved the accuracy of the k-shell decomposition by filtering out the redundant links. In Ref [33], Lü et al. discovered an important relation among degree, H-index and coreness. By constructing a suitable operator, they proved that degree, H-index and coreness were the initial, intermediate and steady states of a special sequences, respectively.

Effectiveness
We use the SIR model to compare the effectiveness of GeneralizedDegreeDiscount with Degree-Distance, SingleDiscount, DegreeDiscount and several centrality-based methods discussed before. In each implementation, a fraction of the nodes is selected as spreaders, and the information spreads according to the SIR process described above. The spreading influence is used to measure the effectiveness of the methods. For each method, the SIR process is repeated many times to ensure the stability of the results.  small, the performance of GeneralizedDegreeDiscount is slightly worse than DegreeDiscount. Compared with DegreeDiscount, the performance of GeneralizedDegreeDiscount is consistently better. One promising phenomenon observed in our method is that as the fraction of spreaders becomes larger, the performance differences becomes more significant. Numerical results confirm that GeneralizedDegreeDiscount is indeed a effective extension of DegreeDiscount. In all networks, Coreness and Closeness perform the worst among all methods. In Ref [31], Liu et al. found that nodes in high shells may not be influential because of the existence of core-like groups: groups of nodes that link very locally within themselves. For nodes in the core-like groups, the Coreness cannot reflect their location importance in the network, which reduces the accuracy of the k-shell decomposition process. Moreover, if nodes in the highest shell tend to links with one another, their influence areas may overlap significantly. Obviously, selecting those nodes as spreaders may cause a large fraction of the network to overlooked. The situation for Closeness is similar: nodes with high closeness values often distribute closely with one another.
In addition, we test the validity of our method with different effective spreading rates. We fix the fraction of spreaders to be 1% of the scale of the networks and vary the effective spreading rate λ. The results are shown in Obviously, GeneralizedDegreeDiscount is an adaptive method which recalculates the gdd v during each step of the spreaders selection processes, while the centrality-based benchmark methods are not. In this part, more comparisons are done among our methods and adaptive versions of Degree, Betweenness and Closeness. To make them adaptive, a simple node-removing process is conducted: in each iteration, the node with the maximum centrality is selected as a spreader, and then we remove it from the network and recalculates the new centrality. The whole process ends until all the spreaders are selected. In fact, the adaptive version of Degree is the same as SingleDiscount. Fig 5 shows the numerical results. Unlike the previous results, when considering the top spreaders with low effective rate, our GeneralizedDegreeDiscount does not performs well. Especially in Gnutella network, the performance of GeneralizedDegree-Discount is worse than Betweenness-adaptive and Closeness-adaptive. As the clustering coefficient of Gnutella is so small, the spreaders selection process in the early iteration of GeneralizedDegreeDiscount is just similar to Degree, which may limit the performance of our method. In Figs 3 and 4, it can also be seen that the performance differences between Generali-zedDegreeDiscount and other methods are not so remarkable under small number of spreaders and low effective spreading rate. How to identify multiple influential spreaders in networks with low clustering coefficients is a challenging problem, and we leave it in the future.

Relations with other methods
In this subsection, we perform numerical comparisons among four degree-based methods: Degree, SingleDiscount, DegreeDiscount and GeneralizedDegreeDiscount. Though DegreeDistance is also a degree-based method, we do not consider it because there is no clear formulation to describe this method. The mathematical formulations of the four are listed below.
• Degree Obviously, the complexity of these methods increases one by one. These formulations indicate that DegreeDiscount has more terms in common with GeneralizedDegreeDiscount than the other two methods. In an extreme case, when p = 0, GeneralizedDegreeDiscount is exact the same as DegreeDiscount. To better clarify the difference between the methods, we set the Generalized Degree Discount fraction of spreaders to be 1% and calculate the similarity(the fraction of commonly selected spreaders) between GeneralizedDegreeDiscount and other methods. Fig 6 shows the results for the four networks. In all the networks, GeneralizedDegreeDiscount shows the best similarity with DegreeDiscount, normal similarity with SingleDiscount, and the worst similarity with Degree.

Conclusion
In this paper, we propose a novel degree-based heuristic, GeneralizedDegreeDiscount, which selects multiple spreaders and maximizes their spreading influence. In our method, when evaluating the potential influence of a candidate node v, the way in which its neighbors are treated depends on whether it has been selected as a spreader or not. Taking both of the situations into consideration, GeneralizedDegreeDiscount uses a heuristic scheme to evaluate the potential influence of all individuals in the network.
We analyze the computational complexity of our method and show that it is just linearly correlated with the network scale. Then, the performance of our method is evaluated in four real networks from different fields. Results show that our method outperforms several centrality-based methods and other heuristic methods in all cases, no matter how many spreaders we choose to select or what the effective spreading rate is.
The theoretical analysis about influence maximization problem is still lacking. Although it has long been proven that there is a strong connection between the spreading process and the percolation process [34,35], few researches have discussed the relationship between influence maximization and percolation. Recently, Morone and Makse pointed out that the influence maximization problem could be mapped onto optimal percolation problem in random networks [36], which might shed light on a new trend of future researches [37,38]. Besides, we have witnessed the rapid development of theories and methods for temporal networks [39,40]. Further researches on the influence maximization problem in temporal networks may also be a promising direction [41,42].