Optimizing Online Social Networks for Information Propagation

Online users nowadays are facing serious information overload problem. In recent years, recommender systems have been widely studied to help people find relevant information. Adaptive social recommendation is one of these systems in which the connections in the online social networks are optimized for the information propagation so that users can receive interesting news or stories from their leaders. Validation of such adaptive social recommendation methods in the literature assumes uniform distribution of users' activity frequency. In this paper, our empirical analysis shows that the distribution of online users' activity is actually heterogenous. Accordingly, we propose a more realistic multi-agent model in which users' activity frequency are drawn from a power-law distribution. We find that previous social recommendation methods lead to serious delay of information propagation since many users are connected to inactive leaders. To solve this problem, we design a new similarity measure which takes into account users' activity frequencies. With this similarity measure, the average delay is significantly shortened and the recommendation accuracy is largely improved.


Introduction
The information and communication technologies lead us to an information-rich era where recommender systems are widely used to filter out irrelevant information [1][2][3]. Recommendation algorithms include correlation-based collaborative filtering [4][5][6], Bayesian clustering [7], probabilistic latent semantic analysis [8], matrix decomposition [9,10]. Many issues related to recommender systems have been studied, such as the diversity of recommendations [11,12], the effect of network topology [13], and ground user [14]. Recent researches show that social influence [15] is more powerful than the purely mathematical analysis based recommendation, as people are more likely to accept the recommendations coming from their friends or peers [16]. Hence, a new technology named social recommendation has emerged [17][18][19] in which users (followers) can select some other users as information sources (leaders) and the information will automatically flow from leaders to followers. This framework has been successfully applied in many real online websites, such as delicious.com, twitter.com and digg.com. The information can refer to news, movies, books, bookmarks, and so on. Without losing any generality, news is used as an example in this paper. When a piece of news is submitted or approved by a user, it will be forwarded to her followers. The diffusion of news thus depends on the structure of leader-follower network, with higher transmission probability of news if users with higher similar tastes are linked [20].
Recently, an adaptive news recommendation model is proposed [21]. In this model, when a user reads news, she can either ''approve'' or ''disapprove'' it. If approved, the news will be forwarded to her followers. With the spreading of news, the leaderfollower network will be updated, that is, the least suitable leader of a user will be replaced by a better one according to the quality of the leader. The quality of her leader is measured by the similarity based on their past assessments on news. The model has been extensively tested by additional aspects like users' reputation [22], implicit ratings [23], local topology optimization [24], leadership structure [25] and link reciprocity [26]. More recently, Cimini et al. [27] introduced two settings for modeling users' tastes and showed that the heterogeneous setting of users' tastes was closer to the real case than homogeneous setting.
Confirmed by many empirical analysis, it is now well-known that the activity frequency of online users are heterogenous [28][29][30]. However, in the original adaptive news recommendation model and the following studies, users are randomly selected to be active (i.e. submitting or reading news), which indicates that the activity frequency of users are set to be homogenous. In this paper, we find that the propagation of news is seriously delayed when some classic similarity metrics are applied in the heterogenous users activity setting. Moreover, the recommendation accuracy (i.e. approval fraction of the news) is lowered as well. To solve this problem, we propose a new similarity measure which takes user activity into account. The simulation shows that the propagation delay is considerably shortened and recommendation accuracy is largely improved. Finally, we introduce a more general similarity definition in which the weight on users' news assessments and users' activity is tunable. With this, the effectiveness (accuracy) and efficiency (time delay) of the information propagation is further improved.

Empirical analysis
To begin our analysis, we study the distribution of user activity frequency in real systems. Here, we consider the dataset of digg.com [31], which contains 3,018,197 votes on 3,553 popular stories made by 139,409 distinct users over a period of a month in 2009. In this dataset, users and stories form a bipartite network in which a link between a user i and a story a exists if user i reads story a. The degree of a user represents the number of stories read by her (i.e. the activity of this user). Figure 1 shows the degree distribution in three observation time windows. Clearly, the distribution follows a power-law with exponent around {2. The results confirm that users are very heterogenous in the frequency of their online activity.
Furthermore, we employ the Kendall's tau coefficient (t) to calculate users' correlation of activity frequency in two adjacent periods. The length of the period is set as one day, so we are actually calculating the Kendall's tau coefficient of users' activity frequency in each day and the previous day. As shown in Fig. 2, t is always larger than zero, which means that users' activity frequency is positively correlated in time. Moreover, we observe that there is some periodic fluctuation in Fig. 2. With the actual dates, we check carefully the reason for this periodic fluctuation. We find that each period is one week, and the correlation is higher in weekdays than in weekends. We conjecture it is because people's live is regular in weekdays but diverse in weekends.

Model description
In the original adaptive news recommendation model [21] and the following studies [22][23][24][25][26][27], users' activity frequencies are assumed to be homogenous, which is inconsistent with the results of above empirical study. To make the information propagation model closer to the real system, we introduce the heterogeneity of users' activity frequency to it. Our model will be directly built on the original news-sharing model in ref. [21]. The system consists of U users. Each of them is connected by directed links to L other users, who represent her news sources and to whom we refer as her leaders. The value of L is fixed as users can follow a limited number of sources. Users receive pieces of news from their leaders, and eventually assess them. In addition, they can introduce new content to the system. Evaluation of news a by user i (e ia ) is either z1 (liked), {1 (disliked) or 0 (not read yet). The set of evaluations from any pair of users i and j is the basis to compute their similarity of their interests (or reading tastes), which is denoted as s ij . The explicit recipes to compute users' similarity are presented in the next section. Note that, apart from their evaluations, no other information about users is assumed by the model.
Users' activity. In each time step of the simulation, a given user is active with probability p A . When active, a user reads the top R news from her recommendation list, immediately forwarding the ones she likes to her followers. In addition, with probability p S she submits a new piece of news. Different from the original model in [21], the users' activity frequency is drawn from a power-law distribution as P(p A )*p {c A where c~2. Propagation of news. When news a is introduced to the system by user i at time t a , it is forwarded from i to the users j who have selected her as a leader, with a recommendation score proportional to their similarity s ij . If this news is later liked by one of her followers j, it is similarly passed further to this user's followers q, with recommendation score proportional to s jq , and so on. For a generic user q at time t, a news a is recommended to her according to its current score: where L q is the set of leaders of user q. d is a Dirac delta function with only two possible values: 0 and 1. If user q has not read news a, d eqa,0~1 since e qa~0 and if q has read news a, d eqa,0~0 since e qa =0. Similarly, d e la ,1~1 if user l likes news a, otherwise d e la ,1~0 . To make the fresh news fast accessed, recommendation scores are damped with time (l [ (0,1 is the damping factor).
Leader selection. The model is adaptive. Initially, each user randomly select L other users as her leaders. Leader-follower connections are periodically rewired to make the social network  approach an optimal state where only highly similar users are connected [32]. In each rewiring, for user i, her current leader j with the lowest similarity value is replaced with a new user (q) if s iq ws ij . There are different selection strategies for picking new candidate leaders, which are discussed in detail in [22,24,26]. In this paper we employ a hybrid strategy in which the user q is picked at random in the network with probability 0:1, otherwise she is selected among the leaders' leaders and followers of user i to maximize s iq . This mechanism well mimics users establishing mutual friendship relations, searching for friends among friends of friends, and having casual encounters which may lead to longterm relationships. In addition, it is an excellent compromise between computational cost and system's performance [24].

Measure of users' similarity
An essential ingredient of the social recommendation algorithms is the estimated similarity of users' reading tastes, which regulates the news' flow over the system by determining the leaders' selection from users (i.e., the link structure of the network) and recommendation scores of news. Since only users' ratings and records of activities are known, the similarity of a pair of users has to be estimated from their assessments on news, which in our case can be either approved, disapproved, or not rated.
The first similarity measure considered is introduced in [21] as where to remove the effect of statistical fluctuation. If a user i and a user j share a small number of commonly read news, they are more likely to achieve ''perfect'' similarity 1. After multiplying this term, the similarity measure will give this user pair a very low similarity value. In sampling of n trials, the typical relative fluctuation is of the order of 1= ffiffi ffi n p . Therefore, we select the above form. In [33], it is shown that Eq. (2) works well only in the system where tastes of users are homogeneously distributed, i.e., each user has the same number of interested fields. To achieve a more accurate leader assignment in the system where users can have different number of tastes, an asymmetric similarity measure is defined as 1{1= ffiffiffiffiffiffiffi ffi jA j j p here is also used to remove the effect of statistical fluctuation.
In this paper, we consider the systems where users' activity distribution is uneven. Some users can be extremely active and read many news, so that their followers can constantly receive fresh news. On the other hand, if one user is connected to many inactive leaders, the news received by her will be very limited. Therefore, recommending highly inactive users is meaningless. Considering this, we modified the similarity in Eq. (3) as where H j (t) is a measurement of user j's activity. Actually, there are many other previous works showing that online users' activity frequency is unevenly distributed [34,35]. Users' active frequencies p A are users' inherent feature and unknown by the recommender system. We design the following way to estimate p A of the users. Instead of taking the whole history into account, we only use the recent record of activities within a time window ½t',t{1 with length T (In our simulation, T~250 generally works best, see Supporting Information S1). The estimated probability for user j to get online is where f j (w) is user j's online times from time w to time t{1. If f j (w) equals 0 for each w, we set P j (t) as 1 T(Tz1) . In Eq. (5), users' recent record plays a more important role. This is very useful in real systems, since the correlation of real users' activity frequency is generally high in short term (See Fig. 2). However, P j (t) cannot be directly used as H j (t) in Eq. (4). Since P j (t)*P(p A ), P j (t) follows a power-law distribution. If it is used as H j (t), some users with high activity will dominate the similarity matrix and be always selected as the leaders of others. In order to solve this problem, we proposed a logarithmic way to embed P j (t) in H j (t). After normalization, it reads where x 0 is the possible lowest value of P j (t), set as 1 T(Tz1) . After simplification, H j (t)~1z log T(Tz1) P j (t). In this definition, H j (t) can distinguish different users and the most inactive users are punished severely. However, the majority gets H j (t) over 0.5, as shown in Fig. 3. We rewrite s (2) ij as In our simulation, we actually compare our method s (2) to with some start-of-the-art similarity methods based on both newsreading and topology. The results show that s (2) can outperform others, see Supporting Information S1. In the following, we will study the behavior of the system under these similarity metrics. For numerical tests of the model, we use an agent-based framework.

Agent-based simulations
To model users' judgments of read news we use a vector model where tastes of user i are represented by a D-dimensional taste vectorã a i~( a 1 i ,:::,a D i ) and attributes of news a are represented by a D-dimensional attribute vectorb b a~( b 1 a ,:::,b D a ). Similar vector models are often used in semantic approaches to recommendation [36]. Opinion of user i about news a is based on the overlap of the user's tastes and the news's attributes, which can be expressed by the scalar product We assume that user i approves news a only when V ia §D, disapproves otherwise, where D is the users' approval threshold: the higher it is, the more demanding the users are. Here, we adopt the heterogeneous setting of the taste/attribute vectors.
Each user has preference for a variable number of D available tastes. Each taste vector has a different number of elements equal to one (active tastes, denoted as d) and the remaining elements are zero. In this paper, we assume D min ƒdƒD max where D min and D max are the minimum and maximum number of active tastes that users can have, respectively. Moreover, we assume that each news' attribute vector has a fixed number D min of active attributes (number of ones), which are randomly chosen among the active tastes of the user who submits it.
Simulation runs in discrete time steps. Assuming no a priori information, the starting network configuration is given by randomly assigning L leaders to each user. Then in each simulation step, an individual user is active with probability p A . When active, the user reads and evaluates the R top-recommended news she has received and with probability p S submits a new news. Connections are rewired every u simulation steps. Parameter values used in all following simulations are given in Table 1. A detailed study of the effect of the parameters on the model can be seen in [24].

Metrics
Three metrics are used to measure the performance of the recommendation models: approval fraction y, average differences d, and average delay w.
Approval fraction y. The ratio of news' approvals to all assessments is an obviously important measure of the method's performance. This number, referred to as approval fraction, tells us how often users are satisfied with the news they get recommended. The higher the y is, the more accurate the recommendation is, and users are more satisfied correspondingly. It can be defined as where term d e ia ,1 equals one if user i approved news a and zero otherwise, and term d je ia j,1 equals one if user i has rated news a and zero otherwise. Average differences d. In the computer simulation, we have the luxury of knowing users' taste vectors and hence we can compute the number of differences between the taste vector of a user and the taste vectors of the user's authorities. By averaging over all users, we obtain the average number of differences. Obviously, the less are the differences, the better is the assignment of authorities. The average differences are defined as where L i is the set of leaders of user i, andã a i (ã a l ) is the taste vector of user i (user l).
Average delay w. The freshness of the news is very important. Once the news becomes old, it is of no interest to users. The average delay measures the novelty of the news read by users. A small average delay indicates that users are always reading fresh news. The average delay is defined as where N i is the set of news read by user i, t ia is the time when i reads news a, t a is the submitted time of news a, and U s is set of users. The smaller the w is, the news read by users will be fresher.

Results and Discussion
We now study the described adaptive social recommender system under different definitions of the similarity measure employed. For comparison, initial conditions and parameters for all simulations are identical, as listed in Table 1. We obtained the average differences, approval fraction and average delay resulting from each similarity definition. We first consider the case where the user activity and the number of user interest are uncorrelated and the results are shown in the uncorrelated case in Table 2. As expected, s (1) enjoys a higher approval fraction and a smaller average difference than s (0) . The results are consistent with ref. [27,33]. However, s (0) results in a smaller average delay than s (1) . Among these methods, s (2) performs the best in all three metrics. These results suggest that introducing the users' activity to the similarity measure can significantly speed up the propagation of news in online systems so that users mostly receive fresh news. Moreover, it improves the leader assignment, resulting in a more accurate recommendation of news for users.
One concern of the new similarity measure is that the network updating algorithm might focus too strongly on the activity frequency of leaders rather than the taste overlap of the leader and follower, putting the information recommendation accuracy at risk. Accordingly, we further introduce a parameter g to adjust the effect of users' activity in the s (2) similarity calculation: Parameter g controls the weight assigned to H j (t). When g~0, Eq. (12) reduces to s (1) , and when g~1, Eq. (12) reduces to s (2) .
The stationary values of average delay w and approval fraction y obtained by using s (3) under different g are reported in Fig. 4. One can immediately see that there is a maximum approval fraction when adjusting g. On the other hand, the average delay drop dramatically once gw0 and then decreases monotonously with g. Compared to the case where the users' activity is not considered in the similarity calculation (g~0), the approve fraction can be improved and average delay can be considerably shortened. Interestingly, the optimal g is around 1, corresponding to the case of s (2) .
We further consider the situation where the user activity and the number of user interest are correlated. Three cases are considered in this paper: positive correlation, negative correlation, and no correlation. We first compare three evaluation metrics under different correlation settings (see Table 2). One can immediately see that both positive and negative correlation between the user activity and the number of user interests can significantly shorten the average delay w of the news. However, the delay from s (0) and s (1) are still longer than s (2) . The advantage of s (2) can also be found in the average difference d. A lower average difference indicates that the network is adapted to a better state for news propagation. We can also see that s (2) enjoys the highest approval fraction y in almost all cases. When the correlation between the Table 2. Three evaluation metrics on different similarity measures. Pos.

Uncorr.
Average user activity and the number of user interests is positive, the approval fraction of s (0) is a bit higher than that of s (2) . However, the delay in s (0) in this case is more than twice longer than that of s (2) . Taken together, s (2) is a very effective and robust similarity measure for recommending leaders in online social systems.
The leader-follower networks after the systems reach stable state is studied. We first investigate the in-degree distribution of nodes (i.e. the distribution of number of followers). The results show that the largest in-degree in s (2) is smaller than that in s (0) and s (1) . This is because users in s (2) select leaders according to not only the similarity but also the activity frequency. It is more difficult for the largest in-degree nodes in s (2) to attract as many followers as in s (0) since these users have both high similarity to others and high activity frequency. We then study some properties of the users of different activities in Fig. 5. Fig. 5(a) shows the relation between user activity and the number of her followers. As discussed above, if the leaders of a user are with low activity, the user may have no news to read and the propagation of news will be largely delayed. This happens a lot in the original similarity measure s (0) and s (1) (See the flat curves of them in Fig. 5(a)). We didn't plot the curves of negative and uncorrelated cases in s (0) and s (1) because they are as flat as in the positive correlation case. In s (2) , the users with higher activity frequency have more followers, which makes users with rich number of news to read. Moreover, we observe that the users with higher activity and fewer interests (see the negative correlation case when p A is large) have more followers. In ref. [33], it is already pointed out that the users with few interests are good information resource and should be selected as leaders (since they are specialized in their fields). As shown in Fig. 5(a), s (2) recommends the users with high activity and few interests as leaders to others. This again supports that s (2) is a good similarity measure.  In Fig. 5(b), we present the relation between users' activity and the average activity of their followers. We can see that the users with higher activity and wider interests (i.e. large p A in positive correlation) have more active followers. Generally, the interests of the followers are wider than that of the leaders [33]. The followers of large p A users will have wide interests and thus high activity.
Moreover, it is interesting to identify which kind of users can be the information hubs in online social networks. We first investigate the number of forwarded news of different users. As shown in Fig. 5(c), the users with higher activity and wider interests forward more news. With wider interests, these users are more likely to approve the news from their leaders, which results in a large number of forwarded news from them.
However, the results in Fig. 5(d) indicate that the users with high activity and wide interests are actually not information hubs. We report the spreading range when the news is originated from different users in Fig. 5(d). The spreading range here is defined as the number of users who finally read the news. As discussed above, the users with fewer interests are more specialized in their fields and their followers are more likely to approve the news from them. Therefore, the news originated from users with higher activity and fewer interests will spread wider. The results imply that the active and specialized users are the information hubs in online social networks.

Conclusion
In this paper, we study a new multi-agent based model for information propagation and recommendation on online social network. The original online information propagation model was proposed in ref. [21] where users' activity frequency is assumed to be homogeneously distributed. Since the empirical study of the online news-sharing systems suggests that users' activity frequency distribution actually follows a power-law distribution, we introduce the heterogeneity to users' activity frequency distribution to the model in ref. [21]. We find that previous similarity methods for leader recommendation connects many users to inactive leaders, resulting serious delay of information propagation and low approval fraction of news.
To solve this problem, we propose a new similarity measure which takes users' activity frequency into account. With the new similarity measure, the suitability of a leader is evaluated according to not only the similarity but also the activity frequency. The numerical simulation shows that our method can outperform the existing ones in network optimization for information recommendation, in both approval fraction and information delay. Finally, we introduce a parameter to adjust the effect of users' activity in the similarity calculation. We find that the leader recommendation can be further improved by this parameter.
Since real online users have heterogenous activity frequency, we believe that our method will be very useful from practical point of view. Since real online news-sharing systems can be different from current model in parameter settings or even news propagation mechanism, the optimal weight of users' activity frequency in the similarity calculation should be determined by some preliminary testings. One possible way is to implement the method first on a small subset of users. After obtaining the optimal balance between users' activity frequency and similarity from the learning procedure, the method can be applied to the whole systems.

Supporting Information
Supporting Information S1 Supporting text and figures.