Cumulative Effect in Information Diffusion: Empirical Study on a Microblogging Network

Cumulative effect in social contagion underlies many studies on the spread of innovation, behavior, and influence. However, few large-scale empirical studies are conducted to validate the existence of cumulative effect in information diffusion on social networks. In this paper, using the population-scale dataset from the largest Chinese microblogging website, we conduct a comprehensive study on the cumulative effect in information diffusion. We base our study on the diffusion network of message, where nodes are the involved users and links characterize forwarding relationship among them. We find that multiple exposures to the same message indeed increase the possibility of forwarding it. However, additional exposures cannot further improve the chance of forwarding when the number of exposures crosses its peak at two. This finding questions the cumulative effect hypothesis in information diffusion. Furthermore, to clarify the forwarding preference among users, we investigate both structural motif in the diffusion network and temporal pattern in information diffusion process. Findings provide some insights for understanding the variation of message popularity and explain the characteristics of diffusion network.


Introduction
We are witnessing the emergence and rapid proliferation of various social applications, including resource sharing sites (e.g., Flickr, Youtube), blogs (e.g., Bloggers, LiveJournal), social networks (e.g., Facebook, Myspace), and microblogs (e.g., Twitter, Sina Weibo). These social applications facilitate users to produce, share, and consume online content. A prominent characteristic of these systems is the relationships formed among users. These relationships can be described by networks, where nodes represent users and links denote the relations or interactions among them. Many efforts have been made to understand the structure of theses networks [1]. Recently, much research attention is paid to various dynamics on these networks, investigating users' tendency to engage in activities such as forwarding messages, linking to articles, joining groups, purchasing products, or becoming fans of certain pages after their friends have done [2][3][4][5][6][7][8][9][10][11].
Existing studies mainly focused on identifying the properties of these dynamics and the potential principles governing them [12][13][14][15]. Scientists have noticed several salient phenomena about information diffusion on networks and the evolution of underlying networks, including the rich-get-richer phenomenon [16], burst [17], the stability constrains [18], homophily [19], clustering [20], bridgeness [21], structural balance [22], structural regularities [23], and two-step flow [24]. However, the fundamental mechanism of information diffusion on networks is still unclear. Does there exist the ''cumulative effect'' in information diffusion on social networks? Are there fundamental differences among the mechanisms underlying the diffusion of various messages? Does the relevant topic or the associated event of messages help explain the distinct characteristics of these messages? More importantly, are there any structural or temporal patterns frequently occurring in the process of information diffusion?
With the increasing availability of data recording the information diffusion on social networks, many efforts have been made to study the effect of multiple exposures on social networks. Using the data from LiveJournal and DBLP, Backstrom et al. found that the propensity of individuals to join communities was dominated by a ''diminishing return'' property [3]. Leskovec et al. examined the probability of purchasing a product as a function of the number of received recommendations about the product [7]. They observed a saturation point after receiving around 10 recommendations. Romero et al. studied the mechanics of information diffusion by comparing the information diffusion process across different topics on Twitter [25]. They found that the effect of multiple exposures decayed rapidly for hashtags representing idioms and neologisms. Ugander et al. found that the probability of contagion was tightly controlled by the number of connected components in an individual's neighborhood, rather than by the actual size of neighborhood [26]. In addition, Milo et al. defined ''network motifs'' and found them in networks from biochemistry, neurobiology, ecology, and engineering [27]. Zhang et al. proposed a new mechanism for the local organization and tested potential theory [28]. They found that the Bi-fan structure was the most favored local structure in directed networks. Bao et al. predicted the popularity of messages on social networks by leveraging the structural diversity of diffusion network [29]. However, recent works mainly focused on the diffusion of innovation, the adoption of new product, and the spread of certain behavior. It is still unclear whether these findings are applicable to the information diffusion on microblogging network.
In this paper, to understand the mechanism of information diffusion on social networks, we conduct a comprehensive empirical analysis on a population-scale dataset from Sina Weibo, the largest Chinese microblogging website. We study the statistics of diffusion network which characterizes the relationship among the individuals involved in diffusion process. We then investigate the cumulative effect of multiple exposures during the spread process of messages, with or without URL and events. We find a peak in the curve of forwarding probability at 2 exposures and a subsequent slow drop. We also find that the probability of forwarding messages with URL or events are significantly higher than that of the other messages. When examining the exposure curves corresponding to different events, we find that the exposure curve is heavily affected by outside intervention, such as restrictions on media coverage. Furthermore, we investigate the structural and temporal patterns frequently occurring in information diffusion. These findings provide us great insights in understanding the fundamental mechanism of information diffusion and predicting the forwarding behavior of individuals.

Diffusion network
To study the information diffusion on social networks, we represent the cascade of message as a diffusion network. For each message, its diffusion network is a directed network where each node is a user who involves in the diffusion of this message. A link from user u to user v denotes that v receives the message from u and then forwards it. To be sure, one user can forward a message more than one time. In this paper, when constructing the diffusion network of a message, we only consider one user's first forwarding behavior of the message as done in [7]. We adopt this definition of diffusion network with two considerations: 1) given a particular message, multiple forwarding from one user is very rare and; 2) multiple forwarding behaviors may obscure the analysis of cumulative effect in diffusion process. In a diffusion network, there is only one node having no incoming link. We call this node the root node of diffusion network because this node corresponds to the source user of message. Similarly, we call the nodes without outgoing link as leaf nodes.
Diffusion network provides us important descriptive information for the cascade of a message. On one hand, the outgoing degree of a node characterizes its amplification factor at the diffusion process of message. The nodes with larger outgoing degrees are usually the so-called ''opinion leaders'' [24] and are essential to the popularity of a message. By inspecting the outgoing degree in diffusion network, we can easily identify these opinion leaders. On the other hand, each path from the root node to a leaf node depicts a forwarding trajectory of message. To a certain extent, the maximum length of all the paths reflects the penetration capability of message. Furthermore, a diffusion network generally has multiple layers. The nodes in the same layer have the same distance from the root node. Finally, the size of a diffusion network characterizes the popularity of the corresponding message. Figure  1 gives an example of diffusion network. The root node, colored in red, has a large outgoing degree and thus promotes the early popularity of the message. The large node in yellow is another node with a large outgoing degree, triggering a new spread range for the message.
We adopt three quantities to characterize the properties of diffusion network, i.e., the size, depth and width of diffusion network. The size of a diffusion network is the number of nodes in the diffusion network and reflects the popularity of message among users. The depth of a diffusion network is the length of the longest path from the root node to leaf nodes. The width of a diffusion network is the number of nodes in the layer with the largest number of nodes. As shown in Figure 2(a), the size of diffusion network follows a power law distribution with exponent 0.66, indicating that the popularity of messages is unequally distributed. This poses a big challenge for predicting the popularity of messages [29][30][31][32][33]. Figure 2(b) shows the distribution of width over all diffusion networks. The width distribution can be well fitted with a two-stage power law distribution with exponents respectively being 1.16 and 1.89. Figure 2(c) shows the distribution of depth over all diffusion networks. The depth roughly follows an exponential distribution with exponent 0.89, indicating that the majority of diffusion networks have shallow depth. To characterize the shallow structure of diffusion network, we further investigate the average number of nodes in each layer of diffusion networks. As shown in Figure 2(d), the average number of nodes decreases dramatically with respect to the depth of layer. The majority of nodes appear in the first five layers of diffusion network. In addition, we also show the error bars in the Figure 2(d) for the first five layers. These error bars show that the number of nodes in the same layer is quite heterogeneous.

Temporal characteristics of information diffusion
Information diffusion is a dynamical process on social networks. Besides the structural characteristics depicted in the previous section, information diffusion also exhibits several temporal patterns which are the focus of this section.
We further analyze the time lag of forwarding behaviors in diffusion process. Figure 3(a) shows the distribution of time interval between two successive forwarding behaviors in the resolution of five minutes from the cascades of all messages, which follows a power law distribution with exponent 2.16. In addition, Figure 3(b) gives the distribution of the time latency of message forwarding, which characterizes how long it will take for a message to be forwarded. This distribution roughly follows a log-normal distribution with a peak at 10 minutes. Indeed, after a message is submitted by a user, it usually takes several minutes to be forwarded by other users, which may result from the fact that users are not always online and they check messages at a certain rate. Therefore, if a user is not active in a certain period, the messages submitted will need to wait for a long time to be forwarded by this user. As a typical example, users are usually active at days and not active at nights.
To verify the activity pattern of users, we investigate the number of messages posted hourly. Figure 3(c) shows the averaged hourly activity of users for 30 days. We can see that users are active between 10am-10pm and are not active between 1am-7am.

Cumulative effect of multiple exposures
We now turn to the diffusion dynamics of messages on social networks. Specifically, we study the cumulative effect of multiple exposures, i.e., a user is more likely to forward a message if this user is exposed to the message for more times. There are two assumptions about the cumulative effect of multiple exposures. The first one claims that a user's multiple exposures to a message will always increase the possibility of the user's forwarding behavior. The second one insists that more exposures will not increase the forwarding possibility if a user has ever been exposed to the message but does not forward it.
To investigate the cumulative effect of multiple exposures, we need to capture the number of exposures before a user forwards a message. For this purpose, we define that a user is k2exposed to a message if the user has received the message for k times but still does not forward it. When a message is submitted or forwarded by a user, all followers of the user are exposed to this message. Using the ordinal time estimate method [25], we denote W(k) the number of users who are k-exposed to a message at certain time, and R(k) the number of users who forward the message directly after being k-exposed to the message. We then calculate the probability P(k) that a k-exposed user forwards the message before this user becomes (k+1)-exposed, i.e., P(k) = R(k)/W(k).
With the above definitions, we empirically study the forwarding probability P(k) using all the messages forwarded by more than 10 users. To alleviate the influence from activity pattern of users, we only consider the messages posted between 10am and 10pm per day, which is the active period as depicted in Figure 3(c). Figure  4 Figure 4(c) gives the forwarding probability P(k) as a function of the number k of exposures. We can see that there is a peak in the curve of forwarding probability P(k) at the place of 2 exposures. After the peak, the value of P(k) drops in a power law manner. These findings can provide some insights for making viral marketing strategies, such as the product promotion campaign and influence maximization. Kempe at el. have proposed the Linear Threshold Model based on the idea of node-specific thresholds,  which has a certain relationship with the cumulative effect of multiple exposures to a user and is one of the most famous social cascade model [8]. In addition, the exposure curve P(k) of each user can help us understand users' forwarding behavior and further identify the users that are critical to trigger a diffusion from the perspective of sender and receiver. Actually, Aral et al. have moved along this line and suggested that influential people with influential followers may be instrumental in the spread of product on social networks [34].
To understand the variation of exposure curve for different messages, we classify messages into different categories and compare the exposure curves of each category. In our data set, messages could contain embedded URL and could be annotated with certain events denoted by several keywords. We classify all messages into different categories according to three criteria: (1) messages with embedded URL versus messages without embedded URL; (2) messages with events versus messages without events; and (3) messages with a single event versus messages with more than one event. The comparison of exposure curves is shown in Figure 5. We can see that the probability of forwarding a message with embedded URL is higher than that of forwarding a message without embedded URL, as shown in Figure 5(a). The probability of forwarding a message with events is higher than that of forwarding a message without events, as shown in Figure 5(b). The probability of forwarding a message with more than one event is higher than that of forwarding a message with a single event, as shown in Figure 5(c). In addition, the probability of forwarding a message with embedded URL or with events is higher than P(k) over all messages which is depicted in Figure 4(c). These findings indicate that users are prone to forward messages containing more information, e.g., with a URL providing additional information or with events implying much more information related to the message. In addition, a message with events can trigger more discussions about the events.
We further investigate the exposure curves of messages corresponding to individual event. The majority of them are similar to the overall shape in Figure 4(c). In particular, we notice that P(k) increases with more exposures to a message for some examples while P(k) decrease with more exposures for others. As examples, Figure 6(a) and Figure 6(b) show the exposure curves for the event ''Foxconn worker falls to death'' and ''Wenzhou train collision'' respectively. For these two particular cases, only a very small number of users are exposed more than five times. This makes the value of P(k) unreliable in the sense of statistics. Thus, we depict the curve P(k) only for k, = 5. This kind of difference lies in the specific contexts of these messages. The ''Foxconn worker falls to death'' event occurred successively in a short period of time and prompted wide and in-depth discussions about laborers' working condition and payment. As a result, the more exposures one is exposed to, the higher probability one might become involved. However, the ''Wenzhou train collision'' event happened suddenly. Two high-speed trains collided with each other, 40 people were killed, and at least 192 were injured. Officials responded to the accident by hastily concluding rescue operations and ordering the burial of the derailed cars. These actions elicited strong criticism from Chinese media and online communities. In  response, the government issued directives to restrict media coverage, which was met with limited compliance, even on stateowned networks. Thus, the distinct forwarding curve P(k) for the messages about this event is partly caused by outside intervention (http://en.wikipedia.org/wiki/Wenzhou_train_collision).

Analysis of structural motif and temporal pattern
In this section, we study the structural and temporal patterns in diffusion process to answer the ''forwarding-whom'' question: for an individual exposed to a message for multiple times, whom does the individual echo, the first one, the last one, or the most influential one? Among all the cases where a user forwards a message after multiple exposures, 2-exposure case is the most frequent one and the study of ''forwarding-whom'' for 2-exposure case can be easily extended to other cases of multiple exposures. Thus, we only focus on the 2-exposure case that one user is exposed to a message for two times and then forwards it. In addition, we only consider multiple exposures from different users rather than multiple exposures to one distinct user.
For the 2-exposure case, without loss of generality, we assume that one exposure is from user A and the other exposure is from user B. Then, according to the relationship between A and B, we have three types of structural motifs, which are (a) ''Diverse motif'' if there is no direct relationship between A and B, (b) ''Reciprocal motif'' if A follows B and B also follows A, and (c) ''Unidirectional motif'' if A follows B or B follows A. Table 1 shows the percentage of the three types of two-node motifs over all data set. We can see that the percentage of ''Diverse motif'' over the whole data set is 76.5%, which is significantly higher than the other two patterns. For a detailed analysis, we further report the percentage of the three types of two-node motifs in different categories depicted in the previous section. We find that the percentage of ''Diverse motif'' over messages with events is 83.9%, which is higher than that over messages without events. The percentage of ''Diverse motif'' is even higher, i.e., 87.8%, over messages with more than one event. In addition, the percentage of ''Diverse motif'' over messages with a single event is 83.1%, which is still higher than the average percentage of that over all data set. However, the percentage of the three types of motifs over messages with or without URL is close to that over all data set. One possible explanation for these findings is that a message with events might trigger more discussions about the events, and then an individual is more likely to be exposed to the message for multiple times.
Furthermore, we divide the messages into four different classes according to their popularity. These classes are class 0 -Messages that were forwarded by 10,100, class 1 -Messages that were forwarded by 100,1000, class 2 -Messages that were forwarded  by 1000,10000, and class 3 -Messages that were forwarded more than 10000 times. As shown in Table 1, from class 0 to class 3, the percentage of the ''Diverse motif'' increases while the other two decrease. This finding shows us the correlation between message popularity and structural diversity of diffusion network.
We turn to a problem we called ''forwarding-whom''. Given an individual X who is exposed to a message from two different users: A and B, whom will X forward the message from, A or B? We analyze this ''forwarding-whom'' problem in our data set. The results are shown in Table 2. When a user is exposed for twice, the percentage of the temporal pattern that X forwards from the latter exposure is 85.5%, while the percentage of the pattern that X forwards from the earlier one is just 14.5%. Furthermore, if user A's indegree on social graph is bigger than B's, the percentage of the temporal pattern that X forwards from A is 38.9%. If A is the source of message, the percentage is 43.7%. The results on the temporal patterns in information diffusion provide several empirical evidence for understanding the forwarding behavior of individuals and the evolution of diffusion network.

Discussion
In this paper, we have analyzed the information diffusion on the microblogging network in the microscopic perspective. Our study is conducted on the biggest microblogging network in China. Specifically, we have studied the cumulative effect of multiple exposures on Sina Weibo. We have also studied the effect on the spread of a message that was divided into groups according to the contents of each event in detail. We have observed a peak in the probability of forwarding at 2 exposures and then a slow drop. We have found that the probability of forwarding a message containing embedded URL, a single event related, and multievent related was significantly higher. We have examined the exposure curves corresponding to different events specifically. To our surprise, we have found that the exposure curve could be affected by outside intervention, such as restrictions on media coverage. Furthermore, we have investigated the structural and temporal patterns frequently occurring in information diffusion. These findings provide us great insights in understanding the fundamental mechanism of information diffusion and predicting the behavior of forwarding for an individual.
A long list of extensions can be conducted based on our findings. Examples include deep exploration on the relationship between the final popularity of a message and the characteristics of the networks spanned by early adopters, i.e., the users who view or forward the content in the early stage of content dissemination. We will further study the various roles played by individuals on social network. A probabilistic view might be introduced to explain the cumulative effect of multiple exposures. Besides, one is also encouraged to discover more temporal characteristics by time series analysis. As future work, we will be devoted to the modeling of forwarding behavior of individuals and the popularity prediction problem.

Materials and Methods
The data set is collected from the most popular Chinese microblogging service, namely Sina Weibo. Sina Weibo has more than 300 million registered users and generates about 100 million messages per day. The length of each message is no larger than 140 characters. Users obtain messages from other users through following relationships among them. Each following relationship is a directed link from the follower to the followee. For each user, the messages from his/her followees are ranked chronologically. Users can both deliver new messages and forward other users' messages. We get the data set from the WISE 2012 Challenge (http:// www.wise2012.cs.ucy.ac.cy/challenge.html). This data set is crawled via the API provided by Sina Weibo. According to Sina Weibo's Terms of Services, both the user IDs and the message IDs are anonymized. The content of messages is also removed. However, some messages are annotated with events. Each event has the terms used to identify the event and a link to Wikipedia (http://wikipedia.org) page containing descriptions to the event.
In this paper, we only use the messages that was originally posted to Sina Weibo between July 1, 2011 and July 31, 2011. There are 16.6 million messages. For each message, we collect its forwarding information between July 1, 2011 and August 31, 2011. For each forwarding of a message, the recorded information contains the anonymized user ids, the timestamp of this forwarding, and the forwarding path containing all the anonymized users in the path from the original user to the current user. The timestamp is in the resolution of seconds.
In addition, the data set also contains a snapshot of the social network recording the followships among users. The social network contains 58.6 million users and 265.5 million followships among them.