Emergence, Evolution and Scaling of Online Social Networks

Online social networks have become increasingly ubiquitous and understanding their structural, dynamical, and scaling properties not only is of fundamental interest but also has a broad range of applications. Such networks can be extremely dynamic, generated almost instantaneously by, for example, breaking-news items. We investigate a common class of online social networks, the user-user retweeting networks, by analyzing the empirical data collected from Sina Weibo (a massive twitter-like microblogging social network in China) with respect to the topic of the 2011 Japan earthquake. We uncover a number of algebraic scaling relations governing the growth and structure of the network and develop a probabilistic model that captures the basic dynamical features of the system. The model is capable of reproducing all the empirical results. Our analysis not only reveals the basic mechanisms underlying the dynamics of the retweeting networks, but also provides general insights into the control of information spreading on such networks.


Introduction
Online social networks have become an indispensable part of our modern society for obtaining and spreading information. A piece of breaking news can activate a corresponding online social network, through which the news topic can spread rapidly to many individuals. By its very nature an online network is necessarily time dependent, growing rapidly in size initially as the news spreads out and saturating after certain amount of time. Since online social networks concerning certain topics can be active for only a transient period of time, they are extremely dynamic, which is quite distinct from, e.g., the typical networks studied in the literature where they can be regarded as stationary with respect to the time scale of typical dynamical processes supported. A question of interest is whether there are general rules underlying the evolution of online social networks. A viable approach to addressing this question is to analyze large empirical data sets that are becoming increasingly accessible [1]. In fact, recent years have witnessed a growing research interest in online social network systems. There have been efforts in issues such as network and opinion coevolution [2], users participation comparison for topics of current interest [3], information diffusion patterns in different domains [4,5], the dynamics of users' activity across topics and time [6,7], users behavior modeling on networks [8,9], popular topic-style analysis in the Twitter-like social media [10][11][12], users influence in social networks [13], and language geography studies of Twitter data set [14].
In this paper, we aim to uncover the fundamental mechanisms underpinning the dynamical evolution of online social networks through empirical-data analysis. Our data come from Sina Weibo, a twitter-like microblogging social network medium in China. The appealing features of the data include wide publicity, real-time availability of information, and message compactness. Similar to Twitter, Weibo attracts users through all kinds of breaking news and spotlight topics, such as the "Japan Earthquake", "Oscar Ceremony", "Boston Marathon Terrorist" and so on. All users can see messages, called Weibos in Chinese, published by concerned users. Given a specific topic of interest, an individual can join the corresponding online social network simply by retweeting (forwarding) or tweeting (posting) the interesting Weibo [15]. To be concrete, we take the empirical data set of the Weibo topic on "Japan Earthquake" and focus on the spatiotemporal dynamics of the user-user retweeting network in terms of characterizing quantities such as the network size, the in-degree and out-degree distributions which correspond to the frequencies of retweeting other or being retweeted by others, and the in-or out-degree correlations. Our main findings are the following: (1) initially the network size increases algebraically with time but it begins to plateau at a critical time when another significant topic of interest emerges; (2) both the in-and out-degrees of the dynamic online-social network follow fat-tailed, approximately algebraic distributions, and (3) the average out-degree is approximately independent of the average in-degree from degree correlation analysis. Based on these results and the rules of online social-network systems, we articulate a theoretic model for the dynamical evolution of these networks. Simulation results of the model agree well with those from the empirical data. Our analysis also suggests a controlled approach to significantly enhancing information spreading on online-social networks.

Results
The 2011 Japan earthquake is a 9:0 magnitude undersea megathrust earthquake occurred on March 11 in the north-western Pacific Ocean near Tohoku, Japan. It was the most powerful earthquake ever hit Japan, which triggered powerful tsunami waves and caused nuclear accidents in the Fukushima Daiichi Nuclear Power Plant complex [16], leading to tremendous loss of human lives and large-scale infrastructure damages. This catastrophe aroused wide concerns and discussions all over the world, especially in China. Since Weibo is the most accessible online social medium in China, a large number of Chinese users joined Weibo to discuss the earthquake and related issues, forming an extremely dynamic user-user retweeting network. We analyze more than 500 thousands Weibo items concerning "Japan Earthquake", starting from the 1 st day of earthquake until the 100 th day (defined in Methods). A simple way for a user to join the Weibo social network is to retweet other users' Weibos. The useruser retweeting network can be generated from the data by identifying the retweeting actions among the users. In particular, when a Weibo published by user i is retweeted by user j, we draw a directed link from i to j. If j retweets the Weibo published by i again, another link from i to j is added, and so on. There can then be duplicate links between any two users in the retweeting network. For the case that a Weibo published by user i is retweeted by user j, and then retweeted again from j (instead of i) by user m, we draw two directed links both from i to j and m. No link from j to m is established since j just plays as a intermediary in the associated information spreading process. In the network, a relatively large value of the out-degree indicates that the corresponding user may act as a main source of information, while a large value of the in-degree suggests a high level of retweeting activities of the corresponding user.
Evolution of the user-user retweeting network Figure 1(a) shows the evolution of the number N of users involved by retweeting links in days (green circles). We observe that for the initial period of about 7 days, the size of the network increases approximately algebraically with the scaling exponent of about 1:3. At the critical time t c , where t c &7, a crossover behaviors occurs, after which the number of nodes increases slowly or plateaus. While in general, an algebraic scaling relation does not permit the definition of some global growth rate, we can still define an "instantaneous" growth rate, the increment DN per day. As shown in Fig.1(b), the "instantaneous" growth rate is approximately constant for tvt c , but for t §t c , the rate decreases approximately algebraically from about 10 4 per day to about 10 1 per day at the end of the data duration.
The remarkable change in the temporal behavior of the system on the 7 th day demands a sensible explanation. By looking into the data further and searching for other medium information about "Japan Earthquake", we find that, at the critical time t c , many users switched to discuss the issue of "Salt Rush", which is closely related to "Japan Earthquake." In fact, on the 7 th day after the earthquake, a rumor began to spread in Weibo that salt may offer protection against radiation, but the radiation leak from the Fukushima nuclear plant explosion would contaminate sea-salt products [17]. This new topic switched many users' attention from the primary "Japan Earthquake" topic to the "Salt Rush" topic, and for twt c many users stopped discussing the "Japan Earthquake" topic. As a consequence, the instantaneous growth rate for the original topic began to decrease. Here, the out-degree of user j, denoted by k j out , is the total times of j's Weibo(s) being retweeted by other users in the network, and the in-degree of j, denoted by k j in , is the retweeting times j has performed. We note that the algebraic scaling exponents are are 3:50 for in-degree and 2:48 for outdegree distributions. Moreover, the maximum value of in-degree is 67 while the out-degree has a much larger maximum value (5,825). This means that, while the capacity of any individual user to retweet others is limited, users' collective retweeting behavior may congregate, generating superhubs with very large out-degrees. This can be considered as an evidence for the preferential selection in the retweeting process introduced by the scheme that Weibo system updates and recommends information. Model of user-user behavior network In the Weibo system, up-to-date topics emerge all the time and are recommended to users through the list of retweeted actions of their friends in the order of time. As soon as a new item is added to the recommendation list, one of the early items is removed from the list. This rule stipulates that, when some extreme event occurs, the related topics may rapidly cover the entire recommendation list to attract more users who might not have paid any attention initially. This process could also attract users who are less likely to be interested in the topic. Thus, the number of potential users who may join the retweeting network and then become the enabled users will increase. This mechanism in fact generates a selfreinforcing (positive feedback) process that makes the messages spread extremely fast initially in the Weibo system. Conversely, this kind of recommendation mechanism may also reduce the number of nodes in the network dramatically when alternative topics emerge. As can be seen from Fig. 1, the event of "Salt Rush" occurring at the 7th day after the Japan earthquake is a typical distractive topic with respect to the original earthquake topic. After the distractive topic emerged, the retweeting dynamics associated with the original topic enters into a phase with distinct scaling behaviors.

Fat-tailed distribution of in-and out-degrees
The sketch map in Fig. 3 briefly illustrates the generation scheme of retweeting network in our model with the aforesaid empirical rules and observations taking into consideration. The dynamical process of retweeting is usually initiated by some primary users' reporting of some specific events. The basic element in the process is the spontaneous retweeting action of some users, i.e., one potential user voluntarily built up a directed link pointed from another user towards him-herself. The final in-degree of each user characterizes its inherent property, i.e., the level of activity in the related topic. The algebraic in-degree distribution signifies the heterogeneity and diversity in the user activities. We are thus led to define the activity level of individual i as where max½k in is the maximum in-degree of all users in the system. A potential user i will retweet a related message from others, i.e., to add one in-link, with probability I i :a i at each time step. As soon as the first in-link is established, the user is enabled to behave as a new source of the topic and can be retweeted by others. The enabled users are thus those connected to the user-user retweeting network, which can be identified from real data. The probability O i for an enabled user i to be retweeted by another potential user, i.e., to add an out-link, is where P e denotes the set of enabled users and the proportional relation is for the reason that, if a user is retweeted by others more frequently, its actions will appear in the recommendation list more times and thus are more likely to be further retweeted.
The temporal evolution of the number of enabled users N e can be obtained analytically. The recommendation mechanism requires that the number of potential users (denoted by N p ) increases with time rapidly in the initial phase of the retweeting process. To gain insights, we first consider the simple case where N p is assumed to be constant. The probability for a potential user i to retweet the topic (i.e., to become enabled) at each time step is I i~ai (each user's own level of activity). The probability for user i to be enabled before time t is then p i t~1 {(1{a i ) t :f (a i ,t). For the case where the users have identical activity level a, the expectation number of the enabled users at time t is SN e (t)T~N p f (a,t), where N e (t) is distributed binomially: Assuming that the user activity obeys a given probability distribution P(a i ), the expectation number of enabled users is As can be seen from real data in Fig. 2(a), user activities a i are typically heterogeneous, where the number of retweeted actions performed (the in-degrees) by users ranges from 1 to 67 and approximately follows an algebraic distribution P(k in )*k {c in , with c&3:5.
From the expressions of p i t and SN e (t)T, we see that the growth rate of SN e (t)T is a monotonic decreasing function of time. However, from Fig. 1, the rate DN e (t) from the real data increases in the initial phase after the network emerges. This discrepancy In-and out-degree distributions of user-user retweeting network generated from real Japan earthquake data (a) and from model (b). The four distribution can be fitted as P(k)*k {a with algebraic scaling exponents a&3:50, 2:48, 3:50, and 2:57 for real in-or out-degree and model in-or out-degree distributions respectively. The distributions were recorded at t = 100 days, and the value of a are estimated using the maximum-likelihood estimator [18]. doi:10.1371/journal.pone.0111013.g002 Emergence, Evolution and Scaling of Online Social Networks originates from the simple case assuming constant N p in our probabilistic model, whereas in the real system, N p increases rapidly initially as a result of the recommendation mechanism. It is thus necessary to take into account the fact that, at time step t', DN t' p new potential users become aware of the topic from their respective recommendation list in the Weibo page and then retweet with the probability I i~ai . Here, DN t' p :N p (t') {N p (t'{1) and N p (0)~0. We assume that the time step t' for user i to become aware of the topic is independent of its activity level a i . Equivalently, the activity distribution of new potential users at each time obeys the same distribution P(a i ). Taking the increment of N p into account, we obtain the expected number of enabled users as where t{t'z1 is the duration of the potential users since their awareness of the topic at t'. The exact form of the function DN t' p cannot be obtained explicitly, as we can observe from data only increment in the number N e of enabled users. However, we note that the analog of N p is the coverage of a spreading process of the topic associated with the recommendation mechanism, which takes place on the underlying friendship network of the Weibo  Emergence, Evolution and Scaling of Online Social Networks PLOS ONE | www.plosone.org system. We thus have [18], approximately, DN t p *nt b , where the parameters n and b can be obtained by fitting to the real data.
As can be seen from Fig. 1, there is a crossover behavior in the time evolution of N e due to the emergence of some alternative topic. For convenience, we name the original topic as A 0 that takes place at t~0 and the new topic as A 1 that emerges at t~t c . For t §t c , as is illustrated in Figs. 3(c)(d), A 1 competes for potential users against A 0 . We assume that the basic dynamical process underlying A 1 is identical to that of A 0 . The number of potential users left in A 0 for t §t c is thus given byÑ N p (t)~N p (t){N p (t{t c ), giving rise to a decreasing behavior in the instantaneous growth rate in the number of enabled users.
Our model can be simulated to yield behaviors that reproduce those from the real data. In particular, in the simulation, each user's activity level a i is proportional to its in-degree, whose distribution can be obtained from data. The increment of potential users obtained from data fitting is DN p (t)&1:89|10 5 t {1:1 . The topic A 0 is initially notified by N e (0) enabled users to trigger the retweeting process [e.g., N e (0)~3]. Results of N e (t) from our model agree well with those from the data, as shown in Fig. 1. The reproduced in-and out-degree distributions are shown in Fig. 2, which again agree with the distributions from the real data.
To further validate our model, we calculate and compare the degree-degree correlation behaviors from the real data and our model. Figure 4(a) plots the out-degree versus the in-degree for all users in the network at time t~100. Figures 4(b) and 4(c) show, respectively, the average in-degrees for users having the same outdegrees and the average out-degrees for users with the same indegrees. The two types of average values are approximately constant but with significant spreads, and the results from our model are qualitatively consistent with those from the real data. The spread can be attributed to the fluctuation due to small amount of large in-or out-degree nodes. Furthermore, we have also calculated the Pearson correlation of the directed networks [19] of the user-user retweeting relation, and the network generated from our model. The four directed assortativity measures from Pearson correlation, i.e., the (in, in), (in, out), (out, in), and (out, out) degree correlations averaged over pairs of neighbor nodes are all found to be around zero.
What would be an effective way to spread information? In a twitter-like virtual social network, the performance of individual users in the spreading process is determined by their out-degrees k i out [20,21]. To select users with larger out-degrees as the sources of spreading would then result in higher coverage in the subsequent time steps. To better understand the spreading process, we plot in Figs. 5(a) and 5(c) the average out-degrees of each user's neighbors, denoted by Sk ni out T, versus the user's own out-degree k i out , obtained from both real data and from model, respectively, where the solid circles denote the average values of Sk ni out T over the users with the same value of k i out . We see that for those users with one given out-degree k out , the value of Sk ni out T is distributed in a wide interval of about 3 orders of magnitude. However, the average of Sk ni out T over each k i out (the solid circle) is approximately constant. Figures 5(b) and 5(d) plot the product of the out-degree and the average neighbor out-degree k out : Sk ni out T, which measures the new information coverage one step after spreading from that particular user. The correlation of k out : Sk ni out T and k out on a logarithmic scale is approximately linear with unit slope both for real data and model. Moreover, the users with larger sum of neighboring out-degrees are those who perform well in the spreading process if they are selected to be the source. The upperleft regions in Figs. 5(b) and 5(d) thus locate the users who are not so popular (small out-degrees) but can spread news efficiently because they have relatively large sums of neighboring outdegrees. These users are the optimal candidates to be controlled for spreading information if a rapid growth of the underlying network is desired. Figure 5. Neighboring out-degrees in user-user retweeting network generated from real data and model. (a) average neighboring outdegree (Sk ni out T) and out-degree (k i out ) of each user from real data, (b) product between the out-degree and the average neighboring out-degree (k out : Sk ni out T) of each user from real data, and (c,d) respective results from model. doi:10.1371/journal.pone.0111013.g005

Discussion
Online social network systems are becoming increasingly ubiquitous in a modern society. At the present few research are considering their dynamical behavior. Using the approach of empirical-data analysis, we have developed a probabilistic model for the growth dynamics of an important class of such systems: user-user retweeting networks. Our model is capable of reproducing the dynamical and statistical behaviors of the key characterizing quantities such as the growth of the network size, in-and outdegree distributions, and the degree-degree correlations. The development of our model also leads to insights into controlling the information-spreading dynamics on these extremely dynamic networks. Our work represents an initial step in understanding, modeling, and controlling online social network systems, with potential applications not only in social sciences (e.g., for controlling opinion spreading) and commerce (e.g., for developing efficient recommendation algorithms), but also in other disciplines where rapidly time-varying, dynamic networks arise.