^{1}

^{2}

^{1}

^{2}

^{3}

^{*}

Conceived and designed the experiments: KKY MG. Performed the experiments: KKY. Analyzed the data: KKY. Contributed reagents/materials/analysis tools: KKY. Wrote the paper: KKY MG.

The authors have declared that no competing interests exist.

The presence of web-based communities is a distinctive signature of Web 2.0. The web-based feature means that information propagation within each community is highly facilitated, promoting complex collective dynamics in view of information exchange. In this work, we focus on a community of scientists and study, in particular, how the awareness of a scientific paper is spread. Our work is based on the web usage statistics obtained from the PLoS Article Level Metrics dataset compiled by PLoS. The cumulative number of HTML views was found to follow a long tail distribution which is reasonably well-fitted by a lognormal one. We modeled the diffusion of information by a random multiplicative process, and thus extracted the rates of information spread at different stages after the publication of a paper. We found that the spread of information displays two distinct decay regimes: a rapid downfall in the first month after publication, and a gradual power law decay afterwards. We identified these two regimes with two distinct driving processes: a short-term behavior driven by the fame of a paper, and a long-term behavior consistent with citation statistics. The patterns of information spread were found to be remarkably similar in data from different journals, but there are intrinsic differences for different types of web usage (HTML views and PDF downloads versus XML). These similarities and differences shed light on the theoretical understanding of different complex systems, as well as a better design of the corresponding web applications that is of high potential marketing impact.

In the era of Web 2.0, individuals are tightly connected in the virtual worlds and form various online communities. Recently, the propagation of information in these online communities has gained much attention

Over the last two decades, the WWW has revolutionized scientific research, in particular by speeding up the rate of the spread of information. Nowadays, once a paper is electronically published on a journal website, the information can propagate rapidly in the community, partially due to various scientific blogs and folksonomy websites like CiteULike and Connotea. The spread of a paper will then be reflected at the level of web usage statistics, in particular, the number of HTML views, i.e. the WWW traffic of the webpage corresponding to the paper. In this work, we regard readers of the 6 PLoS journals (PLoS Biology, PLoS Computational Biology, PLoS Genetics, PLoS Medicine, PLoS One and PLoS Pathogens) as a community of scientists. As an estimation of the size of the community, there were over 4000 papers published in 2008 and the total HTML views numbered over 7 million. We quantitatively examine the propagation process by studying the monthly web usage statistics of individual papers reported in the PLoS Article-Level Metrics (ALM) dataset. The dataset contains the number of HTML views; the number of PDF and XML downloads of more than 13000 papers published from 2003 to 2009 on a monthly basis since their publication. Compiled by PLoS, the ALM dataset (

To develop a better intuition regarding the PLoS Article-Level Metrics, we examined the correlation pattern between various statistics.

The full meaning of the labels are: note threads (number of notes users put on an article), replies to comments (number of replies to comment threads of an article), rating+comments (number of users who leave a rating as well as a comment to an article), no. of ratings (how many times an article has been rated), average rating (the average rating an article received), comment threads (number of comment threads users put on an article), trackbacks (the number of trackbacks that have been made to this article by external sites), Bloglines, Nature Blogs and Postgenomics (the number of times an article have been blogged by the respective sites, Connotea and CiteUlike (the counts of how many bookmarks have been made to an article by users of these social bookmarking sites), CrossRef, PubMed and Scopus (the counts of how many citations are recorded in these databases), HTML views, PDF downloads and XML downloads (the counts of HTML views, PDF and XML downloads for each article). The article access metrics, the citation metrics and the social bookmarking metrics form a broad cluster.

We next move our focus to the empirical observation of information propagation using a time series of web accesses. As described above, the number of PDF downloads and the number of HTML views are strongly correlated, and we thus use the HTML views as a proxy to measure information propagation. We considered 7000 papers that have been published for at least one year, and counted the number of HTML views they received at different time points after publication. These papers were published in one of the six PLoS journals: PLoS Biology, PLoS Computational Biology, PLoS Genetics, PLoS Medicine, PLoS One and PLoS Pathogens.

We study the access statistics of 7000 publications that have been published for at least one year. These publications belong to 6 different PLoS journals: PLoS Biology (1177), PLoS Computational Biology (688), PLoS Genetics (723), PLoS Medicine (1300), PLoS One (2796) and PLoS Pathogen (543). In average, the number of HTML views of an old paper is lower. The decay process is much faster from the first month to the second month after publication, compared to the subsequent period. The inset shows the average accesses of different journals normalized by the corresponding values of the first month. Note that the patterns are remarkably similar for the six different journals.

While the number of HTML views better reflects the knowledge of the existence of a paper, we repeated the decay pattern analysis for the number of PDF downloads, which might arguably measure the number of times a paper is read, and also the number of XML downloads.

The decay pattern of PDF downloads are consistent with the pattern of HTML views. Both profiles possess the same two phases of decay. The profile of XML downloads does not share the same characteristics.

In addition to the average number of HTML accesses as shown in

A. The cumulative number of HTML views of 7000 papers at the 3^{rd} month after publication is fitted using the maximum likelihood method by a lognormal distribution, with the mean and variance of the logarithmic values as shown. B. The normal Q-Q plot of the logarithm of the HTML views shown in panel A. Apart from a slightly longer tail, the plot is close to a straight line, meaning that the majority of the data is well explained by a lognormal distribution. C. The Q-Q plot between the cumulative accesses of the same set of 7000 papers at the 3^{rd} month and at the 10^{th} month after publication. The plot is very close to a straight line, suggesting the cumulative HTML views at different time points follow the same distribution up to certain shift and scaling factors.

While the lognormal distribution is a reasonable approximation to the distribution of cumulative HTML views at a certain instant, we ask whether this case is true for any time point. As the lifespan of papers increases, the cumulative views of all individual papers increase monotonically. As shown in

It is well known that lognormal distributions can be generated by the so-called random multiplicative processes. A simple stochastic model, which was recently used by Wu and Huberman in a study of the voting statistics in digg.com _{t} is the cumulative number of HTML views at time t. The dynamical process is mathematically written as _{i} are positive, independent and identically distributed random variables with finite mean _{t}, is defined to moderate the average rate of spread of information at time t. As the time series are given in the resolution of month, r_{t} is a piecewise constant function such that

A. After a scientist has accessed a paper, he/she might spread information from the paper to his friends, colleagues or students. The information would then be further spread via a cascade of social interactions. The cumulative number of accesses at time t is pictured by the number of scientists enclosed in the concentric circles. Mathematically, _{t} s are positive, independently and identically distributed random variables with finite mean _{t} is a modulating factor (see main text). B. Modulating factors decay with respect to time. The value at the jth month, r^{(j)}, is normalized by the value at the first month r^{(1)}. The decay of modulating factors is divided into two regimes: a rapid drop from the first month to the second month, and a low power law decay afterward. The power law regime is best fitted by the function ^{−3}. The residual of the first data point, compared to the fitted curve, is 0.48. The point is significantly deviated from the power law regime.

The simple model is able to explain the observed lognormal distributions. When time steps are small, X_{t} is small and therefore we write _{0} is the size of initial sources. Taking the logarithm of both sides, we have _{0} is comparably small, the cumulative accesses of a paper could be viewed as a random variable drawn from a lognormal distribution. Furthermore, the average value

By fitting the data to the stochastic model, we further extracted the parameters r^{(j)}, which are proportional to the rates of information propagation at different stages. As shown in ^{2} = 0.997), the initial rapid decay is significantly deviated from this power law (see caption for details).

We have empirically studied the online access statistics in the PLoS ALM dataset. We proposed to use the number of HTML views of a paper to quantify by how far the paper has percolated in the scientific community, and explained the time series of HTML views using a simple stochastic model. We found that the rate of information spread decreases as a function of time after publication, and there are two decay regimes: a rapid drop from the first month to the second month after publication, and a slow power law decay afterward (

We have emphasized that a lognormal distribution is a reasonable approximation of the empirical data. Nevertheless, as shown in

The emergence of lognormal distribution via random multiplicative processes has been studied for a long time

We have questioned whether the decay patterns are consistent among three types of web usage statistics: the number of HTML views and the number of PDF and XML downloads. Interesting enough, the number of XML downloads is not consistent with the other two. The fact that XML downloads behave differently from the usage of both HTML and PDF is also presented in the correlation map shown in

Over the last decades, electronic publications have revolutionized our ways of publishing, leading to many interesting new questions as well as methods in scientometrics. For instance, the relationship between the number of citations acquired by an article to the number of downloads or accesses

The PLoS Article-Level Metric (ALM) dataset was downloaded from the PLoS website (

(TIFF)

We thank Mark Patterson for pointing us to the PLoS ALM dataset. We thank Mark Patterson, Richard Cave and Theodora Bloom for helpful comments on some of the analyses described in this manuscript. KKY thanks Rebecca Robilotto for helping the design of