Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words

Background Zipf's discovery that word frequency distributions obey a power law established parallels between biological and physical processes, and language, laying the groundwork for a complex systems perspective on human communication. More recent research has also identified scaling regularities in the dynamics underlying the successive occurrences of events, suggesting the possibility of similar findings for language as well. Methodology/Principal Findings By considering frequent words in USENET discussion groups and in disparate databases where the language has different levels of formality, here we show that the distributions of distances between successive occurrences of the same word display bursty deviations from a Poisson process and are well characterized by a stretched exponential (Weibull) scaling. The extent of this deviation depends strongly on semantic type – a measure of the logicality of each word – and less strongly on frequency. We develop a generative model of this behavior that fully determines the dynamics of word usage. Conclusions/Significance Recurrence patterns of words are well described by a stretched exponential distribution of recurrence times, an empirical scaling that cannot be anticipated from Zipf's law. Because the use of words provides a uniquely precise and powerful lens on human thought and activity, our findings also have implications for other overt manifestations of collective human dynamics.


Introduction
Research on the distribution of time intervals between successive occurrences of events has revealed correspondences between natural phenomena on the one hand [1,2] and social activities on the other hand [3][4][5]. These studies consistently report bursty deviations both from random and from regular temporal distributions of events [6]. Taken together, they suggest the existence of a dynamic counterpart to the universal scaling laws in magnitude and frequency distributions [7][8][9][10][11]. Language, understood as an embodied system of representation and communication [12], is a particularly interesting and promising domain for further exploration, because it both epitomizes social activity, and provides a medium for conceptualizing natural and biological reality.
The fields of statistical natural language processing and psycholinguistics study language from a dynamical point of view. Both treat language processing as encoding and decoding of information. In psycholinguistics, the local likelihood (or predictability) of words is a central focus of current research [13]. Many widely used practical applications of statistical natural language processing, such as document retrieval based on keywords, also exploit dynamic patterns in word statistics [10,14,15]. Particularly important for these applications, and also noticed in different contexts [16][17][18][19][20][21], is the non-uniform distribution of content words through a text, suggesting that connections to the previous discoveries about inter-event distributions may be revealed through a systematic investigation of the recurrence times of different words.
With the rise of the Internet, large records of spontaneous and collective language are now available for scientific inquiry [22][23][24], allowing statistical questions about language to be investigated with an unprecedented precision. At the same time, large-scale text mining and document classification is of ever-increasing importance [25]. The primary datasets used in our study are USENET discussion groups available through Google (http:// groups.google.com). These exemplify spontaneous linguistic interactions in large communities over a long period of time. We first focus on the N~2,128 words that occurred more than 10,000 times between Sept. 1986 and Mar. 2008 in a (2 10 8 -word) discussion group, talk.origins. The data were collated chronologically, maintaining the thread structure (see Text S1, Databases).
Here, we show that long-time word recurrence patterns follow a stretched exponential distribution, owing to bursts and lulls in word usage. We focus on time scales that exceed the scale of syntactic relations, and the burstiness of the words is driven by their semantics (that is, by what they mean). The burstiness of physical events and socially contextualized choices makes words more bursty than an exponential distribution. However, we show that words are typically less bursty than other human activities [26] due to their logicality or permutability [27,28], technical constructs of formal semantics that index the extent to which the meanings and usage of words are stable over changes in the discourse context. Our quantitative analysis of the empirical data confirms the inverse relationship between burstiness and permutability. The model we develop to explain these observations shares the generative spirit of local (n-gram) and weakly non-local models of text classification and generation [29][30][31]. However it focuses on long time-scales, picking up at temporal scales where studies of local predictability and coherence leave off [13]. We verify the generality of our main findings using different databases, including books of different genres and a series of political debates.

Methods
We are interested in the temporal distribution of each word w. All words are enumerated in order of appearance, i~1, 2,:::, N, where i plays the role of the time along the text. The recurrence time t w j~i w jz1 {i w j is defined by the number of words between two successive uses (i w j and i w jz1 ) of word w (plus one). For instance, the first appearances of the word the in the abstract above are at i the 1~2 2, i the 2~4 1, i the 3~4 4, i the 4~5 0, :::, leading to a sequence of recurrence times t the 1~1 9, t the 2~3 , t the 3~6 , :::. We are interested in the distribution f w t ð Þ of t~t w j , j~1,:::,N w . The mean recurrence time, called by Zipf the wavelength of the word [7], is given by St w T~N=N w :1=n w [2] (hereafter we drop w from our notation). It is mathematically convenient to consider t to be a continuous time variable (an assumption that is justified by our interested in t&1) and to use the cumulative probability density function defined by F t ð Þ: The first point of interest is how the distribution f t ð Þ [or F t ð Þ] deviates from the exponential distribution where StT~1=n leads to m~n. The exponential distribution is predicted by a simple bag-of-words model in which the probability m of using the word is time independent and equals n (a Poisson process with rate m~n) [14,15,19,25,29], as observed if the words in the text are randomly permuted. Deviations are caused by the way that people choose their words in context. Numerous studies, as reviewed in Ref. [32], already demonstrate that the language users dynamically modify their use of nouns and noun phrases as a function of the linguistic and external context. We analyze such modifications for all types of words.
Results and Discussion Figure 1 shows the empirical results obtained for the example words theory and also in the talk.origins group of the USENET database. Both words have StT&820 but are linguistically quite Figure 1. Recurrence time distributions for the words theory (red) and also (blue) in the USENET group talk.origins, a discussion group about evolution and creationism. Both words have a mean recurrence time of StT&820. (a) Linear-logarithmic representation of f t ð Þ, showing that the decay is slower than the exponential b~1 prediction (1) (black dashed line) and follows closely the stretched exponential distribution (2) with b~0:46 (R 2~0 :9984) for theory and b~0:85 (R 2~0 :9999) for also. For comparison, b~1 yields R 2~0 :49 for the word theory and R 2~0 :9904 for the word also (see Text S1, Fitting Procedures). The inset in (a) shows a magnification for short times. A word-dependent peak at tv50 reflects the domination of syntactic effects and local discourse structure at this scale. different: while theory is a common noun, also is an adverb that functions semantically as an operator. The deviation from the Poisson prediction (1) is apparent in Fig. 1(a-c): f t ð Þ is larger than the exponential distribution for distances t both much shorter and much longer than StT, while it is smaller for t&StT. Both words exhibit a most probable recurrence time t * v 20 and a monotonically decaying distribution f t ð Þ for larger times [ Fig. 1(a)]. Comparing the insets in Fig. 1(b), one sees that the occurrences of theory are clustered close to each other in a phenomenon known as burstiness [6,14,15,19,21]. Due to burstiness, the frequency of the word theory estimated from a small sample would differ a great deal as a function of exactly where the sample was drawn. Similar but lesser deviations are observed for the word also.
Central to our discussion, Fig. 1 shows that the distributions of both words can be well described by the single free parameter b of the stretched exponential distribution where C is the Gamma function, and 0vbƒ1. Distribution (2), also known as Weibull distribution, and similar stretched exponential distributions describe a variety of phenomena [6,23,[33][34][35], including the recurrence time between extreme events in time series with long-term correlations [2,36]. The stretched exponential (2) is more skewed than the simple exponential distribution (1), which corresponds to the limiting case b~1, but less skewed than a power law, which is approached for Þas a function of t in a double logarithmic plot [2]. The straight line behavior for almost three decades shown in Fig. 1(b), which is illustrative of the words in our datasets, provides strong evidence for the stretched exponential scaling (spam-related deviations for long t are discussed in Text S1, Databases). This is a clear advance over the closest precedents to our results: (i) In Ref. [8] Zipf proposed a power-law decay, which would appear as an horizontal line in Fig. 1b. (ii) Refs. [14,15] compare two non-stationary Poisson processes for predicting the counts of words in documents (see Text S1, Counting Distribution); (iii) Ref. [19] proposes a nonhomogeneous Poisson process for recurrence times, using a mixture of two exponentials with a total of four free parameters; (iv) Ref. [37] uses the Zipf-Alekseev distribution f t ð Þ*t {a{b ln t ð Þ , which we found to underestimate the decay rate for large t and to leave larger residuals than our fittings (see Text S1, Zipf-Alekseev Distribution). The stretched exponential distribution was found to describe the time between usages of words in Blogs and RSS feeds in Ref. [24]. However, time was measured as actual time and the same distribution was found for different types of words, suggesting that their observations are driven by the bursty update of webpages, a related but different effect. More strongly related to our study is Ref. [5]'s analysis of email activity, in which a nonhomogeneous Poisson process captures the way one email can trigger the next.

Generative Model
Motivated by the successful description of the stretched exponential distribution (2), we search for a generative stochastic process that can model word usage. We consider the inverse frequency StT as given and focus on describing how the words are distributed throughout the text. We assume that our text (abstractly regarded as arbitrarily long) is generated by a well-defined stationary stochastic process with finite StT for the words of interest. We further assume that the probability m t ð Þ of using the word w depends only on the distance t since the last occurrence of the word. The latter means that we are modeling the word usage as a renewal process [34,36]. The distribution of recurrence times is then given by the (joint) probability of having the word at distance t and not having this word for tvt: The cumulative distribution function is written as The time dependent probability m t ð Þ, also known as hazard function, can be obtained empirically as m t ð Þ~f t ð Þ=F t ð Þ (see Text S1, Hazard Function). Equation (3) reduces to the exponential distribution (1) for a time independent probability m t ð Þ~m~1=StT. The stretched exponential distribution (2) is obtained from (3) by asserting that [34,36,38] This assertion means that in our model, the probability of using a word decays as a power law since the last use of that word. This is further justified by the power-law behavior of m t ð Þ determined directly from the empirical data, as shown in Fig. 1(c) and Text S1, Fig. 9, and is in agreement with results from mathematical psychology [39,40] and information retrieval [40]. The Weibull renewal process we propose can be analyzed formally as a particular instance of a doubly stochastic Poisson process [41].
Our model is illustrated in Fig. 1(d) and can be interpreted as a bag-of-words with memory that accounts for the burstiness of word usage. This model does not reproduce the positive correlations between t j and t jzp [2,6,20], which are usually small (less than 20% for p~1) but decay slowly with p (see Text S1, Correlation in t j È É ). These correlations quantify the extent to which the renewal model is a good approximation of the actual generative process, and show that the burstiness of words exists not only as a departure of f t ð Þ from the exponential distribution, but also as a clustering of small (large) t [6] (see Text S1, Independence of t j È É ). The advantage of the renewal description is that the model (i) can be substantiated to a vast literature describing power-law decay of memory in agreement with Eq. (4), see Refs. [39,40] and references therein, and (ii) fully determines the dynamics (allowing, e.g., the precise derivation of counting distributions [38], which are used in applications to document classification [14,15] and information retrieval [40]).

Word Dependence
We have seen in Fig. 1 that the word-dependent deviation from the exponential distribution is encapsulated in the parameter b: the smaller the b for any given word, the larger the deviation (see Text S1, Deviation from the Exponential Distribution). Next we investigate the dominant effects that determine the value of the parameter b of a word. Previous research has observed that frequent function words (such as conjunctions and determiners) usually are closer to the random (Poisson) prediction while less frequent content words (particularly names and common nouns) are more bursty. These observations were quantified using: (i) an entropic analysis of texts [16]; (ii) the variance of the sequence of recurrence times [17]; (iii) the recurrence time distribution [19,42]; and (iv) the related distribution of the number of occurrences of words per document [14,15]. Because we have a large database and do not bin the datastream into documents, we are able to go beyond these insightful works and systematically examine frequency and linguistic status as factors in word burstiness.
Our large database allows a detailed analysis of words that, despite being in the same frequency range, have very different statistical behavior. For instance, in the range 2,000vStTv3,000, words with high b (&0:80) include once, certainly, instead, yet, give, try, makes, and seem; the few words with b & v 0:40 include design, selection, intelligent, and Wilkins. Corroborating Ref. [14], it is evident that words with low b better characterize the discourse topic. However, these examples also show that the distinction between function words and content words cannot be explanatory. For instance, many content words, such as the adverbs and verbs of mental representation in the list just above, have b values as high as many function words. Here we obtain a deeper level of explanation by drawing on tools from formal semantics, specifically on type theory [27,43,44], and on dynamic theories of semantics [45,46], which model how words and sentences update the discourse context over time. We use semantics rather than syntax because syntax governs how words are combined into sentences, and we are interested in much longer time scales over which syntactic relations are not defined. Type theory establishes a scale from simple entities (e.g., proper nouns) to high type words (e.g., words that cannot be described using first-order logic, including intensional expressions and operators). Simplifying the technical literature in the interests of good sample sizes and coding reliability, we define a ladder of four semantic classes, as listed in Table 1.
In Fig. 2, we report our systematical analysis of the recurrence time distribution of all 2,128 words that appeared more than ten thousand times in our database (for word-specific results see Table  S1). We find a wide range of values for the burstiness parameter b [0:2vbv0:9, Fig. 2(a,b)] and the stretched exponential distribution describes well most of the words [R 2 median~0 :993, Fig. 2(c)]. The Class-specific results displayed in Fig. 2(a-c) show that words of all classes are accurately described by the same statistical model over a wide range of scales, a strong indication of a universal process governing word usage at these scales. Figure 2(b) also reveals a systematic dependence of b on the semantic Classes: burstiness increases (b decreases) with decreasing semantic Class. This relation implies that words functioning unambiguously as Class 3 verbs should be less bursty than words of the same frequency functioning unambiguously as common nouns (Class 2). This prediction is confirmed by a paired comparison in our database: such verbs have a higher b in 103 out of 116 pairs of verbs and frequency-matched nouns (sign test, Pƒ8 10 {19 ). The relation applies even to morphologically related forms of the same word stem (see Text S1, Lemmatization): for 37 out of the 47 pairs of Class 3 adjectives and Class 4 adverbs in the database that are derived with -ly, such as perfect, perfectly, the adverbial form has a higher b than the adjective form (sign test, Pƒ5 10 {5 ). Figure 2(d) shows the dependence of b on inverse frequency StT. This figure may be compared to the TF-IDF (term frequency-inverse document frequency) method used for keyword identification [14], but it is computed from a single document (see also Refs. [16][17][18]). Figure 2(d) reveals that b is correlated with StT and that the Class ordering observed in Fig. 2(b) is valid at all StTs. The detailed analysis in Fig. 2(e) demonstrates that semantic Class is more important than frequency as a predictor of burstiness (Class accounts for 0:32 and log-frequency for 0:26 of the variance of b, by the test proposed in Ref. [47]).
We are now in a position to discuss why burstiness depends on semantic Class. A straw man theory would seek to derive the burstiness of referring expressions directly from the burstiness of their referents. The limitations of such a theory are obvious: Oxygen is a very bursty word in our database (b&0:25) though oxygen is ubiquitous. A more careful observer would connect the burstiness of words to the human decisions to perform activities related to the words. For instance, the recurrence time between sending emails is known to approximately follow a power law [3,5]. However, in our database the word email is significantly closer to the exponential (b&0:5) than a power law would predict (b?0). Indeed, a defining characteristic of human language is the ability to refer to entities and events that are not present in the immediate reality [48]. These nontrivial connections between language and the world are investigated in semantics. An insight on the problem of word usage can be obtained from Ref. [27], which establishes that the meaning and applicability of words with great logicality remains invariant under permutations of alternatives for the entities and relations specified in the constructions in which they appear. Here we consider permutability to be proportional to the semantic Classes of Table 1. As a long discourse unfolds exploring different constructions, we expect words with higher permutability (higher semantic Class) to be more homogeneously distributed throughout the discourse and therefore have higher b (be less bursty). Critical to this explanation is the fact that human language manipulates representations of abstract operators and mental states [49]. However, the overt statistics of recurrence times do not need to be learned word by word. It seems more likely that they are an epiphenomenal result of the differential contextualization of word meanings. The fact that the behavior of almost all words deviate from a Poisson process to at least some extent, indicates that the permutability and usage of almost all words are contextually restricted to some degree, whether by their intrinsic meaning or by their social connotations.

Different Databases
In Fig. 3 we verify our main results using databases of different sizes and characterized by different levels of formality. We analyzed a second example of a USENET group (U), a series of political debates (D), two novels (S,W), and a technical book (P) (for word-specific results see Table S1). The stretched exponential provides a close fit for frequent words in these datasets [ Fig. 3(a,c)], and a wide and smoothly varying range of bs is observed in each The primitive types are entities e, exemplified by proper nouns such as Darwin (Class 1), and truth values, t (which are the values of sentences). Predicates or relations, such as the simple verb die, and the adjective/noun blue, take entities as arguments and map them to sentences (e.g., Darwin dies, Tahoe is blue). They are classified as Se,tT (Class 2). The notation Sx,yT denotes a mapping from an element x in the domain to the image y [43,44]. The semantic types of higher Classes are established by assessing what mappings they perform when they are instantiated. For example, everyone is of type SSe,tT,tT (Class 3), because it is a mapping from sets of properties of entities to truth values [44]; the verb believe shares this classification as a verb involving mental representation. The adverb supposedly is a higher order operator (Class 4), because it modifies other modifiers. Following Ref. [44] (contra Ref. [43]) words are coded by the lowest type in which they commonly occur (see Text S1, Coding of Semantic Types). doi:10.1371/journal.pone.0007678.t001  Table 1 (c) Quality of fit quantified in terms of the coefficient of determination R 2 between the fitted stretched exponential and the empirical F t ð Þ (see Text S1, Quality of Fit). The boxplots are centered at the median and indicate the 1,2,6,7 octiles. For comparison, an exponential fit with two free parameters yields R 2 median~0 :907 (see Text S1, Deviation from the Exponential Distribution). (d) Relative dependence of b on Class and StT~1=n (inverse frequency), indicating: running median on words ordered according to StT (center black line) and 1-st and 7-th octiles (boundaries of the gray region); and running medians on words by Class (colored lines, Class 1-4, from bottom to top) with illustrative words for each Class. At each StT, large variability in b and a systematic ordering by Class is observed. (e) Box-plots of the variation of b for words in a given Class. The box-plots in the background are obtained using frequency to divide all words in four groups with the same number of words of the semantic Classes (first box-plot has words with lowest frequency and last box-plot has words with highest frequency). The classification based on Classes leads to a narrower distribution of b's inside Class and to a better discrimination between Classes. doi:10.1371/journal.pone.0007678.g002  Fig. 3(b)]. The technical book exhibits lower b values, which can be attributed to the predominance of specific scientific terms. These datasets include examples of texts differing by almost four orders of magnitudes in size, generated by a single author (books), a few authors (debates) or a large number of authors (USENET), in writing and speech (e.g., books vs. debates), and in different languages (e.g., novels), indicating that the stretched exponential scaling is robust with regard to sample size, number of authors, language mode, and language.

Conclusions
The quest for statistical laws in language has been driven both by applications in text mining and document retrieval, and by the desire for foundational understanding of humans as agents and participants in the world. Taking texts as examples of extended discourse, we combined these research agendas by showing that word meanings are directly related to their recurrence distributions via the permutability of concepts across discourse contexts. Our model for generating long-term recurrence patterns of words, a bagof-words model with memory, is stationary and uniformly applicable to words of all parts of speech and semantic types. A word's position along the range in the memory parameter in the model, b, effectively captures its position in between a power-law and an exponential distribution, thus capturing its degree of contextual anchoring. Our results agree with Ref. [49] in emphasizing both the specific ability to learn abstract operators and the broader conceptual-intentional system as components in the human capability for language and in its use in the flow of discourse.
Analogies between communicative dynamics and social dynamics more generally are suggested by the recent documentation of heavytailed distributions in many other human driven activities [3,5,26]. They indicate that tracing linguistic activities in the ever larger digital databases of human communications can be a most promising tool for tracing human and social dynamics [22]. The stretched exponential form for recurrence distributions that derives from our model and the empirical finding it embodies are thus expected to also find applicability in other areas of human endeavor.