Zipf's Law Leads to Heaps' Law: Analyzing Their Relation in Finite-Size Systems

Background Zipf's law and Heaps' law are observed in disparate complex systems. Of particular interests, these two laws often appear together. Many theoretical models and analyses are performed to understand their co-occurrence in real systems, but it still lacks a clear picture about their relation. Methodology/Principal Findings We show that the Heaps' law can be considered as a derivative phenomenon if the system obeys the Zipf's law. Furthermore, we refine the known approximate solution of the Heaps' exponent provided the Zipf's exponent. We show that the approximate solution is indeed an asymptotic solution for infinite systems, while in the finite-size system the Heaps' exponent is sensitive to the system size. Extensive empirical analysis on tens of disparate systems demonstrates that our refined results can better capture the relation between the Zipf's and Heaps' exponents. Conclusions/Significance The present analysis provides a clear picture about the relation between the Zipf's law and Heaps' law without the help of any specific stochastic model, namely the Heaps' law is indeed a derivative phenomenon from the Zipf's law. The presented numerical method gives considerably better estimation of the Heaps' exponent given the Zipf's exponent and the system size. Our analysis provides some insights and implications of real complex systems. For example, one can naturally obtained a better explanation of the accelerated growth of scale-free networks.


Introduction
Giant strides in Complexity Sciences have been the direct outcome of efforts to uncover the universal laws that govern disparate systems. Zipf's law [1] and Heaps' law [2] are two representative examples. In 1940s, Zipf found a certain scaling law in the distribution of the word frequencies. Ranking all the words in descending order of occurrence frequency and denoting by z r ð Þ the frequency of the word with rank r, the Zipf's law reads z r ð Þ~z max : r {a , where z max is the maximal frequency and a is the so-called Zipf's exponent. This power-law frequency-rank relation indicates a power-law probability distribution of the frequency itself, say p z ð Þ*z {b with b equal to 1z1=a (see Materials and Methods). As a signature of complex systems, the Zipf's law is observed everywhere [3]: these include the distributions of firm sizes [4], wealths and incomes [5], paper citations [6], gene expressions [7], sizes of blackouts [8], family names [9], city sizes [10], personal donations [11], chess openings [12], traffic loads caused by YouTube videos [13], and so on. Accordingly, many mechanisms are put forward to explain the emergence of the Zipf's law [14,15], such as the rich gets richer [16,17], the self-organized criticality [18], Markov Processes [19], aggregation of interacting individuals [20], optimization designs [21] and the least effort principle [22]. To name just a few.
Heaps' law [2] can also be applied in characterizing natural language processing, according to which the vocabulary size grows in a sublinear function with document size, say N t ð Þ*t l with lv1, where t denotes the total number of words and N t ð Þ is the number of distinct words. One ingredient causing such a sublinear growth may be the memory and bursty nature of human language [23][24][25]. A particular interesting phenomenon is the coexistence of the Zipf's law and Heaps' law. Gelbukh and Sidorov [26] observed these two laws in English, Russian and Spanish texts, with different exponents depending on languages. Similar results were recently reported for the corpus of web texts [27], including the Industry Sector database, the Open Directory and the English Wikipedia. Besides the statistical regularities of text, the occurrences of tags for online resources [28,29], keywords for scientific publications [30], words contained by web pages resulted from web searching [31], and identifiers in modern Java, C++ and C programs [32] also simultaneously display the Zipf's law and Heaps' law. Benz et al. [33] reported the Zipf's law of the distribution of the features of small organic molecules, together with the Heaps' law about the number of unique features. In particular, the Zipf's law and Heaps' law are closely related to the evolving networks. It is wellknown that some networks grow in an accelerating manner [34,35] and have scale-free structures (see for example the WWW [36] and Internet [37]), in fact, the former property corresponds to the Heaps' law that the number of nodes grows in a sublinear form with the total degree of nodes, while the latter is equivalent to the Zipf's law for degree distribution.
Baeza-Yates and Navarro [38] showed that the two laws are related: when aw1, it can be derived that if both the Zipf's law and Heaps' law hold, l~1 a . By using a more sophisticated approach, Leijenhorst and Weide [39] generalized this result from the Zipf's law to the Mandelbrot's law [40] where z r ð Þ* r c zr ð Þ {a and r c is a constant. Based on a variant of the Simon model [16], Montemurro and Zanette [41,42] showed that the Zipf's law is a result from the Heaps' law with a depending on l and the modeling parameter. Also based on a stochastic model, Serrano et al. [27] claimed that the Zipf's law can result in the Heaps' law when aw1, and the Heaps' exponent is l~1 a . In this paper, we prove that for an evolving system with a stable Zipf's exponent, the Heaps' law can be directly derived from the Zipf's law without the help of any specific stochastic model. The relation l~1 a is only an asymptotic solution hold for very-large-size systems with aw1. We will refine this result for finite-size systems with a * > 1 and complement it with av1. In particular, we analyze the effects of system size on the Heaps' exponent, which are completely ignored in the literature. Extensive empirical analysis on tens of disparate systems ranging from keyword occurrences in scientific journals to spreading patterns of the novel virus influenza A (H1N1) has demonstrated that the refined results presented here can better capture the relation between Zipf's and Heaps' exponents. In particular, our results agree well with the evolving regularities of the accelerating networks and suggest that the accelerating growth is necessary to keep a stable power-law degree distribution.
Whereas the majority of studies on the Heaps' law are limited in linguistics, our work opens up the door to a much wider horizon that includes many complex systems.

Analytical Results
For simplicity of depiction, we use the language of word statistics in text, where z r ð Þ denotes the frequency of the word with rank r. However, the results are not limited to language systems. Note that r{1 ð Þ is the very number of distinct words with frequency larger than z r ð Þ. Denoting by t the total number of word occurrences (i.e., size of the text) and N t ð Þ the corresponding number of distinct words, then According to the Zipf's law z r ð Þ~z max : r {a and the relation between the Zipf's and power-law exponents b~1z 1 a , the right part of Eq. 2 can be expressed in term of z max and a, as Combine Eq. 1 and Eq. 3, we can obtain the estimation of z max , as Obviously, the text size t is the sum of all words' occurrences, say Notice that the summation The relative error of this approximation, for P { Ð À Á P , increases with the increasing of a and decreases with the increasing of N (see Figure S1 the numerical results on the sensitivity of relative errors to parameters a and N). Substituting z max by Eq. 4, it arrives to the relation between N t ð Þ and t: The direct comparison between the empirical observation and Eq. 6, as well as an improved version of Eq. 6, is shown in Materials and Methods. Clearly, Eq. 6 is not a simply power-law form as described by the Heaps' law. We will see that the Heaps' law is an approximate result that can be derived from Eq. 6. Actually, when a is considerably larger than 1, N t ð Þ 1{a %1 and N t ð Þ& a{1 ð Þ 1=a t 1=a ; while if a is considerably smaller than 1, N t ð Þ 1{a &1 and N t ð Þ& 1{a ð Þt. This approximated result can be summarized as which is in accordance with the previous analytical results [29,38,39] for aw1 and has complemented the case for av1. Although Eq. 6 is different from a strict power law, numerical results indicate that the relationship between N t ð Þ and t can be well fitted by the power-law functions (the fitting is usually much better than the empirical observations about the Heaps' law, see Materials and Methods for some typical examples). In Fig. 1, we report the numerical results with fixed total number of word occurrences t~10 5 . When a is considerably larger or smaller than 1, the numerical results agree well with the known analytical solution in Eq. 7, however, a clear deviation is observed for a&1 (see Materials and Methods about how to get the numerical results for a~1).
To validate the numerical results of Eq. 6, we propose a stochastic model. Given the total number of word occurrences t, clearly, there are at most t distinct words having the chance to appear. The initial occurrence number of each of these t words is set as zero. At each time step, these t words are sorted in descending order of their occurrence number (words with the same number of occurrences are randomly ordered), and the probability a word with rank r will occur in this time step is proportional to r {a . The whole process stops after t time steps. The distribution of word occurrence always obeys the Zipf's law with a stable exponent a, and the growth of N t ð Þ approximately follows the Heaps' law with l dependent on a (see Figure S2 for the simulation results of the stochastic model). The simulation results about l vs. a of this model are also reported in Fig. 1, which agree perfectly with the numerical ones by Eq. 6. The result of the stochastic model strongly supports the validity of Eq. 6, and thus we only discuss the numerical results of Eq. 6. In addition to a, the Heaps' exponent l also depends on the system size, namely the total number of word occurrences, t. An example for a~1 is shown in Fig. 2, and how l varies in the a,t ð Þ plane is shown in Fig. 3 (see Figure S3 for the comparison of fitting functions and four typical examples of numerical results). It is seen that the exponent l increases monotonously as the increasing of t. According to Eq. 6, it is obvious that in the large limit of system size, t??, the exponent l can be determined by the asymptotic solution Eq. 7. Actually, the asymptotic solution well describes the systems with a&1 or a%1 or t??. However, real systems are often with a around 1 and of finite sizes. As indicated by Fig. 2 and Fig. 3, the growth of l versus t is really slow. For example, when a~1, for most real systems with t scaling from 10 4 to 10 8 , the exponent l is considerably smaller than the asymptotic solution l~1. Even for very large t that is probably larger than any studied real systems, like t~10 16 , the difference between numerical result and asymptotic solution can be observed. As we will show in the next section, this paper emphasizes the difference between empirical observations and the asymptotic solution, and the simple numerical method based on Eq. 6 provides a more accurate estimation.

Experimental Results
We analyze a number of real systems ranging from small-scale system containing only 40 distinct elements to large-scale system consisting of more than 10 5 distinct elements. The results are listed in Table 1 while the detailed data description is provided in Materials and Methods. Four classes of real systems are considered, including the occurrences of words in different books and different languages (data sets Nos. 1-9), the occurrences of keywords in different journals (data sets Nos. 10-33), the confirmed cases of the novel virus influenza A (data set No. 34), and the citation record of PNAS articles (data set No. 35). Figure 4 reports the Zipf's law and Heaps' law of the four typical examples, each of which belongs to one class, respectively. Figure S4 in the Supporting Information displays the probability density function p z ð Þ, the Zipf's plot z r ð Þ and the Heaps' plot N t ð Þ for all the 35 data sets with the same order as shown in Table 1. To sum up, the empirical results indicate that (i) evolving systems displaying the Zipf's law also obey the Heaps' law even for small-scale systems; (ii) the asymptotic solution (Eq. 7) can well capture the relationship between the Zipf's exponent and Heaps' exponent, and the present numerical result based on Eq. 6 can provide considerably better estimations (the numerical results based on Eq. 6 outperforms Eq. 7 in 34, out of 35, tested date sets).

Discussion
Zipf's law and Heaps' law are well known in the context of complex systems. They were discovered independently and treated as two independent statistical laws for decades. Recently, the increasing evidence on the coexistence of these two laws leads to serious consideration of their relation. However, a clear picture cannot be extracted out from the literature. For example, Montemurro and Zanette [41,42] suggested that the Zipf's law is a result from the Heaps' law while Serrano et al. [27] claimed that the Zipf's law can result in the Heaps' law. In addition, many previous analyses about their relation are based on some stochastic models, and the results are strongly dependent on the corresponding models -we are thus less confident of their applicability in explaining the coexistence of the two laws observed almost everywhere.
In this article, without the help of any specific stochastic model, we directly show that the Heaps' law can be considered as a derivative phenomenon given that the evolving system obeys the Zipf's law with a stable exponent. In contrast, the Zipf's law can not be derived from the Heaps' law without the help of a specific model or some external conditions. In a word, our analysis indicates that the Zipf's law is more fundamental than the Heaps' law in the systems where two laws coexist, which provides a new perspective on the origin of the Heaps' law. For example, the observed Heaps' law in natural language processing was attributed to the bursty nature and memory effect of human language [23][24][25], while Serrano, Flammini and Menczer [27] recently showed that the word occurrences in English Wikipedia also display the Heaps' law. Since the English Wikipedia is attributed by many independent editors, the memory effect is obviously not a proper interpretation. Our analysis suggests that the observed Heaps' law may be just an accompanying phenomenon of a more fundamental law -the Zipf's law. However, one can not conclude that the Heaps' law is completely dependent on the Zipf's law since there may exists some mechanisms only resulting in the Heaps' law, namely it is possible that a system displays the Heaps' law while does not obey the Zipf's law. In addition, we refine the known asymptotic solution (Eq. 7) by a more complex formula (Eq. 6), which is considerably more accurate than the asymptotic solution, as demonstrated by both the testing stochastic model and the extensive empirical analysis. In particular, our investigation about the effect of system size fills the gap in the relevant theoretical analyses.
Our analytical result (Eq. 6) indicates that the growth of vocabulary of an evolving system cannot be exactly described by the Heaps' law even though the system obeys a perfect Zipf's law with a constant exponent. In fact, not only the solution of the Heaps' exponent (Eq. 7), but also the Heaps' law itself is an asymptotic approximation obtained by considering infinite-size systems. More terribly, a Zipf's exponent larger than one does not correspond to a true distribution p z ð Þ since SzT will diverge as the increasing of the system size, yet a large fraction of real systems can be well characterized by the Zipf's law with aw1 (see general examples in Refs. [3,15] and examples of degree distributions of complex networks in Refs. [46,47]). Putting the blemish in mathematical strictness behind, the Zipf's law and Heaps' law well capture the macroscopic statistics of many complex systems, and our analysis provides a clear picture of their relation.
Note that, our analysis depends on an ideal assumption of a ''perfect'' power law (Zipf's law) of frequency distribution, while a real system never displays such a perfect law. Indeed, deviations from a power law have been observed, but the assumption of a perfect power-law distribution is widely used in many theoretical analyses. For example, the degree distribution in email networks [48] has a cutoff at about z~100 and the one in sexual contact networks [49] displays a drooping head, while in the analysis of epidemic dynamics, the underlying networks are usually supposed to be perfect scale-free networks [50]. Another example is the study on the effects of human dynamics on epidemic spreading [51,52], where the interevent time distribution of human actions are supposed as a power-law distribution, ignoring the observed cutoffs and periodic oscillations [53,54]. In a word, although the ideal assumption of a perfect power-law distribution could not fully reflect the reality, the corresponding analysis indeed contributes much to our understanding of many phenomena.
We also tested the power-law distribution with exponential cutoff, as p z ð Þ*z {b exp {z=z c ð Þ , where z c is a free parameter controlling the cutoff effect. According to the stochastic model (we first generate the rank-based distribution z r ð Þ corresponding to the probability density function p z ð Þ, and then generate the relation N t ð Þ versus t by using the stochastic model), when the cutoff effect gets enhanced (by decreasing z c ), the Heaps' exponent l will increase (see a typical example for b~2 from Figure S5 in the Supporting Information). The simulation results suggest that the power-law part plays the dominant role, namely even under a very strong cutoff (e.g., b~2 and z c~1 , with the maximal degree is about 10), the Heaps' law still holds. But if p z ð Þ obeys an exponential form (it can even have heavier tail than the power-law distribution with strong cutoff, like p z ð Þ*z {2 exp {z ð Þ), then N t ð Þ will grow almost linearly in the early stage and soon bend, deviating from the Heaps' law. The comparison of the N t ð Þ curves for power-law distribution with exponential cutoff and exponential distribution can be found in Figure S6 in the Supporting Information. An interesting implication of our results lies in the accelerated growth of scale-free networks. Considering the degree of a node as its occurrence frequency and the total degree of all nodes as the text size, a growing network is analogous to a language system. Then, the scale-free nature corresponds to the Zipf's law of word frequency and the accelerated growth corresponds to the Heaps' law of the vocabulary growth. In an accelerated growing network, the total degree t (proportional to the number of edges) scales in a power-law form as t*N t ð Þ w , where N t ð Þ denotes the number of nodes and ww1 is the accelerating exponent. At the same time, the degree distribution usually follows a power law as p k ð Þ*k {b where k denotes the node degree. For example, the Internet at the autonomous system (AS) level displays the scale-free nature with b&2:16 (see Table 1 in Ref. [55]) and thus a~1 b{1 &0:862.
According to a recent report [37]  T is the total number of elements, N T ð Þ is the total number of distinct elements, a is the Zipf's exponent obtained by the maximum likelihood estimation [3,43], l a is the asymptotic solution of the Heaps' exponent as shown in Eq. 7, l n is the numerical value of the Heaps' exponent given T and a as shown in Fig. 3, and l e is the empirical result of the Heaps' exponent obtained by the least square method. The effective number of the 34th data set is only two digits since the size of this data set is very small. Except the 4th data set, in all other 34 real data sets, the numerical results based on Eq. 6 outperform the asymptotic solution shown in Eq. 7. Detailed description of these data sets can be found in Materials and Methods. doi:10.1371/journal.pone.0014139.t001 than Eq. 7 (w~1). Actually, the asymptotic solution indicates that all the scale-free networks with bw2 should grow in a steady (linear) manner, which is against many known empirical observations [34][35][36][37], while the refined result in this article is in accordance with them. Furthermore, our result provides some insights on the growth of complex networks, namely the accelerated growth can be expected if the network is scale-free with a stable exponent and this phenomenon is prominent when b is around 2.

Relation between Zipf's Law and Power Law
Given the Zipf's law z r ð Þ*r {a , we here prove that the probability density function p z ð Þ obeys a power law as p z ð Þ*z {b with b~1z 1 a . Considering the data points with ranks between r and rzdr where dr is a very small value. Clearly, the number of data points is dr, which can be expressed by the probability density function as where Therefore, we have namely b~1z 1 a . Analogously, the Zipf's law z r ð Þ*r {a can be derived from the power-law probability density distribution p z ð Þ*z {b , with a~1 b{1 .

Direct Comparison between Empirical and Analytical Results
Given the parameter a, according to Eq. 6, we can numerically obtain the function N t ð Þ. The comparison between Eq. 6 and the empirical data for words in the book ''La Divina Commedia'' and keywords in the PNAS articles are shown in Fig. 5. The growing tendency of distinct words can be well captured by Eq. 6. Actually, using a more accurate normalization condition Ð zmaxz 1 2 1 2 as an improved version of Eq. 4, the estimation of z max is determined by Given the parameter a, for an arbitrary N t ð Þ, one can estimate the corresponding z max according to Eq. 11 and then determine the value of t by Eq. 5. The numerical results of this improved version are also presented in Fig. 5, which fits better than Eq. 6 to the empirical data. Notice that, both the two analytical results give almost the same slope in the log-log plot of N t ð Þ function, namely the Heaps' exponents obtained by these two versions are almost the same.

Examples of Numerical Results
Mathematically speaking, as indicated by Eq. 6, N t ð Þ does not scale in a power law with t. However, the numerical results suggest that the dependence of N t ð Þ on t can be well approximated as power-law functions. As shown in Fig. 6, for a wide range of a, N t ð Þ can be well fitted by t l , and the value of fitting exponent l depends on both a and t.

The case of a~1
The numerical solution of Eq. 6 for a~1 can be obtained by considering the limitation a?1, where N t ð Þ a &N t ð Þ and N t ð Þ 1{a &1z 1{a ð ÞlnN t ð Þ. Accordingly, Eq. 6 can be rewritten as When t approaches to infinity, N t ð Þ scales almost linearly with t . Actually, the solution can be expressed For any finite system, the numerical result can be produced by Eq. 12.

Data description
The data sets analyzed in this article can be divided into four classes. According to the data sets shown in Table 1, these four classes are as follows.
(i) Occurrences of words in different books and different languages (data sets Nos. [1][2][3][4][5][6][7][8][9].   [44] where Z r ð Þ is the frequency of the word ranked r and N t ð Þ is the number of distinct words. (b) Keywords of articles published in the Proceedings of the National Academy of Sciences of the United States of America (PNAS) [30] where Z r ð Þ is the frequency of the keyword ranked r and N t ð Þ is the number of distinct keywords; (c) Confirmed cases of the novel virus influenza A (H1N1) [45] where Z r ð Þ is the number of confirmed cases of the country ranked r and N t ð Þ is the number of infected country in the presence of t confirmed cases over the world; (d) PNAS articles having been cited at least once from 1915 to 2009 where Z r ð Þ is the number of citations of the article ranked r and N t ð Þ is the number of distinct articles in the presence of t citations to PNAS. In (c), the data set is small and thus the effective number is only two digits. The fittings in (c1) and (c2) only cover the area marked by blue. In (d1), the deviation from a power law is observed in the head and tail, and thus the fitting only covers the blue area. The Zipf's (power-law) exponents and Heaps' exponents are obtained by using the maximum likelihood estimation [3,43] and least square method, respectively. Statistics of these data sets can be found in Table 1 (the data set numbers of (a), (b), (c) and (d) are 9, 10, 34 and 35 in Table 1) with detailed description in Materials and Methods. doi:10.1371/journal.pone.0014139.g004 Figure 5. Direct comparison between the empirical data and Eq. 6 as well as its improved version. The left and right plots are for the words in ''La Divina Commedia'' and the keywords in PNAS. The blue dash lines and red solid lines present the results of Eq. 6 and Eq. 11, respectively. In accordance with Figure 4 and Table 1, the values of the parameter a are given as 1 Math., Ann. Neurol., J. Evol. Biol., Theo. Popul. Biol., MIS Quart., and IEEE Trans. Automat. Contr.. These data are collected from the ISI Web of Knowledge (http://isiknowledge. com/). For every scientific journal, we consider the keywords sequence in each article according to its publishing time. Since most of the published articles do not have keywords before 1990 in ISI database, we limit our collections from 1991 to 2007 (except for ACM Comput. Surv. which is available only from 1994 to 1999).
(iii) Confirmed cases of the novel virus influenza A in 2009 (data set No. 34). The data of the cumulative number of laboratory confirmed cases of H1N1 of each country are available from the website of Epidemic and Pandemic Alert of World Health Organization (WHO) (http://www.who.int/). The analyzed data set reported influenza A starting from April 26th to May 18th, updated each one or two days. After May 18th, the distribution of confirmed cases in each country shifted from a power law to a power-law form with exponential cutoff [45].
(iv) Citation record of PNAS articles (data set No. 35). This data set consists of all the citations to PNAS articles from papers published between 1915 and 2009 according to the ISI database, ordered by time. Figure S1 Relative errors of the approximation in Eq. 5.