Rank distributions: Frequency vs. magnitude

We examine the relationship between two different types of ranked data, frequencies and magnitudes. We consider data that can be sorted out either way, through numbers of occurrences or size of the measures, as it is the case, say, of moon craters, earthquakes, billionaires, etc. We indicate that these two types of distributions are functional inverses of each other, and specify this link, first in terms of the assumed parent probability distribution that generates the data samples, and then in terms of an analog (deterministic) nonlinear iterated map that reproduces them. For the particular case of hyperbolic decay with rank the distributions are identical, that is, the classical Zipf plot, a pure power law. But their difference is largest when one displays logarithmic decay and its counterpart shows the inverse exponential decay, as it is the case of Benford law, or viceversa. For all intermediate decay rates generic differences appear not only between the power-law exponents for the midway rank decline but also for small and large rank. We extend the theoretical framework to include thermodynamic and statistical-mechanical concepts, such as entropies and configuration.


Introduction
Ranking data that originates from apparently disconnected subjects in many fields -astrophysical, geophysical, ecological, biological, technological, financial, urban, social, etc.-has revealed universal patterns [1,2] and opened intriguing questions about their origin. The empirical law of Zipf [3,4] for the numbers of occurrence (frequencies if normalized) of words in texts has played a central role in the development of this widespread research topic of multidisciplinary complex systems. Zipf's law has been found to be (approximately) followed by many sets of ranked data outside linguistics, that record the number of occurrences [5] of other types of items. But also, and this is an important distinction we address here, for the magnitudes or sizes of many measurable objects or entities, such as firmament voids, lengths of rivers, city populations, etc. [6].
Here we analyze the conceptual, and also quantitative, difference between frequency and size ranked data. To this purpose we make use of a straightforward stochastic procedure [7][8][9] to reproduce ranked data from an assumed parent distribution that governs sets of values of random variables that constitute samples. Examination of the expressions for the two types of a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 rank functions indicate that they are functional inverses of each other. See also [6,10,11]. In particular, we focus in the case where the parent distribution P(N), where N is a magnitude random variable, has the power-law form P(N) * N −α , 1 α < 1. We find that in the limit α = 1 the size-rank distribution N(k), where k is the rank, decays exponentially as k grows, while the frequency-rank distribution F(k 0 ) decays logarithmically as k 0 , the corresponding rank variable, increases. On the contrary, in the limit α ! 1 N(k) decays logarithmically while F(k 0 ) does so exponentially. The intermediate case α = 2 is the special exponent value when both N (k) and F(k 0 ) decay as a power law with exponent −1, the classical Zipf's power law value. To complement our description we replicate the procedure by considering instead a starting parent distribution Q(F) * F −β , 1 β < 1 where F is a frequency random variable, and obtain an equivalent account with We have recently [8,9,12] shown that the above-referred stochastic approach to size-rank distributions can be exactly represented by deterministic nonlinear one-dimensional iterated maps close to tangency [13]. Here we extend this strict analogy to determine frequency-size distributions within this nonlinear dynamical language. These distributions are given by areas below map trajectories. To explore the duality between size-rank N(k) and frequency-rank F (k 0 ) distributions, we look at specific sets of real data that can be sorted out in both ways, magnitudes or numbers of occurrences, such as the cases of earthquakes [14] and forest fires [15] (see Fig 1), and we find agreement with the theoretical approach. We also comment on how Benford's law [16,17] for the frequency of digits corresponds in our scheme to the case α = 1.
Finally, we extend our statistical-mechanical interpretation with generalized entropies of rank distributions [12,18] to include the role of F(k 0 ).

Rank distributions from a size parent distribution
The basic ingredient in the stochastic method [7][8][9] for rank distributions is the probability distribution P(N) of the magnitude or size data N under consideration. The scheme is phenomenological since the form of P(N) is assumed, and so, the first common choices are: gaussian, exponential, or power law expressions. For the latter case we write Sets of data N can be generated from Eq (1) and subsequently examined if they match, statistically, real ranked data sets. Each data set formed by a total of N entries, expressed with given suitable precision, can be ranked according to their sizes N or the numbers of times F with which their items appear. We shall consider that N takes positive values within an interval N min N N max , where we allow as limiting values N min = 0 and/or N max ! 1. To obtain the number of occurrences F for real numbers N recorded with a given precision it may be necessary to introduce a partition and count incidences within intervals.
The entries in the sample set N can be sorted out starting with the largest, N max , and continuing with decreasing magnitudes down to N min . And then labeled with the rank variable variable k, with k = 0 for N max and k = k max for N min . We call the function N(k) the size-rank distribution. The rank k can be an integer k = 0, 1, 2, 3, . . ., k max (often, elsewhere, the 1st value is k = 1) and it can be generalized to be a real number. The set N can also be ordered in terms of the frequency with which they appear, that is, the number of occurrences F having size equal or greater than N, or equivalently the rate f, 0 f 1, of occurrences having size equal or greater than N. For this second sorting the occurrences are labeled with a rank variable k 0 , with k 0 = 0 for the most frequent and k 0 ¼ k 0 max for the least frequent. We call F(k 0 ) the frequency-rank distribution. Similarly, the rank k 0 can be an integer k 0 ¼ 0; 1; 2; 3; . . . ; k 0 frequency-rank distribution is f ðk 0 Þ ¼ Fðk 0 Þ=N . The main task is to determine N(k) and F(k 0 ) from P(N).
We now introduce the complementary cumulative distribution of P(N), where the normalization of P(N) implies P(N min , N max ) = 1. The parent distribution P(N) can be recuperated from P(N, N max ) via In the theoretical approach the evaluation of P(N, N max ) is the means by which the values N generated by P(N) are sorted out and leads to the rank distributions. The cumulative distribution P(N, N max ) increases monotonically as N decreases, taking values from P(N max , N max ) = 0 to P(N min , N max ) = 1. This distribution P(N(k), N max ), where we have now indicated the rank k occupied by the variable magnitude N, is identified with k=N , that is The size-rank distribution N(k) is obtained by solving for N(k). Normalization of P(N) indicates that k max ¼ N . If k is to be an integer the possible lower limits in the integral in Eq (5), On the other hand, the fraction k=N can also be seen as the rate or scaled frequency with which the sizes equal or greater than N occur, small for small k ' 0 and large for k ' k max . Therefore we identify the normalized frequency-rank distribution f(k 0 ) as where k 0 N. If k 0 is to be an integer the values of N to be used in P(N) are integers. In practice, the non-normalized frequency-size distribution Fðk 0 Þ N f ðk 0 Þ is often used as it is constructed directly from the numbers of occurrences in data samples. From the above definitions k 0 N, and Fðk 0 Þ N f ðk 0 Þ, together with Eqs (4) and (6), it is clear that the rank distributions N(k) and F(k 0 ) are functional inverses of each other. That is, k = F(N) or N = F −1 (k). The inverse of a cumulative distribution is referred to as the quantile function [6,10]. We refer to N(k) as the size-rank distribution even though technically it is not a probability distribution, as P(N) and f(k 0 ) are.

Rank distributions from a power-law parent distribution
We look now at the specific expressions that come out of the general equations in the previous Section when P(N) is given by Eq (1). We have or, in terms of the q-deformed logarithmic function ln The size-rank distribution N(k) is explicitly obtained from the above with use of the inverse of ln q (x), the q-deformed exponential function exp While the frequency-rank distribution f(k 0 ) is given by In Fig 2 we show the agreement of Eqs (9) and (10) with the data on earthquakes and forest fires already shown in Fig 1. Our method for fitting the data to Eqs (9) and (10) is heuristic. We first select a data point to define N max . We then approximate with a straight line segment a section of the data that appears lined when displayed in logarithmic scales (involving a choice of its two extremes) via minimum squares. This gives us, with the use of Eq (8), a set of two equations from which we determine numerically preliminary values for α and N (notice that Eq (7) has no normalization constant). We iterate this procedure to improve fitting (mostly only N changes its value appreciably). Once the parameters in Eq (9) are determined F(k 0 ) follows from Eq (10).
When α = 1 Eq (9) acquires the ordinary exponential form while Eq (10) becomes an ordinary logarithmic function, We take the limit α ! 1 to signify that P(N) = N 0 exp(−N 0 N), and we choose N 0 = 1. We find and In the limit N max ! 1 Eq (9) becomes the power law N(k) * k 1/(1 − α) that when α = 2 gives the simple hyperbolic form N(k) * k −1 . Whereas Eq (10) in the same limit becomes the power law f(k 0 ) * k 0(1−α) that when α = 2 gives, coincidentally, the same hyperbolic form f(k 0 ) * k 0−1 . For many sets of frequency-rank real data α ' 2 and the standard Zipf law is α = 2, whereas the same feature for real size-rank data has led to refer (concurrently) to the observation of Zipf's law in relation to N(k). In contrast, when α ! 1, in the limit N max ! 1 the rank distributions become NðkÞ ¼ ln ðN =kÞ and f(k 0 ) = exp(−k 0 ), N(k) decays very fast as k increases since the argument in the logarithmic function lies in the interval 0 < k=N < 1, while f(k 0 ) decays exponentially as k 0 increases. This can be compared with the case α = 1, but N max finite, when N(k) decays exponentially as k increases while f(k 0 ) decays very fast as k 0 increases since again the argument in the logarithmic function lies in the interval 0 < k 0 =N max < 1.
A note on normalization. The choice of P(N) given by Eq (1) is not compatible with finite data sets (N < 1), these should be represented by a different expression for P(N), at least one that differs from Eq (1) for some values of N, specially small N. Normalization of Eq (1) obeys k max ¼ N , with both k max ! 1 and N ! 1, while N min ! 0.

Rank distributions from a frequency parent distribution
To show a duality feature of the approach to rank distributions we now consider the derivation of these distributions from a different parent distribution. This distribution, Q(F), generates values of the numbers of occurrences F to form data sets. As before we introduce a complementary cumulative distribution where the normalization of Q(F) implies X(F min , F max ) = 1. We denote by F the total number of elements in the occurrences sample set.
Proceeding as before we indicate the rank k 0 occupied by the number of occurrences F in the distribution X(F(k 0 ), F max ) and identify this as k 0 =F . That is When we assume the power law expression we obtain or, in terms of the q-deformed logarithmic function, The frequency-rank distribution F(k 0 ) is explicitly obtained from the above with with use of the q-deformed exponential function, this is While the size-rank distribution N(k), following arguments parallel to those given before for F (k 0 ), is given by where k F. Explicitly, Again, it is clear that the rank distributions F(k 0 ) and N(k) are functional inverses of each other. That is, k 0 = N(F) or F = N −1 (k 0 ). The exponent α in the previous two sections and the exponent β in this section are related via and coincide in value when α = β = 2, and both distributions acquire the simple hyperbolic functions F(k 0 ) * k 0−1 and N(k) * k −1 when in addition F max ! 1 and N max ! 1, that is, the classical Zipf case.

Rank distributions from a nonlinear map at tangency
We have shown recently [8,9,12] that there is an exact analogy between the expressions for the rank distributions as presented above for N(k) and those for the trajectories associated with the tangent bifurcation in one-dimensional nonlinear iterated maps. A map g(x) at the tangent bifurcation is written locally as x 0 = g(x) = x−u|x| z + Á Á Á, x 0 [13], z > 1, and trajectories initiated at x 0 ≲0 are obtained via repeated iterations of g(x), i.e.
These trajectories move monotonically towards the point of tangency at x = 0. If we make the replacement, valid for large time τ, of the difference x τ+1 − x τ by dx τ /dτ in Eq (25) (written as −u|x τ | z = x τ+1 − x τ ) we obtain the differential form udτ = −|x τ | −z dx τ , and integration of both sides of it yields The iteration number or time t dependence of all trajectories is obtained by solving the above for x t , i.e.
The equivalence of the trajectory positions x t with the size-rank distribution N(k) is made clear by comparison of Eqs (27) and (28) with Eqs (8) and (9), respectively, together with the identifications t = k, u ¼ N À 1 , x t = −N(k), x 0 = −N max and z = α. Also, comparison of the right-hand side of Eq (26) with that of Eq (7), taking into account Eq (6), indicates that the analog of the frequency-rank distribution f(k 0 ) is the quantity where −x t plays the role of k 0 . In [8] it is pointed out that the trajectories given by Eq (28) have precisely the analytical form for all trajectories with generic x 0 that are generated by the functional composition renormalization group fixed-point map [13,19] at the tangent bifurcation. And therefore the areas A t in Eq (29) have also the same property. That is, all trajectories of the fixed-point map for all t initiated at the generic position x 0 obey Eq (28). Also Eq (29) enjoys the degree of universality given by the fixed-point map.
In Fig 3 we illustrate the iterated map properties for the case z = α = 2 that translate into the equivalent description of the rank distributions N(k) and F(k 0 ).
When z = 1 we have and A t ¼ ln jx t j À ln jx 0 j: ð31Þ The trajectories in Eq (30) are obtained when a linear map intersects the identity line, i.e.
and this occurs locally when the tangent map is shifted into a double-secant map. In the limit z ! 1 the counterpart of Eq (28) is as this expression transforms into Eq (13) for N(k) under the same equivalences t = k,

Rank distributions associated with Benford's first digit law
Benford's first digit law [16,17], where n is the first digit of a decimal base number N and log denotes the decimal base logarithmic function, can be readily expressed in terms of the complementary cumulative distribution Eq (2) when α = 1 and the parent distribution is P(N) = 1/N. This is where N + 1 = (n + 1).000Á Á Á and N = n.000Á Á Á. Thus, by considering the cumulative version of Benford's law, with N max = 10 and N = n.000Á Á Á, n = 1, 2, . . ., 9, we have and In Fig 4 we show these distributions together with numerical data that follows Benford's law as shown in the figure's inset. Benford's law has been generalised to the case α > 1 [7], so that its associated (complementary) cumulative distribution Eq (7) provides the connection with the rank distributions studied here. In particular the case α = 2 corresponds to the classical Zipf's law described by F(k 0 ) with k 0 = 0, 1, 2, 3, . . ., shifted and limited to the values of the first digits 1, 2, 3, . . ., 9, when using decimal base logarithms.

Rank distributions as expressions of a thermodynamic structure
As pointed out N(k) is a the functional inverse of F(k 0 ), that is, the inverse of a (non-normalized complementary) cumulative distribution in reverse order, a quantile function [6,10]. Also N(k) has been interpreted [8,9] as the total number that the size variable unit occurs at fixed rank k. That is, in thermal system language, N(k) is equivalent to the degeneracy of a micro state of 'energy' k, or a micro-canonical partition function with fixed k, where the associated uniform probability is p ðkÞ i ¼ p ðkÞ 1=NðkÞ for all i = 1, . . ., N(k). Thus, we can call S(k) ln N(k) an entropy for α = 1 and define S(k) ln α N(k) as a generalized entropy for α > 1. Likewise S max ln N max for α = 1 and S max ln α N max for α > 1, N(0) N max . Eq (8) is written now as where, if S(k) is thought of as the entropy for the system with fixed k, then S max ðN À 1 Þ would be a generalized Massieu potential when the variable k is replaced by the (conjugate) variable N À 1 via a Legendre transformation. Just like thermodynamic quantities are dominant values of statistical-mechanical fluctuating quantities in a macroscopic system, we think of Eq (11), valid for α = 1, to be the result of the application of the saddle-point approximation for large N on The consideration of the emergence of dominant rank fluctuations for the general case α > 1 in the 'thermodynamic' limit k max ¼ N ! 1 is less straightforward and here we do not discuss it further. We recall [20,21] that the formalism of thermodynamics can be expressed in two equivalent ways. One of them is to consider as starting point the entropy as the fundamental monotonic function of the energy (and other basic variables) that characterise the system, while the other alternative is to begin with the internal energy as the fundamental quantity, a monotonic function of the entropy (and the same other variables). The expressions obtained from the parent distribution P(N), Eqs (8) and (9), correspond to the former choice, while those obtained from the parent distribution Q(F), Eqs (20) and (21), relate to the second one.
In this statistical-mechanical interpretation the rank k plays the role of energy and the entropy is S(k) = ln N(k), when α = 1. In the alternative description F plays the role of energy while the entropy is ln(k 0 ), again when α = 1. Thus here the quantities representing entropy and energy are, as customary, functional inverses of each other in accordance with the usual two equivalent thermodynamic frames [20,21].
As a consequence of the precise analogy between the rank distributions obtained from a parent distribution and the nonlinear iterated fixed-point map at tangency, we note that the thermodynamic structure observed above for the rank distributions quantities translates thoroughly into an equivalent structure for the nonlinear dynamical problem. It is only necessary to recall the identifications z = α, t = k,

Summary and discussion
We have analyzed the relationship that exists between two types of ranked data, numbers of occurrences and sizes or magnitudes of items. The technical relationship is well understood for statistics specialists, frequency-rank data is represented by a (complementary) cumulative probability distribution while size-rank data is described by its functional inverse, a quantile function [6,10]. It is of wider interest, for those studying the many topics of the complex systems science, where universal patterns are observed in ranked data samples from very different sources, such as the empirical laws of Zipf and Benford, to understand the physical origin of the documented behavior We have obtained expressions for size-rank N(k) and frequencyrank F(k 0 ) distributions from a stochastic method and or from an equivalent nonlinear deterministic approach and corroborated that the two functions are inverses of each other. Their differences are most apparent when the exponent α of the power-law parent distribution differs from α = 2, but they coincide and behave as hyperbolic functions (with deviations for small and large rank) when α = 2. In this latter case we have a nonlinear map at tangency with nonzero curvature, the most common case of analytic map at tangency. This being the case of the classic Zipf law. On the other hand we illustrated the case when α = 1 with the first digit Benford law. We complemented our description by also considering the option for the parent distribution for the source of data to be that for the number of occurrences F instead of that for the size N. When these two distributions are assumed to have the power-law forms P(N) * N −α and Q(F) * F −β we obtain parallel (and equivalent) descriptions for the size and frequency rank distributions with the roles of cumulative distribution and quantile function interchanged and with the exponents relationship 1 − α = (1 − β) −1 . Further, we advanced a thermodynamic and statistical-mechanical interpretation to be associated with the properties obtained for the rank distributions and indicated that the rank k plays the role of energy and N (k) takes the place in a prototypical thermal system of the number of configurations at fixed energy k with entropy S(k) = ln N(k) when α = 1. The interpretation of the alternative description corresponds to F playing the role of energy while ln(k 0 ) that of entropy when α = 1. Thus entropy and energy as functional inverses of each other provide two equivalent thermodynamic formalisms [20,21].
The case α > 1 suggests the use of the generalized entropy expression S(k) = ln α (k) in the thermodynamic description, but this poses a question for its corresponding statistical-mechanical formalism in that the validity of the usual saddle-point approximation requires reconsideration. The outcome may be one in which fluctuations are not suppressed in the thermodynamic limit, here represented by k max ! 1 and N ! 1. The reproduction of the rank distributions via a nonlinear map at a tangent bifurcation indicates a reason for the appearance of generalized entropy expression through the drastic contraction of configuration space from a real number set of possible iterated map trajectories positions to only a finite number in the limit k max ! 1 and N ! 1 that corresponds to t ! 1 [12,18].