Fitting power-laws in empirical data with estimators that work for all exponents

doi:10.1371/journal.pone.0170920

Fig 1.

Decision tree of questions that should be clarified before estimating power-law exponents from data.

The tree shows under which conditions the fitting algorithms developed in this paper r_plfit and r_plhistfit can be used.

More »

Expand

Fig 2.

The four types of distribution functions.

Data is sampled from a power-law distribution p(x) ∝ x^−λ with an exponent λ = 0.7 (red line). The relative frequencies f_i are shown for N = 10000 sampled data points according to their natural (prior) ordering that is associated with p (blue). The rank-ordered distribution (posterior) is shown in yellow, where states i are ordered according to their observed relative frequencies f_i. The rank-ordered distribution follows a power-law, except for the exponential decay that starts at rank∼500. A low frequency cut-off should be used to remove this part for estimating exponents. The inset shows the frequency distribution ϕ(n) that describes how many states x appear n times (green). The frequency distribution has a maximum and a power-law tail with exponent α = 1 + 1/λ ∼ 2.43. To estimate α, one should only consider the tail of the frequency distribution function.

More »

Expand

Fig 3.

Comparison of the three power-law exponent estimators, LS, ML_CSN, and ML*.

For 400 values of λ in the range between 0 and 4, we sample N = 10,000 events from Ω = {1, ⋯, 1,000}, from a power-law probability distribution p(x|λ, Ω) ∝ x^−λ. The estimated exponents λ_est for the estimators LS (red), the ML_CSN (green, ), and the new ML* (black, λ_est = λ*), are plotted against the true value of the exponent λ of the probability distribution samples are drawn from. Clearly, below λ ∼ 1.5 the ML_CSN estimator no longer works reliably. ML_CSN and ML* work equally well in a range of 1.5 < λ < 3.5. Outside this range ML* performs consistently better than the other methods. The inset shows the mean-square error σ² of the estimated exponents. The LS-estimator has a much higher σ² over the entire region, than the ML*-estimator. The blue dot represents the ML* estimate for the Zipf exponent of C. Dickens’ “A tale of two cities”. Clearly, this exponent could never reliably be obtained from the rank ordered distribution using ML_CSN, whereas ML* works fine even for values of λ ∼ 0.

More »

Expand

Table 1.

Comparison of the estimators ML* and ML_CSN on empirical data sets that were used in [23].

These include the frequency of surnames, intensity of wars, populations of cities, earthquake intensity, numbers of religious followers, citations of scientific papers, counts of words, wealth of the Forbes 500 firms, numbers of papers authored, solar flare intensity, terrorist attack severity, numbers of links to websites, and forest fire sizes. We added the word frequencies in the novel “A tale of two cities” (C. Dickens). The second column states if α or λ were estimated. The exponents reported in [23] are found in column CSN₁, those reproduced by us applying their algorithm to data [23, 34–37] is shown in column CSN₂. The latter correspond well with the new ML* algorithm. For values λ < 1.5, CSN can not be used. We list the corresponding values for Kolmogorov-Smirnov test for the two estimators, KS_CSN and KS*.

More »

Expand