Assessing Conformance with Benford’s Law: Goodness-Of-Fit Tests and Simultaneous Confidence Intervals

Benford’s Law is a probability distribution for the first significant digits of numbers, for example, the first significant digits of the numbers 871 and 0.22 are 8 and 2 respectively. The law is particularly remarkable because many types of data are considered to be consistent with Benford’s Law and scientists and investigators have applied it in diverse areas, for example, diagnostic tests for mathematical models in Biology, Genomics, Neuroscience, image analysis and fraud detection. In this article we present and compare statistically sound methods for assessing conformance of data with Benford’s Law, including discrete versions of Cramér-von Mises (CvM) statistical tests and simultaneous confidence intervals. We demonstrate that the common use of many binomial confidence intervals leads to rejection of Benford too often for truly Benford data. Based on our investigation, we recommend that the CvM statistic Ud2, Pearson’s chi-square statistic and 100(1 − α)% Goodman’s simultaneous confidence intervals be computed when assessing conformance with Benford’s Law. Visual inspection of the data with simultaneous confidence intervals is useful for understanding departures from Benford and the influence of sample size.


Introduction
Benford's Law is a probability distribution for the first significant digit (FSD) of numbers, for example, the FSD of the numbers 871 and 0.0561 are 8 and 5 respectively. The law is based on the empirical observation that for many sets of numerical data the FSD is not uniformly distributed, as might naively be expected, but rather follows a logarithmic distribution, that is, for first digit D 1 , For example, the probability that the first digit is 3 is log 10 [1 + 1/3] % 0.1249. The law is remarkable because many types of data are considered to be consistent with Benford's Law. The Benford Online Bibliography [1] is a large database of papers, books, websites, etc. which apply Benford's Law in diverse areas, from diagnostic tests for mathematical models in Biology, Genomics, Neuroscience, to image analysis and fraud detection by the U.S. Internal Revenue Service, and two recent books [2,3] also bear testimony to the popularity of the law in many fields.
To demonstrate conformance with Benford's Law, many authors use simple statistical methodology: visual plots, Pearson's chi-square test and individual confidence intervals for digit probabilities based on the binomial distribution. These methods may be inefficient, inaccurate, or lacking in power to detect reasonable departures from (alternatives to) Benford's Law. In particular, methods based on individual confidence intervals do not take into consideration the phenomenon of multiple comparisons. For example, the joint confidence level for nine binomial 100(1 − α)% confidence intervals computed using the observed proportions of leading digits 1 through 9 in a sample of numbers may be very different from 100 (1 − α)%, the analyst's intended confidence level, and the problem is magnified if the first two or more digits are considered.
Often data sets are large, and Miller's (Chapter 1, 2015) [4] remark concerning conformance with Benford's Law, "It is a non-trivial task to find good statistical tests for large data sets", is pertinent. In this article we present and compare statistically sound methods for assessing conformance of data to Benford's Law for medium to large data sets. We investigate the likelihood ratio test for the most general alternative, three tests based on Cramér-von Mises statistics for discrete distributions, Pearson's chi-square statistic and simultaneous confidence interval procedures for assessing compliance with the set of Benford probabilities.
Because Benford's Law is of wide application and general interest, we first present a brief description of the law. This is followed by sections on the goodness-of-fit tests and simultaneous confidence intervals for multinomial probabilities. Comparisons of the power of the procedures to detect various plausible alternatives are provided as well as examples from Genomics and Finance. The final section concludes with a discussion of the results. An R [5] package for these methods is freely available.

Benford's Law
Benford's Law is based on the empirical observation that for many sets of numerical data, the first significant (or leading) digits follow a logarithmic distribution. For the first m digits, D 1 , D 2 , . . .,D m , for d 1 = 1, 2, . . ., 9 and d 2 , . . ., d m = 0, 1, . . ., 9, so that, for example, the probability that the first two digits are 30 is log 10 [1 + (30) −1 ] % 0.01424 and the probability that the first three digits are 305 is log 10 [1 + (305) −1 ] % 0.00142. This closely agrees with empirical distributions of first digits in much tabular data: for example, [6] considered areas of rivers, American League baseball statistics, atomic weights of elements and numbers appearing in Reader's Digest articles.
There have been many attempts to explain Benford's Law-see [2,3,[7][8][9] for reviews of these. One of the most convincing explanations is that put forward by Hill [8], who demonstrated that if numbers are generated by first selecting probability distributions at random and then choosing and combining random samples from said distributions, the distribution of FSDs will converge to Benford's Law provided that the sampling is unbiased with regard to scale or base [2]. Thus, even if tabular data come from many sources, one might expect the empirical first digit frequencies to closely follow Benford's Law. Other explanations are provided in the books [2,3] and include: spread, geometric, scale-invariance and Central Limit Theorem explanations.
Not all datasets conform to Benford's Law. For example, it does not hold for tables of (uniformly distributed) random numbers, nor for numbers in telephone directories, nor for dates (mm/dd/yy or dd/mm/yy). Rodriguez (2004) [10] demonstrates that Benford's Law is inadequate when data are drawn from commonly used distributions, including the standard normal, Cauchy and exponential distributions. He does show, however, that the Lognormal distribution yields FSD probabilities arbitrarily close to Benford as the log-scale variance increases.
Likelihood ratio and Pearson's chi-square tests for Benford's Law Likelihood ratio tests are generally powerful tests [11] and are often the tests of choice of statisticians. Given the FSDs of a set of n entries in a set of data, we test whether they are compatible with Benford's Law Eq (1). That is, we test the null hypothesis for the first digit probabilities, p i Pr(D 1 = i), against the broadest alternative hypothesis, With first digit frequencies, f i , and observed proportions,p i ¼ f i =n, i = 1, 2. . ., 9, the likelihood ratio (LR) statistic Λ for testing H 0 vs. H 1 is given by which asymptotically follows a w 2 ð8Þ distribution, where ln is natural log. The LR test is asymptotically equivalent to Pearson's chi-square statistic, Tests based on Cramér-von Mises statistics In this section we consider omnibus goodness-of-fit tests based on the Cramér-von Mises type (CvM) statistics for discrete distributions [12,13]. Specifically we consider statistics W 2 d , U 2 d and A 2 d which are analogues of, respectively, the Cramér-von Mises, Watson and Anderson-Darling statistics, widely used for testing goodness of fit for continuous distributions. These discrete CvM statistics have been shown to have greater power than Pearson's chisquare statistic when testing for the grouped exponential distribution and the Poisson distribution [14][15][16].
As above, we test Benford's Law against the most general alternative hypothesis, H 1 . Let S i ¼ P i j¼1pj and T i ¼ P i j¼1 p j denote the cumulative observed and expected proportions, and Z i = S i − T i . Note that Z i is the difference between the empirical and null cumulative distribution functions on which the CvM statistics are based. Define weights t i = (p i + p i + 1 )/2 for i = 1, . . ., 8 and t 9 = (p 9 + p 1 )/2 and define the weighted mean of the deviations Z i as Z ¼ P 9 i¼1 t i Z i . The CvM statistics are defined as follows [13] Note that since Z 9 = 0 the last term in W 2 d is zero. The last term in A 2 d is of the form 0/0, and is set equal to zero.
The CvM type statistics defined here take into account the order of the cells (or, digits here) in contrast to Pearson's statistic, X 2 , which does not. However, if the order of the cells is completely reversed, the values of the statistics are unaltered. Further, the statistic U 2 d is invariant to the choice of the origin for the hypothesized discrete distribution [13].
Under the null hypothesis, the asymptotic distribution of the CvM statistics is a linear combination of independent w 2 ð1Þ random variables. Asymptotic percentage points (or critical values) for the CvM statistics under the null are in Table 1 and R code for computing p-values for these statistics is available. Upper-tail probabilities for the asymptotic distribution can be obtained using a numerical method due to Imhof [17,18] or more crudely using a chi-square approximation. Imhof's method requires numerical integration in one dimension of a closed form expression, whereas the chi-square approximation is faster to compute since it only requires the first three cumulants of the statistic in question.

Simultaneous confidence intervals for multinomial probabilities
Confidence intervals provide more information about departures from Benford's Law than do p-values for goodness-of-fit. Ideally, we wish to compute a 100(1 − α)% set of confidence intervals, with overall confidence level 100(1 − α)%, for the nine, or more generally, k, digit probabilities using the observed digit frequencies f 1 , f 2 , . . ., f k . If all of the k confidence intervals cover all of the Benford probabilities, then the data are deemed to be consistent with Benford's Law at the 100(1 − α)% level. If they do not, we can easily determine for which digits departures occur and investigate further. The widths of the confidence intervals also clearly indicate the amount of information in the data which is related to the sample size, n. The larger n, the narrower the confidence intervals and indeed, extremely narrow confidence intervals that do not all cover all of the Benford probabilities may not be considered as practically significant departures from Benford's Law. One approach that is commonly used to generate confidence intervals for multinomial probabilities is to compute, for each cell/digit in turn, a 100(1 − α)% (approximate) binomial confidence interval for that digit frequency versus all of the others, i.e.p i þ z a=2 . This procedure uses many (k here) single 100(1 − α)% confidence intervals and is problematic since the probability that all of these confidence intervals simultaneously contain the population proportions is not (1 − α), and it can be as small as (1 − kα) by the Bonferroni inequality. To remedy this, we use simultaneous 100(1 − α)% confidence intervals constructed so that the probability that every one of the intervals will contain the corresponding population proportion is (approximately) (1 − α). Several simultaneous confidence intervals for multinomial proportions have been proposed in the literature. We consider six techniques, ordered by date of publication, and present their formulae and some background below. Let f = (f 1 , . . ., f k ) T be the vector of observed cell frequencies, w 2 n;a be the upper αth quantile of the chi-square distribution with ν degrees of freedom and z α be the upper αth quantile of the standard normal distribution. R code for computing the following simultaneous confidence intervals is available. [19]: The Ques simultaneous confidence intervals are constructed so that the probability that all of them cover the corresponding Benford's probabilities is at least (1 − α), i.e. they are conservative. The theory for the construction is based on the asymptotic χ 2 distribution of Pearson's chi-square statistic Eq (3) and are recommended when the smallest expected frequency, np i , is at least 5. [20]: The Good simultaneous intervals modify the Ques intervals, replacing A with B to obtain typically shorter, and thus less conservative, intervals.

Bailey angular transformation [Bang] [21]
: Bailey modifies the Good simultaneous intervals, incorporating transformations of the observed frequencies which are known to be more nearly normally distributed, for large n, than the frequencies themselves. The first modification uses the arcsin-square-root transformation which is a variance stabilizing transformation for binomial data. We do not incorporate corrections for continuity since sample sizes are generally large in Benford's Law studies.
where C = B/(4n). q . They show that a lower bound for the simultaneous coverage probability of the k intervals is (1 − 2α) for small α. Therefore, their 100(1 − α)% intervals take the form:

Sison and Glaz [Sison][23]
: The Sison simultaneous confidence intervals are based on a relatively complex approximation for the probabilities that multinomial frequencies lie within given intervals. This procedure does not have a closed form and must be implemented using a computer. Let V i and Y i , i = 1, 2, . . ., k, be independent Poisson random variables with mean f i and its truncation to . . ., f Ã m be the cell frequencies in a sample of n observations from a multinomial distribution with cell probabilities (f 1 /n, . . ., f m /n). Define The Sison and Glaz interval has the following form: where the integer τ satisfies the condition ν(τ) < 1 − α < ν(τ + 1), and γ = (1 − α) − ν(τ)/ν(τ + 1) − ν(τ).

Simulation Study
We investigated the finite sample behaviour of the test statistics and confidence intervals using a simulation study assuming several different alternative distributions. The simulation results, size (proportion of tests rejected when the data are truly Benford) and power (proportion of tests rejected when the data are truly not Benford), are compared. We considered three sample sizes, n = 100, n = 1,000 and n = 10,000. Ten thousand (N = 10,000) random samples were generated using each of the distributions listed in Table 2, which are alternative distributions that could be reasonably expected to arise in practice. The continuous distributions listed are commonly used and Rodriguez (2004) [10] studies and tabulates the first significant digit probabilities for each of these distributions. The "contaminated" distributions arise from contaminating one digit by γ, the amount specified in the table. Each digit is contaminated in turn, increasing that digit's Benford probability by γ, then the remaining digit probabilities are scaled so that all sum to one. This type of distribution was found to arise, in practice, for example when one specific accounting transaction had been processed many times. The Generalized Benford's Law [24] for the first digit, D 1 is, which was found to approximate the distribution of first digits for southern California earthquake magnitudes. The Uniform/Benford mixture distribution could arise if a proportion, γ of data is generated (possibly fabricated) from a first digit uniform distribution while the remainder of the data conforms to Benford.  Table 3 shows the proportion of samples of size n rejected at the 0.05 level when the generating distribution is Benford. With N = 10,000 replications, the margin of error (2 standard errors) is 0.004, and all test statistics except the LR statistic with n = 100 show acceptable size (Type I error rate); that is, the proportions rejected are close to 0.05 when the generating distribution is Benford.
2. We investigated the empirical power, defined as the proportion of N = 10,000 samples which reject the null hypothesis of Benford at the 0.05 level, for each of the test statistics and alternative distributions given in Table 2. All test statistics have excellent power for detecting the discrete and continuous uniform alternatives for all n and the results are not shown here.
3. Simulated power for the Normal(13,400) is given in Fig 1(a). The results are very similar for Normal(0,1). All statistics have good power for large n, and U 2 d has the largest power for n = 100. Fig 1(c) also displays results for the Lognormal(2,1) where none of the statistics have much power for n = 100 or even n = 1,000, but the CvM statistics, especially U 2 d , have good power to detect Lognormal(2,1) departures from Benford when n = 10,000. None of the statistics have power to detect Lognormal (2,9) alternatives to Benford (not shown here) because, as Rodriguez (2004) [10] notes, the first digit distribution of Lognormal (2,9) variates is essentially Benford . Fig 2(a) and 2(c) graphs the simulated power for the Exponential (.2) and Cauchy(1) generating distributions respectively. The CvM and U 2 d statistics perform better than Pearson's chi-square and LR statistics for the Exponential(.2) and Cauchy (1) distributions respectively.  Fig 4(a) and 4(b) display the simulated power for Generalized Benford Eq (4) simulated data for n = 100 and 1,000. Note that the Generalized Benford distribution tends to Benford as γ tends to 0 and we expect the proportion rejected to be approximately 0.05 when γ = 0. A 2 d , W 2 d and U 2 d have the largest power, however, for n = 10,000, all tests perform very well (results not shown). 6. Results for the Uniform/Benford mixture distributions are given in Fig 5(a) and 5(b) for n = 100 and 1,000 since all tests perform well for n = 10,000. As the proportion, γ, of

Results: Simultaneous Confidence Intervals
In this section, we assess the performance of simultaneous confidence intervals for testing comformance with Benford's Law. We do this by generating N = 10,000 samples from the distributions given in Table 2 and observing for each sample whether the nine Benford probabilities all fall within the set of simultaneous intervals computed for that sample.
1. Table 4 shows the estimated coverage probabilities, that is, the proportions out of the N = 10,000 replications such that nominal 95% simultaneous confidence intervals cover the 2. To study the power of the simultaneous confidence intervals, we graph the proportion of samples that do NOT simultaneously cover Benford probabilities, or one minus the coverage proportion, since this is analogous to power computed for test statistics. For frequencies generated under the discrete and continuous uniform distributions, all intervals perform well (except Quesenberry) since almost none of the joint sample confidence intervals simultaneously cover the set of Benford probabilities (results not shown here).
3. Results for the Normal(13,400) are shown in Fig 1(b), which are very similar to those for Normal(0,1). All intervals have good power for large n, and the Sison intervals have the best power for n = 100. Fig 1(d) displays results for the Lognormal(2,1) where none of the intervals have much power for n = 100 or even n = 1,000, but all but Quesenberry and Fitz have some power to detect Lognormal(2,1) departures from Benford when n = 10,000. None of the intervals have power to detect Lognormal (2,9) departures from Benford (not shown here) . Fig 2(b) and 2(d) graph the simulated power for the Exponential(.2) and Cauchy(1) generating distributions. The Fitz and Quesenberry intervals do not perform as well as the

Examples
The following examples demonstrate applications of the tests and simultaneous confidence intervals studied in this paper in assessing conformance of real data to Benford's Law. We have attempted to replicate Friar's findings using the 2013 GOLD database. In the summer of 2013, the GOLD database held completed sequences for 121 Eukaryotes with their corresponding number of ORFs and total genome sizes. Table 5 displays the first digit observed, relative frequencies and Goodman simultaneous confidence interval values. Table 6 lists p-values for the tests studied in this paper and Fig 7 displays the Goodman simultaneous confidence intervals. U 2 d is consistent with Pearson's chi-square and the LR test, all rejecting the hypothesis of Benford at the α = 0.05 level. From Fig 7 and Table 5, we note that the frequency of the first digit 5 is larger than expected under Benford, however examination of Fig 7 indicates that it is quite close to it and the difference can be deemed practically insignificant.

Rodriguez Data
Rodriguez (2004) [10] analyzes 10 financial datasets which we re-analyze using the tests and simultaneous confidence intervals proposed. The series are: net income (NI) and betas (Betas) from the Disclosure Global Researcher SEC database; the annual market rates of return (Mkt Return) from Ibbotson Associates' Stocks, Bonds, Bills, and Inflation yearbooks; the gross national product (GNP) from the 1998 World Bank Atlas; the group of initial public offering (IPO) data, initial price (IPO Price), number of shares (IPO Shares), and total dollar value  intervals are drawn as vertical lines and the red crosses are the Benford probabilities. The widths of the interval estimates clearly display the precision in the confidence interval estimates which is a function of the sample size. The graphs provide clear indications of which digits in the data sets are not consistent with Benford, wherever the crosses do not intersect the vertical lines. We note that GNP and deltaDJ/DJ are not statistically consistent with Benford, however, from the graph, they appear to be practically consistent with Benford since the Benford probabilities are very close to the intervals.

Discussion
In this paper we proposed and evaluated methods of testing conformance with Benford's Law. From the simulation study, we observed that Pearson's chi-square test does not have the greatest power under all alternatives and that the discrete CvM statistics often perform very well. The simulation study also confirmed that separate 100(1 − α)% binomial confidence intervals reject the hypothesis of Benford too often for truly Benford data, and they should not be used for this problem. The analyses of the genomic and financial data led to findings that were consistent with those of the simulation study.
As a result of our study, we make the following recommendations: 1. To assess conformance with Benford's Law, investigators should perform statistical tests; the CvM statistic U 2 d is recommended and if contamination is expected in the larger values of the first significant digit, Pearson's chi-square statistic.
2. Visual inspection of data is crucial for any dataset and we recommend that simultaneous confidence intervals are useful for understanding the nature of departures from Benford's Law. They are also a useful tool for understanding the precision inherent in the data. The Goodman and Sison simultaneous intervals perform best in our study; if computational resources are an issue, then we recommend that the Goodman simultaneous intervals be computed and plotted.
The work presented here applies to the first significant digit. It is extended to the first m > 1 digits in Wong (2010) [26]. Asymptotic power approximations are provided in Lesperance (2015) [27] which an investigator can use to perform sample size calculations to ensure that a study is adequately powered. R code for both is available.