Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Empirical Bayes factors for common hypothesis tests

Abstract

Bayes factors for composite hypotheses have difficulty in encoding vague prior knowledge, as improper priors cannot be used and objective priors may be subjectively unreasonable. To address these issues I revisit the posterior Bayes factor, in which the posterior distribution from the data at hand is re-used in the Bayes factor for the same data. I argue that this is biased when calibrated against proper Bayes factors, but propose adjustments to allow interpretation on the same scale. In the important case of a regular normal model, the bias in log scale is half the number of parameters. The resulting empirical Bayes factor is closely related to the widely applicable information criterion. I develop test-based empirical Bayes factors for several standard tests and propose an extension to multiple testing closely related to the optimal discovery procedure. When only a P-value is available, an approximate empirical Bayes factor is 10p. I propose interpreting the strength of Bayes factors on a logarithmic scale with base 3.73, reflecting the sharpest distinction between weaker and stronger belief. This provides an objective framework for interpreting statistical evidence, and realises a Bayesian/frequentist compromise.

Introduction

Bayes factors contrast the statistical evidence for two hypotheses, updating their relative plausibility in the light of data [1]. In the recent debates about testing, they offer an approach that avoids some of the difficulties with P-values [25]. However it is well known that Bayes factors for composite hypotheses are sensitive to the prior distribution of parameters [6, 7], and because improper priors cannot be used, it is not obvious how to encode vague prior knowledge [810]. For these reasons, Bayes factors are generally used with subjective, informative priors, or with objective priors that represent concepts of minimal information [1, 1115].

Subjective priors are logically sound but perennially open to criticism; they may be supplemented by sensitivity analysis, but that may be tedious and distracting in routine applications. Objective priors offer the prospect of standardised analyses; however, they do not reflect a genuine state of knowledge and may be at odds with reasonable assumptions about the subject matter [8, 16, 17]. They may also be at odds with future studies that use knowledge learned from the data at hand, playing into concerns about reproducibility [18]. These issues have inhibited the use of Bayes factors with the common, textbook statistics upon which much applied research depends.

This paper aims to bridge this gap by constructing objective Bayes factors with priors that are subjectively reasonable: that is, they can be considered compatible with the subject matter. I take the view that from a position of complete ignorance, a reasonable prior can only be inferred from the data at hand. One approach is to use part of the data to learn the prior and the remainder to evaluate the Bayes factor, an idea developed in the fractional and intrinsic Bayes factors [19, 20]. However, there is no good general rule for dividing the data, and different allocations may suit different problems. On the other hand, the posterior Bayes factor [21] infers the prior from the entire data and then evaluates the Bayes factor in the same data. This approach was criticised for using the data twice, but that criticism only holds sway if the posterior Bayes factor is held to an agreed calibration; it is otherwise unclear how inferences are distorted (see discussion and rejoinder of Aitkin [21]). The proposal is to compare the posterior Bayes factor to the Bayes factor from an independently supplied prior with the same information. It will be seen that the posterior Bayes factor tends to over-estimate the evidence compared to such a proper Bayes factor, and it is therefore biased in that respect.

Here I develop the posterior Bayes factor in two ways. Firstly I correct this bias to give empirical Bayes factors (EBFs) that are interpretable on the same scale as ordinary Bayes factors. Secondly I calculate EBFs from summary test statistics, giving test-based Bayes factors [22]. This allows standard routine analyses to proceed as usual, with the EBF calculated at the same stage as a P-value.

Similar ideas occur in posterior predictive approaches to model selection, including Bernardo’s Bayesian reference criterion (BRC) [23] and Watanabe’s widely applicable information criterion (WAIC) [24, 25]. Indeed for regular normal models, the EBF is equivalent to the difference in WAIC under a vague prior. The EBF differs in explicitly seeking a Bayes factor interpretation. The focus on computing EBFs from summary statistics, and in multiple testing applications, also leads to differences from BRC and WAIC.

The EBF is empirical Bayes in that it is framed as a Bayesian quantity with prior estimated from the data. Empirical Bayes ideas have been applied to multiple testing for some time [26], but with the focus on posterior probabilities rather than the Bayes factor. Storey’s optimal discovery procedure [27] can be interpreted as an EBF, although it was developed as a frequentist statistic and does not explicitly estimate a prior distribution. The Bayesian decision procedure of Guindani et al. [28] also resembles an EBF. Closest to the approach developed here is work by Morisawa et al. [29], who constructed EBFs for multiple tests from a non-parametric estimate of the prior distribution. But they did not interpret their EBF beyond comparing its ranking of associations to that from P-values. Thus, while concepts related to EBFs have been in the multiple testing literature for some time, there has been little effort to interpret them in the same way as ordinary Bayes factors.

I am emphasising the Bayes factor interpretation, but it is unclear why this would be desirable. Although Bayes factors update prior odds to posterior odds, it is uncommon for either prior or posterior odds to be appraised precisely. More usually the strength of evidence is assessed qualitatively by the order of magnitude of the Bayes factor. But proposed interpretations with base 10 [9, 30, 31], natural logs [1] and base 2 [32] are based on subjective experiences, undermining the aim of an objective Bayes factor. Here I propose interpreting Bayes factors with base 3.73, using mathematical arguments flowing from Bayes’ theorem. Such an objective scale of interpretation is lacking from other accounts of evidence [33, 34], and gives a rationale for the Bayes factor interpretation separate from more philosophical arguments. With this scale, the EBF yields a surprising re-interpretation of classical significance testing, realising a compromise between Bayesian and frequentist paradigms.

Single tests

The Bayes factor in favour of hypothesis H0 over H1, in light of data x, is It updates prior probability to posterior probability according to the odds version of Bayes’ theorem,

Suppressing the subscript, when H describes a set ΘH of parameter values in a model for x, where f(x|θ) is the density of x given θ, and πH(θ) is the prior density of θ∈ΘH under H (similar definitions hold with probabilities for discrete x or θ). Various terms are used for Pr(x|H); here it will be called the (prior) marginal likelihood.

Let πH(θ|x) denote a posterior density for θ∈ΘH in light of x. Take an unrestricted prior π(θ) truncated to ΘH so that πH(θ|x) is similarly the truncated π(θ|x). The posterior marginal likelihood is given by (1) equal to the posterior predictive density of x itself. The posterior marginal likelihood may be biased, in the sense of being systematically different from a marginal likelihood with the prior for θ supplied independently of X. To fix this idea, consider the marginal likelihood using, as the prior, the posterior distribution from independent replicate data. Such data are understood in the frequentist sense as exchangeable with the data at hand, in particular having the same sample size. Given replicate data y, the bias in log MH(X) is

The expected bias EYbH(Y) is taken over all replicate data, and the bias-corrected posterior marginal likelihood is

The EBF is the ratio of the bias-corrected posterior marginal likelihoods, (2)

Calculation of the bias requires a distribution for X and Y. I will use the joint prior predictive distribution, where θ is now not restricted to ΘH. In other words I consider the evidence for H under all data X, Y that may arise under model f(⋅) whether or not H is true. The expected bias is then explicitly (3) where the outer integrals are over the entire domains of f(⋅) and π(⋅) respectively. Changing the order of integration gives (4)

This will be useful when the integral in x and y does not depend on θ, in which case the outer integral over π(θ) may be discarded. This occurs in most of the following subsections.

The prior π(θ) enters the EBF both through the marginal likelihood and the bias correction. The approach to identifying priors will aim both to express vague information about θ and to produce a reasonable prior predictive distribution for the bias correction [35, 36]. This position is especially relevant for the binomial probability.

The EBF may be viewed as a bias-corrected posterior Bayes factor, or as the expected Bayes factor (in log scale) if the prior were provided from independent replicate data. Another interpretation is the expected Bayes factor (in log scale) if the posterior distribution from the data at hand were taken as the prior in independent replicate data. This reflects what might actually be done in practice, and in that respect the EBF can be viewed as a reproducible measure of evidence.

Normal theory

Let the data be summarised by a single vector observation X such that the likelihood for the mean is ϕd(x; μ, Σ), the multivariate normal density of dimension d with mean vector μ, and the variance-covariance matrix Σ assumed known and of full rank. Here X may be any quantity with an assumed normal distribution, especially large sample parameter estimates. With an improper flat prior for μ, the posterior is . The numerator of the posterior marginal likelihood is (5)

For the unrestricted hypothesis H:μ∈ℝd, the posterior marginal likelihood is just and the posterior predictive density of y is ϕd(y; x, 2Σ). Then the bias is

The expected bias is (6) In a regular normal model, the expected bias is half the dimension, irrespective of μ. The bias-corrected posterior marginal likelihood is

On the deviance scale this is (7) that is, minus twice the maximised log-likelihood penalised by the number of parameters times 1+log(2) = 1.69. This resembles the Akaike information criterion (AIC), but with a weaker penalty, and in fact it is equal to WAIC in the case of a regular normal model with asymptotically flat prior [24, 25]. This is unsurprising since the posterior marginal likelihood is also the posterior predictive density of x itself, which the WAIC adjusts to the posterior predictive density in new data. Therefore, in model selection the penalty of 1.69d gives a WAIC-like criterion interpretable on the Bayes factor scale. However, the focus on test-based EBFs leads to differences from WAIC in general.

Now consider a directional hypothesis, for example scalar H:μ>0. The posterior distribution is truncated to (0, ∞] and the expected bias is which now depends on the prior predictive distribution of X and Y. Note that for sufficiently large X, ϕ(X; μ, σ2) is arbitrarily close to zero for μ<0 and the integrals in the expected bias approach those for the unrestricted hypothesis. In the limit of a flat prior on μ this occurs with probability . Similarly with probability , −X is large enough that ϕ(X; μ, σ2) is arbitrarily close to zero for μ>0 and the integrals then approach zero. The expected bias is thus one half of that for the unrestricted hypothesis, .

In some one-sided problems, values of μ<0 (say) are impossible by definition. In these cases the above argument for –X does not apply and the expected bias is as for the unrestricted hypothesis.

In the multivariate case the expected bias can be taken as , where d1 is the number of one-sided hypotheses and d2 is the number of two-sided hypotheses for the elements of μ. Note that there is no bias for finite interval or point hypotheses.

From Eqs (2), (5) and (6) it is easy to construct EBFs for two-sided, one-sided and interval hypotheses about a scalar mean. For the two-sided test of H0:μ = 0 versus H1:μ≠0, the EBF in favour of H0 is (8) where z = x/σ.

For H0:μ = 0 versus the one-sided alternative H1:μ>0 the EBF is (9) assuming that μ<0 is also possible. If μ<0 is impossible then the EBF is (10)

For H0:μ<0 versus H1:μ>0, the EBF is (11) Some implications of these results are discussed below in the Case studies.

The multivariate version of Eq (8) is (12)

where z2 = xTΣ−1x. This corresponds to a χ2 test when under H0 and has a non-central χ2 distribution under H1.

t-tests

In the following subsections I consider scalar quantities in some commonly used tests. The principles for constructing their EBFs may be applied to other tests not treated here.

Let X and S be observed such that for some μ, (Xμ)/S is distributed as t with ν degrees of freedom. The location μ is the parameter of interest, with likelihood .

With a flat prior for μ, the posterior is . The numerator of the posterior marginal likelihood is (13)

For the expected bias, (14) irrespective of μ. Numerical integration for ν = 1,…,10 is shown in Table 1. The bias is greater than for the normal distribution, but decreases to the asymptotic value of as ν increases; for ν = 30, the expected bias is 0.513. As with the normal distribution, the bias is halved for directional hypotheses and vanishes for interval hypotheses.

thumbnail
Table 1. Expected bias EYbH(Y) of the log posterior marginal likelihood for the t distribution with ν degrees of freedom.

https://doi.org/10.1371/journal.pone.0297874.t001

This formulation accommodates the most common applications of t-tests. For tests of the mean, X is the sample mean (or for two samples, difference in means) and S its estimated standard error. Similarly, for Wald tests of regression parameters, X is the parameter estimate and S its estimated standard error. Two-sided tests of H0:μ = μ0 have and one-sided tests have or as appropriate, with the integration in Eq 13 calculated from the cumulative distribution of t.

Binomial probability

Consider a fixed number n of Bernoulli trials with x successes. The likelihood for the probability p is Binomial (n, p). Take the usual Beta(α, α) prior for p, with posterior Beta(x+α, nx+α); I discuss the choice of α below. The numerator of the posterior marginal likelihood is (15) where fB(p; α, β) denotes the Beta(α, β) density at p. For the expected bias we have where and (16)

There are good arguments for different values of α [37, 38]. Here I prefer the uniform prior as it is consistent with a uniform prior predictive Pr(x) = (n+1)−1 [36, 39]. This is desirable because the expected bias explicitly depends on Pr(x), for which the uniform distribution is a natural non-parametric choice. The expected bias for n = 1,…,10, calculated from Eq (16) for ΘH = [0,1] and ΘH = [0,0.5] is shown in Table 2.

thumbnail
Table 2. Expected bias EYb(Y) of the log posterior marginal likelihood of the binomial distribution with n trials, with uniform prior.

https://doi.org/10.1371/journal.pone.0297874.t002

As n increases, the expected bias approaches the normal theory values of 0.5 for the full interval and 0.25 for the half interval. The expected bias for ΘH = [0.5,1] is the same as for ΘH = [0,0.5] so that in a comparison of H0:p≤0.5 to H1:p>0.5, the bias cancels.

If instead the number of successes is fixed at x and the total number of trials n is random, similar calculations yield EBFs for the negative binomial distribution. In general, for given x and n the EBF is different for binomial and negative binomial models. This dependence on the sampling model is a violation of the likelihood principle, but without a sampling model one cannot identify a bias in the posterior Bayes factor. In practice, if the sampling process is uncertain one may use Bayesian model averaging to reduce the dependence on the sampling model. In its most objective form this simply calculates the bias-corrected posterior marginal likelihoods for each possible model and then takes their arithmetic mean.

F-tests

Let X be observed such that for some r>0, rX is distributed as F with (ν1, ν2) degrees of freedom. The likelihood for r is . Take the usual improper prior for a scale parameter π(r)∝r−1, equivalent to a flat prior on log r. The posterior density is . The numerator of the posterior marginal likelihood is (17) Similar to the t-test, the expected bias is calculated by numerical integration. (18) irrespective of r. Some values are shown in Table 3.

thumbnail
Table 3. Expected bias of the log posterior marginal likelihood for the F distribution. Numerator degrees of freedom in rows, denominator degrees of freedom in columns.

https://doi.org/10.1371/journal.pone.0297874.t003

A common application of F-tests is in the analysis of variance, in which the tests are one-sided. In the present notation the hypotheses are H0:r = 1 versus H1:r<1. The posterior marginal likelihood is then (19) where F denotes the cumulative distribution function. Here values of r>1 are implausible by construction, so following a similar argument to the normal theory the expected bias is the same as for the unrestricted hypothesis.

P-values

In some situations only a P-value may be available, such as in non-parametric tests based on ranks. Here an EBF can be obtained by assuming that P-values arise from a Beta distribution. This assumption may not hold in reality, and such EBFs should be viewed as no more than a rough measure. However, the following result provides a useful rule of thumb.

Previously, Sellke et al. [40] obtained the lower bound of −e p log (p) for the Bayes factor when p<e−1, assuming that the P-value follows a Beta(α, 1) distribution with α<1. Held and Ott [41] considered a Beta(1,β) distribution with β>1, obtaining a smaller lower bound of −e(1−p)log(1−p). Here I adopt the latter model as it can accommodate a flat prior and yields an expected bias that is fairly constant over β.

Let p be a P-value in [0,1]. The likelihood for β is . Under a flat prior for H:β>1 the posterior is Gamma(2,−log(1−p)) and the posterior marginal likelihood is (20)

Calculus gives the numerator as and the denominator as

For a fixed value of β the expected bias has the form (21)

The first integral in the bracket is the numerator above. The second integral is It is found that for a wide range of β, and so this value is taken as a default. Assume that under H0, the P-value is uniformly distributed and the likelihood is 1. Then a P-value of 0.05 equates to an EBF of 1/2.05, compared to the Sellke et al. [40] lower bound of 1/2.45 and the Held and Ott [41] lower bound of 1/7.55.

For small P-values, say less than 0.1, the numerator of MH(p) is dominated by and the denominator by

Then with the default bias adjustment the EBF is approximated by (22)

Thus a small P-value can be converted into an EBF using a very simple rule of thumb. Indeed, with equal prior probabilities on H0 and H1, the corresponding posterior probability of H0 is 1−(1+10p)−1. When p = 0.05, this probability is , consistent with the lower bound of around 30% proposed by Berger and Sellke [42]. For p = 0.005, proposed as a revised standard for significance testing [43], the posterior probability of H0 is 4.8%.

Multiple tests

For multiple tests the principle is again to estimate a posterior distribution from the data, and construct a bias-corrected posterior marginal likelihood for each hypothesis in each test. Now the posterior distribution is estimated from the ensemble of tests, assuming exchangeability under a common random-effects distribution. This is not correcting for multiple testing in the usual sense, but is including more information in each test, potentially improving simultaneous inference in both frequentist and Bayesian paradigms [27, 28]. For simplicity I describe an EBF for scalar parameters, but the generalisation to vectors is straightforward.

Let the data be Xi~F(θi;ηi) independently for i = 1,…,m, where F is a known distribution, θi is the parameter of interest in test i and ηi the nuisance parameter assumed known. Let Θ be a common set such that the hypothesis in test i is H(i):θi∈Θ. We want a posterior distribution for θi given that H(i) is true, but H(j) is not necessarily true for ji. Let pΘ;i be a common prior probability that θj∈Θ for ji. Then take the mixture of single-test posterior distributions (23)

The posterior marginal likelihood for test i is where f is the density corresponding to F. The integrals in the numerator are the same as those appearing in the expected bias calculations for single tests, with xj in place of replicate data y.

The bias in now involves logs of sums and cannot be easily derived. The following heuristic is proposed. The first term in the numerator is adjusted by the single-test correction from the previous section, while the other terms involving independent data are left unchanged. The adjusted posterior marginal likelihood for test i is thus (24)

This approach is related to the ODP [27]. In the present notation the ODP uses involving the true but unknown parameters θj. In practice they must be estimated, here using posterior distributions. I use a single probability pΘ;i for all tests ji and treat this as an input parameter, but different probabilities could be specified for each test [44]. The main conceptual differences from the ODP are that I assume H(i)is true for test i, and recognise a bias from double use of xi (Eq 24). These differences arise because I seek a Bayes factor interpretation.

When the number of tests m is large, the contribution of each data point to its own test becomes negligible and pΘ;i almost cancels from Eq (24). In practice the EBF is often insensitive to values of pΘ;i outside a small range near 0. An example is given in the Case Studies.

A related strand of work has focussed on the posterior probability of a point null hypothesis, compared to a general alternative [26, 45]. The observations are assumed to follow a two-groups model where f(X) is the marginal density of X, fi is the density under hypothesis Hi and π0 is the prior probability of H0. Then according to Bayes’ formula the posterior probability is

Since f can be estimated directly from the observations, and f0 may be assumed by theory, there is a straightforward conversion of prior probability to posterior probability. The approach is even more appealing if π0 itself can be estimated from the data [46, 47], for we then directly obtain posterior probabilities of the hypotheses, ostensibly without prior information.

In practice π0 cannot be estimated without some prior assumption. Intuitively, in finite data a model with π0>0 is almost indistinguishable from a model with π0 = 0 and infinitesimally small departures from H0. Thus a degree of subjectivity is inevitable; usually the assumption is that π0 is close to 1. With π0 given, the problem reduces to density estimation for the marginal f(X).

In this model the Bayes factor is implicitly defined as

Of note, π0 is the same for all tests and appears both in the Bayes factor and as the prior probability in each test. In contrast, the EBF model for f1 allows pΘ;i to differ for each test and to differ from the prior probability, which does not feature in the EBF. While pΘ;i can be made the same for all tests, it remains a parameter in the Bayes factor and is formally separated from the prior probabilities of the individual tests. If these probabilities are also made equal to pΘ;i then the EBF will lead to the same posterior probability as the two-groups model.

Case studies

Normal mean

The test of a normal mean against a point null is a benchmark in discussions of hypothesis testing. Recall the EBF for the two-sided test of H0:μ = 0 versus H1:μ≠0 (Eq 8)

Here , where x is the sample mean, n the sample size and σ the known standard deviation.

There is a correspondence between this EBF and the P-value, since p = 2Φ(−|z|). However, the EBF favours H0 when z2<1+log 2; thus in contrast to P-values and minimum Bayes factors, it can give evidence for either hypothesis. Indeed p = 0.193 when z2 = 1+log 2, so any larger P-value corresponds to an EBF in favour of H0 and smaller in favour of H1. Fig 1 shows an almost linear relationship in log scale between EBF and P-value. This suggests a degree of resolution to the “irreconcilability of P-values and evidence” [42], for the EBF gives a Bayes factor interpretation to Z and hence P. However, a given P-value corresponds to the same EBF irrespective of sample size, in contrast to ordinary Bayes factors [31, 48, 49]. The notion that a fixed P-value has less evidential value in larger samples is tied to the use of an informative prior for H1. With prior ignorance about H1, the posterior distribution is related to the P-value in such a way that the EBF has no further dependence on sample size.

thumbnail
Fig 1. Calibration of P-values for two-sided test of normal mean.

Solid black line, EBF for normal likelihood. Long dashed red line, P-value EBF with Beta likelihood. Dashed cyan line, Sellke et al. bound on the Bayes factor. Dotted blue line, Bayesian reference criterion. All quantities shown as negative logarithms with base 10.

https://doi.org/10.1371/journal.pone.0297874.g001

Fig 1 also shows the BRC [23], in the present notation . This is also a posterior predictive approach, a key difference from EBF being that it takes replicate data under the posterior distribution rather than the prior. Consequently the criterion always favours H1, and does so more strongly than the minimum Bayes factor over all priors [50, 51]. It is motivated by estimating Kullback-Leibler divergence rather than a Bayes factor, and is therefore less directly related to Bayesian hypothesis testing than the EBF. The figure also shows the P-value EBF which is in the same order of magnitude as the normal EBF. The approximation of 10p is clearly seen.

The evidence for H0 is limited, even asymptotically, as when z2 = 0. If H0 does hold exactly then Z2 follows a distribution regardless of sample size, with probability 0.193 of exceeding 1+log 2 and incorrectly providing evidence for H1.

This lack of asymptotic consistency may be considered a shortcoming. This has been discussed for related quantities such as AIC, with common responses being that H0 is never exactly true or that consistency is less critical for prediction models [52, 53]. Here I note that although a method may be asymptotically consistent, it may not be accurate at the current sample size. For if, in finite data, the Bayes factor favours a correct H0 with high probability, then for small μ≠0 it will incorrectly favour H0 with almost the same probability. This is shown in Fig 2 for a fixed sample size of 1000. The EBF is compared to the unit information Bayes factor , which has the prior N(0, σ2) for H1. This is regarded as minimally informative [12] and is approximately equivalent to the difference in Bayesian information (Schwarz) criterion for large n. For μ/σ≥0.15 the unit information Bayes factor is nearly consistent at this sample size, but for small μ/σ it favours H1 with low probability. As μ is unknown, asymptotic consistency may not be relevant to the data at hand.

thumbnail
Fig 2. Sensitivity of Bayes factors for two-sided test of normal mean.

Sensitivity is the probability that the Bayes factor favours the correct model, and is calculated from the χ2 distribution with non-centrality parameter 2/σ2. Sample size is fixed to 1000. Solid black line, EBF as a function of μ/σ. Dashed red line, unit information Bayes factor as a function of μ/σ. Dotted blue line, EBF when μ = 0. Long dashed cyan line, unit information Bayes factor when μ = 0.

https://doi.org/10.1371/journal.pone.0297874.g002

The EBF has greater sensitivity against H1 than the unit information Bayes factor, at the cost of fixed sensitivity against H0. This contrast underlies the so-called Jeffreys-Lindley paradox [54, 55]. The dilemma for the practitioner is between consistency and sensitivity, with good arguments for both frequentist and Bayesian readings [5658]. Here the behaviour of the EBF is more similar to frequentist testing, but this emerges from its objective construction rather than a philosophical position.

It is in fact remarkable that, with finite data, we can infer any evidence for the point H0, compared to an infinitesimal departure from it, from a prior position of complete ignorance. The EBF has probability 0.807 of favouring H0 when it is true, and log(EBF01) has expectation , also favouring H0. So in certain ways it is expected to favour a correct H0, although never overwhelmingly. On the other hand, if data were to accrue without limit, one may as well infer an informative prior from some finite subset and achieve consistent inference in the remainder. An EBF is only needed when data are finite and prior information is absent.

The same discussion applies to the one-sided test of H0:μ = 0 versus H1:μ>0 (Eqs 9, 10) and the χ2 test (Eq 12). However, for H0:θ<0 versus H1:θ>0 (Eq 11) the EBF depends only on the signed value of z and will then be consistent for either hypothesis.

The two-sided EBF is approximately which is a constant multiple of the minimum Bayes factor over all priors [50, 51]. It is also similar to the minimum Bayes factor over all symmetric priors about 0, which is approximately for z>1.64 [41]. These lower bounds, however, cannot favour H0 and only give a limit on the evidence in favour of H1; therefore they only quantify an absence of evidence and cannot support a conclusion favouring either hypothesis. The EBF by contrast is exact and can provide positive evidence for either hypothesis, albeit limited for H0. The numerical similarity to the lower bounds is reassuring for the accuracy of those bounds. The generalisation of the EBF to multivariate and interval hypotheses, and to non-normal distributions, also goes beyond these previous results.

Limited multiplicity

In the previous section an EBF was proposed for multiple testing, in which each test contributes a posterior distribution to the EBF for every other test. The contribution from a test to its own EBF is adjusted by the single-test correction (Eq 24). Clearly this reduces to the single-test EBF in the case of one test. Here I examine this heuristic EBF for a moderate number of tests.

Test of a normal mean were simulated with the number of tests m varied from 1 to 10. In each case, three scenarios were simulated for the true means. Firstly, all means were set to zero. Secondly, the means were sampled from N(0,1) independently in each simulated dataset. Thirdly, the means were fixed to a uniform grid on [–5, 5]. In each scenario, 10,000 sample means were simulated for each test assuming a population variance of 1 and a sample size of 100. With each sample mean, an independent replicate value was also simulated, to estimate the bias in each scenario.

Table 4 shows the bias in the log posterior marginal likelihood for H1:μ≠0, with and without the proposed adjustment. As expected, the bias is for a single test and decreases towards zero as the number of tests increases. The rate of decrease depends on the true values of the means. The proposed adjustment does a good job of reducing the bias. Qualitatively similar results are observed for different sample sizes and models (not shown).

thumbnail
Table 4. Bias in log posterior marginal likelihood for m tests of a normal mean with H1:μ≠0.

First column shows scenario numbers as described in the main text. Unadj, no adjustment for over-fitting. Adj, adjustment as proposed in Eq 24.

https://doi.org/10.1371/journal.pone.0297874.t004

Table 5 compares the mean square error for the single- and multiple-test EBFs, where the error is the difference in log scale between the posterior marginal likelihood and the marginal likelihood with prior supplied from replicate data. The mean square error is seen to decrease in the multiple-test EBF as the number of tests increases. The decrease is greatest in scenario 1, where all means are equal, whereas there is negligible decrease in scenario 3 where the means are well separated. Similar to the ODP, the EBF performs an adaptive shrinkage towards common effects where present. These results together suggest that the proposed EBF is useful for multiple testing situations.

thumbnail
Table 5. Mean square error in log posterior marginal likelihood for m tests of a normal mean with H1:μ≠0.

First column shows scenario numbers as described in the main text. Single, single-test EBF. Multiple, multiple-test EBF. Error is the difference between the log posterior marginal likelihood and the log prior marginal likelihood with prior from independent replicate data.

https://doi.org/10.1371/journal.pone.0297874.t005

Large-scale multiplicity

In Fig 3, 1000 tests of a normal mean were simulated in which μ = 0 for 900 tests and μ~N(0,1) for 100. The sampling variance is 1. The multiple test EBF shrinks the single test EBFs, reflecting reduced bias for each observed datum.

thumbnail
Fig 3. Single- and multiple-test EBFs.

EBFs are in favour of H1:μ≠0, for 900 simulated tests with μ = 0 and 100 with μ~N(0,1). Black circles, multiple-test EBF with pΘ;i = 1. Blue crosses, multiple-test EBF with pΘ;i = 0.01. The true value is pΘ;i = 0.1. EBFs shown in log10 scale.

https://doi.org/10.1371/journal.pone.0297874.g003

The lower line of circles shows the EBFs calculated with pΘ;i = 1 for all i, that is assuming μ≠0 for all tests. The results are virtually identical using the true value of pΘ;i = 0.1 (not shown). The upper line of crosses shows the EBFs still similar with pΘ;i = 0.01. In this case a default of pΘ;i = 1 would not be misleading when the truth is unknown.

Note that more extreme EBFs can occur as the number of tests increases, and so the usual issues of multiple testing remain. Although the multiple-test EBFs are reduced compared to the single-test version, there is little improvement in the ranking of the tests: the non-zero means have a mean rank of 411.8 for the single-test EBF and 411.28 for the multiple-test, where lower ranks indicate greater evidence against H0. However, the multiple-test EBF selects a higher proportion of true positives over a range of thresholds (Fig 4), in line with the ODP theory.

thumbnail
Fig 4. Proportion of true positives in Fig 3 among all tests with EBF above a threshold.

Solid black line, multiple-test EBF. Dashed blue line, single-test EBF.

https://doi.org/10.1371/journal.pone.0297874.g004

Qualitative interpretation of Bayes factors

Often the order of magnitude of the Bayes factor is of greater interest than its precise value. Several authors have proposed interpretations on the log scale with descriptions such as “moderate” and “strong” evidence. Jeffreys [9] used base 10, with Lee and Wagenmakers [30] and Held and Ott [31] suggesting minor adjustments to his scale. Kass and Raftery [1] used natural logs, while Royall [32] used base 2 for simple likelihood ratios. However, all authors used subjective experience to identify significant points on their scales, and it is unclear which scale is most appropriate for general use.

Here I propose measuring Bayes factors in logarithmic units with base 3.73. Briefly, the argument is the following, with further details in the supplementary material. Define weaker belief to be that which is more easily modified by evidence, and stronger belief to be that which is less easily modified. From Bayes’ theorem, the effect of evidence on belief is the first derivative of the logistic function. The sharpest distinction between weaker and stronger belief occurs where the effect of evidence is changing most rapidly. This is where the third derivative is zero, which is . With that boundary, a Bayes factor of updates any weaker belief to a stronger belief and is said to comprise one unit of evidence.

For the normal theory and P-value EBFs, the P-values corresponding to the first few units of evidence are shown in Table 6. These are remarkably similar to traditional significance levels. The EBF and the proposed units together give an evidential account of classical testing, whereby P = 0.05 is roughly enough evidence to decrease the qualitative belief in H0, P = 0.01 to decrease it further and P = 0.005 further again. Recall also that against even prior odds, the P-value EBF for P = 0.005 yields posterior probability of 4.8% for H0 (Eq 22), supporting the proposal for this level as a general threshold for hypothesis testing [43].

thumbnail
Table 6. P-value calibrations for EBFs corresponding to the first four units of evidence.

Normal mean and χ2 EBFs from Eqs (8) and (12). P-value EBF from the 10p calibration (Eq 22). EBFs shown in favour of H1. The table shows, for example, that for the test of a normal mean a P-value of 0.038 corresponds to an EBF of one unit in favour of the posterior distribution.

https://doi.org/10.1371/journal.pone.0297874.t006

The EBF and the proposed units give some rationale to traditions that have had only intuitive justification. Although this does little to discourage continued use of P-values, I prefer the EBF as the exact calibration is different for each test, and it has a direct evidential interpretation as in the following examples.

Examples

Given the calibrations of EBFs to P-values, the main operational difference is in interpretation. Here are two examples in which the interpretation of P-values has fuelled discussions of inference. In both cases there is little relevant prior information for an ordinary Bayes factor. The EBF seems to provide a direct and transparent interpretation.

The first concerns the use of stents to treat patients with angina [59]. In the first blinded randomized controlled trial of their effectiveness, patients receiving a stent increased their exercise time by 16.6s more than the controls [60]. With standard error 12.96s, z = 1.28 and two-sided P = 0.2, the authors concluded that stents provided no clinical benefit, and discussion ensued on whether clinical guidelines should be revised [61]. As a non-significant P-value is not evidence for H0, such discussion appeared premature [59].

In this case the two-sided EBF (Eq 8) is 1.03, favouring no effect but minimally. A simple interpretation is that the study gives no reason to change existing views on whether stents give any benefit. For a more nuanced interpretation, the authors considered an increase of at least 30s to be a clinically meaningful effect, but the study was only powered to detect an effect of at least 34s. The EBF comparing H0:[−∞, 30) to H1:[30, ∞] is 2.29 (Eqs 1 and 5), supporting the hypothesis of no clinically meaningful benefit, though at less than one unit, not to a degree that demands revision of clinical guidelines.

The second example is the identification of the Higgs boson from particle collisions [62, 63]. In simple terms, the production of a new particle was inferred from a count of events 5σ above background expectation (one-sided P = 3×10−7). The Standard Model predicts an excess of 5.8σ for the Higgs boson and the observations were thus deemed compatible with its presence and incompatible with its absence.

The substantial media interest brought the usual clumsiness in communicating P-values, such as “a 0.00003% probability that the result is due to chance” [64]. For a normal likelihood, the one-sided EBF (Eq 9) for 5σ is 1.48×105 in favour of a new particle; against even prior odds this gives 0.0007% posterior probability of a chance result. We may also calculate the EBF for the Standard Model expectation of 5.8σ versus a two-sided alternative, giving 1.69 in favour of the Higgs boson. Together these EBFs give positive evidence that a new particle was produced and that it is the Higgs boson as opposed to some other phenomenon.

Discussion

The present aim is to construct objective Bayes factors with subjectively reasonable priors. Although the approach is motivated by Bayesian learning, the result is in some ways closer to frequentist testing. For tests of a point null there is a fixed asymptotic distribution under H0, allowing control of type-1 error. Like the P-value, the EBF appeals to hypothetical replicate data, and thus violates the likelihood principle, although uncertainty about the sampling model may be mitigated by Bayesian model averaging. EBFs can be calibrated to P-values; while the exact calibration is different for each test, a general calibration of 10p has been proposed (Eq 22).

Several other calibrations of P-values to Bayes factors have been proposed previously, the most well-known being −e p log(p) [40, 41, 65, 66]. These are all lower bounds on the Bayes factor in favour of H0, but are constrained to be at most 1. Consequently, these calibrations cannot produce positive evidence in favour of either hypothesis; at best they say that there is no more than so much evidence against H0. As such, they only quantify an absence of evidence and are not very useful for decision making. In contrast the EBF calibrations are exact, and can favour either hypothesis. They can be viewed as giving a Bayes factor interpretation to the P-value, alleviating some concerns about frequentist testing [5]. Informally, the EBF suggests that for P-values near 1 one may increase belief in H0, but only to a limited extent since there are also realisations of H1 under which such P-values are possible.

Such an interpretation may seem surprising, since the P-value quantifies evidence against but not for H0. However, the P-value is just a function of the data, and can be used to obtain a Bayes factor if H1 is appropriately specified, here as the posterior distribution. Thus the EBF presents a Bayesian/frequentist compromise in that it may be used as a frequentist statistic, having a fixed distribution under a point H0, while a P-value may be calibrated to a Bayes factor in which H1 is the posterior distribution.

This seems reasonable in parametric tests, in which a P-value is often supplemented with a point estimate or confidence interval. Then the EBF similarly reports both the posterior distribution and the evidence supporting it. But in non-parametric tests the correspondence between P and EBF is more tenuous. A calibration of 10p is obtained by assuming a parametric model for P, comparing the uniform distribution to Beta(1, β). Similarly for goodness-of-fit tests, Eq (12) might be used to obtain an EBF comparing the central χ2 distribution to a non-central distribution. In these cases the EBF requires a model for the alternative that is absent in the classical test, and the Bayesian interpretation of P should be treated with caution.

From the Bayesian perspective there are two main issues. The first is the lack of asymptotic consistency when testing a point null embedded within a composite alternative. I have argued that this is less important in the finite data scenarios for which EBFs are intended. Indeed it is remarkable that, in finite data, the EBF is expected to favour a true point null over an infinitesimal departure. The inconsistency of the EBF is not a defining characteristic, but a property that emerges from its objective construction.

The second issue concerns coherence. While the EBF is on the same scale as a proper Bayes factor, its prior does not precede the data and there is no formal justification for updating prior odds by Bayes’ theorem [67]. Of course this may be said of any empirical Bayes method, and is only brought into relief by the single-test EBF. If empirical Bayes is acceptable for multiple testing, but not for a single test, one must ask where the difference lies since a bias is present in all cases. A pragmatic line is the following: if the data at hand would yield this Bayes factor given replicate data, then one may as well update the prior odds now. Such an argument is not strictly Bayesian, but is a compromise combining the principle of Bayesian updating with the frequentist idea of a replicate experiment.

In reality, literal applications of Bayesian updating or frequentist sampling are uncommon. Especially in research contexts, the Bayesian and frequentist constructions merely serve as rhetorical devices to enable inference from the single case. My position is closer to the evidential view, according to which the primary aim is to quantify the evidence in data [32, 34, 68]. That evidence can then be interpreted within one paradigm or other according to the agent’s subjective will [69], but the evidence stands itself as an objective property of the data. At the community level, inference proceeds by consensus based on multiple lines of evidence, and formal testing is less important than the evidence base [70]. From this perspective the EBF is attractive as it admits Bayesian, frequentist and (via asymptotic equivalence to WAIC, Eq (7)) information interpretations.

Taken purely as a measure of evidence, the EBF is one of several possible approaches including information criteria [34], maximised likelihood ratios [71, 72] and frequentist error rates [33]. Indeed the posterior Bayes factor could simply be taken at face value (rejoinder of Aitkin [21]). These quantities are each on an absolute scale, and a key question is how to interpret a given value. I have given arguments for interpreting Bayes factors with logarithmic base 3.73, an approach that is lacking from other accounts of evidence. The availability of such a scale is an advantage of the Bayes factor, for it allows a more objective interpretation of the strength of evidence. Without this scale, there is little reason to adjust the posterior Bayes factor to the EBF. The P-value calibrations for EBFs on this scale are similar to traditional significance levels (Table 6), so that this framework is consistent with established practice while providing an arguably more transparent interpretation.

The use of test-based Bayes factors avoids some of the problems associated with nuisance parameters and likelihood-free approaches. Where, for example, an estimating equation is used whose solution is asymptotically normal, one may simply take the estimator to the normal EBF. However a potential limitation is the need for a prior predictive distribution for replicate data. In the tests considered in this paper, the expected bias either does not depend on the prior predictive, or there is a natural choice as in the binomial. But in other cases there may not be a clear model for replicate data and an EBF may be hard to define.

I have focused on some hypothesis tests that are commonly applied in routine fashion. These are the focus of much of the debate around testing, but the EBF is more generally applicable, for example to multi-dimensional parameters, non-nested and interval hypotheses. Similar approaches to those described in this paper can be used to develop EBFs in a wider range of settings.

References

  1. 1. Kass RE, Raftery AE. Bayes Factors. Journal of the American Statistical Association. 1995;90(430):773–95.
  2. 2. Benjamini Y, Veaux RDD, Efron B, Evans S, Glickman M, Graubard BI, et al. The ASA president’s task force statement on statistical significance and replicability. The Annals of Applied Statistics. 2021;15(3):1084–5.
  3. 3. Morey RD, Romeijn J-W, Rouder JN. The philosophy of Bayes factors and the quantification of statistical evidence. Journal of Mathematical Psychology. 2016;72:6–18.
  4. 4. Wasserstein RL, Lazar NA. The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016;70(2):129–33.
  5. 5. Wasserstein RL, Schirm AL, Lazar NA. Moving to a World Beyond “p < 0.05”. The American Statistician. 2019;73(sup1):1–19.
  6. 6. Liu CC, Aitkin M. Bayes factors: Prior sensitivity and model generalizability. Journal of Mathematical Psychology. 2008;52(6):362–75.
  7. 7. Kass RE. Bayes Factors in Practice. Journal of the Royal Statistical Society Series D (The Statistician). 1993;42(5):551–60.
  8. 8. DeGroot MH. Lindley’s Paradox: Comment. Journal of the American Statistical Association. 1982;77(378):336–9.
  9. 9. Jeffreys H. The theory of probability: OUP Oxford; 1961.
  10. 10. Gelman A, Carlin J. Some Natural Solutions to the p-Value Communication Problem—and Why They Won’t Work. Journal of the American Statistical Association. 2017;112(519):899–901.
  11. 11. Ly A, Verhagen J, Wagenmakers E-J. Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology. Journal of Mathematical Psychology. 2016;72:19–32.
  12. 12. Raftery AE. Bayes Factors and BIC: Comment on “A Critique of the Bayesian Information Criterion for Model Selection”. Sociological Methods & Research. 1999;27(3):411–27.
  13. 13. Spiegelhalter DJ, Smith AFM. Bayes Factors for Linear and Log-Linear Models with Vague Prior Information. Journal of the Royal Statistical Society: Series B (Methodological). 1982;44(3):377–87.
  14. 14. Berger J. The case for objective Bayesian analysis. Bayesian Analysis, 1, 385–402. Bayesian Analysis. 2006;1.
  15. 15. Wagenmakers EJ, Wetzels R, Borsboom D, van der Maas HL. Why psychologists must change the way they analyze their data: the case of psi: comment on Bem (2011). J Pers Soc Psychol. 2011;100(3):426–32. pmid:21280965
  16. 16. Weakliem DL. A Critique of the Bayesian Information Criterion for Model Selection. Sociological Methods & Research. 1999;27(3):359–97.
  17. 17. Bem DJ, Utts J, Johnson WO. Must psychologists change the way they analyze their data? J Pers Soc Psychol. 2011;101(4):716–9. pmid:21928916
  18. 18. Munafo MR, Nosek BA, Bishop DVM, Button KS, Chambers CD, du Sert NP, et al. A manifesto for reproducible science. Nat Hum Behav. 2017;1:0021. pmid:33954258
  19. 19. O’Hagan A. Fractional Bayes Factors for Model Comparison. Journal of the Royal Statistical Society: Series B (Methodological). 1995;57(1):99–118.
  20. 20. Berger JO, Pericchi LR. The Intrinsic Bayes Factor for Model Selection and Prediction. Journal of the American Statistical Association. 1996;91(433):109–22.
  21. 21. Aitkin M. Posterior Bayes Factors. Journal of the Royal Statistical Society: Series B (Methodological). 1991;53(1):111–28.
  22. 22. Johnson VE. Bayes factors based on test statistics. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005;67(5):689–701.
  23. 23. Bernardo JM. Nested Hypothesis Testing: The Bayesian Reference Criterion. In: Bernardo JM, Berger J. O., Dawid A. P. and Smith A. F. M., editor. BAYESIAN STATISTICS 6,: Oxford University Press; 1999. p. 101–30.
  24. 24. Gelman A, Hwang J, Vehtari A. Understanding predictive information criteria for Bayesian models. Statistics and Computing. 2014;24(6):997–1016.
  25. 25. Watanabe S. Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory. Journal of Machine Learning Research. 2010;11(116):3571–94.
  26. 26. Efron B. Microarrays, Empirical Baye and the Two-Groups Model. Statistical Science. 2008;23(1):1–22.
  27. 27. Storey JD. The optimal discovery procedure: a new approach to simultaneous significance testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2007;69(3):347–68.
  28. 28. Guindani M, Müller P, Zhang S. A Bayesian discovery procedure. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2009;71(5):905–25. pmid:20694043
  29. 29. Morisawa J, Otani T, Nishino J, Emoto R, Takahashi K, Matsui S. Semi-parametric empirical Bayes factor for genome-wide association studies. Eur J Hum Genet. 2021;29(5):800–7. pmid:33495595
  30. 30. Lee MD, Wagenmakers E-J. Bayesian cognitive modeling: A practical course. New York, NY, US: Cambridge University Press; 2013. xiii, 264–xiii, p.
  31. 31. Held L, Ott M. How the Maximal Evidence of P-Values Against Point Null Hypotheses Depends on Sample Size. The American Statistician. 2016;70(4):335–41.
  32. 32. Royall R. Statistical Evidence: A Likelihood Paradigm: Taylor & Francis; 1997.
  33. 33. Mayo DG, Cox DR. Frequentist Statistics as a Theory of Inductive Inference. Lecture Notes-Monograph Series. 2006;49:77–97.
  34. 34. Taper ML, Ponciano JM. Evidential statistics as a statistical modern synthesis to support 21st century science. Population Ecology. 2016;58(1):9–29.
  35. 35. Hartmann M, Agiashvili G, Bürkner P, Klami A. Flexible Prior Elicitation via the Prior Predictive Distribution. In: Jonas P, David S, editors. Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI); Proceedings of Machine Learning Research: PMLR; 2020. p. 1129–38.
  36. 36. Stigler SM. Thomas Bayes’s Bayesian Inference. Journal of the Royal Statistical Society: Series A (General). 1982;145(2):250–8.
  37. 37. Geisser S. On Prior Distributions for Binary Trials. The American Statistician. 1984;38(4):244–7.
  38. 38. Tuyl F, Gerlach R, Mengersen K. A Comparison of Bayes-Laplace, Jeffreys, and Other Priors: The Case of Zero Events. The American Statistician. 2008;62(1):40–4.
  39. 39. Gerlach R, Mengersen K, Tuyl F. Posterior predictive arguments in favor of the Bayes-Laplace prior as the consensus prior for binomial and multinomial parameters. Bayesian Analysis. 2009;4(1):151–8.
  40. 40. Sellke T, Bayarri MJ, Berger JO. Calibration of p Values for Testing Precise Null Hypotheses. The American Statistician. 2001;55(1):62–71.
  41. 41. Held L, Ott M. On p-Values and Bayes Factors. Annual Review of Statistics and Its Application. 2018;5(1):393–419.
  42. 42. Berger JO, Sellke T. Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence. Journal of the American Statistical Association. 1987;82(397):112–22.
  43. 43. Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers EJ, Berk R, et al. Redefine statistical significance. Nat Hum Behav. 2018;2(1):6–10. pmid:30980045
  44. 44. Storey JD, Dai JY, Leek JT. The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. Biostatistics. 2007;8(2):414–32. pmid:16928955
  45. 45. Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes Analysis of a Microarray Experiment. Journal of the American Statistical Association. 2001;96(456):1151–60.
  46. 46. Langaas M, Lindqvist BH, Ferkingstad E. Estimating the proportion of true null hypotheses, with application to DNA microarray data. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005;67(4):555–72.
  47. 47. Storey JD. A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2002;64(3):479–98.
  48. 48. Royall RM. The Effect of Sample Size on the Meaning of Significance Tests. The American Statistician. 1986;40(4):313–5.
  49. 49. Wagenmakers E. Approximate Objective Bayes Factors From P-Values and Sample Size: The 3p√n Rule. PsyArXiv. 2022.
  50. 50. Edwards W, Lindman H, Savage LJ. Bayesian statistical inference for psychological research. Psychological Review. 1963;70(3):193–242.
  51. 51. Goodman SN. Toward evidence-based medical statistics. 2: The Bayes factor. Ann Intern Med. 1999;130(12):1005–13. pmid:10383350
  52. 52. Burnham KP, Anderson DR. Multimodel Inference: Understanding AIC and BIC in Model Selection. Sociological Methods & Research. 2004;33(2):261–304.
  53. 53. Ding J, Tarokh V, Yang Y. Model Selection Techniques: An Overview. IEEE Signal Processing Magazine. 2018;35(6):16–34.
  54. 54. Lindley DV. A statistical paradox. Biometrika. 1957;44(1–2):187–92.
  55. 55. Wagenmakers E-J, Ly A. History and nature of the Jeffreys–Lindley paradox. Archive for History of Exact Sciences. 2023;77(1):25–72.
  56. 56. Robert CP. On the Jeffreys-Lindley Paradox. Philosophy of Science. 2014;81(2):216–32.
  57. 57. Spanos A. Who Should Be Afraid of the Jeffreys-Lindley Paradox? Philosophy of Science. 2013;80(1):73–93.
  58. 58. Sprenger J. Testing a Precise Null Hypothesis: The Case of Lindley’s Paradox. Philosophy of Science. 2013;80(5):733–44.
  59. 59. Matthews RAJ. Moving Towards the Post p < 0.05 Era via the Analysis of Credibility. The American Statistician. 2019;73(sup1):202–12.
  60. 60. Al-Lamee R, Thompson D, Dehbi HM, Sen S, Tang K, Davies J, et al. Percutaneous coronary intervention in stable angina (ORBITA): a double-blind, randomised controlled trial. Lancet. 2018;391(10115):31–40. pmid:29103656
  61. 61. Brown DL, Redberg RF. Last nail in the coffin for PCI in stable angina? Lancet. 2018;391(10115):3–4. pmid:29103657
  62. 62. Chatrchyan S, Khachatryan V, Sirunyan AM, Tumasyan A, Adam W, Aguilo E, et al. Observation of a new boson at a mass of 125 GeV with the CMS experiment at the LHC. Physics Letters B. 2012;716(1):30–61.
  63. 63. Aad G, Abajyan T, Abbott B, Abdallah J, Abdel Khalek S, Abdelalim AA, et al. Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC. Physics Letters B. 2012;716(1):1–29.
  64. 64. Chalmers M. Physicists find new particle, but is it the Higgs? Nature. 2012.
  65. 65. Benjamin DJ, Berger JO. Three Recommendations for Improving the Use of p-Values. The American Statistician. 2019;73(sup1):186–91.
  66. 66. Kline B. Bayes Factors Based on p-Values and Sets of Priors With Restricted Strength. The American Statistician. 2022;76(3):203–13.
  67. 67. Gelman A, Robert CP, Rousseau J. Inherent difficulties of non-Bayesian likelihood-based inference, as revealed by an examination of a recent book by Aitkin. Statistics & Risk Modeling. 2013;30(2):105–20.
  68. 68. Edwards AWF. Likelihood: Cambridge University Press; 1972.
  69. 69. Gandenberger G. Why I Am Not a Likelihoodist. Philosophers’ Imprint. 2016;16.
  70. 70. Poole C, Peters U, Il’yasova D, Arab L. Commentary: This study failed? Int J Epidemiol. 2003;32(4):534–5. pmid:12913024
  71. 71. Zhang Z, Zhang B. A Likelihood Paradigm for Clinical Trials. Journal of Statistical Theory and Practice. 2013;7(2):157–77.
  72. 72. Bickel DR. The strength of statistical evidence for composite hypotheses: inference to the best explanation. Statistica Sinica. 2012;22:1147–98.