Conditional equivalence testing: An alternative remedy for publication bias

We introduce a publication policy that incorporates “conditional equivalence testing” (CET), a two-stage testing scheme in which standard NHST is followed conditionally by testing for equivalence. The idea of CET is carefully considered as it has the potential to address recent concerns about reproducibility and the limited publication of null results. In this paper we detail the implementation of CET, investigate similarities with a Bayesian testing scheme, and outline the basis for how a scientific journal could proceed to reduce publication bias while remaining relevant.


Introduction
Poor reliability, within many scientific fields, is a major concern for researchers, scientific journals and the public at large.In a highly cited essay, Ioannidis (2005) uses Bayes theorem to claim that more than half of published research findings are false.
While not all agree with the extent of this conclusion, the argument raises concerns about the trustworthiness of science, amplified by a disturbing prevalence of scientific misconduct, Fanelli (2009).That the reliability of a result may be substantially lower than its p-value suggests has been "underappreciated" (Goodman & Greenland, 2007), to say the least.
The limited publication of null results is certainly one of the most substantial factors contributing to low reliability.Whether due to a reluctance of journals to publish null results or to a reluctance of investigators to submit their null research (Dickersin et al., 1992;Reysen, 2006), the consequence is severe publication bias; see Franco et al. (2014) and Doshi et al. (2013).Despite repeated warnings, publication bias persists and, to a certain degree, this is understandable.Accepting a null result can be difficult, owing to the well-known fact that "absence of evidence is not evidence of absence" (Hartung et al., 1983;Altman & Bland, 1995).As Greenwald (1975) writes: "it is inadvisable to place confidence in results that support a null hypothesis because there are too many ways (including incompetence of the researcher), other than the null hypothesis being true, for obtaining a null result."Indeed, this is the foremost critique of NHST, that it cannot provide evidence in favour of the null hypothesis.The commonly held belief that for non-significant result to show high "retrospective power" (Zumbo & Hubley, 1998) implies support in favour of the null, is problematic; see Hoenig & Heisey (2001).In fact, a larger p-value (e.g.p-value > 0.05), combined with high power often occurs even in situations when the data support the alternative hypothesis more than the null, Greenland (2012).
In order to address publication bias, it is often suggested (Walster & Cleary, 1970;Sterling et al., 1995;Dwan et al., 2008;Suñé et al., 2013) that publication decisions should be made without regards to the statistical significance of results, i.e. "resultblind peer review" (Greve et al., 2013).In fact, a growing number of psychology and neuroscience journals are adopting pre-registration (Nosek et al., 2017) including "Registered Reports" (RR) (Chambers et al., 2015), a publication policy in which authors "pre-register their hypotheses and planned analyses before the collection of data" (Chambers et al., 2014).If the rationale and methods are sound, the RR journal agrees (before any data are collected) to publish the study regardless of the eventual data and outcome obtained.Among many potential pitfalls with result-blind peer review (Findley et al., 2016), a legitimate and substantial concern is that, if a journal were to adopt such a policy, it might quickly become a "dumping ground" (Greve et al., 2013) for null and ambiguous findings that do little to contribute to the advancement of science.To address these concerns, RR journals require that, for any manuscript to be accepted, authors must provide a-priori (before any data are collected) sample size calculations that show statistical power of at least 90% (in some cases 80%).This is a reasonable remedy to a difficult problem.Still, it is problematic for two reasons.
First, it is acknowledged that this policy will disadvantage researchers who work with "expensive techniques or who have limited resources" (Chambers et al., 2014).
While not ideal, small studies can provide definitive value and potential for learning; see Sackett & Cook (1993).For this reason, some go as far as arguing against any requirement for a-priori sample size calculations (i.e.needing to show sufficient power as a requisite for publication).For example, Bacchetti (2002) writes: "If a study finds important information by blind luck instead of good planning, I still want to know the results"; see also Aycaguer & Galbán (2013).While unfortunate, the loss of potentially valuable "blind luck" results and small sample studies (Matthews, 1995) appears to be a necessary price to pay for keeping a "result-blind peer review" journal relevant.Is this too high a price?Based on simulations, Borm et al. (2009) conclude that the negative impact of publication bias does not warrant the exclusion of studies with low power.
Second, a-priori power calculations are often flawed, due to the unfortunate "sample size samba" (Schulz & Grimes, 2005): the practice of retrofitting the anticipated effect size in order to obtain a desirable sample size.Even under ideal circumstances, a-priori power estimation is often "wildly optimistic" (Bland, 2009) and heavily biased due to the "illusion of power" (Vasishth & Gelman, 2017).This "illusion" occurs when the estimated effect size is based on a literature filled with overestimates (to be expected in many fields due, somewhat ironically, to publication bias).Djulbegovic et al. (2011) conduct a retrospective analysis of phase III randomized controlled trials (RCTs) and conclude that optimism bias significantly contributes to inconclusive results; see also Chalmers & Matthews (2006).What's more, oftentimes due to unan-ticipated difficulties with enrolment, the actual sample size achieved is substantially lower than the target set out a-priori, Chan et al. (2008).In these situations, RR requires that either the study is rejected/withdrawn for publication or a certain leeway is given under special circumstances (Chambers (2017), personal communication).
Neither option is ideal.Given these difficulties with a-priori power calculations, it remains to be seen to what extent the 90% power requirement will reduce the number of underpowered publications that could lead a journal to be a dreaded "dumping ground".
An alternative proposal to address publication bias and the related issues surrounding low reliability is for researchers to adopt Bayesian testing schemes; e.g.Dienes & Mclatchie (2017), Kruschke & Liddell (2017) and Wagenmakers (2007).It has been suggested that with Bayesian methods, publication bias will be mitigated "because the evidence can be measured to be just as strong either way" (Dienes, 2016).Bayesian methods may also provide for a better understanding of the strength of evidence (Etz & Vandekerckhove, 2016).However, researchers in many fields remain uncomfortable with the need to define (subjective) priors and are concerned that Bayesian methods may increase "researcher degrees of freedom" (Simmons et al., 2011).Furthermore, it is acknowledged that sample sizes will typically need to be larger than with equivalent frequentist testing in situations when there is little prior information incorporated, Zhang et al. (2011).Nevertheless, a number of RR journals allow for a Bayesian option.At registration (before any data are collected), rather than committing to a specific sample size, researchers commit to attaining a certain Bayes Factor (BF).For example, the journals Comprehensive Results in Social Psychology (CRSP) and NFS Journal (the official journal of the Society of Nutrition and Food Science) require that one pledges to collect data until the BF is more than 3 (or less than 0.33), Jonas & Cesario (2017).The journals BMC Biology and the Journal of Cognition require, as a requisite for publication, a BF of at least 6 (or less than 1/6)2 .
In this paper, we propose an alternate option made possible by adapting NHST to conditionally incorporate equivalence testing.While equivalence testing is by no means a novel idea, previous attempts to introduce equivalence testing have "largely failed" (Lakens, 2017).Our proposal to systematically incorporate equivalence testing into a two-stage testing procedure has not been extensively pursued (one exception may be Hauck & Anderson (1986)) and there has not been any discussion of how such a testing procedure could facilitate publication decisions for peer-review journals.One reason for this is a poor understanding of the conditional equivalence testing strategy whereby testing traditional non-equivalence (or superiority) is followed conditionally, by testing equivalence (or non-inferiority).In fact, whether or not such a two-stage approach is beneficial has been somewhat controversial.As the sample-size is typically determined based only on the primary test, the power of the secondary equivalence (non-inferiority) test is not controlled, thereby potentially increasing the false discovery rate, Ng (2003).Koyama & Westfall (2005) investigate and conclude that, in most situations, such concern is unwarranted.In Section 2, in order to further the understanding of conditional equivalence testing, we provide a brief overview including how to carry out power calculations and how to (not necessarily prior to the study) establish appropriate equivalence margins.
One reason conditional equivalence testing (CET) is an appealing approach is that it shares many of the properties that make Bayesian testing schemes so attractive.As such, the publication policy we put forward is somewhat similar to the RR "Bayesian option".With conditional equivalence testing, evidence can be measured in favour of both the alternative and the null (at least in a pragmatic sense), and as such is "compatible with a Bayesian point of view" (Ocaña i Rebull et al., 2008).In Section 3, we conduct a simple simulation study to demonstrate how one will often arrive at the same conclusion whether using CET or a Bayesian testing scheme.
In Section 4, we outline how a publication policy could be framed around CET to encourage the publication of null results and make recommendations for reporting and implementation.Finally, Section 5 concludes with suggestions for future research.

Conditional Equivalence Testing Overview
Standard equivalence testing is essentially NHST with the hypotheses reversed.For example, for a two-sample study of means, the equivalence testing null hypothesis would be a difference in means, and the alternative hypothesis would be equal (within a given margin) means.Conditional equivalence testing (CET) is the practice of standard NHST followed conditionally (if one fails to reject the null) by equivalence testing.CET is not an altogether new way of testing.Rather it is the usage of established testing methods in a way that permits better interpretation of results.
In this regard, it is similar to other proposals such as the "three-way testing" scheme proposed by Goeman et al. (2010), and Zhao (2016)'s proposal for incorporating both statistical and clinical significance into one's testing.To illustrate, what follows is a brief outline of CET for a two-sample test of equal means (assuming equal variance).Let x i1 , for i = 1, ..., n 1 and x i2 , for i = 1, ..., n 2 be independent random samples from two normally distributed populations of interest with µ 1 , the true mean of population 1; µ 2 , the true mean of population 2; and σ 2 , the true common population variance.Let n = n 1 + n 2 and define sample means and sample variances as follows: xg = ng i=1 x gi , and . The true difference in population means, µ d = µ 1 − µ 2 , under the standard null hypothesis, H 0 , is equal to zero.Under the standard alternative, H 1 , we have that µ d = 0.
The term equivalence is not used in the strict sense that µ 1 = µ 2 .Instead, equivalence in this context refers to the notion that the two means are "close enough", i.e. their difference is within the equivalence margin, δ = [−∆, ∆], chosen to define a range of values considered equivalent (i.e. the "zone of indifference").In equivalence (and non-inferiority) trials, the ∆ is ideally chosen to be the "minimum clinically meaningful difference" (Kaul & Diamond, 2006;Greene et al., 2008).
Let F T df () be the cumulative distribution function (cdf) of the t distribution with df degrees of freedom and define the following critical t values: t * α 1 /2 = F −1 T n−2 (1−0.5α 1 ) (i.e. the upper 100• α 1 2 -th percentile of the t-distribution with n−2 degrees of freedom) and t * α 2 = F −1 T n−2 (1 − α 2 ), (i.e. the upper 100•α 2 -th percentile of the t-distribution with n − 2 degrees of freedom).As such, α 1 is the maximum allowable type I error (e.g.α 1 =0.05) and α 2 is the maximum allowable "type E" error (erroneously concluding equivalence), possibly equal to α 1 .If a type E error is deemed less costly than a type I error, α 2 may be set higher (e.g.α 2 =0.10).CET is the following conditional procedure consisting of five steps:

2.1
Step 1-A two-sided, two-sample t-test for a difference of means.
Calculate the t-statistic, T = (x 1 − x2 )/(s p 1/n 1 + 1/n 2 ) and associated , then declare a positive result.There is evidence of a statistically significant difference, p-value = p 1 .
Step 3-Two one-sided tests (TOST) for equivalence of means.
Otherwise, proceed to Step 5.
Step 5-Declare an inconclusive result.There is insufficient evidence to support any conclusion.
For ease of explanation, let us define p CET = p 1 if the result is positive, and p CET = 1 − p 2 if the result is negative or inconclusive.Thus, a small value of p CET suggests evidence in favour of a positive result, whereas a large value of p CET suggests evidence in favour of a negative result.As is noted above, it is important to acknowledge that, despite the above procedure being dubbed "conditional", p 2 is a marginal p-value, i.e. it is not calculated under the assumption that p 1 > α 1 .The interpretation of p 2 would be the same regardless of whether it was obtained following Steps 1 and 2 or was obtained "on its own" via standard equivalence testing.
Standard NHST involves the same first and second steps and ends with an alternative Step 3 which states that if p 1 >α 1 , one declares an inconclusive result ('there is insufficient evidence to reject the null.').Similar to two-sided CET, one-sided CET is straightforward, making use of non-inferiority testing in Step 3. Note that one-sided CET testing, like all one-sided testing, is vulnerable to potential post-hoc abuse, i.e. the direction of the test could be based on the data (Freedman, 2008).
There is a large literature on Step 3's TOST and non-inferiority testing, see Walker & Nowacki (2011) and Meyners (2012) for overviews that cover the basics as well as more subtle issues.There are also many proposed alternatives to TOST for equivalence testing, what are known as "the emperor's new tests" (Perlman et al., 1999).
These alternative are offered as marginally more powerful options, yet are more complex and are not widely used.As such, we will not go into any further detail and refer those interested to Meyners (2012).
CET is a procedure applicable to any type of outcome.Let θ be the parameter of interest in a more general testing setting.CET can be described in terms of calculating confidence intervals around θ, the statistic of interest.For example, θ may be defined as the the difference in sample proportions, the hazard ratio, the risk ratio, or the slope of a linear regression model, etc.For details on equivalence testing in more general scenarios (i.e.non-continuously distributed outcomes, one- that in these other cases, the equivalence margin, δ, may not necessarily be centred at zero, but will be a symmetric interval around θ 0 , the value of θ under the standard null hypothesis.In general, let δ = [δ L , δ U ], with ∆ equal to half the length of the interval.Consider, more generally, CET as the following procedure:
Step 2-If this C.I. excludes θ 0 , then declare a positive result.
Otherwise, if θ 0 is within the C.I., proceed to Step 3.
Step 4-If this C.I. is entirely within δ, declare a negative result.
Otherwise, proceed to Step 5.
There is insufficient evidence to support any conclusion.(1995) calls a "definitive-positive" result.The lower confidence limit of the parameter is not only larger than zero (the null value, θ 0 ), implying a "positive" result, but also is above the ∆ threshold.Situations "b" and "c", in which the (1 − α 1 )% C.I. excludes zero and the (1

Conditional Equivalence Testing
Point estimates, (1-α 1 )% confidence intervals, and (1-2α 2 )% confidence intervals results but require some additional interpretation.One could describe the effect in these cases as "significant yet not meaningful" or conclude that there is evidence of a significant effect, yet the effect is likely "not minimally important".A "positive" result such as "d", with a wider confidence interval, represents a significant, albeit imprecisely estimated, effect.One could conclude that additional studies, with larger sample sizes, are required to determine if the effect is of a meaningful magnitude.

Inconclusive
Evidently, in some cases, the three categories are not sufficient on their own for adequate interpretation.For example, without any additional information the "positive" vs. "negative" distinction between cases "b" and "f" is misleading.These two results appear very similar: in both cases a substantial (i.e.greater than ∆-sized) effect can be ruled out.Indeed, positive result "b" appears much more like the negative result "f" than positive result "a".Additional language and careful attention to the estimated effect size is required for correct interpretation.While case "k" has a similar point estimate to "b" and "f", the wider C.I. (∆ is within means that one cannot rule out the possibility of a meaningful effect, and so it is rightly categorized as "inconclusive". It may be argued that CET is simply a recalibration of the standard confidence interval, like converting Fahrenheit to Celsius.This is valid commentary and in response, it should be noted that our suggestion to adopt CET (in the place of standard confidence intervals) is not unlike that of Gardner & Altman (1986) who suggest that confidence intervals should replace p-values; see also Cumming (2008) and Reichardt & Gollob (2016).One advantage of CET over confidence intervals is that it may improve the interpretation of null results, see Parkhurst (2001) and Hauck & Anderson (1986).By clearly distinguishing between what is a negative versus an inconclusive result, CET serves to simplify the long "series of searching questions" necessary to evaluate a "failed outcome" (Pocock & Stone, 2016).However, as can be seen with the examples of Figure 1, the use of CET should not rule the complementary use of confidence intervals.Indeed, the best interpretation of a result will be when using both tools together.
As Dienes & Mclatchie (2017) clearly explain, with standard NHST, one is unable to make the "three-way distinction" between the positive, inconclusive and negative ("evidence for H 1 ", "no evidence to speak of", and "evidence for H 0 ").Under the standard null (µ d = 0) with NHST, one is just as likely to obtain a large p-value as one is to obtain a small p-value.Under the standard null with CET, who write "[a]s long as we treat our larger p-values as unwanted children, they will continue disappearing in our file drawers, causing publication bias, which has been identified as the possibly most prevalent threat to reliability and replicability."

Defining the equivalence margin
As with standard equivalence and non-inferiority testing, defining the equivalence margin will be one of the "most difficult issues" for CET (Hung et al., 2005).If the margin is too large, then a claim of equivalence is meaningless.If the margin is too small, then the probability of declaring equivalence, will be substantially reduced; see Wiens (2002).As stated earlier, the margin is ideally chosen as a boundary to exclude 'minimum clinically meaningful differences' (Kaul & Diamond, 2006;Greene et al., 2008).However, "clinically meaningful" effects are difficult to define, and there is generally no clear consensus among stakeholders, Keefe et al. (2013).Furthermore, previously agreed-upon meaningful differences may be difficult to ascertain as they are rarely specified in protocols and published results (Djulbegovic et al., 2011).
In some fields, there are some generally accepted norms.For example, in bioavailability studies, equivalence is routinely defined (and listed by regulatory authorities) as a difference of less than 20%.In oncology trials, a small effect size has been defined as odds ratio or hazard ratio of 1.In cases when a specified equivalence margin may not be as clear-cut, less conventional options have been put forth.Hauck & Anderson (1986) propose the concept of using an "equivalence curve" that illustrates results for a range of possibilities.Meyners (2007) proposes to use the least equivalent allowable difference (LEAD), the largest possible value of ∆ for which one can claim equivalence.The public would then be left to draw their own conclusions from the data and whether they believe the LEAD-∆ is a reasonable standard for equivalence.This would no doubt lead to a discussion about which effect sizes are too small to be worthwhile for a given treatment, and advance researchers towards a specific standard.
One important choice a researcher must make in defining the equivalence margin is whether the margin should be defined on a raw or standardized scale.Lakens (2017) discusses the pros and cons of each option.For example, if there is no rationale for any standard margin in our two-sample normal case, taking equivalence to be a difference within half the estimated standard deviation, i.e. defining ∆ = qs p , with pre-specified q=0.5 seems reasonable.Note, that the probability of obtaining a negative result may be zero (or negligible) for certain combinations of values of α 1 , α 2 , n 1 , n 2 and q.For example, with q = 0.5, α 1 = 0.05, and α 2 = 0.10, P r(negative) = 0 for all n ≤ 26 (with n 1 = n 2 ).As such, q must be chosen with additional practical considerations (see additional details in 2.2).For binary and time-to-event outcomes, there is an even greater number of different ways one can define the margin (e.g. in terms of relative risk vs. odds ratio).For discussions on this, see Ng (2008), da Silva et al.
( It is important to note that, while it may be ideal to specify the margin prior to collecting data, setting the margin afterwards will not lead to any type I error inflation (i.e. one will not erroneously reject H 0 : θ = θ 0 with probability greater than α 1 ).For the same reason that a 95% C.I. is just as valid as a 85% C.I., but must be interpreted differently, CET is valid regardless of whether the margin is specified (on a raw, or a standardized scale by defining q) before or after data are obtained.
However, the conclusion made will clearly depend upon the margin chosen, and for any non-positive result, it is always be possible to choose a specific margin (or specific value for q) which leads to a negative conclusion (see Figure 1).Since the choice of margin is often a difficult one in the best of circumstances, a retrospective choice is not ideal as there will be ample room for bias in one's choice, regardless of how well intentioned one may be.For this reason, for equivalence and non-inferiority RCTs, it is generally expected that margins are to be pre-specified (Piaggio et al., 2006).

Operating characteristics and sample size calculations
What follows is a brief overview of how to calculate the probabilities of obtaining each of the three conclusions (positive, negative, and inconclusive) as listed in our CET procedure above for two-sample testing of normal data with equal variance (box ) covers the values for which a negative conclusion is obtained.A positive conclusion corresponds to when |x 1 − x2 | is large and s * is relatively small.Note how ∆ and σ impact the conclusion.If the equivalence margin is sufficiently wide, one will have P r(inconclusive) approach zero as σ approaches zero.Indeed, the ratio of ∆/σ determines, to a large extent, the probability of a negative result.If the equivalence margin is sufficiently narrow and the variance relatively large (i.e.∆/σ is very small), one will have P r(negative) ≈ 0.
The values of μd = x1 − x2 (on the x-axis) and s * (on the y-axis) are obtained from the data.The three conclusions, positive, negative, inconclusive correspond to the three areas shaded in green, blue and red respectively.The black lettered points correspond to the scenarios of Figure 1.
Let us assume for simplicity that n 1 = n 2 .Then, the sampling distributions of the sample mean and sample variance are well established: μd ∼ N (µ d , σ * 2 ) and Therefore, given fixed values for µ and σ 2 , we can calculate the probability of obtaining a positive result, P r(positive).In Figure 3, P r(positive) equals the probability of μd and s * falling into either the left or right "positive" corners and is calculated (as in a usual power calculation for NHST): where F df,ncp (x) is the cdf of the non-central t distribution with df degrees of freedom and non-centrality parameter ncp.
One can calculate the probability of obtaining a negative result, P r(negative), as the probability of μd and s * falling into the "negative" diamond, ♦.Since μd and s * are independent statistics, we can write their joint density as the product of a normal probability density function, f N (), and a chi-squared probability density function, f χ 2 ().However, the resulting double integral will remain difficult to evaluate algebraically over the boundary, ♦.Therefore, the probability is best approximated numerically, for example, by Monte Carlo integration, as follows: where Φ() is the normal cdf and Monte Carlo draws from a chi-squared distribution provide c j = σ 2 q [j] /(n − 2), with q [j] ∼ χ 2 n−2 for j = 1, ..., M .The left and righthand boundaries of the diamond-shaped "negative region", are defined by h 1 (c j ) = min(0, max(+c j t 2 − ∆, −c j t 1 )) and h 2 (c j ) = max(0, min(−c j t 2 + ∆, +c j t 1 )).
Defining the boundary with h 1 () and h 2 () allows for three distinct cases as seen in Figure 3: When the equivalence boundaries are defined as a function of s p (e.g.∆ = qs p ), the calculations are somewhat different; Figure 4 illustrates.In particular, a negative conclusion requires: As such, for a given sample size, it will only be possible to obtain a negative result if q > t * α 2 1/n 1 + 1/n 2 .Likewise, an inconclusive result will only be possible if (q/ 1/n 1 + 1/n 2 ) < (t * α 1 /2 + t * α 2 ).In order to determine an appropriate sample size for CET, one must replace µ d with an a-priori estimate, μd , the "anticipated effect size", and replace σ 2 with an a-priori estimate, σ2 , the "anticipated variance".Then one might be interested in calculating six values: the probabilities of obtaining each of the three possible results (positive, negative and inconclusive) under two hypothetical scenarios, (1) where µ d = 0, and (2) where µ d equal to μd , the value expected given results in the literature.
One might also be interested in a hybrid approach whereby one specifies a composite null and alternative distribution.Since the objective of any study should be to obtain a conclusive result, sample size could also be calculated with the objective to maximize the likelihood of success, i.e. to minimize P r(inconclusive).
Required sample size for 90% Power, or Pr(Success)=61% and the equivalence margin is pre-specified with ∆ = 0.1025 (= 1 2 μd ).Then, based on the desire for Pr(positive) = 90% (i.e."power" = 0.90) (dashed blue lines) or a 61% "probability of success" (solid red lines) a sample size of n = 1, 000 (with n 1 = n 2 ) would be required.If the true variance is slightly larger than anticipated, σ 2 = 1.25, and the effect size smaller, µ d = 0.123, the actual sample size needed for 90% power is in fact 3,476, while the actual sample size needed for P r(success) = 61% is 1,770; see points "a1" and "a2".On the other hand, if the true variance is slightly smaller than anticipated, σ 2 = 0.75 and the effect size greater, µ d = 0.33, the actual sample size needed for 90% power is only 288, while the actual sample size needed for P r(success|δ, d, s) = 61% is 638; see points "b1" and "b2".as the criteria for determining sample size attenuates the effect of μd on the required sample size.Suppose one calculates that the required sample size is n = 1, 000 based on the desire for 90% statistical power and the belief that σ2 = 1 and µ d =0.205, with n 1 = n 2 as before.This corresponds to a 61% probability of success for ∆=0.1025 If the true variance is slightly larger than anticipated, σ 2 =1.25, and the difference in means smaller, µ d =0.123, the actual sample size needed for 90% power is in fact n=3,476, while the actual sample size needed for P r(success|∆, d, s) = 61% is n=1,770.On the other hand, if the true variance is slightly smaller than anticipated, σ 2 = 0.75 and the difference in means greater, µ d =0.33, the actual sample size needed for 90% power is only n=288, while the actual sample size needed for P r(success|δ, d, s)=61% is n=638.It follows that, if one has little certainty in µ d and σ 2 , calculating the required sample size with consideration of P r(success) may be less risky.
For related work on statistical power, see Shao et al. (2008) who propose a hybrid Bayesian-frequentist approach to evaluate power for testing both superiority and noninferiority.Jia & Lynn (2015) discuss a related sample size planning approach that considers both statistical significance and clinical significance.Finally, Jiroutek et al. (2003) advocate that, rather than calculate statistical power (the probability of rejecting θ = θ 0 should the alternative be true), one should calculate the probability that the width of a confidence interval is less than a fixed constant and the null hypothesis is rejected, given that the confidence interval contains the true parameter.

A comparison with Bayesian testing
Recently, Bayesian statistics have been advocated for, as a "possible solution to publication bias" (Konijn et al., 2015).In particular, there have been many Bayesian testing schemes proposed in the psychology literature; see the discussion of Mulder & Wagenmakers (2016) and, for an accessible overview of the "Bayesian t-test", see Gönen (2010).What's more, publication policies based on Bayesian testing schemes are currently in use by a small number of journals and are the preferred approach for some (e.g.Dienes & Mclatchie (2017)).In response to these developments, we will compare, with regards to their operating characteristics, CET and a Bayesian testing scheme.This brings to mind Dienes (2014) who compares testing with Bayes Factors (BF) to testing with "interval methods" and notes that with interval methods, a study result is a "reflection of the data", whereas with BFs the result reflects the "evidence of one theory over another".What follows is a brief overview of one Bayesian scheme and an investigation of how it compares to CET.
The Bayes Factor is a valuable tool for determining the degree of evidence for the absence of a treatment effect, see most recently Hoekstra et al. (2017).Consider, for the two-sample testing of normally distributed data (as described for box 2.1), a Bayes Factor testing scheme in which we take the JZS (Jeffreys-Zellner-Siow) prior for the alternative hypothesis, see Rouder et al. (2009).Note that, for the Bayes Factor, the null hypothesis, H 0 , corresponds to µ d = 0; and the alternative, H 1 , corresponds to µ d = 0.
The JZS testing scheme involves placing a normal prior on η = (µ 2 − µ 1 )/σ, η ∼ N ormal(0, σ 2 η ), and for the hyper-parameter σ η , placing an inverse chi-squared prior, σ 2 η ∼ inv.χ 2 (1).Integrating out σ η shows that this is equivalent to having a Cauchy prior, η ∼ Cauchy.The JZS prior is recommended as a reasonable "objective prior" to be used in a Bayesian alternative to the common frequentist t-test (Rouder et al., 2009).We can write the JZS Bayes Factor in terms of the standard t-statistic, T = (x 1 − x2 )/s * , with n * = n 1 n 2 /(n 1 + n 2 ), as follows: When the observed mean difference is exactly 0, the BF increases logarithmically with n.For small to moderate μd , the BF supports the null for small values of n but, as n becomes larger, yields less support for the null and eventually favours the alternative.The horizontal lines mark the 3:1 and 1/3 thresholds ("moderate evidence") as well as the 10:1 and 1/10 thresholds ("strong evidence").When the observed mean difference is exactly zero, CET provides increasing evidence in favour of equivalence with increasing n.For small to moderate μd , CET supports the null at first and then, as n becomes larger, at a certain point favours the alternative.The sharp change-point represents border cases.Consider case "f" in Figure 1: if n increased, the confidence intervals would shrink and at a certain point, the (1 − α 1 )% C.I. would be exclude 0 (similar to case "b").At that point, the result accordingly.Results are presented in Figure 7, based on 5,000 distinct simulated datasets per scenario.

Several findings merit comment:
• In this simulation study, the JZS-BF admits a very low frequentist type I error, recorded at most ≈ 0.01, for a sample size of n = 110.As the sample size increases, the frequentist type I error diminishes to a negligible level.
• The JZS-BF requires less data to reach a negative conclusion than the CET.
However, with moderate to large sample sizes (n=100 to 5,000) and small true mean differences (µ d = 0 to 0.25), both methods are approximately equally likely to deliver a negative conclusion.
• While the JZS-BF requires less data to reach a conclusion when the true mean difference is small (µ d = 0 to 0.25) (see how solid black curve drops more rapidly than the dashed grey line), there are scenarios in which larger sample sizes will surprisingly reduce the likelihood of obtaining a conclusive result (see how the solid black curve drops abruptly then rises slightly as n increases for µ d = 0.07, 0.09, 0.13, and 0.18.) • The JZS-BF is always less likely to deliver a positive conclusion (see how dashed blue line is always higher than solid blue line).In the scenarios like those considered, JZS-BF may require larger sample sizes for reaching a positive conclusion and may be considered "less powerful" in a traditional frequentist sense.
The results of the simulation study suggest that, in many ways, the JZS-BF and CET operate very similarly.Think of JZS-BF and CET as two pragmatically similar, yet philosophically different, tools for making "trichotomous significance-testing decisions".Both tools will often result in the same outcome, given the same data.

A CET publication policy
Many researchers have put forth ideas for new publication policies aimed at addressing the issue of publication bias.There is a wide range of opinions on how to incorporate more null results into the published literature.Consider just a few interesting ideas.
In an editorial titled "Journals Should Publish All Null Results and Should Sparingly Positive results should be published equally promptly, but only on the web, pending independent replication; once refuted, the original article and the refutation could be printed as a single nice null report; the rare validated findings should appear in print with full details."Another suggestion is that of Shields (2000) who advocates accepting null papers in a special section of a journal, under the category of "Null Results in Brief".The null papers in this section, would provide only a brief summary of the methods and results of the studies.With regards to power, Shields (2000) states that: "For the [null] paper to be considered for publication, there must be sufficient statistical power to test the a-priori hypothesis.For example, the authors should state the level of power to detect an odds ratio of 2.0 with the current sample size." Dirnagl et al. ( 2010) makes a similar suggestion with the note that "the quality of the data submitted to our Negative Results section must meet the same rigorous standards that our journal applies to all other submissions.In fact, it may be said that the standards must even exceed those applied currently, as type II error (false negatives) considerations need to be included."Certainly one of the most exciting proposals of late is that of Registered Reports (RR).RR is one of many proposed "two-step" manuscript review schemes in which acceptance for publication is granted prior to obtaining the results, see e.The policy we put forth here is not meant to be an alternative to the "preregistration" component of RR.Its benefits are clear, and in our view is most often "worth the effort" (Wager & Williams, 2013).If implemented properly, preregistration should not "stifle exploratory work" (Gelman, 2013b).Instead, what follows is an alternative to the second "commitment to publish" component.

Outline of a CET-based publication policy
Figure 9 illustrates the steps of our proposed policy.What follows is a general outline.
Registration-In the first stage of a CET-based policy, before any data are collected, a researcher will register the intent to conduct a study with a journal's editor.
As in a RR policy, this registration process will detail the motivations and merits of the study and list the defined outcomes, the various hypotheses to be tested, and the proposed methods for analysis.Unlike in the RR policy, the registration will not require a sample size calculation showing a specific level of statistical power.Instead, the researcher will define an equivalence margin for each hypothesis test to be carried out.For example if a researcher intends to fit a linear regression model with five explanatory variables, a margin should be defined for each of the five variables.A target sample size should be stated, but need not be justified with regards to power considerations.The researcher will also need to note if there are plans for any sample size reassessments, interim and/or futility analyses.
Editorial and Peer Review-If the merits of the study satisfy the editorial board, the registration study plan will then be sent to reviewers to assess whether the methods for analysis are adequate and whether the equivalence margins are sufficiently narrow.Once the peer reviewers are satisfied (possibly after revisions to the registration plan), the journal will then agree, in principle, to accept the study for eventual publication, on condition that either a positive or a negative result is obtained.
Data collection-Armed with this "in principle acceptance", the researcher will then collect data in an effort to meet the established sample size target.Once the data are collected and analyses complete, the study will be published if and only if either a positive or negative result is obtained as defined by the pre-specified equivalence margins.Inconclusive results will not generally be published thus protecting a journal from becoming a "dumping ground" for failed studies.In very rare circumstances, however, it may be determined that an inconclusive study offers a substantial contribution to the field and should therefore be considered for publication.As is required practice for good reporting, any failure to meet the target sample size should be stated clearly, along with a discussion the reasons for failure and the consequences with regards to the interpretation of results; see Toerien et al. (2009).
This proposed policy is most similar to the RR Bayesian option outlined in Section 3. Journals only commit ("in principle acceptance") to publish conclusive studies (small p-value/ BF above or below a certain threshold) and no a-priori sample size requirements are forced upon a researcher.As noted in Section 3, we expect that the conclusions obtained under either policy will often be the same.The CETbased policy therefore represents a frequentist alternative to the RR-BF policy.
A journal may wish to require stricter or weaker requirements for publication of results and this can be done by setting thresholds for α 1 and α 2 accordingly, (e.g.stricter thresholds: α 1 = 0.01, α 2 = 0.05, ∆ = 0.5s p ; weaker thresholds: α 1 = 0.05, α 2 = 0.10, ∆ = 0.5s p ). Also, note that a p-value may be used in one way for the interpretation of results and another way to inform publication decisions.Recently, a large number of researchers (Benjamin et al., 2017) have publically suggested that the threshold for defining "statistical significance" be changed from p < 0.05 to p < 0.005.
However, they are careful to emphasize that while this stricter threshold should change the description and interpretation of results, it should not change "standards for policy action nor standards for publication".
"Gaming the system"-When evaluating a new publication policy one should always ask how (and how easily) a researcher could -unintentionally or not-"game the system".For the CET-policy as described above, we see one rather obvious strategy.
Consider the following, admittedly extreme, example.A researcher submits 20 different study protocols for pre-registration each with very small sample size targets (e.g.n = 8).Suspend any disbelief, and suppose these studies all have good merit, are all well written, and are all well designed (besides being severely underpowered), and are therefore all granted in principle acceptance.Then, in the event that the null is true (i.e.µ = 0) for all 20 studies (and with α 1 = 0.05), it is expected that at least one study out of the 20 will obtain a positive result and thus be published.This is publication bias at its worst.
In order to discourage this unfortunate practice, we suggest making a researcher's history of pre-registered studies available to the public.The researcher "gaming the system" in this way will still score his one "type I error" publication, but it will also be known to the public that in 19 other experiments, research was unsuccessful and valuable resources were essentially wasted.With digital identifiers such as ORCID (Haak et al., 2012), it should be straightforward to maintain a track record across journals and disciplines of the number of successful and unsuccessful studies for each researcher/laboratory.However, while potentially beneficial, we do not anticipate this type of action being necessary.With the CET-policy, it is no longer in a researcher's interest to "game the system".

Conclusion
Publication bias has been recognized as a serious problem for several decades now (Rosenthal, 1979).Yet, in many fields, it is only getting worse (Pautasso, 2010).An investigation by Kühberger et al. (2014) concludes that the "entire field of psychology" is now tainted by "pervasive publication bias".
There remains substantial disagreement on the merits of pre-registration and result-blind peer-review (see e.g. ) We recognize that current publication policies, in which evidence is dichotomized (without any "ontological basis") may be highly unsatisfactory.However, one benefit to adopting distinct categories based on clearly defined thresholds (as in the CET-policy) is that one can assess, in a systematic way, the state of published research (in terms of reliability, power, reproducibility, etc.).While using a "more holistic view of the evidence" to inform publication decisions may (or may not?) prove effective, 'meta-research' under such a paradigm is clearly less feasible.
As such, the question of effectiveness may perhaps never be adequately answered.
Bayesian approaches offer many benefits.However, we see three main drawbacks.
First, adopting Bayesian testing requires a substantial paradigm shift.Since the interpretation of findings deemed significant by traditional NHST may differ with Bayes, some will no doubt be reluctant to accept the shift.With CET, the traditional usage and interpretation of the p-value remains unchanged, except in circumstances when one fails to reject the null.As such, CET does not change the interpretation of findings already established as significant.Indeed, CET simply "extend[s] the arsenal of confirmatory methods rooted in the frequentist paradigm of inference" (Wellek, 2017).Second, as we observed in our simulation study, Bayesian testing is potentially less powerful than NHST (in the traditional frequentist sense, with a fixed sample size design) and as such could require substantially larger sample sizes.Finally, we share the concern of Morey & Rouder (2011) who write that Bayesian testing "provides no means of assessing whether rejections of the nil [null] are due to trivial or unimportant effect sizes or are due to more substantial effect sizes."For these reasons, we believe CET should be welcomed by any "pragmatic Bayesian" (Kass et al., 2006).
There are many potential areas for further research.Determining whether ∆ and/or α 1 and/or α 2 should be chosen with consideration of the sample size is important and not trivial; related work includes Pérez & Pericchi (2014) who put forward a "Bayes/non-Bayes compromise" in which the α-level of a confidence interval changes with n.Issues which have proven problematic for standard equivalence testing must also be addressed for CET.These include multiplicity control (Lauzon & Caffo, 2009) and potential problems with interpretation (Aberegg et al., 2017).It would also be worthwhile considering whether CET is appropriate for testing for baseline balance, Senn (1994).Finally, the impact of a CET policy on meta-analysis should be exam-ined, Hedges (1992) (i.e.how should one account for the exclusion of inconclusive results in the published literature when deriving estimates in a meta-analysis?).
The publication policy outlined here should be welcomed by journal editors, researchers and all those who wish to see more reliable science.Research journals which wish to remain relevant and gain a high impact factor should welcome the CET-policy as it offers a mechanism for excluding inconclusive results while providing a space for potentially impactful negative studies.Embracing equivalence testing is an effective way to make publishing null results "more attractive" (O'Hara, 2011).Researchers should be pleased with a policy that provides "in principle acceptance" and does not insist on specific sample size requirements that may not be feasible or desirable.
The requirement to specify an equivalence margin prior to collecting data will have the additional benefit of forcing researchers and reviewers to think about what would represent a meaningful effect size before embarking on a given study.While there will no doubt be pressure on researchers to "p-hack" in order to meet either the the α 1 or α 2 threshold, this can be discouraged by insisting that an analysis strictly follows the pre-registered analysis plan.Adopting strict thresholds for significance can also act as a deterrent.Finally, we believe that the CET-policy will improve the reliability of published science by not only allowing for more negative research to be published, but by modifying the incentive structure driving research (Nosek et al., 2012).and provides the basis for why statistical power in many fields has not improved (Smaldino & McElreath, 2016) despite being highlighted as an issue over six decades ago (Cohen, 1962).A CET-based policy may provide the incentive scientists need to pursue higher statistical power.If CET can change the incentives driving research, the reliability of science will be further improved.More research on this question (i.e. "meta-research") is needed.

Figure 1
Figure 1 illustrates the three conclusions with their associated confidence intervals.In this figure, we once again consider two-sample testing of normally distributed data as described for box 2.1., with α 1 = 0.05 and α 2 = 0.10.Situation "a", in which the (1 − α 1 )% C.I. is entirely outside the equivalence margin is what Guyatt et al.
Figure 2 shows the distribution of standard two-tailed p-values under NHST and corresponding p CET values under CET for three different sample sizes.These are the result of two-sample testing of normally distributed data as described for box 2.1, with n 1 = n 2 .

Figure 2 :
Figure 2: Distribution of two-tailed p-values from NHST and p CET values from CET, with varying n (total sample size) and δ = [−0.5,0.5].Data are the results from 10,000 Monte Carlo simulations of two-sample normally distributed data with equal variance, σ 2 = 1.The true difference in means, µ d = µ 1 − µ 2 , is drawn from a uniform distribution between -2 and 2. Black points (= 1 − p 2 ) indicate evidence in favour of equivalence ("negative result"), whereas blue points (= p 1 ) indicate evidence in favour of non-equivalence ("positive result").An "inconclusive result" would occur for all black points falling in between the two dashed horizontal lines at α 1 = 0.05 and α 2 = 0.10, (i.e for p 2 > α 2 ).The format of this plot is based on Figure1, "Distribution of two-tailed P-values from Student's t-test", from Lew (2013).

2. 1 )
. More in-depth related work includes Shieh (2016) who establishes exact power calculations for TOST for equivalence of normally distributed data and da Silva et al. (2009) who review power and sample size calculations for equivalence testing of binary and time-to-event outcomes.As before, let the true population mean difference be µ d = µ 1 −µ 2 and the true population variance equal σ 2 .Let σ * = σ (1/n 1 + 1/n 2 ) and s * = s p (1/n 1 + 1/n 2 ).

Figure 3
Figure 3 illustrates how each of the three conclusions can be reached based on the values of s * and μd = x1 − x2 obtained from the data.For this plot, n = 90 (with n 1 = n 2 ), ∆ = 0.5, α 1 = 0.05 and α 2 = 0.10.The black lettered points correspond to the scenarios of Figure 1.The interior diamond, ♦, (with corners located at [0,0], negative negative negative negative positive positive positive positive positive positive inconclusive inconclusive inconclusive inconclusive inconclusive

Figure 5 :
Figure 5: Suppose the anticipated effect size is μd = 0.205, the anticipated variance is σ2 = 1 inconclusive; µ d = μd , σ2 ) − 1 2 P r(inconclusive; µ d = 0, σ2 ) (2) represents the probability of a "successful" study under the assumption that the null (µ d = 0) and alternative (with specified µ d = μd ) are equally likely.(Note that both false positive-, and false negative-studies in this equation are considered "successful".)This weighted average could be considered a simple version of what is known as "assurance" (O'Hagan et al., 2005).Figure 5 shows how consideration of P r(success)

Figure 6 -
Figure 6 -left panel shows how the JZS Bayes Factor changes with sample size for four different values of the observed difference in means, μd = x1 − x2 ; the observed variance remains constant at s 2 p = 1.When the observed mean difference is exactly 0, the BF increases logarithmically with n.For small to moderate μd , the BF supports

Figure 6 :
Figure 6: Based on Figure 5 from (Rouder et al., 2009).Left (middle; right) panel shows how the JZS Bayes Factor (the posterior probability of H 0 ; p CET ) changes with sample size for four different observed mean differences, μd = x1 − x2 .The observed variance is constant, s 2 p = 1.

Figure 6 - 2 p = 1 ,
Figure 6 -right panel shows p CET -values for the same four values of μd , constant s 2 p = 1, and the equivalence margin of [−0.50, 0.50].The lines suggest that CET possesses similar "ideal behaviour" (Rouder et al., 2009) as is observed with the BF.

Figure 8 :
Figure 8: The RR publication policy.

Figure 8
illustrates the RR procedure andChambers et al. (2014) a small number of RR journals, a Bayesian alternative option is offered.Instead of committing to a specific sample size, researchers commit to achieving a certain BF.

Figure 9 :
Figure 9: The CET publication policy.
Using an optimality model,Higginson & Munafò (2016) conclude that, given current incentives, the rational strategy of a scientist is to "focus almost all of their research effort on underpowered exploratory work [... and] carry out lots of underpowered small studies to maximize their number of publications, even though this means around half will be false positives."This result is in line with the views of many (e.g.Bakker et al. (2012), Button et al. (2013) and Gervais et al. (2015)),