Conservative Tests under Satisficing Models of Publication Bias

Publication bias leads consumers of research to observe a selected sample of statistical estimates calculated by producers of research. We calculate critical values for statistical significance that could help to adjust after the fact for the distortions created by this selection effect, assuming that the only source of publication bias is file drawer bias. These adjusted critical values are easy to calculate and differ from unadjusted critical values by approximately 50%—rather than rejecting a null hypothesis when the t-ratio exceeds 2, the analysis suggests rejecting a null hypothesis when the t-ratio exceeds 3. Samples of published social science research indicate that on average, across research fields, approximately 30% of published t-statistics fall between the standard and adjusted cutoffs.


Introduction
A natural tendency in scientific work is for statistically significant results to be reported with greater likelihood than insignificant results.
In fields like economics or psychology, where hypothesis testing plays an important role in establishing the robustness of estimated effects, this tendency may result in a systematic selection effect, whereby published estimates are more extreme than the underlying population effects. Rosenthal famously notes that in the "extreme view of this problem, the 'file drawer problem,' . . .the journals are filled with the 5% of the studies that show type I errors, while the file drawers back at the lab are filled with the 95% of the studies that show nonsignificant . . .results." (See p. 638 in [1].) This fundamental problem has been widely acknowledged and appreciated in economics [2][3][4][5][6][7][8][9][10][11][12][13]. In recent years the issue has received renewed attention in finance [14][15][16][17][18], in statistics [19][20][21][22], in political science [23,24], in psychology [1,[25][26][27], in medicine [28][29][30], and in other fields. Publication bias in medicine is a sufficiently serious concern that the U.S. Congress first mandated trial registration in 1997, the clinicaltrials.gov trial registry was created in 2000, expanded in 2007, and the NIH recently sought public comment on a further expansion of the requirements of trial results reporting [31].
If a file drawer bias leads statistically significant results to be more prevalent in the published literature than they would be in the absence of this bias, then the threshold for the results of statistical tests to be significant should be higher than it otherwise would be to maintain the desired type I error rate. This intuitive idea motivates the analysis of this paper. We show that if a finding's conditional publication probability given the result of a statistical test is a step function, with step occurring at the conventional critical value for the test-in other words, a satisficing model of publication bias-then an adjustment to conventional critical values restores the intended type I error rate (significance level) of the test among the sample of published papers. These adjusted critical values are simple to calculate and require only access to a table of the values of the original test statistic.
We give examples of adjusted critical values for various test statistics under conventional levels of significance. These adjusted critical values are an average of 49% larger than the corresponding unadjusted critical values. For example, if authors use two-tailed t-tests to gauge the robustness of their findings, and only submit findings with t-test values above 1.96 in absolute value, then 5 percent of the t-tests observed by the editor will exceed 3.02 in absolute value. An editor seeking to counteract the selection effect created by authors' behavior, then, would use a critical value of 3.02. Further, suppose a literature contains independent estimates of the same quantity, with authors testing the same null hypothesis using the t-test, and that 95 percent of the t-statistics in the literature are between 2 and 3 in absolute value, with the remainder above. Despite this (hypothetical) literature of significant results, there would in fact be little evidence against the null hypothesis, if file drawer bias prevented the submission of insignificant estimates. (Naturally, this game between authors and editors can be interpreted to apply equally to editors and readers, with readers (editors) playing the role of the editors (authors).) The approach taken here in addressing publication bias is to restore the intended type I error rate of hypothesis tests by adjusting the critical value. The approach we propose might help to assess the reliability of an existing literature, and could complement current methods to assess and correct retrospectively for publication bias. Current approaches typically apply to meta-analysis and make use of funnel plots and related meta-regression techniques aimed at estimating the possible presence of file-drawer effects and recovering point estimates for the population average of the underlying estimates [32][33][34]. These latter methods are relatively narrow in scope and are sensitive to deviations from their underlying assumptions, for example by requiring large sample sizes and low heterogeneity [35].
Researchers have addressed the issue of hypothesis testing in the context of publication bias [15], but have focused on selection rules, such as specification searching and data mining, that are more pernicious than satisficing models. For the problems created by file drawer bias, a satisficing model may be realistic. Other types of publication bias may not be consistent with satisficing models, and the results of this paper do not apply to such problems. One way to understand the contribution of the present paper is that it clarifies that a researcher who insists on truly decisive rejections of null hypotheses (e.g., t-tests greater than 5 in absolute value) must implicitly believe in more troublesome forms of publication bias than simple file drawer bias (setting aside issues regarding inappropriate standard error calculation).

Methods
Suppose authors calculate a test statistic, T, and plan to reject at the 1 − α percent level a given null hypothesis if T > c 1−α , for a known critical value c 1−α . For example, T could be the square of the t-ratio for a regression coefficient and authors could plan to reject the null hypothesis of zero if T exceeds c 0.95 = 1.96 2 . Let the distribution function of T under the null hypothesis be denoted F(Á) and let F −1 (Á) denote the corresponding quantile function. (Throughout, we will assume that the quantile function is uniquely defined, i.e., F(Á) is strictly monotonic. We also assume the null hypothesis is true, though this is also the case with all regularly calculated pvalues and test statistics.) The critical value c 1−α is given by Throughout, we assume that authors submit statistically insignificant results with probability π 0 , but submit statistically significant results with probability π 1 . Formally, we state ASSUMPTION 1.
where D equals one if a study is submitted and equals zero otherwise. Thus, the conditional probability of submission is a step function, with step occurring at c 1−α and with step height π 1 − π 0 . A few remarks are in order. First, Assumption 1 would be unreasonable if different individuals had differing views regarding the significance level at which tests should be conducted. However, as there is a clear default of α = 0.05, the assumption seems reasonable. Second, while Assumption 1 simplifies the analysis, it is not the only condition under which the results derived in the next section obtain. In particular, it is not important that the conditional probability be constant to the left of c 1−α . However, it is important that it be constant to the right of c 1−α . That the submission probability be constant to the right of c 1−α is in fact the essence of a satisficing model of publication bias: there exists a threshold at which an estimate becomes statistically significant, and authors are just as likely to submit a paper with a test statistic of T = c 1 −α + a as they are to submit a paper with a test statistic of

Results
Under Assumption 1, the distribution function of submitted test statistics is given by where π is the unconditional probability of submission: π = απ 1 + (1 − α)π 0 . The calculation is a straightforward application of Bayes' rule, as follows: define G(t) = P(T t|D = 1). Bayes' rule implies GðtÞ ¼ PðD¼1jT tÞPðT tÞ
By inverting G(Á), we can derive a formula for critical values that adjust for type I error rate distortions induced by file drawer bias.
The lemma clarifies that to undo the selection effect created by authors' selective submission, an editor should calculate the critical value for the relevant testing procedure, using any standard table for the test, but pretending that the desired type I error rate was απ/π 1 . Under the null hypothesis and Assumption 1, such a procedure will guarantee a testing procedure with type I error rate α. (One could choose any type I error rate here, not just α, as this refers to only the type I error rate among the submitted test statistics, which is clearly a selected sample. We choose α for its intuitive appeal, keeping the level of false positives identical across the entire universe of tests and the selected sample of submitted tests.) This conclusion would seem to be of little practical consequence, since neither π 1 nor π 0 are known. However, it is straightforward to derive bounds under a worst-case scenario.
PROPOSITION. Under the null hypothesis and Assumption 1, a test with type I error rate no more than α is obtained by utilizing a critical value of F −1 (1 − α 2 ).
PROOF. Since G(Á) is increasing in π 0 , an upper bound on the critical value is obtained by set- Examples Table 1 lists unadjusted and adjusted critical values for selected tests, where the adjusted critical values are those of the proposition. As discussed, under a satisficing model of publication bias, use of these critical values guarantees that the tests have type I error rate of at most 5 percent.
The tests considered are two-tailed t-tests, F-tests with 5 and 10 numerator degrees of freedom and a large denominator degrees of freedom, and two types of nonparametric two-sample tests (Kolmogorov-Smirnov and Feller). All tests are of the form "reject if T > c 1−α , " for some T and some c 1−α . The t-and F-test distributions are standard [36] and critical values are calculated using statistical software. The Kolmogorov-Smirnov two-sample test statistic is given by T ¼ ffiffiffi n p sup x jĤ 1 ðxÞ ÀĤ 2 ðxÞj whereĤ 1 ðxÞ andĤ 2 ðxÞ are the empirical distribution functions for two independent samples of sizes n 1 and n 2 drawn, under the null hypothesis, from the same distribution, and where n = n 1 n 2 /(n 1 + n 2 ). Under the null, the distribution function for T converges to t 7 ! 1 À 2 P 1 j¼1 ðÀ1Þ jÀ1 exp ðÀj 2 t 2 Þ (critical values for this distribution are taken from the tabulation in [37]). The Feller two-sample test statistic is T ¼ ffiffiffi n p sup x fĤ 1 ðxÞ ÀĤ 2 ðxÞg, with limiting distribution function of t 7 ! 1 − exp(−2t 2 ) (Theorem 4, [38]). Critical values for this distribution are calculated directly.
Looking over Table 1, it is apparent that the adjusted critical values for the tests considered are 30 to 70 percent larger than the corresponding unadjusted critical values. For example, the adjusted critical values for the common t-test are about 50 percent larger than their unadjusted counterparts. For a test of 5 percent type I error rate, we reject the null hypothesis if the absolute value of the t-ratio exceeds 1.96. Adjusting for file drawer bias, we reject if it exceeds 3.02. Adjusted critical values for the F-test with 10 numerator degrees of freedom are similarly about 50 percent larger than their unadjusted counterparts, while those for the F-test with 5 numerator degrees of freedom are 60-70 percent bigger.

Statistics from Published Papers
To gauge the practical difference in published research between standard and adjusted cutoffs, we probed the literature for data on the distribution of t-statistics to be compared to the adjusted and unadjusted cutoffs for two-tailed tests at the α = .05 level. Our search yielded eight studies with publicly available data (or data available in the original figures themselves) in the behavioral and social sciences and one from biology that assessed the prevalence of the filedrawer problem by examining the distribution of t-values or equivalent statistics [13,[39][40][41][42][43][44][45]. These studies differed widely in key methodological choices including: discipline, sampling strategy (e.g. some surveyed specific journals, others specific topics), sample size, historical period considered, and in whether they had extracted all available statistics or hand-selected substantive ones. This heterogeneity precludes the calculation of a meta-analytical summary. Nonetheless, this collection of published results yields a relatively coherent picture by suggesting that, in the behavioral and social sciences, on average across studies 31% of all test statistics might lie between the adjusted and unadjusted cutoffs (N studies = 6, excluding two studies which lacked data for t-statistics below 1.96) and 22% lie above the adjusted cutoff (N = 6, range: 4%-36%). Looking only at the fraction of test statistics above the standard threshold, across studies an average of 62% lie between the adjusted and unadjusted cutoffs (N = 8) and an average of 38% lie above the adjusted cutoff (N = 8, range: 7%-65%) (Fig 1).

Discussion
Intuitively, the results reported here are due to the narrow tails of the asymptotic distribution of most test statistics. For example, most econometric test statistics are asymptotically normal. The tails of the normal distribution decline so rapidly that, even if only significant results are observed, the chances are still quite good that a randomly chosen draw above the critical value will be close to the critical value. Thus, adjustments for the types of publication bias discussed in this paper may be small.
Empirically, however, it appears that a quite sizable fraction of published research that appears significant by normal standards would not meet the adjusted standard of our satisficing model. Notably, this evidence comes mainly from the social sciences, where significance testing is common and where we have been able to find reasonably sized samples of published research. We conjecture that our method would make a modest difference in fields that typically work with large data, whilst it might have a significant impact in fields in which small sample sizes and low signal-to-noise ratios are common. This conjecture is supported by the data we have gauged from the literature: t-statistic distributions from psychology-a discipline that is believed to have relatively low reproducibility rates (see [46])-tend to have a higher concentration in the 1.96-3.02 range compared to other social sciences.
The results obtained in this paper are highly specific in at least a few other regards. First, the results in this paper do not apply to settings where F(Á) fails to be the distribution function of the test statistic in question. This would occur, for example, in settings where regression standard errors are calculated incorrectly [47][48][49].
Second, they do not pertain to specification searching. For example, suppose we model specification searching in the following way. Imagine that authors estimate J independent models, where the discrepancy between J and the true number of estimated models summarizes the dependence between the estimates, and that authors report only the specification with the most significant results. In that case, the critical value that would undo the selection effect of the specification searching would be F À1 ð ffiffiffiffiffiffiffiffiffiffiffi 1 À a J p Þ, where F(Á) is the distribution function for the test statistic in question. This critical value is only bounded if it is possible to bound J. This gives rise to the emphasis by Sullivan, Timmermann, and White [16] on the ability of the analyst to specify the universe of tests conducted by a single researcher, or a field of researchers [6].  Figure shows the distribution of t-statistics, as reported in the literature, that would lie below, between, or above the non-adjusted and adjusted threshold. Data were obtained from independent publications, which are referenced above each graph, and were either provided by the original authors or were re-digitized from histograms provided in the texts. Below each graph are indicated the following key methodological characteristics: study sampling strategy (i.e. specific journals or specific field), year range, number of articles included, and selection strategy for the statistics (i.e. whether, from each of the included articles, the authors had taken all available statistics, only those referring to explicit hypotheses, or only a subsample of "substantive" results selected by human coders). Third, a related point is the interpretation of any individual test statistic within a paper or within a literature. While a literature may collectively be biased by non-publication of null results, each individual test that is published may still show unbiased results on its own. In this case, our method in some sense requires a stricter test (lower type I error rate, and lower power) than intended. However, given the evidence that even individual papers with multiple experiments suffer from publication bias within the individual papers, we believe our method is still useful. (See the test for excess significance, developed in [50] and applied in [51] and [52].) Fourth, as mentioned previously, one could modify our method and develop a different modified cutoff by choosing a different type I error rate among the selected sample of published results, not just the same α that is used in submission/publication to generate the satisificing model of publication bias. The reason we use α is for its intuitive appeal, keeping the level of false positives identical across the entire universe of tests and the selected sample of submitted tests, restoring the originally intended type I error rate.
Finally, we believe that our approach could be of use in fields where file-drawer effects are believed to be pervasive, by offering a simple rule of thumb to adjust after the fact for the rate of false positives. However, this method is unlikely to represent an active remedy to the problem of publication bias, as it may encourage a "t-ratio arms race," whereby authors understand that editors are suspicious of t-ratios just above 2, and adjust their submission behavior accordingly. Authors would become ever more selective in their submissions as editors became ever more critical, ad infinitum. In light of these considerations, perhaps the best way to understand the result described above is as a useful rule of thumb to employ, assuming that few other people deviate from standard practice.

Conclusion
In this paper, we have outlined a simple method for restoring the intended type I error rate of tests used by consumers of research (e.g., editors, readers) when producers of research (e.g., authors, editors) select results based on the statistical significance of tests, and where the selection follows a satisficing rule. The analysis shows that this selection effect in fact distorts the size of test statistics by approximately 50% and may be eliminated using adjusted critical values. These adjusted critical values are particularly simple to implement and require only a (detailed) table giving the distribution of the (unselected) test statistic under the null hypothesis. A leading example of the application of this result is two-tailed t-tests, where a test with type I error rate of 5 percent involves a critical value of 1.96. Distortions created by file drawer bias are adjusted for by using an adjusted critical value of 3.02. Samples of published social science research indicate that on average, across research fields approximately 30% of t-statistics are between the standard and adjusted critical values, and might thus be affected by the proposed method.