Figures
Abstract
In recent years, concern has grown about the inappropriate application and interpretation of P values, especially the use of P<0.05 to denote “statistical significance” and the practice of P-hacking to produce results below this threshold and selectively reporting these in publications. Such behavior is said to be a major contributor to the large number of false and non-reproducible discoveries found in academic journals. In response, it has been proposed that the threshold for statistical significance be changed from 0.05 to 0.005. The aim of the current study was to use an evolutionary agent-based model comprised of researchers who test hypotheses and strive to increase their publication rates in order to explore the impact of a 0.005 P value threshold on P-hacking and published false positive rates. Three scenarios were examined, one in which researchers tested a single hypothesis, one in which they tested multiple hypotheses using a P<0.05 threshold, and one in which they tested multiple hypotheses using a P<0.005 threshold. Effects sizes were varied across models and output assessed in terms of researcher effort, number of hypotheses tested and number of publications, and the published false positive rate. The results supported the view that a more stringent P value threshold can serve to reduce the rate of published false positive results. Researchers still engaged in P-hacking with the new threshold, but the effort they expended increased substantially and their overall productivity was reduced, resulting in a decline in the published false positive rate. Compared to other proposed interventions to improve the academic publishing system, changing the P value threshold has the advantage of being relatively easy to implement and could be monitored and enforced with minimal effort by journal editors and peer reviewers.
Citation: Fitzpatrick BG, Gorman DM, Trombatore C (2024) Impact of redefining statistical significance on P-hacking and false positive rates: An agent-based model. PLoS ONE 19(5): e0303262. https://doi.org/10.1371/journal.pone.0303262
Editor: Vincent Antonio Traag, Leiden University: Universiteit Leiden, NETHERLANDS
Received: September 5, 2023; Accepted: April 23, 2024; Published: May 16, 2024
Copyright: © 2024 Fitzpatrick et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The public repository of this research is available at: https://osf.io/awqnf/.
Funding: The authors received no specific funding for this research.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Concern with the inappropriate application and interpretation of P values has grown in recent years, especially regarding making multiple comparisons within one dataset and using P<0.05 to denote “statistically significance” differences [1–3]. Gelman and Loken observe that many datasets and analysis plans present the opportunity to make multiple comparisons without investigators purposefully and deliberately fishing for a P value below 0.05 [4]. However, the facts that most published results are positive and that an unusually large number of these barely pass the threshold P<0.05 have raised concern that many of these results may in fact arise from data dredging, also referred to as “P-hacking” [5–9]. Data-dredging denotes a deliberate and purposeful search for statistically significant results, with analysis continuing until at least one is found. This chance statistically significant result is then selectively reported in a manuscript as though it were the result of a prespecified test of a hypothesis.
A number of solutions to data dredging and selective outcome reporting have been proposed, including preregistration of study methods and analysis plans, data sharing, Registered Reports, blind data analysis, and adversarial collaboration. The latter two methods have not been widely adopted outside of psychology and physics, respectively [10,11]. While there is evidence that preregistration and data sharing can reduce the number of false positive results published by journals, these editorial practices have been undermined by a failure of many journals to enforce preregistration and data sharing policies and by some study registries allowing retrospective registration [12–14]. As for Registered Reports, interpretation of study findings indicating that it produces more results supporting the null is difficult as it cannot be ruled out that researchers especially concerned with selective outcome reporting bias and/or inclined to test “risky” hypotheses are more likely to use this format [15,16].
A formal proposal that the threshold for “statistical significance” be changed from P<0.05 to P<0.005 was made by Benjamin and other statisticians and research methodologists in 2018 [17]. It is premised on the idea that “statistical standards of evidence for claiming new discoveries in many fields of science are simply too low” [17, p. 6]. The practice of associating statistically significant findings with P< 0.05 will result in a high false positive rate, they argue, irrespective of other problems in study design, data analysis, and reporting of results. In support of their argument, Benjamin et al. state that a two-sided P value of 0.05 corresponds to Bayes factors in favor of the alternative hypothesis in the range of 2.5 to 3.4, which is “weak” or “very weak” evidence. In contrast, a two-sided P value of 0.005 is in the “strong” to “substantial” range of Bayes factor recommendations. They also present a table showing the association between the two P value thresholds, study power, prior odds of the alternative hypothesis, and the false positive rate. For example, with prior odds of 1:10 and power of 1.0, the false positive rate drops from 33% with a P value threshold of 0.05, over a range of values of statistical power, to 5% with a threshold of 0.005. Lakens et al. [18] question the underlying assumptions and data used in arriving at each of these justifications (e.g., the prior odds estimate used in calculating the false positive rates for each alpha level is based on data from just 73 studies from the Reproducibility Project: Psychology).
Opinions as to the effect of changing the P-value threshold on data dredging and P-hacking vary. Ioannidis [1] considers it to be a “temporizing measure” to dam the flood of positive results, but likely to be more beneficial than harmful. On the other hand, Amrhein and Greenland [19] contend that the new threshold will lead to “more intense P hacking and selective reporting”. Benjamin et al. [17] acknowledge that an investigator can still engage in such analytic practices while using a P<0.005 threshold but contend that the likelihood of a “statistically significant” chance finding emerging from these analyses is lower than with a 0.05 threshold. Moreover, it is likely that the longer the data dredging exercise continues, the more easily identifiable it will be to those reading a publication in which such results are reported; that is, investigators will be forced into conducting more extreme analyses as the search for a statistically significant result continues (e.g., removing study groups or assessment points from the analysis, even when the study is registered) [e.g., 20,21].
A handful of studies have assessed the effects of the proposed change in threshold on the number of statistically significant results reported using P values from already published studies. Based on a large text-mining study of many thousands of P values published over 25 years, Ioannidis estimated that changing the P value threshold from 0.05 to 0.005 would remove about one-third of the statistically significant results of past biomedical literature [22,23].
In a series of studies, Vassar and colleagues examined the impact of changing the threshold on randomized controlled trials (RCTs) published in general medical, orthopaedic trauma, and orthopaedic sports medicine journals [24–26]. The results of these studies are summarized in Table 1, along with a study by Thakur and Jha [27] that examined changing the P value threshold on results from 123 RCTs pertaining to chronic rhinosinusitis and a study by Khan et al. [28] that focused on 72 RCTS from high impact general medical and cardiology journals. Across the five studies, the range of P values that retained statistical significance with a 0.005 threshold was 38.9–70.7%.
The studies in Table 1 were focused on recent RCTs published in top medical journals and a majority would have been registered. Such studies are most appropriate for null hypothesis statistical testing using P values. In contrast, research focused on data dredging, P-hacking and the clustering of P values just below 0.05 has typically examined studies in psychology, biology and political science that are unlikely to be preregistered and many of which will use study designs other than RCTs [e.g., 7,9,29,30]. Benjamin et al. [17] recommend that research using non-experimental designs, especially exploratory research that tests numerous hypotheses, should employ even lower P value thresholds than 0.005. It is unknown what percent of P values from non-experimental studies that were statistically significant with the traditional 0.05 threshold would remain so with a 0.005 threshold, let alone one more stringent.
Assessing the potential effects of changes in P value thresholds on the validity of published results is difficult as traditional empirical methods involving experimental manipulation are not feasible. Such constraints apply to studying many aspects of the current academic incentive system and proposed changes to realign this in ways that produce higher quality research and more valid published findings. This has led to the application of simulation models to estimate the effects of the “publish or perish” academic culture on the quality of published scientific research and the potential of proposed changes to the publication process to improve this. An influential work in this area is that of Smaldino and McElrealth [31] which describes a competitive agent-based model (ABM) [32] comprised of research laboratories that survive based on their ability to publish a high volume of papers while expending low effort, albeit many of which report false positive results. The model demonstrates that this way of conducting research, called “bad science”, can rapidly spread through a group of laboratories, and become its dominant approach. The main results of the model were recently replicated [33], and its framework has been used to conduct virtual experiments of interventions such as auditing of research facilities, making publication of negative results more prestigious, improving peer review, assigning research funds randomly or according to methodological integrity, and researchers expending effort on the selection of strong hypotheses through theory development [34–36].
The aim of the current study is to examine the statistical process of empirical science using an ABM, building on the work of Smaldino and McElrealth [31]. Here the agents are researchers who test hypotheses and publish positive findings. With this model we explore the impact of P-hacking and modified P value thresholds on false positive rates in the literature.
We should be clear as to two aspects of what it is we are modeling. First, the disciplines we have in mind are those in which a culture of “you can publish if you found a significant effect ‘‘ prevails, thereby encouraging multiple statistical testing (i.e., data dredging or P-hacking) to obtain such effects [37]. Disciplines in which such concerns have been raised include biomedical sciences, ecology, psychology, and biology [37–40]. The percent of published positive results in such “soft” disciplines is estimated to be over ninety, especially in applied areas of research [41]. Depending on study design, sample size, specificity of hypotheses, analytic flexibility, and other methodological features, most of these disciplines’ published positive results may be false [37,42–44]. While even the hardest of sciences are not immune to confirmation bias, multiple testing using P values to determine “significance” is not among the questionable research practices used to achieve desirable results, and therefore the solutions in these disciplines are different, albeit with potential application to the soft sciences [45,46].
The second clarification involves the type of null hypothesis significance testing (NHST) we model. There is a large literature detailing how NHST as reported in most published papers produced by researchers in the disciplines described above is a problematic amalgam of Fisher’s significance test and the hypothesis testing approach described by Neyman and Pearson [47–49]. The problems with this hybrid have been clearly explained but are not of immediate concern to us as we are presenting an idealized model of what researchers in disciplines such as biomedical science and psychology typically do when hypothesis testing rather than what they should do. Accordingly, in the current model the agents structure their studies as a null hypothesis statistical test, and they reject the null hypothesis on the basis of a P value of 0.05 or below.
Methods
Building on the model of Smaldino and McElreath [31], we populate our ABM with N researchers that dynamically perform experiments to generate new results. Each researcher is characterized by a vector of state variables: effort, number of hypotheses per experiment, value, age, and effect size. The first two values denote the researcher’s methods, in the sense that the amount of effort and the number of “bites at the apple” (i.e., attempts to find a statistically significant result) comprise the variables that drive publication outcomes, as will be seen below.
Effort
In our model, effort impacts the research process in two ways: the time it takes to perform an experiment and the power a study can achieve. These two impacts are in tension–more effort means potentially slowed productivity but improved true positive results. Smaldino and McElreath [31] model the probability of conducting a study during a time step with a convex, decreasing function. Our functional form is different but maintains the convex, decreasing shape:
in which E denotes effort, Emin denotes the lower bound on effort, and η is a tunable model parameter. Greater effort means smaller probability of false positive results. Fig 1 illustrates this functional relationship.
Number of hypotheses
It is in this component that our model is quite different from that of Smaldino and McElreath [31]. We simulate a statistical hypothesis test, a modeling choice that allows us to investigate the reduced P value threshold intervention. The number of statistical hypothesis tests performed models the process of P-hacking, repeatedly testing in hopes of finding something to be statistically significant.
Value
If the researcher performs a hypothesis test that achieves a sufficiently small P value, then that positive finding results in a publication. The researchers’ value is the total number of publications over the course of their careers.
Age
The number of time steps in the simulation that the researchers are active is their age.
Time progresses in discrete fixed steps, during which each researcher:
- applies an amount of effort in order to conduct an experiment (or not);
- collects data to analyze;
- tests one or more hypotheses;
- publishes any positive finding;
- checks the publication rates of other researchers to identify “better” methods.
In addition, researchers “quit” or “retire” at random with an exponential rate, and new researchers enter the workforce to maintain a fixed population size of N researchers.
Effect size and hypothesis testing in the model
In order to study the impact of the P value threshold change, we must simulate the process of statistical hypothesis testing, which involves two types of error: type I or false positive and type II or false negative. In aggregate, P value thresholds define the acceptable probability of false positive findings.
An important aspect of the simulation is the modeling of “truth.” To simulate false positive rates within a hypothesis testing context, we must select a means of generating “true” and “false.” In this model, each researcher conducts a number of independent-sample t-tests, the simplest test comparing a control group to an experimental or treatment group. To simulate these t-tests, we generate true Cohen’s d effect sizes from an exponential distribution:
in which d0 is a tunable modeling parameter. Smaller effect sizes are more likely, and larger effect sizes are less likely in this model.
To generate experimental truth, we model an effect size greater than dmin (a tunable parameter) to yield an experiment that is a true positive. Within this modeling structure, the prior probability of the null hypothesis being true is
.
For example, if we set dmin = 0.2, which is a small effect according to Cohen [50], and we set 80% as the prior probability of the null, we have d0 = -0.2/ln(0.8) = 0.1243. In general, the exponential scale parameter is determined from the minimum effect size and the prior probability by the formula Fig 2 illustrates the probability density we use to generate effect sizes in the model.
This model treats dmin as a minimally relevant effect size, which is the smallest effect that would “justify associated costs, risks, and other harms” [51, p. 251]. An experiment simulated with an effect size below dmin would be considered as arising from a true null hypothesis; likewise, an experiment simulated with an effect above dmin would be a false null. Once the experiment is simulated, and data are obtained, the t-test is performed. A true positive, then, will happen when the point-null hypothesis of zero effect is rejected, in an experiment simulated with an underlying effect size above dmin. A false positive will come about when the null is rejected in a simulation with an underlying effect at or below dmin.
Of course, this definition of truth differs from the point-null that the effect size is 0 (versus the alternative that it is not). An alternative modeling choice would be to simulate truth of 0 effect sizes with some a priori probability and non-zero effect sizes with the complementary probability. But as Gelman and Carlin [52, p.900] remark, “[r]ealistically, all statistical hypotheses are false: effects are not exactly zero, groups are not exactly identical, distributions are not really normal, measurements are not quite unbiased, and so on.” Toward that end, we choose to model truth via minimally relevant effect sizes.
Within this structure there are many ways to simulate effect sizes for researchers. In our model, we simulate a single d value for each researcher, say di, i = 1,2,…,N. This value serves as a scale parameter for simulating effect sizes for individual experiments. That is, at each time step, Researcher i will obtain , effect sizes for the multiple hypotheses 1,2, …, ki. Researcher i then obtains ki t-statistic values randomly sampled from the noncentral t distribution with 2*Ei-2 degrees of freedom and noncentrality parameters
At this point in the simulation, each researcher has obtained a number of T values. A P-hacking researcher would ask if at least one of these is greater than the critical t-value for the null hypothesis at the threshold level α. If so, the researcher would claim a positive outcome and publish. To determine whether or not that result is a true positive in the simulation, we check that (1) the effect size was at least dmin and (2) that the T value exceeded a P value-corrected critical t-value using the Šidák correction [53].
in which k denotes the number of hypotheses (or T values obtained by the researcher). If these two conditions are met, the result is a true positive for the purposes of simulation. Otherwise, the result is a false positive. Any positive finding increments the researcher’s value by one unit. Negative results do not count towards the researcher’s value.
To be specific, P values computed herein correspond to the divergence P value as defined in Greenland [48]. That is, P values are computed as probability of future, replicated test statistics exceeding the value of the test statistic computed for the data at hand, under the condition that the null hypothesis is true. To the best of our knowledge, most statistical software packages that provide P values compute them in this manner.
The parameters of the model are given in Table 2.
Results
We present three sets of model runs, organized by the prior probability of the null being 0.9, 0.8, or 0.5. In each set of runs, we compare three different scenarios. The first is an “ideal” scenario, in which each researcher tests exactly one hypothesis with each experiment. If that experiment’s analysis results in a P value of less than 0.05, the researcher publishes. In the simulation, we know which hypotheses are actually true, so we tabulate the fraction of published works that are false positives. In the second, researchers may test multiple hypotheses, and if at least one of these results in a P value below 0.05, that experiment is published. In this scenario, researchers are not using any type of P value correction for multiple testing. However, in the simulation, true positives are determined using the Sidak correction: “truth” is simulated, but “positive” for the purpose of “true positive” is computed with the Sidak correction. The third scenario follows the second except that we apply the intervention of reducing the P value threshold to 0.005. Fig 3 illustrates the evolution over 50000 time steps and 100 evolutionary replicates. Panels in the figure show the medians across researchers and replicates at each time step.
Quantities are medians across 2000 researchers, averaged over 100 simulation replicates.
In this set of simulations, we use Pr[H0] = 0.9, wherein the majority of simulated experiments will come from effect sizes below the threshold for “truth.” We see that P-hacking continues at the 0.005 threshold, but that the false publication rate declines from the 0.05 threshold false publication rate.
In a second simulation, presented in Fig 4, we examine the prior probabilitiy of the null set to 0.8, and in Fig 5 we show results for the prior probabilitiy of the null set to 0.5.
Quantities are medians across 2000 researchers, averaged over 100 simulation replicates.
Quantities are medians across 2000 researchers, averaged over 100 simulation replicates.
In Figs 3–5, medians are graphed over time, but variation around the median is also of interest. To illustrate the variation in these quantities, we follow up with some histograms for the case of Pr[H0] = 0.8. In Fig 6 we show the system’s false positive publication rate at the final time step, as a histogram over 100 simulation replicates. For each replicate, we compute the cumulative number of false positives over the current researchers, and we divide by the cumulative number of publications (that is, true positives plus false positives) of the current researchers. We generate a histogram for each of the three simulation scenarios. For the single hypothesis scenario, values cluster in the center of the distribution, with a maximum frequency occurring near a false positive rate of 0.52. The distribution under the 0.005 scenario bears a resemblance to this, with a mode near 0.54 and somewhat more variation. In contrast, the 0.05 distribution clusters around 0.875–0.9, indicating a greater number of false positive results among the researchers at the final step of the model.
Variation in the number of hypotheses tested for each of the three scenarios is shown in Fig 7. The number of hypotheses for each researcher is tabulated over the 100 replicates and the 2000 researchers. This shows the continuation of testing multiple hypotheses, that is P-hacking, under the 0.005 scenario. For both 0.05 and 0.005 thresholds, the distributions show a bimodal shape with one peak at the low end and a spread of values that skew slightly toward larger values. In both cases, the number of attempted hypotheses tapers off for values larger than 15.
Variation in the amount of effort for each scenario is shown in Fig 8. The effort for each researcher is tabulated over the 100 replicates and the 2000 researchers. The distribution of 0.05 multiple hypotheses tested with a threshold of 0.05 resembles that of the single hypothese, but with higher peaks. Changing the threshold to 0.005 produces a more even distribution of effort with lower peaks than the 0.05 threshold.
Conclusions
The results of the simulations generally support Ioannidis’ [1] view that a more stringent P value threshold can serve to dam the flood of published positive results. While, as Benjamin et al. [17] acknowledge, investigators can still engage in P-hacking, the false positive rate declines using the 0.005 threshold while the effort expended by investigators increases substantially. Overall, their productivity is reduced, but there is an improvement in production of true positive results.
While the results of our simulation do suggest that reducing the P value threshold may reduce the number of false-positive publications, we must note that this model has a number of limitations. First and foremost, it serves as an idealized version of the statistical aspects of the research process. Simulated researchers herein are applying a simple (two-sided) point-hypothesis test that is an appropriate test for the simulated data. P-hacking is modeled as a simple repeated generation of data for that test. There is no “garden of forking paths” that might lead researchers to search for a statistical method based on the observed data [4]. Moreover, researchers are not permitted to remove or add data in order to obtain positive results [54]. Certainly these (and other) unmodeled researcher activities could serve to moderate the effectiveness of the 0.005 intervention.
As noted in several recent papers, academic publishing can be viewed as a production system which currently incentivizes engagement in problematic behaviors such as data dredging, P-hacking, HARKing, and selective outcome reporting [55–58]. Using such a systems framework allows for the identification of leverage points which can be the target of interventions designed to improve research quality and integrity [59]. Such leverage points in the system will vary in the extent to which they are amenable to change and the magnitude and type of improvements in outputs that can be expected should such change successfully occur.
Preventive interventions targeted at leverage points can have a positive effect by reducing the extent to which individuals within the system engage in an unwanted behavior and/or by reducing the negative consequences (or harms) of engaging in the behavior. The simulations presented here indicate that the introduction by academic journals of a more stringent P value threshold for “statistical significance” has the latter effect: researchers continue to P-hack but the number of false discoveries that enter the published literature as a result of this practice is reduced by more than a half compared to using a 0.05 threshold when effect sizes were randomly generated.
One thing to consider when deciding if the reductions in false positive results observed in the simulations is worth pursuing is that a change in the P value threshold is a relatively simple intervention to introduce into the academic publishing system and one that could be monitored and enforced with minimal effort by journal editors and peer reviewers. Essentially, it involves changing one arbitrary threshold of “statistical significance” for another, albeit a less familiar one. In practice, it would require journals announcing in their instructions to authors that P<0.005 now constitutes statistical significance, running each submitted manuscript through a computer program to ensure the new threshold was adhered to (and returning to authors those that did not), and requesting that peer reviewers also ensure the 0.005 significance threshold was used in the analyses reported. In short, changing the P value threshold has the appeal of being an intervention with a clear target that involves an easy behavioral change and, if implemented widely, can reduce the false positive rate, albeit not entirely eradicating P-hacking.
While other interventions designed to improve research integrity and quality have greater potential impact, the feasibility of their widespread implementation, adherence, and enforcement is questionable. For example, in an ideal world every genuine a priori hypothesis-testing study would be written-up in the form of a Registered Report. This essentially embeds pre-registration in the publication pipeline and eliminates the incentive for researchers to data dredge when writing-up the results of their studies [15,16]. This format appears successful in reducing the publication of positive results and, consequently, reducing the rate of false discoveries [60,61]. However, Registered Reports have not enjoyed widespread adoption among journals: in early 2022, Chambers and Tzavella [15] reported that only 300 journals offered this as a publishing option, with just 94 of these having published a total of 591 final manuscripts reporting study results.
Prospective registration is another proposed intervention that, in principle, can greatly reduce P-hacking and selective publication of positive results [62,63]. While this has been more widely adopted by academic journals than the Registered Reports format, adherence by authors and enforcement by journals is suboptimal and some registries allow retrospective registration and alterations of protocols after a study is underway or even complete [13,14].
To the extent that it reduces the number of false positive results that find their way into the published literature, changing the P value threshold to 0.005 appears to be an editorial procedure worth pursuing, given its minimal costs and inconvenience to editors and reviewers. In a recent discussion of P values, Greenland [64] argued that, as with tobacco smoking, education, not prohibition, might be the best way to limit their misuse and its attendant harms. However, while outright prohibition might have proved as problematic with tobacco as it did with alcohol, there is little doubt that policies that have restricted the circumstances and places in which one can smoke (e.g., smoking bans in workplaces, restaurants and bars, public transport, places of entertainment, aircraft) made significant contributes to declining rates. Such policy changes restrict opportunities to smoke and decrease its social acceptability [65]. Our model suggests changing the P value threshold restricts the opportunity for researchers to find a “significant” (and publishable) result and, if the effort to produce such a result through data dredging becomes more arduous and extreme, its acceptability will, hopefully, decline over time. Researchers are more likely to see the absurd and unethical nature of trying to squeeze a P<0.005 result out of a data set as the analyses required become increasingly distant from those originally intended. This, over time, might help change the current research culture of many disciplines in which P-hacking is so easy it virtually goes unnoticed. As with smoking, a comprehensive approach to the problem of P values is required; we believe changing the threshold for statistical significance should be part of this approach.
Although the results of the simulations suggest there are benefits in terms of a reduction in published false positive results to be derived from changing the P value threshold from 0.05 to 0.005, there are also potential negative consequences that must be considered. First, there are compelling arguments to the effect that it is not the threshold used to designate “statistically significance” that is the problem with P values, but rather the very use of this statistic is problematic and should be discontinued [66]. Changing the P value threshold will therefore simply encourage the continued use of a statistical practice that should be abandoned altogether. Second, and more broadly, Greenland [67] contends that null hypothesis significance testing reinforces cognitive biases that are detrimental to the practice of science, specifically “dichotomania” (the tendency to misperceive quantities as dichotomous even when this is incorrect and misleading) and “nullism” (the assumption that false positives are more problematic than false negatives). From this perspective changing the threshold simply moves the point at which the dichotomy is made and does so in a manner that assumes an over-abundance of false positives is a problem that requires addressing (based, as noted in the introduction, on the observation that null findings are relatively rare in the published literature of the academic disciplines upon which our modeling assumptions are based).
In response, while we are sympathetic to both arguments and believe they have merit, it seems very unlikely that the many disciplines in which P values are widely used (and misused) will abandon them anytime soon. To paraphrase Goodman [68], P values are in the statistical air of a great many academic disciplines and, as such, even those statisticians who would prefer a Bayesian atmosphere must live and breathe them to survive. In addition, eradicating P values would require an alternative, with effect size estimates (and confidence intervals) and Bayesian models the most frequently suggested [66,69,70]. The few cases in which journals have required authors to report effect estimates and not P values have met with limited success, with researchers largely ignoring confidence intervals when presenting their results and continuing to focus on their “significance” [71]. As Finch et al. [72] observe, proper presentation and interpretation of effect sizes and confidence intervals would require prior upstream training in these methods for researchers in many disciplines and not just editorial stricture. There would be even more additional training required if the alternative to P values was Bayesian analysis. A recent survey found that almost half of 323 clinical trial medical researchers reported insufficient knowledge as the main reason they did not use Bayesian statistics [73]. Thus, changing the P value threshold seems, at this point, a modest proposal that might help correct the problem of false positive results identified in many social and behavioral academic disciplines.
A third potential problem with introducing a stricter threshold for statistical significance was found in the optimality models presented by Campbell and Gustafson [74], which show that reducing false positives in the published literature can lead to a depletion in the number of truly “breakthrough discoveries” appearing in academic journals. This applies also to our models as they require investigators expending more effort to publish fewer papers but with more reliable results. The extent to which journal editors want to balance publication of truly reliable and valid results against publishing truly novel and surprising results might depend on the subject matter of the discipline and the question being addressed in the research. In cases where a very pressing issue with substantial societal implications is being addressed, then less stringent requirements for “statistical significance” (i.e., P<0.05) might be warranted. In many cases, however, “breakthrough discoveries” will be of interest primarily to other academics in a particular field of research and requiring these to meet a more stringent standard of statistical significance before publication will likely not result in any major cost to society. It might also help reduce the so-called “decline effect” [75] whereby an initially positive (but false) research finding concerning a phenomenon fails to be replicated in subsequent studies but becomes resistant to falsification by virtual of its perceived novelty and early influence on the field. The persistence of non-reproduced work can, in fact, be quite large [76,77].
Sir Ronald Fisher, considered by many to have popularized “tests of significance” using P values, stated, with respect to the P<0.05 criterion, that “[a] scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance” [78, p.85, Fisher’s emphasis]. That is to say, P<0.05 was intended as a screening tool, after which multiple replicates would be required for a “real” finding. Robinson and Wainer [79, p.264] emphasize that Fisher “understood science as a continuous and continuing process and viewed [what has come to be known as] ‘null hypothesis significance testing’ in that context.” Until the scientific research community can converge on longer-term and more challenging-to-implement interventions, a more onerous screening level may reduce the number of false positive publications as suggested by our models.
References
- 1. Ioannidis JPA. What have we (not) learned from millions of scientific papers with P values? The American Statistician. 2019;73(S1): 20–5.
- 2. Wasserstein RL, Lazar NA. ASA Statement on Statistical Significance and P-Values. The American Statistician. 2016;70:129–33.
- 3. Johnson VE. Revised standards for statistical evidence. PNAS. 2016;110(48): 19313–7. pmid:24218581
- 4. Gelman A, Loken E. The statistical crisis in science. American Scientist. 2014;102:460–5.
- 5. Erasmus A, Holman B, Ioannidis JPA. Data dredging bias. BMJ Evidence Based Medicine. 2022;27(4):209–11. pmid:34930812
- 6. Fanelli D. “Positive” results increase down the hierarchy of sciences. PLoS One. 2010;5(4):e10068. pmid:20383332
- 7. Masicampo EJ, Lalande DR. A peculiar prevalence of p values just below .05. Quarterly Journal of Experimental Psychology. 2012:55:2271–2279. pmid:22853650
- 8. Perneger TV, Combescure C. The distribution of P-values in medical research articles suggested selective reporting associated with statistical significance. Journal of Clinical Epidemiology. 2017:87:70–7. pmid:28400294
- 9. Simonsohn U, Simmons, JPNelson, LD. Better P-curves: Making P-curve analysis more robust to errors, fraud, and ambitious P-hacking, a Reply to Ulrich and Miller (2015). Journal of Experimental Psychology: General. 2015:144(6):1146–52. pmid:26595842
- 10.
MacCoun R, Perlmutter S. Blind analysis as a corrective for confirmatory bias in physics and psychology. In: S.O. Lilienfeld SO, Waldman I, editors. Psychological Science under Scrutiny: Recent Challenges and Proposed Solutions. Hoboken (NJ): Wiley-Blackwell, 2017 p. 297–321.
- 11. Nuzzo R. Fooling ourselves. Nature. 2015;526:182–5. pmid:26450039
- 12. Boccia S, Rothman KJ, Panic N, Flacco ME, Ross A, Pastorino R, et al. Registration practices for observational studies on ClinicalTrials.gov indicated low adherence. Journal of Clinical Epidemiology, 2016;70:176–82. pmid:26386325
- 13. Taylor NJ, Gorman DM. Registration and primary outcome reporting in behavioral health trials. BMC Medical Research Methodology. 2022;22:41. pmid:35125101
- 14. Serghiou S, Axfors C, Ioannidis JPA. Lessons learnt from registration of biomedical research. Nature Human Behavior. 2023;7:9–11. pmid:36604496
- 15. Chambers CD, Tzavella L. The past, present and future of Registered Reports. Nature Human Behavior 2022;6:29–42. pmid:34782730
- 16. Hardewick T.E., Ioannidis J.P.A. Mapping the universe of Registered Reports. Nature Human Behavior, 2018;2:793–6. pmid:31558810
- 17. Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E-J, Berk R, et al. Redefine statistical significance. Nature Human Behavior. 2018;2:6–10. pmid:30980045
- 18. Lakens D, Aldolfi SG, Albers CJ, Anvari F, Apps MAJ, Argamon SE, et al. (2018). Justify your Alpha. Nature Human Behavior. 2(3):167–71. https//doi:org/10.1038/s41562-018-0311-x.
- 19. Amrhein V, Greenland S. Remove, rather than redefine, statistical significance. Nature Human Behavior. 2018;2:4. pmid:30980046
- 20. van der Zee T, Anaya J, Brown NJL. Statistical heartburn: an attempt to digest four pizza publications from the Cornell Food and Brand Lab. BMC Nutrition. 2017;3:54. pmid:32153834
- 21. Gorman DM. Can a registered trial be reported as a one-group, pretest-posttest study with no explanation? A critique of Williams et al. (2021). Health and Justice. 2022;10:2. pmid:34978633
- 22. Chavalarias D, Wallach JD, Li AH, Ioannidis JP. Evolution of reporting P values in the biomedical literature, 1990–2015. JAMA. 2016;315(11):1141–8. pmid:26978209
- 23. Ioannidis JPA. The proposal to lower P value thresholds to .005. JAMA. 2018;319(14):1429–1430. pmid:29566133
- 24. Evans S., Anderson J.M., Johnson A.L., Checketts J.X., Scott J., Middlemist K., et al. (2021). The potential effect of lowering the threshold of statistical significance from p <0.05 to p <0.005 in Orthopaedic Sports Medicine. Arthroscopy, 37:1068–1074. pmid:33253798
- 25. Johnson AL, Evans S., Checketts JX, Scott J, Wayant C, Johnson M, et al. Effects of a proposal to alter the statistical significance threshold on previously published orthopaedic trauma randomized controlled trials. Injury. 2019;50:1934–7. pmid:31421816
- 26. Wayant C., Scott J., & Vassar M. (2018). Evaluation of lowering the P value threshold for statistical significance from.05 to.005 in previously published randomized clinical trials in major medical journals. JAMA. 320, 1813–1815. pmid:30398593
- 27. Thakur P, Jha, V. Potential effects of lowering the threshold of statistical significance in the field of chronic rhinosinusitis–A meta-research on published randomized controlled trials over last decade. Brazilian Journal of Otorhinolaryngology;2022;88(S5):S83–9. pmid:35331655
- 28. Khan SK, Irfan S, Khan, SU, Mehra MR, Vaduganathan M. Transforming the interpretation of significance in heart failure trials. European Journal of Heart Failure. 2020;22:177–80. pmid:31729133
- 29. Ioannidis JPA, Munafò MR, Fusar-Poli , Nosek BA, David SP,Publication and other reporting biases in cognitive sciences: detection, prevalence, and prevention. Trends in Cognitive Sciences. 2014;18:235–41. pmid:24656991
- 30. Kerr NL. HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review. 1998;2:196–217. pmid:15647155
- 31. Smaldino PE, McElreath R. The natural selection of bad science. Royal Society Open Science. 2016;3:160384. pmid:27703703
- 32.
Railsback SF, Grimm V. Agent-based and individual-based modeling: A practical introduction, 2nd ed. Princeton (NJ): Princeton University Press; 2019.
- 33. Kohrt F, Smaldino PE, McElreath R, Schönbrodt F. Replication of the natural selection of bad science. Royal Society Open Science. 2023;10:221306.
- 34. Barnett AG, Zardo P, Graves N. Randomly auditing research labs could be an affordable way to improve research quality: A simulation study. PLoS ONE. 2018;13(4):e0195613. pmid:29649314
- 35. Smaldino PE, Turner MA, Contreras Kallens PA. Open science and modified funding lotteries can impede the natural selection of bad science. Royal Society Open Science. 2019;6:190194. pmid:31417725
- 36. Stewart AJ, Plotkin JB. The natural selection of good science. Nature Human Behavior. 2021;5:1510–8. pmid:34002054
- 37. Forstmeier W, Wagenmakers E-J, Parker TH. Detecting and avoiding likely false-positive findings–a practical guide. Biological Reviews. 2017; 94:1941–968.
- 38. Begley CG, Ioannidis JPA. Reproducibility in science: Improving the standard for basic and preclinical research. Circulation Research, 2015;116:116–126. pmid:25552691
- 39. Kimmel K, Avolio ML, Ferraro PJ. Empirical evidence of widespread exaggeration bias and selective reporting in ecology. Nature Human Behavior. 2023;7:1525–1536. pmid:37537387
- 40. Simmons JP, Nelson LD, Simonsohn U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science. 2011; 22:1359–1366. pmid:22006061
- 41. Fanelli D. “Positive” results increase down the hierarchy of sciences. PLoS One. 2010;5(4): e10068.
- 42. Ioannidis JPA. Why most published research findings are false. PLOS Medicine. 2005;2:e124. pmid:16060722
- 43. Ioannidis JPA, Tarone R., McLaughlin JK. The false-positive to false-negative ratio in epidemiologic studies. Epidemiology. 2011.22:450–456. pmid:21490505
- 44. Niemeyer RE, Proctor KR, Schwartz JA, Niemeyer RG. Are most published criminological research findings wrong? Taking stock of criminological research using a Bayesian simulation approach. International Journal of Offender Therapy and Comparative Criminology. Online ahead of print. pmid:36384305
- 45. MacCoun R, Perlmutter S. (2015). Hide results to seek the truth. Nature. 2015;526:187–189.
- 46. Platt JR. Strong inference. Science. 1964;146:347–353.
- 47. Goodman SN. p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. American Journal of Epidemiology. 1993;137(5):485–96. pmid:8465801
- 48. Greenland S. Divergence versus decision P-values: A distinction worth making in theory and keeping in practice: Or, how divergence P-values measure evidence even when decision P-values do not. Scandinavian Journal of Statistics. 2023;50:54–88.
- 49. Schneider JW. Null hypothesis significance tests. A mix-up of two different theory: the basis for widespread confusion and numerous misinterpretations. Scientometrics. 2015;102:411–432.
- 50.
Cohen J. Statistical Power Analysis for the Behavioral Sciences, New York: Routledge; 1988
- 51. Barrett B, Brown D, Mundt M, Brown R. Sufficiently important difference: expanding the framework of clinical significance. Med Decis Mak. 2005;25:250–61. pmid:15951453
- 52. Gelman A, Carlin J. Some Natural Solutions to the p-Value Communication Problem–and Why They Won’t Work, Journal of the American Statistical Association. 2017;112(519): 889–901.
- 53. Šidák ZK. Rectangular Confidence Regions for the Means of Multivariate Normal Distributions. Journal of the American Statistical Association. 1967;62(318):626–633.
- 54. Eisenach JC, Warner DS, Houle TT. Reporting of preclinical research in anesthesiology: transparency and enforcement. Anesthesiology. 2016;124(4):763–5. pmid:26845144
- 55. Gorman DM, Elkins AD, Lawley M. A systems approach to understanding and improving research integrity. Science and Engineering Ethics. 2019;25:211–225. pmid:29071573
- 56.
Institute of Medicine. Integrity in Scientific Research: Creating an Environment that Promotes Responsible Conduct. Washington, DC: National Academies Press; 2002. https://doi.org/10.17226/10430 pmid:24967480
- 57. Nosek BA, Spies JR, Motyl M. Scientific utopia II. Restructuring incentives and practices to promote truth over publishability. Psychological Science.2012;7:615–631. pmid:26168121
- 58. Robson SG, Baum MA, Beaudry JL, Beitner J, Brohmer H, Chin JM, et al. Promoting open science: A holistic approach to changing behavior. Collabra: Psychology; 2021;7(1):30137.
- 59.
Meadows D. Leverage Points: Places to intervene in a System. Hartland, VT: The Sustainability Institute; 1999. https://donellameadows.org/wp-content/userfiles/Leverage_Points.pdf.
- 60. Allen C, Mehler DMA. Open science challenges, benefits and tips in early career and beyond. PLoS Biology. 2019;17:e3000246. pmid:31042704
- 61. Scheel AM, Schijen MRMJ, Lakens D. An excess of positive results: Comparing the standard psychology literature with Registered Reports. Advances in Methods & Practices in Psychological Science. 2021;4:1–12.
- 62. Humphreys M., de la Sierra RS, van der Windt P. Fishing, commitment, and communication: A proposal for comprehensive nonbinding research registration. Political Analysis. 2013;21: 1–20.
- 63. Wagenmakers E-J, Wetzels R, Borsboom D, van der Mass HLJ, Kievit RA. An agenda for purely confirmatory research. Perspectives in Psychological Science. 2012;7:632–638. pmid:26168122
- 64. Greenland S. Connecting simple and precise P-values to complex and ambiguous realities (includes rejoinder to comments on “Divergence vs. decision P-values”). Scandinavian Journal of Statistics. 2023;50:899–914.
- 65. Flor LS, Reitman MB, Gupta V, Ng M, Gakidou E. The effects of tobacco control policies on global smoking prevalence. Nature Medicine. 2021. 27:239–243. pmid:33479500
- 66. Trafimow D, Marks M. Editorial. Basic and Applied Social Psychology. 2018;37:1–2.
- 67. Greenland S. Invited Commentary: The Need for Cognitive Science in Methodology. American Journal of Epidemiology. 2017;186(6):639–645, pmid:28938712
- 68. Goodman S. Commentary: The P-value, devalued. International Journal of Epidemiology 2003;32:699–702. pmid:14559733
- 69. Cumming G. The new statistics: Why and how. Psychological Science. 2014;25(1):7–29. pmid:24220629
- 70.
Wagenmakers E-J, Verhagen J, Ly A, Matzke D, Steingroever H, Rouder JN, et al. The need for Bayesian hypothesis testing in psychological science. In: Lilienfeld SO, Waldman I, editors. Psychological Science under Scrutiny: Recent Challenges and Proposed Solutions. Hoboken (NJ): Wiley-Blackwell, 2017 p. 123–138.
- 71. Fidler F, Thomason N, Cumming G, Finch s, Leeman J. Editors can lead researchers to confidence intervals, but can’t make them think. Psychological Science 2004;15(2):119–126.
- 72. Finch S, Cumming G, Williams J, Palmer L, Griffith E, Alders C, Anderson J, Goodman O. Reform of statistical inference in psychology: The case of Memory & Cognition. Behavior Research Methods, Instruments & Computers. 2004;36:312–324.
- 73. The Medical Outreach Subteam of the Drug Information Association Bayesian Scientific Working Group, Clark J, Muhlemann N, Natanegara F, Hartley A, Wenkert D, et al. Why are not there more Bayesian clinical trials? Perceived barriers and educational preferences among medical researchers involved in drug development. Therapeutic Innovation & Regulatory Science. 2023;57:417–425.
- 74. Campbell H, Gustafson P. The world of research has gone berserk: Modeling the consequences of requiring “greater statistical stringency” for scientific publication. The American Statstician. 2019;73(S1):358–373.
- 75. Lehrer J. The truth wears off. The New Yorker. December 13, 2010:52–57. https://sites.ualberta.ca/~ahamann/teaching/renr480/reading/Lehrer_2010_The_truth_wears_off.pdf.
- 76. Pietschnig J, Siegel M, Eder JSN, Gittler G. Effect declines are systematic, strong, and ubiquitous: A meta-meta-analysis of the decline effect in Intelligence Research. Frontiers in Psychology. 2019;10:2874. pmid:31920891
- 77. Begley CG, Ellis L. Drug development: raise standards for preclinical research. Nature, 2012;483:531–533.
- 78. Fisher R A. The arrangement of field experiments, Journal of the Ministry of Agriculture. 1926;33:503–513.
- 79. Robinson DH, Wainer H. On the past and future of null hypothesis significance testing. The Journal of Wildlife Management. 2002;66(2):263–271.