Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Impact of redefining statistical significance on P-hacking and false positive rates: An agent-based model

  • Ben G. Fitzpatrick ,

    Contributed equally to this work with: Ben G. Fitzpatrick, Dennis M. Gorman

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

    bfitzpatrick@lmu.edu (BGF); ctrombat@lion.lmu.edu (CT)

    Affiliations Department of Mathematics, Loyola Marymount University, Los Angeles, California, United States of America, Tempest Technologies, Los Angeles, California, United States of America

  • Dennis M. Gorman ,

    Contributed equally to this work with: Ben G. Fitzpatrick, Dennis M. Gorman

    Roles Conceptualization, Methodology, Writing – original draft, Writing – review & editing

    Affiliation Department of Epidemiology & Biostatistics, School of Public Health, Texas A&M University, College Station, Texas, United States of America

  • Caitlin Trombatore

    Roles Formal analysis, Investigation, Software

    bfitzpatrick@lmu.edu (BGF); ctrombat@lion.lmu.edu (CT)

    Affiliation Department of Mathematics, Loyola Marymount University, Los Angeles, California, United States of America

Abstract

In recent years, concern has grown about the inappropriate application and interpretation of P values, especially the use of P<0.05 to denote “statistical significance” and the practice of P-hacking to produce results below this threshold and selectively reporting these in publications. Such behavior is said to be a major contributor to the large number of false and non-reproducible discoveries found in academic journals. In response, it has been proposed that the threshold for statistical significance be changed from 0.05 to 0.005. The aim of the current study was to use an evolutionary agent-based model comprised of researchers who test hypotheses and strive to increase their publication rates in order to explore the impact of a 0.005 P value threshold on P-hacking and published false positive rates. Three scenarios were examined, one in which researchers tested a single hypothesis, one in which they tested multiple hypotheses using a P<0.05 threshold, and one in which they tested multiple hypotheses using a P<0.005 threshold. Effects sizes were varied across models and output assessed in terms of researcher effort, number of hypotheses tested and number of publications, and the published false positive rate. The results supported the view that a more stringent P value threshold can serve to reduce the rate of published false positive results. Researchers still engaged in P-hacking with the new threshold, but the effort they expended increased substantially and their overall productivity was reduced, resulting in a decline in the published false positive rate. Compared to other proposed interventions to improve the academic publishing system, changing the P value threshold has the advantage of being relatively easy to implement and could be monitored and enforced with minimal effort by journal editors and peer reviewers.

Introduction

Concern with the inappropriate application and interpretation of P values has grown in recent years, especially regarding making multiple comparisons within one dataset and using P<0.05 to denote “statistically significance” differences [13]. Gelman and Loken observe that many datasets and analysis plans present the opportunity to make multiple comparisons without investigators purposefully and deliberately fishing for a P value below 0.05 [4]. However, the facts that most published results are positive and that an unusually large number of these barely pass the threshold P<0.05 have raised concern that many of these results may in fact arise from data dredging, also referred to as “P-hacking” [59]. Data-dredging denotes a deliberate and purposeful search for statistically significant results, with analysis continuing until at least one is found. This chance statistically significant result is then selectively reported in a manuscript as though it were the result of a prespecified test of a hypothesis.

A number of solutions to data dredging and selective outcome reporting have been proposed, including preregistration of study methods and analysis plans, data sharing, Registered Reports, blind data analysis, and adversarial collaboration. The latter two methods have not been widely adopted outside of psychology and physics, respectively [10,11]. While there is evidence that preregistration and data sharing can reduce the number of false positive results published by journals, these editorial practices have been undermined by a failure of many journals to enforce preregistration and data sharing policies and by some study registries allowing retrospective registration [1214]. As for Registered Reports, interpretation of study findings indicating that it produces more results supporting the null is difficult as it cannot be ruled out that researchers especially concerned with selective outcome reporting bias and/or inclined to test “risky” hypotheses are more likely to use this format [15,16].

A formal proposal that the threshold for “statistical significance” be changed from P<0.05 to P<0.005 was made by Benjamin and other statisticians and research methodologists in 2018 [17]. It is premised on the idea that “statistical standards of evidence for claiming new discoveries in many fields of science are simply too low” [17, p. 6]. The practice of associating statistically significant findings with P< 0.05 will result in a high false positive rate, they argue, irrespective of other problems in study design, data analysis, and reporting of results. In support of their argument, Benjamin et al. state that a two-sided P value of 0.05 corresponds to Bayes factors in favor of the alternative hypothesis in the range of 2.5 to 3.4, which is “weak” or “very weak” evidence. In contrast, a two-sided P value of 0.005 is in the “strong” to “substantial” range of Bayes factor recommendations. They also present a table showing the association between the two P value thresholds, study power, prior odds of the alternative hypothesis, and the false positive rate. For example, with prior odds of 1:10 and power of 1.0, the false positive rate drops from 33% with a P value threshold of 0.05, over a range of values of statistical power, to 5% with a threshold of 0.005. Lakens et al. [18] question the underlying assumptions and data used in arriving at each of these justifications (e.g., the prior odds estimate used in calculating the false positive rates for each alpha level is based on data from just 73 studies from the Reproducibility Project: Psychology).

Opinions as to the effect of changing the P-value threshold on data dredging and P-hacking vary. Ioannidis [1] considers it to be a “temporizing measure” to dam the flood of positive results, but likely to be more beneficial than harmful. On the other hand, Amrhein and Greenland [19] contend that the new threshold will lead to “more intense P hacking and selective reporting”. Benjamin et al. [17] acknowledge that an investigator can still engage in such analytic practices while using a P<0.005 threshold but contend that the likelihood of a “statistically significant” chance finding emerging from these analyses is lower than with a 0.05 threshold. Moreover, it is likely that the longer the data dredging exercise continues, the more easily identifiable it will be to those reading a publication in which such results are reported; that is, investigators will be forced into conducting more extreme analyses as the search for a statistically significant result continues (e.g., removing study groups or assessment points from the analysis, even when the study is registered) [e.g., 20,21].

A handful of studies have assessed the effects of the proposed change in threshold on the number of statistically significant results reported using P values from already published studies. Based on a large text-mining study of many thousands of P values published over 25 years, Ioannidis estimated that changing the P value threshold from 0.05 to 0.005 would remove about one-third of the statistically significant results of past biomedical literature [22,23].

In a series of studies, Vassar and colleagues examined the impact of changing the threshold on randomized controlled trials (RCTs) published in general medical, orthopaedic trauma, and orthopaedic sports medicine journals [2426]. The results of these studies are summarized in Table 1, along with a study by Thakur and Jha [27] that examined changing the P value threshold on results from 123 RCTs pertaining to chronic rhinosinusitis and a study by Khan et al. [28] that focused on 72 RCTS from high impact general medical and cardiology journals. Across the five studies, the range of P values that retained statistical significance with a 0.005 threshold was 38.9–70.7%.

thumbnail
Table 1. Summary of results from studies that assessed changing the P-value threshold.

https://doi.org/10.1371/journal.pone.0303262.t001

The studies in Table 1 were focused on recent RCTs published in top medical journals and a majority would have been registered. Such studies are most appropriate for null hypothesis statistical testing using P values. In contrast, research focused on data dredging, P-hacking and the clustering of P values just below 0.05 has typically examined studies in psychology, biology and political science that are unlikely to be preregistered and many of which will use study designs other than RCTs [e.g., 7,9,29,30]. Benjamin et al. [17] recommend that research using non-experimental designs, especially exploratory research that tests numerous hypotheses, should employ even lower P value thresholds than 0.005. It is unknown what percent of P values from non-experimental studies that were statistically significant with the traditional 0.05 threshold would remain so with a 0.005 threshold, let alone one more stringent.

Assessing the potential effects of changes in P value thresholds on the validity of published results is difficult as traditional empirical methods involving experimental manipulation are not feasible. Such constraints apply to studying many aspects of the current academic incentive system and proposed changes to realign this in ways that produce higher quality research and more valid published findings. This has led to the application of simulation models to estimate the effects of the “publish or perish” academic culture on the quality of published scientific research and the potential of proposed changes to the publication process to improve this. An influential work in this area is that of Smaldino and McElrealth [31] which describes a competitive agent-based model (ABM) [32] comprised of research laboratories that survive based on their ability to publish a high volume of papers while expending low effort, albeit many of which report false positive results. The model demonstrates that this way of conducting research, called “bad science”, can rapidly spread through a group of laboratories, and become its dominant approach. The main results of the model were recently replicated [33], and its framework has been used to conduct virtual experiments of interventions such as auditing of research facilities, making publication of negative results more prestigious, improving peer review, assigning research funds randomly or according to methodological integrity, and researchers expending effort on the selection of strong hypotheses through theory development [3436].

The aim of the current study is to examine the statistical process of empirical science using an ABM, building on the work of Smaldino and McElrealth [31]. Here the agents are researchers who test hypotheses and publish positive findings. With this model we explore the impact of P-hacking and modified P value thresholds on false positive rates in the literature.

We should be clear as to two aspects of what it is we are modeling. First, the disciplines we have in mind are those in which a culture of “you can publish if you found a significant effect ‘‘ prevails, thereby encouraging multiple statistical testing (i.e., data dredging or P-hacking) to obtain such effects [37]. Disciplines in which such concerns have been raised include biomedical sciences, ecology, psychology, and biology [3740]. The percent of published positive results in such “soft” disciplines is estimated to be over ninety, especially in applied areas of research [41]. Depending on study design, sample size, specificity of hypotheses, analytic flexibility, and other methodological features, most of these disciplines’ published positive results may be false [37,4244]. While even the hardest of sciences are not immune to confirmation bias, multiple testing using P values to determine “significance” is not among the questionable research practices used to achieve desirable results, and therefore the solutions in these disciplines are different, albeit with potential application to the soft sciences [45,46].

The second clarification involves the type of null hypothesis significance testing (NHST) we model. There is a large literature detailing how NHST as reported in most published papers produced by researchers in the disciplines described above is a problematic amalgam of Fisher’s significance test and the hypothesis testing approach described by Neyman and Pearson [4749]. The problems with this hybrid have been clearly explained but are not of immediate concern to us as we are presenting an idealized model of what researchers in disciplines such as biomedical science and psychology typically do when hypothesis testing rather than what they should do. Accordingly, in the current model the agents structure their studies as a null hypothesis statistical test, and they reject the null hypothesis on the basis of a P value of 0.05 or below.

Methods

Building on the model of Smaldino and McElreath [31], we populate our ABM with N researchers that dynamically perform experiments to generate new results. Each researcher is characterized by a vector of state variables: effort, number of hypotheses per experiment, value, age, and effect size. The first two values denote the researcher’s methods, in the sense that the amount of effort and the number of “bites at the apple” (i.e., attempts to find a statistically significant result) comprise the variables that drive publication outcomes, as will be seen below.

Effort

In our model, effort impacts the research process in two ways: the time it takes to perform an experiment and the power a study can achieve. These two impacts are in tension–more effort means potentially slowed productivity but improved true positive results. Smaldino and McElreath [31] model the probability of conducting a study during a time step with a convex, decreasing function. Our functional form is different but maintains the convex, decreasing shape: in which E denotes effort, Emin denotes the lower bound on effort, and η is a tunable model parameter. Greater effort means smaller probability of false positive results. Fig 1 illustrates this functional relationship.

thumbnail
Fig 1. Functional relationship between agent effort and probability of conducting an experiment.

https://doi.org/10.1371/journal.pone.0303262.g001

Number of hypotheses

It is in this component that our model is quite different from that of Smaldino and McElreath [31]. We simulate a statistical hypothesis test, a modeling choice that allows us to investigate the reduced P value threshold intervention. The number of statistical hypothesis tests performed models the process of P-hacking, repeatedly testing in hopes of finding something to be statistically significant.

Value

If the researcher performs a hypothesis test that achieves a sufficiently small P value, then that positive finding results in a publication. The researchers’ value is the total number of publications over the course of their careers.

Age

The number of time steps in the simulation that the researchers are active is their age.

Time progresses in discrete fixed steps, during which each researcher:

  • applies an amount of effort in order to conduct an experiment (or not);
  • collects data to analyze;
  • tests one or more hypotheses;
  • publishes any positive finding;
  • checks the publication rates of other researchers to identify “better” methods.

In addition, researchers “quit” or “retire” at random with an exponential rate, and new researchers enter the workforce to maintain a fixed population size of N researchers.

Effect size and hypothesis testing in the model

In order to study the impact of the P value threshold change, we must simulate the process of statistical hypothesis testing, which involves two types of error: type I or false positive and type II or false negative. In aggregate, P value thresholds define the acceptable probability of false positive findings.

An important aspect of the simulation is the modeling of “truth.” To simulate false positive rates within a hypothesis testing context, we must select a means of generating “true” and “false.” In this model, each researcher conducts a number of independent-sample t-tests, the simplest test comparing a control group to an experimental or treatment group. To simulate these t-tests, we generate true Cohen’s d effect sizes from an exponential distribution: in which d0 is a tunable modeling parameter. Smaller effect sizes are more likely, and larger effect sizes are less likely in this model.

To generate experimental truth, we model an effect size greater than dmin (a tunable parameter) to yield an experiment that is a true positive. Within this modeling structure, the prior probability of the null hypothesis being true is .

For example, if we set dmin = 0.2, which is a small effect according to Cohen [50], and we set 80% as the prior probability of the null, we have d0 = -0.2/ln(0.8) = 0.1243. In general, the exponential scale parameter is determined from the minimum effect size and the prior probability by the formula Fig 2 illustrates the probability density we use to generate effect sizes in the model.

thumbnail
Fig 2. Probability density used to generate effect sizes.

https://doi.org/10.1371/journal.pone.0303262.g002

This model treats dmin as a minimally relevant effect size, which is the smallest effect that would “justify associated costs, risks, and other harms” [51, p. 251]. An experiment simulated with an effect size below dmin would be considered as arising from a true null hypothesis; likewise, an experiment simulated with an effect above dmin would be a false null. Once the experiment is simulated, and data are obtained, the t-test is performed. A true positive, then, will happen when the point-null hypothesis of zero effect is rejected, in an experiment simulated with an underlying effect size above dmin. A false positive will come about when the null is rejected in a simulation with an underlying effect at or below dmin.

Of course, this definition of truth differs from the point-null that the effect size is 0 (versus the alternative that it is not). An alternative modeling choice would be to simulate truth of 0 effect sizes with some a priori probability and non-zero effect sizes with the complementary probability. But as Gelman and Carlin [52, p.900] remark, “[r]ealistically, all statistical hypotheses are false: effects are not exactly zero, groups are not exactly identical, distributions are not really normal, measurements are not quite unbiased, and so on.” Toward that end, we choose to model truth via minimally relevant effect sizes.

Within this structure there are many ways to simulate effect sizes for researchers. In our model, we simulate a single d value for each researcher, say di, i = 1,2,…,N. This value serves as a scale parameter for simulating effect sizes for individual experiments. That is, at each time step, Researcher i will obtain , effect sizes for the multiple hypotheses 1,2, …, ki. Researcher i then obtains ki t-statistic values randomly sampled from the noncentral t distribution with 2*Ei-2 degrees of freedom and noncentrality parameters

At this point in the simulation, each researcher has obtained a number of T values. A P-hacking researcher would ask if at least one of these is greater than the critical t-value for the null hypothesis at the threshold level α. If so, the researcher would claim a positive outcome and publish. To determine whether or not that result is a true positive in the simulation, we check that (1) the effect size was at least dmin and (2) that the T value exceeded a P value-corrected critical t-value using the Šidák correction [53]. in which k denotes the number of hypotheses (or T values obtained by the researcher). If these two conditions are met, the result is a true positive for the purposes of simulation. Otherwise, the result is a false positive. Any positive finding increments the researcher’s value by one unit. Negative results do not count towards the researcher’s value.

To be specific, P values computed herein correspond to the divergence P value as defined in Greenland [48]. That is, P values are computed as probability of future, replicated test statistics exceeding the value of the test statistic computed for the data at hand, under the condition that the null hypothesis is true. To the best of our knowledge, most statistical software packages that provide P values compute them in this manner.

The parameters of the model are given in Table 2.

Results

We present three sets of model runs, organized by the prior probability of the null being 0.9, 0.8, or 0.5. In each set of runs, we compare three different scenarios. The first is an “ideal” scenario, in which each researcher tests exactly one hypothesis with each experiment. If that experiment’s analysis results in a P value of less than 0.05, the researcher publishes. In the simulation, we know which hypotheses are actually true, so we tabulate the fraction of published works that are false positives. In the second, researchers may test multiple hypotheses, and if at least one of these results in a P value below 0.05, that experiment is published. In this scenario, researchers are not using any type of P value correction for multiple testing. However, in the simulation, true positives are determined using the Sidak correction: “truth” is simulated, but “positive” for the purpose of “true positive” is computed with the Sidak correction. The third scenario follows the second except that we apply the intervention of reducing the P value threshold to 0.005. Fig 3 illustrates the evolution over 50000 time steps and 100 evolutionary replicates. Panels in the figure show the medians across researchers and replicates at each time step.

thumbnail
Fig 3. Evolution of effort, number of hypotheses tested, false positive publication rate, and value (clockwise from top left), Pr[H0] = 0.9.

Quantities are medians across 2000 researchers, averaged over 100 simulation replicates.

https://doi.org/10.1371/journal.pone.0303262.g003

In this set of simulations, we use Pr[H0] = 0.9, wherein the majority of simulated experiments will come from effect sizes below the threshold for “truth.” We see that P-hacking continues at the 0.005 threshold, but that the false publication rate declines from the 0.05 threshold false publication rate.

In a second simulation, presented in Fig 4, we examine the prior probabilitiy of the null set to 0.8, and in Fig 5 we show results for the prior probabilitiy of the null set to 0.5.

thumbnail
Fig 4. Evolution of effort, number of hypotheses tested, false positive publication rate, and value (clockwise from top left), Pr[H0] = 0.8.

Quantities are medians across 2000 researchers, averaged over 100 simulation replicates.

https://doi.org/10.1371/journal.pone.0303262.g004

thumbnail
Fig 5. Evolution of effort, number of hypotheses tested, false positive publication rate, and value (clockwise from top left), Pr[H0] = 0.5.

Quantities are medians across 2000 researchers, averaged over 100 simulation replicates.

https://doi.org/10.1371/journal.pone.0303262.g005

In Figs 35, medians are graphed over time, but variation around the median is also of interest. To illustrate the variation in these quantities, we follow up with some histograms for the case of Pr[H0] = 0.8. In Fig 6 we show the system’s false positive publication rate at the final time step, as a histogram over 100 simulation replicates. For each replicate, we compute the cumulative number of false positives over the current researchers, and we divide by the cumulative number of publications (that is, true positives plus false positives) of the current researchers. We generate a histogram for each of the three simulation scenarios. For the single hypothesis scenario, values cluster in the center of the distribution, with a maximum frequency occurring near a false positive rate of 0.52. The distribution under the 0.005 scenario bears a resemblance to this, with a mode near 0.54 and somewhat more variation. In contrast, the 0.05 distribution clusters around 0.875–0.9, indicating a greater number of false positive results among the researchers at the final step of the model.

thumbnail
Fig 6. Distribution of false positive publication rates for the three scenarios.

https://doi.org/10.1371/journal.pone.0303262.g006

Variation in the number of hypotheses tested for each of the three scenarios is shown in Fig 7. The number of hypotheses for each researcher is tabulated over the 100 replicates and the 2000 researchers. This shows the continuation of testing multiple hypotheses, that is P-hacking, under the 0.005 scenario. For both 0.05 and 0.005 thresholds, the distributions show a bimodal shape with one peak at the low end and a spread of values that skew slightly toward larger values. In both cases, the number of attempted hypotheses tapers off for values larger than 15.

thumbnail
Fig 7. Distribution of number of hypotheses for the three scenarios (insert shows only the 0.05 and 0.005 threshold).

https://doi.org/10.1371/journal.pone.0303262.g007

Variation in the amount of effort for each scenario is shown in Fig 8. The effort for each researcher is tabulated over the 100 replicates and the 2000 researchers. The distribution of 0.05 multiple hypotheses tested with a threshold of 0.05 resembles that of the single hypothese, but with higher peaks. Changing the threshold to 0.005 produces a more even distribution of effort with lower peaks than the 0.05 threshold.

Conclusions

The results of the simulations generally support Ioannidis’ [1] view that a more stringent P value threshold can serve to dam the flood of published positive results. While, as Benjamin et al. [17] acknowledge, investigators can still engage in P-hacking, the false positive rate declines using the 0.005 threshold while the effort expended by investigators increases substantially. Overall, their productivity is reduced, but there is an improvement in production of true positive results.

While the results of our simulation do suggest that reducing the P value threshold may reduce the number of false-positive publications, we must note that this model has a number of limitations. First and foremost, it serves as an idealized version of the statistical aspects of the research process. Simulated researchers herein are applying a simple (two-sided) point-hypothesis test that is an appropriate test for the simulated data. P-hacking is modeled as a simple repeated generation of data for that test. There is no “garden of forking paths” that might lead researchers to search for a statistical method based on the observed data [4]. Moreover, researchers are not permitted to remove or add data in order to obtain positive results [54]. Certainly these (and other) unmodeled researcher activities could serve to moderate the effectiveness of the 0.005 intervention.

As noted in several recent papers, academic publishing can be viewed as a production system which currently incentivizes engagement in problematic behaviors such as data dredging, P-hacking, HARKing, and selective outcome reporting [5558]. Using such a systems framework allows for the identification of leverage points which can be the target of interventions designed to improve research quality and integrity [59]. Such leverage points in the system will vary in the extent to which they are amenable to change and the magnitude and type of improvements in outputs that can be expected should such change successfully occur.

Preventive interventions targeted at leverage points can have a positive effect by reducing the extent to which individuals within the system engage in an unwanted behavior and/or by reducing the negative consequences (or harms) of engaging in the behavior. The simulations presented here indicate that the introduction by academic journals of a more stringent P value threshold for “statistical significance” has the latter effect: researchers continue to P-hack but the number of false discoveries that enter the published literature as a result of this practice is reduced by more than a half compared to using a 0.05 threshold when effect sizes were randomly generated.

One thing to consider when deciding if the reductions in false positive results observed in the simulations is worth pursuing is that a change in the P value threshold is a relatively simple intervention to introduce into the academic publishing system and one that could be monitored and enforced with minimal effort by journal editors and peer reviewers. Essentially, it involves changing one arbitrary threshold of “statistical significance” for another, albeit a less familiar one. In practice, it would require journals announcing in their instructions to authors that P<0.005 now constitutes statistical significance, running each submitted manuscript through a computer program to ensure the new threshold was adhered to (and returning to authors those that did not), and requesting that peer reviewers also ensure the 0.005 significance threshold was used in the analyses reported. In short, changing the P value threshold has the appeal of being an intervention with a clear target that involves an easy behavioral change and, if implemented widely, can reduce the false positive rate, albeit not entirely eradicating P-hacking.

While other interventions designed to improve research integrity and quality have greater potential impact, the feasibility of their widespread implementation, adherence, and enforcement is questionable. For example, in an ideal world every genuine a priori hypothesis-testing study would be written-up in the form of a Registered Report. This essentially embeds pre-registration in the publication pipeline and eliminates the incentive for researchers to data dredge when writing-up the results of their studies [15,16]. This format appears successful in reducing the publication of positive results and, consequently, reducing the rate of false discoveries [60,61]. However, Registered Reports have not enjoyed widespread adoption among journals: in early 2022, Chambers and Tzavella [15] reported that only 300 journals offered this as a publishing option, with just 94 of these having published a total of 591 final manuscripts reporting study results.

Prospective registration is another proposed intervention that, in principle, can greatly reduce P-hacking and selective publication of positive results [62,63]. While this has been more widely adopted by academic journals than the Registered Reports format, adherence by authors and enforcement by journals is suboptimal and some registries allow retrospective registration and alterations of protocols after a study is underway or even complete [13,14].

To the extent that it reduces the number of false positive results that find their way into the published literature, changing the P value threshold to 0.005 appears to be an editorial procedure worth pursuing, given its minimal costs and inconvenience to editors and reviewers. In a recent discussion of P values, Greenland [64] argued that, as with tobacco smoking, education, not prohibition, might be the best way to limit their misuse and its attendant harms. However, while outright prohibition might have proved as problematic with tobacco as it did with alcohol, there is little doubt that policies that have restricted the circumstances and places in which one can smoke (e.g., smoking bans in workplaces, restaurants and bars, public transport, places of entertainment, aircraft) made significant contributes to declining rates. Such policy changes restrict opportunities to smoke and decrease its social acceptability [65]. Our model suggests changing the P value threshold restricts the opportunity for researchers to find a “significant” (and publishable) result and, if the effort to produce such a result through data dredging becomes more arduous and extreme, its acceptability will, hopefully, decline over time. Researchers are more likely to see the absurd and unethical nature of trying to squeeze a P<0.005 result out of a data set as the analyses required become increasingly distant from those originally intended. This, over time, might help change the current research culture of many disciplines in which P-hacking is so easy it virtually goes unnoticed. As with smoking, a comprehensive approach to the problem of P values is required; we believe changing the threshold for statistical significance should be part of this approach.

Although the results of the simulations suggest there are benefits in terms of a reduction in published false positive results to be derived from changing the P value threshold from 0.05 to 0.005, there are also potential negative consequences that must be considered. First, there are compelling arguments to the effect that it is not the threshold used to designate “statistically significance” that is the problem with P values, but rather the very use of this statistic is problematic and should be discontinued [66]. Changing the P value threshold will therefore simply encourage the continued use of a statistical practice that should be abandoned altogether. Second, and more broadly, Greenland [67] contends that null hypothesis significance testing reinforces cognitive biases that are detrimental to the practice of science, specifically “dichotomania” (the tendency to misperceive quantities as dichotomous even when this is incorrect and misleading) and “nullism” (the assumption that false positives are more problematic than false negatives). From this perspective changing the threshold simply moves the point at which the dichotomy is made and does so in a manner that assumes an over-abundance of false positives is a problem that requires addressing (based, as noted in the introduction, on the observation that null findings are relatively rare in the published literature of the academic disciplines upon which our modeling assumptions are based).

In response, while we are sympathetic to both arguments and believe they have merit, it seems very unlikely that the many disciplines in which P values are widely used (and misused) will abandon them anytime soon. To paraphrase Goodman [68], P values are in the statistical air of a great many academic disciplines and, as such, even those statisticians who would prefer a Bayesian atmosphere must live and breathe them to survive. In addition, eradicating P values would require an alternative, with effect size estimates (and confidence intervals) and Bayesian models the most frequently suggested [66,69,70]. The few cases in which journals have required authors to report effect estimates and not P values have met with limited success, with researchers largely ignoring confidence intervals when presenting their results and continuing to focus on their “significance” [71]. As Finch et al. [72] observe, proper presentation and interpretation of effect sizes and confidence intervals would require prior upstream training in these methods for researchers in many disciplines and not just editorial stricture. There would be even more additional training required if the alternative to P values was Bayesian analysis. A recent survey found that almost half of 323 clinical trial medical researchers reported insufficient knowledge as the main reason they did not use Bayesian statistics [73]. Thus, changing the P value threshold seems, at this point, a modest proposal that might help correct the problem of false positive results identified in many social and behavioral academic disciplines.

A third potential problem with introducing a stricter threshold for statistical significance was found in the optimality models presented by Campbell and Gustafson [74], which show that reducing false positives in the published literature can lead to a depletion in the number of truly “breakthrough discoveries” appearing in academic journals. This applies also to our models as they require investigators expending more effort to publish fewer papers but with more reliable results. The extent to which journal editors want to balance publication of truly reliable and valid results against publishing truly novel and surprising results might depend on the subject matter of the discipline and the question being addressed in the research. In cases where a very pressing issue with substantial societal implications is being addressed, then less stringent requirements for “statistical significance” (i.e., P<0.05) might be warranted. In many cases, however, “breakthrough discoveries” will be of interest primarily to other academics in a particular field of research and requiring these to meet a more stringent standard of statistical significance before publication will likely not result in any major cost to society. It might also help reduce the so-called “decline effect” [75] whereby an initially positive (but false) research finding concerning a phenomenon fails to be replicated in subsequent studies but becomes resistant to falsification by virtual of its perceived novelty and early influence on the field. The persistence of non-reproduced work can, in fact, be quite large [76,77].

Sir Ronald Fisher, considered by many to have popularized “tests of significance” using P values, stated, with respect to the P<0.05 criterion, that “[a] scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance” [78, p.85, Fisher’s emphasis]. That is to say, P<0.05 was intended as a screening tool, after which multiple replicates would be required for a “real” finding. Robinson and Wainer [79, p.264] emphasize that Fisher “understood science as a continuous and continuing process and viewed [what has come to be known as] ‘null hypothesis significance testing’ in that context.” Until the scientific research community can converge on longer-term and more challenging-to-implement interventions, a more onerous screening level may reduce the number of false positive publications as suggested by our models.

References

  1. 1. Ioannidis JPA. What have we (not) learned from millions of scientific papers with P values? The American Statistician. 2019;73(S1): 20–5.
  2. 2. Wasserstein RL, Lazar NA. ASA Statement on Statistical Significance and P-Values. The American Statistician. 2016;70:129–33.
  3. 3. Johnson VE. Revised standards for statistical evidence. PNAS. 2016;110(48): 19313–7. pmid:24218581
  4. 4. Gelman A, Loken E. The statistical crisis in science. American Scientist. 2014;102:460–5.
  5. 5. Erasmus A, Holman B, Ioannidis JPA. Data dredging bias. BMJ Evidence Based Medicine. 2022;27(4):209–11. pmid:34930812
  6. 6. Fanelli D. “Positive” results increase down the hierarchy of sciences. PLoS One. 2010;5(4):e10068. pmid:20383332
  7. 7. Masicampo EJ, Lalande DR. A peculiar prevalence of p values just below .05. Quarterly Journal of Experimental Psychology. 2012:55:2271–2279. pmid:22853650
  8. 8. Perneger TV, Combescure C. The distribution of P-values in medical research articles suggested selective reporting associated with statistical significance. Journal of Clinical Epidemiology. 2017:87:70–7. pmid:28400294
  9. 9. Simonsohn U, Simmons, JPNelson, LD. Better P-curves: Making P-curve analysis more robust to errors, fraud, and ambitious P-hacking, a Reply to Ulrich and Miller (2015). Journal of Experimental Psychology: General. 2015:144(6):1146–52. pmid:26595842
  10. 10. MacCoun R, Perlmutter S. Blind analysis as a corrective for confirmatory bias in physics and psychology. In: S.O. Lilienfeld SO, Waldman I, editors. Psychological Science under Scrutiny: Recent Challenges and Proposed Solutions. Hoboken (NJ): Wiley-Blackwell, 2017 p. 297–321.
  11. 11. Nuzzo R. Fooling ourselves. Nature. 2015;526:182–5. pmid:26450039
  12. 12. Boccia S, Rothman KJ, Panic N, Flacco ME, Ross A, Pastorino R, et al. Registration practices for observational studies on ClinicalTrials.gov indicated low adherence. Journal of Clinical Epidemiology, 2016;70:176–82. pmid:26386325
  13. 13. Taylor NJ, Gorman DM. Registration and primary outcome reporting in behavioral health trials. BMC Medical Research Methodology. 2022;22:41. pmid:35125101
  14. 14. Serghiou S, Axfors C, Ioannidis JPA. Lessons learnt from registration of biomedical research. Nature Human Behavior. 2023;7:9–11. pmid:36604496
  15. 15. Chambers CD, Tzavella L. The past, present and future of Registered Reports. Nature Human Behavior 2022;6:29–42. pmid:34782730
  16. 16. Hardewick T.E., Ioannidis J.P.A. Mapping the universe of Registered Reports. Nature Human Behavior, 2018;2:793–6. pmid:31558810
  17. 17. Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E-J, Berk R, et al. Redefine statistical significance. Nature Human Behavior. 2018;2:6–10. pmid:30980045
  18. 18. Lakens D, Aldolfi SG, Albers CJ, Anvari F, Apps MAJ, Argamon SE, et al. (2018). Justify your Alpha. Nature Human Behavior. 2(3):167–71. https//doi:org/10.1038/s41562-018-0311-x.
  19. 19. Amrhein V, Greenland S. Remove, rather than redefine, statistical significance. Nature Human Behavior. 2018;2:4. pmid:30980046
  20. 20. van der Zee T, Anaya J, Brown NJL. Statistical heartburn: an attempt to digest four pizza publications from the Cornell Food and Brand Lab. BMC Nutrition. 2017;3:54. pmid:32153834
  21. 21. Gorman DM. Can a registered trial be reported as a one-group, pretest-posttest study with no explanation? A critique of Williams et al. (2021). Health and Justice. 2022;10:2. pmid:34978633
  22. 22. Chavalarias D, Wallach JD, Li AH, Ioannidis JP. Evolution of reporting P values in the biomedical literature, 1990–2015. JAMA. 2016;315(11):1141–8. pmid:26978209
  23. 23. Ioannidis JPA. The proposal to lower P value thresholds to .005. JAMA. 2018;319(14):1429–1430. pmid:29566133
  24. 24. Evans S., Anderson J.M., Johnson A.L., Checketts J.X., Scott J., Middlemist K., et al. (2021). The potential effect of lowering the threshold of statistical significance from p <0.05 to p <0.005 in Orthopaedic Sports Medicine. Arthroscopy, 37:1068–1074. pmid:33253798
  25. 25. Johnson AL, Evans S., Checketts JX, Scott J, Wayant C, Johnson M, et al. Effects of a proposal to alter the statistical significance threshold on previously published orthopaedic trauma randomized controlled trials. Injury. 2019;50:1934–7. pmid:31421816
  26. 26. Wayant C., Scott J., & Vassar M. (2018). Evaluation of lowering the P value threshold for statistical significance from.05 to.005 in previously published randomized clinical trials in major medical journals. JAMA. 320, 1813–1815. pmid:30398593
  27. 27. Thakur P, Jha, V. Potential effects of lowering the threshold of statistical significance in the field of chronic rhinosinusitis–A meta-research on published randomized controlled trials over last decade. Brazilian Journal of Otorhinolaryngology;2022;88(S5):S83–9. pmid:35331655
  28. 28. Khan SK, Irfan S, Khan, SU, Mehra MR, Vaduganathan M. Transforming the interpretation of significance in heart failure trials. European Journal of Heart Failure. 2020;22:177–80. pmid:31729133
  29. 29. Ioannidis JPA, Munafò MR, Fusar-Poli , Nosek BA, David SP,Publication and other reporting biases in cognitive sciences: detection, prevalence, and prevention. Trends in Cognitive Sciences. 2014;18:235–41. pmid:24656991
  30. 30. Kerr NL. HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review. 1998;2:196–217. pmid:15647155
  31. 31. Smaldino PE, McElreath R. The natural selection of bad science. Royal Society Open Science. 2016;3:160384. pmid:27703703
  32. 32. Railsback SF, Grimm V. Agent-based and individual-based modeling: A practical introduction, 2nd ed. Princeton (NJ): Princeton University Press; 2019.
  33. 33. Kohrt F, Smaldino PE, McElreath R, Schönbrodt F. Replication of the natural selection of bad science. Royal Society Open Science. 2023;10:221306.
  34. 34. Barnett AG, Zardo P, Graves N. Randomly auditing research labs could be an affordable way to improve research quality: A simulation study. PLoS ONE. 2018;13(4):e0195613. pmid:29649314
  35. 35. Smaldino PE, Turner MA, Contreras Kallens PA. Open science and modified funding lotteries can impede the natural selection of bad science. Royal Society Open Science. 2019;6:190194. pmid:31417725
  36. 36. Stewart AJ, Plotkin JB. The natural selection of good science. Nature Human Behavior. 2021;5:1510–8. pmid:34002054
  37. 37. Forstmeier W, Wagenmakers E-J, Parker TH. Detecting and avoiding likely false-positive findings–a practical guide. Biological Reviews. 2017; 94:1941–968.
  38. 38. Begley CG, Ioannidis JPA. Reproducibility in science: Improving the standard for basic and preclinical research. Circulation Research, 2015;116:116–126. pmid:25552691
  39. 39. Kimmel K, Avolio ML, Ferraro PJ. Empirical evidence of widespread exaggeration bias and selective reporting in ecology. Nature Human Behavior. 2023;7:1525–1536. pmid:37537387
  40. 40. Simmons JP, Nelson LD, Simonsohn U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science. 2011; 22:1359–1366. pmid:22006061
  41. 41. Fanelli D. “Positive” results increase down the hierarchy of sciences. PLoS One. 2010;5(4): e10068.
  42. 42. Ioannidis JPA. Why most published research findings are false. PLOS Medicine. 2005;2:e124. pmid:16060722
  43. 43. Ioannidis JPA, Tarone R., McLaughlin JK. The false-positive to false-negative ratio in epidemiologic studies. Epidemiology. 2011.22:450–456. pmid:21490505
  44. 44. Niemeyer RE, Proctor KR, Schwartz JA, Niemeyer RG. Are most published criminological research findings wrong? Taking stock of criminological research using a Bayesian simulation approach. International Journal of Offender Therapy and Comparative Criminology. Online ahead of print. pmid:36384305
  45. 45. MacCoun R, Perlmutter S. (2015). Hide results to seek the truth. Nature. 2015;526:187–189.
  46. 46. Platt JR. Strong inference. Science. 1964;146:347–353.
  47. 47. Goodman SN. p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. American Journal of Epidemiology. 1993;137(5):485–96. pmid:8465801
  48. 48. Greenland S. Divergence versus decision P-values: A distinction worth making in theory and keeping in practice: Or, how divergence P-values measure evidence even when decision P-values do not. Scandinavian Journal of Statistics. 2023;50:54–88.
  49. 49. Schneider JW. Null hypothesis significance tests. A mix-up of two different theory: the basis for widespread confusion and numerous misinterpretations. Scientometrics. 2015;102:411–432.
  50. 50. Cohen J. Statistical Power Analysis for the Behavioral Sciences, New York: Routledge; 1988
  51. 51. Barrett B, Brown D, Mundt M, Brown R. Sufficiently important difference: expanding the framework of clinical significance. Med Decis Mak. 2005;25:250–61. pmid:15951453
  52. 52. Gelman A, Carlin J. Some Natural Solutions to the p-Value Communication Problem–and Why They Won’t Work, Journal of the American Statistical Association. 2017;112(519): 889–901.
  53. 53. Šidák ZK. Rectangular Confidence Regions for the Means of Multivariate Normal Distributions. Journal of the American Statistical Association. 1967;62(318):626–633.
  54. 54. Eisenach JC, Warner DS, Houle TT. Reporting of preclinical research in anesthesiology: transparency and enforcement. Anesthesiology. 2016;124(4):763–5. pmid:26845144
  55. 55. Gorman DM, Elkins AD, Lawley M. A systems approach to understanding and improving research integrity. Science and Engineering Ethics. 2019;25:211–225. pmid:29071573
  56. 56. Institute of Medicine. Integrity in Scientific Research: Creating an Environment that Promotes Responsible Conduct. Washington, DC: National Academies Press; 2002. https://doi.org/10.17226/10430 pmid:24967480
  57. 57. Nosek BA, Spies JR, Motyl M. Scientific utopia II. Restructuring incentives and practices to promote truth over publishability. Psychological Science.2012;7:615–631. pmid:26168121
  58. 58. Robson SG, Baum MA, Beaudry JL, Beitner J, Brohmer H, Chin JM, et al. Promoting open science: A holistic approach to changing behavior. Collabra: Psychology; 2021;7(1):30137.
  59. 59. Meadows D. Leverage Points: Places to intervene in a System. Hartland, VT: The Sustainability Institute; 1999. https://donellameadows.org/wp-content/userfiles/Leverage_Points.pdf.
  60. 60. Allen C, Mehler DMA. Open science challenges, benefits and tips in early career and beyond. PLoS Biology. 2019;17:e3000246. pmid:31042704
  61. 61. Scheel AM, Schijen MRMJ, Lakens D. An excess of positive results: Comparing the standard psychology literature with Registered Reports. Advances in Methods & Practices in Psychological Science. 2021;4:1–12.
  62. 62. Humphreys M., de la Sierra RS, van der Windt P. Fishing, commitment, and communication: A proposal for comprehensive nonbinding research registration. Political Analysis. 2013;21: 1–20.
  63. 63. Wagenmakers E-J, Wetzels R, Borsboom D, van der Mass HLJ, Kievit RA. An agenda for purely confirmatory research. Perspectives in Psychological Science. 2012;7:632–638. pmid:26168122
  64. 64. Greenland S. Connecting simple and precise P-values to complex and ambiguous realities (includes rejoinder to comments on “Divergence vs. decision P-values”). Scandinavian Journal of Statistics. 2023;50:899–914.
  65. 65. Flor LS, Reitman MB, Gupta V, Ng M, Gakidou E. The effects of tobacco control policies on global smoking prevalence. Nature Medicine. 2021. 27:239–243. pmid:33479500
  66. 66. Trafimow D, Marks M. Editorial. Basic and Applied Social Psychology. 2018;37:1–2.
  67. 67. Greenland S. Invited Commentary: The Need for Cognitive Science in Methodology. American Journal of Epidemiology. 2017;186(6):639–645, pmid:28938712
  68. 68. Goodman S. Commentary: The P-value, devalued. International Journal of Epidemiology 2003;32:699–702. pmid:14559733
  69. 69. Cumming G. The new statistics: Why and how. Psychological Science. 2014;25(1):7–29. pmid:24220629
  70. 70. Wagenmakers E-J, Verhagen J, Ly A, Matzke D, Steingroever H, Rouder JN, et al. The need for Bayesian hypothesis testing in psychological science. In: Lilienfeld SO, Waldman I, editors. Psychological Science under Scrutiny: Recent Challenges and Proposed Solutions. Hoboken (NJ): Wiley-Blackwell, 2017 p. 123–138.
  71. 71. Fidler F, Thomason N, Cumming G, Finch s, Leeman J. Editors can lead researchers to confidence intervals, but can’t make them think. Psychological Science 2004;15(2):119–126.
  72. 72. Finch S, Cumming G, Williams J, Palmer L, Griffith E, Alders C, Anderson J, Goodman O. Reform of statistical inference in psychology: The case of Memory & Cognition. Behavior Research Methods, Instruments & Computers. 2004;36:312–324.
  73. 73. The Medical Outreach Subteam of the Drug Information Association Bayesian Scientific Working Group, Clark J, Muhlemann N, Natanegara F, Hartley A, Wenkert D, et al. Why are not there more Bayesian clinical trials? Perceived barriers and educational preferences among medical researchers involved in drug development. Therapeutic Innovation & Regulatory Science. 2023;57:417–425.
  74. 74. Campbell H, Gustafson P. The world of research has gone berserk: Modeling the consequences of requiring “greater statistical stringency” for scientific publication. The American Statstician. 2019;73(S1):358–373.
  75. 75. Lehrer J. The truth wears off. The New Yorker. December 13, 2010:52–57. https://sites.ualberta.ca/~ahamann/teaching/renr480/reading/Lehrer_2010_The_truth_wears_off.pdf.
  76. 76. Pietschnig J, Siegel M, Eder JSN, Gittler G. Effect declines are systematic, strong, and ubiquitous: A meta-meta-analysis of the decline effect in Intelligence Research. Frontiers in Psychology. 2019;10:2874. pmid:31920891
  77. 77. Begley CG, Ellis L. Drug development: raise standards for preclinical research. Nature, 2012;483:531–533.
  78. 78. Fisher R A. The arrangement of field experiments, Journal of the Ministry of Agriculture. 1926;33:503–513.
  79. 79. Robinson DH, Wainer H. On the past and future of null hypothesis significance testing. The Journal of Wildlife Management. 2002;66(2):263–271.