The Extent and Consequences of P-Hacking in Science

A focus on novel, confirmatory, and statistically significant results leads to substantial bias in the scientific literature. One type of bias, known as “p-hacking,” occurs when researchers collect or select data or statistical analyses until nonsignificant results become significant. Here, we use text-mining to demonstrate that p-hacking is widespread throughout science. We then illustrate how one can test for p-hacking when performing a meta-analysis and show that, while p-hacking is probably common, its effect seems to be weak relative to the real effect sizes being measured. This result suggests that p-hacking probably does not drastically alter scientific consensuses drawn from meta-analyses.


Introduction
There is increasing concern that many published results are false positives [1,2] (but see [3]). Many argue that current scientific practices create strong incentives to publish statistically significant (i.e., "positive") results, and there is good evidence that journals, especially prestigious ones with higher impact factors, disproportionately publish statistically significant results [4][5][6][7][8][9][10]. Employers and funders often count papers and weigh them by the journal's impact factor to assess a researcher's performance [11]. In combination, these factors create incentives for researchers to selectively pursue and selectively attempt to publish statistically significant research findings.
There are two widely recognized types of researcher-driven publication bias: selection (also known as the "file drawer effect", where studies with nonsignificant results have lower publication rates [7]) and inflation [12]. Inflation bias, also known as "p-hacking" or "selective reporting," is the misreporting of true effect sizes in published studies (Box 1). It occurs when researchers try out several statistical analyses and/or data eligibility specifications and then selectively report those that produce significant results [12][13][14][15]. Common practices that lead to p-hacking include: conducting analyses midway through experiments to decide whether to continue collecting data [15,16]; recording many response variables and deciding which to report postanalysis [16,17], deciding whether to include or drop outliers postanalyses [16], excluding, combining, or splitting treatment groups postanalysis [2], including or excluding covariates postanalysis [14], and stopping data exploration if an analysis yields a significant p-value [18,19].
If published data are biased, data synthesis might lead to flawed conclusions. Meta-analysis is a set of statistical methods that combine studies on the same question to estimate the true effect size [33]. Meta-analyses are now the "gold standard" for synthesizing the evidence for an effect of a treatment or the existence of a relationship, and combining effect size estimates across studies to give an overall estimate. Meta-analyses guide the application of medical treatments and policy decisions, and influence future research directions [34]. However, meta-analyses are compromised if the studies being synthesized do not reflect the true distribution of effect sizes [5,[35][36][37].
Quantifying p-hacking is important because publication of false positives hinders scientific progress. When false positive results enter the literature they can be very persistent. In many fields, there is little incentive to replicate research [38]. Even when research is replicated, early positive studies often receive more attention than later negative ones. In addition, false positives can inspire investment in fruitless research programs, and even discredit entire fields [14,16].

Box 1. The History of P
Fisher [20] introduced null hypothesis significance testing (NHST) to objectively separate interesting findings from background noise [21]. NHST is the most widely used data analysis method in most scientific disciplines [22,23]. The null hypothesis is typically a statement of no relationship between variables or no effect of an experimental manipulation. With NHST, one computes the probability (i.e., p) of finding an effect at least or more extreme than the observed finding if the null hypothesis is true [24,25].
The NHST approach uses an arbitrary cutoff value (usually 0.05). Findings with smaller p-values are described as "statistically significant" ("positive" findings), and the remainder as "nonsignificant" ("negative" findings). This arbitrary cutoff has led to the scientifically dubious practice of regarding "significant" findings as more valuable, reliable, and reproducible [24], thereby incentivizing various kinds of research bias.
Before computers, test statistics (e.g., t and F) were routinely calculated by hand and the associated p-value was looked up in statistical tables. Here, p-values were given for a limited set of values (e.g., 0.001, 0.01, 0.02, and 0.05) [26]. Researchers then reported pvalues as the lowest threshold consistent with the test statistic (e.g., p < 0.05 or p < 0.01). With modern statistical software this practice is unnecessary, as precise p-values are now provided, but it is still commonplace. Previous research has shown that strict adherence to p-value thresholds can bias how research is reported, even within the region of significance [27].
The p-value is easily misinterpreted. For example, it is often equated with the strength of a relationship, but a tiny effect size can have very low p-values with a large enough sample size. Similarly, a low p-value does not mean that a finding is of major clinical or biological interest [28]. Many researchers have advocated abolishing NHST (e.g., [29,30]). However, others note that many of the problems with publication bias reoccur with other approaches, such as reporting effect sizes and their confidence intervals [31] or Bayesian credible intervals [32]. Publication biases are not a problem with p-values per se. They simply reflect the incentives to report strong (i.e., significant) effects.
Despite the potential importance of p-hacking, the consequences for formal and informal data synthesis are unknown. Here, we address both issues using p-curves (see Box 2). First, we used text-mining to obtain reported p-values in papers drawn from a broad range of scientific disciplines. We then looked for evidence of p-hacking based on the shape of the pcurves. Second, we produced p-curves from primary data used in published meta-analyses. This allowed us to test the evidence for p-hacking when looking at specific hypotheses which researchers have clearly identified as being of general interest (i.e., that warrant a metaanalysis).

Box 2. The P-Curve: What Can It Tell Us?
A p-curve is the distribution of p-values for a set of studies. P-curves can be a helpful tool to assess the reliability of published research. Here, we outline how they have been used to assess the literature.
Evidential value. One can examine whether a set of findings contains evidential value by examining the distribution of p-values, particularly those between 0 and 0.05. "Evidential value" refers to whether or not the published evidence for a specific hypothesis suggests that the effect size is nonzero.
When the effect size for a studied phenomenon is zero, every p-value is equally likely to be observed. The expected distribution of p-values under the null hypothesis is uniform (Black line, Fig. 1A and Fig. 2A), such that p<0.05 will occur 5% of the time, p<0.04 will occur 4% of the time, and so on. On the other hand, when the true effect size is nonzero, the expected distribution of p-values is exponential with a right skew [39][40][41][42] (Black line, Fig. 1B and Fig. 2B). When the true effect is strong, researchers are more likely to obtain very low p-values (e.g., p<0.001) than moderately low p-values (e.g., 0.01), and less likely still to obtain nonsignificant p-values (p > 0.05) [41]. So, as the true effect size increases the p-curve is more right skewed [41].
Publication bias. Several studies have plotted the distribution of p-values or related test statistics (i.e., Z or t) around the main significance threshold of p = 0.05 (often in the range of 0.01 to 0.1). A notable drop in p-values above 0.05 (or for Z values, 1.96) (Red line, Fig. 1A and Fig. 1B) is interpreted as evidence for publication bias (e.g., [40,[43][44][45]). While a discontinuity in the distribution of p-values around 0.05 is indicative of publication bias, it does not distinguish between selective publication bias and p-hacking (see Box 1).
P-hacking. The p-curve can, however, be used to identify p-hacking, by only considering significant findings [14]. If researchers p-hack and turn a truly nonsignificant result into a significant one, then the p-curve's shape will be altered close to the perceived significance threshold (typically p = 0.05). Consequently, a p-hacked p-curve will have an overabundance of p-values just below 0.05 [12,40,41]. If researchers p-hack when there is no true effect, the p-curve will shift from being flat to left skewed ( Fig. 2A). If, however, researchers p-hack when there is a true effect, the p-curve will be exponential with right skew but there will be an overrepresentation of p-values in the tail of the distribution just below 0.05 (Fig. 2B). Both p-hacking and selective publication bias predict a discontinuity in the p-curve around 0.05, but only p-hacking predicts an overabundance of p-values just below 0.05 [12]. The exact shape of the p-curve will, however, depend on both the true effect (i.e., the p-curve before p-hacking) and the intensity of p-hacking [41].
Assessing p-curves for p-hacking and evidential value. Similar to previous studies (e.g., [14,43]) we employ binomial tests to look for evidence of evidential value and phacking in both our text-mined and meta-analyses datasets. We tested for evidential value using a two-tailed sign test, in which we compared the number of p-values falling in the bin 0 p < 0.025 to the number in the bin 0.025 p < 0.05. Under the null hypothesis of no evidential value, the expected number of p-values in each of these bins is equal. Significantly more p-values in the lower bin is consistent with evidential value (i.e., right skewed p-curve), and significantly more p-values in the upper bin is consistent with severe p-hacking. This test is a slightly modified version of a test proposed by Simohnson et al. [41], who suggest using two separate one-tailed sign tests for the same purpose.
The two-tailed sign test with a p = 0.025 threshold (above) and the tests proposed by Simonsohn et al. [41] can detect severe p-hacking, but are insensitive to more modest (and arguably more realistic) levels of p-hacking. This is true especially if the average true effect size is strong, as the right skew introduced to the p-curve will mask the left skew caused by p-hacking. A more sensitive approach to detect p-hacking is to look for an increase in the relative frequency of p-values just below 0.05, where we expect the signal of p-hacking to be strongest. Under the null hypothesis of no p-hacking, we expect either that the distribution of p-values is uniform close to 0.05 (if the true effect sizes are zero), or right skewed (i.e., if at least some effect sizes are nonzero). However, p-hacking introduces additional p-values close to 0.05, producing a left skew. Thus, a simple, and conservative, test for p-hacking involves testing the null hypothesis that the p-values just below 0.05 are either uniformly distributed or right skewed. We used a one-tailed sign test to ask whether the number of p-values in the bin that abuts 0.05 is greater than that in the adjacent lower bin. This test becomes more likely to detect p-hacking if one uses smaller bins, since p-values are right skewed when the average effect size is positive (masking p-hacking), but in practice, using smaller bins will reduce the sample size (and thus power) of the test. We selected a bin width of 0.005, with the lower bin specified as 0.04 < p < 0.045 and the upper bin as 0.045 < p < 0.05. We chose p < 0.05 as the cutoff for our upper bin (following [3]), rather than p = 0.05 (see [46]) because we suspect that many authors do not regard p = 0.05 as significant. As a measure of the strength of phacking, we present the proportion of p-values in the upper bin and the associated 95% confidence intervals (calculated following Clopper and Pearson [47] using the binom.test function in R).
We ran the above analyses separately for each discipline and meta-analysis dataset. In addition, we tested for overall evidential value (two-tailed test) and signs of p-hacking (one-tailed test) in the two main datasets (Text-mining of p-values and the meta-analysis data sets respectively). To do this, we used the proportion of p-values occurring in the upper bin for each discipline or meta-analysis (depending on the dataset being analysed) and ran a binomial generalised linear model to test whether the observed intercept differed from 0.5 (i.e., equal number of cases in the two bins). This approach is equivalent to a meta-analysis testing for a significant trend when combining the individual disciplines or questions because each is weighted by its sample size. The R code we used is deposited in Dryad [48].

Assessing the Extent of P-Hacking in the Scientific Literature Using Text-Mining
We used text-mining to search for p-values in all Open Access papers available in the PubMed database (see S1 Text). To quantify "evidential value" (i.e., if there is evidence that the true effect size is nonzero) and p-hacking, we constructed p-curves from the p-values we obtained (see Box 2). We present separate tests of evidential value and p-hacking for p-values extracted from the Results section, and for p-values extracted from the Abstracts. Researchers have identified weaknesses in the use of text-mined data to look for publication bias (e.g., [46]). Here, we adopted several measures to counter these weaknesses (see S1 Text). Pooling p-values across all disciplines, there was strong evidence for "evidential value"; that is, researchers appear to be predominantly studying phenomena with nonzero effect sizes, as shown by the strong right skew of the p-curve for p-values found in both the Results (binomial glm: estimated proportion of p-values in the upper bin (0.025 p < 0.05) (lower CI, upper CI) = 0.257 (0.254, 0.259), p < 0.001, n = 14 disciplines) and the Abstracts (binomial glm: estimated proportion of p-values in the upper bin (0.025 p < 0.05) (lower CI, upper CI) = 0.262 (0.257, 0.267), p < 0.001, n = 10 disciplines). We found significant evidential value in every discipline represented in our text-mining data, irrespective of whether we tested the p-values from the Results or Abstracts (Table 1; Table 2). Based on the net trend across all disciplines, however, there was also strong evidence for p-hacking in both the Results (binomial glm: estimated proportion of p-values in the upper bin (0.045 < p < 0.05) (lower CI) = 0.546 (0.536), p < 0.001, n = 14 disciplines) and the Abstracts (binomial glm: estimated proportion of p-  values in the upper bin (0.045 < p < 0.05) (lower CI) = 0.537 (0.518), p < 0.001, n = 10 disciplines). In most disciplines, there were more p-values in the upper than the lower bin; and when we look at the p-values text-mined from Results sections in every discipline where we had good statistical power (i.e., Health and medical Sciences, Biological Sciences, and Multidisciplinary), this difference was statistically significant (Table 1, Fig. 3A). When looking at pvalues text-mined from Abstracts, despite the significant general trend, only the multidisciplinary and Information and Computer Science categories were significant ( Table 2, Fig. 3B). Our text-mining suggests that p-hacking is widespread. Other studies that have inspected pcurves for far smaller sets of journals have also found evidence of p-hacking [12,40,45]. By contrast, Jager and Leek [3] found no evidence of p-hacking in a text-mining study of five medical journals. However, they were criticized for using p-values from Abstracts [46], because reporting p-values in Abstracts is optional, so they are more likely to contain only the strongest results (i.e., smallest p-values). Such a bias would exaggerate evidential value in our analysis, and make it harder to detect p-hacking (e.g., if researchers censor results with p = 0.049 from the Abstract, but not p = 0.041). Even though Abstracts are more likely to contain p-values that relate to primary hypotheses, which are expected to be more strongly p-hacked than p-values from less interesting, ancillary tests [41], lower power and reporting bias may impede detection of p-hacking using p-values obtained from Abstracts. The fact that we find evidence for phacking when using p-values from either the Abstracts or the Results sections across all scientific disciplines for which data are available (our overall analysis) supports the conclusion that p-hacking is rife. Although we present evidence that p-hacking is widespread, there was still a strong right skew in all the p-curves we examined. This is consistent with researchers investigating predictions that lead to refutation of the null hypothesis, implying that the average true effect size studied by life scientists is nonzero. Given recent concerns about the lack of reproducibility of findings (e.g., [49] but see [50]) and the possibility that many published results are false [2], our results are reassuring. It is, of course, important to note that when using text-mining, we are combining many different types of questions to generate our p-curves. Consequently, it remains unclear whether there are some research fields or questions subsumed within the disciplines we considered for which the average effect size of published results is zero (i.e., the pcurve is flat). To examine this, it is important to also look at p-curves for well-defined research questions [41].

The Consequences of P-Hacking for Meta-analyses
Meta-analysis is an excellent method for systematically synthesizing the literature and quantifying an effect or relationship by averaging effect sizes from multiple studies after weighting each one by its reliability [33,51]. However, meta-analyses are only as good as the data they use, and a recent study estimated that up to 37% of meta-analyses of clinical trials reporting a significant mean effect size represent false positives [34].
Tests for evidential value and p-hacking can readily be used to detect biases in datasets used in meta-analyses. We encourage researchers conducting meta-analyses to report p-values associated with each effect size (which is not currently standard practice) and then to test for evidential value and p-hacking. For a recent example of this practice, see [52]. To demonstrate this procedure, we obtained p-values from studies subject to meta-analyses by evolutionary biologists studying sexual selection [53][54][55][56][57][58][59][60][61] (see S1 Text).
When conducting our own meta-analysis of all the data used in these meta-analyses, there was clear evidence that researchers have strong evidential value for claims that effect sizes are nonzero (binomial glm: estimated proportion of p-values in the upper bin (0.025 p < 0.05) (lower CI, upper CI) = 0.202 (0.179, 0.228), p<0.001, n = 12 datasets). We then examined each dataset separately and found statistically significant evidential value for 9 of the 12 p-curves ( Table 3). The three p-curves that did not show evidential value had the three lowest sample sizes, so low statistical power to detect evidential value may explain the lack of significance. Again, it is worth noting that evidential value for well-studied phenomena is not a given (see a real-world example in [62]).
When considering evidence for p-hacking, we found that when we included misreported p-values (those given as p < 0.05 which were actually larger; a total of 16 cases-see S1 Text) there were more p-values in the upper than the lower bin for 7 of 12 p-curves (Table 3). This bias was significant in one dataset (Fig. 4), which was also the one with the largest sample size. However, the evidence for p-hacking disappeared when we excluded misreported p-values from our analyses (Table 3). One could argue that including misreported p-values in the upper bin of our binomial test biases our results toward detecting p-hacking, but reporting nonsignificant results as "p<0.05" is a component of p-hacking that should not be ignored. Indeed, Leggett et al. [45] also found considerable misreporting of p-values around the 0.05 threshold. They noted that p-values were more likely to be misreported as significant when they were not, rather than the reverse, and that this "error" has become more common in recent years.
More importantly, when misreported p-values were included in our analysis we found significant p-hacking from a meta-analysis of the p-curves of the 12 meta-analyses (binomial glm: estimated proportion of p-values in the upper bin (0.045 < p < 0.05) (lower CI) = 0.615 (0.513), p = 0.033; excluding misreported p-values: 0.489 (0.375), p = 0.443). Although questions subjected to meta-analysis might not be a representative sample of all research questions asked by scientists, our results indicate that studies on questions identified by researchers as important enough to warrant a meta-analysis tend to be p-hacked. Whether this influences the general conclusions of a meta-analysis depends on both the extent of p-hacking and the  strength of the true effect. For instance, we found a statistically significant indication of p-hacking in only one of the 12 questions examined in published meta-analyses (Fig. 4). However, this study [56] also showed strong evidential value and p-values in the 0.045-0.05 bin were only a small proportion of published significant p-values. It is therefore unlikely that p-hacking would change the qualitative conclusions made in this meta-analysis, although p-hacking might have inflated the estimated mean effect size. In general, meta-analyses might be robust to inflated effects sizes that results from p-hacking, because: 1) all else being equal, studies that are most susceptible to p-hacking are those with small sample sizes (i.e., because low statistical power means less chance of a significant result), and these are given less weighting in a metaanalysis, 2) at least in some fields (e.g., ecology and evolution), meta-analyses often use data that is not directly related to the primary focus of the original paper. The p-values associated with secondary questions are less likely to be p-hacked. One way to check how sensitive estimates of effects sizes are to p-hacking would be to randomly remove the appropriate number of studies that contribute to a hump in the p-curve just below 0.05. Alternatively, meta-analysts could estimate effect sizes using p-curves (i.e., using only the significant p-values they find), a method which has been proposed to account for publication biases and to offer a conservative estimate of the true effect when there is p-hacking [62,63]. Development of p-curve methods is ongoing and we look forward to further tests of their ability to correct for the file-drawer effect, p-hacking, and other forms of publication bias given that real world data are likely to violate some of the assumptions in the available simulations of their effectiveness.

Summary and Conclusions
Our study provides two lines of empirical evidence that p-hacking is widespread in the scientific literature. Our text-mining approach is based on a very large dataset that consists of p-values from different disciplines and questions, while our meta-analysis approach consists of p-values concerning a few specific hypotheses. Both approaches yielded similar results: evidential value for claims that the mean effect sizes for key study questions are nonzero-the conclusions researchers are making based on significant study findings-but that estimated mean effect size has probably been inflated by p-hacking. Eliminating p-hacking entirely is unlikely when career advancement is assessed by publication output, and publication decisions are affected by the p-value or other measures of statistical support for relationships. Even so, there are a number of steps that the research community and scientific publishers can take to decrease the occurrence of p-hacking (see Box 3).

Box 3. Recommendations
The key to decreasing p-hacking is better education of researchers. Many practices that lead to p-hacking are still deemed acceptable. John et al. [16] measured the prevalence of questionable research practices in psychology. They asked survey participants if they had ever engaged in a set of questionable research practices and, if so, whether they thought their actions were defensible on a scale of 0-2 (0 = no, 1 = possibly, 2 = yes). Over 50% of participants admitted to "failing to report all of a study's dependent measures" and "deciding whether to collect more data after looking to see whether the results were significant," and these practices received a mean defensibility rating greater than 1.5. This indicates that many researchers p-hack but do not appreciate the extent to which this is a form of scientific misconduct. Amazingly, some animal ethics boards even encourage or mandate the termination of research if a significant result is obtained during the study, Supporting Information S1 Text. Details of how text-mined data and data from meta-analyses were collected and analysed. (DOCX)

Acknowledgments
We are grateful to members of the Jennions lab for comments and discussion on various versions of the manuscript. which is a particularly egregious form of p-hacking (Anonymous reviewer, personal communication).
What can researchers do?
• Clearly label research as prespecified (i.e., designed to answer a specific question, where detail of methods and analyses can be fully reported prior to data collection) or exploratory (i.e., involves exploration of data that looks intriguing, where methods and analyses used are often post hoc [13]), so that readers can treat results with appropriate caution. Results from prespecified studies offer far more convincing evidence than those from exploratory research [2].
• Adhere to common analysis standards [2]; measuring only response variables that are known (or predicted) to be important; and using sufficient sample sizes.
• Perform data analysis blind wherever possible. This approach makes it difficult to phack for specific results.
• Place greater emphasis on the quality of research methods and data collection rather than the significance or novelty of the subsequent findings when reviewing or assessing research. Ideally, methods should be assessed independently of results [13,44].
What can journals do?
• Provide clear and detailed guidelines for the full reporting of data analyses and results. For instance, stating that it is necessary to report effect sizes whether small or large, to report all p-values to three decimal places [27,64], to report samples sizes, and, most importantly, to be explicit about the entire analysis process (not just the final tests used to generate reported p-values). This will reduce p-hacking and aid the collection of data for meta-analyses and text-mining studies.
• Encourage and/or provide platforms for method prespecification [13,65]. Although methods and results in publications do not always match their prespecified protocols [5,66], prespecification allows readers to assess the risk of p-hacking and adjust their confidence in the reported outcomes accordingly.
• Encourage and/or provide platforms for open access to raw data. While access to raw data does not prevent p-hacking, it does make researchers more accountable for marginal results and allows readers to reanalyze data to check the robustness of results.