^{*}

Conceived and designed the experiments: JMW. Performed the experiments: JMW MB DM. Analyzed the data: JMW MB DM. Wrote the paper: JMW.

The authors have declared that no competing interests exist.

The widespread reluctance to share published research data is often hypothesized to be due to the authors' fear that reanalysis may expose errors in their work or may produce conclusions that contradict their own. However, these hypotheses have not previously been studied systematically.

We related the reluctance to share research data for reanalysis to 1148 statistically significant results reported in 49 papers published in two major psychology journals. We found the reluctance to share data to be associated with weaker evidence (against the null hypothesis of no effect) and a higher prevalence of apparent errors in the reporting of statistical results. The unwillingness to share data was particularly clear when reporting errors had a bearing on statistical significance.

Our findings on the basis of psychological papers suggest that statistical results are particularly hard to verify when reanalysis is more likely to lead to contrasting conclusions. This highlights the importance of establishing mandatory data archiving policies.

Statistical analyses of research data are quite error prone

Here we study whether researchers' willingness to share data for reanalysis is associated with the strength of the evidence (defined as the statistical evidence against the null hypothesis of no effect) and the quality of the reporting of statistical results (defined in terms of the prevalence of inconsistencies in reported statistical results). To this end, we followed-up on Wicherts et al.'s requests for data

In the summer of 2005, Wicherts and colleagues

We extracted from the papers all the ^{2}^{2}

Five undergraduates, who were unaware from which papers data were shared also independently retrieved a total of 495 statistics and DFs. We compared these 495 statistics to ours and determined that the accuracy rate in our own data was 99.4%. The three minor errors in our data retrieval were corrected but proved trivial.

Inconsistencies between reported p-values (or ranges) and p-values recalculated from the retrieved statistics were detected automatically in Excel as follows. The recomputed p-value was first rounded to the same number of digits as was used in the reported p-value (range). Subsequently, an IF-statement automatically checked for consistency. Next, we determined by hand whether reporting errors were not due to possible errors in extraction (none were found) or to rounding. For example, a test result such as “t(15) = 2.3; p = 0.034” could have arisen from test statistic that could possibly range from 2.25 to 2.35. Consequently, the correct p-value could range from .033 to .040 and so the reported value was not seen as inconsistent, although the recomputed p-value is .0362. In the analyses of the p-value distributions, we used the nearest next decimal that attained consistency for these correctly rounded cases (i.e., 2.34 in the example), but used the p-value on the basis of the reported test statistic in other cases. We checked whether over-reported p-values were corrected upwards via procedures like Bonferroni's or Huyn-Feldt's, but did not use these corrections in analyzing p-value distributions. As some of the inconsistencies may have arisen from the use of one-sided testing, we made additional searches of the text for explicit references thereof. In one instance, an F-test result was reported explicitly as a one-sided test, but because this result was equivalent to a one-sided t-test we did not consider it erroneous (as suggested by an independent reviewer). As a final check, the three authors independently verified all 49 inconsistencies on the basis of the papers. All documented errors are available upon request.

The use of this method previously revealed quite high errors rates in the reporting of p-values in papers published in ^{2}^{2} test is already a one-sided test), (5) confusion of = with<(e.g.,

This study has been approved by the Ethics Committee of the Psychology Department of the University of Amsterdam. In light of the purpose of our study, we could not ask the corresponding authors for their informed consent. The Ethics Committee exempted the use of informed consent because all corresponding authors had signed APA publication forms related to data sharing and in light of Article 8.05 of the Ethical Principles of the APA. The documented errors are based on publically available papers and so are considered archival material. The sharing or non-sharing of data is considered to be an unobtrusive observation of professional behavior of the corresponding authors that should not create distress on their behalf, provided that anonymity is assured. To protect the identity of corresponding authors, we are not allowed to make public who did or did not share data with Wicherts et al. However, this information is available upon request to allow others to verify our results through reanalysis. The problems that we highlight are general, and our results should not be used to question the academic integrity of individual researchers. The analyses we report here were all conducted independently by at least two of us on the basis of the data that all of us have in our possession.

Of the 49 corresponding authors, 21 (42.9%) had shared some data with Wicherts et al. Thirteen corresponding authors (26.5%) failed to respond to the request or any of the two reminders. Three corresponding authors (6.1%) refused to share data either because the data were lost or because they lacked time to retrieve the data and write a codebook. Twelve corresponding authors (24.5%) promised to share data at a later date, but have not done so in the past six years (we did not follow up on it). These authors commonly indicated that the data were not readily available or that they first needed to write a codebook.

The 49 papers contained a total of 1148 test statistics that were presented as significant at ^{nd} decimal (e.g.,

Distribution of the number of errors in the reporting of p-values for 28 papers from which the data were not shared (left column) and 21 from which the data were shared (right column) for all misreporting errors (upper row), larger misreporting errors at the 2^{nd} decimal (middle row), and misreporting errors that concerned statistical significance (p<.05; bottom row).

Journal | DOI | Pageno. | No. of stats. | Mean of ps | Median of ps | Reporting. errors | ||

All | 2nd dec. | around.05 | ||||||

jep∶lmc | 10.1037/0278–7393.30.5.947 | 947–959 | 7 | 0.006636 | 0.00295 | 0 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.5.969 | 969–987 | 13 | 0.027302 | 0.02936 | 0 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.5.988 | 988–1001 | 33 | 0.010325 | 0.00482 | 3 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.5.1002 | 1002–1011 | 25 | 0.004257 | 0.00001 | 1 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.5.1012 | 1012–1025 | 83 | 0.003054 | 0.00000 | 0 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.5.1026 | 1026–1044 | 30 | 0.007286 | 0.00189 | 0 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.5.1045 | 1045–1064 | 19 | 0.005587 | 0.00073 | 0 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.5.1065 | 1065–1081 | 22 | 0.001672 | 0.00009 | 3 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.5.1082 | 1082–1092 | 9 | 0.001089 | 0.00010 | 0 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.5.1093 | 1093–1105 | 21 | 0.011132 | 0.00115 | 1 | 1 | 0 |

jep∶lmc | 10.1037/0278–7393.30.5.1106 | 1106–1118 | 16 | 0.002213 | 0.00001 | 2 | 2 | 1 |

jep∶lmc | 10.1037/0278–7393.30.5.1119 | 1119–1130 | 10 | 0.007128 | 0.00095 | 0 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.5.1131 | 1131–1142 | 21 | 0.003256 | 0.00098 | 0 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.6.1147 | 1147–1166 | 8 | 0.008461 | 0.00036 | 0 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.6.1167 | 1167–1175 | 8 | 0.011841 | 0.00231 | 0 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.6.1176 | 1176–1195 | 32 | 0.005418 | 0.00006 | 1 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.6.1196 | 1196–1210 | 37 | 0.004050 | 0.00000 | 1 | 1 | 0 |

jep∶lmc | 10.1037/0278–7393.30.6.1211 | 1211–1218 | 11 | 0.019460 | 0.01967 | 0 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.6.1219 | 1219–1234 | 39 | 0.016008 | 0.01084 | 7 | 6 | 1 |

jep∶lmc | 10.1037/0278–7393.30.6.1235 | 1235–1251 | 23 | 0.004993 | 0.00096 | 1 | 1 | 0 |

jep∶lmc | 10.1037/0278–7393.30.6.1252 | 1252–1270 | 46 | 0.010496 | 0.00058 | 0 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.6.1271 | 1271–1278 | 20 | 0.002645 | 0.00001 | 1 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.6.1290 | 1290–1301 | 35 | 0.013469 | 0.00475 | 0 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.6.1302 | 1302–1321 | 30 | 0.013727 | 0.00680 | 0 | 0 | 0 |

jep∶lmc | 10.1037/0278–7393.30.6.1322 | 1322–1337 | 37 | 0.006148 | 0.00094 | 0 | 0 | 0 |

jpsp | 10.1037/0022–3514.87.5.557 | 557–572 | 33 | 0.016946 | 0.01104 | 1 | 1 | 0 |

jpsp | 10.1037/0022–3514.87.5.573 | 573–585 | 15 | 0.011696 | 0.00597 | 1 | 0 | 0 |

jpsp | 10.1037/0022–3514.87.5.586 | 586–598 | 21 | 0.019989 | 0.01519 | 4 | 4 | 3 |

jpsp | 10.1037/0022–3514.87.5.599 | 599–614 | 24 | 0.009036 | 0.00263 | 0 | 0 | 0 |

jpsp |
10.1037/0022–3514.87.5.615 | 615–630 | 27 | 0.003605 | 0.00000 | 3 | 0 | 0 |

jpsp | 10.1037/0022–3514.87.5.631 | 631–648 | 6 | 0.008074 | 0.00385 | 0 | 0 | 0 |

jpsp | 10.1037/0022–3514.87.5.649 | 649–664 | 16 | 0.012216 | 0.00510 | 4 | 4 | 0 |

jpsp | 10.1037/0022–3514.87.5.665 | 665–683 | 23 | 0.016715 | 0.00179 | 2 | 1 | 1 |

jpsp | 10.1037/0022–3514.87.6.733 | 733–749 | 24 | 0.023442 | 0.02068 | 2 | 2 | 2 |

jpsp |
10.1037/0022–3514.87.6.750 | 750–762 | 5 | 0.000002 | 0.00000 | 0 | 0 | 0 |

jpsp | 10.1037/0022–3514.87.6.763 | 763–778 | 29 | 0.007420 | 0.00005 | 1 | 1 | 0 |

jpsp |
10.1037/0022–3514.87.6.779 | 779–795 | 9 | 0.025925 | 0.03231 | 0 | 0 | 0 |

jpsp |
10.1037/0022–3514.87.6.796 | 796–816 | 15 | 0.006438 | 0.00072 | 0 | 0 | 0 |

jpsp | 10.1037/0022–3514.87.6.817 | 817–831 | 20 | 0.007695 | 0.00011 | 0 | 0 | 0 |

jpsp | 10.1037/0022–3514.87.6.832 | 832–844 | 8 | 0.021422 | 0.02079 | 4 | 4 | 1 |

jpsp | 10.1037/0022–3514.87.6.845 | 845–859 | 48 | 0.009394 | 0.00380 | 2 | 0 | 0 |

jpsp | 10.1037/0022–3514.87.6.860 | 860–875 | 28 | 0.019047 | 0.01104 | 0 | 0 | 0 |

jpsp | 10.1037/0022–3514.87.6.876 | 876–893 | 27 | 0.011934 | 0.00598 | 1 | 1 | 1 |

jpsp |
10.1037/0022–3514.87.6.894 | 894–912 | 8 | 0.009142 | 0.00092 | 0 | 0 | 0 |

jpsp | 10.1037/0022–3514.87.6.913 | 913–925 | 7 | 0.018208 | 0.00783 | 0 | 0 | 0 |

jpsp | 10.1037/0022–3514.87.6.926 | 926–939 | 9 | 0.011442 | 0.01224 | 0 | 0 | 0 |

jpsp | 10.1037/0022–3514.87.6.940 | 940–956 | 36 | 0.009620 | 0.00314 | 2 | 2 | 0 |

jpsp | 10.1037/0022–3514.87.6.957 | 957–973 | 45 | 0.006310 | 0.00020 | 0 | 0 | 0 |

jpsp | 10.1037/0022–3514.87.6.974 | 974–990 | 30 | 0.018801 | 0.01527 | 1 | 1 | 0 |

correlational design;

mixed correlational/experimental design, remaining papers involve experimental designs.

Predictor | Parameter (SE) | Wald χ^{2} (DF = 1) |
p |

All reporting errors (range: 0–7) | |||

(Intercept) | −2.76 (1.30) | 4.53 | .033 |

Data shared (1) or not (0) | −0.83 (0.38) | 4.84 | .028 |

Square root (Average of p-values) | 4.39 (6.13) | 0.51 | .473 |

Log (No. of test statistics) | 0.85 (0.41) | 4.19 | .041 |

Neg.Binomial parameter | 0.83 (0.46) | ||

Large reporting errors at the second decimal (range: 0–6) | |||

(Intercept) | −4.10 (1.78) | 5.30 | .021 |

Data shared (1) or not (0) | −1.20 (0.52) | 5.39 | .020 |

Square root (Average of p-values) | 17.17 (9.42) | 3.32 | .069 |

Log (No. of test statistics) | 0.71 (0.45) | 2.53 | .112 |

Neg.Binomial parameter | 1.41 (0.84) |

We came across a total of ten cases (from seven papers) in which the recomputed p-value was above .05, whilst the result was presented as being significant (

P-values from NHST are traditionally interpreted as the strength of the evidence against the null hypothesis of no effect

Distribution of p-values reported as being significant (at p<.05) in 21 papers from which data were shared (N = 561; in black) and in 28 papers from which data were not shared (N = 587; in grey), showing that p-values often lie closer to the typical boundary of significance when data are not shared for reanalysis. Frequencies of reporting errors (as given above the bars) reflect higher error prevalence in papers from which no data were shared.

We also conducted a bootstrap analysis to test this difference between shared and non-shared papers on the basis of individual p-values as clustered in the papers. In this analysis, we determined on the basis of 100,000 replications the null distribution of Wilcoxon's W test for the 1138 statistically dependent p-values that were smaller than .05. To this end, we randomly assigned each paper (and hence all p-values in it) to either the shared or non-shared category (on the basis of the base rate of p = 21/49), and repeated this process 100,000 times to get an empirical null distribution for W on the basis of our data. The W statistic computed on the basis of actual difference between shared and non-shared gave a p-value of .0298 (2-tailed) in this bootstrapped null distribution. Hence, the analyses of individual p-values corroborated that p-values were significantly higher in papers from which data were not shared.

In this sample of psychology papers, the authors' reluctance to share data was associated with more errors in reporting of statistical results and with relatively weaker evidence (against the null hypothesis). The documented errors are arguably the tip of the iceberg of potential errors and biases in statistical analyses and the reporting of statistical results. It is rather disconcerting that roughly 50% of published papers in psychology contain reporting errors

The association between reporting errors and sharing of data after results are published may also reflect differences in the rigor with which researchers manage their data. Rigorously working researchers may simply commit fewer reporting errors

Regardless of the underlying processes, the results on the basis of the current papers imply that it is most difficult to verify published statistical results when these are contentious. We focused here on NHST within two psychology journals and so it is desirable to replicate our results in other fields and in the context of alternative statistical approaches. However, it is likely that similar problems play a role in the widespread reluctance to share data in other scientific fields

More stringent policies concerning data archiving will not only facilitate verification of analyses and corrections of the scientific record, but also improve the quality of reporting of statistical results. Changing policies require better educational training in data management and data archiving, which is currently suboptimal in many fields

(DOC)

(DOC)

We thank our colleagues at Psychological Methods for comments on an earlier draft, and Guusje Havenaar, Isabelle Hofmans, Vera Kruse, Femke Paling and Rianda van Veen for assistance in data collection.