Most Published Research Findings Are False—But a Little Replication Goes a Long Way

While the authors agree with John Ioannidis that "most research findings are false," here they show that replication of research findings enhances the positive predictive value of research findings being true.


Essay
February 2007 | Volume 4 | Issue 2 | e28 W e know there is a lot of lack of replication in research fi ndings, most notably in the fi eld of genetic associations [1][2][3]. For example, a survey of 600 positive associations between gene variants and common diseases showed that out of 166 reported associations studied three or more times, only six were replicated consistently [4]. Lack of replication results from a number of factors such as publication bias, selection bias, Type I errors, population stratifi cation (the mixture of individuals from heterogeneous genetic backgrounds), and lack of statistical power [5].
In a recent article in PLoS Medicine, John Ioannidis quantifi ed the theoretical basis for lack of replication by deriving the positive predictive value (PPV) of the truth of a research fi nding on the basis of a combination of factors. He showed elegantly that most claimed research fi ndings are false [6]. One of his fi ndings was that the more scientifi c teams involved in studying the subject, the less likely the research fi ndings from individual studies are to be true. The rapid early succession of contradictory conclusions is called the "Proteus phenomenon" [7]. For several independent studies of equal power, Ioannidis showed that the probability of a research fi nding being true when one or more studies fi nd statistically signifi cant results declines with increasing number of studies.
As part of the scientifi c enterprise, we know that replication-the performance of another study statistically confi rming the same hypothesis-is the cornerstone of science and replication of fi ndings is very important before any causal inference can be drawn. While the importance of replication is also acknowledged by Ioannidis, he does not show how PPVs of research fi ndings increase when more studies have statistically signifi cant results. In this essay, we demonstrate the value of replication by extending Ioannidis' analyses to calculation of the PPV when multiple studies show statistically signifi cant results.
The probability that a study yields a statistically signifi cant result depends on the nature of the underlying relationship. The probability is 1 − β (one minus the Type II error rate) if the relationship is true, and α (Type I error rate) when the relationship is false, i.e., there is no relationship. Similarly, the probability that r out of n studies yield statistically signifi cant results also depends on whether the underlying relationship is true or not. Let B(p,r,n) denote the probability of obtaining at least r statistically signifi cant results out of n independent and identical studies, with p being the   In this formula, p is 1 − β when the underlying relationship is true and α when it is false. Let R be the prestudy odds and c be the number of relationships being probed in the fi eld. The pre-study probability of a relationship being true is given by R/(R + 1). The expected values of the 2 × 2 table are given in Table 1. When r is equal to one, entries in Table 1 are identical to those in Table 3 of Ioannidis [6]. The probability that, in the absence of bias, at least r out of n independent studies fi nd statistically signifi cant results is given by (RB(1 − β,r,n) + B(α,r,n))/(R + 1) and the PPV when at least r studies are statistically signifi cant is RB(1 − β,r,n)/((RB(1 − β,r,n) + B(α,r,n)).

Positive Predictive Value as a Function of Study Replication
We examine the PPV as a function of the number of statistically signifi cant fi ndings. Figure 1 shows the PPV of at least one, two, or three statistically signifi cant research fi ndings out of ten independent studies as a function of the pre-study odds of a true relationship (R) for powers of 20% and 80%. The lower lines correspond to Ioannidis' fi nding and indicate the probability of a true association when at least one out of ten studies shows a statistically signifi cant result. As can be seen, the PPV is substantially higher when more research fi ndings are statistically signifi cant. Thus, a few positive replications can considerably enhance our confi dence that the research fi ndings refl ect a true relationship. When R ranged from 0.0001 to 0.01, a higher number of positive studies is required to attain a reasonable PPV. The difference in PPV for power of 80% and power of 20% when at least three studies are positive is higher than when at least one study is positive. Figure 2 gives the PPV for increasing number of positive studies out of ten, 25, and 50 studies for pre-study odds of 0.0001, 0.01, 0.1, and 0.5 for powers of 20% and 80%. When there is at least one positive study (r = 1) and power equal to 80%, as indicated in Ioannidis' paper, PPV declined approximately 50% for 50 studies compared to ten studies for R values between 0.0001 and 0.1. However, PPV increases with increasing number of positive studies and the percentage of positive studies required to achieve a given PPV declines with increasing number of studies. The number of positive studies required to achieve a PPV of at least 70% increased from eight for ten studies to 12 for 50 studies when pre-study odds equaled 0.0001, from fi ve for ten studies to eight for 50 studies when pre-study odds equaled 0.01, from three for ten studies to six for 50 studies when prestudy odds equaled 0.1, and from two for ten studies to fi ve for 50 studies when pre-study odds equaled 0.5. The difference in PPV for powers of 80% and 20% declines with increasing number of studies.

Probability Distribution of Statistically Signifi cant Results
Although the PPV increases with increasing statistically signifi cant results, the probability of obtaining at least r signifi cant results declines with increasing r. This probability and the corresponding PPV for pre-study odds of 0.0001, 0.01, 0.1, and 0.5 are given for ten studies in Table 2. When power is 20% and pre-study odds are 0.0001, the probability of obtaining at least three statistically signifi cant results is 1% and the corresponding PPV is 0.3%. This probability and the corresponding PPV increase with increasing pre-study odds. For example, when R = 0.1, the probability of obtaining at least three signifi cant results is 4% and the PPV is 74%. As expected, both the probability of obtaining statistically signifi cant results and the corresponding PPV increase ≥ r signifi cant studies cRB(1 − β,r,n)/(R + 1) cB(α,r,n)/(R + 1) c(RB(1 − β,r,n) + B(α,r,n))/(R + 1) < r signifi cant studies cR(1 − B(1 − β,r,n))/(R + 1) c(1 − B(α,r,n))/(R + 1) c(1 − (RB(1 − β,r,n) + B(α,r,n))/(R + 1) Total cR/(R + 1) c/(R + 1) c  with increasing power. However, for very small R values (around 0.0001), the increase in power has a minimal impact in the probability of obtaining at least one, two, or three statistically signifi cant results. When power is 80%, the probability of obtaining at least three statistically signifi cant results is 1.2% and the corresponding PPV is 0.9% for R = 0.0001, and when prestudy odds are 0.1, the probability of obtaining at least three statistically signifi cant results increases to 10% and the corresponding PPV to 90%.

Comment
The importance of research replication was discussed in a Nature Genetics editorial in 1999 lamenting the nonreplication of association studies [8]. The editor emphasized that when authors submit manuscripts reporting genetic associations, the study should include an effect size and it should contain either a replication in an independent sample or physiologically meaningful data supporting a functional role of the polymorphism in question. While we acknowledge that our assumptions of identical design, power, and level of signifi cance refl ect a somewhat simplifi ed scenario of replication, we quantifi ed the positive predictive value of true research fi ndings for increasing numbers of signifi cant results. True replication, however, requires a precise process where the exact same fi nding is reexamined in the same way. More often than not, genuine replication is not done, and what we end up with in the literature is corroboration or indirect supporting evidence. While this may be acceptable to some extent in any scientifi c enterprise, the distance from this to data dredging, moving the goal post, and other selective reporting biases is often very small and can contribute to "pseudo" replication. Replication does not mean that we can have underpowered studies; even when we have several underpowered studies replicate a fi nding, the PPV remains low. Good replication practices require adequately powered studies. More generally, meta-analysis is a more useful approach to assess the totality of evidence in a body of work. Ioannidis discussed the importance of meta-analysis, and its weaknesses in cases where even the meta-analysis is underpowered.
Our calculations have not considered the possibility of bias, i.e., selective reporting problems that may change some "negative" results to "positive" or may leave "negative" results unpublished. John Ioannidis has shown that modest bias can decrease the PPV steeply [6]. Therefore if replication is to work in genuinely increasing the PPV of research claims, it should be coupled with full transparency and non-selective reporting of research results. Note that when hypotheses are one-sided, according to our defi nition of replication, we only consider hypotheses that are in the same direction. Under this defi nition, statistically signifi cant results in both directions do not arise. However, in meta-analysis, one can combine results that are signifi cant in opposite directions. Calculations in a formal meta-analysis may not square fully with the inference presented here, since meta-analysis would incorporate both effect sizes and their uncertainty rather than just the "positive" versus "negative" inference. For example, we may have the necessary number of "positive" studies, but if the observed "positive" effects are small and all the other studies have trends in the opposite direction, the summary effect may well be null.
In summary, while we agree with Ioannidis that most research fi ndings are false, we clearly demonstrate that replication of research fi ndings enhances the positive predictive value of research fi ndings being true. While this is not unexpected, it should be encouraging news to researchers in their never-ending pursuit of scientifi c hypothesis generation and testing. Nevertheless, more methodologic work is needed to assess and interpret cumulative evidence of research fi ndings and their biological plausibility. This is especially urgent in the exploding fi eld of genetic associations.