^{1}

^{2}

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{5}

^{7}

^{8}

^{3}

^{3}

^{5}

^{9}

^{10}

^{11}

The authors have declared that no competing interests exist.

The need for replication of initial results has been rediscovered only recently in many fields of research. In preclinical biomedical research, it is common practice to conduct exact replications with the same sample sizes as those used in the initial experiments. Such replication attempts, however, have lower probability of replication than is generally appreciated. Indeed, in the common scenario of an effect just reaching statistical significance, the statistical power of the replication experiment assuming the same effect size is approximately 50%—in essence, a coin toss. Accordingly, we use the provocative analogy of “replicating” a neuroprotective drug animal study with a coin flip to highlight the need for larger sample sizes in replication experiments. Additionally, we provide detailed background for the probability of obtaining a significant

Using a coin toss to ‘replicate’ a neuroprotective effect of valproic acid, this study highlights the fact that exact replications of biomedical experiments are usually underpowered, often with power of approximately 50%.

“Non-reproducible single occurrences are of no significance to science.” [

In modern times, replication of results has been considered an integral part of the scientific process, at least, since Karl Popper’s famous declaration [

As a consequence, many biomedical researchers are aware of potentially low replication rates across laboratories [

We use an empirical example from our own research to highlight the generally low statistical power of same sample-size exact replications, with emphasis on the common scenario of a barely significant initial finding. Unconventionally to this end, we conduct a coin flip experiment in an attempt to “replicate” an animal experiment that found a small neuroprotective effect of valproic acid (VPA). We use this admittedly absurd procedure to provide the background for a broader discussion of the caveats and challenges implicated in replications of preclinical experiments. In particular, we discuss the variability of

VPA has been widely used as an anticonvulsant and mood-stabilizing drug for the treatment of epilepsy and bipolar disorders. Additional uses of the drug have been suggested by studies that have demonstrated its neuroprotective properties in rats [

In the experiment, 20 male C57Bl/6 N mice underwent transient intraluminal middle cerebral artery occlusion (MCAO) for 45 minutes (for detailed description, see ^{3}. The VPA treated group displayed significantly lower infarct volumes (−37%) compared with the vehicle treated group (mean: 39.4 mm^{3}, standard deviation [SD]: 27.6 mm^{3} versus 63.6 mm^{3}, SD: 22.7; ^{3} with 95% confidence interval [CI; 0.3–48.0 mm^{3}]; standardized effect size of 0.96 (95% CI: 0.01–1.87); t = 2.136;

All animal experiments, inclusive of the welfare-related assessments and interventions that were carried out prior to, during, or after the experiment, were performed according to protocols approved by the Berlin Authorities (ethics committee of the “Landesamt für Gesundheit und Soziales Berlin,” LaGeSo Reg 390/09).

Given that the

Rather than conduct a same sample-size exact replication as is typically done in biomedicine, however, we decided to attempt to “replicate” our initial findings by a probabilistically equivalent Bernoulli experiment—a simple coin flip (see

For a repetition of the experiment with the same sample size, intervention, and groups, the probability of again obtaining a significant result is equal to the power of the replication experiment with respect to identifying the observed effect size from the first experiment [_{total}-2 degrees of freedom and a noncentrality parameter (ncp) that depends on the observed effect d of the original study and the size n of each group: ncp = d*√(n/2). For a single duplicate experiment, the probability of a statistically significant result in in the same direction as the original experiment corresponds to the area Φt under the curve of the density function of the t distribution beyond the critical value tα/2 for the corresponding alpha error probability. In our original study, we had an effect size d of 0.957 with

A caveat is needed regarding the coin flip analogy used in our study. The analogy only holds under the assumption that the estimated effect size in the first experiment equals the true population effect. In general, however, data from initial experiments are consistent with a broad range of effect sizes, as can be inferred from the wide confidence intervals associated with the effects.

Colored numerical values refer to our original experiment. Data of this figure can be found in

For our unconventional replication experiment, we used a fair coin and a single coin flip to attempt to replicate the effectiveness of VPA on lowering brain infarct volumes. Study plan and procedure of the replication experiment were preregistered [

Screenshots of the coin flip experiment: (A) blind selection of coin and (B) flipping the coin (C) resulting in heads.

Clearly, we do not believe that a coin flip can help us to infer whether or not VPA has a beneficial neuroprotective effect. We use this absurd example to highlight the similarly absurd (from a frequentist probability perspective) exact replication scenario. In contrast to a coin toss, an exact replication would have consumed considerable resources with the suffering and death of 20 additional mice, but with no greater probability of replication than of our coin toss experiment (under the assumption that the initially observed effect equaled the population effect). Even if the initially observed

A

Because treatment effects measured in small samples have larger variability around their corresponding population effects (relative to treatment effects estimated with larger sample sizes), their associated

Researchers are well advised to focus their research around the central question: “What is the effect (size)?” instead of the binary “Is there a statistically significant effect?” A

Power considerations should be obligatory for both initial experiments and for their replications [

Simonsohn [

To further explore and understand the role of power in replication experiments, we provide a web application in which initial sample size, initial results, and sample size of the replication experiment can be manipulated for determining power of a replication experiment under different scenarios (s-quest.bihealth.org/power_replication/).

Three observations are noteworthy: (1) Assuming that the effect observed in our original experiment equals the population effect, an exact replication with the same sample size yields 52.5% power of detecting an effect with alpha set to 0.05, which motivated our coin flip analogy. (2) Under this same scenario, considerably larger sample sizes per group would have been necessary for a high-powered replication experiment (

Exact replications, regardless of using the same or an increased sample size compared with the initial experiment, can only show whether a certain effect can be replicated in a specific setting. However, to what extent the same effect can also be generalized can only be learned from replications that vary some aspect of the original design (e.g., different species, different laboratory, etc.) and thus increase the external validity of the results [

In

Method of replication/ |
Coin flip replication | Exact replication (same design, same sample size) | Exact replication with increased sample size (e.g., 2.5× sample size of initial study) | Conceptual replication (meaningful alterations to design, varying sample size) |
---|---|---|---|---|

Can identify technical mistakes in initial experiment | no | yes | yes | maybe |

Can be used to reduce false inference on treatment effects | no | maybe | yes | maybe |

Can provide information on robustness | no | no | no | yes |

Can be used for meta-analyses | no | yes | yes | maybe |

We used the apparently fallacious example of combining an animal experiment with a game of chance to illustrate and discuss the statistical challenges and complexities of replication experiments. We stress, however, that our argument here is not that exact replication does not have a role in the scientific process. In fact, it has a very useful but also limited and specific purpose. Although reproducibility is a complex construct [

We describe the design and results of a preregistered animal experiment to establish the efficacy of VPA to reduce brain infarct volumes in murine stroke, which we combine with a coin toss as a substitute for an exact replication. The absurd but true notion that a coin flip provides approximately the same positive predictive value as an exact replication experiment when the initial effect is barely significant highlights an important, but little known, limitation of exact replications. Although replication is a complex construct that eludes simple definition, we can learn from both successful and failed replication attempts [

Brain infarct volumes with and without treatment of VPA.

(TIF)

(XLSX)

(DOCX)

We thank John Ioannidis (METRICS, Stanford University, USA) for critical and stimulating discussions.

confidence interval

middle cerebral artery occlusion

noncentrality parameter

standard deviation

valproic acid