^{1}

^{2}

^{3}

^{4}

The authors have declared that no competing interests exist.

A typical rule that has been used for the endorsement of new medications by the Food and Drug Administration is to have two trials, each convincing on its own, demonstrating effectiveness. “Convincing” may be subjectively interpreted, but the use of p-values and the focus on statistical significance (in particular with

Endorsement of medications (drugs and biologics) for clinical use is under rigorous control by regulatory agencies. Since 1962, the body that provides this control is the US Food and Drug Administration (FDA; [

Unfortunately,

Another issue that is associated with the use of

In this paper, we will present through simulation the extent to which strength of evidence varies when employing a criterion for drug approval of having exactly two

In the next section, we will describe the set-up of our simulations in detail. Then, we will present the results of our simulations, demonstrating both the range in strength of evidence and the proportion of times the evidence actually points in favor of the null hypothesis. We will conclude with a discussion of the implications of our results for regulatory assessment of new medications.

We conducted three sets of simulations. For every set, we generated 12,500 data sets. All of the data sets were intended to mimic two–condition between–subjects experiments with an experimental group and a control (e.g. placebo) group. A two-tailed

Therefore, our simulations are:

_{123}

_{1}

_{2}

_{3}

_{123}indicates simulated data for the placebo groups in all sets of simulations, and

_{1},

_{2}, and

_{3}indicate simulated data for the experimental groups in the first, second, and third set of simulations respectively. The notation ∼

For each effect size simulation set, we ran five different kinds of number of trial simulations: one with 2 trials with statistically significant results out of 2 performed, one with 2 significant results out of 3 performed, one with 2 significant results out of 4 performed, one with 2 significant results out of 5 performed, and one with 2 significant results out of 20 performed. This was achieved by continuously regenerating data until exactly 2 significant results emerged. Note that our simulations are not concerned with the likelihood of obtaining exactly 2 out of 5 significant results given a certain effect size. The purpose of our simulations is to demonstrate the range of strengths of evidence if such a scenario were to occur.

These simulations reflect different scenarios: on one end the scenario in which exactly two trials were conducted and both were statistically significant in the expected direction, on the other end the scenario in which twenty trials were conducted and exactly two were significant in the expected direction (and 18 were not statistically significant). We also varied the number of participants per group. We ran five conditions: n = 20, n = 50, n = 100, n = 500, and n = 1,000.

Thus, to sum up, our simulations varied along the following dimensions:

Effect size: small (0.2 SD), medium (0.5 SD), and zero (0 SD)

Number of total trials: 2, 3, 4, 5, and 20

Number of participants: 20, 50, 100, 500, and 1,000

This resulted in a total of 75 types of simulations. We replicated each simulation type 500 times. In addition to these simulations, we performed sensitivity analyses with simulations that used individual differences in the effect size distribution and unequal variance in the two groups (see

We calculated one-sided JZS Bayes factors for the combined data from the total number of trials conducted [

For each replication, we computed an independent-samples one-sided Bayesian

The Bayes factor results of the small effect size simulations are shown in

Boxes contain Bayes factors for 50% of the simulations with tails extending to Bayes factors for 100% of the simulations excluding outliers. Circles indicate outliers and are those values more than 1.5*IQR removed from the boxes. Note that for large numbers of participants, Bayes factors increase exponentially and only the tail of the boxes is visible.

The results show that there is substantial variability in the evidential strength both across different types of simulations (as reflected by the different heights of the boxes) and within different types of simulations (as reflected by the size of the boxes and the extent of the tails). Summarizing the main trends, the evidential strength for medications that achieve two trials with statistically significant results is lower if more trials were conducted (boxes get lower for higher number of trials), the evidential strength is higher if more participants were tested in the placebo and experimental groups (boxes get higher to the right of each panel), and we see an interaction between number of trials and number of participants: increasing the number of participants has a stronger effect for a larger number of trials.

How often would it happen that medications that achieve two trials with statistically significant results actually have the overall evidence point in favor of the null hypothesis? The answer can be found in

As shown in the bottom left of the bottom middle panel, when the number of trials is 20 and the number of participants per group is 20, in 10.4% of all simulations the evidence is actually in favor of the null (i.e., the black part of the bar-plot). A BF that does not exceed 3 in favor of the alternative hypothesis over the null hypothesis is quite a common occurrence for scenarios with relatively few participants or a relatively large number of trials (i.e., the black and the red part of the bar-plot). For instance, with 2 out of 5 significant trials and 20 participants per group, our strength of evidence in favor of the alternative hypothesis is lower than 3 in 22% of all simulations. Finally, examining the combined black, red, and green parts of the bar-plot shows that a BF not exceeding 20 is very common. Depending on the number of participants and the number of trials, the percentage of simulations that have strength of evidence in favor of the alternative hypothesis that is lower than 20 can be over half of all simulations.

The median Bayes factor results of the medium effect size simulations are shown in

Boxes contain Bayes factors for 50% of the simulations with tails extending to Bayes factors for 100% of the simulations excluding outliers. Circles indicate outliers and are those values more than 1.5*IQR removed from the boxes.

Comparing these results to those obtained in

The percentage of 500 simulations for which the Bayes factor is lower than a certain cut-off value for median effect size is displayed in

The median Bayes factor results of the zero effect size simulations are shown in

Boxes contain Bayes factors for 50% of the simulations with tails extending to Bayes factors for 100% of the simulations excluding outliers. Circles indicate outliers and are those values more than 1.5*IQR removed from the boxes.

As shown, when the true effect size is zero, the Bayes factor becomes much lower than for scenarios where the true effect size is non-zero. In most simulation cells, the median Bayes factor is lower than 10 and in many cases the Bayes factor actually favors the null hypothesis. This is particularly true when the number of trials is larger and the number of participants is larger.

The percentage of 500 simulations for which the Bayes factor is lower than a certain cut-off value for zero effect size is displayed in

Examining the results shows that if the true effect size is zero, Bayes factor are usually in favor of the null hypothesis if the number of trials is large. When the true effect is zero, Bayes factors are rarely larger than 20, except in cases where 2 out of 2 trials were significant, but this should be a rare occurrence if the true effect size is zero. In additional simulations, we verified that our results hold in case of variable underlying effect size distributions and in case of unequal variance in the experimental group compared to the control group (see online supplement).

When taken together, the results provide striking evidence of the large variety in strength of evidence that occurs as a result of a strict endorsement criterion of exactly two

In this study, we simulated clinical trial data comparing an experimental group to a placebo. Simulations differed on the underlying true effect size, the number of clinical trials, and the number of participants in each trial. The simulations all had one thing in common: exactly two of the conducted trials were statistically significant with a two-tailed

The result of our simulations is simple yet compelling: a criterion of endorsement of two

What are we to conclude from this? First and foremost, it is likely that a criterion asking for two statistically significant trials would often lead to correct endorsement of new medications. It is difficult to estimate the true proportion of incorrectly endorsed medicine, because the underlying true effect size cannot be known in advance. However, empirical evidence across many medical interventions suggests that most effects are small or modest [

Fortunately, a straightforward solution exists: quantifying evidence using the Bayes factor. Such a change in protocol for statistical inference is not unprecedented. Use of Bayesian statistics in clinical trials design and analysis has been used for a number of years by FDA in some domains (e.g., medical device clinical trials; [

Another important problem that was not solved until recently was the absence of easy-to-handle statistical software with an intuitive interface. This meant that the application of the Bayesian hypothesis test was a tool that could only be used by statistical experts. The recent development of online Bayes factor calculator tools [

It is important to stress that the results of our simulations make no assumptions as to how the data were obtained. Our simulations make no commitment about the nature of the data, whether it was obtained with honest intentions, through cherry-picking, or through

Some limitations of our study should be discussed. First, the “exactly 2 significant trials” rule that we simulated may not capture fully the way that FDA or other regulatory agencies operate, even without explicit consideration of Bayesian methods. The FDA’s position to require “…at least two adequate and well-controlled studies, each convincing on its own, to establish effectiveness” is not the same as “exactly 2 significant trials”. Our demonstration provides an indication of the strength of evidence one were to obtain when this policy is employed with exactly two significant trials. Furthermore, the approval process includes consideration of multiple aspects of efficacy and safety and it also entails a qualitative assessment of the adequacy of design, conduct and analysis of the trial and of the relevance of the outcomes used. Therefore, our simulations should not be seen as exactly mapping the regulatory process, but rather as exploring the consequences of using a rule that is based on statistical significance alone. Safety assessment in particular has a stronger track record of use of Bayesian analysis [

Second, there can be differences of opinion on how strong the evidence should be before a medication is approved. A BF of 3 typically is considered to be very weak and worth mere mentioning, while even a BF of 20 may not be considered conclusive at times [

Third, we used two different non-null effect sizes, but the magnitude of the effect that is considered to be sufficiently good to lead to approval may vary on a case-by-case basis. E.g. the type of outcome, the availability of other drugs, the safety profile of the tested medication, and other factors may also be involved in the decision-making.

Fourth, we have focused on evaluating superiority trials, but for some drugs decisions may be made based on non-inferiority designs. Non-inferiority trials are a minority: A survey identified 209 non-inferiority or equivalence trials published in 2009 [

Fifth, we used a standard Bayesian framework for all analyses, so as to standardize the inferences derived. In real practice, some further diversification can exist based on additional prior evidence. For our simulations, we assumed non-informative priors.

Allowing for these caveats, our study offers through simulations yet another demonstration of the unfortunate effect of

Boxes contain Bayes factors for 50% of the simulations with tails extending to Bayes factors for 100% of the simulations. Note that for large numbers of participants, Bayes factors increase exponentially and only the tail of the boxes is visible.

(TIF)

(TIF)

Boxes contain Bayes factors for 50% of the simulations with tails extending to Bayes factors for 100% of the simulations.

(TIF)

(TIF)

Boxes contain Bayes factors for 50% of the simulations with tails extending to Bayes factors for 100% of the simulations.

(TIF)

(TIF)

Boxes contain Bayes factors for 50% of the simulations with tails extending to Bayes factors for 100% of the simulations.

(TIF)

(TIF)

(TIF)

(TIF)

(PDF)

(R)