The quest for an optimal alpha

Researchers who analyze data within the framework of null hypothesis significance testing must choose a critical “alpha” level, α, to use as a cutoff for deciding whether a given set of data demonstrates the presence of a particular effect. In most fields, α = 0.05 has traditionally been used as the standard cutoff. Many researchers have recently argued for a change to a more stringent evidence cutoff such as α = 0.01, 0.005, or 0.001, noting that this change would tend to reduce the rate of false positives, which are of growing concern in many research areas. Other researchers oppose this proposed change, however, because it would correspondingly tend to increase the rate of false negatives. We show how a simple statistical model can be used to explore the quantitative tradeoff between reducing false positives and increasing false negatives. In particular, the model shows how the optimal α level depends on numerous characteristics of the research area, and it reveals that although α = 0.05 would indeed be approximately the optimal value in some realistic situations, the optimal α could actually be substantially larger or smaller in other situations. The importance of the model lies in making it clear what characteristics of the research area have to be specified to make a principled argument for using one α level rather than another, and the model thereby provides a blueprint for researchers seeking to justify a particular α level.


S1 Appendix. Supplementary analysis of other possible α levels
In principle, α = 0.05 and α = 0.005 are not the only two possibilities, because any number between zero and one could be chosen as the α level. Thus, in this supplement we treat α level as a continuously varying value and investigate how its optimal value depends on research scenario parameters.
The overall payoff P T (Eq 8) is a function of two parameters controlled by the researcher, α and sample size, plus six parameters inherent in the research scenario, d, π, P tp , P f p , P tn , and P f n . For any given values of the research scenario parameters, standard numerical search procedures [1] can be used to find the optimal combination of α and sample size-that is, the combination leading to the highest payoff within that scenario. These search procedures treat α as a continuously varying numerical quantity and can find whatever value is optimal, even if it is something other than α = 0.05 or α = 0.005. Fig A shows the optimal α levels for a variety of scenarios, and Fig B shows the corresponding optimal sample sizes. Inspection of Fig A reveals that the optimal α level depends very strongly on the base rate, π, and on the payoff associated with false positives, P f p . At one extreme, with a very low base rate (i.e., π = 0.01) and a relatively high FP cost (i.e., P f p = −5), the optimal α level is less than 0.001. In contrast, with a relatively large base rate (i.e., π = 0.5) and a relatively low FP cost (i.e., P f p = −2), the optimal α level is greater than 0.1. Given that the optimal α varies across at least two orders of magnitude, the discussion over whether to use α = 0.05 or α = 0.005 actually seems somewhat narrow. Perhaps more importantly, the figure shows quite clearly that researchers cannot hope to identify the appropriate α level without having good quantitative information about the base rate of true effects and about the cost of FPs. Although base rate estimates have been considered in some recent discussions of α levels [2], FP costs have not been addressed quantitatively.
Surprisingly, Fig A also shows that the optimal α level is essentially independent of the effect size, d, and it is also relatively independent of the FN cost, at least within the ranges considered here. The reasons for such independence are not entirely clear, but in principle the independence makes it more feasible to identify optimal α levels, because it reduces the amount of prior knowledge that researchers must obtain to identify these levels (i.e., d and P f n need not be identified). In practice, however, information about d is clearly needed to identify the optimal (α, n s ) pair because it is required for choosing n s , as will be seen next. Fig B shows the optimal sample sizes associated with each of the optimal α levels shown in Fig A. Quite reasonably, much larger samples are needed with smaller true effect sizes, and larger samples are also needed when the base rate of true effects is low. Interestingly, at least for the conditions shown in Fig A, the sample size increases for smaller effects to an extent that produces approximately the same levels of power for all three effect sizes. Thus, not only is the optimal α level rather independent of effect size, but so is the optimal level of power. It is disconcerting, however, that the optimal sample sizes are typically 300+ for medium-sized effects; sample sizes that large are rare in many research areas where two-group comparisons are common. For repeated measures designs (i.e., one-sample t-tests), analogous computations indicate that the optimal sample sizes are approximately 25% of those shown in Fig B. Fortunately, further investigation reveals that the choice of the exactly optimal α level may not be very important in practice, at least under conditions similar to those considered here (i.e., π < 0.5, P f p and P f n of -2 or -5). Fig C shows expected payoffs for researchers using α = 0.005, α = 0.05, or the exact α yielding the maximum expected payoff, each with its own optimal sample size. Critically, in these scenarios researchers can achieve nearly the same payoff by using whichever is better of α = 0.005 of α = 0.05, without determining the best α level even more precisely.
We also conducted a more detailed analysis to see which values of α would be optimal using a wider range of payoff values. To that end, we computed the optimal combination of α level and sample size for two-sample t-tests using all of the scenarios that could be constructed from combinations of the parameter values listed in Table A. Excluding approximately 11% degenerate scenarios in which the expected payoff was maximized by taking the minimum possible sample size (see [3], Appendix C), the optimal α levels ranged all the way from a minimum of 0.0001 to a maximum of 0.876. Interestingly, a logistic model predicted the optimal α levels from the scenario parameters fairly accurately. In this model, the to-be-predicted score was and Y was predicted as a linear function of the parameters in Table A with a constant of -4.14 and with the slopes shown in the table, with an overall R 2 = 0.92.
The slopes in Table A show how the optimal values of α are affected by the values of the scenario parameters, and the sizes of the parameter effects are most clearly illustrated with numerical examples. As a baseline condition, assume P f p = −2, P tn = 0, P f n = 0, π = 0.1, and d = 0.2. For this condition, the model predicts an optimal α = 0.01. If the loss associated with a false positive is reduced to P f p = −1, the predicted optimal α level increases to 0.018. The gain associated with a true negative outcome, P tn , has a much stronger influence on the predicted α optimal , perhaps because there were so many opportunities for true negatives with the low base rates used in these computations. For example, a small change from the baseline condition to P tn = 0.2 increases α optimal to 0.036. Similarly, α optimal was quite sensitive to the base rate of true effects; changing from the baseline to π = 0.3 yields predicted α optimal = 0.04. In contrast, there seem to be relatively small influences of the effect size, d, and of the loss associated with a false negative, P f n , at least over moderate ranges of those values. For example, the predicted α optimal values remain at 0.01 when there is a change from the baseline to P f n = −1 or to d = 0.8.