Second-generation p-values: Improved rigor, reproducibility, & transparency in statistical analyses

Verifying that a statistically significant result is scientifically meaningful is not only good scientific practice, it is a natural way to control the Type I error rate. Here we introduce a novel extension of the p-value—a second-generation p-value (pδ)–that formally accounts for scientific relevance and leverages this natural Type I Error control. The approach relies on a pre-specified interval null hypothesis that represents the collection of effect sizes that are scientifically uninteresting or are practically null. The second-generation p-value is the proportion of data-supported hypotheses that are also null hypotheses. As such, second-generation p-values indicate when the data are compatible with null hypotheses (pδ = 1), or with alternative hypotheses (pδ = 0), or when the data are inconclusive (0 < pδ < 1). Moreover, second-generation p-values provide a proper scientific adjustment for multiple comparisons and reduce false discovery rates. This is an advance for environments rich in data, where traditional p-value adjustments are needlessly punitive. Second-generation p-values promote transparency, rigor and reproducibility of scientific results by a priori specifying which candidate hypotheses are practically meaningful and by providing a more reliable statistical summary of when the data are compatible with alternative or null hypotheses.

Remark 1. Naturally, neither nor # may contain the entire parameter space. Other pathologies are easily rectified. For example, if intervals and # overlap, but ⊆ # , i.e., is a subset of # , then ' = 1 regardless of the length of the intervals. The problem arises when the intersection is finite, | ∩ # | < ∞, but both intervals are not. For example we might have = [ , ∞) and # = (−∞, ] with < real numbers. Now we have | ∩ # | = − and we could argue that | |/| # | = 1. But | ∩ # |/| | is arguably zero, whereas ' = 0 seems inappropriate here because the intervals have a finite set of hypothesis in common. A practical and realistic solution is to simply truncate at effects that are not possible to observe in practice. Remark 2. Note that the procedure is inferentially consistent for all null and alternative hypothesis that are not on the boundary of the indifference zone. When the true hypothesis is exactly on the boundary of the interval null, say at # + , the second-generation p-value will have essentially the same frequency properties as a classical hypothesis test. As a result, the Type I Error rate of will remain constant as a function of the sample size and the procedure is no longer inferentially consistent in the limit. That is, it will be wrong 100 % of the time regardless of the sample size. Remark 4. The exact relationship will depend on circumstances, but this simple case provides a good guide. Let be the margin of error (half width) from a (1 − )100% CI with sample size .
Also, let be the sample size from a two-sided hypothesis test with size and power 1 − to detect an alternative that is units from the null hypothesis. Assuming the variance is constant, If we let represent the smallest change of scientific interest, then 2 = | # | and | | = 2 and we have that where NOP/Q = ON [1 − /2] and [ ] = ( ≤ ) is the standard normal cumulative distribution function. We find that | | = | # | when the sample size confers 50% power to detect . With 80% and 90% power, we have | | = 0.7| # | and | | = 0.6| # |, respectively, when = 0.05. It follows that the power has to drop below 16% for the correction factor to be triggered.
Remark 5. In future work, we intend to use 1/8 likelihood support intervals for the basis of our second-generation p-values. This is easily achieved with standard software by using a 96% CI from a normal approximation when the underlying sampling distribution is symmetric (which is most  Remark 7. Figure S2 shows how the second-generation p-values were computed to color the rug plot. The estimated survival differences are plotted with their confidence interval and the indifference zone (shaded region). The confidence interval on the difference in survival rates could be computed using asymptotic methods or a simple bootstrap. Here we used the variance of the predictions from a cox proportional hazard model and assumed the two groups were independent. An alternative approach would be to estimate the baseline hazard using some other non-parametric method.
Remark 8. The 2x2 table examines a binary exposure's association, say smoking, with a binary outcome, say lung cancer. Imagine 100 smokers and 100 non-smokers, where 65 smokers and 50 non-smokers developed lung cancer. This is displayed in Table S1. (1). While the conclusion is essentially the same, the degree to which the data are deemed "inconclusive" varied slightly. Importantly, a ' of 0 or 1 will not be affected by Suppose we have the following linear regression model for HbA1c with independent errors ∼ (0, Q ). Taken together, weight, waist size, and triceps thickness represent the impact of body size on HbA1c. We can assess the contribution of body size to this model and remove these predictors if they do not contribute sufficiently. This is usually posed as

Remark 10. Statistical and frequency properties of second-generation p-values
There are three cases to consider: the probability data are compatible with the alternative, ( ' = 0), the probability data are compatible with the null, ( ' = 1), and the probability data are inconclusive (0 < ' < 1). Note that we have three potential outcomes to consider instead of just two ("Reject the null" or "Fail to reject the null"). In Remarks 11 through 18, we examine the statistical properties of second-generation p-values when the sampling distribution of the estimator can be approximated by a normal distribution. This scenario covers a large majority of statistical applications, including methods of moments and maximum likelihood estimation, as well as common non-parametric estimators in large samples.
Remark 11. Distributional assumptions: Let y z be an estimator of parameter . We consider the case where the sampling distribution is √ u y z − w } (0, ) where the variance is known or can be readily estimated. This scenario reflects the core behavior of a large majority of statistical applications, such as methods of moments, maximum likelihood estimation, and some common non-parametric estimators in large samples, i.e., U-statistics.
Remark 12. Observing data compatible with the alternative hypothesis: How often will a given set of data indicate compatibility with the alternative hypothesis? This probability, ( ' = 0), is analogous to power. Since ' is 0 only when the intersection between the intervals is the empty set, it follows that where # is the point null hypothesis and is the 'true' data generating hypothesis. As expected, the 'power curve' is a function of , the indifference zone margin.
When graphed, it looks like a power curve that was cut in half and pulled apart. Figure  Remark 13. When = # , Equation S4 is analogous to the Type 1 Error rate which reduces to Note the dependence on the sample size and . Hence, the Type I Error rate is bounded above by 2 †− P/Q ‡ = . Moreover, it shrinks to 0 as the sample size approaches infinity, for any given > 0. When = 0, we recover the usual Type I Error rate.
Remark 14. Comparison with a Bonferroni adjustment: Figure S4   Remark 17. Observing data that are inconclusive: Perhaps the scourge of any study is inconclusive results. Here we detail the probability that the second-generation p-value is inconclusive. The probability of observing data that are inconclusive is: when > P/Q √ /√ and otherwise. Figure S6 displays this behavior. When the indifference zone is small relative to the intended precision, and the true hypothesis is in or near the indifference zone, the probability of inconclusive results is high. As the indifference zone widens, the probability of inconclusive results remains high at its edges when the truth is also at the edges. But the probability drops rapidly near the middle of the zone when such results would indicate compatibility with the null hypothesis. The take home message here is that data will tend to be inconclusive when the truth is near the edges of the indifference zone. The practical solution to this problem is to use an indifference zone that is neither too large nor too small. But, of course, this is much easier said than done. Remark 20. In the example used in the paper the FDR and FCR for second-generation p-values are smaller than their hypothesis testing counterparts. This is generally true for the FDR when the multiple comparisons being made have varying standard errors. However, it is possible for the FCR to be larger than the false non-discovery rate. This happens for hypotheses inside the null interval when the sample size is very large. As such is it not of consequence. By design, hypotheses within the indifference zone are not detectable by second-generation p-values. We believe this can be addressed by allowing the null interval to shrink at a rate slower than the interval estimate, but this will be detailed elsewhere. Figure S7 displays the FDR and FCR as the sample size changes. Note that the FCR is undefined in the first plot because the sample size is too small to permit nesting of the interval estimate in the interval null hypothesis.    Figure S5: The relationship between the probability of data supported compatibility with the null hypothesis, ( ' = 1), and various s. The black line represents = 0, the traditional point null hypothesis. The orange line represents = 1/30~0, a very small indifference zone relative to the observed precision. The green line represents = 1/2, and the blue line represents = 1, which are two larger indifference zone. The graph on the left has a smaller sample size while the graph on the right has a larger sample size.  Figure S6: The relationship between the probability of inconclusive results, (0 < ' < 1), and various s. The black line represents = 0.008 which is very close to the traditional point null hypothesis. The orange line represents = 1/30 = 0.03, a very small indifference zone relative to the observed precision. The green line represents = 1/2, and the blue line represents = 1, which are two larger indifference zone. The graph on the left has a smaller sample size while the graph on the right has a larger sample size.  Figure S7: Illustrations of the false discovery rate (red) and false confirmation rate (blue) for second-generation p-values (solid lines). The false discovery rate (red) and false non-discovery rate (blue) from a comparable hypothesis test are shown as dotted lines. This example uses = 1, = 0.05, = /2, and = 5, 20, 60, 100.