New relevance and significance measures to replace p-values

The p-value has been debated exorbitantly in the last decades, experiencing fierce critique, but also finding some advocates. The fundamental issue with its misleading interpretation stems from its common use for testing the unrealistic null hypothesis of an effect that is precisely zero. A meaningful question asks instead whether the effect is relevant. It is then unavoidable that a threshold for relevance is chosen. Considerations that can lead to agreeable conventions for this choice are presented for several commonly used statistical situations. Based on the threshold, a simple quantitative measure of relevance emerges naturally. Statistical inference for the effect should be based on the confidence interval for the relevance measure. A classification of results that goes beyond a simple distinction like “significant / non-significant” is proposed. On the other hand, if desired, a single number called the “secured relevance” may summarize the result, like the p-value does it, but with a scientifically meaningful interpretation.


Introduction
The p-value is arguably the most used and most controversial concept of applied statistics. Blume et al. [1] summarize the shoreless debate about its flaws as follows: "Recurring themes include the difference between statistical and scientific significance, the routine misinterpretation of non-significant p-values, the unrealistic nature of a point null hypothesis, and the challenges with multiple comparisons." They nicely collect 14 citations, and I refrain from repeating their introduction here, but complement the analysis of the problem and propose a solution that both simplifies and extends their's.
The basic cause of the notorious lack of reliability of empirical research, notably in parts of social and medical science, can be found in the failure to ask scientific questions in a sufficiently explicit form, and the p-value problem is intrinsically tied to this flaw. Here is my argument.
Most empirical studies focus on the effect of some treatment, expressed as the difference of a target variable between groups, or on the relationship between two or more variables, often expressed with a regression model. Inferential statistics needs a probabilistic model that describes the scientific question. Usually, this is a parametric model in which the effect of In "ancient" times, before the computer produced p-values readily, statisticians examined the test statistics and then compared them to tables of "critical values." In the widespread case that the t-test was concerned, they used the t statistic as an informal quantitative measure of significance of an effect by comparing it to the number 2, which is approximately the critical value for moderate to large numbers of degrees of freedom. This will also shine up in the proposed significance measure.
In a similar way, the proposed quantitative measure of relevance divides the effect by a meaningful threshold, and a value above 1 indicates a relevant effect. In contrast to significance, relevance is a parameter of the model. As such, it is estimated on the basis of the observations, and a confidence interval for it should be determined. The lower end of this confidence interval will be proposed as a single most interpretable characteristic.
This quantitative measure of relevance is most generally interpretable if applied to a suitable way of expressing the effect of interest. This leads to standardizing or transforming model parameters in order to determine an appropriate "effect scale." The idea of effect scale is partly parallel and partly alternative to the "effect size" definitions that are popular in quantitative psychology [8,9]. Section 3 proposes these scales for the most commonly used stattistical models.
The suitable relevance threshold should be determined in the context of the scientific question. As a professional statistician, I prefer to leave the choice to the scientist who formulates this question. As a consultant, I appreciate the hurdle that this desideratum poses to the practical application of the concept of the relevance measure and give in to providing a recommendation that can be used as a starting point and default (Section 5).

Definitions
The simplest case for statistical inference is the estimation of a constant based on a sample of normal observations. It directly applies to the estimation of a difference between two treatments using paired observations. I introduce the new concepts first for this situation. The application of the concepts for typical situations-comparison of two samples, estimation of proportions, simple regression and correlation-will be discussed in Section 3 and extended to a general parametric model and to multiple regression in Section 4.

The generic case
Consider a sample of n statistically independent observations Y i with a normal distribution, The interest is in the effect parameter ϑ = μ, and more specifically, knowing whether ϑ is different from 0 in a relevant manner, where relevance is determined by the relevance threshold z > 0. Thus, one wants to summarize the evidence for the hypotheses H 1 : W > z against H 0 : W � z : (The symbol z, pronounced "zeta," delimits the "zero" hypothesis).
2.1.1 One sided. I consider a one-sided hypothesis here. In practice, only one direction of the effect is usually plausible and/or of interest. Even if this is not the case, the conclusion drawn will be one-sided: If the estimate turns out to be significant according to the two-sided test for 0 effect, then nobody will conclude that "the effect is different from zero, but we do not know whether it is positive or negative." Therefore, in reality, two one-sided tests are conducted, and technically speaking, a Bonferroni correction is applied by using the level α/ 2 = 0.025 for each of them. Thus, I treat the one-sided hypothesis and use this testing level.
The point estimate and confidence interval arê whereV is the empirical variance of the sample,V ¼ 1 and q is the 1−α/ 2 = 0.975 quantile of the appropriate t distribution. Thus,ô is half the width of the confidence interval and equals the standard error, multiplied by the quantile. 2.1.2 Remark. The choice of the test level, α, is arbitrary in principle. In some fields, α = 0.01 is common, but α = 0.05 is clearly the most popular choice and ubiquitous in many fields. It is straightforward to adjust all concepts introduced here to any α.
2.1.3 Significance. The proposed significance measure compares the difference between the estimated effect and the relevance threshold with the half width of the confidence interval, The effect is statistically significantly larger than the threshold if and only if Sig z > 1.
Significance can also be calculated for the common test for zero effect, Sig 0 ¼Ŵ=ô. This quantity can be listed in computer output in the same manner as the p-value is given in today's programs, without a requirement to specify z. It is much easier to interpret than the p-value, since it is, for a given precision expressed byô, proportional to the estimated effectŴ. Furthermore, a standardized version of the confidence interval for the effect is Sig 0 ± 1, Nevertheless, it should be clear from the Introduction that Sig 0 should only be used with extreme caution, since it does not reflect relevance.

2.1.4
Relevance. An extremely simple and intuitive quantitative measure of relevance is the effect, expressed in z units, Its point and interval estimates are Rle ¼Ŵ=z and ½Rls; Rlp� ; where The lower end of the confidence interval is called the "secured relevance," Rls, and the upper end, the "potential relevance," Rlp. The effect is called relevant if Rls > 1, that is, if the estimated effect is significantly larger than the threshold. The estimated relevance Rle is related to Sig z by If the relevance threshold is one hour of extra sleep, z = 60, then Sig z = 80/86 = 0.93, and the gain is not significantly relevant. This is also seen when calculating the relevance and its confidence interval, Rle = 140/60 = 2.33 and Rls = 2.33 − 86/60 = 54/60 = 0.90, Rlp = 2.33 + 86/ 60 = 226/60 = 3.76. It remains therefore unclear whether the sleep prolongation is relevant. Fig  1 shows the results graphically.

Related concepts 2.2.1 Two One-Sided Tests (TOST).
Lakens [5] focusses on testing for a negligible effect, advocating the paradigm of equivalence testing. He considers an interval of values that are negligibly different from the point null hypothesis, also called a "thick" or "interval null" [1,6]. If this interval is denoted as |ϑ| � z, there is a significantly negligible effect if both hypotheses ϑ > z and ϑ < −z are rejected using a one-sided test for each of them. A respective p-value is the larger of the p-values for the two tests.
I have argued for a one-sided view of the scientific problem. With this perspective, the idea reduces to the one one-sided test for a negligible effect with significance measure −Sig z .

Second
Generation P-value. The "Second Generation P-Value" SGPV P δ has been introduced by Blume et al. [1,11]. In the present notation, z is their δ. The definition of P z starts from considering the length O of the overlap of the confidence interval with the interval defined by the composite null hypothesis H 0 . Assume first thatŴ > 0. Then, the overlap measures O ¼ 2ô if the confidence interval contains the "null interval," that is, ifŴ þô < z, and otherwise, O ¼ z À ðŴ ÀôÞ, or 0 if this is negative.
The definition of P z distinguishes two cases based on comparingô to the threshold z. If o < 2z, P z = 0 if there is no overlap, and P z = 1 for complete overlap, O ¼ 2ô. In between, the SGPV is the overlap, compared to the length of the confidence interval, In this case, then, P z is a rescaled, mirrored, and truncated version of the significance at z. Here, I have neglected a complication that arises when the confidence interval covers values below −z. The definition of P z starts from a two-sided formulaton of the problem, H 0 : |ϑ| < z. Then, the confidence interval can also cover values below −z. In this case, the overlap decreases and P z changes accordingly.
The definition of P z changes if the confidence interval is too large, specifically, if its length exceeds 2z. This comes again from the fact that it was introduced with the two-sided problem in mind. In order to avoid small values of P z caused by a large denominator 2ô in this case, the length of the overlap O is divided by twice the length 2z of the "null interval," instead of the length of the confidence interval, 2ô, P z = O/(4z). Then, P z has a maximum value of 1/2, which is a deliberate consequence of the definition, as this value does not suggest a "proof" of H 0 . For a comparison of the SGPV with TOST, see [12].
If the overlap is empty, P z = 0. In this case, the concept of SGPV is supplemented with the notion of the "δ gap," Since the significance and relevance measures are closely related to the Second Generation P-Value and the δ gap, one might ask why still new measures should be introduced. Here is why: • An explicit motivation for the SGPV was that it should resemble the traditional p-value by being restriced to the 0-1 interval. I find this quite undesirable, as it perpetuates the misinterpretation of P as a probability. Even worse, the new concept is further removed from such an interpretation than the old one, for which the problem "Find a correct statement including the terms p-value and probability" still has a (rather abstract) solution.
• The new p-value was constructed to share with the classical one the property that small values signal a large effect. This is a counter-intuitive aspect that leads to confusion for all beginners in statistics. In contrast, larger effects lead to larger significance (and, of course, larger relevance).
• Taking these arguments together, the problems with the p-value are severe enough to prefer a new concept with a new name and more direct and intuitive interpretation over advocating a new version of p-value that will be confused with the traditional one.
• The definition of the SGPV is unnecessarily complicated, since it is intended to correspond to the two-sided testing problem, and only quantifies the undesirable case of ambiguous results. It deliberately avoids to quantify the strength of evidence in the two cases in which either H 0 or H 1 is accepted.

Classification of results
There is a wide consensus that statistical inference should not be reported simply as "significant" or "non-significant." Nevertheless, communication needs words. I therefore propose to distiguish the cases that the effect is shown to be relevant (Rlv), that is, H 1 : ϑ > z is "statistically proven," or negligible (Ngl), that is, H 0 : ϑ � z is proven, or the result is ambiguous (Amb), based on the significance measure Sig z or on the secured and potential relevance Rls and Rlp (Rls > 1 for Rlv, Rlp < 1 for Ngl and Rls � 1 � Rlp for Amb). 2.3.1 Remark. Kruschke [7] distinguishes the same cases on the basis of the Bayesian Posterior Highest Density Interval and calls them "Reject Null Value," "Accept Null Value," and "Undecided"-examining, however, a two-sided "Region of Practical Equivalence." For a finer classification, the significance for a zero effect, Sig 0 , is also taken into account. This may even lead to a contradiction (Ctr) if the estimated effect is significantly negative. Fig  2 shows the different cases with corresponding typical confidence intervals, and Table 1 lists the respective significance and relevance ranges. Similar figures have appeared in [1,Fig 2] and [6, Fig 1] and before, with different interpretations.

The two-sample problem
The usual model for comparing two treatments arises when x i = 1 if observation i received one treatment, and x i = −1 for the other treatment. (The code for the second group is −1 instead of 0 since this choice fits better with the standardized coefficient of linear regression to be treated below.) Then, The effect parameter θ is the half the difference of expected values between the two groups, whereas μ 0 and σ are nuisance parameters.

Effect scale.
In several models, it appears useful to consider a transformed version of the parameter of interest as the effect, since the transformation leads to a more generally interpretable measure and may have more appealing properties, as in the next subsection. Therefore, the original parameter of interest is denoted as θ or as popular in the model, and the transformed version will be considered as the effect, ϑ = g(θ).

Standardization.
In the case of two samples, it is very popular to standardize the difference between the groups in order to make it independent of any unit of measurement, leading to Cohen [13]'s d, which is, in the present notation, d = 2θ/σ. In the same way, the effect size is introduced here as Table 1. Classification of cases defined by ranges of significance and relevance measures. s and r are the place holders for the column headings.

Case Sig 0 Sig z Rls Rlp
Rlv https://doi.org/10.1371/journal.pone.0252991.t001 Note that standardization with the variation σ of the target variable within groups makes good sense if σ measures the natural variation between observation units. It is less well justified if it includes measurement error, since this would change if more precise measurements were obtained, for example, by averaging over several repeated measurements. In this case, the standardized effect is not defined by the scientific question alone, but also by the study design. Even though d and ϑ have been introduced in the two samples framework, they also apply to a single sample, since the effect in this case is the difference between its expected value and a potential population that has an expectation of zero. In this case, ϑ = d. Remember that the effect is defined as a function of parameters, not of their estimates.
Coming back to the paired observation case (Section 2), note that the standard deviation measures the variability of differences rather than of the observations of the variable under study, and this will often be inappropriate. This shows that standardization may be misleading in the sense that the standardized effect does not reflect an aspect of the scientific question alone but also depends on the study design and the estimator used (see [14], p.396).

Log scale.
Most quantitative target variables in the exact and life sciences are measurements that cannot be negative, and for which differences are naturally expressed as percentages, that is, effects are best described by proportions. Such variables have been called "amounts" by the great promotor of applied statistics John W. Tukey, and he strongly recommended to express them in terms of logarithms, calling this the "first aid transformation" for such variables. On this scale, the variables usually fulfill assumptions of equal variances, of normal or at least symmetrical distributions, and of linear relationships much better than on their original scale of measurement. In other words, such variables often show a log-normal distribution on their original scale, and effects of treatments turn out to be multiplicative. Therefore, the log transformation turns them into normally distributed variables and the effects into additive ones [15].
A further advantage of the log scale is that differences become independent of any unit of measurement, and effects are directly comparable. An increase by 5% turns into an additive effect of 0.05, and generally, an increase of p%, into log(1 + p/100) (which is � 1 + p/100 for small p). Therefore, no standardization relating to any variabilities is needed.
3.1.4 Log-percent. When using percentages or ratios, it is often arbitrary which of the two numbers is taken as the reference. If one is 25% larger than the other, then the other is 20% smaller than the first. This asymmetry is a nuisance that disappears on the log scale. Therefore, let the "log-percent" scale for relative effects be defined as 100�ϑ, ϑ = log(μ 1 ) − log(μ 0 ), and indicate it as, e.g., 22.3%ℓ. For small percentages, the ordinary "percent change" and the "logpercent change" are approximately equal. The new scale has the advantage of being symmetric in the two values generating the change, and therefore, the discussion whether to use the first or the second as a basis is obsolete. A change by 100%ℓ equals an increase of 100% (e − 1) = 171% ordinary percent, or a decrease by 100% (1 − 1/e) = 63% in reverse direction.

Inference.
Note that all these considerations regard parameters of the model and do not depend on the methods used to estimate them from observations. Effects are estimated by replacing model parameters by estimators in their defining equations, leading to point and interval estimates.

Proportions
When a proportion is estimated, the model is, using B to denote the binomial distribution, Considering variations in the probability parameter p, a difference of 0.05 clearly has different relevance along the range of values of the parameter: It may be plausible to say that a change from p to p + 0.05 for p = 0.5 (i.e., from 0.5 to 0.55) is barely relevant, but if p = 0.05 or below, the difference is large, and for p > 0.95, it is even impossible.

Log odds.
For good and well known reasons, probabilities are often expressed as odds or log odds, also known as the "logit transformation." Let The difference between p = 0.5 and 0.55 corresponds to a difference of 0.2 on the log odds scale. The same difference on this effect scale results between p = 0.1 and p = 0.12 and, also for smaller probabilities p, when one is about 20% larger than the other. In fact, for low probabilities, common in the assessment of risks, log odds turn into simple logarithms, and differences of logs correspond to relative differences on the original scale. Thus, generally, equal differences of log odds appear intuitively quite comparable in relevance on the original scale, and effects on proportions should be measured on this effect scale. (Note again that the problem of estimating the effect has not been considered yet).

Comparing two proportions.
Log-odds are again suitable for a comparison between two proportions p 0 and p 1 . They lead to the log-odds ratio,

Simple regression and correlation
3.3.1 Normal response. In applications of the common simple regression model, the slope is almost always the parameter of interest, θ = β. It measures the change in the target variable Y evoked by a change δ X = 1 in the input variable X. For a standardized measure of the effect, a suitable step δ X , independent of X's unit of measurement, should be chosen, and the change in Y should also be standardized. The well known standardized coefficient β � uses the empirical standard deviation s X as δ X and the (marginal) standard deviation s Y for the standardization,b � ¼b s X =s Y . Here, I prefer to measure the effect in units of the error standard deviation σ, since s Y is not a model parameter (unless X is modeled as a random variable), but depends on the set of x i 's for which observations are obtained, that is, on the design. Therefore, the effect measure is In the case of a binary input variable X, the regression model is equivalent to the two groups problem treated above, and setting δ X = 1 leads to the effect measure introduced there (if the two values are coded as 1 and −1). For X's with more values and in the absence of a more natural alternative, the "standard step" δ X should be proportional to a measure of scatter of X values, and s X is the straightforward choice. Note that for a binary variable with equal group sizes and codes 1 and −1, s X = 1, and the two definitions match. (This was the reason for introducing these codes in the two groups case).

Remark.
By setting δ X = s X , the standardization depends on the values x i for which observations are obtained. In experiments, these are chosen by the experimenter, and the effect measure then does not describe a parameter of the process under examination alone-an undesirable feature in the spirit of this paper.

Other regression models.
For a binary response variable Y, logistic regression provides the most well established and successful model. It reads The parameter of interest is again β. The considerations for proportions extend directly to this model, and the effect scale is ϑ = βδ X with the same arguments for choosing δ X as for ordinary regression. The same is true for proportional odds linear regression (POLR) for an ordered target variable.
In Poisson regression for frequency or count data, the link function connecting the linear predictor α + βx i to the expected value of the target variable Y is the logarithm, which again needs no standardization, and the same simple definition of effects scale applies. Finally, even for the models commonly used for survival data, i.e. Weibull or Cox regression, the log link function is used to connect the linear predictor to the hazard function of the target variable, and the effect scale is the same as before.

Correlation.
Before displaying formulas for a correlation, let us discuss its suitability as an effect. The related question is: "Is there a (monotonic, or even linear) relationship between the variables Y (1) and Y (2) ?" According to the basic theme, we need to insert the word "relevant" into this question. But this does not necessarily make the question relevant. What would be the practical use of knowing that there is a relationship? It may be that • there is a causal relationship; then, the problem is one of simple regression, as just discussed, since the relationship is then asymmetic, from a cause X the a response Y; • one of the variables should be used to infer ("predict") the values of the other; again a regression problem; • in an exploratory phase, the causes of a relationship may be indirect, both variables being related to common causes, and this should lead to further investigations; this is then a justified use of the correlation as a parameter, which warrants its treatment here.
The Pearson correlation is r ¼ P j 12 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P j 11 P j 22 p ; A suitable effect scale is given by Fisher's well-known transformation which extends the limited range of values of ρ to all real numbers as it does in the case of proportions. When large correlations are compared, the effect as measured by the difference of ϑ values is approximately W ¼ W 1 À W 0 � 1 2 log ðð1 À r 0 Þ=ð1 À r 1 ÞÞ, that is, it compares the complements to the correlation on a relative (logarithmic) scale. For correlations around zero, the effect turns out to be approximately equal to the correlation itself and to the effect for the regression coefficient.

General multivariate effects and multiple regression
This section is technically more involved. Readers are encouraged to continue with Section 5 in a first run.

The general model
The models just discussed are special cases of the general parametric model where y is the parameter of interest, � denotes nuisance parameters, and the distribution F may vary between observations depending on covariates x i . The parameters and covariates may be multidimensional. Interest is in a suitable function W ¼ gðyÞ that turns the parameter of interest into the effect as measured on the "effect scale" if desired. Of course, W may be y without transformation. There is typically a value W 0 , and the question is if the true W differs from it to a relevant extent. If ϑ is one-dimensional, the interest is in differences in one direction, ϑ > ϑ 0 , say, and there is a threshold z defining the relevance.

Effect norm.
In the general case, y is multidimensional and the interest is in a function W ¼ gðyÞ. In the regression case to be discussed below, g will just select components of y.
A plausible general way to formalize the relevance for a p-dimensional W is based on a matrix Q that defines the norm η by and the question is if η exceeds the relevance threshold z. The natural choice of Q is then Q ¼ B Jðy; �ÞBᐪ ; B ¼ @W=@y :

Inference.
As mentioned in Section 3.1, these considerations only concern parameters and therefore, estimation methods are needed to get point and interval estimates in applications. Whereas such estimators ðŷ;�Þ usually are approximately multivariate normal, pη 2 then follows approximately a chi-squared distribution or a mixture of scaled chi-squares.

Multidimensional effect?.
Note that in this treatment of the problem, the alternative hypothesis is no longer one-sided for the parameter of interest itself-although it is, for η-, since there is no natural ordering in the multivariate space. This shows an intrinsic difficulty of the present approach for multivariate effects. However, the limitation mirrors the difficulty of asking scientifically relevant questions to begin with: What would be an effect that leads to new scientific insight?
In order to fix ideas, let us consider a regression model with a multivariate target variable. For example, Y may be a characterization of color or of shape, and the multivariate regression model may describe the effect of a treatment on the expected value of Y. In the case of a single predictor, e.g., in a two-groups situation, the parameter of interest y has a direct interpretation as the difference of colors, shapes or the like, and a range of relevant differences may be determined using a norm that characterizes distinguishable colors or shapes, which will be different from V. In more general situations, it seems difficult to define the effect in a way that leads to a practical interpretation.
If the target variable Y measures different aspects of interest, like quality, robustness and price of a product or the abundance of different species in an environment, the scientific problem itself is a composite of problems that should be regarded in their own right and treated as univariate problems in turn.
A more common situation where there is an intrinsically multidimensional effect comes up in regression for a single target variable with categorical predictor variables in regression, to be discussed now.

Multiple regression and analysis of variance
In the multiple regression model, the predictor is multivariate, The model also applies to (fixed effects) analysis of variance or general linear models, where a categorical predictor variable (often called a factor) leads to a group of components ("dummy variables") in the predictor vector x i and correspondingly in the coefficient vector b.
Since we set out to ask scientifically relevant questions, a distinction must be made between two fundamentally different situations in which the model is proposed.
(a). In technical applications, the x values are chosen by the experimenter and are therefore fixed numbers. Then, a typical question is whether changing the values from an x 0 to x 1 evokes a relevant change in the target variable Y. This translates into the relevance of a single coefficient β j or of several of them.
(b). In other fields of applications, the values of the predictor variables are often also random, and there is a joint distribution of X and Y. A very common type of question asks whether a predictor variable or a group of them have a relevant influence on the target variable. The naive interpretation of influence here is that, as in the foregoing situation, an increase of the variable X (j) by one unit leads to a change given by β j in the target variable Y. However, this is not necessarily true since even if such an intervention may be possible, it can cause changes in the other predictors that lead to a compensation or an enhancement of the effect described by β j . Thus, the question if β j is relevantly different from 0 is of unclear scientific merit. A related question asks if a predictor (a component x (j) ) contributes in a relevant manner to the "explanatory value" of the model. This extends naturally to a group of coefficients that constitute a "term" of the model, typically a categorical predictor. In other words, one asks if the effect of dropping the predictor or the term from the complete model is relevant.
Another legitimate use of the model is estimation of an unknown value of the response Y, Y 0 , on the basis of known values x 0 0 of the predictors, usually called "prediction." Then, one may ask if a predictor or a group of them reduce the prediction error by a relevant amount. It is of course also legitimate to use the model as a description of a dataset. Then, statistical inference is not needed, and there is a high risk of over-interpretation of the outputs obtained from the fitting functions.

Coefficient effect.
Let us first consider the experimental situation, where the effect of interest is a part of b. If it reduces to a single coefficient β j , the other components are part of �, and the formulas for simple regression generalize in a straightforward way. The "coefficient effect" is where δ j is the empirical standard deviation s j for a continuous x (j) and half the difference between the two possible values if x (j) is binary.

Drop effect.
Applying the concept of standardization introduced above for the general model (7)  where the second equation holds if there is an intercept in the model and its coefficient is not β j , R j is the multiple correlation of the predictor X (j) with all the other predictors, and k n ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 À 1=n p � 1. The proof is given in the S1 Appendix. Eq (11) shows that ϑ drop,j turns into the test statistic of the t-test for dropping the predictor X (j) from the model, divided by ffi ffi ffi n p , if estimators are plugged in for the parameters-whence its name "drop effect." It measures the change in the response (in σ units) of increasing X (j) , orthogonalized on the other predictors, by one of its standard deviations. If the predictor X (j) is orthogonal to the others, ϑ drop,j and ϑ j coincide.
If a categorical predictor is in the focus, a contrast between its levels may be identified as the effect of interest. For example, a certain group may be supposed to have higher values for the target variable than the average of the other groups. Then, the problem can be cast in the same way as for the single coefficient.

Multidimensional drop effect.
The effect of a categorical variable or another term in the model giving rise to a set b J of cocefficients β j , j 2 J, can be assessed as a multidimensional effect. The general model (7) leads to The derivation is again deferred to the S1 Appendix.
Noting that σ 2 (C −1 ) JJ /n is the variance-covariance matrix of the estimated effectb J makes again clear that η J is the norm of a kind of standardized effect, and that n Z 2 J is related to the F test statistic for examining if the term can be dropped from the model.

Prediction effect.
The prediction error for predicting Y 0 for a given predictor vector x 0 has two sources: the variability of the predicted value, which depends on the observations used for estimating the parameters, and the random deviation � 0 that is intrinsic to making the new observation Y 0 . The latter is characterized by the parameter σ and will be considered here. The question to be asked is: Is the reduction in the random variation σ obtained by using a group of predictors relevant? The model with the group, called the "full model," is compared with the "reduced model," without them, and the corresponding σ's are σ f and σ r . The following technical comment defines the parameters precisely.

Remark.
Whereas σ f is the σ of the regression model (10), σ r needs a definition as will σ Y below. Assuming that (10) is correct, the reduced model will push some effects-that is, some constants-into the error term. The model results from projecting Xb (where X is the design matrix collecting the x i 's as its rows) to the space spanned by the reduced design matrix X r , with projection matrix Therefore, let g ¼ ðX ᐪ r X r Þ À 1 X r εðYÞ be the linear fit to the expected values of Y. Then, Below, we will need s 2 Y , defined as although this definition is a place holder since s 2 Y will cancel in the definition of the effect. Alternatively to these definitions, the model may be modified by assuming x i to be random, with arbitrary distribution. Then, averages should be replced by expectations.
In the sequel, I will use the multiple correlation R, related to the variances of the random deviations and of Y by where σ Y is the (marginal) standard deviation of Y (see the remark for an exact definition), and J collects the predictors that do not appear in in the reduced model. A comparison of variances-or other scale parameters for that matter-is best done in the logarithmic scale, since relative differences are a natural way of expressing such differences (cf. Section 3.1). Then, an effect measure is It measures the log (-percent) increase in the error standard deviation caused by dropping the considered group of predictors from the model. For simple analysis of variance, the model for comparing several groups, θ J reduces to where R 2 f is the fraction of the target variable's variance explained by the grouping, called η 2 in [16] and is between 0 and 1. Note that W pred;J ¼g ðR r Þ Àg ðR f Þ, wherẽ It is related to Fisher's z transformation g for correlations (6) byg ðRÞ ¼ gðRÞ À log ð1 þ RÞ and shows the same behavior for large R.
In fact, the prediction effect is closely related to the drop effect, since as shown in the S1 Appendix. Thus, the approximation being useful for reasonably small W � a . 4.2.6 Estimation. The effects η J and ϑ pred,J are estimated by plugging in estimates for the parameters,b J ,ŝ ¼ŝ f andŝ r , into (12) or (13). Using the first option shows how to obtain a confidence interval. Assume that W 0 ¼ 0 as is almost always the case. (It can always be achieved by subtracting X J W 0 from both sides of the model (10)  • The coefficient effect ϑ j = β j δ j /σ describes the effect of manipulating a single predictor variable X j .
• The drop effect examines the importance of a term involving a single coefficient, ϑ drop,j , or several of them, η J , for "explaining" the target variable Y. If the term is orthogonal to the other terms in the model, ϑ drop,j coincides with ϑ j .
• The prediction effect ϑ pred,J is a function of η J and measures the effect of a term on reducing the essential part of the prediction error, the standard deviation of the random error � i , on the logarithmic scale.
Programs should provide the three measures for each term in a model. The scientific question should determine which one is appropriate for interpretation (see (a) and (b) above).

Relevance thresholds
The arguments in the Introduction have lead to the molesting requirement of choosing a threshold of relevance, z. Ideally, such a choice is based on the specific scientific problem under study. However, researchers will likely hesitate to take such a decision and to argue for it. Conventions facilitate such a burden, and it is foreseeable that rules will be invented and adhered to sooner or later, analogously to the ubiquitous fixation of the testing level α = 5%.
Therefore, some considerations about simple choices of the relevance threshold in typical situations follow here.

One and two samples, regression coefficients
An established "small" value of Cohen's d is 20% [13]. It may serve as the threshold for d. Since d = 2ϑ in the case of two groups (5), this leads to z = 10% for ϑ, which can be used also for a single sample and the coefficient effect in regression according to the discussion in Section 3. It extends to drop effects for terms with a single degree of freedom because they coincide if R j = 0 (11), and from there to multivariate drop effects. However, this threshold transforms to a tiny effect ϑ pred,J of 0.5%ℓ on the log ratio of lengths of prediction intervals according to (14). A threshold of 5%ℓ seems be more appropriate here. This shows again that the scientific question should guide the choice of the effect scale and of the relevance threshold!

Relative effect
General intuition may often lead to an agreeable threshold expressed as a percentage. For example, for a treatment to lower blood pressure, a reduction by 10% may appear relevant according to common sense. Admittedly, this value is as arbitrary as the 5% testing level. Physicians should determine if such a change usually entails a relevant effect on the patients' health, and subsequently, a corresponding standard might be generally accepted for treatments of high blood pressure.
As discussed in Section 3.1, when percentage changes are a natural way to describe an effect, it is appropriate to express it formally on the log scale, like ϑ = ε(log(Y (1) )) − ε(log(Y (0) )) in the two samples situation. Then, one might set z = 0.1 for a 10% relevance threshold for the change, or more precisely, using the "log percent" scale, as z = 10%ℓ.

Log-linear models
Several useful models connect the logarithm of the expected response with a linear combination of the predictors, notably Poisson regression with the logarithm as the canonical link function, log-linear models for frequencies, and Weibull regression, a standard model for reliability and survival data. Here, the consideration of a relative effect applies again. An increase of 0.1 in the linear predictor leads to an increase of 10% in the expected value, and therefore, z = 10%ℓ seems appropriate for the standardized coefficients ϑ j = β j s j .

Proportions and logistic regression
As the "logit percent" scale (Section 3.2) extends the log percent scale and matches it for small proportions, the same threshold z = 10%ℓ should be applied. It declares a difference between p = 0.5 and p = 0.525, or between 0.35 and 0.373, or between 0.1 and 0.109 as relevant. Like for the log-linear models, this threshold also applies to standardized coefficients in logistic regression.

Correlation
In the two samples situation, considering the x i as random and assuming equal probabilities for both groups, the correlation is and the threshold of 20% on Cohen's d leads again to z = 0.1. In Section 3, the logit scale according to Fisher has been recommended as the effect scale. Since ϑ = g(ρ) � ρ for ρ � 0.1, the threshold can be used on this scale, too, and the "logit percent" notation is appropriate, z = 10%ℓ.

Summary
The scales and thresholds for the different models that are recommended here for the case that the scientific context does not suggest any choices are listed in Table 2.

Description of results
It is common practice to report the statistical significance of results by a p-value in parenthesis, like "The treatment has a significant effect (p = 0.04)," and estimated values are often decorated with asterisks to indicate their p-values in symbolized form. If such short descriptions are desired, secured relevance values should be given. If Rls > 1, the effect is relevant, if it is > 0, it is significant in the traditional sense, and these cases can be distingished in even shorter form in tables by plusses or an asterisk as symbols as follows: � for significant, that is, Rls > 0; + for relevant (Rls > 1); ++ for Rls > 2; and +++ for Rls > 5. To make these indications well-defined, the relevance threshold z must be declared either for a whole paper or alongside the indications, like "Rls = 1.34 (z = 10%ℓ)." Since the secured (and potential) relevance also depends on the confidence level 1 − α, this quantity should also be declared.

Examples
The first examples are taken from the first "manylabs" project about replicability of findings in psychology [17], since for that study, the scientific questions had been judged to deserve replication and full data for the replication is easily available. The original studies were replicated in each of 36 institutions. Here, I pick the replication at Penn State University of the following item: "Students were asked to guesstimate the height of Mount Everest. One group was 'anchored' by telling them that it was more than 2000 feet, the other group was told that it was less than 45,500 feet. The hypothesis was that respondents would be influenced by their 'anchor,' such that the first group would produce smaller numbers than the second" [18]. The true height is 29,029 feet.
Relative Difference log ðYÞ � N ðm k ; s 2 Þ log(μ 1 /μ 0 ) 10%ℓ Proportion Bðn; pÞ log(p/(1−p)) 10%ℓ Logistic regression According to the discussion in Section 5, the data is analyzed here on the logarithmic scale, and the threshold of 10%ℓ is applied. The data, reduced to the first 20 observations for simplicity, are given in Table 3.
A second study asked if a positive or negative formulation of the same options had an effect on the choice [19]. Confronted with a new contagious disease, the government has a choice between action A that would save 200 out of 600 people or action B which would save all 600 with probability 1/3. The negative description was that either (A) 400 would die or (B) all 600 would die with probability 2/3. I report the results for Penn State (US) and Tilburg (NL) universities. The data is summarized in Table 4, and the effect, significance, and relevance, in Table 5. The secured relevance is Rls = 4.16 (z = 10%ℓ) and 10.1 (z = 10%ℓ) for the two institutions, the effect is thus clearly relevant. One may ask if there is a relevant (!) difference between these two studies, with a view of applying the notions of this paper to the theme of replicability. This will be done in a forthcoming paper.
The third example is a multiple regression problem. The dataset reflects the blasting activity needed for digging a freeway tunnel beneath a Swiss city. Since blasting can cause damage in houses located at a small distance from the point of blasting, the charge should be adjusted to keep the tremor in the basement of such a house below a threshold y 0 . The logarithmic tremor is modelled as a linear function of the logarithmic distance and charge, an additive adjustment to the house where the measurements are taken (factor location), and time, a rescaled calendar day. Only part of the data for 3 locations are used here, see Table 6. Results are collected in Table 7. The time does not show any significance and therefore no relevance either. The relevances of the coefficient and drop effects are related by (11). Thus, their ratio equals ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi and is a useful measure of collinearity.
For the shortest description, the coefficient of log10(charge) would be indicated as 0.752 +++ .
The results for these examples have been otained by the R package relevance, available from "CRAN" https://r-project.org.

Relevance instead of p-values
The deficiencies of the common use of p-values has lead to a fierce debate and a flood of papers, often resulting in the vague conclusion that the accused concept should be used with caution.
Here, I have argued that the origin of the crisis roots deeper: The misuse of the p-value reflects a way to avoid the effort of asking relevant scientific questions to begin with. Typical problems in empirical research often concern a quantity like the effect of a treatment on a specified target variable. These problems are only well-posed if there is a threshold of relevance. I am not the first to advocate this requirement, I emphasize its importance again and develop it further into the novel measure of relevance. It is essential to keep in mind that the threshold should be determined only by the scientific problem and therefore should depend as little as possible on the design of the study that estimates the effect-it must not depend on the number of observations. Some earlier proposals are also based on an idea of a threshold that widens the null hypothesis, as the "Second Generation P-Value" by Blume et al. [1], which was discussed above. Kruschke [7] uses similar ideas as this paper with a Bayesian approach (see footnote 2.3), and a referee suggested to draw conclusions on the base of the posterior probability of the effect exceeding the threshold, ϑ > z. However, none of these proposals, neither frequentist nor Bayesian, has yet been widely applied.
The paradigm of null hypothesis significance testing that is so well established asks for the choice of a threshold: the significance level α of the test, or the confidence level 1−α. In principle, α could be arbitrarily chosen, but tradition has fixed it at 5% for most scientific fields. The relevance threshold introduces yet another choice to be made. A careful selection should be sought in each scientific study. Since this is a cumbersome requirement, conventions have been proposed in this paper for the most common situations.
The traditional method to convey the assessment of an effect in a more informative way than the p-value is the confidence interval. Its downside is that it consists of two numbers that carry the measurement unit of the effect and are therefore not directly comparable between studies. The significance measure introduced here is a single, standardized number that conveys the essentials of the confidence interval. It depends, however, again on a given value of the effect. When this value is 0, the basic flaw of the p-value is inherited. Combining it with the relevance threshold is a necessary step to give an appropriate characterization of the relevance of a result.
The combination is best achieved by focussing on the confidence interval for the relevance measure, with boundaries called "secured" and "potential" relevance. The secured relevance Rl s may even be used as a single number characterizing the knowledge gained about the effect of interest.
A conclusion from the p-value debate is that a simple yes-no decision about the result is misleading. Since our thinking likes categorization, I have introduced labels characterizing the comparison of the confidence interval with both the zero effect and the relevance threshold. It is defined on the basis of the two significance values Sig 0 and Sig z or of the two relevance limits Rls and Rlp.
The significance and relevance measures and the classification are straightforward enhancements of concepts that are well established and ubiquously known. There is hope that they can form a new standard of presenting statistical results.

Replicability
The p-value debate is closely related to and often confounded with the reproducibility crisis. In fact, there is ample evidence that in several fields of science, when a statistical study is replicated, a significant effect found in the original study turns out to be non-significant in the replication, thereby formally failing the fundamental requirement of reproducibility of empirical science. While many causes are suggested and found for such failures, prominent ones are tied to the problems with statistical testing and the p-value discussed in the Introduction. Here is the argument: The p-value was originally advocated as a filter against publication of results that may be due to pure randomness. It was soon converted into a tool to generate "significant" results regardless of their scientific relevance. This leads to so-called selection bias: When many studies examine small true effects with limited precision, some of them will turn out significant by chance, will thus pass the filter and be published, whereas the non-significant ones will go unnoticed. These studies will have a low probability of being successfully replicated.
Clearly, using the criterion of a secured relevant effect (case Rlv) as a filter would reduce the frequency of phony results drastically, since the barely significant results would rarely pass it. A relevant result in this sense will usually have a high probability of showing at least a significant estimate (case Amb.Sig) and an estimated relevance above 1 upon replication-unless the precision is low or data snooping has been extensively applied to get it. A securely relevant result can be expected if the replication has sufficient power.
The concepts introduced here can be profitably applied to assess replications of results also in more depth, as will be shown in a forthcoming paper.

Conclusion
The p-value has been (mis-) used to express the results of statistical data analyses for too long, in spite of the extensive discussions about the bad consequences of this practice for science.
It is time to introduce a new concept for the presentation of the statistical inference for an effect under study. The measure of relevance introduced here is suitable to achieve this goal. It needs the choice of a relevance threshold for the effect of interest, a requirement posed by the desire to ask a scientifically meaningful question to begin with.
The goal of a typical statistical enquiry is to prove that an effect is relevant. Based on the measures "secured relevance," Rls, and "potential relevance," Rlp, either this can be achieved, or a "negligible" effect can be found-or the answer may be "ambiguous." Application of these concepts will enhance reproducibility: When relevant effects are examined rather than merely significant ones, the replication will much more often turn out to be at least significant in the replication.