Systematic review of the use of “magnitude-based inference” in sports science and medicine

Magnitude-based inference (MBI) is a controversial statistical method that has been used in hundreds of papers in sports science despite criticism from statisticians. To better understand how this method has been applied in practice, we systematically reviewed 232 papers that used MBI. We extracted data on study design, sample size, and choice of MBI settings and parameters. Median sample size was 10 per group (interquartile range, IQR: 8–15) for multi-group studies and 14 (IQR: 10–24) for single-group studies; few studies reported a priori sample size calculations (15%). Authors predominantly applied MBI’s default settings and chose “mechanistic/non-clinical” rather than “clinical” MBI even when testing clinical interventions (only 16 studies out of 232 used clinical MBI). Using these data, we can estimate the Type I error rates for the typical MBI study. Authors frequently made dichotomous claims about effects based on the MBI criterion of a “likely” effect and sometimes based on the MBI criterion of a “possible” effect. When the sample size is n = 8 to 15 per group, these inferences have Type I error rates of 12%-22% and 22%-45%, respectively. High Type I error rates were compounded by multiple testing: Authors reported results from a median of 30 tests related to outcomes; and few studies specified a primary outcome (14%). We conclude that MBI has promoted small studies, promulgated a “black box” approach to statistics, and led to numerous papers where the conclusions are not supported by the data. Amidst debates over the role of p-values and significance testing in science, MBI also provides an important natural experiment: we find no evidence that moving researchers away from p-values or null hypothesis significance testing makes them less prone to dichotomization or over-interpretation of findings.

How many drop-outs were there? 15. Exclusions refer to participants whose data were available, but were excluded by the authors (for any reason).

In your view, is the conclusion in the Abstract written in a way that supports the hypotheses of the study?
(Note this question is getting at "hypothesizing after results are known". We want to see if studies are unusually good at finding support for their stated hypotheses.) (Yes, No, or there was no clear directional hypothesis) 24. How are effects interpreted in the Conclusion of the Abstract? These effects are usually a descriptor of certainty plus a magnitude (E.g., MBI recommends certain descriptors, such as "likely harm/benefit", "very likely trivial", or "certain harm/benefit".) Authors may also describe their interventions as "substantial" or "implementable". Please insert all relevant descriptors separated by a semi-colon.)

Additional Comments:
Please take note of any other oddities you found in the studies. For instance, some studies seem to confuse a clear observed difference in the sample with the inference to the population (i.e., "the treatment group might have performed better than control"). MBI parameters (i.e., the trivial thresholds, maximum risk of harm/ 1ℎ , and the minimum evidence threshold) were critical to the simulations undertaken in the main paper, so we also conducted a thorough review of those values. There was good agreement for the minimum threshold for harm (-δ) with the median value being -0.2 SD and the minimum threshold for benefit (+δ) with the median value being +0.2 SD. Similarly, there was moderate initial agreement between reviewers for the maximum risk of harm/ 1ℎ , with the most common thresholds being 5% or 0.5% as shown in Supplemental Figure 2.

Supplemental
Coding for the minimum evidence threshold had generally lower agreement, because this was only explicitly stated in 23% of papers. An example of an explicit statement is found in Cruz et al. 2018 1 : "A likely difference (>75%) was considered as the minimum threshold to detect meaningful differences." and .40, respectively), and total and moderate PA (d= .73 and .59, respectively) in the intervention compared to the usual care group. We found no likely beneficial improvements in any other outcome" (emphasis added). They then recommend the intervention's use: "This intervention has the potential for widespread implementation and adoption, which could considerably impact on post-treatment recovery in this population." One can infer from their description and ensuing recommendation that they used "likely" as their minimum evidence threshold, though this was never explicitly stated. As another example, Weston et al. (2015) 3 write in their abstract that, "Compared to the control group, the core training intervention group had a possibly large beneficial effect on 50-m swim time" and then say, "This is the first study to demonstrate a clear beneficial effect of isolated core training on 50-m front crawl swim performance" (emphasis added). The authors implicitly used "possible" as their minimum evidence threshold, because they draw definitive conclusions from a "possible" effect. To ensure consistency in the coding of this parameter, reviewers 3 or 4 reviewed all 232 papers to confirm the reported values.
As shown in Supplemental Figure 2, the median threshold for evidence was 50%, but this was not the most common value, with most studies using a threshold of 75% (n=100) or 25% (n=88). In a few cases, this choice was ambiguous and was coded as unclear (n = 25).
Supplemental Figure 2. The 'Maximum Risk of Harm' and the 'Threshold for Evidence' across studies.

Systematic Review Methods: Risk of Bias Assessment
To assess the potential risk of bias in our data, we had two questions related specifically to Selection Bias, Performance Bias, and Detection Bias in our data extraction tool. Other questions about attrition, data exclusions, and the reporting of outcomes tap into potential Attrition Bias and Reporting Bias (Higgins & Green, 2011) 4 . Descriptive statistics related to each type of bias are presented below, but our primary focus was on the risk of Selection, Performance, and Detection Bias in these studies.
Detection Bias. Detection bias was assessed by two questions that asked about (1) random assignment to groups/conditions and (2) if the method of randomization was explicit. As shown in Supplemental Table 2, most controlled trials and cross-over designs had random assignment to groups/conditions but very few studies were explicit about the method of randomization.
Supplemental Table 2. Summary of questions relating to detection bias as a function of study design.  (2) were the outcomes measures sufficiently objective that assessor blinding was unlikely to make a difference (in the judgment of the raters). As shown in Supplemental Table 4, blinding of assessors was rarely stated across study types. This concern is somewhat ameliorated, however, by the frequency of objective measures of performance.

Study Design
Supplemental Table 4. Summary of questions relating to detection bias as a function of study design. Note that we added a category of "mixed" for the objectivity of measures because a subset of the dependent variables was judged to be sufficiently objective whereas others were not.

Attrition Bias.
Of the 232 studies included in the review, 55 (24%) reported some degree of attrition or exclusion of data. Among those studies that reported dropout, the median number of participants dropping out of the study was 5, IQR [2,9]. Among those studies that reported exclusion of existing data the median number of participants excluded was 4, IQR [1,10]. Given the sample sizes in these studies, this amount of attrition is worrisome (the median total sample size for studies reporting attrition was 21, IQR[16, 30]). It is worth noting that these exclusions were only reported in 24% of all studies, but among those studies that reported attrition, it was common for up to 25% of the data to be censored or excluded.
Reporting Bias. Reporting bias is difficult to discern, but in our data extraction tool reviewers were asked to rate if the abstract was, "written in a way that supported the hypotheses" of a given study. For 104 cases (45%), this was rated as not clear because no specific directional hypothesis was given. In cases where hypothesis support was clear, the abstract supported the hypotheses in 72 cases (31%), clearly did not support the hypotheses in 8 cases (3%), and presented mixed results in 37 cases (16%). An additional 11 cases (5%) were rated at "Other". The lack of clear directional hypotheses in most cases suggests that a substantial number of these studies were exploratory in nature. Assuming that authors are reporting all measures they collected, however, we do not think that the risk of reporting bias is higher in these studies than in other studies in the field.

Simulations
We simulated the Type I error rates for MBI for both a between-group comparison and a withingroup comparison assuming a range of sample sizes. For the between-group comparison, we generated 200,000 simulated trials with n per group from two normally distributed populations with the same variance and zero or a trivial difference between the groups. For the within-group comparison, we generated 200,000 simulated trials with a sample size of n from a normally distributed population with a true effect size of 0 or a trivial effect size. Type I error rates were then calculated as the percentage of studies in which MBI returned a positive or negative inference that met a given minimum evidence threshold (e.g., "likely") or the percentage of studies where p<.05 (for standard hypothesis testing). For Figures 2C and 2D, we simulated 5000 between-group trials and identified MBI inferences as: • "Likely" or higher effect: (LCL90 ≥ -δ and LCL50 ≥ δ) or (UCL90 ≤ δ and UCL50 ≤ -δ) • "Possible" effect: (LCL90 ≥ -δ and UCL50 ≥ δ and LCL50 < δ) or (UCL90 ≤ δ and LCL50 ≤ -δ and UCL50 ≥ -δ) • "Trivial" effect: (LCL90 ≥ -δ and UCL50 ≤ δ) or (UCL90 ≤ δ and LCL50 ≥ -δ) • "Unclear" effect: LCL90 ≤ -δ and UCL90 ≥ δ Simulations were conducted in SAS 9.4. See Supplemental Appendix II for the simulation code; we have also provided code for the same simulations in R (see Supplemental Appendix II).

Mathematical Calculations
We also calculated the Type I error rates mathematically. Mathematical symbols used: n=sample size in each group 1 = significance level for deciding "unclear" 2 = significance level for the minimum evidence threshold of interest ℎ = threshold for harm = threshold for benefit 1 = (1− 1 ),2 −2 2 = 2 ,2 −2 ES= true effect size Z represents the standard normal distribution 2 −2 is the chi-square distribution with 2n-2 degrees of freedom Sainani 5 previously showed that to achieve a minimum evidence threshold for benefit in clinical MBI, one must meet two constraints: (1) p< 1ℎ for H 0 : true effect ≤ -δ h ; otherwise, the inference would be deemed "unclear" and (2) p< 2 for H 0 : true effect ≤ δ b , where 2 is the significance threshold that determines the exact inference achieved, whether "possibly", "likely", etc. 2 =.05 corresponds to "very likely"; 2 =.25 corresponds to "likely"; and 2 =.75 corresponds to "possibly." For example, to achieve a clear beneficial inference of at least "likely" when 1ℎ =5%, one must meet: Sainani 5 previously derived equations for these two constraints, for the problem of comparing two means. For simplicity, we use a pooled sample variance, 2 , and equal sample sizes for the groups.
The two constraints are: Thus, to meet a given minimum evidence threshold for benefit, the following must be true: We have extended this to accommodate both directions, i.e., non-clinical MBI. For non-clinical MBI, a minimum evidence threshold for harm or benefit is met when: See Supplemental Appendix II for SAS and R code that implements the math equations. Note that this equation gives the probability of achieving a given inference. This probability is a Type I error rate when the true effect size is 0 or trivial, but is a Type II error rate when the true effect is non-trivial.
Results from the between-group comparison are shown visually within the main paper. Results for a within-person study design are shown in Supplemental Figure 4.
Supplemental Figure 4. Type I error rates (mathematically calculated) for a within-person study for MBI's "possible" (purple) and "likely" (red) thresholds. True effect size = 0. Standard hypothesis testing with a significance level of 0.05 is shown in blue. The shaded area is the interquartile range of total sample sizes among within-person studies. Typical MBI settings were used (thresholds for harm/benefit of -0.2/0.2; 1 = 5%; and equivalent treatment of positive and negative directions). Assumes withinperson variances of 0.364 (A,B) and 0.80 (C) and trivial range of 0.2 (A,C) and 0.1 (B) between-person standard deviations. In papers we reviewed, within-person studies sometimes used smaller trivial ranges (B) and typically had large within-person variance (C).
We calculated Type I error rates for a range of scenarios (Table 3 of the main paper). All reported values were confirmed by both simulation and math. For the base-case, we assumed a 1 = 5%; and thresholds of harm/benefit of 0.2 standard deviations; these were varied in some simulations. Effect sizes were expressed in standard deviation units; the true effect was set at 0 or a trivial amount (e.g., 0.1). For studies that assumed a cross-sectional comparison, we assumed that the variance was 1.0. For studies that assumed a pre-post parallel trial design, we allowed the within-person variance to differ from the between-person variance; in the base case, we set this at 0.364 (assumes pre-post correlation is r=.818).