Reader Comments

Post a new comment on this article

Conclusions largely unrelated to findings of the systematic review

Posted by jaisbett on 03 Aug 2020 at 10:21 GMT

Summary: The review collects informative statistics on MBI usage patterns and presents an exact mathematical formulation of Type I error rates in certain scenarios. However, the research methods are poorly and incompletely reported, key data were not captured, and relevant literature was ignored. The authors draw strong conclusions that do not pertain to findings of the systematic review.
__________________________________________________________________________
Five authors of this systematic review had separately or jointly published papers that conclude the MBI procedure should not be used. Nevertheless, they set out anew to “better understand how this method has been applied in practice” through a review of 232 papers augmented with theoretical and simulated calculations of error rates.

Given its distinguished authorship and context, one might expect the review to be careful, rigorous and transparent. Unfortunately, it is none of these. The reporting of the data extraction procedure is misleading and incomplete, and data needed to underpin the paper’s major conclusions were never captured.

Box 1 illustrates these deficiencies in relation to the claim that 38% of studies declared effects to be substantial or implementable at a test alpha level of .75.

Unravelling the data extraction process procedure is made difficult by the paper’s presentation, in which results (and even conjectures about results) are in the Methods section, incomplete descriptions of additional methods are in the Results section and new findings essential to the conclusions pop up the Discussion section. The text is peppered with words like “frequently” and “routinely” to describe the occurrence of MBI usage problems for which, at best, only small amounts of anecdotal data appear to have been collected. The data that might support the paper’s conclusions require document analysis, which is not mentioned in the Methods section or supplementary material, and the only published text extracts from the systematic review are examples drawn from nine studies.

Box 2 illustrates findings reported in the Results section that cannot be supported by data collected with the published tool, and for which the anecdotal support cannot provide evidence of prevalence.

The Discussion section begins: “Through our systematic review, we have documented that studies using MBI are typically small in size and often make strong claims based on weak evidence.” The data extraction tool captures study size and it captures the lowest evidence level about which some conclusion is made. It does not, however, capture the strength of the claim, since qualifiers attached to conclusions were not captured (see Box 3).

None of the conclusions in the Conclusion section are supported by the reported results. Instead they draw on material introduced in the Discussion section, in which quotes are extracted from on-line blogs by MBI proponents and from the reviewed studies. Lohse et al. seem unaware that they are conducting qualitative research, which requires as much rigour to form a conclusion that an intervention is harmful or to establish causality as would a statistical analysis. The research questions that are being pursued, the method of selecting the documents and the analysis method are not described. This makes it hard to evaluate conclusions such as: “We found that MBI has done direct harm to the sports science and medicine literature by causing authors to draw overly optimistic conclusions from their data.”

Box 3 explains that evidence to support this and other conclusions was not collected in the systematic review, and relies on selective quotes from the studies, papers and blogs. The authors do not attempt to demonstrate objectivity and sensitivity in the selection and analysis of these documents.

Some statements in the concluding sections lack any apparent basis. For example, the authors argue that because MBI has been described by its developers as a Bayesian procedure, “it’s not surprising that users appear unaware that, underneath the hood, MBI is in fact running non-zero hypothesis tests.” No source is cited. The review in fact found no study described its analysis as Bayesian, and the functions of t-distributions used in the MBI Excel spreadsheet implementations are visible on clicking the appropriate cell.

The review echoes previous MBI critiques in claiming that MBI “misrepresents weak evidence as strong evidence” without reference to the extensive literature concerning how people interpret probabilistic terms. The Type I error rates that the paper attributes to “possible” and “likely” MBI decisions fall in or above the typical probability/frequency ranges experimentally shown to represent these terms.

Previous studies into the misuse and misreporting of statistical analyses form another extensive body of literature neglected by the review. Methodological approaches taken by these studies could have been followed to avoid problems such as the obviously inadequate piloting of the tool. This literature might also provide a comparative basis to assess the impact of MBI on the use of default meaningful effect bounds, average sample size, prevalence of multiple testing and so on.


***Box 1: Determining the threshold level of evidence
1. The Methods section refers only to a carefully developed data extraction tool and a process for resolving disagreements in reviewer assessments.

The only item in the tool that concerns the threshold level of evidence (Item 22) asks that the minimum chance of benefit be extracted from the “MBI details” in the paper under review, with NA to be coded if none is provided.

Item 22 matches a field chance_benefit in the published spreadsheet of the extracted data. The spreadsheet has a field explicit_threshold which does not match any item in the data extraction tool.

2. The Results section contains a table footnote which refers to “explicit” and “implicit” setting of thresholds. This section quotes phases from three of the reviewed studies to illustrate explicit and implicit setting of evidence thresholds. There is no reference to supporting supplementary material.

The published data do not contain the key phrases used to determine implicit thresholds. The lead author confirmed to me that these had not been captured.

3. The only further description of coding of the field chance_benefit is in S1 Appendix, amid a discussion of reviewer agreement on items:
“Coding for the minimum evidence threshold had generally lower agreement, because this was only explicitly stated in 23% of papers. ... Where this was not explicitly stated, we used the OBJECTIVE CRITERION OF WHAT RESULTS THE AUTHORS PRESENTED AS SUBSTANTIVE in their abstract, conclusion section, and/or first paragraph of their discussion section.” [emphasis added]

Two examples of implicit determinations follow, one of which was also used in the Results section. In both examples, phrases reporting a “possible” or likely” finding are paired with a conclusion about that finding. These examples support the definition highlighted above in defining the objective criterion for coding a threshold level to be an unqualified conclusion about a finding at that level.

4. The review assumes all studies only use the MBI evidence to determine which findings to highlight. It did not collect data on whether statistical significance, implied by CIs or assessed directly through NHST, determined highlighted findings.

5. To audit the coding of the chance-benefit field, studies in the review had to be located and downloaded then reviewed from scratch, due to the reviewers’ failure to capture the key phrases used in coding the threshold level. A small sample (see Box 1a) shows the determination of the evidence threshold was not made according to the simple “objective criterion” claimed.

In an email, the lead author KL elaborated on the actual process: “We debated some cases and trained reviewers in the pilot stages when we were trying to improve the consistency of coding.” None of this process is documented and the training material has not been published. The description “objective criterion” is therefore misleading.

*** Box 1a: Sample audit
Sixty-three of the studies in the review are coded with .25 (“possible”) as the benefit evidence threshold together with .05 as the harm risk. Of these, the spreadsheet field explicit_threshold codes three studies as explicit in setting the evidence threshold. In fact, none does. Instead, all three conform to the previous definition of implicit coding, by presenting a “possible” result as substantial.

A sample of 10 of the remaining 60 studies (selected on alphabetical order) showed none conforms to the clear-cut example of assumed benefit provided in Lohse et al. To illustrate with the first four studies:

• Conclusions of Aughey et al. (2015) (miscoded as AUGHEY2016) only concern findings reported as “likely”; these are also reported as 90% CIs that are significant at that level. The first paragraph of the discussion refers to a possible effect that “remains” under different conditions, but this may refer to a non-inferior effect, which is the correct interpretation of “possible.” *

• Ayala et al. (2017) explicitly set alpha to .05 (presumably on harm and benefit) and appear to only make conclusions where NHST is significant at .05. There may be arithmetical and other errors not attributable to MBI.

• Bartolomei et al. (2014) conclude that the intervention “may enhance upper-body power expression”; there is no indication of whether this is based on some or all of a group of findings declared in the abstract as “likely” or “possible.” There is no unambiguous example showing “possible” is treated as substantial.

• The statistical analysis section of Bellinger et al. (2012) proposes to use NHST at .05, with MBI used to determine practical significance. Conclusions are reported as p values. One finding is reported as insignificant (p =.20) with an MBI likelihood of 37%, and the qualified conclusion is made that there “may be a small meaningful improvement in performance.” The discussion/conclusion begins “The main findings from the present study were that [the intervention] did not significantly improve high-intensity cycling performance; however, magnitude based inferences demonstrated that in highly-trained cyclists β-alanine supplementation was 37% likely to improve performance with 0% likelihood of a negative effect.”

* In MBI “possible positive” corresponds to a non-inferiority test at either an alpha of .05 or .005, when neither superiority nor equivalence can be established at an alpha of .25.


*** Box 2: Reporting of unpublished data not collected through the data extraction tool
The quotes below [emphasis added] illustrate findings that are not related to data collected through the data extraction tool and for which the only supporting data appear to be the selected quotes.
1. “We also note that MANY authors seemed to erroneously believe that use of MBI circumvents the need for an adequate sample size [25–29]. For example, Stanley et al. [25] wrote of MBI: ‘With this statistical approach, our sample size is not limiting.’”

2. “MOST authors reported using the Excel spreadsheets to run their MBI analyses, but A FEW authors reported running frequentist statistical models in other programs, such as SAS, but then interpreted the resulting frequentist confidence intervals in MBI terms.”


3. “In fact, AUTHORS APPEARED TO VIEW the “likely” threshold as a high level of evidence. For example, Barrett et al. [35] state: ‘We adopted a conservative approach to inference whereby substantial effects were only declared clear when the probability likelihood for the effect was ≥75% (i.e., likely).’”

4. “When p-values were reported alongside MBI results, we found that MBI descriptors were FREQUENTLY prioritized above p-values in the interpretation of results. Indeed, the results were OFTEN used to justify the superiority of MBI over conventional approaches.” [Extracted phrases from 4 studies follow]




***Box 3: Conclusions not based on review findings and apparently based on ad hoc evidence
1. “Through our systematic review, we have documented that studies using MBI ... often make strong claims based on weak evidence”
The review authors KS & KL told me that the evidence level threshold coded in the field chance_benefit is not the dichotomization level, since a finding that an effect is substantial can be qualified. However, no data were captured on how qualifiers were applied to a conclusion. Hence the strength of claims was not captured.

The small audit reported in Box1a found all studies with the explicit_threshold field in the spreadsheet coded “implicit” had qualified the conclusions they made about findings at this level, whereas the three studies with explicit_threshold coded “explicit” had at last one unqualified conclusion. Of the 232 studies reviewed, 23% were coded “explicit. Perhaps in these studies the chance_benefit field corresponds to the level at which findings are considered substantive - without qualification - and hence would represent strong claims on weak evidence. However, this is nowhere described.

2. “...we find no evidence that moving researchers away from p-values or null hypothesis significance testing makes them less prone to dichotomization…”
This strong conclusion in the abstract implies that studies dichotomize their findings at the evidence level threshold, as suggested by the definition of this threshold in Table 2. However, as above, this was not the case. No data on the dichotomization level were captured. This conclusion therefore is unfounded.

3. “We found that MBI has done direct harm to the sports science and medicine literature by causing authors to draw overly optimistic conclusions from their data.”
As noted, no data were collected on how qualifiers were applied to conclusions, and no data were collected on dichotomization. The measure of harm and the methods to establish causality are not discussed, and no comparative statistics on misuse are offered.

The Discussion section reports a check of 10 articles citing one of the review studies that found none took account of the evidence level of the study’s findings. The selected study appears atypical in that its abstract did not report evidence levels with its results or qualify its conclusions. In any case, the consequences of using weak results (for example in meta-analyses) are not examined.

4. “MBI has .... reinforced the harmful view that the purpose of research is to get publishable results for the researcher”
This conclusion is supported by a 2018 YouTube post from an MBI user and a 2018 blog on the sportsci.org site. However, a related blog article on this site quotes a user for whom MBI analyses “were a cornerstone for my work examining high performance athletes and utilizing its findings to SUPPORT COACHES and the integrative support team.” One of the earliest and most cited works on magnitude based inference, Batterham & Hopkins 2006, describes applying the “chances of benefit, triviality and harm” to make a “final decision about ACTING ON AN OUTCOME ...” [emphasis added]

In document analysis, the objects to be studied should be selected with the aim of gaining a complete understanding of the topic under research.

No competing interests declared.