Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Unusual outcome variances as a method to identify potentially problematic clinical trials

  • Philippe P. Hujoel ,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Writing – original draft, Writing – review & editing

    hujoel@uw.edu

    Affiliations Department of Epidemiology, School of Public Health, University of Washington, Seattle, Washington, United States of America, Department of Oral Health Sciences, School of Dentistry, University of Washington, Seattle, Washington, United States of America

  • Margaux L.A. Hujoel

    Roles Formal analysis, Methodology, Supervision, Writing – review & editing

    Affiliations Department of Human Genetics, University of California, Los Angeles, California, United States of America, Department of Computational Medicine, University of California, Los Angeles, California, United States of America

Abstract

An unusual outcome variance contributed to uncovering major cases of research misconduct, leading to over 200 retractions. Detecting such problematic randomized trials early – before they unduly influence clinical guidelines – remains challenging. Empirical evidence indicates that differences in variances between trial arms (DiVBTAs) are usually small and non-significant in properly conducted trials. This study investigated whether the converse – unusually large and statistically significant DiVBTAs - can serve as a red flag for potentially problematic trials. We conducted simulations to assess the sensitivity and specificity of a DiVBTA-based decision rule under realistic scenarios, including proper randomization, heterogeneous treatment effects, and missing-not-at-random data. In parallel, we applied the rule in a real-world analysis of 226 systematically sampled randomized trials in diabetes research to assess whether unusually large and statistically significant DiVBTAs occur with sufficient frequency to warrant screening. Unusually large DiVBTA values were defined as those falling outside the 3-sigma prediction limits. Simulations demonstrated high specificity, with legitimate trials rarely flagged (low false-positive rate), and adequate sensitivity for detecting a specific form of severe fabrication. In the empirical analysis, 19 out of 226 trials (8%) were flagged as potentially problematic demonstrating utility to screening trials for unusually large and statistically significant DiVBTAs. Subsequent screening of the identified trials revealed additional concerns in 18 (out of 19) flagged trials. These findings suggest that screening for unusually large, statistically significant DiVBTAs offers a simple, low-effort tool to identify trials warranting further scrutiny, potentially strengthening the reliability of evidence used in clinical guidelines.

Introduction

Problematic clinical trials are widespread and erode the credibility of health information. It is estimated that 25% of published clinical trials may be flawed or fraudulent [1]. In absolute terms, hundreds of thousands of trials are believed to lack credibility [2]. These unreliable trials have permeated meta-analyses and clinical guidelines [36], which has led to concerted efforts at curbing their influence. These efforts have included global regulatory guidelines which have imposed requirements on data collection that are designed to minimize fraud, misrepresentation, and data integrity issues [7].

The conduct of systematic reviews presents itself with its own set of challenges to prevent potentially problematic trials from infiltrating clinical guidelines. A 2021 Cochrane editorial warned of the threat posed by untrustworthy or “problematic” studies, highlighting that retracted studies are only the tip of the iceberg. The Cochrane editorial coincided with the release of new Cochrane guidelines on how to handle concerns about the trustworthiness of a publication when no formal post-publication correction exists [8]. Checklists to identify problematic trials [912] and recommendations on how these tools should become integrated during research synthesis followed [13].

The 2021 Cochrane editorial also underscored the urgent need for validated statistical methods to reliably and fairly detect trials with statistical irregularities [8]. One particularly promising method involves identifying improbable distributions of baseline data across trial arms [14]. This method exploits the cornerstone assumption that participants are allocated randomly to interventions leading to predictable distributions of baseline variables across groups. Trials flagged as having highly unusual distributions when compared to those expected by chance have shown a higher likelihood of retraction [1517]. Additional statistical methods have become available and at least two statistical packages integrate methods to re-appraise the publication integrity in groups of randomized controlled trials [1821].

An unusual outcome variance – described as a “very small standard deviation” by the first whistleblower– helped in the discovery of the largest fraudulent research body identified to date [22]. Building on the informativeness of unusual standard deviations to inform on fraud, we propose that Differences in Variance Between Trial Arms (DiVBTAs) of continuous outcome measures can offer the basis to develop an objective method to identify potentially problematic trials.

Empirical evidence in support of this proposal is that a preponderance of meta-analyses of DiVBTAs within the setting of clinical trials demonstrated that DiVBTAs are typically not statistically significant, or, when statistically significant, are small in size [2327]. The power of statistical tests furthermore to detect significant DiVBTA is low in both meta-analyses of clinical trials, and, especially so within the setting of single clinical trials [21].

The discovery of statistically significant DiVBTAs within the setting of a single clinical trial can thus be regarded as a somewhat unexpected finding. Explanations for such unexpected findings currently focus on genuine design and analysis issues such as randomization, heterogeneous treatments effects, informative trial participant dropout, compliance issues, or floor and ceiling effects of the outcome variable. We suggest here that fraud or unintentional error needs to be included in the list of plausible explanations. As such, we (1) describe methods to identify statistically significant DiVBTA outliers, (2) perform simulations to assess how likely genuine design and analysis issues can cause statistically significant DiVBTA outliers and (3) provide a case-study on clinical trials included in systematic reviews on diabetes.

Methods

The methods section is presented in two parts (i) simulations to assess the sensitivity and specificity of the proposed DiVBTA decision rule, and (ii) a diabetes case study to assess whether the proposed decision rule has real-world clinical utility. The following background presents DiVBTA terms discussed in the two proceeding subsections.

Background: The proposed decision rule to flag a potentially problematic trial has two elements: (1) the DiVBTA has to be statistically significant, and (2) the DiVBTA needs to be an outlier, i.e., fall outside a tolerance band.

A first step is selecting a DiVBTA statistic among those available (for a review of DiVBTA statistics see [21]). For the detection of potentially problematic trials, it is advantageous to focus on those DiVBTA estimators (a) which can be derived from published summary statistics, (b) which are standardized by the mean, and (c) which are normally distributed and robust.

First, selecting a DiVBTA estimator which can be derived from published summary statistics is crucial given that it remains uncommon for authors of clinical trials to share individual participant data. A 2019 survey of authors from 619 randomized controlled trials published in high-impact anesthesiology journals (2014–2016) found that only about 4% provided individual participant data upon request [28]. A 2019 randomized controlled trial assessing the impact of financial incentives to encourage data sharing reported that none of the investigators provided individual participant data [29]. As a result, DiVBTA estimators requiring individual participant data for calculations remain currently of little value in identifying potentially problematic trials.

Potential summary statistics of interest to calculate DiVBTA measures are the standard deviation (s), the mean (), and the derived measure of coefficient of variation (CV). For a two-arm trial, where subscripts T and C denote treatment and control, respectively.

Second, DiVBTA measures which minimize assumption about mean-variance relationships have been recommended over measures which are built on the assumption that no mean-variance relationships exists [30]. The log coefficient of variation ratio ( or lnCVR) is from this perspective a conservative choice as it explicitly normalizes variability by the mean. The lnCVR has the other advantage of being a “master” statistic, a statistic which simplifies to other DiVBTA statistics when no mean-variance relationship exists. The log of the variability ratio ( or lnVR) is a special DiVBTA case of lnCVR when group means are equal. The F-ratio (), another DiVBTA ratio measure, is a log transformation of VR (ln(F)=2 lnVR).

And third, normally distributed ratio DiVBTA measures are preferable when it comes identifying outliers. A log transformation can achieve this goal by reducing skewness which is, for instance, inherent to the F-statistic. Ratio DiVBTA measures are preferable to difference DiVBTA measures because they are scale-invariant, robust across heterogeneous studies, and largely unaffected by errors in publications that mislabel standard errors as standard deviations or fail to label the reported measures of variability as either standard errors or standard deviations.

  1. (i). The first criterion needed for a DiVBTA-based decision rule is to establish statistical significance of the DiVBTA. Methods to test the statistical significance of lnCVR are presented in the section on simulation methods and in the Supplementary Materials for lnVR and the F statistic (S1 Text).
  2. (ii). The second criterion needed for a DiVBTA-based decision rule is to define an outlier, i.e., to construct a DiVBTA tolerance band. Standard trial dynamics (which can lead to statistically significant DiVBTAs) should not be flagged as potentially problematic. Randomization, for instance, will in and of itself lead to 5% of the DiVBTAs to be statistically significant when the type I error rate is set at 5%. Thus, 5% of the trials would be falsely flagged as potentially problematic without setting a DiVBTA tolerance band. Other legitimate trial dynamics such as heterogeneous treatment effects may further increase the proportion of falsely flagged trials as potentially problematic. Construction of a tolerance band reduces such false alarms. The wider the tolerance band, the fewer false alarms, but at the cost of fewer truly problematic trials being captured.

A DiVBTA tolerance band can be determined using parametric or non-parametric methods based on the standard trial dynamics of a given clinical response variable (e.g., blood pressure or quality of life) and its statistical characteristics (e.g., absence of mean-variance relationships or ceiling effects).

A parametric approach to define tolerance bands which account for standard trial dynamics is to construct the 1-α DiVBTA prediction intervals based on a meta-regression [31]. Let DiVBTAijk and sdijk be the estimator and standard deviation for the ith variance difference or variance ratio between the control arm and treatment arm i, at the jth post-intervention time, and for the kth trial. A meta-analysis of these DiVBTAijk leads to (1) the mean DiVBTA for the group of randomized trials (M*), (2) the sample estimate of the variance of the true effect sizes (T2), and (3) the variance (VM*) of the mean effect sizes, M* [32]. The prediction interval of the differences in variance between trial arms can be calculated as:

where tα is the t-value corresponding to the desired prediction interval (e.g., α ≈ 0.0027 for a 3-sigma probability). For independent DiVBTAs (i.e., one DiVBTA per clinical trial), the degrees of freedom (df) is 2 less than the number of clinical trials, and VM* can be derived from a model-based estimate [32]. For correlated DiVBTAs (e.g., trials with more than 2 arms), robust meta-regression methods can be used where df is typically recommended to be calculated using a Satterthwaite approximation, and VM* is estimated empirically using a cluster-robust “sandwich” estimator [33].

A non-parametric approach to define tolerance bands which account for standard trial dynamics is to derive the median DiVBTA for each included trial (median across all DiVBTAijk for a given trial k). The median and the interquartile range of these median DiVBTAs can then be used as basis to construct DiVBTA Tukey inner and outer fences [34]. Potentially problematic trials can then be defined as statistically significant DiVBTAs which fall outside of these inner and outer Tukey fences.

The parametric or non-parametric cutoff values (e.g., α ≈ 0.0027 or inner Tukey fences) determine the width of the DiVBTA tolerance bands which in turn determine the sensitivity and specificity of the decision rule to identify potentially problematic trials. On one hand, defining a narrow tolerance band will increase sensitivity and decrease specificity; standard trial dynamics will frequently be falsely flagged as potentially problematic. On the other hand, setting a wide tolerance band will decrease sensitivity but increase specificity; potentially problematic trials will frequently fail to be flagged for further evaluation.

Simulation methods

The simulation methods are presented using the ADEMP structure [35].

Aims.

To evaluate the specificity and sensitivity of the proposed DiVBTA-based decision rule for detecting anomalous variance patterns under realistic and manipulated trial scenarios.

Specifically, the simulations assess:

  • Specificity (true negative rate): the probability that the test correctly classifies non-problematic trials as non-problematic (i.e., avoids false positives/ false alarms) under standard/legitimate trial dynamics and non-standard/questionable trial dynamics, including (i) proper randomization, (ii) heterogeneous treatment effects (HTE), and (iii) missing-not-at-random (MNAR) dropout.
  • Sensitivity (true positive rate): the probability that the test correctly identifies potentially problematic trials. For the simulations, a specific form of data manipulation (i.e., fraud) was modeled, namely deletion of a fraction of the worst responders in the treatment arm followed by replacement (duplication) of those values with copies of the best-responder observations. How trial data is manipulated is largely unknown, making the relevance of this specific form of simulated fraud data questionable. As the mechanisms of fraud are largely unknown, the sensitivity of any method for identifying potentially problematic trials is difficult to quantify. Proposed decision rules will have utility if it has high specificity (i.e., few false positives), thus allowing it to be a screening tool to rule in potential fraud.

A secondary aim is to examine the performance of the proposed decision rule under both large-sample and small-sample scenarios. These aims focus on assessing robustness against false positive conclusions (specificity under realistic trial features that should not trigger alarms) and detection power (sensitivity to detect a targeted fraudulent mechanism), treating a statistically significant DiVBTA outlier as a diagnostic flag for potential issues (unintentional errors or fraud).

Data-generating mechanisms.

The data-generating mechanisms for the simulations to assess these specific aims were based on five parameter estimates; two parameter estimates characterizing the probability distribution of the clinical response variable (e.g., the mean and the standard deviation of normal distribution) and three parameter estimates derived from a meta-analysis of clinical trials on the clinical response variable (mean (M*), variance of the mean (VM*), and variance of the true effect size(τ2)). Almost all clinical outcomes will be able to be modeled based on 5 parameter estimates, making the provided R-program versatile.

These 5 parameter estimates are specific to the (continuous) clinical response under investigation. Clinical outcomes such as patient-reported outcome measures can have ceiling or floor effects (which can create statistically significant DiVBTAs) whereas other clinical outcomes such as blood pressure do not suffer from this effect. Other standard trial dynamics such as the presence of heterogeneous treatment effects (which can create statistically significant DiVBTAs) can also differ depending on the selected clinical outcome. In other words, the 5 parameter estimates are specific to a specific response variable or domain.

The 5 parameters in the simulations presented here are diabetes-specific. Data were generated for simulated two-arm randomized trials with a continuous primary outcome (post-intervention HbA1c), modeled parametrically as a gamma distribution. Shape and rate parameters for the gamma distribution were estimated from individual participant data in a pivotal NIH-funded diabetes trial [36]. The standard trial dynamics in diabetes clinical trials were derived from a meta-analysis of post-intervention HbA1c standard deviations in 175 trials. The between-trial lnCVR heterogeneity was modeled as a baseline layer that existed in every simulated trial by sampling a trial-specific true lnCVR from N(M*, τ2).

Three standard (legitimate) trial dynamics were modeled to evaluate specificity:

  • Randomization — Participants randomly assigned to treatment or control arms (assessed under the assumption of no heterogeneous treatment effects and no missing data to isolate randomization effects on lnCVR variability).
  • Heterogeneous treatment effects (HTE) — Systematic variation in treatment response in the intervention arm, modeled via two parameters: (1) proportion of participants responding to treatment (varied 0%–50%), and (2) magnitude of treatment effect among responders (HbA1c treatment effect from −0.2% to −2.0%). Standard and non-standard treatment effect sizes were defined as between -0.2 to −1.4% and −1.4% to −2.0%, respectively. (−1.4% is the 3-sigma bound for the treatment effect sizes in a systematic review of diabetes trials; next section). A systematic review of diabetes trials failed to provide strong evidence in support of heterogeneous treatment effects. Standard and non-standard trial dynamics for the proportion of participants responding to treatment were defined as 0% to 20% and 30% to 50%, respectively.
  • Missing-not-at-random (MNAR) dropout — Dropout probability dependent on unobserved (missing) outcomes modeled as deletion of the worst responders (highest HbA1c values) from the control group. This is an extreme mechanism unlikely in real settings and specified as such to stress-test specificity. Less than 3% of the trials included in Cochrane reviews have a dropout rate of 30% of more [37] and the fraction of these trials having the extreme mechanism of informative dropout modeled in this study is likely to very small. We classified the simulated dropout rate of 0% to 20% as a standard trial dynamic, and a simulated dropout of 30% to 50% or larger (both with the extreme mechanism described above) as a non-standard trial dynamic. Given the extreme form of dropout modeled this is likely to a conservative definition of standard/non-standard.

One fraudulent mechanism was modeled to evaluate sensitivity:

  • Deletion of a fraction of the worst responders (highest HbA1c) in the treatment arm, replaced by duplication of the single best responder (lowest post-treatment HbA1c) observation. The fraction replaced was varied between 50% and 90%.

Simulations were run in both large-sample (n = 250 per arm) and small-sample (n = 20 per arm) scenarios, reflecting approximate 95th and 25th quantiles of sample sizes in the motivating diabetes trial meta-analysis. For each scenario/combination, 1,000 independent trials (iterations) were simulated which leads to a Monte Carlo standard error for a sensitivity or specificity of 0.95 of ~0.007. The 3-sigma tolerance bounds for outlier classification were derived from a robust meta-analysis of 175 diabetes trials, specifically using the variance of true effects (τ²) and variance of the mean effect (VM*) estimated from lnCVR measures in those trials. All simulation code (R), random seeds, and modifiable key parameters are available at https://www.github.com/mhujoel/DIVBTA enabling verification and adaptation to other clinical settings.

Estimand.

The estimand is the trial-specific true lnCVR. The estimator is the log coefficient of variation ratio (lnCVR), which quantifies relative variability between treatment and control arms. LnCVR is defined as: as . The approximate sampling variance of lnCVR is .

Methods.

For each simulated trial (iteration), lnCVR and its sampling variance were computed using the formulas above. The decision rule to classify a trial as potentially problematic consisted of two criteria:

  1. lnCVR is statistically significant (p ≤ 0.05) using a two-sample t-test with Welch–Satterthwaite degrees of freedom for parallel-arm trials.
  2. lnCVR exceeds the 3-sigma tolerance bound, derived from the meta-analysis of 175 diabetes trials defined as reliable (see next section).

No comparator methods were evaluated because comparators such as lnVR or F are biased since mean-variance relationships are present for the clinical outcome selected (Hemoglobin A1c).

Performance measures.

The primary performance measures were specificity and sensitivity (framed in diagnostic testing terminology, where the “disease” is a problematic trial due to unintentional errors or fraud, and a positive test for “disease” is a statistically significant 3-sigma lnCVR outlier). These measures directly address the aims: high specificity indicates low risk of false alarms under legitimate trial features; high sensitivity indicates good detection of the targeted fraud mechanism.

Methods for case-study assessing real-world utility

A systematic search was performed of the Cochrane library for systematic reviews with the key words of diabetes and glycaemic ((“The Cochrane database of systematic reviews”[Journal]) AND ((“2010/1/1”[Date – Publication]: “2023/05/23”[Date – Publication]))) AND (diabetes[Title] OR diabetic[Title]) AND (glycaemic OR “glucose-lowering drugs”). Key trial characteristics of the identified trials were abstracted and screened for the availability of post-intervention standard deviations or standard errors. Variance measures derived from statistical methods which assume a homogeneity of variance were excluded. The Carlisle-Stouffer-Fisher p-value, a measure of the plausibility of randomization given the baseline data, was calculated for trials reporting baseline data [15,38]. Risk of bias scores for randomization, blinding, and ascertainment were abstracted from the Cochrane reviews and assigned the values of 0 for high risk, 1 for unclear risk, and 2 for low risk, respectively. Data source (“published data only” or relying on “published and unpublished data”), funding source, sample sizes, number of trials arms, trial duration, effect size, PubMed identification numbers (PMID) were abstracted. Trials without PMID included Ph.D. or Master’s theses, grey literature, clinical trial registries which posted results on the clinical trial registration website, and other data sources not indexed in PubMed. The origin of the trial data was reclassified from “published data only” to “published and unpublished data” for 19 trials included in two Cochrane reviews because it was an organization other than Cochrane (the Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen or (IQWiG)) which obtained unpublished data for respectively 6 and 13 trials in two Cochrane reviews [39,40].

Trials with no or improbable baseline data were identified (operationalized as a one-sided Carlisle-Stouffer-Fisher p-value which was < 0.025, > 0.975, or missing) and were excluded from the set of trials for defining the tolerance band. Bootstrapping sampling assessed the robustness of excluding trials with no or improbable baseline data on the width of the tolerance band. Parametric and non-parametric tolerance bands were estimated as described in the previous section. Statistical tests reported in the trials were described as questionable when the trial failed to provide a description of the primary statistical test or reported any of the following tests without accommodation for the extreme variance heterogeneity (1) the use of standard Student’s t-test (or equivalent pooled-variance method), (2) reliance on standard ANOVA (including repeated measures ANOVA), (3) use of standard repeated measures ANCOVA (or ANCOVA), (4) reporting of parametric or non-parametric tests “as appropriate” as this assumes readers can retroactively infer the decision rule, or (5) reporting of p-values without reporting the statistical tests used. Countries with a retraction rate above 0.10% in the field of medicine were labeled as having a high retraction ranking [41].

Results

Simulation results

The specificity and sensitivity of the proposed decision rule to flag potentially problematic trials are presented for standard and non-standard trial dynamics (Tables 1–3 for a 3-sigma decision rule, and S1S3 Tables for a 4-sigma decision rule).

thumbnail
Table 1. Specificity of 3-sigma statistically-significant lnCVR when randomization and heterogeneous treatment effects occur in legitimate trials, by sample size per trial arm.

https://doi.org/10.1371/journal.pone.0346238.t001

thumbnail
Table 2. Specificity of 3-sigma statistically significant lnCVR when randomization and an extreme form of MNAR dropout (0–50%) occur in trials, by sample size per trial arm.

https://doi.org/10.1371/journal.pone.0346238.t002

thumbnail
Table 3. Sensitivity of 3-sigma statistically significant lnCVR for detecting simulated fraud (50–90% worst HbA1c scores in intervention arm replaced by best-responder value), by sample size per trial arm.

https://doi.org/10.1371/journal.pone.0346238.t003

Specificity of 3- and 4-sigma significant lnCVRs in the presence of randomization only: Randomization alone leads to few false alarms. In large samples, the specificity of 3- and 4-sigma significant lnCVRs exceeds 99.4% and 99.9%, respectively (very few false alarms). In small samples, the specificity of 3- and 4-sigma lnCVRs exceeds 91.8% and 97.2%, respectively.

Specificity of 3- and 4-sigma significant lnCVRs in the presence of randomization and heterogeneous treatment effects: (i) Under standard trial dynamics (heterogeneous treatment effects combined with randomization), 3-sigma significant lnCVRs yielded 99.0% to 99.8% specificity in large-sample settings and 90.5% to 93.7% specificity in small-sample settings. By contrast, 4-sigma significant lnCVRs yielded 99.9% to 100% specificity in large-sample settings and 96.7% to 98.5% specificity in small-sample settings. (ii) Under non-standard trial dynamics (heterogeneous treatment effects combined with randomization), 3-sigma significant lnCVRs yielded 65.0% to 100% specificity in large-sample settings and 61.8% to 94.1% specificity in small-sample settings. By contrast, 4-sigma significant lnCVRs yielded 96.6% to 100% specificity in large-sample settings and 80.1% to 98.6% specificity in small-sample settings.

Specificity of 3- and 4-sigma significant lnCVRs in the presence of randomization and missing-not-at-random dropout: (i) Under standard-trial dynamics, when the prevalence of the extreme form of dropout modeled was 20% or less, the specificity exceeded 89.1% in large-sample settings and 83.0% in small-sample settings for 3-sigma significant lnCVRs. For 4-sigma significant lnCVRs, the specificity exceeded 98.2% and 90.9% in large- and small-sample settings, respectively. (ii) Under non-standard trial dynamics, for 3-sigma bounds, the specificity ranged from 69.0% to 77.3% for small-sample settings, and from 51.5% to 79.0% for large-sample settings. Under non-standard trial dynamics, for 4-sigma bounds, the specificity ranged from 69.0% to 84.8% for small-sample settings, and from 82.5% to 95.4% for large-sample settings.

Sensitivity of 3- and 4-sigma significant lnCVRs to duplication of responses: The proposed decision rule to flag potentially problematic trials is not sensitive to detecting a 50% duplication of responses in a trial arm. It is only when the rate of data duplication in the intervention arm reaches 80% or higher that the sensitivity becomes larger than 89.8% for small-sample trials. In large-sample settings, a data duplication rate of 90% leads to a sensitivity of 85.4%. The sensitivity decreased when a 4-sigma significant lnCVR tolerance was selected for the decision rule.

A case study assessing utility of the proposed decision rule

The PRISMA flow diagram shows the systematic selection process which led to a sample of 305 trials, 226 of which with an ability to calculate lnCVR (S1 Fig). Trials reporting calculable lnCVRs (n = 226) (versus those where no lnCVRs can be calculated (n = 79)) were more likely not to report funding, to be of shorter duration, to have fewer authors, not to be indexed in PubMed, to have a smaller sample size, to have fewer trial arms, and to originate from a country with a high retraction ranking (S4 Table).

The 3-sigma prediction interval for lnCVR (i.e., the selected DiVBTA) was −0.54 to 0.47 for 175 trials for trials with plausible baseline data (175 trials;229 lnCVR estimates). Bootstrap sampling showed a modest impact of restricting the estimation of the tolerance bands to trials reporting plausible baseline data (S5 Table). Bootstrapping from the available 226 trials led to 3-sigma prediction interval where the lower bound of the 95% confidence interval ranged from −0.67 to −0.46 and the upper bound ranged from 0.39 to 0.65.

Nineteen trials reported statistically significant lnCVRs falling outside the parametric 3-sigma prediction interval of −0.54 to 0.47 (Fig 1). These trials, when compared to the 207 trials with either non-significant LnCVRs or LnCVRs not falling outside of the 3-sigma prediction interval, were more likely to report baseline data distributions which are inconsistent with randomization and smaller sample size (S6 Table). 18 of 19 trials reported at least one additional potentially problematic feature (Table 4): (1) improbable or no baseline data (n = 7), (2) calculation or data errors in glycemic responses (n = 3), (3) 0% dropout (n = 4), (4) high risk of attrition bias (n = 2) of allocation concealment bias (n = 2), (5) a larger than 3-sigma effect size for HbA1c improvement (n = 1), and (6) reporting of a questionable statistical test (e.g., standard Student t test), or no statistical test (n = 13). Eleven of the 19 trials reported statistically significant lnCVRs falling outside the parametric 4-sigma bounds.

thumbnail
Table 4. Nineteen trials with DIVBTAs outside of the 3-sigma bounds – 7 checks on trustworthiness.

https://doi.org/10.1371/journal.pone.0346238.t004

thumbnail
Fig 1. Nineteen trials flagged as potentially problematic because their LnCVRs are (i) statistically significant and (ii) outside of the 3-sigma lnCVR prediction intervals estimated based on 175 trials.

The left side of the graph shows the parametric approach to screen for potentially problematic trials; construct the 3-sigma lnCVR prediction interval. The right side of the graph shows the non-parametric approach with the construction of Tukey inner and outer fences to screen for potentially problematic trials. The parametric 3-sigma prediction interval and the Tukey inner fences are remarkably similar.

https://doi.org/10.1371/journal.pone.0346238.g001

The high prevalence of potentially problematic trials reported here (~8%) reflects on calendar years when awareness and scrutiny of problematic trials was low. Included Cochrane reviews were published starting in 2010; all but one of the Cochrane reviews included were published before Cochrane’s 2021 editorial policy on managing potentially problematic studies. Authors of the most recent Cochrane review included in this report may have been unaware of the 2021 editorial policy, in part because it exists separately from the Cochrane Handbook.

Discussion

Three lines of evidence support the view that trials with unusually large and statistically significant Differences in Variances Between Trial Arms (DiVBTAs) can be flagged as potentially problematic. First, empirical evidence has shown that DiVBTAs are typically small and statistically insignificant [21,2327,42], making trials with large and statistically significant DiVBTAs unexpected. Second, simulations demonstrated that the decision rule to flag studies with statistically significant DiVBTA outliers as potentially problematic has high specificity, especially when evaluated under standard trial dynamics. A high specificity implies that a flagged trial can reliably be ruled in as potentially problematic. Third, the real-world case study showed that the effort to calculate and analyze DiVBTAs is worthwhile. About 8% of the trials were flagged even when the definition of a large DiVBTA was defined as a 3-sigma event. These 3 lines of evidence suggest that identifying studies with statistically significant DiVBTA outliers is a worthwhile screening tool and offers a fair and objective flag for potentially problematic studies.

An illustrative example demonstrates how the proposed DiVBTA method provides an objective and fair alternative to the subjective impression of unusual variances that flagged problematic trials in the past. The 2009 Boldt et al. trial—pivotal in exposing one of the largest research misconduct scandals to date (22)—was originally questioned in part based on perceived anomalies in reported variances. Our method retrospectively confirms this concern objectively. The proposed methods here flagged the trial as potentially problematic. The DiVBTA was statistically significant (p = 3.1 × 10 ⁻ 13), a first criterion to flag a trial. The DiVBTA was also an outlier, the second criterion to flag a trial. This example illustrates that what was once flagged subjectively as potentially problematic can now be detected reliably and transparently using statistical criteria, thereby enhancing fairness and reproducibility in identifying potentially fraudulent or problematic studies. Further real-world examples of confirmed fraudulent trials (e.g., from retracted studies or audits) may reveal how often fraud manifests itself as outlying significant variance differences across trial arms.

It is essential to clarify that a statistically significant and unusually large DiVBTA identifies trials as potentially problematic but does not confirm fraud, fabrication, unintentional error, or any specific form of misconduct. Such outliers may occasionally arise from genuine, albeit rare, trial dynamics. As illustrated with the simulations with 3- and 4-sigma outliers as flags for identifying potentially problematic trials, the wider the tolerance band, the more unlikely a flagged trial is to have arisen from rare trial dynamics. DiVBTA outliers may also arise from unintentional errors beyond the control of the trial investigators (e.g., transcription mistakes during the publication process or in intermediary steps between trial publication by the clinical trial team and meta-analysis publication by a separate research team). In one such case, the published trial report presented only non-parametric statistics [43]. The trial investigators later provided unpublished parametric statistics directly to the meta-analysis team [44]—statistics now flagged as outliers in this report. When unpublished data are used in this way, it becomes impossible to determine whether an error occurred or, if it did, which team (trial investigators or meta-analyst team) was responsible.

These considerations underscore the role of DiVBTA as a screening tool rather than a definitive diagnostic, complementing rather than replacing other data integrity tools.

Spotting DIVBTA outliers could become integrated into checklists to assess the trustworthiness of clinical trial reports. Checks of baseline data such as the Carlisle-Stouffer-Fisher method focuses on detecting randomization failures [15]. DiVBTA methods extends scrutiny to post-randomization outcome data, capturing problematic data issues in post-baseline data, that may not be manifest at baseline. Its compatibility with summary statistics commonly reported in clinical trials makes it practical for implementation. The method’s applicability to non-normal outcomes via lnCVR further enhances its versatility. By integrating parametric (robust meta-regression) and non-parametric (Tukey fences) approaches, and assessing different widths of tolerance bands, a fair and objective framework can be constructed to assess the robustness of labeling trials as potentially problematic. The case study presented suggests that trials with DiVBTA outliers frequently fail to report statistical tests which considers the extreme variance heterogeneity. This absence of an appropriate analysis can lead to biased p-values, risk invalid inferences and consequently expose patients to ineffective/harmful treatments or deny access to beneficial ones.

Despite its strengths, the DiVBTA method has limitations. First, detecting DiVBTA becomes challenging when reported standard deviations are imprecise. This can occur when small standard deviations (SD ≈ 1–2) are excessively rounded, an issue that becomes more pronounced when values are reported as standard errors, where rounding can result in a greater loss of information. Second, no meaningful DiVBTA can be computed for trials reporting standard deviations or standard errors calculated under the assumption of a homogeneity of variances. Third, simulation studies showed that small sample sizes increase the risk of false alarms (ruling in a trial as being potentially problematic, when in fact standard trial dynamics may have been responsible). Fourth, in the unlikely scenario in which there are no systematic reviews of clinical trials available, the proposed methods needs to start with a search for trials reporting variability estimates.

By offering a scalable, objective method, DiVBTAs could be integrated into the checklists of journal editors, peer reviewers, and meta-analysts to flag trials for further scrutiny, potentially reducing or preventing the impact of problematic studies on the meta-analyses which impact clinical guidelines. Several such workflows are already in place [13,45]. Prospective studies tracking flagged trials for retraction rates could quantify the method’s predictive accuracy, while integration into automated tools or AI-assisted review systems could streamline its application. Collaborative efforts to combine DiVBTA with multimodal integrity checks (e.g., baseline anomalies, effect size outliers) could further improve specificity. Ultimately, the DiVBTA method may offer a robust, transparent approach to bolstering research integrity, addressing the urgent need for validated statistical tools to safeguard the credibility of clinical evidence.

Supporting information

S1 Text. DiVBTA measures other than lnCVR.

https://doi.org/10.1371/journal.pone.0346238.s001

(DOCX)

S1 Table. Specificity of 4-sigma statistically significant lnCVR when randomization and heterogeneous treatment effects occur in legitimate trials, by sample size per trial arm.

https://doi.org/10.1371/journal.pone.0346238.s002

(DOCX)

S2 Table. Specificity of 4-sigma statistically significant lnCVR when randomization and MNAR dropout (0–50%) occur in legitimate trials, by sample size per trial arm.

https://doi.org/10.1371/journal.pone.0346238.s003

(DOCX)

S3 Table. Sensitivity of 4-sigma statistically significant lnCVR for detecting simulated fraud (50–90% worst HbA1c scores in intervention arm replaced by best-responder value), by sample size per trial arm.

https://doi.org/10.1371/journal.pone.0346238.s004

(DOCX)

S1 Fig. Flow diagram of study selection for the meta-analysis of lnCVR estimates.

From 58 Cochrane reviews identified via search terms (n = 57) and follow-up (n = 1), 23 were excluded due to absence of HbA1c outcome or being reviews of reviews. 338 trials were identified in the remaining 35 Cochrane reviews, yielding 305 unique trials. 79 trials lacked informative SD estimates and were excluded, leaving 226 trials for the lnCVR meta-analysis.

https://doi.org/10.1371/journal.pone.0346238.s005

(PNG)

S4 Table. Characteristics of diabetes intervention trials stratified by reporting of outcome standard deviations.

https://doi.org/10.1371/journal.pone.0346238.s006

(DOCX)

S5 Table. Characteristics of diabetes intervention trials stratified by reporting of statistically significant lnCVR outliers.

https://doi.org/10.1371/journal.pone.0346238.s007

(DOCX)

S6 Table. Assessment of the robustness towards the selection of RCTs for inclusion into the lnCVR meta-analysis.

Bootstrap summary of lnCVR effect estimates and 3σ prediction interval bounds. Results from 1000 bootstrap replications (with replacement at the study level). The original values are based on the full dataset (n = 226 studies). The 95% confidence intervals are percentile-based.

https://doi.org/10.1371/journal.pone.0346238.s008

(DOCX)

References

  1. 1. Van Noorden R, Thompson B. Audio long read: Medicine is plagued by untrustworthy clinical trials. How many studies are faked or flawed?. Nature. 2023;:10.1038/d41586-023-02627–0. pmid:37626217
  2. 2. Ioannidis JPA. Hundreds of thousands of zombie randomised trials circulate among us. Anaesthesia. 2021;76(4):444–7. pmid:33124075
  3. 3. Anonymous. There is a worrying amount of fraud in medical research and a worrying unwillingness to do anything about it. The Economist. 2023.
  4. 4. Blake H, Watt H, Winnett R. Millions at risk in drug fraud scandal: Investigation prosecutors investigate doctor. The Daily Telegraph. 2011.
  5. 5. Subbaraman N. The band of debunkers busting bas scientists; Stanford’s president and a high-profile physicist are among those taken down by a growing wave of volunteers who expose faulty or fraudulent research. Wall Street Journal. 2023.
  6. 6. McKie R. Peer review and scientific publishing “The situation has become appalling’: fake scientific papers push research credibility to crisis point. The Guardian. 2024.
  7. 7. International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH). Guideline for Good Clinical Practice E6(R2). 2016. https://www.ema.europa.eu/en/ich-e6-good-clinical-practice-scientific-guideline
  8. 8. Boughton SL, Wilkinson J, Bero L. When beauty is but skin deep: dealing with problematic studies in systematic reviews. Cochrane Database Syst Rev. 2021;6(6):ED000152. pmid:34081324
  9. 9. Hunter KE, Webster AC, Clarke M, Page MJ, Libesman S, Godolphin PJ, et al. Development of a checklist of standard items for processing individual participant data from randomised trials for meta-analyses: Protocol for a modified e-Delphi study. PLoS One. 2022;17(10):e0275893. pmid:36219622
  10. 10. Mol BW, Lai S, Rahim A, Bordewijk EM, Wang R, van Eekelen R, et al. Checklist to assess Trustworthiness in RAndomised Controlled Trials (TRACT checklist): concept proposal and pilot. Res Integr Peer Rev. 2023;8(1):6. pmid:37337220
  11. 11. Weibel S, Popp M, Reis S, Skoetz N, Garner P, Sydenham E. Identifying and managing problematic trials: A research integrity assessment tool for randomized controlled trials in evidence synthesis. Res Synth Methods. 2023;14(3):357–69. pmid:36054583
  12. 12. Wilkinson J, Heal C, Antoniou GA, Flemyng E, Ahnström L, Alteri A, et al. Assessing the feasibility and impact of clinical trial trustworthiness checks via an application to Cochrane Reviews: Stage 2 of the INSPECT-SR project. J Clin Epidemiol. 2025;184:111824. pmid:40349737
  13. 13. Mousa A, Flanagan M, Tay CT, Norman RJ, Costello M, Li W, et al. Research Integrity in Guidelines and evIDence synthesis (RIGID): a framework for assessing research integrity in guideline development and evidence synthesis. EClinicalMedicine. 2024;74:102717. pmid:39628711
  14. 14. Carlisle JB, Dexter F, Pandit JJ, Shafer SL, Yentis SM. Calculating the probability of random sampling for continuous variables in submitted or published randomised controlled trials. Anaesthesia. 2015;70(7):848–58. pmid:26032950
  15. 15. Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944–52. pmid:28580651
  16. 16. Estruch R, Ros E, Salas-Salvadó J, Covas M-I, Corella D, Arós F, et al. Primary prevention of cardiovascular disease with a Mediterranean diet. N Engl J Med. 2013;368(14):1279–90. pmid:23432189
  17. 17. Editor’s note. N Engl J Med. 2018;378(25):2442.
  18. 18. Bolland MJ, Avenell A, Grey A. Statistical techniques to assess publication integrity in groups of randomized trials: a narrative review. J Clin Epidemiol. 2024;170:111365. pmid:38631528
  19. 19. Hunter KE, Aberoumand M, Libesman S, Sotiropoulos JX, Williams JG, Aagerup J, et al. The Individual Participant Data Integrity Tool for assessing the integrity of randomised trials. Res Synth Methods. 2024;15(6):917–39. pmid:39136348
  20. 20. Hunter KE, Aberoumand M, Libesman S, Sotiropoulos JX, Williams JG, Li W, et al. Development of the individual participant data integrity tool for assessing the integrity of randomised trials using individual participant data. Res Synth Methods. 2024;15(6):940–9. pmid:39155538
  21. 21. Mills HL, Higgins JPT, Morris RW, Kessler D, Heron J, Wiles N, et al. Detecting Heterogeneity of Intervention Effects Using Analysis and Meta-analysis of Differences in Variance Between Trial Arms. Epidemiology. 2021;32(6):846–54. pmid:34432720
  22. 22. Wise J. Boldt: the great pretender. BMJ (Clinical research ed). 2013;346:f1738.
  23. 23. Munkholm K, Winkelbeiner S, Homan P. Individual response to antidepressants for depression in adults-a meta-analysis and simulation study. PLoS One. 2020;15(8):e0237950. pmid:32853222
  24. 24. Plöderl M, Hengartner MP. What are the chances for personalised treatment with antidepressants? Detection of patient-by-treatment interaction with a variance ratio meta-analysis. BMJ Open. 2019;9(12):e034816. pmid:31874900
  25. 25. Guo X, McCutcheon RA, Pillinger T, Mizuno Y, Natesan S, Brown K, et al. The magnitude and heterogeneity of antidepressant response in depression: A meta-analysis of over 45,000 patients. J Affect Disord. 2020;276:991–1000. pmid:32750615
  26. 26. Volkmann C, Volkmann A, Müller CA. On the treatment effect heterogeneity of antidepressants in major depression: A Bayesian meta-analysis and simulation study. PLoS One. 2020;15(11):e0241497. pmid:33175895
  27. 27. Alsaeid M, Sung S, Bai W, Tam M, Wong YJ, Cortes J, et al. Heterogeneity of treatment response to beta-blockers in the treatment of portal hypertension: A systematic review. Hepatol Commun. 2024;8(2):e0321. pmid:38285880
  28. 28. Gabelica M, Cavar J, Puljak L. Authors of trials from high-ranking anesthesiology journals were not willing to share raw data. J Clin Epidemiol. 2019;109:111–6. pmid:30738169
  29. 29. Veroniki AA, Ashoor HM, Le SPC, Rios P, Stewart LA, Clarke M, et al. Retrieval of individual patient data depended on study characteristics: a randomized controlled trial. J Clin Epidemiol. 2019;113:176–88. pmid:31153977
  30. 30. Senior AM, Viechtbauer W, Nakagawa S. Revisiting and expanding the meta-analysis of variation: The log coefficient of variation ratio. Res Synth Methods. 2020;11(4):553–67. pmid:32431099
  31. 31. Tanner-Smith EE, Tipton E. Robust variance estimation with dependent effect sizes: practical considerations including a software tutorial in Stata and spss. Res Synth Methods. 2014;5(1):13–30. pmid:26054023
  32. 32. Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Introduction to Meta-Analysis. Wiley. 2021.
  33. 33. Tipton E. Small sample adjustments for robust variance estimation with meta-regression. Psychol Methods. 2015;20(3):375–93. pmid:24773356
  34. 34. Tukey JW. Exploratory data analysis. Reading, Mass.: Addison-Wesley Pub. Co. 1977.
  35. 35. Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38(11):2074–102. pmid:30652356
  36. 36. Engebretson SP, Hyman LG, Michalowicz BS, Schoenfeld ER, Gelato MC, Hou W, et al. The effect of nonsurgical periodontal therapy on hemoglobin A1c levels in persons with type 2 diabetes and chronic periodontitis: a randomized clinical trial. JAMA. 2013;310(23):2523–32. pmid:24346989
  37. 37. Babic A, Tokalic R, Amílcar Silva Cunha J, Novak I, Suto J, Vidak M, et al. Assessments of attrition bias in Cochrane systematic reviews are highly inconsistent and thus hindering trial comparability. BMC Med Res Methodol. 2019;19(1):76. pmid:30953448
  38. 38. Carlisle JB. R code for calculating Carlisle-Stouffer-Fisher statistic. 2023.
  39. 39. Fullerton B, Siebenhofer A, Jeitler K, Horvath K, Semlitsch T, Berghold A, et al. Short-acting insulin analogues versus regular human insulin for adults with type 1 diabetes mellitus. Cochrane Database Syst Rev. 2016;2016(6):CD012161. pmid:27362975
  40. 40. Semlitsch T, Engler J, Siebenhofer A, Jeitler K, Berghold A, Horvath K. (Ultra-)long-acting insulin analogues versus NPH insulin (human isophane insulin) for adults with type 2 diabetes mellitus. Cochrane Database Syst Rev. 2020;11(11):CD005613. pmid:33166419
  41. 41. Sebo P, Sebo M. Geographical Disparities in Research Misconduct: Analyzing Retraction Patterns by Country. J Med Internet Res. 2025;27:e65775. pmid:39808480
  42. 42. Senior AM, Gosby AK, Lu J, Simpson SJ, Raubenheimer D. Meta-analysis of variance: an illustration comparing the effects of two dietary interventions on variability in weight. Evol Med Public Health. 2016;2016(1):244–55. pmid:27491895
  43. 43. Durán A, Martín P, Runkle I, Pérez N, Abad R, Fernández M, et al. Benefits of self-monitoring blood glucose in the management of new-onset Type 2 diabetes mellitus: the St Carlos Study, a prospective randomized clinic-based interventional study with parallel groups. J Diabetes. 2010;2(3):203–11. pmid:20923485
  44. 44. Malanda UL, Welschen LM, Riphagen II, Dekker JM, Nijpels G, Bot SD. Self-monitoring of blood glucose in patients with type 2 diabetes mellitus who are not using insulin. Cochrane Database of Systematic Reviews. 2012;2012(1):Cd005060.
  45. 45. Cochrane. Policy for managing potentially problematic studies: implementation guidance. Chichester, UK: John Wiley & Sons. 2024. https://www.cochranelibrary.com/cdsr/editorial-policies#problematic-studies
  46. 46. Meschi F, Beccaria L, Vanini R, Szulc M, Chiumello G. Short-term subcutaneous insulin infusion in diabetic children. Comparison with three daily insulin injections. Acta Diabetol Lat. 1982;19(4):371–5. pmid:6758461
  47. 47. Homko CJ, Santamore WP, Whiteman V, Bower M, Berger P, Geifman-Holtzman O, et al. Use of an internet-based telemedicine system to manage underserved women with gestational diabetes mellitus. Diabetes Technol Ther. 2007;9(3):297–306. pmid:17561800
  48. 48. Vincent D, Pasvogel A, Barrera L. A feasibility study of a culturally tailored diabetes intervention for Mexican Americans. Biol Res Nurs. 2007;9(2):130–41. pmid:17909165
  49. 49. Kiran M, Arpak N, Unsal E, Erdoğan MF. The effect of improved periodontal health on metabolic control in type 2 diabetes mellitus. J Clin Periodontol. 2005;32(3):266–72. pmid:15766369
  50. 50. Macedo GO, Novaes AB Jr, Souza SLS, Taba M Jr, Palioto DB, Grisi MFM. Additional effects of aPDT on nonsurgical periodontal treatment with doxycycline in type II diabetes: a randomized, controlled clinical trial. Lasers Med Sci. 2014;29(3):881–6. pmid:23474741
  51. 51. Agurs-Collins TD, Kumanyika SK, Ten Have TR, Adams-Campbell LL. A randomized controlled trial of weight reduction and exercise for diabetes management in older African-American subjects. Diabetes Care. 1997;20(10):1503–11. pmid:9314625
  52. 52. Zieve FJ, Kalin MF, Schwartz SL, Jones MR, Bailey WL. Results of the glucose-lowering effect of WelChol study (GLOWS): a randomized, double-blind, placebo-controlled pilot study evaluating the effect of colesevelam hydrochloride on glycemic control in subjects with type 2 diabetes. Clin Ther. 2007;29(1):74–83. pmid:17379048
  53. 53. Tsalikis L, Sakellari D, Dagalis P, Boura P, Konstantinidis A. Effects of doxycycline on clinical, microbiological and immunological parameters in well-controlled diabetes type-2 patients with periodontal disease: a randomized, controlled clinical trial. J Clin Periodontol. 2014;41(10):972–80. pmid:25041182
  54. 54. Schiel R, Müller UA. Efficacy and treatment satisfaction of once-daily insulin glargine plus one or two oral antidiabetic agents versus continuing premixed human insulin in patients with type 2 diabetes previously on long-term conventional insulin therapy: the Switch pilot study. Exp Clin Endocrinol Diabetes. 2007;115(10):627–33. pmid:18058596
  55. 55. Huang X, Song L, Li T. Effect of health education and psychosocial intervention on depression in patients with type 2 diabetes. Zhongguo-xinli-weisheng-zazhi. 2002;16(3):149–51.
  56. 56. Hirsch IB, Abelseth J, Bode BW, Fischer JS, Kaufman FR, Mastrototaro J, et al. Sensor-augmented insulin pump therapy: results of the first randomized treat-to-target study. Diabetes Technol Ther. 2008;10(5):377–83. pmid:18715214
  57. 57. Schade DS, Mitchell WJ, Griego G. Addition of sulfonylurea to insulin treatment in poorly controlled type II diabetes. A double-blind, randomized clinical trial. JAMA. 1987;257(18):2441–5. pmid:3106656
  58. 58. Ma W-J, Huang Z-H, Huang B-X, Qi B-H, Zhang Y-J, Xiao B-X, et al. Intensive low-glycaemic-load dietary intervention for the management of glycaemia and serum lipids among women with gestational diabetes: a randomized control trial. Public Health Nutr. 2015;18(8):1506–13. pmid:25222105
  59. 59. Petrovski G, Dimitrovski C, Bogoev M, Milenkovic T, Ahmeti I, Bitovska I. Is there a difference in pregnancy and glycemic outcome in patients with type 1 diabetes on insulin pump with constant or intermittent glucose monitoring? A pilot study. Diabetes Technol Ther. 2011;13(11):1109–13. pmid:21751889
  60. 60. Al-Zahrani MS, Bamshmous SO, Alhassani AA, Al-Sherbini MM. Short-term effects of photodynamic therapy on periodontal status and glycemic control of patients with diabetes. J Periodontol. 2009;80(10):1568–73. pmid:19792844
  61. 61. Fang Q, Fang M, Yao Y, Feng S, Yang Y, Xue L, et al. Efficacy and safety of pioglitazone for intervention therapy of impaired glucose regulation [吡格列酮⼲预糖调节受损的疗效和安全性]. Journal of Clinical Research. 2013;30(2):239–42.
  62. 62. Cohen D, Weintrob N, Benzaquen H, Galatzer A, Fayman G, Phillip M. Continuous subcutaneous insulin infusion versus multiple daily injections in adolescents with type I diabetes mellitus: a randomized open crossover trial. J Pediatr Endocrinol Metab. 2003;16(7):1047–50. pmid:14513883
  63. 63. Dans AML, Villarruz MVC, Jimeno CA, Javelosa MAU, Chua J, Bautista R, et al. The effect of Momordica charantia capsule preparation on glycemic control in type 2 diabetes mellitus needs further studies. J Clin Epidemiol. 2007;60(6):554–9. pmid:17493509