Skip to main content
  • Loading metrics

Risk of Bias in Systematic Reviews of Non-Randomized Studies of Adverse Cardiovascular Effects of Thiazolidinediones and Cyclooxygenase-2 Inhibitors: Application of a New Cochrane Risk of Bias Tool

  • Anja Bilandzic,

    Affiliation Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada

  • Tiffany Fitzpatrick,

    Affiliation Ontario Strategy for Patient-Oriented Research Support Unit, Toronto, Ontario, Canada

  • Laura Rosella,

    Affiliations Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada, Public Health Ontario, Toronto, Ontario, Canada, Institute for Clinical Evaluative Sciences, Toronto, Ontario, Canada

  • David Henry

    Affiliations Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada, Institute for Clinical Evaluative Sciences, Toronto, Ontario, Canada, Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada



Systematic reviews of the effects of healthcare interventions frequently include non-randomized studies. These are subject to confounding and a range of other biases that are seldom considered in detail when synthesizing and interpreting the results. Our aims were to assess the reliability and usability of a new Cochrane risk of bias (RoB) tool for non-randomized studies of interventions and to determine whether restricting analysis to studies with low or moderate RoB made a material difference to the results of the reviews.

Methods and Findings

We selected two systematic reviews of population-based, controlled non-randomized studies of the relationship between the use of thiazolidinediones (TZDs) and cyclooxygenase-2 (COX-2) inhibitors and major cardiovascular events. Two epidemiologists applied the Cochrane RoB tool and made assessments across the seven specified domains of bias for each of 37 component studies. Inter-rater agreement was measured using the weighted Kappa statistic. We grouped studies according to overall RoB and performed statistical pooling for (a) all studies and (b) only studies with low or moderate RoB. Kappa scores across the seven bias domains ranged from 0.50 to 1.0. In the COX-2 inhibitor review, two studies had low overall RoB, 14 had moderate RoB, and five had serious RoB. In the TZD review, six studies had low RoB, four had moderate RoB, four had serious RoB, and two had critical RoB. The pooled odds ratios for myocardial infarction, heart failure, and death for rosiglitazone versus pioglitazone remained significantly elevated when analyses were confined to studies with low or moderate RoB. However, the estimate for myocardial infarction declined from 1.14 (95% CI 1.07–1.24) to 1.06 (95% CI 0.99–1.13) when analysis was confined to studies with low RoB. Estimates of pooled relative risks of cardiovascular events with COX-2 inhibitors compared with no nonsteroidal anti-inflammatory drug changed little when analyses were confined to studies with low or moderate RoB. The exception was a rise in the relative risk associated with ibuprofen from 1.07 (95% CI 0.97–1.18) to 1.14 (95% CI 1.03–1.26). The main limitation of our study was testing the instrument on a narrow range of pharmacoepidemiological studies; we cannot assume our findings extend to a broader range of interventions and settings.


The Cochrane RoB tool highlighted a wide range of risks of bias in studies included in two widely cited reviews and had the potential to change the conclusions of the reviews. Systematic reviews that incorporate non-randomized studies of medical interventions should include a detailed assessment of RoB for each included study.

Editors' Summary


In the past, clinicians used their own experience to help them make decisions about the best treatments (interventions) for their patients. Nowadays, “evidence-based medicine”—largely based on findings from randomized controlled trials (RCTs)—guides most clinical decisions. RCTs—studies that compare outcomes in groups of patients chosen at random to receive different interventions—are the best way to assess the efficacy of an intervention (the performance of a treatment under ideal conditions), but individual trials often fail to show a statistically significant difference (a difference unlikely to have arisen by chance) between two interventions. Significant differences between interventions can be detected, however, by undertaking a systematic review (a study that identifies all the RCTs on a given intervention using predefined criteria) and a meta-analysis (a statistical technique for combining, or “synthesizing,” the findings from several independent RCTs).

Why Was This Study Done?

Systematic reviews of healthcare interventions can also include non-randomized studies, which use administrative databases to identify people receiving different interventions and electronic health records to determine clinical outcomes. However, non-randomized studies of interventions are prone to many “biases” that affect the accuracy of their findings. For example, a potential bias in non-randomized studies is “confounding,” the possibility that an unmeasured characteristic shared by the people receiving a specific intervention, rather than the intervention itself, is responsible for the observed outcome. When undertaking systematic reviews and meta-analyses, it is essential to measure the risk of bias (RoB) in each individual study included in the review and meta-analysis. But, although a widely used tool is available for measuring RoB in RCTs, bias is seldom considered in detail when synthesizing the results of non-randomized studies of interventions. Here, the researchers assess the reliability and usability of ACROBAT-NRSI, a tool developed by Cochrane (an organization that promotes evidence-informed health decision-making) for the assessment of RoB in non-randomized intervention studies. ACROBAT-NRSI assists authors in identifying potential concerns across seven bias domains and assesses the overall RoB of individual non-randomized intervention studies.

What Did the Researchers Do and Find?

Two of the researchers independently applied the ACROBAT-NRSI process to 37 papers included in two widely cited systematic reviews of non-randomized studies of the relationship between the use of thiazolidinediones (drugs used to treat diabetes, such as rosiglitazone and pioglitazone) and cyclooxygenase-2 (COX-2) inhibitors (nonsteroidal anti-inflammatory drugs [NSAIDs] such as ibuprofen) and major cardiovascular events (heart attack [myocardial infarction] and heart failure). The two researchers largely agreed on their RoB assessments (good inter-rater agreement), which, after training and early experience, took roughly 2.5 hours to complete for each study. In the thiazolidinedione review, six studies had low overall RoB, four had moderate RoB, four had serious RoB, and two had critical RoB. In the COX-2 inhibitor review, two studies low overall RoB, fourteen had moderate RoB, and five had serious RoB. When the researchers restricted meta-analysis to studies with low or moderate RoB, estimates of the pooled relative risks of cardiovascular events with COX-2 inhibitors (compared with no NSAID) changed little, except for a rise in the relative risk associated with ibuprofen. Finally, although the risk estimates for myocardial infarction, heart failure, and death for rosiglitazone compared with pioglitazone remained significantly raised when analyses were confined to studies with low or moderate RoB, there was no significantly increased risk of myocardial infarction when the analysis was confined to studies with low RoB.

What Do These Findings Mean?

These findings show that there was considerable variability in RoB among the studies included in two systematic reviews of non-randomized intervention studies. Although all 37 studies included in these reviews were originally considered to be of sufficiently high quality for inclusion using less comprehensive—or less RoB-focused—critical appraisal tools, only eight were judged to have low RoB using ACROBAT-NRSI. Notably, exclusion of studies with moderate, serious, or critical RoB resulted in clinically important changes to some of the conclusions of the original reviews. Because the researchers considered only two systematic reviews, their findings may not be generalizable—ACROBAT-NRSI needs further testing across a range of study types. Moreover, because the tool is designed to be used within a team setting, studies are needed to investigate whether the performance of the tool depends on the team’s skill mix. Importantly, however, these findings highlight the importance of including a detailed RoB assessment for each study included in systematic reviews of non-randomized studies of medical interventions.

Additional Information

This list of resources contains links that can be accessed when viewing the PDF on a device or via the online version of the article at


Well-conducted randomized controlled trials (RCTs) remain the gold standard for assessing medical interventions because their design controls both measured and unmeasured confounding variables. Systematic reviews with meta-analyses of RCTs have become the accepted evidence base for many important clinical and policy decisions. The limitations of RCTs are well documented [13]. They may not reflect “real world” patient experiences because they study highly selected populations in atypical settings. Also, despite substantial investments of time and money, few trials enroll the number of patients over the necessary length of time to quantify uncommon or long-term outcomes.

Non-randomized studies of interventions have proliferated in recent years due to increased access to extensive linked administrative databases and electronic health records, with large populations, long follow-up periods, and advances in analytic approaches to control for confounding [4,5]. It is recognized that non-randomized studies provide different information (i.e., “real world” effectiveness, wider population inclusion, and longer follow-up) from RCTs [3]. Thus, the methods can be considered complementary, and systematic reviews of both types of studies are needed to provide a comprehensive assessment of a body of evidence.

However, controversy persists. While there is agreement that large, high-quality non-randomized studies can accurately quantify adverse outcomes of medical treatments [6], there is less agreement on their capacity to generate unbiased estimates of the effectiveness of medical interventions [7]. Nevertheless, non-randomized studies are increasingly being included in systematic reviews and meta-analyses [8]. The large sample sizes of many non-randomized studies correspond to greater weight attributed to their findings during statistical pooling. The concern is that, while the larger sample sizes may increase precision in summary estimates of treatment effects, they may also be prone to bias [8]. In order to minimize this problem, it is necessary to measure the risk of bias (RoB) in the individual studies that are being included in systematic reviews. This enables exclusion of studies that have an increased RoB from the overall estimate, or during sensitivity analyses.

While a widely used gold-standard RoB tool exists for RCTs [9], there is less agreement on how to assess RoB within non-randomized study designs. A wide variety of checklists, judgment ratings, and scales for observational studies have been proposed [10,11], including the Newcastle–Ottawa Scale (NOS) [12], the Downs and Black checklist [13], and the Scottish Intercollegiate Guidelines Network’s methodology checklists [14]. None of these tools reflects a contemporary domain-based approach to bias assessment, and they are dated (e.g., the current version of the popular NOS was released in 2000) [11,12]. Many instruments use overall rating scales, which have been shown to be flawed [15].

To address this problem, the Cochrane Collaboration released a draft of a comprehensive tool specifically for non-randomized studies in September 2014 [16]. The Cochrane Risk of Bias Tool for Non-Randomized Studies of Interventions (ACROBAT-NRSI) builds upon the Cochrane Risk of Bias tool for RCTs [9] and assesses internal validity through a series of RoB judgments in seven chronologically organized domains to provide an overall RoB assessment for each study (see Box 1).

Box 1. Domains of Bias Assessed by ACROBAT-NRSI [16]

Domains of bias

Pre-intervention (baseline)

  1. 1. Bias due to confounding
  2. 2. Bias in selection of participants into the study

At intervention

  1. 3. Bias in measurement of interventions


  1. 4. Bias due to departures from intended interventions
  2. 5. Bias due to missing data
  3. 6. Bias in measurement of outcomes
  4. 7. Bias in selection of the reported result

Judgment about RoB for each domain

  1. Low RoB: the study is comparable to a well-performed randomized trial with regard to this domain
  2. Moderate RoB: the study is sound with regard to this domain, but cannot be considered comparable to a well-performed randomized trial
  3. Serious RoB: the study has some important problems in this domain
  4. Critical RoB: the study is too problematic in this domain to provide any useful evidence on the effects of intervention

Overall RoB judgment for each study*

  1. Low RoB: the study is judged to be at low RoB for all domains
  2. Moderate RoB: the study is judged to be at low or moderate RoB for all domains, and moderate in at least one domain
  3. Serious RoB: the study is judged to be at serious RoB in at least one domain, but not at critical RoB in any domain
  4. Critical RoB: the study is judged to be at critical RoB in at least one domain

*Reviewers have some discretion in making an overall risk of bias judgment based on the assessment of individual domains. To quote from the ACROBAT guidance document: “In practice some ‘Serious’ risks of bias (or ‘Moderate’ risks of bias) might be considered to be additive, so that ‘Serious’ risks of bias in multiple domains can lead to an overall judgement of ‘Critical’ risk of bias (and, similarly, ‘Moderate’ risks of bias in multiple domains can lead to an overall judgement of ‘Serious’ risk of bias).”

The first aim of this study was to assess the performance of ACROBAT-NRSI by applying it to the studies included in two published systematic reviews of the adverse cardiovascular effects of thiazolidinediones (TZDs) [17] and cyclooxygenase-2 (COX-2) inhibitors [18]. The second aim was to determine whether limiting the meta-analyses to studies with lower RoB changed the overall estimates of the adverse drug effects.



The ACROBAT-NRSI instrument considers each non-randomized study as an attempt to emulate a hypothetical randomized trial (the “target trial”) that compares the health effects of two or more interventions. The ACROBAT-NRSI guidance points out that the target trial need not be feasible or ethical, and recommends that it is useful to consider the population, interventions, comparators, and outcomes of such a hypothetical trial [16]. It is also important to decide whether the target trial would be analyzed according to initial treatment assignment (“intention to treat” analog) or according to both initiation and adherence to treatment (“per protocol” analog).

Users of the instrument are guided through seven chronologically arranged (pre-intervention, at intervention, and post-intervention) bias domains (See Box 1). Signaling questions help flag potential bias concerns and help review authors make RoB judgments. The first three domains (pre-intervention and at intervention) are specific to non-randomized studies of interventions, whereas the remaining four domains also have relevance to the assessment of RoB in RCTs (Box 1). Signaling questions for the bias domains are framed so that “yes” indicates a lower RoB than “no” (e.g., “Did the authors use an appropriate analysis method that adjusted for all the critically important confounding domains?”). If the answers to all signaling questions for a domain are “yes” or “probably yes,” then the overall RoB is judged to be low. The ACROBAT-NRSI instrument is provided as S1 Table.

Selection of Systematic Reviews and Meta-analyses

We selected two widely cited systematic reviews with meta-analyses that addressed important questions about the safety of widely used prescription drugs: one by Loke, Kwok and Singh [17], who investigated the cardiovascular risks of TZDs (comparing rosiglitazone to pioglitazone) in diabetic patients, and one by McGettigan and Henry [18], who investigated the cardiovascular risks associated with a range of selective and nonselective COX-2 inhibitors, with non-use of the drug class as the reference.

We considered reviews of drug effects an appropriate subject for initial testing as they are generally simple interventions that do not involve complexities such as operator skill or extensive infrastructure. Cardiovascular outcomes are clear-cut and not prone to major misclassification. We chose reviews with a substantial number of component studies to give us a sample size sufficient to assess inter-rater agreement. In both reviews. the majority of studies used patient data from large, population-based administrative health databases, and most used sophisticated methods to adjust results for bias. We knew that the quality scores for the component studies in the COX-2 inhibitor review (using the popular NOS) were tightly grouped, with high overall scores [18,19]. Further, the results of both reviews were broadly similar to those from meta-analyses of randomized trials [17,18]. We considered that they would provide a good test of the responsiveness of the new tool to modest variations in RoB and allow us to assess consequential effects of bias on the pooled estimates of adverse effects associated with use of these drugs.

We retrieved full-text copies of the 39 component studies included in the two reviews. One study of TZDs (Graham et al. [20]), was used in a training and calibration exercise that involved all four authors. This article was chosen as it had been assessed during ACROBAT-NRSI working group meetings and majority consensus ratings had been established. All four authors applied ACROBAT-NRSI (version 1.0.0) to this study and met to compare and discuss judgments, interpretations of the guidance document, and user experiences.

Two reviewers (A. B. and T. F.) independently assessed the remaining component studies. Both authors are trained epidemiologists, but had no prior experience using ACROBAT-NRSI. For reports with multiple risk estimates (for a single outcome), the results cited in the original systematic review were extracted and their corresponding properties were assessed (combination of variables in the statistical model, exposure definition, etc.). Two study reports in the form of abstracts [21,22] were excluded, as they did not contain enough information to be assessed by the instrument. This left 37 articles to be evaluated. Individual study assessments were conducted independently, and results recorded, before a meeting at which reviewers compared judgments and achieved a consensus. If both raters had the same category of judgment for a domain, no further discussion occurred. If the ratings differed, each rater provided their reasoning for selecting their RoB judgment. The supporting notes (written in the comments area of the tool) were useful for recalling details relevant to RoB judgment. Inter-rater reliability was measured for each domain of bias, and for the overall RoB judgment, by calculating weighted Kappa scores using linear weighting in SAS 9.4 [23].

The meta-analyses from each review were replicated using RevMan 5.3 before and after RoB assessment. In the case of McGettigan and Henry [18], the risk estimates from the two studies available only in abstract form were excluded from both the before and after RoB assessment analyses. All three cardiovascular outcomes from Loke et al. [17] were assessed, as well as each individual nonsteroidal anti-inflammatory drug (NSAID) group exposure in relation to the major cardiovascular outcome in McGettigan and Henry [18].

Generic inverse-variance weighting was used in a random effects model, as set out in the methods sections of the original study reports. This exercise included all eligible studies. Next, those studies judged as having an overall serious or critical RoB were excluded from the meta-analyses, leaving only estimates from studies with overall low or moderate RoB. In a further analysis, moderate RoB studies were also excluded, resulting in meta-analyses of only low RoB studies. The heterogeneity in the meta-analyses was measured using the I2 statistic, and changes in this statistic, the risk estimates, and their confidence intervals were recorded between the original meta-analysis and the re-analyses stratified by RoB. We made an informal assessment of usability by asking reviewers to record the time taken to complete evaluations and to record their overall impressions of using the ACROBAT-NRSI instrument.


Details of the 37 studies included in the two reviews are provided in Tables 1 and 2. Seventeen studies were analyzed as cohorts and 20 as case–control designs; the majority of the latter were nested in cohorts. In total, 34/37 (92%) studies were performed using linked administrative claims data or electronic medical records. Risk estimates varied across studies and outcomes. However, the majority of estimates were 1.00 or greater. In the case of TZDs, 28/31 relative risk estimates lay between 1.00 and 1.70. In the case of COX-2 inhibitors, 40/66 relative risk estimates lay between 1.00 and 2.29.

Table 1. Details of component studies included in the systematic review by Loke et al. [17].

Table 2. Details of component studies included in the systematic review by McGettigan and Henry [18].

Inter-Rater Agreement on Risk of Bias Judgments

The weighted kappa scores varied across the seven domains of bias assessed by ACROBAT-NRSI (Table 3). In the case of the Loke et al. [17], kappa values ranged from 0.59 (bias due to missing data) to 0.91 (bias in selection of participants). The remaining kappa values were between 0.63 and 0.78, indicating substantial agreement between the two raters [60]. For McGettigan and Henry [18], the kappa scores ranged from 0.45 (bias in selection of reported results) to 1.00 (bias due to missing data). The remaining scores were between 0.50 and 0.91, denoting moderate to substantial agreement. For the overall score, the Kappa statistic showed substantial agreement for both studies (0.72 and 0.91).

Table 3. Weighted Kappa scores for inter-rater agreement when assessing the component studies included in two systematic reviews.

Risk of Bias Assessments

The consensus judgments for the domains of bias and overall RoB assessments for studies included in the two systematic reviews are given in Tables 4 and 5. Assessment comments are summarized in S2 and S3 Tables. Loke et al. [17] studied three major outcomes (heart failure, myocardial infarction, and death). As the assessments of the RoB domains did not differ by individual outcome, a single set of domain-specific and overall judgments is provided.

Table 4. Consensus ACROBAT-NRSI judgments between two reviewers by domain of bias—component studies from Loke et al. [17].

Table 5. Consensus ACROBAT-NRSI judgments between two reviewers by domain of bias—component studies from McGettigan and Henry [18].

The overall judgments for the component studies from Loke et al. [17] were distributed across all four rating categories. Six studies were found to be at low RoB. The RoB assessments for the remaining studies were as follows: four moderate, four serious, and two critical ROB. For the component studies in McGettigan and Henry [18], the overall judgments appeared less variable. Fourteen of 21 studies fell into the moderate RoB category. Only two studies were rated as low RoB, and five were deemed to have serious RoB. None of the studies received a critical RoB rating. For both reviews, the main causes of serious or critical overall RoB assessments were weaknesses in the domains of confounding and selection of participants.

Changes in Risk Estimates and Conclusions

For rosiglitazone compared with pioglitazone, excluding all component studies judged to be have serious or critical RoB resulted in slightly lower risk estimates for myocardial infarction and heart failure outcomes overall (Table 6). Both risk estimates remained elevated and statistically significant. The estimates for overall mortality did not change for either study type (cohort or case–control), or overall. However, when studies judged as having moderate RoB were also excluded from the meta-analysis, the pooled odds ratio estimate for myocardial infarction for rosiglitazone compared with pioglitazone fell from 1.16 (95% CI 1.07–1.24) to 1.06 (95% CI 0.99–1.13). The other outcomes, heart failure and overall mortality, did not change to a material extent.

Table 6. Risk estimates from meta-analyses: comparison of original estimates with post-assessment estimates for the systematic review by Loke et al. [17].

Risk estimates for COX-2 inhibitors tended to increase in re-analyses confined to studies judged to be at low or moderate overall RoB, except for indomethacin and meloxicam, which featured in only two studies (Table 7). Risk estimates for the more selective COX-2 inhibitors (celecoxib, rofecoxib) showed little change, with only one study removed from the meta-analyses. For the nonselective NSAIDs, the risk estimates for naproxen, diclofenac, and piroxicam remained similar to the original estimates. The relative risk estimate for ibuprofen increased from 1.07 (95% CI 0.97–1.18) to 1.14 (95% CI 1.03–1.26), indicating an elevated cardiovascular risk after exclusion of four studies assessed as having serious RoB. Due to the low number of studies deemed to have low RoB, we were unable to perform a sensitivity analysis excluding studies judged as having moderate RoB.

Table 7. Risk estimates from meta-analyses: comparison of original estimates with post-assessment estimates for the systematic review by McGettigan and Henry [18].

Effects on Heterogeneity of Risk Estimates

In the case of Loke et al. [17], I2 statistics for the summary risk estimates for myocardial infarction, heart failure, and death changed little after exclusion of studies with critical or serious RoB (from 48%, 41%, and 0% to 19%, 41%, and 0%, respectively). After further exclusion of studies judged to have moderate RoB, there was reduced heterogeneity among the remaining studies (I2 statistics: 0%, 16%, and 0%, respectively). No pattern could be seen with the nine individual NSAID analyses after exclusion of studies with critical or serious RoB.

Usability of Cochrane ACROBAT-NRSI

Initially, reviewers took an average of 4 h (but up to 8 h in one instance) to complete each component study assessment. By the end of the study, and with increased experience with the instrument, most studies were assessed within 2.5 h. The reviewers found that it took longer to assess cohort studies than case–control studies. In part, this was because of difficulty in evaluating the potential for time-varying confounding, as essential information regarding this domain was commonly not reported. Overall, reviewers agreed that important determinants of success in applying the instrument were training in epidemiology, familiarity with certain adjustment methods (e.g., propensity score matching), and the creation of a comprehensive list of potential confounders and co-interventions before starting the assessment.


We found that a comprehensive assessment revealed variability in the RoB in non-randomized studies that were included in two systematic reviews of adverse cardiovascular events associated with the use of TZDs and COX-2 inhibitors. Of all studies included in the reviews, only eight of 37 studies that were considered of sufficiently high quality to be included in the two published systematic reviews were judged to have low RoB. The exclusion of studies with moderate, serious, or critical RoB resulted in changes to some risk estimates—in particular, rosiglitazone was no longer associated with an increased risk of myocardial infarction, while the reverse was true for ibuprofen and cardiovascular events.

Clinical Relevance

Although the changes in risk estimates after exclusion of poorer quality studies were small, they may be important in a field where decisions are made on the basis of small relative increases in the risk of serious adverse events. In the case of the NSAID meta-analysis, the most notable change was a rise in the relative risk estimate for ibuprofen (compared with no NSAID use). This was a small change, but the risk may be real, as ibuprofen has been shown to be associated with dose-related increases in the relative risk of cardiovascular events in both randomized and non-randomized studies [19]. In the case of rosiglitazone, the summary relative risk estimate (compared with pioglitazone) for myocardial infarction moved towards the null after exclusion of nine studies assessed as having moderate, serious, or critical RoB. This is not consistent with the most recent meta-analyses of RCTs of rosiglitazone [61]. However, the RCTs compared rosiglitazone with placebo, insulin, biguanides, or sulfonylureas, not with pioglitazone. The RoB-stratified estimates of the risk of myocardial infarction with rosiglitazone compared with pioglitazone should not therefore be assumed to conflict with the trial results.

Comparison with Other Tools to Assess Risk of Bias

The substantial variation in RoB we found in these published systematic reviews indicates that ACROBAT-NRSI is sensitive to variations in bias across a range of studies that were considered to be of sufficiently high quality to be included in the reviews considered here. In the case of the COX-2 inhibitors, the authors of the published review originally assessed the quality of the component studies by applying the NOS. [18,19] Using this scale, they found that all studies ranked highly (seven or eight out of a possible total of nine points on the scale). In contrast, with application of the domain-based ACROBAT-NRSI instrument, five of the studies were assessed as being at serious RoB, 14 at moderate RoB, and only two at low RoB. This comparison reveals two things. First, the NOS scores were too tightly clustered to enable examination of the impact of bias on the pooled risk estimates. Second, the overall rating scale used in the NOS did not reveal weaknesses in specific domains that generated poor overall assessments of RoB with the ACROBAT-NRSI instrument, which does not generate an overall score. A simple summary score implies equal weighting of domains of bias, and the overall score may disguise serious or critical flaws and fail to document where the flaws are occurring. The new Cochrane tool allows a more transparent judgment. The instrument enables the identification and categorization of the severity of domain-specific flaws that are important in determining the overall assessment of RoB.

There are many published instruments for assessing susceptibility to bias in non-randomized studies. While there is general agreement about the key domains that should be assessed in the case of RCTs, this is not so with non-randomized studies [9,11]. This is because non-randomized studies have considerably more opportunities for variation in design and analysis, in addition to RoB due to the lack of random allocation and blinding. In their review, Sanderson and colleagues identified 86 assessment tools for non-randomized studies, comprising 41 simple checklists, 12 checklists with additional summary judgments, and 33 scales [11]. The authors concluded that around half of the published scales did not describe the development process and had not been tested for reliability or validity. As a result, they were unable to recommend a specific instrument.

A recent review by Katikireddi et al. found that the majority of 59 systematic reviews published between March and May 2012 included some form of critical appraisal of the included studies [62]. The percentage was higher for RCTs (71%) than non-randomized studies (57%), which is ironic given that non-randomized studies are more susceptible to bias. Katikireddi et al. found that review authors used a variety of existing and adapted critical appraisal tools but that fewer than half included domain-level RoB assessments and that there was confusion about how these scores and ratings should be included in the synthesis and interpretation of review findings. This underscores the importance of assessing domain-specific RoB, which allows for a more nuanced understanding of biases within individual studies.

Experience with ACROBAT-NRSI

ACROBAT-NRSI is demanding to use as it addresses the serious and complex issues of RoB in non-randomized studies of healthcare interventions. It took two reviewers approximately 2.5 h to complete the process for each component study, including reading the paper, applying the tool, and achieving consensus. This was after training and early experience with the tool. Proper application of the instrument requires a substantial time and resource commitment in addition to an in-depth understanding of the sources of bias in non-randomized studies. We believe this commitment, including the use of two raters, is necessary because of the complexity of non-randomized studies, the inevitable discrepancies that emerge between ratings, and the value of the consensus process that follows. In our study, the raters were supported by a methods expert (L. R.) and a clinician (D. H.). We think both roles are a necessary part of teams that are evaluating (or conducting) systematic reviews that include non-randomized intervention studies. This RoB assessment effort is justified as the results of these systematic reviews may form the basis of policy or regulatory decisions.

We are aware that broader feedback from other users of the ACROBAT-NRSI instrument has indicated that rewording of some signaling questions within the domains of bias is desirable, and that process is underway. We anticipate that as more people use the instrument, further changes will be needed to improve its usability. It is important that potential users access the most recent version of the instrument (available at Further developments of the instrument are unlikely to change the domains of bias, or how these are assessed. But changes to signaling questions will help guide interpretation. As such, our experiences in this study are relevant to future users of the instrument.

ACROBAT-NRSI has been used to assess the RoB of non-randomized studies included in several recently published systematic reviews [6366]. We were unable to find another published study that reported on the inter-rater reliability of the instrument or estimated the effect of restricting reviews to studies with low or moderate RoB.

We are aware of three reports (in abstract form) of inter-rater reliability of the instrument presented at the 2015 Cochrane Colloquium in Vienna, Austria. The topic areas were environmental exposure, housing improvements, and the relationship between benzodiazepine use and mortality [6769]. All studies found lower levels of inter-rater agreement than we did. The differences may have been due to the nature of the literature we reviewed and the fact that our raters were epidemiologists, had received training in the use of the instrument, and had gone through a calibration exercise that included an author involved in the development of ACROBAT-NRSI. The tool may not be so readily used by less qualified or less trained personnel, but, arguably, they should not be evaluating systematic reviews that include non-randomized studies of healthcare interventions.

The information derived from application of ACROBAT-NRSI can be integrated into tools designed to provide overall ratings of systematic reviews. In the case of ROBIS (a tool for assessing the RoB in systematic reviews), the relevant domain is number 3, concerned with individual study appraisal [70]. ROBIS appraises a number of other steps in the review process that can introduce bias, in addition to flaws in the component studies. Likewise, ACROBAT-NRSI can provide information on RoB that can be integrated into the revised version of the popular AMSTAR systematic review critical appraisal instrument [71].


Our study has several limitations. First, ACROBAT-NRSI has not been subject to a formal test of construct validity. That means we cannot be certain that the instrument truly measures the constructs (in this case domains of bias) that it was designed to measure. However, we note that it underwent an extensive development program involving many methods experts, has considerable face validity, and was developed from a well-established and validated instrument (the Cochrane Risk of Bias tool for RCTs). Second, we limited our assessment to two reviews of relatively sophisticated pharmacoepidemiological studies. We cannot assume our findings extend to a broader range of interventions and settings. The instrument needs further testing across a range of study types. Third, many of the studies in the two reviews under consideration used propensity score or other matching methods, and ACROBAT-NRSI and related findings may function differently in non-randomized studies that use alternative methods such as self-controlled designs or interrupted time series analysis. Finally, ACROBAT-NRSI was designed to be used within a team setting, with methodologists and subject matter experts contributing to study evaluations [16]. Our study involved two reviewers with similar training backgrounds, who had access to content expertise. But it is possible that other skill mixes in the reviewers would lead to different RoB judgments.


Systematic reviews that include non-randomized studies of medical interventions should encompass a detailed assessment of domain-level RoB for each included study. Even in a sophisticated field such as contemporary pharmacoepidemiology, a sensitive rating tool can detect significant variation in RoB between individual studies. Exclusion of studies deemed to have unacceptably high RoB may impact the findings of pooled estimates of intervention effects, altering both the statistical and clinical significance of the results.

Supporting Information

S1 Table. The Cochrane risk of bias tool for non-randomized studies of interventions.


S2 Table. Consensus overall risk of bias ratings by study and corresponding reasons for ranking of Loke et al. [17] component studies.


S3 Table. Consensus overall risk of bias ratings by study and corresponding reasons for ranking of McGettigan and Henry [18] component studies.


Author Contributions

Conceived and designed the experiments: DH LR. Performed the experiments: AB TF. Analyzed the data: AB TF. Contributed reagents/materials/analysis tools: AB TF DH LR. Wrote the first draft of the manuscript: AB. Contributed to the writing of the manuscript: AB TF LR DH. Agree with the manuscript’s results and conclusions: AB TF LR DH. All authors have read, and confirm that they meet, ICMJE criteria for authorship.


  1. 1. Van Spall HC, Toren A, Kiss A, Fowler RA. Eligibility criteria of randomized controlled trials published in high-impact general medical journals: a systematic sampling review. JAMA. 2007;297:1233–1240. pmid:17374817
  2. 2. Rothwell PM. External validity of randomised controlled trials: “To whom do the results of this trial apply?”. Lancet. 2005;365:82–93. pmid:15639683
  3. 3. Black N. Why we need observational studies to evaluate the effectiveness of health care. BMJ. 1996;312:1215–1218. pmid:8634569
  4. 4. Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. J Clin Epidemiol. 58:323–337. pmid:15862718
  5. 5. Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res. 2011;46:399–424. pmid:21818162
  6. 6. Golder S, Loke YK, Bland M. Meta-analyses of adverse effects data derived from randomised controlled trials as compared to observational studies: methodological overview. PLoS Med. 2011;8:e1001026. pmid:21559325
  7. 7. Sox HC, Greenfield S. Comparative effectiveness research: a report from the Institute of Medicine. Ann Intern Med. 2009;151:203–205. pmid:19567618
  8. 8. Egger M, Schneider M, Davey SG. Spurious precision? Meta-analysis of observational studies. BMJ. 1998;316:140–144. pmid:9462324
  9. 9. Higgins JPT, Altman DG, Gøtzsche PC, Jüni P, Moher D, Oxman AD, et al. The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. BMJ. 2011;343:d5928. pmid:22008217
  10. 10. Deeks J, Dinnes J, D’Amico R, Sowden A, Sakarovitch C. Evaluating non-randomised intervention studies. Health Technol Assess. 2003;7:186.
  11. 11. Sanderson S, Tatt ID, Higgins JP. Tools for assessing quality and susceptibility to bias in observational studies in epidemiology: a systematic review and annotated bibliography. Int J Epidemiol. 2007;36:666–676. pmid:17470488
  12. 12. Wells GA, Shea B, O’Connell D, Peterson J, Welch V, Losos M, et al. The Newcastle–Ottawa Scale (NOS) for assessing the quality of nonrandomised studies in meta-analyses. 2014 [cited 26 Aug 2015]. Ottawa: Ottawa Hospital Research Institute. Available:
  13. 13. Downs SH, Black N. The feasibility of creating a checklist for the assessment of the methodological quality both of randomised and non-randomised studies of health care interventions. J Epidemiol Community Health. 1998;52:377–384. pmid:9764259
  14. 14. Scottish Intercollegiate Guidelines Network. Critical appraisal: notes and checklists. 2015 [cited 10 Oct 2015]. Available:
  15. 15. Jüni P, Witschi A, Bloch R, Egger M. The hazards of scoring the quality of clinical trials for meta-analysis. JAMA. 1999;282:1054–1060. pmid:10493204
  16. 16. Sterne J, Higgins J, Reeves B, editors. A Cochrane risk of bias assessment tool: for non-randomized studies of interventions (ACROBAT-NRSI). Version 1.0.0, 24 September 2014. Available:
  17. 17. Loke YK, Kwok CS, Singh S. Comparative cardiovascular effects of thiazolidinediones: systematic review and meta-analysis of observational studies. BMJ. 2011;342:d1309. pmid:21415101
  18. 18. McGettigan P, Henry D. Cardiovascular risk and inhibition of cyclooxygenase: a systematic review of the observational studies of selective and nonselective inhibitors of cyclooxygenase 2. JAMA. 2006;296:1633–1644. pmid:16968831
  19. 19. McGettigan P, Henry D. Cardiovascular risk with non-steroidal anti-inflammatory drugs: systematic review of population-based controlled observational studies. PLoS Med. 2011;8:e1001098. pmid:21980265
  20. 20. Graham DJ, Ouellet-Hellstrom R, MaCurdy TE, Ali F, Sholley C, Worrall C, et al. Risk of acute myocardial infarction, stroke, heart failure, and death in elderly medicare patients treated with rosiglitazone or pioglitazone. JAMA. 2010;304:411–418. pmid:20584880
  21. 21. Singh G, Mithal A, Triadafilopoulos G. Both selective COX-2 inhibitors and non-selective NSAIDs increase the risk of acute myocardial infarction in patients with arthritis: selectivity is with the patient, not the drug class. Ann Rheum Dis. 2005;64(Suppl III):85.
  22. 22. Sturkenboom MC, Dieleman J, Verhamme K, Straus S, Vander Hoeven-Borgman M, van der Lei J. Cardiovascular events during use of COX-2 selective and non-selective NSAIDs. Pharmacoepidemiol Drug Saf. 2005;14:S57.
  23. 23. Cicchetti DV, Allison T. A new procedure for assessing reliability of scoring EEG sleep recordings. Am J EEG Technol. 1971;11:101–109.
  24. 24. Bilik D, McEwen LN, Brown MB, Selby JV, Karter AJ, Marrero DG, et al. Thiazolidinediones, cardiovascular disease and cardiovascular mortality: Translating Research Into Action For Diabetes (TRIAD). Pharmacoepidemiol Drug Saf. 2010;19:715–721. pmid:20583206
  25. 25. Brownstein JS, Murphy SN, Goldfine AB, Grant RW, Sordo M, Gainer V, et al. Rapid identification of myocardial infarction risk associated with diabetes medications using electronic medical records. Diabetes Care. 2010;33:526–531. pmid:20009093
  26. 26. Dormuth CR, Maclure M, Carney G, Schneeweiss S, Bassett K, Wright JM. Rosiglitazone and myocardial infarction in patients previously prescribed metformin. PLoS ONE. 2009;4:e6080. pmid:19562036
  27. 27. Hsiao F-Y, Huang W-F, Wen Y-W, Chen P-F, Kuo K, Tsai Y-W. Thiazolidinediones and cardiovascular events in patients with type 2 diabetes mellitus: a retrospective cohort study of over 473,000 patients using the National Health Insurance database in Taiwan. Drug Saf. 2009;32:675–690. pmid:19591532
  28. 28. Juurlink DN, Gomes T, Lipscombe LL, Austin PC, Hux JE, Mamdani MM. Adverse cardiovascular events during treatment with pioglitazone and rosiglitazone: population based cohort study. BMJ. 2009;339:b2942. pmid:19690342
  29. 29. Koro CE, Fu Q, Stender M. An assessment of the effect of thiazolidinedione exposure on the risk of myocardial infarction in type 2 diabetic patients. Pharmacoepidemiol Drug Saf. 2008;17:989–996. pmid:18759378
  30. 30. Lipscombe LL, Gomes T, Lévesque LE, Hux JE, Juurlink DN, Alter DA. Thiazolidinediones and cardiovascular outcomes in older patients with diabetes. JAMA. 2007;298:2634–2643. pmid:18073359
  31. 31. Margolis DJ, Hofstad O, Strom BL. Association between serious ischemic cardiac outcomes and medications used to treat diabetes. Pharmacoepidemiol Drug Saf. 2008;17:753–759. pmid:18613215
  32. 32. Pantalone K, Kattan M, Yu C, Wells B, Arrigain S, Jain A, et al. The risk of developing coronary artery disease or congestive heart failure, and overall mortality, in type 2 diabetic patients receiving rosiglitazone, pioglitazone, metformin, or sulfonylureas: a retrospective analysis. Acta Diabetol. 2009;46:145–154. pmid:19194648
  33. 33. Stockl KM, Le L, Zhang S, Harada ASM. Risk of acute myocardial infarction in patients treated with thiazolidinediones or other antidiabetic medications. Pharmacoepidemiol Drug Saf. 2009;18:166–174. pmid:19109802
  34. 34. Tzoulaki I, Molokhia M, Curcin V, Little MP, Millett CJ, Ng A, et al. Risk of cardiovascular disease and all cause mortality among patients with type 2 diabetes prescribed oral antidiabetes drugs: retrospective cohort study using UK general practice research database. BMJ. 2009;339:b4731. pmid:19959591
  35. 35. Walker AM, Koro CE, Landon J. Coronary heart disease outcomes in patients receiving antidiabetic agents in the PharMetrics database 2000–2007. Pharmacoepidemiol Drug Saf. 2008;17:760–768. pmid:18383443
  36. 36. Wertz DA, Chang C-L, Sarawate CA, Willey VJ, Cziraky MJ, Bohn RL. Risk of cardiovascular events and all-cause mortality in patients treated with thiazolidinediones in a managed-care population. Circ Cardiovasc Qual Outcomes. 2010;3:538–545. pmid:20736441
  37. 37. Winkelmayer WC, Setoguchi S, Levin R, Solomon DH. Comparison of cardiovascular outcomes in elderly patients with diabetes who initiated rosiglitazone vs pioglitazone therapy. Arch Intern Med. 2008;168:2368–2375. pmid:19029503
  38. 38. Ziyadeh N, McAfee AT, Koro C, Landon J, Arnold Chan K. The thiazolidinediones rosiglitazone and pioglitazone and the risk of coronary heart disease: a retrospective cohort study using a US health insurance database. Clin Ther. 2009;31:2665–2677. pmid:20110009
  39. 39. Bak S, Andersen M, Tsiropoulos I, García Rodríguez LA, Hallas J, Christensen K, et al. Risk of stroke associated with nonsteroidal anti-inflammatory drugs: a nested case-control study. Stroke. 2003;34:379–386.
  40. 40. Curtis JP, Wang Y, Portnay EL, Masoudi FA, Havranek EP, Krumholz HM. Aspirin, ibuprofen, and mortality after myocardial infarction: retrospective cohort study. BMJ. 2003;327:1322–1323. pmid:14656840
  41. 41. Fischer LM, Schlienger RG, Matter CM, Jick H, Meier CR. Current use of nonsteroidal antiinflammatory drugs and the risk of acute myocardial infarction. Pharmacotherapy. 2005;25:503–510. pmid:15977911
  42. 42. García Rodríguez LA, Varas C, Patrono C. Differential effects of aspirin and non-aspirin nonsteroidal antiinflammatory drugs in the primary prevention of myocardial infarction in postmenopausal women. Epidemiology. 2000;11:382–387. pmid:10874543
  43. 43. García Rodríguez LA, Varas-Lorenzo C, Maguire A, González-Pérez A. Nonsteroidal antiinflammatory drugs and the risk of myocardial infarction in the general population. Circulation. 2004;109:3000–3006. pmid:15197149
  44. 44. Gislason GH, Jacobsen S, Rasmussen JN, Rasmussen S, Buch P, Friberg J, et al. Risk of death or reinfarction associated with the use of selective cyclooxygenase-2 inhibitors and nonselective nonsteroidal antiinflammatory drugs after acute myocardial infarction. Circulation. 2006;113:2906–2913. pmid:16785336
  45. 45. Graham DJ, Campen D, Hui R, Spence M, Cheetham C, Levy G, et al. Risk of acute myocardial infarction and sudden cardiac death in patients treated with cyclo-oxygenase 2 selective and non-selective non-steroidal anti-inflammatory drugs: nested case-control study. Lancet. 2005;365:475–481. pmid:15705456
  46. 46. Hippisley-Cox J, Coupland C. Risk of myocardial infarction in patients taking cyclo-oxygenase-2 inhibitors or conventional non-steroidal anti-inflammatory drugs: population based nested case-control analysis. BMJ. 2005;330:1366. pmid:15947398
  47. 47. Johnsen SP, Larsson H, Tarone RE, McLaughlin JK, Nørgård B, Friis S, et al. Risk of hospitalization for myocardial infarction among users of rofecoxib, celecoxib, and other nsaids: a population-based case-control study. Arch Intern Med. 2005;165:978–984. pmid:15883235
  48. 48. Kimmel SE, Berlin JA, Reilly M, Jaskowiak J, Kishel L, Chittams J, et al. The effects of nonselective non-aspirin non-steroidal anti-inflammatory medications on the risk of nonfatal myocardial infarction and their interaction with aspirin. J Am Coll Cardiol. 2004;43:985–990. pmid:15028354
  49. 49. Kimmel SE, Berlin JA, Reilly M, Jaskowiak J, Kishel L, Chittams J, et al. Patients exposed to rofecoxib and celecoxib have different odds of nonfatal myocardial infarction. Ann Intern Med. 2005;142:157–164. pmid:15684203
  50. 50. Lévesque LE, Brophy JM, Zhang B. The risk for myocardial infarction with cyclooxygenase-2 inhibitors: a population study of elderly adults. Ann Intern Med. 2005;142:481–489. pmid:15809459
  51. 51. MacDonald TM, Wei L. Effect of ibuprofen on cardioprotective effect of aspirin. Lancet. 2003;361:573–574. pmid:12598144
  52. 52. Mamdani M, Rochon P, Juurlink DN, Anderson GM, Kopp A, Naglie G, et al. Effect of selective cyclooxygenase 2 inhibitors and naproxen on short-term risk of acute myocardial infarction in the elderly. Arch Intern Med. 2003;163:481–486. pmid:12588209
  53. 53. McGettigan P, Han P, Henry D. Cyclooxygenase-2 inhibitors and coronary occlusion—exploring dose–response relationships. Br J Clin Pharmacol. 2006;62:358–365. pmid:16934052
  54. 54. Ray WA, Stein CM, Hall K, Daugherty JR, Griffin MR. Non-steroidal anti-inflammatory drugs and risk of serious coronary heart disease: an observational cohort study. Lancet. 2002;359:118–123. pmid:11809254
  55. 55. Ray WA, Stein CM, Daugherty JR, Hall K, Arbogast PG, Griffin MR. COX-2 selective non-steroidal anti-inflammatory drugs and risk of serious coronary heart disease. Lancet. 2002;360:1071–1073. pmid:12383990
  56. 56. Schlienger RG, Jick H, Meier CR. Use of nonsteroidal anti-inflammatory drugs and the risk of first-time acute myocardial infarction. Br J Clin Pharmacol. 2002;54:327–332. pmid:12236854
  57. 57. Solomon DH, Glynn RJ, Levin R, Avorn J. Nonsteroidal anti-inflammatory drug use and acute myocardial infarction. Arch Intern Med. 2002;162:1099–1104. pmid:12020178
  58. 58. Solomon DH, Schneeweiss S, Glynn RJ, Kiyota Y, Levin R, Mogun H, et al. Relationship between selective cyclooxygenase-2 inhibitors and acute myocardial infarction in older adults. Circulation. 2004;109:2068–2073. pmid:15096449
  59. 59. Watson DJ, Rhodes T, Cai B, Guess HA. Lower risk of thromboembolic cardiovascular events with naproxen among patients with rheumatoid arthritis. Arch Intern Med. 2002;162:1105–1110. pmid:12020179
  60. 60. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174. pmid:843571
  61. 61. Nissen SE, Wolski K. Rosiglitazone revisited: an updated meta-analysis of risk for myocardial infarction and cardiovascular mortality. Arch Intern Med. 2010;170:1191–1201. pmid:20656674
  62. 62. Katikireddi SV, Egan M, Petticrew M. How do systematic reviews incorporate risk of bias assessments into the synthesis of evidence? A methodological study. J Epidemiol Community Health. 2015;69:189–195. pmid:25481532
  63. 63. Bernatsky S, Slim Z, Yuwan M. ACPA and future onset of rheumatoid arthritis among individuals with undifferentiated arthritis and arthritis free individuals: a systematic review of cohort studies. J Autoimmune Dis Rheumatol. 2015;3:30–40.
  64. 64. Kalkhoran S, Glantz S. E-cigarettes and smoking cessation in real-world and clinical settings: a systematic review and meta-analysis. Lancet Respir Med. 2016;4:116–128. pmid:26776875
  65. 65. Emmett C, Close H, Yiannakou Y, Mason J. Trans-anal irrigation therapy to treat adult chronic functional constipation: systematic review and meta-analysis. BMC Gastroenterol. 2015;15:139. pmid:26474758
  66. 66. Staton C, Vissoci J, Gong E, Toomey N, Wafula R, Abdelgadir J, et al. Road traffic injury prevention initiatives: a systematic review and metasummary of effectiveness in low and middle income countries. PLoS ONE. 2016;11:e0144971. pmid:26735918
  67. 67. Thomson H, Campbell M, Craig P, Hilton Boon M, Katikireddi V, editors. ACROBAT-NRSI for public health: reporting on feasibility and utility of applying ACROBAT to studies of housing improvement. Cochrane Colloquium 2015; 3–7 Oct 2015; Vienna, Austria.
  68. 68. Morgan R, Thayer K, Guyatt G, Blain R, Eftim S, Ross P, et al. Assessing the usability of ACROBAT-NRSI for studies of exposure and intervention in environmental health research. Cochrane Colloquium 2015; 3–7 Oct 2015; Vienna, Austria.
  69. 69. Couto E, Pike E, Torkilseng E, Klemp M. Inter-rater reliability of the Risk of Bias Assessment Tool: for Non-Randomized Studies of Interventions (ACROBAT-NRSI). Cochrane Colloquium 2015; 3–7 Oct 2015; Vienna, Austria.
  70. 70. Whiting P, Savović J, Higgins J, Caldwell D, Reeves B, Shea B, et al. ROBIS: a new tool to assess risk of bias in systematic reviews was developed. J Clin Epidemiol. 2016;69:225–234. pmid:26092286
  71. 71. Shea B, AMSTAR Development Group. AMSTAR: helping decision makers distinguish high and low quality systematic reviews that include non-randomized studies. Cochrane Colloquium 2015; 3–7 Oct 2015; Vienna, Austria.