Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Poor Reliability between Cochrane Reviewers and Blinded External Reviewers When Applying the Cochrane Risk of Bias Tool in Physical Therapy Trials

  • Susan Armijo-Olivo ,

    sla4@ualberta.ca

    Affiliations CLEAR (Connecting Leadership and Research) Outcomes Research Program, Faculty of Nursing, University of Alberta, Edmonton, Alberta, Canada, Faculty of Rehabilitation Medicine, Department of Physical Therapy, University of Alberta, Edmonton, Alberta, Canada

  • Maria Ospina,

    Affiliation Emergency Medicine Strategic Clinical Network, Alberta Health Services, Department of Emergency Medicine, Faculty of Medicine & Dentistry, University of Alberta, Edmonton, Alberta, Canadá

  • Bruno R. da Costa,

    Affiliation Department of Physical Therapy, Florida International University, Miami, Florida, United States of America

  • Matthias Egger,

    Affiliation Institute of Social & Preventive Medicine, University of Bern, Bern, Switzerland

  • Humam Saltaji,

    Affiliation Orthodontic Graduate Program, School of Dentistry, University of Alberta, Edmonton, Alberta, Canadá

  • Jorge Fuentes,

    Affiliations Faculty of Rehabilitation Medicine, University of Alberta, Edmonton, Alberta, Canada, Catholic University of Maule, Department of Physical Therapy, Talca, Maule, Chile

  • Christine Ha,

    Affiliation Rehabilitation Research Center, Faculty of Rehabilitation Medicine, University of Alberta, Edmonton, Alberta, Canada

  • Greta G. Cummings

    Affiliation CLEAR (Connecting Leadership and Research) Outcomes Research Program, Faculty of Nursing, University of Alberta, Edmonton, Alberta, Canada

Poor Reliability between Cochrane Reviewers and Blinded External Reviewers When Applying the Cochrane Risk of Bias Tool in Physical Therapy Trials

  • Susan Armijo-Olivo, 
  • Maria Ospina, 
  • Bruno R. da Costa, 
  • Matthias Egger, 
  • Humam Saltaji, 
  • Jorge Fuentes, 
  • Christine Ha, 
  • Greta G. Cummings
PLOS
x

Abstract

Objectives

To test the inter-rater reliability of the RoB tool applied to Physical Therapy (PT) trials by comparing ratings from Cochrane review authors with those of blinded external reviewers.

Methods

Randomized controlled trials (RCTs) in PT were identified by searching the Cochrane Database of Systematic Reviews for meta-analysis of PT interventions. RoB assessments were conducted independently by 2 reviewers blinded to the RoB ratings reported in the Cochrane reviews. Data on RoB assessments from Cochrane reviews and other characteristics of reviews and trials were extracted. Consensus assessments between the two reviewers were then compared with the RoB ratings from the Cochrane reviews. Agreement between Cochrane and blinded external reviewers was assessed using weighted kappa (κ).

Results

In total, 109 trials included in 17 Cochrane reviews were assessed. Inter-rater reliability on the overall RoB assessment between Cochrane review authors and blinded external reviewers was poor (κ  =  0.02, 95%CI: −0.06, 0.06]). Inter-rater reliability on individual domains of the RoB tool was poor (median κ  = 0.19), ranging from κ  =  −0.04 (“Other bias”) to κ  =  0.62 (“Sequence generation”). There was also no agreement (κ  =  −0.29, 95%CI: −0.81, 0.35]) in the overall RoB assessment at the meta-analysis level.

Conclusions

Risk of bias assessments of RCTs using the RoB tool are not consistent across different research groups. Poor agreement was not only demonstrated at the trial level but also at the meta-analysis level. Results have implications for decision making since different recommendations can be reached depending on the group analyzing the evidence. Improved guidelines to consistently apply the RoB tool and revisions to the tool for different health areas are needed.

Introduction

The term “quality assessment” has been used extensively in the literature, particularly in the context of systematic reviews, to refer to the critical appraisal of primary studies. Different approaches to quality assessment have been proposed for assessing the quality of studies [1], [2]. A variety of methods (scales and checklists) have been used by different Cochrane Review groups [3], [4]; however, because of methodological inconsistencies across quality instruments and the lack of empirical evidence supporting their validity and reliability [5], [6], the use of these methods was explicitly discouraged in Cochrane reviews [3].

In 2008, the Cochrane Collaboration (CC) initiated a shift in the approach to the evaluation of trial quality by linking the concept of quality to the internal validity of a study (risk of bias; the extent to which the design and conduct of a study are likely to prevent bias) [3]. The Cochrane Collaboration developed the Risk of Bias tool (RoB) as a method to assess risk of bias based on study design and conduct rather than relying on general reporting issues of trial characteristics [3]. Since then, the Cochrane Collaboration has required the use of the RoB tool to establish consistency in the assessment of study quality across Cochrane Review groups.

The RoB tool is based on six domains and 7 items: sequence generation, allocation concealment, blinding, incomplete outcome data, selective outcome reporting, and “other sources of bias.” Critical assessments of the risk of bias (high, low, unclear) in each domain are made separately for each outcome in a given study. The choice of these components for inclusion in the tool was based on empirical evidence of their association with effect estimates [5], [7], [8]; Recent research [9], [10] recommends further testing of the psychometric properties (i.e., validity, reliability, and responsiveness) of the RoB tool, and evaluations of the tool in a broad range of research fields. In addition, researchers have called for the use of clear and consistent guidelines and classification systems to apply and interpret the RoB tool [11]. This information is essential since differences in the appraisal and interpretation of risk of bias across trials can explain variation in the interpretation of results of studies included in a systematic review, and ultimately impact the conclusions and clinical practice.

Despite the RoB tool being increasingly used in Cochrane reviews; few studies have assessed its psychometric properties, specifically in paediatric trials, general medical and oncology trials [9], [10], [12], [13]. Ihe inter-rater agreement for the individual domains of the RoB tool has been found to range from poor (κ [kappa]  =  0.13 for selective reporting) to substantial (κ  =  0.74 for sequence generation) [9]. A recent study [13] assessed the reliability of the RoB tool between individual reviewers and across consensus ratings of pairs of reviewers on a sample of 154 and 30 randomized clinical trials (RCTs) published in the general medical literature respectively. The study found that the reliability between pairs of reviewers was “fair” for most of RoB domains with kappa values ranging from 0.2 to 0.34. However, the agreement between consensus ratings was always poorer than the agreement between pairs of reviewers indicating a high variability in interpreting and applying the RoB tool across different systematic review groups and across systematic reviews [13]. This agreement in consensus ratings (across pair of reviewers) was conducted only on 30 trials within a group of reviewers from the same team using guidelines developed specifically for the study.

The reliability of the RoB tool has not been investigated by comparing ratings of an external blinded panel of reviewers with those obtained from authors of Cochrane reviews. This work is of crucial importance for researchers who incorporate risk of bias assessments from Cochrane- and non-Cochrane systematic reviews into meta-epidemiological research approaches, since risk of bias assessments obtained by different research group can lead to different results. Furthermore, the reliability of the RoB in the context of physical therapy (PT) trials has not yet been evaluated. The objectives of this study were to test the inter-rater reliability of the RoB tool applied to PT trials by comparing consensus ratings from Cochrane review authors with those of blinded external reviewers, and to investigate potential sources of disagreements to inform the use of the RoB tool.

Methods

The Cochrane Database of Systematic Reviews (CDSR) was systematically searched from 2005 to May 25 2011 for meta-analyses of PT interventions using the words physical therapy, physiotherapy, rehabilitation, exercise, electrophysical agents, acupuncture, massage, transcutaneous electrical stimulation (TENS), interferential current, ultrasound, stretching, chest therapy, pulmonary rehabilitation, manipulative therapy, mobilization, and related terms. For a detailed search strategy see Appendix S1. Meta-analyses and their RCTs were included if: 1) the meta-analysis included at least 5 RCTs, with at least one of the interventions being currently or potentially part of PT practice according to the World Confederation for Physical Therapy (WCPT) [14]; 2) the outcome of interest in the meta-analysis (explicitly described as the main outcome or the outcome with the largest number of trials) was continuous; and 3) the RoB tool was used for assessment of individual trials. A unique identifier was assigned to meta-analyses and trials that met the inclusion criteria.

RoB assessments procedure

The risk of bias of individual trials included in the meta-analyses was assessed on 6 domains (7 items) of the RoB tool [15]: sequence generation, allocation concealment, blinding of participants and personnel, blinding of outcome assessors, incomplete outcome data, selective outcome reporting, and other sources of bias. We followed the guidelines established by the Cochrane Collaboration to perform RoB assessments; however we developed specific decision rules to make decisions (Appendix S2). Risk of bias evaluations for blinding and incomplete outcome data were based upon the primary (continuous) outcome of interest selected for meta-analysis in the Cochrane review. If not clearly specified, the outcome was chosen according to the meta-analysis that contained the largest number of trials in the review. The Cochrane guidelines recommend using trial protocols to complete assessments of selective outcome reporting bias. However, due to the low likelihood of locating protocols for trials, we did not search for study protocols [24]. Therefore, for the category of “low” risk of bias, it was required that trial publications reported all primary and secondary outcomes in the methods and results sections, with new outcomes not being added in the results section. If the primary outcome of the trial was not included in the results, there was a high risk of selective outcome reporting bias. In addition we paired outcomes reported in methods and results sections. If more than 70% of the secondary outcomes were not reported in the results or methods sections, then the study was rated as high RoB. For ‘other bias’, we looked at baseline comparability, control for co-interventions (contamination bias) and whether treatment compliance was acceptable. These criteria have been used in the risk of bias assessments of the Cochrane Back Review Group to determine other sources of potential bias [16].

For the overall assessment of RoB, a trial was considered at low risk of bias if it was rated as low risk in all individual domains; if the rating was unclear in at least one domain, and the other domains were unclear or low, the overall assessment of RoB was unclear. Finally, an overall assessment of high risk of bias was considered if at least one domain was rated as high [12], [13].

Two independent reviewers (any of these reviewers: SAO, JF, HS, CH, AC, DP) blinded to the RoB ratings reported in the Cochrane reviews assessed the risk of bias of all PT trials included in the meta-analyses. Each pair of reviewers assessed risk of bias in each study and disagreements were resolved by discussion between reviewers until consensus was reached. If consensus was not achieved, a final decision on RoB assessments was reached after consultation with a third reviewer (first author), although this was not necessary. Blinding of the external panel of reviewers was achieved as follows: 1) reviewers were not told the objective of this study; 2) they were not provided with RoB assessments performed by Cochrane reviewers; 3) after the external panel of reviewers completed their assessments, an independent reviewer who was not part of the review panel extracted RoB data assessment performed by Cochrane reviewers (MO). The integrity of blinding was assessed by asking the reviewers post hoc if they had checked the Cochrane RoB assessment. None of them reported that they did.

Data on RoB assessments from Cochrane reviews and other characteristics of reviews and trials were extracted by one reviewer (MO or SAO) and entered directly into a pilot tested electronic form. Consensus assessments between the two reviewers from our panel were then compared with the RoB ratings from the Cochrane reviews. In addition, two reviewers independently assessed the RoB at the meta-analysis level for both groups of reviewers (i.e. external panel of reviewers and Cochrane reviewers) using the guidelines established by the Cochrane handbook [15], [17]. A low, unclear and high RoB at the meta-analysis level was defined as: “most information is from studies at low, unclear or high risk of bias respectively” [15], [17]. Since no further guidance is in the Cochrane handbook, we established an arbitrary cut-off value of 60% to define the “majority of studies”. Assessments were compared and discrepancies were resolved by consensus between reviewers.

Characteristics of the reviewers' panel

Six reviewers with experience in different areas of health sciences research comprised the review panel in this study. Two reviewers had a Bachelor in Health Sciences (CH, AC), one had a Masters in Public Health (DP), one had a Masters in Dentistry and currently working on a PhD in Orthodontics (HS), and two were physical therapists and had Masters and PhD in Rehabilitation sciences (SAO, JF) with at least 10 years of experience in critical appraisal and systematic reviews. Four of them (DP, HS, SAO, and JF), had formal training in critical appraisal and systematic reviews. The other 2 (CH, AC) had at least one year of hands-on experience conducting systematic reviews. Four of the reviewers (SAO, JF, HS, CH) were part of the research team collaborating in this project and two of them (DP, AC) were hired to perform the data extraction and quality assessments. All of them verbally agreed to participate as reviewers in this study.

Training process

All reviewers were trained and received guidelines for RoB assessments from the first author (SAO) who was a physical therapist by training and had a MSc and PhD in Rehabilitation Sciences and more than 10 years of experience in critical appraisal and systematic reviews. Reviewer training was carried out using 10 trials not included in the study. Results of RoB assessments for these 10 studies were independently reviewed and discussed in a group meeting to determine consistency in ratings. In addition, the team members met on a regular basis to further calibrate RoB assessments throughout the study.

Statistical analysis

Inter-rater reliability of RoB assessments between Cochrane and blinded external reviewers [18][20] and within the panel of external reviewers was assessed using weighted kappa (κ) for categorical data. Inter-rater scores for both individual domains and overall assessments of the RoB tool were considered. Analyses were conducted using STATA (version 12, Stata Corp; College Station, Texas; USA). For raw data for each domain see Appendix S3.

Criteria proposed by Byrt [21] were used to interpret kappa values. Values between 0.93–1.00 represented excellent agreement; 0.81–0.92 very good agreement; 0.61–0.80 good agreement; 0.41–0.60 fair agreement; 0.21–0.40 slight agreement, 0.01–0.20 poor agreement; and 0.00 or less were considered to have no agreement.

Results

Literature search

The systematic search of the CDSR resulted in the identification of 3901 Cochrane review titles, with 271 reviews being potentially relevant to physical therapy. Of these, 68 Cochrane reviews included a meta-analysis of at least five studies on PT interventions assessing a continuous outcome. Figure 1 outlines the retrieval of Cochrane reviews and the number of trials included in the analysis. A total of 109 trials included in 17 Cochrane reviews that used the RoB tool were assessed. Table 1 summarizes the characteristics of the Cochrane reviews included in the study.

thumbnail
Table 1. Characteristics of Cochrane systematic reviews on physical therapy interventions that provided trial data for the analysis of inter-rater reliability of RoB.

https://doi.org/10.1371/journal.pone.0096920.t001

Characteristics of selected studies

Briefly, the reviews were published between 2008 and 2011 and included meta-analyses of the effectiveness of PT interventions for musculoskeletal (9 reviews [22][30] cardiorespiratory (4 reviews) [31][34], neurological (2 reviews) [35], [36], gynaecological (1 review) [37], and general conditions (1 review) [38].

The majority of Cochrane reviews (15 reviews) did not include a formal evaluation of the inter-rater reliability of the RoB assessments. Although the majority of reviews stated that two independent reviewers assessed study RoB, in four reviews, a single reviewer assessed RoB, with verification by a second reviewer. Similarly, twelve of the 17 (71%) Cochrane systematic reviews did not clearly specify the outcome used for the RoB assessments, whereas eight out of 17 (47%) of systematic reviews combined all outcomes into a single bias assessment.

A median number of six trials were included in the meta-analyses (interquartile range: 5, 8). All but one cross-over trial were identified as parallel trials. The majority of trials (n  =  93) used active controls whereas 15 trials were placebo-controlled. The control group of one trial was not clearly identified. Seventy-five trials were efficacy trials; 26 effectiveness trials, and seven trials combined an evaluation of the efficacy/effectiveness of PT interventions. One trial was not clearly described as an efficacy or effectiveness trial.

The number of trials available for assessing the inter-rater reliability of both individual-domain and overall RoB assessments varied as not all Cochrane reviews reported ratings for all the domains of the RoB tool. Inter-rater reliability of RoB assessments between Cochrane review authors and blinded external reviewers and the inter-rater reliability within the external panel of reviewers are presented in Table 2.

thumbnail
Table 2. Reliability between Cochrane Reviewers and External Panel and Reliability for the External Panel.

https://doi.org/10.1371/journal.pone.0096920.t002

Inter-rater agreement: Cochrane review authors vs. blinded external reviewers

Inter-rater reliability on the overall RoB assessment between Cochrane review authors and blinded external reviewers was poor (κ  =  0.02, 95%CI: −0.06, 0.06). Inter-rater reliability on individual domains of the RoB tool was poor (median κ  = 0.19), ranging from κ  =  −0.04 (“Other bias”) to κ  =  0.62 (“Sequence generation”). Table 2 displays the inter-rater reliability of the RoB tool between the blinded external review panel versus Cochrane reviewers.

When overall RoB categories assigned by blinded external reviewers were compared to those of Cochrane review authors, we found that the number of trials assessed as “low” risk of bias by Cochrane review authors (n  =  9) was greater than blinded external reviewers (n  =  3). Similarly, the number of trials rated as “high” risk of bias by Cochrane review authors (n  =  66) was greater than blinded external reviewers (n  =  31). In contrast, blinded external reviewers had a greater number of trials assessed as “unclear” in the overall RoB assessment (n  =  74) compared to Cochrane review authors (n  =  33). The main source of disagreement between Cochrane review authors and blinded external reviewers in the overall rating of RoB was due to discrepancies in the classification of “unclear” vs. “high” risk of bias; with 45 trials rated as “high” risk of bias by Cochrane review authors and “unclear” by blinded external reviewers.

Inter-rater agreement within the panel of blinded external reviewers

The inter-rater reliability between blinded external reviewers on the overall RoB rating was fair (κ  =  0.55, 95%CI: 0.40, 0.70). Inter-rater reliability on individual domains of the RoB tool was fair (median κ  = 0.56) ranging from κ  =  0.32 (“Other bias”) to κ  =  0.79 (“allocation concealment”).

Overall RoB at the Meta-analysis level

There was no agreement (κ  =  −0.29, 95%CI: −0.81, 0.35) in the overall RoB assessment at the meta-analysis level between Cochrane review authors and blinded external reviewers. Cochrane reviewers had evaluated 10 meta-analyses as high RoB while the external panel of reviewers classified them as “unclear”. Table 3 displays the RoB assessment at the meta-analysis level.

thumbnail
Table 3. Comparison of Overall ratings at the meta-analysis level between external panel and Cochrane reviewers.

https://doi.org/10.1371/journal.pone.0096920.t003

Discussion

Based on the assessment of RCTs included in Cochrane reviews of PT interventions, this study found that the inter-rater reliability of RoB assessments between Cochrane review authors and blinded external reviewers was poor. This result confirms the findings of previous studies regarding the poor reliability of the RoB tool domains in other areas of health research [9], [10], [12], [13]. Our results indicated that RoB assessments in Cochrane reviews could not be replicated consistently by an external panel of reviewers using consensus RoB assessments.

Consensus ratings are of crucial importance since they are commonly used in systematic reviews. Only one previous study assessed the reliability of the RoB based on consensus assessments across pairs of reviewers from four research centres using a sample of 30 trials indexed in PubMed between 2000 and 2006 [13]. Using a larger number of trials in PT and comparing the RoB consensus ratings between blinded external reviewers and Cochrane reviewers, our study confirmed that agreement across pairs of reviewers is generally lower than agreement between reviewers. Cochrane reviews have long been considered the gold standard for systematic reviews in health care. Results of our study have important implications for the interpretation of results of RoB assessments across Cochrane reviews and produced by different Cochrane Review Groups. The poor agreement in RoB assessments between Cochrane reviewers and an external panel of reviewers has raised several concerns: 1) RoB assessments cannot be reproduced by different groups of reviewers. If true, it would mean that RoB assessments are not reliable and depend on the reviewers' level of knowledge and familiarity with the information provided in the individual trials; 2) the RoB tool is a very subjective tool that cannot provide reliable assessments; 3) despite efforts by the Cochrane Collaboration to establish high quality standards for conducting systematic reviews, poor agreement appears to be the norm rather than the exception when conducting RoB assessments. Thus, we pose the following questions: can we trust risk of bias results reported in Cochrane reviews? Can we trust assessments using the RoB tool?

The low reliability of RoB assessments between our panel of blinded external reviewers and Cochrane reviewers has implications for researchers who use bias ratings from Cochrane reviews or other external sources to conduct meta-epidemiological research on the relationship between trial characteristics and over and under-estimation of treatment effects, since bias ratings obtained by different research group can lead to different results. For example, authors of meta-epidemiological studies [8], [39], [40], have taken information from external sources (Cochrane assessments, or information provided by authors of reviews). Although using data reported in the reviews, it is a practical and cost-efficient way to obtain information, authors should be aware that these evaluations may be inconsistent and prone to bias due to many factors such as expertise, training, level of education, and other characteristics of reviewers making quality judgements.

Very low agreements among Cochrane reviewers and the external panel were obtained for allocation concealment, blinding of participants, blinding of outcome assessment, and incomplete data. These features of a trial can have a substantial impact on the estimates of treatment effect [5], [9], [40][42]. Some studies, for example, have found that inadequate allocation concealment or lack of double-blinding can overestimate treatment effects on average by 18% and 9%, respectively [5], [40], [42]. Nevertheless, other studies have found that trials with adequate allocation concealment and blinding had higher treatment effects than trials that did not accomplish with these methodological features. [43], [44] Similarly, effect sizes from trials that excluded dropouts in the analysis or considered a modified intention to treat (ITT) approach were more likely to show a beneficial effect than trials without exclusions, demonstrating that the ITT principle is important to preserve the benefits of randomization and keep unbiased estimates [45][47]. Over-estimates of treatment effects, or bias, at the trial level, can lead to biased or inaccurate results and conclusions in systematic reviews and meta-analyses [40], [41], [48][50]. In addition, our analyses showed no agreement between decisions made based on RoB assessments at the level of meta-analysis. This means that both groups of reviewers did not agree in the overall quality of the evidence at the meta-analysis level. These factors can ultimately have repercussions on decision-making and quality of patient care since different assessments could lead to different decisions for clinical practice. Therefore, is alarming that the disagreements obtained between the two panels of reviewers are worse when it matters most.

The selection of different outcomes for RoB assessments may have influenced the poor agreement between Cochrane reviewers and a panel of blinded external reviewers. The majority of Cochrane reviews analyzed did not clearly specify the outcome used for RoB assessments. This directly reduces reproducibility of RoB assessment for outcome-dependent domains of the tool. Cochrane reviewers should report RoB assessments separately for each outcome analyzed, or at least for the main outcomes of the review. Half of the systematic reviews included in this study combined all outcomes into a single bias assessment and therefore, it is uncertain for which outcome the RoB assessments were applicable. Cochrane reviewers should clearly state which outcomes were used to perform the RoB assessments, in order to allow reproducibility and comparison.

The RoB has been extensively used by many Cochrane reviews, albeit the information of the inter-rater reliability of RoB is rather limited. To date, five studies [9], [10], [12], [13], [51] have investigated the inter-rater reliability of the RoB. One of them [51] did not use the generic RoB tool but a 12-item modified version of the tool developed by the Cochrane Back Review Group. The four other studies were conducted by the same group of researchers. When our inter-rater reliability results for the RoB tool were compared to those of other studies, most kappa values for the RoB domains were similar, except for allocation concealment, incomplete data, selective reporting, and overall rating of the RoB tool. Our kappa values were much higher than those reported in previous studies (Table 4). We suggest a variety of reasons for these differences. Although we used the Cochrane Handbook guidelines for RoB assessments, we pre-defined specific decision rules to assess the individual domains of the tool. For example, the item of allocation concealment was scored low only when studies used central allocation (including telephone, web-based and centre controlled randomization) or when envelopes with three adequate safeguards were used (sequentially numbered, opaque, and sealed envelopes). If all three safeguards were not described, the item was scored as “unclear”. In addition to the Cochrane guidelines, the RoB item of incomplete data was rated “low” when intention to treat was conducted and the drop-out rate was less than or equal to 20%. When the drop-out rate was higher than 20%, the item was scored as “high” risk of bias since there is evidence that drop-out rates higher than 20% are likely to increase bias in treatment estimates [52], [53].

thumbnail
Table 4. Inter-rater reliability (kappa values) of the RoB tool reported in the scientific literature.

https://doi.org/10.1371/journal.pone.0096920.t004

Similarly, we created a precise decision rule for the item of selective reporting, and identified a clear cut off to determine low, unclear and high RoB categories. It is likely that all of these decision rules may have increased the inter-reliability between the blinded external reviewers in the RoB assessments for these domains.

Final ratings of the RoB tool based on the Cochrane reviewers assessments indicated that almost 92% of trials included in the reviews had either high or unclear RoB; a proportion that is similar to those identified in other studies [10], [13]. As expressed by other researchers [13], the large number of trials classified as high or unclear RoB casts doubts about the discrimination power of the RoB tool to differentiate between studies with different levels of risk of bias that can explain variability of treatments effects across studies and inform accurately practice based on these assessments. Thus, it is important to highlight that the overall assessment of the RoB may not be useful to determine quality of individual trials. We used the guidelines established by the Cochrane handbook to determine overall RoB. However, these criteria can be considered arbitrary and may not be appropriate. In addition, the items included in the RoB may be insufficient to represent the construct of interest: “Risk of bias”. Other items not considered in this tool may need to be added to provide a more comprehensive evaluation. Some scales commonly used to evaluate the quality of research (e.g. the Jadad scale) use only a limited number of items (3) and have been criticized for their inability to distinguish among good and bad quality studies [54]. This may be a similar problem for the RoB, which may not include all important factors to evaluate the full construct of “risk of bias”. Empirical evidence supports the evaluation of randomization, allocation concealment and blinding of clinical trials, all of which are included in the RoB tool. While there is insufficient evidence to support other domains being included, other methodological factors could be important for evaluating RoB and could be considered for inclusion in the RoB tool after careful empirical evidence testing.

It is recommended that RoB assessments are made by multidisciplinary groups of reviewers, in which epidemiologists, methodologists, and clinicians with expertise in the content area of the review participate in the assessments. Our panel of reviewers had different levels of expertise, with two reviewers having at least 10 years of expertise in performing quality assessments and two of them with expertise in the area of the physical therapy. This might explain in part our higher levels of reliability compared to other studies.

When junior researchers are involved in RoB assessments, it is crucial that training in concepts and guidelines for assessing study bias is provided prior to the start of the review [4]. Training should be intense and monitored in each stage of the review. Previous studies have trained reviewers using an average of 5 trials per study. In contrast, we used 10 studies for training purposes and held regular meetings to discuss bias ratings of common papers. These factors may have helped to obtain acceptable levels of reliability between the external reviewer panel for most of the domains of the RoB tool.

Limitations

This study restricted the analysis to a limited number of Cochrane systematic reviews in PT and therefore, the results might not reflect the inter-rater agreement of the RoB tool when applied to Cochrane reviews conducted in other areas of research, or to systematic reviews conducted out of the Cochrane Collaboration. Future studies should further assess potential differences in the inter-rater reliability of the RoB tool by comparing bias ratings of Cochrane reviews and non-Cochrane reviews versus those of independent panels of reviewers.

Future directions

The reliability of RoB assessments applied to clinical trials in systematic reviews needs to be improved. The creation of an international database (a bias assessment bank) in which a qualified panel of experts (with extensive years of experience in trial methodology and critical appraisal of the scientific literature) contribute with independent RoB assessments of RCTs in a variety of clinical areas would be a promising step in that direction. Thus, researchers conducting systematic reviews and meta-epidemiological studies can use this data bank as a gold standard resource for RoB assessments. It is imperative that if an RoB assessment bank is created, contributors have the proper qualifications and experience to obtain less biased RoB assessments.

Conclusions

As far of our knowledge, this study is the first to demonstrate that risk of bias assessments of RCTs using the RoB tool are not consistent across different research groups contrasting results from Cochrane reviewers with an independent external panel of reviewers. Poor agreement was not only demonstrated at the trial level but also at the meta-analysis level. These results have important implications for decision making since different recommendations can be reached depending on the group analyzing the evidence. Improved guidelines to apply the RoB tool and revisions to the tool for different health areas are needed. In addition, empirical evidence supporting additional items for the RoB tool needs to be developed. A call is made for the creation of a bank of RoB assessments of trial data, maintained by methodological and clinical experts that can be used as a reliable gold standard resource for RoB assessments. (4453 Words)

Supporting Information

Appendix S1.

Search strategy to identify systematic review in physical therapy from the Cochrane Library of Systematic Reviews.

https://doi.org/10.1371/journal.pone.0096920.s001

(DOC)

Appendix S2.

Guidelines for evaluating the Risk of Bias in PT trials.

https://doi.org/10.1371/journal.pone.0096920.s002

(DOC)

Appendix S3.

Frequency of responses between Cochrane reviewers and the external panel of reviewers by RoB Domain.

https://doi.org/10.1371/journal.pone.0096920.s003

(DOC)

Acknowledgments

Dr. Susan Armijo-Olivo is supported by the Canadian Institutes of Health Research (CIHR) through a full-time Banting fellowship, the Alberta Innovates Health solution through an incentive award, the STIHR Training Program from Knowledge Translation (KT) Canada, and the University of Alberta.

Dr. Greta Cummings has been funded both provincially with a Population Health Investigator award from the Alberta Heritage Foundation for Medical Research (2006–2013), and nationally with a New investigator award from the Canadian Institutes of Health Research (2006–2011). Currently, she holds a Centennial Professorship at the University of Alberta (2013–2020).

Dr. Fuentes is supported by the Government of Chile, University of Alberta through a Dissertation fellowship, and the University Catholic of Maule.

Dr. Humam Saltaji is supported through a Clinician Fellowship Award by Alberta Innovates - Health Solutions (AIHS), the Honorary Izaak Walton Killam Memorial Scholarship by the University of Alberta and the WCHRI Award by the Women and Children's Health.

Research Institute (WCHRI).

In addition, the authors of this study thank the Alberta Research Center for Health Evidence (ARCHE) at the University of Alberta and all research assistants who helped with data collection.

Author Contributions

Conceived and designed the experiments: SA-O. Contributed reagents/materials/analysis tools: GGC ME BRdC. Wrote the paper: SA-O. Contributed to data collection: SA-O JF MO CH HS. Contributed data analysis, and interpretation: SA-O JF MO CH HS. Critically revised the manuscript and provided final approval of the version to be published: SA-O MO BRdC ME HS JF CH GGC. Provided feedback on the concept and research design and participated in interpretation of data: GGC ME BRdC.

References

  1. 1. Armijo-Olivo S, Macedo LG, Gadotti IC, Fuentes J, Stanton T, et al. (2008) Scales to Assess the Quality of Randomized Controlled Trials: A Systematic Review. Physical Therapy 88: 156–175.
  2. 2. Moher D, Jadad AR, Nichol G, Penman M, Tugwell P, et al. (1995) Assessing the quality of randomized controlled trials: an annotated bibliography of scales and checklists. Controlled Clinical Trials 16: 62–73.
  3. 3. Higgins J, Altman DG (2008) Chapter 8: Assessing risk of bias in included studies. In: Higgins J, Green S, editors. Cochrane Handbook for Systematic Reviews of Interventions Version 500 [updated February 2008] version 5.0 ed: Available from www.cochrane-handbook.org, February, 2008.
  4. 4. Lundh A, Gotzsche PC (2008) Recommendations by Cochrane Review Groups for assessment of the risk of bias in studies. BMC Medical Research Methodology 8.
  5. 5. Schulz KF, Chalmers I, Hayes RJ, Altman DG (1995) Empirical evidence of bias: Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. Journal of the American Medical Association 273: 408–412.
  6. 6. Emerson JD, Burdick E, Hoaglin DC, Mosteller F, Chalmers TC (1990) An empirical study of the possible relation of treatment differences to quality scores in controlled randomized clinical trials. Controlled Clinical Trials 11: 339–352.
  7. 7. Moher D, Cook DJ, Jadad AR, Tugwell P, Moher M, et al.. (1999) Assessing the quality of reports of randomised trials: implications for the conduct of meta-analyses. Health Technology Assessment (Winchester, England) 3: i–iv.
  8. 8. Savovic J, Jones HE, Altman DG, Harris RJ, Juni P, et al. (2012) Influence of reported study design characteristics on intervention effect estimates from randomized, controlled trials. Annals of Internal Medicine 157: 429–438.
  9. 9. Hartling L, Ospina M, Liang Y, Dryden DM, Hooton N, et al. (2009) Risk of bias versus quality assessment of randomised controlled trials: Cross sectional study. BMJ 339: 1017.
  10. 10. Armijo-Olivo S, Stiles CR, Hagen NA, Biondo PD, Cummings GG (2012) Assessment of study quality for systematic reviews: A comparison of the Cochrane Collaboration Risk of Bias Tool and the Effective Public Health Practice Project Quality Assessment Tool: Methodological research. Journal of Evaluation in Clinical Practice 18: 12–18.
  11. 11. Boutron I, Ravaud P (2012) Classification systems to improve assessment of risk of bias. Journal of Clinical Epidemiology 65: 236–238.
  12. 12. Hartling L, Bond K, Vandermeer B, Seida J, Dryden D, et al.. (2011) Applying the Risk of Bias tool in a systematic review of combination longacting betaagonists and inhaled corticosteroids for persistent asthma. PLoS Medicine 6.
  13. 13. Hartling L, Hamm MP, Milne A, Vandermeer B, Santaguida PL, et al. (2012) Testing the Risk of Bias tool showed low reliability between individual reviewers and across consensus assessments of reviewer pairs. Journal of Clinical Epidemiology 66: 973–981.
  14. 14. World Confederation for Physical Therapy (2011) Position statement: standards of physical therapy practice. World Confederation for Physical Therapy. 1–45 p.
  15. 15. Higgins JPT, Altman DG, Goetzsche PC, Juni P, Moher D, et al.. (2011) The Cochrane Collaboration's tool for assessing risk of bias in randomised trials. BMJ 343.
  16. 16. Furlan AD, Pennick V, Bombardier C, Van Tulder M (2009) 2009 Updated method guidelines for systematic reviews in the cochrane back review group. Spine 34: 1929–1941.
  17. 17. Higgins J, Altman D (2008) Chapter 8: Assessing risk of bias in included studies In: Higgins J, Green S, editors. Cochrane Handbook for Systematic Reviews of Interventions version 50. Chichester, UK: John Wiley & Sons, Ltd.
  18. 18. Schuck P (2004) Assessing Reproducibility for Interval Data in Health-Related Quality of Life Questionnaires: Which Coefficient Should Be Used? Quality of Life Research 13: 571–586.
  19. 19. Cohen J (1968) Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin 70: 213–220.
  20. 20. Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33: 159–174.
  21. 21. Byrt T (1996) How good is that agreement? Epidemiology (Cambridge, Mass) 7: 561.
  22. 22. Fransen M, McConnell S, Hernandez MG, Reichenbach S (2009) Exercise for osteoarthritis of the hip. Cochrane Database of Systematic Reviews: Reviews 2009, Issue 3 JohnWiley& Sons, Ltd Chichester, UK DOI: 101002/14651858CD007.
  23. 23. Handoll-Helen HG, Cameron ID, Mak-Jenson CS, Finnegan TP (2009) Multidisciplinary rehabilitation for older people with hip fractures. Cochrane Database of Systematic Reviews Reviews 2009, Issue 4 JohnWiley& Sons, LtdChichester, UK DOI: 101002.
  24. 24. Harvey LA, Brosseau L, Herbert RD (2010) Continuous passive motion following total knee arthroplasty in people with arthritis. Cochrane Database of Systematic Reviews Reviews 2010, Issue 3 JohnWiley& Sons, LtdChichester, UK DOI: 101002/14.
  25. 25. Katalinic OM, Harvey LA, Herbert RD, Moseley AM, Lannin NA, et al.. (2010) Stretch for the treatment and prevention of contractures. Cochrane Database of Systematic Reviews Reviews 2010, Issue 9 JohnWiley& Sons, LtdChichester, UK DOI: 101002/14 2010.
  26. 26. Manheimer E, Cheng K, Linde K, Lao L, Yoo J, et al.. (2010) Acupuncture for peripheral joint osteoarthritis. Cochrane Database of Systematic Reviews: Reviews 2010 Issue 1 JohnWiley& Sons, Ltd Chichester, UK.
  27. 27. Ostelo-Raymond WJG, Costa-Leonardo OP, Maher CG, de-Vet-Henrica CW, van-Tulder MW (2008) Rehabilitation after lumbar disc surgery. Cochrane Database of Systematic Reviews Reviews 2008, Issue 4 JohnWiley& Sons, LtdChichester, UK.
  28. 28. Rutjes-Anne WS, Nüesch E, Sterchi R, Jüni P (2010) Therapeutic ultrasound for osteoarthritis of the knee or hip. Cochrane Database of Systematic Reviews Reviews 2010, Issue 1 JohnWiley & Sons, LtdChichester, UK DOI: 101002/14651858CD00.
  29. 29. Rutjes-Anne WS, Nüesch E, Sterchi R, Kalichman L, Hendriks E, et al.. (2009) Transcutaneous electrostimulation for osteoarthritis of the knee. Cochrane Database of Systematic Reviews 2009: Issue;4 JohnWiley & Sons, Ltd Chichester CD002823.
  30. 30. Schaafsma F, Schonstein E, Whelan KM, Ulvestad E, Kenny DT, et al.. (2010) Physical conditioning programs for improving work outcomes in workers with back pain. Cochrane Database of Systematic Reviews Reviews 2010, Issue 1 John Wiley & Sons, Ltd Chichester, UK.
  31. 31. Davies P, Taylor F, Beswick A, Wise F, Moxham T, et al.. (2010) Promoting patient uptake and adherence in cardiac rehabilitation. Cochrane Database of Systematic Reviews Reviews 2010, Issue 7 John Wiley& Sons Ltd Chichester, UK.
  32. 32. Effing T, Monninkhof-Evelyn EM, Valk-Paul PDLP, Zielhuis-Gerhard GA, Walters EH, et al.. (2007) Self-management education for patients with chronic obstructive pulmonary disease. Cochrane Database of Systematic Reviews Reviews 2007, Issue 4 John Wiley& Sons Ltd Chichester, UK.
  33. 33. Puhan MA, Gimeno SE, Scharplatz M, Troosters T, Walters EH, et al.. (2009) Pulmonary rehabilitation following exacerbations of chronic obstructive pulmonary disease. Cochrane Database of Systematic Reviews Reviews 2009, Issue 1 John Wiley& Sons Ltd Chichester, UK.
  34. 34. Taylor RS, Dalal H, Jolly K, Moxham T, Zawada A (2010) Home-based versus centre-based cardiac rehabilitation. Cochrane Database of Systematic Reviews Reviews 2010, Issue 1 John Wiley& Sons, Ltd Chichester, UK. DOI: 101002/14651858CD.
  35. 35. Sirtori V, Corbetta D, Moja L, Gatti R (2009) Constraint-induced movement therapy for upper extremities in stroke patients. Cochrane Database of Systematic Reviews Reviews 2009, Issue 4 John Wiley& Sons, Ltd Chichester, UK.
  36. 36. States RA, Pappas E, Salem Y (2009) Overground physical therapy gait training for chronic stroke patients with mobility deficits. Cochrane Database of Systematic Reviews Reviews 2009, Issue 3 JohnWiley& Sons, LtdChichester, UK
  37. 37. Kramer MS, McDonald SW (2006) Aerobic exercise for women during pregnancy. Kramer MichaelS, McDonald SheilaW Aerobic exercise for women during pregnancy Cochrane Database of Systematic Reviews: Reviews 2006 Issue 3 John Wiley& Sons, Ltd Chichester, UK. DOI: 101002/14651858CD000180pub2.
  38. 38. Orozco LJ, Buchleitner AM, Gimenez PG, Figuls M, Richter B, et al.. (2008) Exercise or exercise and diet for preventing type 2 diabetes mellitus. Cochrane Database of Systematic Reviews Reviews 2008, Issue 3 John Wiley& Sons Ltd Chichester, UK.
  39. 39. Egger M, Juni P, Bartlett C, Holenstein F, Sterne J (2003) How important are comprehensive literature searches and the assessment of trial quality in systematic reviews? Empirical study. Health technology assessment 7: 1–76.
  40. 40. Wood L, Egger M, Gluud LL, Schulz KF, Juni P, et al. (2008) Empirical evidence of bias in treatment effect estimates in controlled trials with different interventions and outcomes: Meta-epidemiological study. BMJ 336: 601–605.
  41. 41. Pildal J, Hrobjartsson A, Jorgensen KJ, Hilden J, Altman DG, et al. (2007) Impact of allocation concealment on conclusions drawn from meta-analyses of randomized trials. International Journal of Epidemiology 36: 847–857.
  42. 42. Moher D, Pham B, Jones A, Cook DJ, Jadad AR, et al. (1998) Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses?[see comment]. Lancet 352: 609–613.
  43. 43. Villari P, Manzoli L, Boccia A (2004) Methodological quality of studies and patient age as major sources of variation in efficacy estimates of influenza vaccination in healthy adults: A meta-analysis. Vaccine 22: 3475–3486.
  44. 44. Manzoli L, Schioppa F, Boccia A, Villari P (2007) The efficacy of influenza vaccine for healthy children: A meta-analysis evaluating potential sources of variation in efficacy estimates including study quality. Pediatric Infectious Disease Journal 26: 97–106.
  45. 45. Nuesch E, Trelle S, Reichenbach S, Rutjes AWS, Burgi E, et al. (2009) The effects of excluding patients from the analysis in randomised controlled trials: Meta-epidemiological study. BMJ 339: 679–683.
  46. 46. Abraha I, Montedori A (2010) Modified intention to treat reporting in randomised controlled trials: Systematic review. BMJ (Online) 341: 33.
  47. 47. Armijo-Olivo S, Warren S, Magee D (2009) Intention to treat analysis, compliance, drop-outs and how to deal with missing data in clinical research: a review. Physical Therapy Reviews 14: 36–49.
  48. 48. Hewitt CE, Kumaravel B, Dumville JC, Torgerson DJ (2010) Assessing the impact of attrition in randomized controlled trials. Journal of Clinical Epidemiology 63: 1264–1270.
  49. 49. Kjaergard LL, Als-Nielsen B (2002) Association between competing interests and authors' conclusions: Epidemiological study of randomised clinical trials published in the BMJ. British Medical Journal 325: 249–252.
  50. 50. Trowman R, Dumville JC, Torgerson DJ, Cranny G (2007) The impact of trial baseline imbalances should be considered in systematic reviews: a methodological case study. Journal of Clinical Epidemiology 60: 1229–1233.
  51. 51. Graham N, Haines T, Goldsmith CH, Gross A, Burnie S, et al.. (2011) Reliability of three assessment tools used to evaluate randomized controlled trials for treatment of neck pain. Spine.
  52. 52. Unnebrink K, Windeler J (2001) Intention-to-treat: Methods for dealing with missing values in clinical trials of progressively deteriorating diseases. Statistics in Medicine 20: 3931–3946.
  53. 53. Wright CC, Sim J (2003) Intention-to-treat approach to data from randomized controlled trials: A sensitivity analysis. Journal of Clinical Epidemiology 56: 833–842.
  54. 54. Herbison P, Hay-Smith J, Gillespie WJ (2006) Adjustment of meta-analyses on the basis of quality scores should be abandoned. Journal of Clinical Epidemiology 59: 1249.e1241–1249.e1211.