Agreement between the Cochrane risk of bias tool and Physiotherapy Evidence Database (PEDro) scale: A meta-epidemiological study of randomized controlled trials of physical therapy interventions

Background The Cochrane risk of bias (CROB) tool and Physiotherapy Evidence Database (PEDro) scale are used to evaluate risk of bias of randomized controlled trials. We assessed the level of agreement between the instruments. Methods We searched the Cochrane Library to identify trials included in systematic reviews evaluating physical therapy interventions. For trials that met our inclusion criteria (primary reference in Cochrane review, review used CROB (2008 version), indexed in PEDro), CROB items were extracted from the reviews and PEDro items and total score were downloaded from PEDro. Kappa statistics were used to determine the agreement between CROB and PEDro scale items that evaluate similar constructs (e.g., randomization). The total PEDro score was compared to the CROB summary score (% of items met) using an Intraclass Correlation Coefficient. Sensitivity analyses explored the impact of the CROB “unclear” category and variants of CROB blinding items. Kappa statistics were used to determine agreement between different thresholds for “acceptable” risk of bias between CROB and PEDro scale summary scores. Results We included 1442 trials from 108 Cochrane reviews. Agreement was “moderate” for three of the six CROB and PEDro scale items that evaluate similar constructs (allocation concealment, participant blinding, assessor blinding; Kappa = 0.479–0.582). Agreement between the summary scores was “poor” (Intraclass Correlation Coefficient = 0.285). Agreement was highest when the CROB “unclear” category was collapsed with “high” and when participant, personnel and assessor blinding were evaluated separately in CROB. Agreement for different thresholds for “acceptable” risk of bias between CROB and PEDro summary scores was, at best, “fair”. Conclusion There was moderate agreement for half of the PEDro and CROB items that evaluate similar constructs. Interpretation of the CROB “unclear” category and variants of the CROB blinding items substantially influenced agreement. Either instrument can be used to quantify risk of bias, but they can’t be used interchangeably.


Introduction
Evidence-based practice is essential for health providers because it guides the adoption of effective interventions while eliminating those that are less effective or harmful [1]. Randomized controlled trials are recognized as the best study design to examine the effects of an intervention [1,2]. Critical appraisal of trial risk of bias (methodological quality) is used to confirm that the findings and conclusions are valid, and is one of the five steps of the evidence-based practice process. Two commonly employed instruments used to assess the risk of bias of trials of physical therapy interventions are the Cochrane risk of bias (CROB) tool [3] and the Physiotherapy Evidence Database (PEDro) scale [4].
The Cochrane Collaboration started using the CROB tool in 2008 to assess and report risk of bias in trials included in Cochrane reviews [3]. The CROB tool evaluates potential bias for seven items across six domains: selection bias (random sequence generation; allocation concealment), performance bias (blinding of participants and personnel), detection bias (blinding of outcome assessment), attrition bias (incomplete outcome data), reporting bias (selective reporting), and other sources of bias. Each domain (item) is rated as "high," "unclear" or "low" risk of bias, and are reported separately (a summary score is not calculated). All items evaluate risk of bias. Clinimetric evaluation of the CROB tool has focused on reliability, suggesting that inter-rater agreement for individual items varies from "poor" (Kappa = -0.04 for 'other bias') to "substantial" (Kappa = 0.79 for 'sequence generation) [5][6][7]. Inter-rater agreement for inexperienced raters with minimal training (Kappa = 0.00 to 0.38) can, however, be improved with standardized training (Kappa = 0.93 to 1.00) [8].
The PEDro scale was developed in 1999 to evaluate the risk of bias and completeness of statistical reporting of trial reports indexed in the PEDro evidence resource [4] and is now commonly used in systematic reviews [9]. This scale evaluates 11 items: inclusion criteria and source, random allocation, concealed allocation, similarity at baseline, subject blinding, therapist blinding, assessor blinding, completeness of follow up, intention-to-treat analysis, between-group statistical comparisons, and point measures and variability. Each item is rated as "yes" or "no," and the total PEDro score is the number of items met (excluding the inclusion criteria and source item). Eight items evaluate risk of bias (random allocation, concealed allocation, similarity at baseline, subject blinding, therapist blinding, assessor blinding, completeness of follow up, intention-to-treat analysis) and two items evaluate the completeness of statistical reporting (between-group statistical comparisons, and point measures and variability). Evaluation of the clinimetric properties of the PEDro scale reveals acceptable validity and reliability. There is evidence for convergent and construct validity for eight out of 11 individual items [10] and acceptably high reliability for the total PEDro score (Interclass Correlation Coefficient = 0.56 to 0.91) [4,[11][12][13]and individual items (Kappa = 0.45 to 1.00) [4,12,14,15]. Rasch analysis suggests that the PEDro scale can be used as a continuous scale [16].
There is recent evidence of convergent validity between the PEDro scale and the Cochrane Back and Neck Group risk of bias tool (which includes the seven items in the CROB tool plus intention-to-treat analysis, group similarity at baseline, co-interventions, compliance, and timing of outcome assessments) for pharmacological trials (Intraclass Correlation Coefficient = 0.83, 95% confidence interval (CI) 0.76 to 0.88) [17]. However, no research group has studied the convergent validity between the PEDro scale and CROB tool in trials evaluating the effects of physical therapy interventions. This would be interesting as trials evaluating physical therapy interventions, like exercise, do not have the same characteristics as pharmacological trials. Blinding participants and personnel in trials of complex physical therapy interventions is difficult and, usually, not possible [6,12]. As both the PEDro scale and CROB tool are commonly used in systematic reviews, evaluation of the convergent validity between these instruments would assist clinicians to understand risk of bias across the review articles they read (i.e., do the tools have a similar interpretation and can they be used interchangeably) and possibly provide some guidance for systematic reviewers when they are selecting a tool to evaluate risk of bias in their reviews.
Although both the PEDro scale and CROB tool have different approaches to assessing risk of bias, they have six items in common (random allocation, concealed allocation, blinding of participants, personnel and assessors, and incomplete outcome data). To date, only one study has made a direct comparison between the PEDro scale and CROB tool in trials of physical therapy interventions [18]. This study found poor agreement between the two instruments [18]. These results do, however, need to be interpreted with caution; there was a small sample size (n = 353), the analysis only considered three CROB items (random sequence generation, allocation concealment, assessor blinding), the CROB tool was assumed to be the gold standard, and the cut-point (i.e., "low" risk of bias for all three CROB items) used for "adequate" risk of bias was not explored. This highlights the need to evaluate agreement between the PEDro scale and CROB tool for the items that evaluate similar constructs. While the Cochrane Methods and Statistical Methods Groups do not recommend the use of summary scores [3], the judicious use of a CROB summary score could facilitate the comparison of the two instruments by allowing agreement to be calculated for overall scores.
The primary objective of this study was to determine the convergent validity (level of agreement) between individual items from the PEDro scale and CROB tool that evaluate similar constructs and for summary scores. Convergent validity, a subtype of construct validity (defined as "studied when the tester has no definite criterion measure of the quality with which he is concerned, and must use indirect measures" [19]), is the degree to which two measures of a construct that theoretically should be related are in fact related [20]. Sensitivity analyses were used to explore the impact of the CROB "unclear" category and variants of CROB blinding items on agreement. The secondary objective was to determine the level of agreement between different thresholds for "acceptable" risk of bias between the summary scores for the CROB tool and PEDro scale. Between-review agreement (inter-rater reliability) for the CROB tool was also evaluated.

Study design
This meta-epidemiological study was conducted using two online public health research databases: Cochrane Library (www.cochranelibrary.com) and PEDro (www.pedro.org.au). The study type was identified as a meta-epidemiological study because we used a systematic review to provide data for methodological analysis and the unit of analysis is at the study level [21]. Ethical approval was obtained from the University of Ottawa ethics board (H12-13-03B). The study did not have a protocol because the PROSPERO registry for systematic reviews only accepts protocols where there is a health-related outcome [22].

Literature search
Experienced librarians working for the Cochrane Collaboration searched the Cochrane Library to identify systematic reviews of randomized controlled trials evaluating physical therapy interventions that used the CROB tool and were published in the period January 2008 to October 2013. An author (AMM) repeated the search for November 2013 to December 2015; all previously included systematic reviews were updated to the most current version in this second search. The following key words were used: "physical therapy," "physiotherapy," "rehabilitation," "exercise," "electrophysical agents," "acupuncture," "massage," "transcutaneous electrical stimulation (TENS)," "interferential current," "ultrasound," "stretching," "chest therapy," "pulmonary rehabilitation," "manipulative therapy," "mobilization."

Trial selection
The eligibility of retrieved Cochrane systematic reviews was assessed by two trained evaluators (PR, AMM) who independently screened the titles and abstracts and, where necessary, the full-text using pre-determined criteria. The criteria were: (1) examined the efficacy of a physical therapy intervention (i.e., physical rehabilitation, excluding surgical or pharmacological interventions); (2) not withdrawn from the Cochrane Library; and (3) used the 2008 version of the CROB tool [3] to evaluate included studies. Any disagreements were resolved by discussion between the two evaluators that led to consensus.
All references for the included studies in the eligible Cochrane reviews were extracted (citation, digital object identification number and PubMed identification number). These data were compiled into an Excel spreadsheet by one data extractor (AMM) and verified by a second extractor (PR). When there was more than one reference for an included study, only the primary reference was retained. In most cases, the review authors had tagged the primary reference using an asterisk. If the primary reference was not tagged, the data extractors selected the most important reference. Again, any disagreements were resolved by discussion between the two evaluators that led to consensus.
Trials that were not indexed on PEDro because they did not fulfil the PEDro inclusion criteria were excluded. The PEDro inclusion criteria are: involve comparison of at least two interventions, at least one intervention is part of physiotherapy practice, intervention applied to subjects who would receive the intervention as part of physiotherapy practice, random or intended-to-be-random allocation to groups, full paper in a peer-reviewed journal.
When trials were in more than one of the included Cochrane reviews, a trial from one review was randomly selected to be in the data set. The duplicate trials were used to evaluate between-review agreement (inter-rater reliability) for the CROB ratings.

Data extraction and processing
CROB ratings ("low," "unclear," "high") for random sequence generation, allocation concealment, blinding of participants, blinding of personnel, blinding of outcome assessment, incomplete outcome data, selective reporting, and other sources of bias for each included trial were extracted from the Cochrane reviews. A number of reporting methods were used for the blinding items in the included reviews: participants and personnel combined, participants only, personnel only, outcome assessment only, outcome assessment for subjective outcomes, outcome assessment for objective outcomes, participants and personnel and outcome assessment combined. Each method was extracted into a separate column in the Excel spreadsheet. Two extractors independently extracted the CROB data for the included trials. Any disagreements were resolved by discussion between the two extractors that led to a consensus.
The CROB summary score was calculated as the number of items with "low" risk of bias divided by the number of core items evaluated in the review, and was expressed as a percentage.
PEDro scale scores (11 individual items and total PEDro score), citation, PubMed identification number, digital object identification number, and PEDro identification number for the primary reference for each included trial were downloaded from the PEDro evidence resource (www.pedro.org.au) and added to the Excel spreadsheet. The digital object identification number, PubMed identification number and citation were used to verify that the PEDro citations matched the citations for included studies extracted from the Cochrane reviews.
PEDro items scored as "yes" (i.e., a positive rating) were recoded as "1" and items scored as "no" (i.e., a negative rating) were recoded as "0". The total PEDro score is the number of items met, excluding the inclusion criteria and source item, and is expressed as a score ranging from 0 to 10.

Statistical analysis
The number and percentage of trials rating "yes" for each PEDro scale item and "low," "unclear" and "high" for each CROB item were tabulated. The mean and standard deviation (SD) for the total PEDro score and CROB summary score were calculated.
Agreement between the CROB tool and PEDro scale items. Under the supervision of a senior biostatistician (GAW), Kappa statistics, 95% CI and percent exact agreement were calculated to assess the level of agreement (or convergent validity) between individual items from the PEDro scale and CROB tool that evaluate similar constructs [23]. The items were: PEDro random allocation vs. CROB random sequence generation; PEDro concealed allocation vs. CROB allocation concealment; PEDro subject blinding vs. CROB blinding of participants; PEDro therapist blinding vs. CROB blinding of personnel; PEDro assessor blinding vs. CROB blinding of outcome assessment; PEDro completeness of follow up vs. CROB incomplete outcome data.
For the main analyses, the CROB categories were dichotomized by recoding "low" (i.e., a positive rating) as "1" and "unclear" or "high" (i.e., a negative rating) as "0." Two sets of prespecified sensitivity analyses were also performed. The first dichotomized the CROB categories into "low" or "unclear" as "1" and "high" as "0." The second omitted trials rating "unclear" CROB and recoded "low" as "1" and "high" as "0." A second set of sensitivity analyses that included different variants of the CROB blinding items were also performed. PEDro subject blinding was compared to three groupings of variants of the CROB blinding of participants item. PEDro therapist blinding was compared to three groupings of variants of the CROB blinding of personnel item. PEDro assessor blinding was compared to six groupings of variants of the CROB blinding of outcome assessment item.
For each of these analyses, the CROB categories were dichotomized by recoding "low" as "1" and "unclear" or "high" as "0." Agreement between the CROB summary score and total PEDro score. Intraclass Correlation Coefficients (type 1,1) and 95% CI were calculated to determine the level of agreement between the CROB summary score and the total PEDro score. In this analysis, the CROB summary score was calculated as the number of items with "low" risk of bias divided by the number of core items evaluated in the review, and was expressed as a percentage. An additional sensitivity analysis was used to test the impact of the "unclear" option. This sensitivity analysis computed the CROB summary score as the number of items with "low" or "unclear" risk of bias divided by the number of core items evaluated in the review.
Agreement between different thresholds for the CROB summary score and total PEDro score. To examine the agreement between different thresholds for "acceptable" risk of bias on the CROB tool and PEDro scale, a Kappa statistic matrix was calculated to determine the level of agreement for 10% increments in the CROB summary score and 1-point increments for the total PEDro score. In this analysis, the CROB summary score was calculated after dichotomizing the CROB categories for individual items into "1" for "low" and "0" for "high" and "unclear" (as per the main analyses for agreement between PEDro and CROB items).
Between-review agreement of CROB ratings. When trials were included in more than one Cochrane review, Kappa statistics and 95% CI plus percent exact agreement were calculated for each CROB item to quantify between-review agreement. The raw scores ("low," "unclear," and "high") were used in this analysis. Intraclass Correlation Coefficients (type 1,1) and 95% CI were used to quantify between-review agreement for the CROB summary score.

Included trials
The flow of reviews and trials in this analysis is illustrated in Fig 1. The literature search identified 194 Cochrane systematic reviews that appeared to be related to physical therapy interventions. Of these, 86 reviews were excluded, mostly because they did not evaluate a physical therapy intervention, did not use the 2008 version of the CROB tool, or were duplicates (Fig  1). The 108 eligible Cochrane reviews had a median of 12 included studies each (interquartile range 7; 22) (see S1 File for list of included reviews). The area of practice for the eligible Cochrane reviews was musculoskeletal (27 reviews), cardiorespiratory (20), continence and women's health (14), neurology (8), orthopedics (8), sports (8), oncology (7), endocrine and lifestyle (6), gerontology (5), ergonomics and occupational health (2), pediatrics (2), and mental health (1). The interventions evaluated were exercise (62 reviews), electrotherapy (11), behavioral (7), manual therapy (6), education (5), respiratory therapy (4), acupuncture (2), ergonomics (2), splinting (2), and a combination of the different treatment options (7). There were 2765 references from these included studies. Of these, 1520 were included in this analysis. There were 1442 unique trial ratings that were used in the CROB tool vs. PEDro scale analyses (see S1 File for list of included trials). There were 78 duplicate trial ratings, of which the second trial ratings (n = 74) were used to examine the between-review agreement for the CROB tool and the third trial ratings (n = 4) were excluded from the data set.
The number of trials classified as "low," "unclear," and "high" for the CROB items are listed in Table 1. The transformed CROB ratings are in Table 2. For the main analysis, the CROB items with the highest prevalence of having "low" risk of bias were other source of bias (56%),  Table 1. Number and percentage of trials classified as "low", "unclear" and "high" risk of bias using the Cochrane risk of bias tool. blinding of outcome assessment for objective outcomes (53%), incomplete outcome data (52%), and random sequence generation (51%). The mean (SD) CROB summary score was 40.0% (24.4) for the main analysis (i.e., number of items with "low" risk of bias divided by the number of core items evaluated), and 74.4% (19.5) for the sensitivity analysis (i.e., number of items with "low" or "unclear" risk of bias divided by the number of core items evaluated).

Agreement between the CROB tool and PEDro scale items
For the main analyses (i.e., dichotomizing the CROB ratings into "1" for "low" and "0" for "unclear" or "high"), the level of agreement between the PEDro and CROB items that evaluate similar constructs ranged from "slight" to "moderate" (see Table 4). Three items were classified as "moderate:" PEDro concealed allocation vs. CROB allocation concealment, PEDro assessor blinding vs. CROB blinding of outcome assessment, and PEDro subject blinding vs. CROB blinding of participants (Kappa = 0.479-0.582). The item with the lowest Kappa value was PEDro random allocation vs. CROB random sequence generation (Kappa = 0.054). The Kappa values for PEDro therapist blinding vs. CROB blinding of personnel and PEDro subject blinding vs. CROB blinding of participants need to be interpreted with caution because of the low base rate of therapist (i.e., 2%, see Table 3) and subject (i.e., 5%, see Table 3) blinding.
The sensitivity analyses revealed that interpretation of the CROB "unclear" category had a large impact on the agreement values. With the exceptions of the PEDro random allocation vs. CROB random sequence generation and PEDro completeness of follow-up vs. CROB incomplete outcome data, Kappa values were lower in the sensitivity analyses (Table 4). Agreement for PEDro random allocation vs. CROB random sequence generation changed from "slight" in the main analysis to "moderate" in both sensitivity analyses. Agreement for PEDro completeness of follow-up vs. CROB incomplete outcome data was highest when trials with CROB "unclear" ratings were omitted from the analysis (i.e., sensitivity analysis 2).
The analyses exploring the impact of different groupings of variants of the CROB blinding items on agreement between the PEDro scale and CROB tool are in Table 5. Agreement was highest when blinding was reported separately for the participants ("moderate" agreement) and personnel ("fair" agreement). For example, PEDro subject blinding vs. CROB blinding of participants had a Kappa value of 0.479 ("moderate" agreement), compared to 0.328 for PEDro subject blinding vs. CROB blinding of participants + CROB blinding of participants and personnel combined + CROB blinding of participants and personnel and outcome assessment combined ("fair" agreement). In contrast, collapsing different methods of reporting blinding of outcome assessment was consistent with the main analyses (all classified as "moderate" agreement), with the exception of CROB blinding of outcome assessment for subjective outcomes ("slight" agreement) and CROB blinding of outcome assessment for objective outcomes ("substantial" agreement).

Agreement between the CROB summary score and total PEDro score
The agreement between the CROB summary score and total PEDro score was "poor" for the main analysis (CROB "unclear" collapsed with "high"), with an Intraclass Correlation Coefficient of 0.285 (95% CI -0.093 to 0.831). Results from the sensitivity analysis (CROB "unclear" collapsed with "low") was also classified as "poor" agreement (Intraclass Correlation Coefficient = -0.150, 95% CI -0.064 to 0.771).

Agreement between different thresholds for the CROB summary score and total PEDro score
The matrix of agreement between different thresholds for "acceptable" risk of bias for the CROB summary score and total PEDro score is in Table 6. At best, the Kappa scores could be categorized as "fair" for the total PEDro score thresholds of �5 to �8 and the CROB thresholds of �20% to �80%. The highest Kappa value (0.318, 95% CI 0.250 to 0.388) occurred for the total PEDro score threshold of �6 and the CROB summary score threshold of �50%. Agreement between Cochrane risk of bias tool and Physiotherapy Evidence Database scale Table 6. Matrix of number of trials achieving both the Cochrane risk of bias summary score and total Physiotherapy Evidence Database scale score thresholds for "acceptable" risk of bias (n), percent exact agreement, Kappa (K) and Kappa 95% confidence interval for the level of agreement between different thresholds for "acceptable" risk of bias for the Cochrane risk of bias summary score and total Physiotherapy Evidence Database scale score (N = 1442). Cells are shaded to indicate the degree of agreement: no shading = "poor" or not calculable, gray = "slight," dark gray = "moderate" [note: no comparisons had "fair," "substantial" or "almost perfect" agreement].

Between-review agreement of CROB ratings
The between-review agreement (inter-rater reliability) for individual CROB items are reported in Table 7 (note, agreement values were not calculable for five items because there were too few pairs of trials that had ratings for these items). One item was classified as "almost perfect" (allocation concealment, Kappa = 0.818), one as "substantial" (random sequence generation, Kappa = 0.686), two as "moderate" (blinding of outcome assessment, Kappa = 0.533; selective reporting, Kappa = 0.450), two as "fair" (incomplete outcome data, Kappa = 0.365; other sources of bias, Kappa = 0.314), and one as "poor" (blinding of participants and personnel, Kappa = -0.037). The agreement for the CROB summary score was "fair to good," with an Intraclass Correlation Coefficient of 0.711 (95% CI 0.578 to 0.808).

Discussion
There was "moderate" agreement between the PEDro scale and CROB tool for three of the six items that evaluate similar constructs: PEDro concealed allocation vs. CROB allocation concealment, PEDro assessor blinding vs. CROB blinding of outcome assessment, and PEDro subject blinding vs. CROB blinding of participants (Kappa = 0.479-0.582). Agreement was "slight" to "fair" for the other three items, and "poor" for the CROB summary score vs. total PEDro score. Agreement tended to be higher when the CROB "unclear" category was collapsed with "high" and when blinding of participants, personnel and outcome assessment were evaluated separately within the CROB tool. It was not possible to draw a strong conclusion about level of agreement between different thresholds for "acceptable" risk of bias between summary scores from the two instruments.
The main strengths of this meta-epidemiological study were the rigorous methods used for data extraction, large sample size, and comprehensive analysis of convergent validity that included all core items from the CROB tool and PEDro scale. Our sample (1442 trials from 108 reviews) represents a three-fold increase on previous studies [18]. This allowed us to assess the agreement between the instruments and to conduct a series of sensitivity analyses to explore the impact of the CROB "unclear" category and how blinding is quantified in the CROB tool. Calculating summary scores for instruments used to assess the risk of bias of trials is controversial. Critics of summary scores argue that summation is invalid because most instruments are comprised of heterogeneous items evaluating both quality of reporting and the conduct of trials [3,7,8,[26][27][28]. Proponents argue that summing is justified if an instrument has empirical evidence indicating a unidimensional structure [10,16] and that summary scores facilitate analysis (e.g., being used as an independent variable in meta-regression). Perhaps there is room for some middle ground, with reporting of both individual items and summary scores for risk of bias assessment in systematic reviews and in the evaluation of the measurement properties of risk of bias instruments. Calculation of a summary score allowed us to perform a more rigorous comparison of the PEDro scale and CROB tool, including the systematic analysis of different thresholds for "acceptable" risk of bias. The low agreement between some items of the CROB tool and PEDro scale could be due to the characteristics of the instruments and raters. Operationalization of some of the items that assess similar constructs differ between the instruments. This is particularly evident in the item with the lowest agreement, PEDro random allocation vs. CROB random sequence generation (Kappa = 0.054). In this instance, the CROB item is more stringent than the PEDro item, requiring the precise method of sequence generation to be specified. In contrast, agreement was "moderate" for items that had similar definitions (e.g., PEDro concealed allocation vs. CROB allocation concealment, Kappa = 0.582). Different pairs of raters generated the CROB ratings because they were extracted from Cochrane reviews evaluating physical therapy interventions. While we did not calculate Kappa using an approach that accommodates multiple raters, our agreement estimates are likely to be reasonable [29]. Online training for the 2008 version of the CROB tool is limited and the terminology used in the tool could be difficult for reviewers who do not have clinical epidemiology training; this may lead to increased usage of the "unclear" risk of bias rating [30][31][32]. While the interpretation of "unclear" is not well explained in the Cochrane handbook [3], we observed that the "unclear" category had a large impact on risk of bias scoring. For example, the CROB summary score was 40.0% when the number of items with "low" risk of bias was divided by the number of core items evaluated and 74.4% when the number of items with "low" or "unclear" risk of bias was divided by the number of core items evaluated. Inclusion of two items that evaluate completeness of statistical reporting in the total PEDro score may have contributed to the poor agreement between the CROB summary score and total PEDro score. However, because nearly all trials achieved the PEDro between-group statistical comparisons (95%) and point measures and variability (91%) items, the impact is likely to be small.
The focus of this meta-epidemiological study was on trials evaluating physical therapy interventions. These trials differ from pharmacological trials in methodological structure, particularly for blinding of participants (or subjects) and personnel (or therapists) [6,12]. Blinding of participants and personnel was not possible for the majority of the included trials (5% had subject blinding and 2% had therapist blinding; Table 3) because of the complex nature of the interventions being evaluated, but assessor blinding was achieved in about one-third of trials (37% had assessor blinding; Table 3). This highlights the importance of evaluating blinding separately for subjects, therapists and assessors when evaluating risk of bias in physical therapy trials. While we classified the physical therapy interventions evaluated in the included trials into 10 categories (exercise, electrotherapy etc), we did not perform any subgroup analyses on the blinding items. This could be the focus of future research.
The difficulty in blinding subjects and therapists may make applying the CROB tool more challenging, as evidenced by the included reviews using seven variations of the blinding items. While variation in the implementation of the CROB tool made it difficult to evaluate the blinding items because of incomplete data, agreement between the CROB and PEDro blinding items was highest when blinding was reported separately for the participants ("moderate" agreement) and personnel ("fair" agreement). Variation also occurred for other CROB items, with none of the core items being assessed for all trials (Table 1) and 25 potentially eligible reviews being excluded because they did not use the 2008 version of the CROB tool (Fig 1). The difference in methodological structure and variation in the application of the CROB tool in the included reviews could contribute to the lower agreement between the CROB summary score and total PEDro score observed in our evaluation of physical therapy interventions (i.e., "poor") compared to pharmacological trials ("strong" convergence) [17]. The observed variability in the implementation of the CROB tool could also add confusion for readers of Cochrane reviews.
Our evaluation of between-review agreement for the CROB tool revealed "almost perfect" agreement for allocation concealment, "substantial" for random sequence generation, "moderate" for blinding of outcome assessment and selective reporting, "fair" for incomplete outcome data and other sources of bias, and "fair to good" for the CROB summary score. With the exception of incomplete outcome data and blinding of participants and personnel, these levels of agreement were the same or better than the agreement observed between pairs of reviewers reported in the literature [6,7,33] and our sample size (n = 74) was larger than other studies [6,33].
Our analyses have implications for risk of bias assessment in systematic reviews and as a component of evidence-based practice. Researchers and clinicians could use either the CROB tool or the PEDro scale, as neither can be considered the gold standard for risk of bias evaluation. However, the instruments cannot be used interchangeably because of the low convergent validity for the summary scores and some individual items. We were not able to identify a robust threshold for "acceptable" risk of bias and so caution against the use of thresholds for "acceptable" risk of bias for both the CROB tool and PEDro scale. Optimal cut-offs have not been rigorously established for either instrument, and could be the focus of future research.
The 2008 version of the CROB tool was compared to the PEDro scale in this study. The next version of CROB (called ROB 2.0) has recently been released [34], but has not yet been used to evaluate risk of bias in Cochrane reviews. It will take some time for the CROB 2.0 tool to be used in all Cochrane reviews and, because the ROB 2.0 tool will not be applied retrospectively, updated reviews may report risk of bias using the 2008 version of the CROB tool. Future metaepidemiological studies could compare the two versions of the CROB tool or compare CROB 2.0 to the PEDro scale in order to provide empirical data that can be used to select the most robust risk of bias instrument. Our dataset may be used to facilitate future evaluations (S2 File).

Conclusion
The agreement between the PEDro scale and CROB tool was "moderate" for three of the six items that evaluate similar constructs. Interpretation of the CROB "unclear" category and variants of the CROB blinding items substantially influenced agreement. We caution against the use of thresholds for "acceptable" risk of bias for both the CROB tool and PEDro scale. Either instrument can be used to quantify risk of bias, but they can't be used interchangeably. Tamara Rader, Elizabeth Ghogomu, Véronique Beaudoin and Ana Lakic for their assistance and contribution. We would also like to acknowledge Éric Bergeron, Philippe Bergeron, Caroline Chrétien and Matthieu Hamel for their contribution, and Christopher Maher for his insightful comments.