Conceived and designed the experiments: JG MB BS NA LB. Performed the experiments: ZO JP VS AB. Analyzed the data: BS TR. Wrote the paper: ZO JG MB BS LB. Other: Designed the study: JS. Wrote the first draft of the paper: BS.
The authors have declared that no competing interests exist.
Thousands of systematic reviews have been conducted in all areas of health care. However, the methodological quality of these reviews is variable and should routinely be appraised. AMSTAR is a measurement tool to assess systematic reviews.
AMSTAR was used to appraise 42 reviews focusing on therapies to treat gastro-esophageal reflux disease, peptic ulcer disease, and other acid-related diseases. Two assessors applied the AMSTAR to each review. Two other assessors, plus a clinician and/or methodologist applied a global assessment to each review independently.
The sample of 42 reviews covered a wide range of methodological quality. The overall scores on AMSTAR ranged from 0 to 10 (out of a maximum of 11) with a mean of 4.6 (95% CI: 3.7 to 5.6) and median 4.0 (range 2.0 to 6.0). The inter-observer agreement of the individual items ranged from moderate to almost perfect agreement. Nine items scored a kappa of >0.75 (95% CI: 0.55 to 0.96). The reliability of the total AMSTAR score was excellent: kappa 0.84 (95% CI: 0.67 to 1.00) and Pearson's R 0.96 (95% CI: 0.92 to 0.98). The overall scores for the global assessment ranged from 2 to 7 (out of a maximum score of 7) with a mean of 4.43 (95% CI: 3.6 to 5.3) and median 4.0 (range 2.25 to 5.75). The agreement was lower with a kappa of 0.63 (95% CI: 0.40 to 0.88). Construct validity was shown by AMSTAR convergence with the results of the global assessment: Pearson's R 0.72 (95% CI: 0.53 to 0.84). For the AMSTAR total score, the limits of agreement were −0.19±1.38. This translates to a minimum detectable difference between reviews of 0.64 ‘AMSTAR points’. Further validation of AMSTAR is needed to assess its validity, reliability and perceived utility by appraisers and end users of reviews across a broader range of systematic reviews.
High quality systematic reviews are increasingly recognized as providing the best evidence to inform health care practice and policy
Several instruments exist to assess the methodological quality of systematic reviews
In an attempt to achieve some consistency in the evaluation of systematic reviews we have developed a tool to assess their methodological quality. This builds on previous work
The characteristics and basic properties of the instrument have been described elsewhere
For our validation test set we chose to use systematic reviews or meta-analyses in the area of gastroenterology, specifically upper gastrointestinal. CADTH's informational specialist searched electronic bibliographic databases (i.e. Medline, Central and EMBASE) up to and including 2005. A total of 42 systematic reviews met the
Two CADTH assessors from two review groups (SS and FA, AL and CY) independently applied AMSTAR to each review and reached agreement on the assessment results. To assess construct validity, two reviewers (JP, ZO) plus a clinician and/or methodologist (MB, DF, DP, MO, and DH) applied a global assessment to each review
We calculated an overall agreement score using the weighted Cohen's kappa, as well as one for each item
Items | Kappa (95% CI) | PHI Φ |
1. Was an ‘a priori’ design provided? | 0.75 (0.55 to 0.96) | 0.76 |
2. Was there duplicate study selection and data extraction? | 0.81 (0.63 to 0.99) | 0.83 |
3. Was a comprehensive literature search performed? | 0.88 (0.73 to 1.00) | 0.89 |
4. Was the status of publication (i.e. grey literature) used as an inclusion criterion? | 0.64 (0.40 to 0.88) | 0.64 |
5. Was a list of studies (included and excluded) provided? | 0.84 (0.67 to 1.00) | 0.84 |
6. Were the characteristics of the included studies provided? | 0.76 (0.55 to 0.96) | 0.76 |
7. Was the scientific quality of the included studies assessed and documented? | 0.90 (0.77 to 1.00) | 0.91 |
8. Was the scientific quality of the included studies used appropriately in formulating conclusions? | 0.51 (0.25 to 0.78) | 0.56 |
9. Were the methods used to combine the findings of studies appropriate? | 0.80 (0.63 to 0.99) | 0.80 |
10. Was the likelihood of publication bias assessed? | 0.85 (0.64 to 1.00) | 0.85 |
11. Were potential conflicts of interest included? | 1.00 (100% no) | 1.00 |
0.84 (0.67 to 1.00) | 0.85 |
We assessed construct validity (i.e. evaluation of a hypothesis about the expected performance of an instrument) by converting the total mean score (mean of the two assessors) for each of the 42 reviews to a percentage of the maximum score for AMSTAR and of the maximum score of the global assessment instrument. We used Pearson's Rank correlation coefficients, Pearson's R and Kruskal-Wallis test to further explore the impact of the following items on the construct validity of AMSTAR: a) Cochrane systematic review vs. non-Cochrane systematic reviews
We assessed the practicability of the new instrument by recording the time it took to complete scoring and the instances where scoring was difficult. We interviewed assessors (N = 6) to obtain data on clarity, ambiguity, completeness and user-friendliness.
We used SPSS (versions 13 and 15) and MedCalc for Windows, version 8.1.0.0.
The 42 reviews included in the study had a wide range of quality scores. The overall scores estimated by the AMSTAR instrument ranged from 0 to 10 (out of a maximum of 11) with a mean of 4.6 (95% CI: 3.7 to 5.6; median 4.0 (range 2.0 to 6.0). The overall scores for the global assessment instrument ranged from 2 to 7 (out of a maximum score of seven) with a mean of 4.43 (95% CI: 3.6 to 5.3) and median 4.0 (range 2.5 to 5.3).
The reliability of the total AMSTAR score between two assessors (the sum of all items answered ‘yes’ scored as 1, all others as 0) was (kappa 0.84 (95% CI: 0.67 to 1.00, Φ = 0.85) and Pearson's R 0.96 (95% CI: 0.92 to 0.98). The inter-rater agreement (kappa) between two raters, for the global assessment was 0.63 (95% CI: 0.40 to 0.88).
Items in AMSTAR displayed levels of agreement that ranged from moderate to almost perfect; nine items scored a kappa of >0.75 (0.55 to 0.96 (and Φ >0.76). Item 4 had a kappa of 0.64 (0.40 to 0.88) Φ = 0.64 and item 8 a kappa of 0.51(0.25 to 0.78 Φ = 0.56). The reliability of the total AMSTAR score was excellent (kappa 0.84 (95% CI: 0.67 to 1.00 and Pearson's R 0.96 (95% CI: 0.92 to 0.98). For the AMSTAR total score, the limits of agreement were −0.19±1.38 (
The mean age of our reviewers was 40.57, median 43. Fifty-seven percent were identified as experts in methodology and 43% were identified as content experts in the field.
Expressed as a percentage of the maximum score, the results of AMSTAR converged with the results of the global assessment instrument [Pearson's Rank Correlation Coefficient 0.72 (95% CI: 0.53 to 0.84)]. AMSTAR scoring also upheld our other
The journals had the following overall summary statistics for the impact factors: mean 5.88 (95% CI: 3.9 to 7.9) median 3.3 (lowest value 1.4, highest value 23.9). There is no statistical association between AMSTAR score and impact factor (Pearson's R (0.555 P = 0.7922)). There was however a significant association found with the number of pages and AMSTAR scores (Pearson's R (0.5623 P = 0.0001 n = 42). We found no association (R 0.1773 P = 0.0308) when we removed the outliers (i.e. systematic reviews with higher page numbers).
Conflict of interest was poorly presented. Of the 42 reviews assessed, no study had appropriately declared their conflict of interest. Therefore, we were unable to assess whether or not funding had a positive or negative effect on the AMSTAR score.
Both AMSTAR and the global assessment required on average 15 minutes to complete, but with the latter, assessors expressed difficulty in reaching a final decision in the absence of comprehensive guidelines. In contrast, AMSTAR was well received.
This paper describes an external validation of AMSTAR. This new measurement tool to assess methodological quality of systematic reviews showed satisfactory inter-observer agreement, reliability and construct validity in this study. Items in AMSTAR displayed levels of agreement that ranged from moderate to almost perfect. The reliability of the total AMSTAR score was excellent. Construct validity was shown by AMSTAR convergence with the results of the global assessment instrument.
We found a significant association between number of published pages and overall AMSTAR score, suggesting that the longer the manuscript, the higher the quality score. It should be interpreted with caution given the fact that only a couple of the longer reviews largely drive the hypothesis tests. We found no association when the outliers were removed from the dataset. We did not find an association between AMSTAR score and impact factor.
The AMSTAR instrument was developed pragmatically using previously published tools and expert consensus. The original 37 items were reduced to an 11- item instrument addressing key domains; the resulting instrument was judged by the expert panel to have face and content validity
This is a prospective external validation study. We compared the new instrument to an independent and reliable gold standard designed for assessing the quality of systematic reviews, allowing multiple testing of convergent validity.
The analytical methods for assessing quality and measuring agreement amongst assessors need further discussion and development. We calculated chance-corrected agreement, using the kappa statistic
We were unable to test our convergent validity hypothesis about conflict of interest because of missing data in the systematic reviews and primary studies. This highlights the need for journals and journal editors to require that the information is provided.
Our results are based on a small sample of systematic reviews in a particular clinical area and a relatively small number of AMSTAR assessors. There is a need for replication in larger and different data sets with more diverse appraisers.
Existing systematic review appraisal instrument did not reflect current evidence on potential sources of bias in systematic reviews and were generally not validated. The best available instrument prior to the development of AMSTAR was OQAQ which was formally validated. However, users of OQAQ frequently had to develop their own rules for operationalizing the instrument and OQAQ does not reflect current evidence on sources of potential bias in systematic reviews (for example funding source and conflict of interest
Quality assessment instruments can focus on either
Decision-makers have spent the last ten years trying to work out the best way to use the enormous amounts of systematic reviews available to them. They can hardly know where to start when deciding whether the relevant literature is valid and of the highest quality. AMSTAR is a user friendly methodological quality assessment that has the potential to standardize appraisal of systematic reviews. Early experience suggests that relevant groups are finding the instrument useful.
Further validation of AMSTAR is needed to assess its validity, reliability and perceived utility by appraisers and end users of reviews across a broader range of systematic reviews. We need to assess the responsiveness of AMSTAR looking at its sensitivity to discriminate between high and low methodological quality reviews.
We need to assess the applicability of AMSTAR for reviews of observational (diagnostic, etiological and prognostic) studies and if necessary develop AMSTAR extensions for these reviews.
We plan to update AMSTAR as new evidence regarding sources of bias within systematic reviews becomes available.
AMSTAR is a measurement tool created to assess the methodological quality of systematic reviews.
(0.04 MB DOC)
Global assessment rating
(0.03 MB DOC)
We would like to thank our International panel of assessors: Daniel Francis, David Henry, Marisol Betancourt, Dana Paul, Martin Olmos, and our local team of assessors: Sumeet Singh, Avtar Lal, Changhua Yu, Fida Ahmed. We also thank Dr. Giuseppe G.L. Biondi-Zoccai and Crystal Huntly-Ball for their helpful suggestions on this manuscript.