Quality Assessment of Clinical Practice Guidelines on the Treatment of Hepatocellular Carcinoma or Metastatic Liver Cancer

Objectives To assess the quality of the currently available clinical practice guidelines (CPGs) for hepatocellular carcinoma, and provide a reference for clinicians in selecting the best available clinical protocols. Methods The databases of PubMed, MEDLINE, Web of Science, Chinese Biomedical Literature database (CBM), China National Knowledge Infrastructure (CNKI), WanFang, and relevant CPGs websites were systematically searched through March 2014. CPGs quality was appraised using the Appraisal of Guidelines for Research & Evaluation (AGREE) II instrument, and data analysis was performed using SPSS 13.0 software. Results A total of 20 evidence-based and 20 expert consensus-based guidelines were included. The mean percentage of the domain scores were: scope and purpose 83% (95% confidence interval (CI), 81% to 86%), clarity of presentation 79% (95% CI, 73% to 86%), stakeholder involvement 39% (95% CI, 30% to 49%), editorial independence 58% (95% CI, 52% to 64%), rigor of development 39% (95% CI, 31% to 46%), and applicability 16% (95% CI, 10% to 23%). Evidence-based guidelines were superior to those established by consensus for the domains of rigor of development (p<0.001), clarity of presentation (p = 0.01) and applicability (p = 0.021). Conclusions The overall methodological quality of CPGs for hepatocellular carcinoma and metastatic liver cancer is moderate, with poor applicability and potential conflict of interest issues. The evidence-based guidelines has become mainstream for high quality CPGs development; however, there is still need to further increase the transparency and quality of evidence rating, as well as the recommendation process, and to address potential conflict of interest.


Introduction
Hepatocellular carcinoma (HCC) is the seventh most common cancer worldwide [1], and the third most common cause of death from cancer with an overall mortality-to-incidence ratio of 0.93 [2]. Most of the burden is in developing countries, where almost 85% of cases occur [1,2]. The annual cost of HCC in the United States is $454.9 million, with an average cost per patient of $32,907. Healthcare costs and lost productivity account for 89.2% and 10.8% of the total, respectively [3]. A survey showed that the cost for patients with HCC is approximately 6 to 8 fold higher than for those without this cancer, with the mean per-patient-per-month (PPPM) cost of $7,863 for cases and $1,243 for controls [4]. It is estimated that the number of disability-adjusted life years (DALYs) lost and medical costs due to HCC will gradually increase as the incidence of HCC rises in younger people.
The Institute of Medicine (IOM) has established the definition of clinical practice guidelines (CPGs) as ''systematically developed statements to assist practitioner and patient decisions about appropriate health care for specific clinical circumstances'' [5]. This will provide doctors with detailed and authoritative recommendations and alter their customary or outdated clinical methods, which will improve healthcare consistency, promote health service equity and reduce healthcare costs for the government [6]. Currently, although the quantity and quality of CPGs have been improved, the differences among guidelines formulated by various institutes or researchers still differ widely. Therefore, a rigorous evaluation of the quality of CPGs is urgently needed. Appraisal of Guidelines for Research & Evaluation (AGREE II) is recognized as a preferred tool for the quality appraisal of guidelines [7,8]. This can provide a methodological strategy for the development of guidelines, and inform authors on the type of information and the manner in which the information should be reported in the guidelines, thereby ultimately improving the level of healthcare [9].
Schmidt et al [10] evaluated the quality of 32 guidelines on the diagnosis and treatment of HCC in 2011. They concluded that most guidelines lacked appropriate methodological quality. However, all guidelines they included were published before 2010 and were assessed using the original four-point scale of the AGREE instrument published in 2003, which is not in compliance with current methodological standards of health measurement design. In particular, this noncompliance might threaten the performance and reliability of the instrument [8]. The aim of the present study is to systematically assess the quality of current available CPGs for HCC or metastatic liver cancer using the AGREE II instrument, and provide a reference for clinicians in selecting the best clinical protocols.

Inclusion criteria
The available guidelines on the treatment of primary or metastatic liver cancer published in English or Chinese were included.
Exclusion criteria a) HCC guidelines for diagnosis (i.e., ultrasound, enhanced computerized tomography (CT)); b) The Chinese version or other versions of oversea CPGs; c) Quality improvement guidelines, position statements or guideline summaries; d) National Institute for Health and Excellence interventional procedure guidance (NICE IPG) or overview; e) Conference abstracts, overviews, primary studies, systematic reviews or letters.

Guideline sources and search strategy
The electronic databases of PubMed, MEDLINE, Web of Science, Chinese Biomedical Literature database (CBM), China National Knowledge Infrastructure (CNKI), and WanFang were systematically searched through March 2014. The MeSH terms with free-text terms were as follows: (Liver Neoplasms OR Carcinoma, Hepatocellular) AND (Guideline OR Practice Guideline OR Consensus). We also searched the relevant CPG websites, including Guideline-International Network (G-I-N), National Guideline Clearinghouse (NGC), Clinical Practice Guideline Network (CPGN), National electronic Library for Medicines (NeLM), and NICE.

Selection of Guidelines
The PRISMA (preferred reporting items for systematic reviews and meta-analyses) statement was followed to search and select guidelines [11]. Two reviewers (WYQ, WSY) independently screened guidelines by browsing title and abstract based on predefined inclusion and exclusion criteria. Primary screening of the guidelines was undertaken by two reviewers who carefully read the full text to determine their eligibility for inclusion in the study. Discrepancies between the two reviewers were resolved by discussion or with a third person (LYP).
If a guideline has clearly stated the quality of evidence on which a recommendation is based or grading for recommendation and statements, then the guideline is judged as evidence-based. If a guideline is developed based on consensus (i.e., consensus meeting or expert panel), without illustrating the source of evidence and grade of recommendation, the guideline is judged as consensusbased.

Quality appraisal
Three appraisers (WYQ, WSY and WHQ) independently rated the included CPGs using the AGREE II instrument that consisted of 23 key items organized within six quality domains followed by two global rating items (''Overall Assessment''). Each of the items was rated on a 7-point scale (1-strongly disagree to 7-strongly agree). The appraisers scored each guideline independently using the rating scale. If the three appraisers rated items with a difference of more than two points, a consensus discussion was held to obtain the final rating [10]. Observed scores of individual items in a domain were calculated by summing up all scores of the three appraisers, and each domain score was standardized as a percentage according to the following formula [9]: The scaled domain scoreÕ bserved score{Minimum possible score Maximum possible score{Minimum possible score |100: [Maximum possible score = 7 (strongly agree) 6 No. of items within a domain 6 No. of appraisers; Minimum possible score = 1 (strongly disagree) 6 No. of items within a domain 6 No. of appraisers].
A domain score of 60% was considered a threshold value of the AGREE instrument for rating the overall quality of CPGs. A guideline was 'strongly recommended' if the majority of domains (more than five) were scored above 60%. A guideline was 'weakly recommended' if more than four domains were scored above 30%. A guideline was 'not recommended' if more than three domains were scored below 30% [10].

Statistical analysis
The mean score and 95% confident intervals (CI) were calculated for each domain using AGREE II. Kendall's coefficient of concordance [12] was applied for estimating the reliability among appraisers. The independent sample Student's t-test was applied if a result of Levene's test was p.0.05. Data and graphics were performed using SPSS version 13.0 for Windows (LEAD Technologies, Inc., IL, USA) and SigmaPlot version 12.0 for Windows (Systat Software, Inc., Chicago, IL), respectively. A pvalue of less than 0.05 was considered significant.

Appraisal of guidelines
Guideline evaluation results using the AGREE II instrument are detailed in Table 2. Three appraisers independently evaluated these guidelines with a mean Kendall's coefficient of concordance of 0.935 (95% CI, 0.928 to 0. 941), which indicates a high level of reliability among evaluators.
Among the six domains of AGREE II, 40 guidelines were scored $60% with a mean of 79% to 83% for two domains, namely scope and purpose, and clarity of presentation. Sixteen guidelines were scored $60% for the stakeholder involvement domain and the remaining twenty-four had scores ranging from 33% to 59%. For the rigor of development domain, eight guidelines were scored $60% with a range of 63% to 90%, sixteen were scored 30% to 59% and the last sixteen were scored 3% to 22%. For the domain of applicability, only three guidelines were scored $60% with a range of 64% to 76%, and four others ranged from 39% to 53%, with 33 being scored below 30%. For the domain of editorial independence, nine guidelines were scored from 61% to 100%, and thirteen ranged from 33% to 58%, while the other eighteen were scored below 30%. Therefore, five guidelines were 'strongly recommended' according to AGREE II including three for HCC [20,25,28] and two for CLM [30,31], and 27 additional guidelines were 'weakly recommended'. Eight guidelines were not recommended because of poor quality [33,34,36,40,41,46,48,49].
Evidence-based guidelines were superior to those established by consensus for the domains of rigor of development (p,0.001), clarity of presentation (p = 0.01), and applicability (P = 0.021). However, there was no significant difference for the other three domains (p.0.05) (Figure 3).

Discussion
There has been a sharp increase in the number of CPGs worldwide since the 1980s [54]. As of June 2013, Guideline International Network (G-I-N) contains more than 6,400 guidelines, evidence reports and related documents (http://www.g-i-n. net/library), and the National Guideline Clearinghouse (NGC) currently includes 2,549 individual guideline summaries (http:// www.guideline.gov). However, there is a great discrepancy among guidelines established by varied governments, associations, and companies or other organizations, especially with respect to their quality [6,55,56]. A systematic review conducted by Alonso-Coello et al [54] has analyzed the quality of published CPGs from 1980-2010, which showed that the quality scores measured with the AGREE instrument were moderate to low.
Zheng et al [57] and Chen et al [58] have analyzed the status of Chinese CPG development, and have concluded that considerable progress has been achieved for Chinese CPGs over time; however, all domain scores were lower than the world average, especially in rigor of development and editorial independence. There is no doubt that recommendation from low quality CPGs may mislead clinical decisions, resulting in harm to the patient. Therefore, screening for high quality CPGs is particularly vital to guide clinical practice.
In this study, it was found that the domain scores that received the highest marks as measured with AGREE II were 'scope and purpose' (mean 83%; 95% CI, 81% to 86%) and 'clarity of presentation' (mean 79%; 95% CI, 73% to 86%), which is similar to the research of Schmidt et al [10]. Furthermore, evidence-based guidelines are superior to consensus-based ones in terms of language, structure and layout. Because evidence-based guidelines have combined level of clinical evidence with strength of recommendations, these guidelines are more accurate and reflect a higher scientific standard.
However, there were some disappointing results regarding evidence-based guidelines in the domain of 'stakeholder involvement'. Although the average quality score measured with AGREE II is 58%, there were 24 guidelines (60%), including eleven evidence-based guidelines that were scored less than 60%, which reflected the dearth of multidisciplinary teams and lack of accounting for views and experiences of the targeted patient population during the development of these guidelines [54]. There were various stakeholders involved, including those in steering groups, research groups involved in selecting and rating the evidence, individuals involved in formulating final recommendations, public and private funding bodies, managers, healthcare professionals, patients, employers and manufactures, but not independent individuals involved externally in reviewing the guideline [9,59]. Their engagement of the latter group is required for various reasons such as including overlooked evidence, transparency and democracy principles, ownership, and potential policy implications [59]. Therefore, they play a vital role during guideline development, review and modification, but their involvement can also be very complex, and it needs to be inclusive, equitable, and sufficiently resourced [59].
The quality of a guideline largely depends on whether or not its methodology is rigorous and scientific. However, most guidelines received a lower score (39%) for the domain of 'rigor of development'. Five consensus-based guidelines scored less than 30% for this domain. Although evidence-based guidelines are superior to consensus-based ones with respect to evidence gathering, quality assessment or strength of recommendations, there are still 12 evidence-based guidelines which were only scored between 30% and 60%. It is common that guidelines include references to published studies, but few of them clearly describe the searching strategy, the methodology used to formulate the final recommendations, or the dates on which guidelines were updated [10]. One reason may be the lack of methodological experts in guideline developing teams, the lack of resources needed to search for high-quality systematic reviews, or the poor reported quality of guidelines [54].
The domain of applicability mainly evaluates implementation barriers, cost factors, and monitoring criteria [9]. However, most guidelines included in this study neither discussed this field nor highlighted the tools required for facilitating or promoting guidelines, resulting in the lowest average domain scores (16%), Table 1. General information of guidelines included in our analysis.  Not Recommend especially for 15 evidence-based guidelines, which were scored less than 30%. Similarly, the domain of editorial independence addresses whether the recommendations are impacted by the funding body and conflict of interests (COIs) issues which may arise from within the guideline-developing organization [9]. Potential COIs may greatly impact the content of guidelines and the recommendations. COIs was highly prevalent (150/288, 52%) among guidelines established by Canadian specialty and US specialty societies, but a large proportion of guidelines did not publicly disclose COIs [60]. A study published by Choudhry et al [61] showed that 87% of guideline developers had some form of interaction with a pharmaceutical company, 58% of whom had received funding support to conduct their research, and 38% of whom had served as employees or consultants in the pharmaceutical industry. In our study, 20 (50%) guidelines did not publicly disclose COIs, and 18 (45%), including seven evidence-based guidelines were scored less than 30% for this domain. Three of the five guidelines that we 'strongly recommend' all reported the COIs of authors in detail. In the EASL-EORTC guideline, the authors have reported the COIs at the end of guideline, however, number of affiliated authors have received research support and/or lecture fees and/or took part in clinical trials for Bayer (a pharmaceutical company) [28], which may lead certain bias for the independence of their recommendations and reliability of guideline to some extent. Therefore, recommendations based on the AGREE II 'strongly recommend' guidelines still need to be revised and updated according to the conclusions of properly conducted systematic reviews.
We based our recommendations of the guidelines on the AGREE II instrument as previously described [9,10]. However, we would like to question the validity of this approach. First, such recommendations may lead clinicians to depend too much on and believe in the individual recommendations of guidelines that have achieved 'strongly recommend'. Second, such recommendation may falsely overrate the evidence because the bar is set too low according to our experience. In short, even the 'strongly recommend' guidelines are not sufficiently evidence based. Thirdly, we lack evidence of any patient benefits by adopting such coarse recommendations. Therefore, the recommendations should be seen as a consequence of adopting the AGREE II methodology rather than a quality stamp on some of the guidelines as being of high methodological quality. If it is a quality stamp, it is relative to the guidelines that achieved lower ratings.
The ultimate goal of the present guideline evaluation is to recognize the faults of existing guidelines so that the necessary steps are taken to improve their quality. We found that most authors had increasingly emphasized evidence gathering and synthesis, and formulated the final recommendations when they developed their guidelines. The evidence-based guideline has become a mainstream for high quality guideline development. However, the transparency of guidelines in aspects of quality appraisal of evidence, formulation of recommendations, and the COI of authors are still insufficient, and this has become a prominent problem affecting the quality of guidelines. Some guidelines have simply classified evidence according to the study design, ignoring quality assessment of evidence, therefore making it difficult to know on which one or type of specific evidence the recommendation was based.
Although some guidelines use GRADE (the Grading of Recommendations Assessment, Development and Evaluation) as a tool for evaluating the quality of evidence and formulating the final recommendations, GRADE evidence profiles and summary of finding (SoF) tables were not presented or linked in the guidelines. Therefore, the GRADE working group has suggested 39 (30)(31)(32)(33)(34)(35)(36)(37)(38)(39)(40)(41)(42)(43)(44)(45)(46)(47)(48)(49) *The overall recommendations are based on the AGREE II evaluations according to references [9] and [10]. This recommendation should be seen as a recommendation between the currently available guidelines rather than a quality stamp of the individual guideline. Please see discussion. doi:10.1371/journal.pone.0103939.t002 that the guideline-developing committee should summarize evidence in simple, transparent and informative SoF tables and evidence profiles that provide detailed information about the reason for the quality of evidence rating [62]. Before developing a guideline, it is necessary to limit funding sources coming from industries or other institutions, or provide a formal process for discussion and public disclosure of financial COIs for authors [61,63,64]. When developing or updating guidelines, the AGREE II instrument is a tool that provides the methodological strategy and standard procedure [9]. When considering guideline recommendations, however, high-quality evidence (i.e., RCTs) should not always be blindly pursued [53]. Patient and societal values or preferences should be considered and incorporated with the evidence to formulate final recommendations [53,62].

Limitations
The study is based on published guidelines in Chinese and English journals. However, most institutions have local guidelines or rely on national guidelines (i.e., those published in books, pamphlets and government documents), none of which is published. Thus the quality of guidelines used in most clinical settings might be of lower quality than published guidelines, hence causing some degree of selection bias. The AGREE II tool mainly focuses on methodology and quality of reporting, but not on the nature of the supporting evidence. Therefore, the quality of evidence on which the recommendations are based in the 'strongly recommended' guidelines still needs to be systematically reviewed and amended accordingly.

Conclusion
Although much progress has been achieved with respect to the quality of HCC and metastatic liver cancer guidelines, the overall methodological quality is moderate with poor applicability and potential conflict of interests (COIs). The evidence-based guidelines has become mainstream for high quality guideline development, such as the Japanese Ministry of Health (JMH) guideline, American Association for the Study of Liver Disease (AASLD), and European Association for the Study of Liver/European Organization for Research and Treatment of Cancer (EASL-EORTC) guideline; however, there is still a need to further increase transparency, quality of evidence rating, and the recommendation process and to address COIs issues.