Screening for osteoporosis: A systematic assessment of the quality and content of clinical practice guidelines, using the AGREE II instrument and the IOM Standards for Trustworthy Guidelines

Background Numerous clinical practice guidelines (CPGs) are published to guide management of osteoporosis. Little is known about their quality or how recommendations have changed over time. Objective To systematically assess the quality and content of the guidelines on screening for osteoporosis, using the Appraisal of Guidelines for Research and Evaluation (AGREE II) tool, and the Institute of Medicine (IOM) standards for trustworthy guidelines. Methods We conducted a systematic search for osteoporosis CPGs published between 2002–2016, using multiple databases and guideline websites. Two reviewers appraised the quality of eligible CPGs using the AGREE II. High quality CPGs were considered if they scored ≥ 60 in four or more domains including the domain for rigor of development. Non-parametric tests were used to test for the change of quality over time. One reviewer assessed the guidelines with IOM standards. We summarized the different evidence grading systems and extracted and compared the recommendations. Results A total of 33 CPGs were identified. The mean scores for AGREE II differed by domain (range: 42% to 71%). CPGs scored higher on domains for clarity of presentation, scope and purpose, and rigor of development. CPGs scored lower on domains for stakeholder involvement, editorial independence and applicability. Assessment of CPGs by IOM standards showed that CPGs scored better on standards for systematic review, establishing evidence foundation and rating strength of recommendation, articulation of recommendation, and establishing transparency. While scored lower on standards for updating, external review, and the development group composition. There was no difference in AGREE II and IOM defined guidelines’ quality before and after the introduction of the two tools (P values >0.05). The IOM identified four more guidelines as high quality compared to the AGREE II. Examining these additional guidelines indicated that the two tools may give conflicting results especially for the rigor of development domain. Recommendations in certain areas showed substantial differences between guidelines. Conclusion Osteoporosis screening CPGs are of variable quality, and their recommendations often differ. Guideline quality as measured by AGREE II and IOM standards has not improved overtime. Guideline developers should work together to improve the quality and consistency of recommendations to improve the likelihood that their guidelines will be used in practice.

a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 involvement, editorial independence and applicability. Assessment of CPGs by IOM standards showed that CPGs scored better on standards for systematic review, establishing evidence foundation and rating strength of recommendation, articulation of recommendation, and establishing transparency. While scored lower on standards for updating, external review, and the development group composition. There was no difference in AGREE II and IOM defined guidelines' quality before and after the introduction of the two tools (P values >0.05). The IOM identified four more guidelines as high quality compared to the AGREE II. Examining these additional guidelines indicated that the two tools may give conflicting results especially for the rigor of development domain. Recommendations in certain areas showed substantial differences between guidelines.

Conclusion
Osteoporosis screening CPGs are of variable quality, and their recommendations often differ. Guideline quality as measured by AGREE II and IOM standards has not improved overtime. Guideline developers should work together to improve the quality and consistency of recommendations to improve the likelihood that their guidelines will be used in practice.

Background
Osteoporosis is a disease characterised by low bone mass and deterioration of the bone tissue structure leading to increased bone fragility and liability to fractures [1]. These fractures usually result from low mechanical forces such as a fall from standing height or less, that usually don't cause a fracture [2]. The most common sites of these fractures are in the spine, hip, and wrist [2]. Worldwide, osteoporosis leads to nearly 9 million fractures annually [1].
Low bone mass is a major risk factor for fractures, however, there are many other factors that contribute to osteoporosis including age, sex, previous fractures, family history of osteoporosis, use of systemic glucocorticoids, excessive alcohol and smoking [3]. The prevalence of osteoporosis rises rapidly with age; thus the incidence of fractures is predicted to increase with the increased longevity of the population [4].
Patients who suffer osteoporotic fractures are at higher risk of morbidity and mortality; in the Canadian Multicentre Osteoporosis Study (CaMOS), both men and women had increased incidence of death of 23.5% (20/85) after hip fracture [5]. Additionally, these fractures cause acute pain and loss of function, and hip fractures nearly always lead to hospitalisation. Recovery is slow and rehabilitation is usually insufficient, leading to decreased quality of life and burden on caregivers [6]. The economic cost of these fractures is high; Hopkins et al 2016, estimated that it costs the Canadian health system more than $ 4 billion per year [7].
Osteoporosis is usually diagnosed based on bone mineral density (BMD) measurement by dual energy x-ray absorptiometry (DXA). A T-score of -2.5 standard deviation below the expected mean value for a young white female adult, is considered diagnostic [8]. Since low bone density is not the only risk factor for osteoporosis, a variety of tools were developed to aid in diagnosis and decision to decide the initiation of pharmacological treatment. These tools incorporate the main risk factors for osteoporosis with or without the DXA testing Tscore; such tools are CAROC, FRAX, QFracture, and others [9].
The aim of screening is to diagnose those at risk for fractures to prevent them from occurring, in addition to prevent the risk of re-fracture in patients who sustained a previous fragility fracture. Many pharmacological and non-pharmacological treatments are available and have proven efficacy [10].
Clinical practice guidelines (CPGs); which are defined by the Institute of Medicine as "Statements that include recommendations intended to optimize patient care that are informed by a systematic review of evidence and an assessment of the benefits and harms of alternative care options" [11]. Guidelines should consist of recommendations for assessment and management of specific diseases based on the latest evidence. Clinicians make countless number of decisions each day and they do not have the time to consider all the underlying evidence for these decisions; CPGs can do this for them [12]. Hence, CPGs are intended to transfer evidence into practice, decrease variability in clinical practice and decrease costly and avoidable harms or mistakes [13].
The number of guidelines have increased substantially; at the time of this study the CMA (Canadian Medical association) infobase included approximately 1,200 CPGs [14]. However, the effect of CPGs on improving the process and outcome of care have varied widely [15]. Furthermore, even for well-developed guidelines, their adoption and use is not an automatic process and depends greatly on the dissemination process and how they are implemented [16,17].
In past years, there were many efforts to improve the development of guidelines, and to standardize the method of development. The AGREE II instrument was developed by an international team of researchers to define the essential components of a good guideline [18]. A review by Vlayen et al 2005, reported that AGREE II is the most validated compared to 24 other tools, with easy scored numerical scales [19]. Additionally, different frameworks to grade the level of evidence have been released; the Grading of Recommendations Assessment, Development and Evaluation (GRADE) system has and was adopted by many organizations and guideline developers such as NICE, and WHO [20]. Recently, the GRADE Working Group has developed a framework "the Evidence to Decision (EtD)", to assist the process of progressing from evidence to making clinical recommendations, coverage decisions, and health system or public health recommendations and decisions [21].
Guidelines for screening of osteoporosis have been developed by many agencies and organizations. Previous literature reported that they conveyed mixed messages to primary care physicians [22]A systematic review by Cranney et al., 2002 of CPGs for postmenopausal osteoporosis released between 1998 to 2001 found that these guidelines were of low quality [23]. Those which were of acceptable quality were most commonly developed by researchers in the United States, one was from Ontario, Canada, and two were from the United Kingdom.
Most studies from different North American and European countries indicated that a high proportion of individuals whom are at risk of fragility fractures are not being screened, which reflects the low adherence of physicians to the CPG [4,6,[24][25][26].Thus, determining the quality of current guidelines is important and to our knowledge there is no recent review of the quality of CPGs relevant to osteoporosis screening. Therefore, we conducted a systematic review to assess the quality of guidelines using the AGREE II and the IOM standards, determine whether osteoporosis guidelines quality has improved over time, summarize the grading systems for level of evidence and strength of recommendations, and to compare the recommendation for their consistency/concordance.

Methods
This systematic review followed the Cochrane Methodology [27], to identify, and select the CPGs and the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) to guide the reporting of this review [28]. Ethics approval was not required as this work was based on systematic literature review.

Search strategy and data extraction
A systematic search for relevant guidelines was performed between January 2002 and September 2016 using the following databases: Embase, Medline, Pubmed, Canadian Medical Association Infobase, National Guidelines Clearinghouse, and Guidelines International Network (http://www.g-i-n.net/). Some well-known guidelines developers; Excellence (NICE), and the Scottish Intercollegiate Network (SIGN), were also searched, in addition to reviewing references of each guideline for other relevant guidelines.
Key words used for the MEDLINE search can be found in S1 Table which are modified according to the indexing systems. Inclusion criteria were; CPGs with recommendations for adult population; guidelines for screening of osteoporosis with or without treatment. The applied period was from 2002-2006, and the guidelines should be intended for health professionals. Language restrictions were not applied, however non-English texts were later excluded. Guidelines were also excluded, if they only addressed glucocorticoid induced osteoporosis or specific diseases or conditions such as hyperthyroidism, inflammatory bowel disease, celiac disease, and post-gastrectomy states. Additionally, we excluded position papers and consensus papers since they are not equivalent to guidelines.
Screening of titles and abstracts and then full texts were carried out by one reviewer (LA), a second reviewer (SY) screened a sample of 100 full text articles to check the accuracy of screening. Guidelines that did not meet the inclusion criteria were excluded. One reviewer (LA) extracted the following information: Guidelines titles, Authors, publication year, country, the organization that produced the guideline, and main key recommendations with the systems used for assigning level of evidence and strength of recommendations.

Guidelines appraisal and data analysis
Quality assessment. The AGREE instrument is a tool; which was developed in 2003 and then was updated in 2010 to the AGREEII instrument [18]. The purpose of the tool is to provide a systematic framework to assess the methodological rigour of guideline quality and a methodological strategy for developing the guidelines [18]. The tool consists of 23 items which are grouped into 6 domains (scope and purpose, stakeholder involvement, rigour and development, clarity of presentation, applicability, and editorial independence) [S2 Table]. Items are rated on 7-point scale ranging from 1 (absence of them) to 7 (exceptional quality of item).
Two appraisers (LA and SY) appraised each guideline independently using the AGREE II. Only information included in the released guideline or their references were used for the appraisal process, i.e., the reviewers did not refer to additional supporting documents that were published separately, unless explicitly indicated by the CPGs. Domains scores were calculated by summing up item scores within each domain for each reviewer, then standardising it as a percentage of maximum possible score Scaled domain score ¼ obtained scoreÀ minimum possible score maximum possible scoreÀ minimum possible score x 100 Descriptive statistical analysis were conducted, and agreement between reviewers was assessed by using two-way, random, single unit, absolute agreement intra-class correlation coefficients ICC ( [11]. The resulting instrument has eight standards (Establishing transparency, management of conflict of interest, guideline development group composition, systematic review intersection, establishing evidence foundation, articulation of recommendation, external review, and updating) with 20 sub-criteria or attributes, with no proposed numerical scoring [IOM with all subcriteria in S3 Table [31].
Differences from the AGREE II tool domains are summarized in S4 Table. IOM has separate standards for external review and for updating the guidelines, while, there are no standards that assess applicability, or resource implications.
We scored each of the 20 subcriteria by 1, or 0; and a standard was considered to be met, if more than half of the sub-criteria were fulfilled. For example; for standards with 3 sub-criteria, 2 or more need to be fulfilled to consider that standard is met. One reviewer (LA) used this tool to appraise the guidelines.
We hypothesized that the quality of guidelines has changed and improved over time especially after updating the AGEEII tool in 2010 and the introduction of IOM standards in 2011. So, we to tested our hypothesis by using a non-parametric test, the Wilcoxon Rank-Sum test (Mann-Whitney test) to test for statistical significant differences in domain scores between CPGs published before and in/after 2010 (AGREE II update) and total IOM scores before and after IOM development [32]. For IOM standards, a Chi square test was used to find if the proportion of guidelines meeting each IOM standard has improved after the IOM development. If expected cell counts were less than five, then Fisher's exact test is used instead [32]. P values less than 0.05 were considered statistically significant.
Comparison of AGREEII tool and IOM. Because the two tools don't have the same items, we compared the two tools by determining whether both tools identify the same guidelines as being of high quality. For that purpose; with the AGREE II, guidelines were considered of high quality, if they scored � 60% in 4 or more domains including domain 3 for rigor of development, since we consider this domain as an important part of guideline quality. This approach in identifying high quality guidelines has been reported in other studies assessing the quality of CPGs [33][34][35]. Similarly, with the IOM standards, we defined high quality guidelines if 5 or more standards were met including standards 4 and 5 (CPG systematic review intersection, and Establishing evidence foundations for and rating strength of recommendations respectively). We didn't compare statistically if the difference between the two tools in identifying the high quality guidelines is significant, since our sample size is low. Yet, we examined the different identified high quality guidelines to find out which domains or areas differ between the two tools, and which tool may give a better trusted results. We were only able to identify one study that compared guidelines quality using both instruments. Bennett et al 2016 [36], compared IOM instrument with the AGREE II, but they changed the method of AGREE II scoring. We opted not to do the same, in order not to change the scoring of the AGREEII, as this may decrease the result's validity.
Analyses were preformed using Microsoft Excel and SAS 9.4 statistical package, except Inter-rater reliability (ICC & weighted kappa) was performed using R statistical software [37].

Search results
A total of 5,818 records were identified from our electronic systematic search of databases, of which 3,143 were excluded as duplicates, and 2,448 were found to be irrelevant after screening the titles and abstracts. We found an additional 17 records from national guideline websites and hand searching references of identified guidelines. Thus, a total of 224 records were screened as full text, and 211 were excluded for a variety of reasons. Finally, 33 final guidelines were eligible for assessment (21 guidelines from databases, and 12 from other sources) presented in Table 1. The screening process for CPGs is presented in the Prisma Flow diagram [Fig 1].

Quality of CPGs based on the AGREE II Score
The standardized domain scores, the weighted kappa, and the Intra-class correlation coefficient (ICC) values for each guideline are depicted in Table 2. The agreement between the two reviewers is ranging between (ICC = 0.50-0.93) except for the Australian guidelines 2010 [38], which indicates according to Cicchetti's cut off points a moderate to excellent agreement [30]. We also presented the weighted kappa agreement scores which agrees with the ICC, however, ICC is a preferred method for inter-rater reliability for ordinal scales [39].
The descriptive statistics for each domain is presented in Table 3 with the mean standardized scores for each domain is shown in Domain2: Stakeholder involvement. This domain focuses on the participation of the professional experts, preferences of target population in the guideline development and whether target users are clearly defined. The mean score with (SD) was 54.57% ± (25.96%) with a very wide range (8% for British Columbian CPG [44] to 100% for NICE, Institute of Clinical System Improvement (ICSI) and SIGN [9,45,46]. Two thirds of guidelines (21/33) scored below 60% in this domain as most of the guidelines didn't seek the views of the other stakeholders such as patients, public, payers, and policy makers in guideline development.
Domain 3: Rigour of development. This domain includes eight items that assess the systematic methods used for gathering and synthesizing of the evidence, and formulating the recommendations, the external peer review process and the procedure for updating the guideline. The mean score for this domain with (SD) was 63% ± (25.53%). Some important guidelines that are constantly cited in the field of osteoporosis such as the National Osteoporosis Foundation guidelines of 2008, and 2014 [47][48]scored low; 33%, and 22% respectively, as they did not report any systematic approach for developing their guideline. Most guidelines didn't describe the process of updating clearly.
Domain 4: Clarity of presentation. This domain covers the language, structure and format of the guideline, and emphasizes on the clarity of the recommendations. The mean score with (SD) was 71.30% ± (16.52%), indicating that most guidelines had clear recommendations.
Domain 5: Applicability. This domain considers the barriers and facilitators to implementation of the guideline, approaches to increase uptake, resource implications of applying the guideline, and monitoring of the uptake or adherence to the guideline.
Consistently across all the CPGs, this was the lowest scored domain with means score and (SD) of 43.00% ± (24.45%), and a range between (8%-88%). Only 3 guidelines (SIGN [46], NICE [9] and ICSI [45]) reported monitoring and auditing criteria. Domain 6: Editorial independence. This domain relates to the formation of recommendations under unbiased influence of the funding body, and with no competing interests of the developers. The mean score with (SD) was 53.57% ± (32.57%), almost half of the guidelines either didn't provide statement about the funding or the competing interests. Some guidelines [42, 51,69] scored 0% on this domain.

Quality of CPGs determined by the IOM standards
The eight IOM standards with their scores are presented in the S5 Table. The mean overall score is 4.18 out of 8 major standards. Four guidelines fulfilled all the eight standards (Australia CPGs 2010 [38], NICE 2012 [9], ICSI CPG 2013 [45], and SIGN CPG 2015 [46]). Fig 3 shows that more than 60% (67%, 64%, 64%, 61%) of guidelines met standards for systematic review intersection, establishing evidence and rating strength of recommendation, articulation of recommendation, and establishing transparency respectively. 55% of CPGs met Standard 2 for Management of conflict of interest; which assesses the disclosure of conflict of interest of the guideline development group. Less than half of the guidelines fulfilled standards for external review and updating the guidelines (42%, 39%) respectively. The least fulfilled standard is for the development group composition, as only 9 (27%) of CPGs met this standard; most guidelines did not involve patients or public representatives or involve strategies to increase participation of patients or consumers. Table 5 shows that there is no significant statistical difference in the proportion of guidelines that met the IOM standards after its introduction except for the systematic review standard in which the result of significance test was borderline (difference = -32%, P = 0.05). Furthermore, we assessed the difference in the total IOM score (out of 8) for the guidelines before and after the IOM publication and found statistically insignificant change (mean difference = -1.08, P = 0.74, 95% CI = -2 to 2). In essence, the quality of osteoporosis guidelines as assessed by the IOM standards instrument has not changed since the release of the tool in 2011.

High Quality CPGs Identified by both AGREE II Instrument and IOM Standards
When applying our criteria for determining a high quality guideline, AGREE II identified 13 such guidelines (39%), and the IOM identified 15 (45%). 11 CPG were identified by both tools We examined the four additional high quality guidelines, which were identified by IOM, and found that both tools may produce different conclusions in regard to the domains of systematic review, and strength of evidence, which most clinicians may consider most important in deciding which recommendation to follow. For instance, in two guidelines (American College of Physicians 2008 [52], and the Malaysian guidelines 2015 [66]), the domain for rigour of development in AGREE II scored <60%, while the matching IOM standards '4' and '5' were fulfilled. This is because questionnaires' content that assess these domains differ between the two tools (Table 1); the external review and updating are included as part of the rigor of development in AGREE II, while they have a separate standard in IOM. Additionally, the items for systematic review section and the quality of evidence for formulating the recommendations are more detailed in the AGREE II compared to the IOM. Therefore, we found that the AGREE II would give lower scores if the systematic review process or methodology were not reported in detail in the guideline.
We tested whether the percentage of high quality guidelines differed before and after the introduction of AGREE II and the IOM standards (30.77% vs 40% after the AGREE and IOM introduction respectively). The percent change was 9.23% and statistically not significant. (% change = 9.23%, P = 0.72).

Evidence and recommendations grading systems in use by guideline developers
The guidelines used different grading systems for determining the level of evidence and different systems for assigning the strength of recommendations. Table 6 summarises these systems.   [45], and the Scottish Intercollegiate Guideline Network 2015 (for recommendations strength) [46]. There is a worldwide agreement to use this system and many respectful organizations moved their systems to GRADE such as SIGN recommendations [46], ICSI [45], and NICE [9].

Summary of comparison between the recommendations
We reviewed the recommendations of 21 most recent and updated guidelines (2010 onward) in major areas of management of osteoporosis ["fracture risk estimation tool before BMD testing", "BMD testing before risk estimation", "when to start treatment?", "considering 2 sites for BMD testing", and "BMD testing after treatment"]. Use of fracture risk estimation tool before BMD testing. Many tools have been developed by researchers incorporating many factors to estimate fractures risk over 5 or 10 years [72];FRAX was developed by the WHO in 2008, and it is one of the most used and validated tool [72]. It estimates the 10-year absolute risk of fracture based on many risk factors, with or without Bone mineral density testing. The fracture risk probability varies by country, therefore, it was calibrated using country specific data where fracture rates and deaths are known [73].
There was a substantial variation between recommendations in whether to use a tool (FRAX) before BMD testing for risk assessment. This variation did not show any pattern of difference or similarity based on high and low quality guidelines, and mostly differed by region or country Four guidelines were not clear in their recommendations whether to use FRAX first or BMD testing (Society of Obstetrician and Gynecologist of Canada 'SOGC' 2014 [64], the Saudi Arabia CPG [67], the Italian CPG 2016 [69], and the American Association of Endocrinologist 2016 [71]).
We found variations between guidelines in the same country; for instance, in Canada, variability between provinces is evident; British Columbia recommends using FRAX to determine the need for DXA [44]. While in Ontario, BMD testing is performed before FRAX, which is used afterwards to calculate the fracture risk estimation [40], and in Alberta, they use osteoporosis self assessment tool (OST), to decide the need for BMD testing. [70].
BMD testing before FRAX:Only four CPGs (Osteoporosis Canada 2010 [40], Australia 2010[38], National osteoporosis society of South Africa 2010 [56], and Greece 2011 [57]) recommend BMD testing before FRAX risk estimation especially for those <65 years of age. However, the recommendation for using FRAX is governed by the availability of country specific data and in countries like India where such data is not available, FRAX can't recommended to be use neither before, nor after BMD testing.   When to start treatment. There was no pattern of consistency or similarity between high or low quality guidelines, or in relation to the date of publication. In general, we found four approaches for setting intervention thresholds. The first approach in countries with no country specific FRAX data, the treatment is based on the T-score value of BMD testing, such as India and South Africa CPGs [42,56]. As a second approach; some guidelines (all the Canadian CPGs) apply a fixed threshold of FRAX probability score that can be used for men and women irrespective of age. A 10-year risk of major osteoporotic fracture of � 20% is the intervention threshold in most guidelines (Osteoporosis Canada 2010 [40], Taiwan 2011 [60], British Columbia 2012 [44], SOGC 2014 [64], and Alberta 2016 [70]). The third approach is using the T-score threshold, and if it is at the osteopenia level (T-score between -1 to -2.5), then FRAX score threshold is applied to decide treatment CPGs that used this approach are: Greece CPG2011 [57], Endocrine Society 2012 [62], Institute for clinical system improvement guideline 2013 [45], NOF 2014 [48], The Malaysian osteoporosis society 2015 [66], and the Italian Society 2016 [69].

AGREE II Domains Mean (Median) pre-AGREE Mean (Median) post-AGREE Mean
The last fourth approach is applying of an intervention threshold which is dependent on age; the National Osteoporosis Guideline Group (NOGG) has set a threshold of intervention at each age level after 40 years, provided through a chart with or without BMD testing [65,68] Sites for BMD testing. There was little discrepancy between guidelines' recommendation in this area.
BMD testing after treatment. The period to assess BMD has varied between CPGs ranging between 1-2 years or 2-3 years or sometimes up to 8 years. However, most guidelines were    [52].

Count and (%) of CPGs post-IOM
American college of physicians grading system which is adapted from the GRADE system [52] High, Moderate, Low American college of physicians grading system which is adapted from the GRADE system [52] Strong, or Weak

Not assigned Not assigned
First Update of the Lebanese Guidelines 2008 [53].
Not assigned Not assigned USPSTF 2011 [58]. USPTSF levels of certainty High, Moderate, Low USPSTF A, B, C, D, I = Insufficient

University of Michigan
Health System Guideline 2011 [59].
Rating was assigned, but the source of the system was not reported I, II, III Grades assigned but the source was not reported Levels were not assigned Not assigned (Continued ) in agreement in reporting that there is lack of evidence for the optimum period or the benefit of repeating BMD and that area is controversial.

Discussion
We systematically identified and assessed 33 guidelines for screening for osteoporosis published between 2002-2016 from 13 countries, using the AGREE II instrument and the IOM A, B, C NICE guidelines 2012 [9] They modified the GRADE system [80] GRADE+ review of the quality of cost-effectiveness studies and don't provide a summary labels for the quality of evidence across all outcomes.
Transition between the ICSI system and GRADE Levels assigned but the system was not reported Ia, Ib, IIa, IIb, III, IV ABC grades were assigned A, B, C standards for trustworthiness, which are developed to appraise the quality of CPGs. Our findings reveal that there has been marked variability in the compliance to the criteria of the AGREE II tool and the IOM standards by these guidelines. An examination of the mean of AGREE II domain scores showed that the highest mean domain scores were for Clarity of Presentation and Scope and Purpose, while the lowest mean scores were for Applicability and Editorial Independence domains. This was consistent with other reviews in other topics [34,35,82,83]. The applicability domain reflects the implementation of the guideline however; most guidelines didn't give advice on how the guideline should be implemented. One of the reasons for this might be that most guidelines don't have experts in knowledge translation and economists within their development group which could advise on strategies for implementation and assessing economic barriers.
Regarding the domain of stakeholder involvement; most guideline developers didn't seek the views and preferences of their target population especially patients in guidelines development and even when they did they were vague about the process. This is worth considering by all guideline developers, since a patient centred approach to health care with shared decision making is associated with better application of the guidelines and improved health care [84]. The domain of rigor of development which is considered of importance to guidelines quality scored only 60% of the guideline, thus, one third of CPGs had poor development methodology. This differs from the results of the previous review of postmenopausal osteoporosis CPGs by Cranney et al 2002 [23], in which the average score was almost 23%. Nonetheless, this review covered 2001-2002 published guidelines, so including more guidelines in our review may resulted in higher and more accurate mean score. Leslie & Schousboe 2011 reviewed 8 osteoporosis guidelines in an illustrative way rather than a systematic approach [85]. They assessed the quality of these guidelines using 7 items from the 23 AGREE II items and a different method for scoring, therefore, we could not compare our AGREE II results to theirs [85]. Nevertheless, we agree with their findings of conflicting recommendations in the same areas in these guidelines. Similarly, our review agrees with a review by Lewiecki M. in regards to the variability and conflicting recommendations especially in areas of evaluation and treatment of osteoporosis [86].
By assessing the compliance of guidelines to the criteria of the IOM standards, we found that 64%-67% of guidelines fulfilled the standards for establishing evidence, strength of recommendations, and systematic review standards. However, most guidelines fell short in involving patients and public representatives in their guideline development and didn't adequately describe the method for external review. Though, the IOM standards were developed in 2011, we found few studies that assessed the quality of CPGs using these standards [36,87,88][.These studies used different methodology, making it difficult to compare results. Reams et al 2013 [88], assessed the quality of guidelines for oncology using the IOM tool; their findings were similar to ours in particular to the lower score in guideline development group composition, yet, our study found a better performance on standards 3, 4 and 5 compared to their study. We found that the application of IOM standards to assess the guidelines quality is challenging, since no scoring system is assigned to the criteria and some of the sub-criteria are vague or partially fulfilled. This was also found in the study by Kung et al using the IOM standards, they excluded many items reporting that they were "vague and subjective" [87].
In our study, we found that the compliance of guidelines to the criteria of both tools (AGREE II and IOM) showed no change between 2002-2010 and 2011-2016, allocating the time of the release of both tools as the comparing point. This finding aligns with previous studies evaluating CPGs over time [82,87,[89][90][91]while it conflicts with others [92][93][94];Armstrong et al. 2016 conducted a quality assessment and structured analysis for recommendations of physical activity and safe movement in osteoporosis guidelines [93,35]. They found an improvement in the quality guidelines over time. Nevertheless, they just reported the average AGREE II scores without a proper statistical test to find if this improvement were statistically significant. In our review, we also found more guidelines of high quality after 2010, yet, it was statistically insignificant improvement (p value >0.05). In a more recent review that assessed the quality of guidelines in a variety of health topics, it was reported that the quality has improved over time, in contrast to our finding [93]. Since it is a review of many health topics, we are uncertain if the quality of guidelines might have improved in these topics, but not in osteoporosis. This lack of improvement in osteoporosis guidelines should be examined, we think that this perhaps because of the lack of studies with direct evidence on screening for osteoporosis which was reported in many guidelines (NICE 2012 [9], and USPSTF 2011 [58]).
In comparing the AGREE II tool with the IOM standards; the AGREE II was more comprehensive, as it covered implementation and dissemination issues of the guidelines, while IOM did not cover this area. Both instruments identified lack of compliance in domains relate to multidisciplinary development group composition with involvement of patients and public representatives. Both tools showed that the influence of funding body, and conflict of interests, is falling short in most guidelines as very few CPGs met these criteria.
We used certain criteria to identify high quality guidelines; AGREE II identified less high quality guidelines compared to IOM. By examining the items or criteria for domains of rigor of development and systematic review methodology in the four extra high quality guidelines; identified by IOM; we found that the AGREE II gave lower scores for these guidelines. This is because the AGREE II has more detailed quality items for this section, While, the IOM has fewer criteria in this section, which have resulted in higher scores for the guidelines. This may have important implications for clinicians and stakeholders, in deciding which guideline to implement based on using one of the two tools to assess the quality and rigor of the guidelines.
Our systematic review emphasizes the variability in the use of the different grading systems to aggregate the level of evidence and to rate the strength of recommendations. Establishing the level of evidence that underlies the recommendations is essential in guideline development. Without clarity of the system of evidence that is used, guidelines users cannot decide whether recommendations are built on strong evidence or weak evidence. Additionally, determining the strength of recommendations influences the applicability and implementation of the guideline. Different frameworks or grading systems were developed, yet, the Grading of Recommendations, Assessment, Development and Evaluation (GRADE) system is considered one of the best now, and has been adopted by many organizations [20]. Surprisingly nine guidelines did not use any system for level of evidence or strength of recommendation, and only seven guidelines used the GRADE system ( Table 6).
The content analysis of selected areas of recommendations for screening of individuals without previous fractures revealed a considerable variability. This variability did not differ between high or low quality guidelines but mainly differed by region or country. In terms of their approach in screening and using BMD testing or risk assessment first, there was a huge variability between guidelines in this area. The other conflicting area relates to choosing the intervention threshold for those at risk of fracture. Some guidelines use bone mineral density T-score diagnostic criteria. Others use BMD testing with FRAX scores and not solely the Tscore results, and this approach is the most adopted in more recent guidelines.
Concordant recommendations were found for BMD testing sites. We expect some variation between guideline recommendations since it is based on country specific data and cost-effectiveness. However, we found even in the same country guideline recommendations differ. For instance; in Canada; British Columbia CPGs recommend using FRAX before DXA [44], while in Ontario DXA is recommended before FRAX [40], and in Alberta, the use osteoporosis self assessment tool to decide the need for DXA [70]. The lack of uniformity between guidelines, probably creates confusion for the clinicians, and may subsequently affects adherence to the guidelines and diminish quality of care to patients.

Study limitations and strengths
There are many limitations to this review. First; we included only CPG in English, so guidelines in other languages were not assessed. Second, the AGREE website suggests 2-4 number of reviewers, and four is preferable to increase the reliability of the study. Third; in scoring the AGREE II, only the published information about the guidelines were used in assessment, i.e. we didn't look at the methodology documents of the organizations, which could have been on their websites. Thus, we may have underscored some domains. Fourth, assessing quality using the IOM standards was challenging because we could not find any suitable published methodology that was used in previous studies, and very few studies have used these standards to assess quality of guidelines (discussed in methods section). Thus, we are not certain about the validity of our methodology. This also may affect the results of comparison between the AGREE II and the IOM standards, which as a result of our IOM methodology, could be liable for bias.
The main strength of this review is that we assessed the quality of osteoporosis screening guidelines over 14-year period to determine changes in guideline quality over time. Our search was systematic and comprehensive including all the general databases, guideline websites and major guideline developer groups, and by hand searching the references of all identified guidelines. Another strength, is that we used two recognised tools to assess guideline quality and compared the results from both tools which to our knowledge was done only by one study [36]. We had a good interrater agreement between the two reviewers which increase the reliability of the study. The AGREE II tool does not assess the content of the recommendations of the guidelines, therefore, another strength is that we examined the content of the recommendations and provided a summary of comparison between the guidelines. In addition, we summarised the different grading systems for level of evidence and strength of recommendations.

Conclusion
The AGREII and IOM defined quality of CPGs for screening of osteoporosis is variable, and there is a considerable room to improve the guideline development process in this field as well as the reporting of guideline development. Guideline developers should develop their guidelines paying attention to the criteria and standards included in the AGREE II instrument and the IOM standards for trustworthy guideline. The reporting of applicability considerations of the guideline and editorial independence areas appear week. The inclusion of patients, economists, and, knowledge translation experts as well as other stakeholders should be considered as a mean of improving the quality of guidelines and their likelihood of implementation. The lack of consensus on specific guideline recommendations for osteoporosis screening is problematic and creates confusion for clinicians and patients about what exactly is best practice.
Supporting information S1