Using and Reporting the Delphi Method for Selecting Healthcare Quality Indicators: A Systematic Review

Objective Delphi technique is a structured process commonly used to developed healthcare quality indicators, but there is a little recommendation for researchers who wish to use it. This study aimed 1) to describe reporting of the Delphi method to develop quality indicators, 2) to discuss specific methodological skills for quality indicators selection 3) to give guidance about this practice. Methodology and Main Finding Three electronic data bases were searched over a 30 years period (1978–2009). All articles that used the Delphi method to select quality indicators were identified. A standardized data extraction form was developed. Four domains (questionnaire preparation, expert panel, progress of the survey and Delphi results) were assessed. Of 80 included studies, quality of reporting varied significantly between items (9% for year's number of experience of the experts to 98% for the type of Delphi used). Reporting of methodological aspects needed to evaluate the reliability of the survey was insufficient: only 39% (31/80) of studies reported response rates for all rounds, 60% (48/80) that feedback was given between rounds, 77% (62/80) the method used to achieve consensus and 57% (48/80) listed quality indicators selected at the end of the survey. A modified Delphi procedure was used in 49/78 (63%) with a physical meeting of the panel members, usually between Delphi rounds. Median number of panel members was 17(Q1:11; Q3:31). In 40/70 (57%) studies, the panel included multiple stakeholders, who were healthcare professionals in 95% (38/40) of cases. Among 75 studies describing criteria to select quality indicators, 28 (37%) used validity and 17(23%) feasibility. Conclusion The use and reporting of the Delphi method for quality indicators selection need to be improved. We provide some guidance to the investigators to improve the using and reporting of the method in future surveys.


Introduction
The Institute of Medicine defines healthcare quality as ''the degree to which health services for individuals and populations increase the likelihood of desired health outcomes and care consistent with current professional knowledge'' [1]. Improving the quality and safety of healthcare has generated considerable attention in recent years [2]. As part of this thrust, authorities and health care professional used a wide range of methods and tools to promote quality improvement. During the past decade, the development and implementation of quality indicators (also known as performance indicators or quality measures) has been largely driven by the arrival of computerised administrative and clinical database and the desire to make performance data available publicly [3]. Many governmental associations and professional bodies have developed quality indicators for different areas in order to improve the quality of care and detect suboptimal care either in structure, process or outcome [4].
The information required to develop quality indicators can be obtained using systematic or nonsystematic methods. Nonsystematic approaches such as case studies are based on data availability and real-time monitoring of critical incidents [5]. Although these approaches play an important role, they fail to exploit much of the available scientific evidence. In systematic approaches, in contrast, indicator selection relies directly on the available evidence, complemented when necessary with expert opinion [6,7]. Experts examine the evidence and reach a consensus. Systematic methods enhance decision making [8]; facilitate the development of quality indicators or review criteria for areas where the evidence alone is insufficient [9] or controversy [10], [11]; and synthesize accumulated expert opinion. Among these methods, the Delphi technique has been widely used for quality-indicator development in healthcare.
The Delphi technique is a structured process that uses a series of questionnaires or 'rounds' to gather information. Rounds are held until group consensus is reached [10,12]. One of the main reasons for the popularity enjoyed by the Delphi technique is that a large number of individuals across diverse locations and areas of expertise can be included anonymously, thus avoiding domination of the consensus process by one or a few experts [13]. Adler et al. [14] defined the Delphi technique as an exercise in group communication that brings together and synthesizes the knowl-edge of a group of geographically scattered participants who never meet.
The Delphi technique is among the methods used to develop prescribing indicators [9], indicators reflecting patient and general practitioner perspectives of chronic illness [15], performance indicators for emergency medicine [16], and indicators for cardiovascular disease [17]. Currently, there are no universally accepted requirements for using the Delphi technique [8]. Considerable confusion, disagreement, and uncertainty exist concerning the parameters of the Delphi technique such as the definition of group consensus, Delphi technique variants, expert selection, number of rounds, and reporting of the method and results [18].
The main objective of this study was to describe and discuss the use of the Delphi technique for quality indicator selection and to assess the reporting of the method and results. We sought to identify specific methodological criteria regarding the use of Delphi techniques for quality indicator selection. Finally, we developed a number of best-practice guidance.

Article Selection
We searched Medline via PUBMED, EMBASE and CO-CHRANE library using the search terms ''Delphi'' AND ''Healthcare'', with no date limits. We chose these broad terms because using restrictive terms might have failed to retrieve all the articles of interest. We identified all reports of studies in which Delphi techniques were used to select quality indicators.
Retrieved articles were assessed by one of us (RB), who read the titles and abstracts to identify the relevant studies. Articles were included only if the study assessed the use of Delphi techniques to select quality indicators in healthcare and was published as a fulltext article. We excluded studies reported only in abstract form, editorials, methodological studies, comments, and duplicate publications.
A further search was conducted in both PUBMED and EMBASE using the more specific terms ''Quality Indicators, Health Care''[Mesh]) AND (''Delphi Technique''[Mesh]). We then compared the results of the two search strategies.
To evaluate the use and reporting of Delphi techniques and results, we developed a standardized data extraction form (APPENDIX S1). The items in the form were selected based on information from articles identified through a literature search [6,8,19]. These items pertained to preparation of the Delphi questionnaire, selection of the experts, characteristics of the survey, and reporting of the results.
Before the study, two of us (RB and ML) independently evaluated 10 articles taken at random among the articles selected for the study. They met to discuss the interpretation of the items and to resolve any differences in scoring. Global reproducibility was high, with a median k of 1(Q1: 0.8, Q3:1).
One of us (RB) recorded the data from each selected article on the standardized data extraction form. For each article were recorded: date of publication, name of the journal and medical specialty of the study.
In addition, the following data were extracted.

Data on Delphi Questionnaire Preparation
Quality indicators were divided into three categories based on whether they related to structure, process, or outcome. Structure refers to static or technical aspects of care (e.g., attributes of service providers or of the healthcare institution). Process refers to steps taken in caring for the patient and outcome to the impact of care on the health status of patients or populations [20].
We recorded the method used for indicator selection which were include in the first questionnaire and the criteria used to select indicators in each round. We checked whether the article included a clear definition of the selection criteria and/or the definition used in the Delphi questionnaire. We also recorded whether the selection criteria used in the first round were the same as those used in the next round. We extracted the number of quality indicators at the beginning of the survey and we determined whether the experts could add indicators they felt deserved evaluation in subsequent rounds.

Data on the Expert Panel Size and Composition
For each selected article, we recorded the number of experts invited to participate and whether these experts were first asked about their willingness to participate. We recorded the data supplied in the article about the experts (i.e., specialty, age and years of experience), the composition of the panel (e.g., patients, informal care providers, healthcare professionals, managers), and whether the panel included professionals from a single specialty or from multiple specialties. We determined how the experts were chosen (e.g., willingness to participate, expertise, or membership in an organization). We evaluated the relationship between the response rate and the use of specific methods to encourage the experts to respond (e.g., stamped addressed envelope for returning the questionnaire and financial compensation).

Data on Progress of the Survey
We evaluated the type of Delphi technique used in each study. We defined the basic Delphi technique as any type of selfadministered questionnaire with no meetings and modified Delphi techniques as the combined use of a self-administered questionnaire and of a physical meeting of the experts to discuss the results or rate the indicators [21,22]. When a modified Delphi technique was used, we determined whether the meeting was held before, after, or between Delphi rounds and what the participants did during the meeting. We recorded the number of rounds. For the basic Delphi method, each round consisted in the completion of a structured questionnaire with the goal of achieving a consensus. For modified Delphi methods, in addition to questionnaire-based rounds, the physical meeting was counted as a round. The time taken to complete the Delphi procedure was recorded, as well as the geographic scope of the survey. We recorded the main methods used to send the questionnaires (e.g., mail, E-mail, or fax). For each study, we checked the formulation of the questionnaire items (e.g., open questions, rating of quality indicators, or both) and whether the quality indicators were rated (in which case, we recorded the minimum and maximum values on the rating scale). We recorded the method used to define a consensus among panel members, whether the percentage of agreement was determined, and whether a cut-off (e.g., median value) was used to select indicators.
We evaluated the methods used by the Delphi procedure organisers to send the responses back to the panel. More specifically, we determined whether the experts were informed of both the response of the group and their own individual response (individual feedback) to each item. For each study, we recorded the type of feedback, which was defined as qualitative when a summary of the panel's comments was sent to each participant and quantitative when simple statistical summaries illustrating the collective opinion (e.g., central tendency and variance) were sent to each participant.

Data on Delphi Results
We recorded the number of quality indicators selected at the end of the Delphi procedure. We searched each article for a list of quality indicators and, when such a list was found, we determined whether it included all the quality indicators used for the first round or only the indicators selected at the end of the last round. We looked for a flow chart of quality indicators (figure showing the output and input indicators at each round) and/or for a written description of indicator flow, as well as the availability of the questionnaires in the article itself or in an appendix. Finally, we recorded the response rate for each round if available.

Statistical Analysis
We computed the medians and the first and third quartiles for continuous variables and the number (%) of articles for categorical variables. Percentages for each characteristic were computed using the total number of articles reporting that characteristic as the denominator. Statistical analyses were performed using SAS version 9.1 (SAS Institute Inc, Cary, NC, USA).

Results
Of the 1241 articles retrieved by our database search, 91 were selected based on the titles and abstracts (FIGURE 1); of these, 80 were included in the final analysis. The included articles are described in TABLE 1. All were published between 1978 and 2009; however, most of them (n = 64, 80%) were published recently, ie after 2000 (FIGURE S1). The research strategy based on restrictive terms retrieved the same articles. The most often used criteria were validity (28/75, 37%) and feasibility (17/75, 23%). However, a substantial proportion of studies used their own selection criteria (TABLE S1).

Delphi Questionnaire Preparation
Among articles that listed the indicator selection criteria, only 61/75 (81%) clearly defined these criteria. Examples of definition are given in BOX S1.  Selection criteria changed between rounds in 13/73 (18%) studies. For example, in one study, indicators were selected based on ''applicability'' in the first round and based on ''validity'' and ''importance'' in subsequent rounds. Only 31/70 (44%) studies allowed the experts to add indicators during the Delphi procedure.

Characteristics of the Delphi Participants (TABLE 3)
The number of individuals invited to participate was reported in 76/80 (95%) articles. Authors reported that they asked participants their willingness to participate to the survey before the first Delphi round in only 21/80 (26%) studies. Only 10/80 (13%) studies described the use of specific techniques to encourage participation, and there was no statistically significant difference in first-round response rates between studies where such techniques were reported and other studies (89.5% vs. 90.0%, p = 0.6).

Data on the Progress of the Delphi Procedure
Of the 80 studies, 49 were modified Delphi procedures and 29 were basic Delphi procedures; procedure type was not specified in 2 articles (TABLE 4).
The number of rounds was reported in 66/80 (83%) studies. The methods used to describe a consensus were not described in 18/80 (23%) studies and were unclear in 3/62 (5%) studies. Five main methods were used to achieve a consensus about the selected indicators. (a) In 22/62 (35%) studies, indicators with median scores above a predefined threshold and a high level of agreement among panel members were selected; an example is selection of indicators having a median score of 8 or more with 75% or more of the ratings being in the lowest or highest tertile. (b) In 10/62 (16%) studies, selection was based only on a median score greater than a predefined threshold (e.g., indicators having a median score of 7 or more were selected). (c) In 9/62 (15%) studies, the proportion of experts who rated the indicator within the highest region of the scale had to be greater than a predefined threshold (e.g., 75% or more of the experts giving the indicator scores of 7, 8,  (17) Agreement or reability 12 (15) Other (Table S1)   or 9). (d) In 8/62 (13%) studies, Rand UCLA criteria for agreement were used (for a 9-member panel using a 9-point Likert scale, no more than 2 members rate the indication outside the 3point region (1-3; 4-6; 7-9) containing the median) [23]. (e) Finally, in 2/62 (3%) studies, indicator selection relied on the interpercentile range (IPR) and interpercentile range adjusted for symmetry (IPRS), with an IPR value greater than the IPRS value indicating that the indicator was rated with disagreement [23]. Concerning the methods used by organisers to send the response back to the panel, 40% (32/80) of studies didn't report that feed back to panel members was given between rounds and 61% (49/ 80) didn't report that own individual response were feed back to the panel.

Data on Delphi Procedure Results
Response rates for all rounds were reported in only 39% (31/80) of studies. For these, the median response rate was 90% (Q1:80%-Q3:100% ) in the first round (87% for basic Delphi and 92% for modified Delphi studies) and 88% (Q1:69%-Q3:96%) in the last round(90% for basic Delphi and 87% for modified Delphi studies). The number of indicators selected at the end of the survey was reported in 68/80 (85%) articles, in which the median was 29 (Q1-Q3: 18-52.5). The lists of indicators were available in 69/80 (86%) reports but the final set of indicators was given in only 46/ 69 (67%) reports. The list of indicators included in the first questionnaire was available in 23/69 (33%) articles and additional information on selection in 8/69 (12%) articles (discarded indicators, 2 articles; sample of selected indicators, 2 articles; indicators included in the next round, 2 articles; indicators given high scores, 1 article; and indicators included after external peer review, 1 article). Finally, 37/80 (46%) articles included a flow chart of the indicators. A single study provided the Delphi questionnaires in an appendix.

Discussion
We appraised the use and reporting of Delphi procedures for selecting healthcare quality indicators. We included 80 articles published as of December 2009. Most studies used a modified Delphi procedure with a physical meeting, usually between Delphi rounds. Considerable variability was noted across studies in the characteristics of the Delphi procedure. Moreover, study reports did not consistently provide details that are important for interpreting the results. For example, only 39% of studies reported that individual feedback was given between rounds and the method used to define a consensus was specified in only 77% of studies. Moreover, response rates for all rounds were reported in only 31% of studies. Information on both points is needed to evaluate the validity and credibility of the results. If the Delphi method is incompletely described this may affect the overall quality of the final consensus and the selected indicators are unlikely to gain the level of credibility needed for adoption in clinical practice. Our results are supported by those found by Sinha and colleague [24], who identified many variability in methodology and reporting of this method to select core outcomes in clinical trials.
To our knowledge, this is the first systematic review of the use and reporting of Delphi procedures for selecting healthcare quality indicators. The strengths of the study include the retrieval of studies published over a 30-year period (1978-2009) and the use of a standardized data extraction form based on data from a literature search. However, our study has limitations. No consensus exists on how to assess the applicability of a Delphi procedure. Consequently, we identified applicability items based on a literature review, and these items may vary in relevance. Several modifications of the original Delphi method have been described in the literature, but standardized definitions of these modifications are not available. We defined a modified Delphi procedure as Delphi rounds plus a physical meeting, in keeping with the definition given in most of the included articles. Finally, a single investigator screened all retrieved articles for eligibility and collected all the data. However, a quality assurance procedure was performed.

Criteria Used to Select QI Depend on the Survey Objective
The Delphi technique has been used since the late 1970s for quality-indicators selection in the field of healthcare [25,26]. Ideally, quality indicators would be based on evidence from rigorously conducted empirical studies. In practice, however, such evidence is rarely available in adequate amounts [27]. Therefore, quality indicators for healthcare must be selected partly or largely based on the opinions and experience of clinicians and others with knowledge of the relevant topic [28,29].
In healthcare, several criteria are used to select indicators via the Delphi method. We found that the most commonly used criterion was validity. Validity is defined as the extent to which the characteristics of the indicator are appropriate for the concept being assessed. Generally, this criterion is used when the objective is to develop new indicators in a given field. Indicators selected via consensus methods such as the Delphi procedure have high face validity, which is a prerequisite for any quality indicator. However, validity is not enough and quality indicator should exhibit other characteristics and required metrological properties like any measuring instrument such as health measurement scales or analytical methods [30]. Indeed , an indicator is considered a good measure if it meets criteria including reliability, sensitivity, specificity, and feasibility (or applicability) [20,31]. The common use of these characteristics can facilitate acceptance and implementation of indicators developed. For example, it has been shown that the validity and feasibility of a specific guideline predict implementation of the guideline in the clinical setting [32].
We noted that selection criteria changed between rounds in 13/ 73 (18%) studies. According to the rules of the Delphi procedure, the selection criteria should be the same in all rounds. Changing the selection criteria is the equivalent of conducting a distinct Delphi procedure, in which case achieving a consensus is extremely difficult.

Simple Means to Increase Adhesion
Selection of the panel members may be crucial if the group consensus technique is to work properly [33]. Participants should be chosen based on their willingness to participate and knowledge of the relevant topic [34]. To maintain a high response rate throughout the Delphi procedure, participants should be asked whether they want to commit to the project. For instance, they could be sent an information letter explaining the method and the reasons their participation to the whole process would be necessary, as well as a form for collecting their consent to complete the entire Delphi process.

A Heterogeneous Panel Member
Studies have shown that panel composition influences ratings [35]. Indeed, ratings vary across specialties [36] and between mixed and single-specialty panels [37,38]. Studies in psychology [39] suggest that heterogeneity in a decision-making group may lead to better performance than homogeneity. To enhance the credibility and acceptance of quality indicators, the panel should reflect the full range of stakeholders who have an interest in the results of the study. Moreover, different stakeholders often have very different point of views about quality of care [40], which enrich the results of the Delphi procedure. Therefore, depending on the study objective, inclusion in the panel of healthcare-quality professionals, patients or patient representatives and methodologists should be encouraged. To obtain a panel that is representative of all stakeholders concerned by the study results, study design must specify the characteristics of the participants, such as gender, professional experience, education or employment.
The Delphi could be a long process. The participants must complete the questionnaire despite their busy schedules and nonrespondents must be contacted. Duffield [33] reported that each round can take up to 8 weeks to complete. That is probably due to the need to follow up non respondents and the time needed to adequately analyze results to prepare feedback for the next round.

Administration of Questionnaire: Which Ways?
Delphi questionnaires are usually sent by mail, although Internet-based questionnaires are being increasingly used [41][42][43] to save time and to increase dissemination. However, a study showed significantly lower response rates with Internet-based questionnaires than with mailed questionnaires [44]. Conceivably, using both mail and the Internet might improve questionnaire dissemination and increase response rates. Moreover, the advantage of the Delphi procedure is that experts who live and work far apart from each other can participate. However, we found that only 11/71 (13%) studies included participants from different countries. In 47/73 (64%) studies in our review, the panellists rated the indicators on a Likert scale (usually ranging from 1 to 9) and were able to make comments. This method allows panellists to explain their choices and to express their views on the indicators, thus supplying the investigators with useful information for developing the questionnaire of next rounds.

Delphi vs Modified Delphi
Delphi participants are polled individually, usually via selfadministered questionnaires with no physical meeting, over two or more rounds. After each round, the results are reported to the group.
In more than half the studies included in our review, at least one physical meeting of panel participants was held. Having a physical meeting contradicts one of the basic rules of the Delphi procedure, which is avoidance of situations that might allow one of the panel members to dominate the consensus process. Conversely, absence of a meeting may deprive the Delphi procedure of benefits related to face-to-face exchange of information, such as clarification of reasons for disagreements [45]. For example, other formal consensus methods such as the nominal group technique [46] and the Rand UCLA Appropriateness Method [23] use a highly structured meeting to gather information from relevant experts.  Therefore, during the Rand UCLA Appropriateness Method meeting, no indications are discarded between rounds and, consequently, no potential information is lost. In the nominal group method, the meeting involves rating, discussing and, finally re-rating a series of items. A panel meeting at the end of the Delphi procedure may be useful when reaching a consensus is difficult. The best strategy would be a physical meeting at the end of the last round to exchange views and resolve uncertainties. However, the meeting should be well structured and should take place under favourable conditions(good surrounding and general environment) [47] with a moderator to contain the influence of dominant personalities. Studies on methods involving face-to-face interaction show that the way a meeting is structured and organized affects the group interactions [48] and influences the manner in which the group produces results.

No Consensual Definition of ''Consensus''
As previously mentioned, the most sensitive methodological issue with the Delphi method is the definition of a consensus among participants. The investigators must decide how agreement among participants will be measured and, if the agreement rate is used, what cut-off will be used to define a consensus. Our review shows that the method used to define a consensus varied across the included studies. The RAND researchers [23] definition was widely used, although in some cases the number of panellists was not a multiple of 9. In one study [49] involving two Delphi rounds, the agreement rate used to define a consensus was higher in the second round than in the first. In another study, the procedure was stopped when the last two rounds showed no significant difference in results as assessed using the Wilcoxon signed rank test [43]. Since as many rounds are held as needed to achieve a consensus or the 'point of diminishing returns', the criterion used to define a consensus influences the number of rounds. Stopping the Delphi procedure too soon may lead to results that are invalid or not meaningful, but a large number of rounds may cause participant fatigue with steep dropout rates [50]. The recommended number of rounds is two or three, in keeping with our results. However, there is very little scientific evidence on which to base decisions about the optimal number of rounds.  individual feedback. It has been recommended that feedback should include qualitative comments and statistical measures [51]. After each round, each participant should be given the panel results (median, lowest, and highest ratings), the participant's response, and a summary of all comments received. These data inform each participant of his or her position relative to the rest of the group, thus assisting in decisions about replies during future Delphi rounds.

Feedback between Rounds: Important Aspect of the Methodology
In conclusion, the Delphi procedure is valuable for achieving a consensus about issues where none existed previously. However, our findings indicate a need for improving the use and reporting of this technique. In TABLE 5, we outline practical guidance that may improve the optimal use and reporting of the Delphi method in quality indicator research. We are aware that the Delphi procedure is used in many other setting whether an appraisal of Delphi practices is also be performed. Nevertheless, our review provides helpful information on the use of Delphi in our field and additional research is needed to investigate its use in other setting. Also determining when to stop the Delphi procedure is a major issue. The optimal size and composition of the panel need to be determined. Authors must strive to provide sufficient detail on the method they use.

Supporting Information
Appendix S1 Data extraction form.  Box S1 Example of definition of selection criteria. (DOC)