Skip to main content
Advertisement
  • Loading metrics

Matching the right study design to decision-maker questions: Results from a Delphi study

  • Cristián Mansilla ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software

    camansil@gmail.com

    Affiliations McMaster Health Forum, McMaster University, Hamilton, Ontario, Canada, Health Policy PhD program, McMaster University, Hamilton, Ontario, Canada

  • Gordon Guyatt,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Supervision, Validation, Visualization, Writing – review & editing

    Affiliation Department of Health Research Methods Evidence and Impact, McMaster University, Hamilton, Ontario, Canada

  • Arthur Sweetman,

    Roles Data curation, Formal analysis, Investigation, Methodology, Supervision, Validation, Writing – review & editing

    Affiliation Department of Economics, Faculty of Social Sciences, McMaster University, Hamilton, Ontario, Canada

  • John N Lavis

    Roles Conceptualization, Data curation, Investigation, Methodology, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations McMaster Health Forum, McMaster University, Hamilton, Ontario, Canada, Department of Health Research Methods Evidence and Impact, McMaster University, Hamilton, Ontario, Canada

Abstract

Research evidence can play an important role in each stage of decision-making, evidence-support systems play a key role in aligning the demand for and supply of evidence. This paper provides guidance on what type of study designs most suitably address questions asked by decision-makers. This study used a two-round online Delphi approach, including methodological experts in different areas, disciplines, and geographic locations. Participants prioritized study designs for each of 40 different types of question, with a Kendall’s W greater than 0.6 and reaching statistical significance (p<0.05) considered as a consensus. For each type of question, we sorted the final rankings based on their median ranks and interquartile ranges, and listed the four study designs with the highest median ranks. Participants provided 29 answers in the two rounds of the Delphi, and reached a consensus for 28 (out of the 40) questions (eight in the first round and 20 in the second). Participants achieved a consensus for 8 of 15 questions in stage I (clarifying a societal problem, its causes, and potential impacts), 12 of 13 in stage II (finding options to address a problem) and four of six in each of stages III (implementing or scaling-up an option) and IV (monitoring implementation and evaluating impact). This paper provides guidance on what study designs are more suitable to give insights on 28 different types of questions. Decision-makers, evidence intermediaries (, researchers and funders can use this guidance to make better decisions on what type of study design to commission, use or fund when answering specific needs.

Introduction

There has been a growing recognition of the potential for evidence to support many aspects of decision-making processes. Evidence can play a significant role in clarifying problems and their causes, analyzing potential solutions, supporting implementation, and monitoring implementation, and evaluating impacts [13].

During the COVID-19 pandemic, evidence provided multiple insights critical to questions asked by decision-makers [4]. The lessons learned from this global crisis led several organizations to call for formalizing and strengthening evidence-support systems and the global evidence architecture [57].

Robust evidence-support mechanisms require alignment between decision-makers’ needs (evidence demand) and the work of evidence producers (evidence supply) and intermediaries (i.e., people working in between researchers and decision-makers) (evidence supply). Typically, a given need can best be met with a combination of different forms of evidence. When answering complex questions, decision-makers often require a combination of different forms or lines of evidence [8]. For example, a policymaker may use quantitative data from a randomized controlled trial to describe the likely benefits and harms of implementing a particular policy intervention, and may also rely on qualitative data to gain a deeper understanding of how the public might perceive the intervention.

Following the policy cycle used in Update 2023 from the Global Commission on Evidence to Address Societal Challenges, previous research efforts generated a taxonomy of decision-makers’ questions grouped into four decision-making stages (S1 Fig). This taxonomy can help decision-makers to frame their need for evidence in a form that an evidence intermediary or producer can answer with different forms of evidence.

After identifying a clear decision-making need in the form of a question, evidence intermediaries and producers can tailor their evidence support to address specific issues and share evidence that is more likely to be used in decision-making processes. The current study builds on the previously created taxonomy of decision-maker questions and matches questions to study designs. While some have suggested a single evidence hierarchy with randomized trials and evidence syntheses at the top of an evidence pyramid [9], our approach aligns with work that presents a more sophisticated approach that recognizes that the preferred study design will differ based on the specific type of question being asked [10, 11].

This paper provides guidance on what type of study designs can address each decision-maker question. Specifically, it:

  1. lists specific study designs that would provide some insights to address different types of question;
  2. provides a ranking of study designs that would be most suitable to answer these questions.

Materials and methods

This is a Delphi study of methodological experts in different areas and disciplines.

Participants

Methodological experts—from the six WHO regions (Africa, Americas, Eastern Mediterranean, Europe, South East Asia, and the Western Pacific) representing the eight different forms of evidence that were identified by the Global Commission on Evidence to Address Societal Challenges (data analytics, modelling, evaluation, behavioural/implementation research, qualitative insights, evidence syntheses, technology assessment/cost-effectiveness analysis and guidelines, and experience in the health sector and in other sectors) were identified through global networks (e.g., EVIPNet, Cochrane and GIN). The study team obtained contact information through publicly available sources, and purposively sampled experts using the above criteria. Between 11 April and 16 May 2023, they were invited to participate in this online Delphi study.

Data collection

We used an online questionnaire to ask participants to rank the suitability of study designs to answer different types of question. The questionnaire provided 40 different types of questions (S1 Fig provides the entire taxonomy of types of question used in this study).

In both Delphi rounds, participants considered a list of study designs potentially relevant to each type of question, and prioritized the designs’ suitability to answer the question. To facilitate understanding, a brief explanation of each type of question and some examples were also provided. A glossary of study designs was also made available for participants, and they could declare that they did not have enough expertise to answer a given question.

In the first round, participants prioritized at least four study designs and suggested any additional study designs they considered to be missing from the original list. They could choose to work through the complete list of types of question or, depending on their expertise in the eight forms of evidence mentioned above, work through a subset of questions.

Using a ranking-type Delphi process [12, 13], we used the level of consensus reached in round one to prioritize the questions included in the round two. Questions that reached a consensus in the first round were not included in the second round. In the second round, participants either confirmed the previous rank order from round 1 or suggested an alternative ranking. All participants in the second round answered the complete list of question types.

Statistical analysis

Responses were analyzed in each round using median ranks and distributions (interquartile ranges). Additionally, in each round, Kendall’s W (a statistical measure of consensus in ranking-type surveys) was used to measure consensus [1215]. Answers were considered if the participants ranked at least two options and did not declare that they had insufficient expertise.

The ranking given by each participant was classified ordinally (e.g., first priority received a 1, second priority received a 2, etc.) and, since not every participant ranked all the options (they were asked to prioritize at least four options), non-ranked values were imputed as the average value between the latest item ranked and the greatest possible rank for each answer (e.g., if one participant ranked four of the six available options, the two options that were not ranked both received a value of 5.5). From these rankings, medians and interquartile ranges were calculated for each study design, and the Kendall’s W with ties (considering missing values as ties, as suggested by Kendall & Gibbons [14]) was calculated, along with their statistical significance, using the Real Statistics Resource Pack for Microsoft Excel [15].

The study designs were sorted for each type of question based on their median ranks (from smallest to largest). If more than one option had the same median rank, then the smallest number of the 25th percentile, followed by the smallest number of the 75th percentile, was used to sort these options.

One question was considered to have reached consensus when Kendall’s W was greater than 0.6 and statistically significant (p-value < 0.05). The options ranked lower in the list with its IQR suggesting no change in their ranking position (e.g., in a question of 7 study design options, a cross-sectional study had a median rank of 7, and a IQR from 7 to 7) were not included in the second round. Suggestions of additional study designs collected from the first round were, when appropriate, included in the study designs to be ranked in the second round.

In each question, a study design whose median rank was 1, and its interquartile range does not alter its ranking as first were identified and placed as gold standard. Similarly, a study design whose median rank was last, and its interquartile range does not alter its ranking as last was also identified.

Ethics statement

The Hamilton Integrated Research Ethics Board (HiREB), Project ID: 14691 approved the study. Participants received an information sheet including the ethical considerations of being part of this study, and their response was considered as implicit consent, as their responses were treated anonymously.

Results

Seventy-three methodological experts were initially identified as potentially eligible to participate and, to balance the sampling across WHO regions and forms of evidence, an invitation was sent to 46 individuals in the first round. Two participants declined the invitation to participate in the study, and were replaced with two from the original list. Twenty-one participated in round 1 (response rate 46%), while 8 participated in round 2 (response rate 18%). In total, 29 answers were received in both rounds of the Delphi process.

Table 1 presents the participants’ characteristics, including their WHO region and the forms of evidence assigned to them for each Delphi round. In the first round, most of the participants were from the Americas and Europe, while no experts participated from South-East Asia, and only one from the Eastern Mediterranean. In the second round, most of the participants were based in the Americas, while no participants were from South-East Asia or the Eastern Mediterranean.

thumbnail
Table 1. Characteristics of the participants included in the study per each round (n = 21 for round 1; n = 8 for round 2), ranked by frequency in round 1.

https://doi.org/10.1371/journal.pgph.0002752.t001

Regarding forms of evidence, the share of participants was similar in both rounds. More than 80% of participants declared themselves as experts in evidence syntheses in both rounds, and approximately 60% reported having expertise in guidelines. The number of participants that identified themselves as having expertise in data analytics, modelling or technology assessments was reduced in the second round. All forms of evidence were represented with at least one participant in both rounds.

In round 1, the number of answers received for each type of the 40 different questions varied from 3 to 10, while the response rate varied from 13% to 83%. In round 2, the number of answers received varied from 6 to 8, while the response rate was consistently around 15% per question.

Table 2 shows the specific response rate for each type of question and the values of the Kendall’s Ws reached in the first or second round of the Delphi process, for the types of question in which a consensus was reached. In 28 types of question (out of 40), there was some evidence of consensus among the answers (Kendall’s W > 0.6 and its p-value < 0.05) reached in either round 1 or 2. S1 Table presents details of the Kendall’s W for each round, and the rank correlation (calculated from the Kendall’s W), and for participants per question (including questions in which a consensus was not reached).

thumbnail
Table 2. Response rate and Kendall’s W for the types of question in which consensus was reached.

https://doi.org/10.1371/journal.pgph.0002752.t002

From the original list of types of question, because their Kendall’s W was greater than 0.6 and were statistically significant in the first round (i.e., there was consensus), 8 original questions were not moved to the second round. Because there was agreement that their ranking was last, a number of study designs were removed from the first round. One new study design suggested by a participant was added in the second round.

Tables 3 to 6 show the first four study designs in which a consensus was reached to address each one of the 28 types of question. Tables are structured depending on the specific decision-making stage in which they were classified in the original taxonomy of types of question. S2 Table shows the full ranking of study designs for all the 40 questions (including the questions in which a consensus was not reached).

thumbnail
Table 3. Ranking of the first four study designs to address the types of question in which consensus was reached in stage 1.

https://doi.org/10.1371/journal.pgph.0002752.t003

thumbnail
Table 4. Ranking of the first four study designs to address the types of question in which consensus was reached in stage 2.

https://doi.org/10.1371/journal.pgph.0002752.t004

thumbnail
Table 5. Ranking of the first four study designs to address the types of question in which consensus was reached in stage 3.

https://doi.org/10.1371/journal.pgph.0002752.t005

thumbnail
Table 6. Ranking of the first four study designs to address the types of question in which consensus was reached in stage 4.

https://doi.org/10.1371/journal.pgph.0002752.t006

In stage I (clarifying a societal problem, its causes, and potential impacts), shown in Table 3, 8 (out of 15) types of question reached a consensus. A clear gold standard (i.e., a study design the median rank of which was 1, and its interquartile range does not alter its ranking as first) was identified in five of the questions included, whereas in all of the types of question one or more study designs were clearly identified as a last priority (i.e., a study design whose median rank was last, and its interquartile range does not alter its ranking as last).

Stage I has most of the types of questions in which a consensus was not reached (7). These questions were related to understanding stakeholders’ values regarding outcomes, describing a problem and its magnitude at a point in time, understanding the role of the context in a given problem, assessing in the variability of the problem across populations and locations and relative to other problems, and understanding the causes and/or aggravating factors of a problem.

In stage II (finding options to address a problem), shown in Table 4, 12 (out of 13) types of question reached a consensus. A clear gold standard was clearly identified in 7 of the questions, whereas in only one of them was at least one study design not identified as clear last priority. There was only one type of question in which a consensus was not reached, which was related to adjusting options to maximize their impacts.

In stage III (implementing or scaling-up an option), shown in Table 5, four (out of six) types of questions reached a consensus. A clear gold standard was identified in all four questions included. In only one question there was not at least one study design identified as clear last priority. Two types of questions did not reach a consensus and were related to the identification and prioritization of facilitators and implementation strategies to take advantage of them.

In stage IV (monitoring implementation and evaluating impacts), shown in Table 6, four (out of six) types of questions reached a consensus. A clear gold standard was identified in three of the questions, and in two of them at least one study design was identified as clear last priority.

There were two types of questions where a consensus could not be reached. These They were related to the first goal (identifying measurement strategies for populations and outcomes). While the two other questions in this goal reached a consensus, the questions related to choosing the most accurate approach, and determining the best instruments, to measure/ascertain populations and outcomes did not reach consensus.

Discussion

Principal findings and findings in relation to the existing literature

This study used a two-round Delphi process to produce a list of preferred study designs that investigators can use to address 28 different types of questions that were part of a mutually exclusive and collectively exhaustive taxonomy of types of questions in which evidence could provide support (S1 Fig). To reach a consensus on preferred study designs, the study elicited opinions from 29 experts.

In stage 1 (clarifying a problem, its causes and potential impacts), five types of question had clear preferred study designs (scoping reviews to identifying outcomes; Delphi studies to prioritize outcomes; reviews of frameworks to find conceptual approaches; qualitative inductive designs to understand stakeholders’ perceptions and retrospective cohort study to assess the variability of a problem over time, while seven types of question (scoping reviews to scope a list of potential options, randomized-controlled study to assess benefits and early-and-frequently occurring harms, modelling to assess the costs, economic evaluations to assess the efficiency in the use of resources, Delphi studies to identify equity, ethical and social and human rights impacts, discrete choice experiments to assess the acceptability, and evidence syntheses to create packages of options) fill this criteria in stage 2 (finding and selecting options). In stage 3 (implementing an option), four types of question had a preferred study design (scoping reviews to identify and understand barriers, Delphi studies to identify who has to do what, jurisdictional scans to identify the context in which the option could be implemented, and cross-sectional studies to describe whether implementation is underway), while three types of question (scoping reviews to identify measurement instruments, randomized-controlled studies to measure the impact of an option, and cross-sectional study to interpret the findings of measuring the impact) fill this criteria in stage 4 (monitoring implementation and evaluating impacts).

A consensus was not reached in 12 of the 40 different types of questions, the majority of which were part of the first decision-making stage (clarifying a societal problem, its causes, and potential impacts). In the first round of the Delphi, 8 types of questions reached a consensus and were not included in the subsequent stage, while 20 reached a consensus in the second stage when they had not reached a consensus in the first stage.

Previous research efforts have generated hierarchies or guidance of evidence for some types of questions or study designs [16, 17] and includes a more pragmatic approach to build guidance on the suitability of study designs to address demand-driven types of question. In this sense, this study moves beyond a static hierarchy or guidance of study designs, assuming that different study designs might more suitably answer different types of question. Regardless of the acceptance of a hierarchy, typology or a list of study designs, it is largely accepted that the traditional hierarchy of evidence (having RCTs and systematic reviews on top of a pyramid) could only provide insights for one type of question (which is represented twice in this paper as “Assessing the benefits and early-and-frequently occurring harms of an option”, and “Measuring the impact of an option or implementation strategy”) and that different types of study designs would be more appropriate than others based on the specific type of question that is to be addressed [9, 10].

Strengths and limitations

This paper has a number of strengths. First, it creates a list of preferred study designs that can answer specific types of questions collected using a demand-driven approach (i.e., hence, are more likely to be related to the questions that decision-makers might ask). Secondly, it does so in a way that gave voice to various methodological experts, sampling by eight different forms of evidence and WHO region. Finally, it also does so using a robust methodology to reach a strong consensus and identify critical areas in which further work is required to reach a consensus (if possible), using a two-round Delphi process conducted online to reach broader audiences.

This study also has limitations. First, Delphi studies have the potential of acquiescence bias, which is the tendency of the participants to agree with the existing result, and thus failing to address their true preference. Secondly, since we prioritized the questions each expert received based on their methodological expertise, the methodological experts answered different sets of questions included in the first round of the Delphi. However, this contributed to reaching a higher response rate, particularly in round 1. Thirdly, despite all the efforts were made to have a higher response rate, round two of the Delphi reached a low response rate. Finally, while we used Kendall’s W and rank correlation to quantify the consensus reached in ranking-type surveys, these statistical measures have potential biases, such as not properly discriminating the distance between ranks (e.g., the disagreement arising from ranking in the sixth versus seventh place is not different from the disagreement arising from ranking in the first or second place).

Implications for policy and practice

Evidence intermediaries can use the results of this paper in their role of promoting the use of evidence in decision-making processes. By identifying the type of question that decision makers are asking, they can either search or commission a study using a design that would be well suited to address that particular need. Researchers can use the results of this paper to guide them in choosing the preferred study design for the type of question at hand. Finally, funders can use this guidance to decide what type of study design to fund or use, depending on the specific need at hand.

Implications for future research

Future research could explore several different areas. Firstly, subsequent efforts can explore reaching a consensus on the 12 types of questions in which a consensus was not reached. While a third round of the Delphi process could have been conducted, we might have a very limited number of responses. Alternatively, we suggest conducting a series of structured meetings (that follows the approach used by the GRADE Working Group to reach a consensus [18]) in which methodological experts can discuss what approaches can be better for specific types of questions.

Secondly, future studies could explore how considerations beyond study design (e.g., data analysis for any given study design) could play a role in answering specific types of question. Thirdly, further research could weigh the role of domestic versus global evidence, and primary studies versus evidence syntheses as complementary approaches to provide an evidence-informed answer for specific types of question. Finally, while qualitative study designs were included as part of this study, they were only part of a high-level category based on their paradigm (deductive/inductive) and general aim (to describe/ to critically interpret). Future research could go beyond this and use specific qualitative study designs to understand better what type of study designs are better than others to address a given type of question.

Supporting information

S1 Fig. A demand-driven taxonomy of the types of question that could be addressed by evidence.

https://doi.org/10.1371/journal.pgph.0002752.s001

(PDF)

S1 Table. Response rates and consensus levels reached by all the questions included in the study.

https://doi.org/10.1371/journal.pgph.0002752.s002

(DOCX)

S2 Table. Complete prioritized list of study designs used to address each type of question.

https://doi.org/10.1371/journal.pgph.0002752.s003

(DOCX)

Acknowledgments

We are grateful to the multiple methodological experts that were part of this study and helped to prioritize the different study designs.

References

  1. 1. Lavis JN, Wilson MG, Oxman AD, Lewin S, Fretheim A. SUPPORT Tools for evidence-informed health Policymaking (STP) 4: Using research evidence to clarify a problem. Health Research Policy and Systems. 2009;7:S4. pmid:20018111
  2. 2. Lavis JN, Wilson MG, Oxman AD, Grimshaw J, Lewin S, Fretheim A. SUPPORT Tools for evidence-informed health Policymaking (STP) 5: Using research evidence to frame options to address a problem. Health Research Policy and Systems. 2009;7:S5. pmid:20018112
  3. 3. Fretheim A, Munabi-Babigumira S, Oxman AD, Lavis JN, Lewin S. SUPPORT Tools for Evidence-informed Policymaking in health 6: Using research evidence to address how an option will be implemented. Health Res Policy Sys. 2009;7:S6.
  4. 4. Pearson H. How COVID broke the evidence pipeline. Nature. 2021;593:182–5. pmid:33981057
  5. 5. Cochrane Convenes. Preparing for and responding to global health emergencies: Learnings from the COVID-19 evidence response and recommendations for the future. February 2022 [Internet]. 2022. https://figshare.com/articles/book/Preparing_for_and_responding_to_global_health_emergencies_Learnings_from_the_COVID-19_evidence_response_and_recommendations_for_the_future/19115849
  6. 6. Global Commission on Evidence to Address Societal Challenges. The Evidence Commission report: A wake-up call and path forward for decision-makers, evidence intermediaries, and impact-oriented evidence producers [Internet]. 2022 [cited 2023 Jan 4]. https://www.mcmasterforum.org/docs/default-source/evidence-commission/evidence-commission-report.pdf?Status=Master&sfvrsn=2fb92517_5/Evidence-Commission-report
  7. 7. Kuchenmüller T, Lavis J, Kheirandish M, Reveiz L, Reinap M, Okeibunor J, et al. Time for a new global roadmap for supporting evidence into action. Robinson J, editor. PLOS Glob Public Health. 2022;2:e0000677.
  8. 8. Global Commission on Evidence to Address Societal Challenges. Evidence Commission update 2023: Strengthening domestic evidence-support systems, enhancing the global evidence architecture, and putting evidence at the centre of everyday life [Internet]. McMaster Health Forum; 2023 [cited 2023 Feb 27]. https://www.mcmasterforum.org/docs/default-source/evidence-commission/update-2023.pdf?sfvrsn=e81cbf_8
  9. 9. Evans D. Hierarchy of evidence: a framework for ranking evidence evaluating healthcare interventions. J Clin Nurs. 2003;12:77–84. pmid:12519253
  10. 10. Petticrew M. Evidence hierarchies, and typologies: horses for courses. Journal of Epidemiology & Community Health. 2003;57:527–9. pmid:12821702
  11. 11. Agoritsas T, Vandvik PO, Neumann I, Rochwerg B, Jaeschke R, Hayward R, et al. Finding Current Best Evidence. In: Guyatt G, Rennie D, Meade MO, Cook DJ, editors. Users’ Guides to the Medical Literature: A Manual for Evidence-Based Clinical Practice, 3rd ed [Internet]. New York, NY: McGraw-Hill Education; 2015 [cited 2023 Jul 27]. jamaevidence.mhmedical.com/content.aspx?aid=1183875650
  12. 12. Strasser A. Design and evaluation of ranking-type Delphi studies using best-worst-scaling. Technology Analysis & Strategic Management. 2019;31:492–501.
  13. 13. Kobus J, Westner M. Ranking-type delphi studies in IS research: step-by-step guide and analytical extension. 2016.
  14. 14. Kendall Maurice, Jean Dickinson Gibbons. Rank correlation methods. 5th Edition. Oxford University Press; 1990.
  15. 15. Zaiontc, C. Real Statistics Using Excel [Internet]. 2020 [cited 2023 May 3]. www.real-statistics.com
  16. 16. Parkhurst JO, Abeysinghe S. What Constitutes “Good” Evidence for Public Health and Social Policy-making? From Hierarchies to Appropriateness. Social Epistemology. 2016;30:665–79.
  17. 17. Daly J, Willis K, Small R, Green J, Welch N, Kealy M, et al. A hierarchy of evidence for assessing qualitative health research. Journal of Clinical Epidemiology. 2007;60:43–9. pmid:17161753
  18. 18. Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck-Ytter Y, Alonso-Coello P, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ. 2008;336:924–6. pmid:18436948