A Systematic Review of Studies Comparing Diagnostic Clinical Prediction Rules with Clinical Judgment

Background Diagnostic clinical prediction rules (CPRs) are developed to improve diagnosis or decrease diagnostic testing. Whether, and in what situations diagnostic CPRs improve upon clinical judgment is unclear. Methods and Findings We searched MEDLINE, Embase and CINAHL, with supplementary citation and reference checking for studies comparing CPRs and clinical judgment against a current objective reference standard. We report 1) the proportion of study participants classified as not having disease who hence may avoid further testing and or treatment and 2) the proportion, among those classified as not having disease, who do (missed diagnoses) by both approaches. 31 studies of 13 medical conditions were included, with 46 comparisons between CPRs and clinical judgment. In 2 comparisons (4%), CPRs reduced the proportion of missed diagnoses, but this was offset by classifying a larger proportion of study participants as having disease (more false positives). In 36 comparisons (78%) the proportion of diagnoses missed by CPRs and clinical judgment was similar, and in 9 of these, the CPRs classified a larger proportion of participants as not having disease (fewer false positives). In 8 comparisons (17%) the proportion of diagnoses missed by the CPRs was greater. This was offset by classifying a smaller proportion of participants as having the disease (fewer false positives) in 2 comparisons. There were no comparisons where the CPR missed a smaller proportion of diagnoses than clinical judgment and classified more participants as not having the disease. The design of the included studies allows evaluation of CPRs when their results are applied independently of clinical judgment. The performance of CPRs, when implemented by clinicians as a support to their judgment may be different. Conclusions In the limited studies to date, CPRs are rarely superior to clinical judgment and there is generally a trade-off between the proportion classified as not having disease and the proportion of missed diagnoses. Differences between the two methods of judgment are likely the result of different diagnostic thresholds for positivity. Which is the preferred judgment method for a particular clinical condition depends on the relative benefits and harms of true positive and false positive diagnoses.


Conclusions
In the limited studies to date, CPRs are rarely superior to clinical judgment and there is generally a trade-off between the proportion classified as not having disease and the proportion of missed diagnoses. Differences between the two methods of judgment are likely the result of different diagnostic thresholds for positivity. Which is the preferred judgment method for a particular clinical condition depends on the relative benefits and harms of true positive and false positive diagnoses.

Introduction
Diagnostic clinical prediction rules (CPRs) are tools designed to improve clinical decision making [1]. Theoretically, CPRs, by providing objective estimates of the probability of the presence or absence of disease derived from the statistical analysis of cases with known outcomes and or by suggesting a clinical course of action, can improve the accuracy of diagnosis and or decision making.
Understanding whether and in what situations CPRs improve upon clinical judgment is an important step in the evaluation of CPRs and for the acceptance of CPRs by clinicians [2]. Existing research, which has focused on the comparative performance of CPRs and clinical judgment when both judgment methods are viewed as competing alternatives, is difficult to interpret. One body of research on the relative merits of clinical and statistical prediction has consistently reported the superior accuracy of statistical models over a clinicians ability to integrate the same data and to collect and integrate their preferred data [3][4][5], while another, more recent body of research has found that heuristicsproposed as models of human judgment, are on occasions more accurate than statistical models [6]. It is also difficult to know how to apply the general findings of this research to clinical practice. Many of the reviews of comparative accuracy have summarised findings from diverse professional fields including finance, medicine, psychology and education. Further, judging the clinical utility of clinical judgment and CPRs requires consideration of not just overall accuracy but the consequences of missed diagnoses (false negative) and false positive results. Results of the existing comparative research are generally not reported in a way that allows such evaluation.
We conducted a systematic review of studies that compared the performance of diagnostic CPRs with clinical judgment or the performance of the combination of CPR and clinical judgment versus either alone in the same study participants against a current and objective reference standard.

Methods
This review was performed following methods detailed in the systematic review protocol (S1 Table-Study protocol) and is reported in line with the PRISMA Statement (S2 Table -PRISMA checklist).

Study selection
We included studies that compared the CPRs with clinical judgment in the same participants using a current and objective reference standard. We also included studies that compared a CPR or clinical judgment alone with the combination of CPR and clinical judgment and modelling studies to determine the added value of CPRs above clinical judgment. The CPR had to have been developed using a method of statistical analysis and tested against clinical judgment in a population different (by time, location or domain) to that from which it was derived. Studies where the CPR and clinical judgment were applied to different individuals (for example, in randomised trials) or were not applied at approximately the same point in the diagnostic pathway were excluded (for example, if the result of a CPR was determined using data collected at first presentation and this was compared to clinical judgment made after further consultation, testing and observation). We excluded studies of CPRs for the diagnosis of disorders across multiple body systems, that were not applied to actual patients, that are used for the interpretation of tests such as ECGs or that are performed in selected samples of patients not consistent with populations for whom use of the CPR is not intended.
Titles and abstracts identified by the searches were screened by one reviewer and obviously irrelevant articles excluded. A second reviewer independently screened 15% of the titles and abstracts to ensure that no further studies met the inclusion criteria. After screening, potentially relevant studies were obtained in full text and independently assessed by two reviewers against the review inclusion criteria. Discrepancies were discussed and resolved with a third reviewer.

Data extraction and risk of bias assessment
Two reviewers (SS and JD) independently extracted data on the characteristics of the study, the risk of bias and the results using a piloted data collection form. QUADAS-2 [7] was used to assess the risk of bias and concerns regarding applicability in each of the included studies. We added an additional signalling question to identify if clinical judgment and the CPR were determined independently. Discrepancies between reviewers were discussed and resolved by discussion with a third reviewer.

Data synthesis and analysis
We grouped studies where a probability estimate, clinical diagnosis or decision was made by; a. Clinical judgment alone; b. Clinical judgment with a method of structured data collection. Clinicians may have collected data on variables contained in the CPR as per the study protocol but calculation of the results of a CPR by the clinician was not anticipated or expected, or occurred after the clinician had provided their probability estimate or diagnosis; or c. A combination of clinical judgment and clinical prediction rule, where the clinician had access to the results of the CPR but could also use their own judgment or override the CPR.
We also recorded whether the result of the CPR was calculated by the examining doctor or a researcher, the method used to elicit clinical judgment and whether clinical judgment was a clinicians probability or risk assessment (e.g. low or high risk), a diagnosis or a clinical decision.
Because many clinical prediction rules are developed to either improve the proportion of individuals with a suspected disease classified as not having the disease (thereby decreasing the number of participants undergoing further testing, referral or treatment), or to reduce the number of cases of disease missed by the current diagnostic protocol, the main outcome measures of the review were 1) the percent of study participants classified as not having the disease by the CPR or clinical judgment ((False negative (FN)+ True negative (TN))/total number of participants in the study (total N). The higher this proportion, the fewer individuals that may undergo further testing, referral and or treatment, and 2) the percent of study participants among those classified by the CPR or clinical judgment as not having the disease who actually have the disease (FN/(FN+TN) or 1-negative predictive value). It is desirable that this be as close to 0% as possible. We also report measures of diagnostic accuracy including the sensitivity (True positive (TP)/(TP+ FN)) and 4) specificity (True negative (TN)/(FP+TN)) of CPRs and clinical judgment, and present graphically the proportion of all study participants who are classified by CPRs and clinical judgment as having disease who do (True Positives/total N) and do not (False Positives/total N) and the proportion of all participants who are classified as not having disease who do (False Negatives/total N) and do not (True Negatives/total N).
We did not perform a meta-analysis due to clinical and statistical heterogeneity. Instead, we synthesised the results of the included studies overall, and by clinical condition (where there were 2 or more studies available) by determining the number of comparisons in which the proportion of participants classified as not having disease and the proportion of missed cases of disease (missed diagnoses) in participants classified as not having disease for CPRs and clinical judgement was similar, greater or lesser. To determine whether there was a difference in the proportion classified as not having disease between CPRs and clinical judgment we conducted a statistical test of the difference between two proportions from dependent samples. To obtain the statistical significance of the relative difference in the proportion classified by CPRs and clinical judgment as not having disease that do, we conducted a test of the strength of association between two proportions (false negative rates) from dependent samples. If studies reported different thresholds for clinical judgment or the CPR, and if the proportions (i.e. those classified as not having disease and the proportion of missed diagnoses) were similar at the different thresholds (this only occurred in 1 study included in this review) we reported only the comparison for the threshold with the highest Youden's index ((sensitivity + specificity)-1).

Literature search
Of 10,155 titles and abstracts screened against review eligibility criteria, 330 were obtained in full text and assessed for eligibility by two reviewers. 31 studies  were included in the review (

Risk of bias
87% (27/31) of studies were judged to be at high or unclear risk of bias on two or more domains of the QUADAS-2 tool (Fig 2 -Summary QUADAS-2 risk of bias and applicability judgments). The most common risk of bias was due to interpretation of the reference standard occurring with knowledge of the index test result. For most studies in which the CPR was applied retrospectively to the data, it was not possible to determine whether researchers were blind to the result of the reference standard test. This is likely to bias results in favour of the CPR. 55% (17/31) of studies were judged to be at high risk of bias on the flow and timing domain. Studies commonly failed to include all enrolled cases in the data analysis or incorporated one of the

Study results
Results of the included studies are Tabulated in Table 3 -Characteristics and results of included studies, Table 4 -Characteristics and results of included studies for conditions with 2 studies, and presented graphically in There were 41 comparisons between CPRs and clinical judgment [8-12, 14-16, 18-28, 30-33, 35-38] (Table 3, Table 4, Fig 3 and Fig 4). In 2 (5%) comparisons (10,37), CPRs reduced the proportion of missed diagnoses in those classified as not having the disease, but this was offset by classifying a larger proportion of study participants as having disease (more false positives). In 33 (80%) comparisons [8, 9, 11, 12, 14, 15, 18, 19, 21, 23-28, 30-33, 35, 36, 38] the proportion of diagnoses missed by the CPR and clinical judgment was similar and in 7 of these comparisons [15,18,19,27,32,36] CPRs classified a larger proportion of participants as not having disease (fewer false positives) and a similar proportion in 16 [8, 9, 11, 12, 21, 23-26, 30, 32, 35, 36, 38]. In 6 (15%) comparisons [8,16,20,22,25] the proportion of diagnoses missed by the CPR was greater. This was offset by classifying a smaller proportion of participants as having the disease (fewer false positives) in 2 [8,25] comparisons. In 3 of the 6 comparisons [16,20,22] the CPRs classified a similar proportion of participants as having the disease. There was 1 comparison [16] where the CPR both missed more diagnoses and classified a larger proportion of participants as having the disease (more false positives), but no comparisons where the CPR missed fewer diagnoses and classified a larger proportion of participants as not having disease. There were 5 comparisons between CPRs and the combination of CPR and clinical judgment [13,17,29,34] (Table 3, Table 4, Fig 3 and Fig 4). In 3 (60%) comparisons the proportion of diagnoses missed was similar [13,17,34] and in 2 [17,34] of these comparisons, CPRs classified a larger proportion of study participants as not having disease (fewer false positives) than the combination of CPR and clinical judgment. In 2 (40%) comparisons [13,29], the proportion of diagnoses missed by the CPRs was greater while the proportion classified as not having disease by the CPRs and the combination of CPR and clinical judgment was similar. There were no comparisons between the combination of CPR and clinical judgment and clinical judgment alone.
There were 5 studies [11,12,17,25,36] of 10 comparisons, that used different thresholds for the CPR or clinical judgment (for example, Kabrhel et al, 2005 [11] compared clinical judgment to the Wells PE score at threshold <2 and 4). We report on the results of 9 of these comparisons, excluding the results of 1 comparison [17] where the proportions of interest (that is, the proportion classified as having disease or the proportion of missed diagnoses) were similar at the different thresholds. This means that for a small number of comparisons (n = 4) clinical judgment is counted twice [11,12,25,36].   Pulmonary embolism From 9 studies in pulmonary embolism, there were 9 comparisons between the Wells PE score (original 3 level or 2 level score) and clinical judgment [8,9,11,12,13,15,16]. In 8 (89%) comparisons [8,9,[11][12][13]15], the proportion of diagnoses missed by the score and clinical judgment was similar. In 1 of these [15], the score classified a larger proportion of all participants as not having the disease (fewer false positives), a similar proportion in 5 comparisons [8,9,11,12,13] and a larger proportion of participants as having the disease (more false positives) in 2 [11,12]. In 1 (11%) comparison [16], the proportion of diagnoses missed by the Wells PE score was greater, while the proportion of participants classified as not having the disease was similar. In 2 comparisons between the PERC Rule and clinical judgment [10,14], the rule reduced the proportion of missed diagnosis in 1 [10], but this was offset by classifying a larger proportion of participants as having the disease (more false positives). In the other comparison [14], the proportion of diagnoses missed by the PERC rule and clinical judgment was similar. In 1 * In studies where the CPR is applied retrospectively to the data by the researcher using predictor data collected by the clinician, if there was no statement that researchers were blind to the reference standard the risk of bias was considered to be unclear. If predictor data was collected by the researcher and there was no statement that researchers were blind to the reference standard, the risk of bias was considered to be high. †When the reference standard comprised subjective tests, if there was no statement that those interpreting the reference standard tests were blind to the results of either the CPR or clinician, the risk of bias was considered to be unclear. ‡If the method of determining disease status involved a combination of different tests in which some tests were applied to some patients and one test applied to all patients (differential verification) then the risk of bias was considered to be unclear. If performance of any of the reference standard tests was dependent upon the results of the index test, the risk of bias was considered to be high. If it was not possible to determine whether all eligible patients had been included in the analysis the risk of bias was considered to be unclear. If it was clear that not all patients had been included in the analysis (due to missing outcome data or because data from the clinicians estimate or data necessary to derive the results of the CPR were not available) and these studies reported results for the comparisons in different numbers of cases or only presented the results for cases on which data for both the comparisons was available, the risk of bias was considered to be high. Risk of bias was recorded as high if either of the issues relating to the reference standard test or analysis were high.
doi:10.1371/journal.pone.0128233.t002     comparison [16] the Revised Geneva Score both missed more diagnoses and classified a larger proportion of participants as having the disease than clinical judgment. In 1 comparison [13] between the Geneva score and the combination of clinical judgment and score, the proportion of diagnoses missed by the CPR was greater.

Deep vein thrombosis
From 6 studies of DVT, there were 6 comparisons between the Wells DVT score and clinical judgment [18][19][20][21][22]. There were no comparisons in which the score reduced the proportion of missed diagnoses. In 4 (67%) comparisons the proportion of diagnoses missed by the score and clinical judgment was similar [18,19,21]. In 3 of these [18,19] the score classified a larger proportion of all participants as not having disease (fewer false positives) and in 1 [21] the proportion was similar. In 2 comparisons [20,22] the proportion of diagnoses missed by the CPR was greater, with a similar proportion classified as not having the disease. In 1 comparison [17] between the Oudega Rule and the combination of clinical judgment and Oudega Rule, the proportion of diagnoses missed was similar, with the rule classifying a larger proportion of participants as not having the disease (fewer false positives).

Streptococcal throat infection
There were 3 studies of streptococcal throat infection. In 2 comparisons [23,24] between the Centor Score (Modified and Original score combined with Tomkins Management Rule) and 1 comparison between the Walsh score and clinical judgment [23] the proportion of diagnoses missed and the proportion of all participants

Foot and or ankle fracture
From 3 studies of foot and or ankle fracture, there were 3 (100%) comparisons between the Ottawa ankle and foot rules (OAR) and clinical judgment [26][27][28]. In all 3 comparisons the proportion of diagnoses missed by the CPR and clinical judgment was similar. In 1 of these [27] the rule classified a larger proportion of study participants as not having disease (fewer false positives) and in 2 comparisons [26,28] the CPR classified a larger proportion of participants as having disease (more false positives). In the 2 comparisons from 2 studies [26,28] in which the OAR classified a larger proportion of participants as having disease than clinical judgment, the clinicians when making a decision or diagnosis, would likely have been aware that all participants would be x-rayed as per study protocol [26] or would have known that an x-ray could be ordered at their discretion (28). This may lead to an overestimate of the proportion of study participants classified as not having disease by clinical judgment.

Acute appendicitis
There were 2 studies of acute appendicitis. In 1 comparison [29] between the Fenyo Score and the combination of score and clinical judgment, the proportion of diagnoses missed by the score was greater while the proportion classified as not having disease was similar. In 1 comparison [30] between the Modified Alvarado Score and clinical judgment, the proportion of diagnoses missed and the proportion of all study participants classified as not having disease was similar. Acute coronary syndrome, pneumonia, head injury in children, cervical spine injury, active pulmonary tuberculosis, malaria, bacteremia and influenza.
Of 8 studies (11 comparisons) addressing a variety of conditions, the CPRs showed either an improvement in the proportion of missed diagnosis or the proportion classified as not having disease, but this was often offset by a worsening of the other measure.

Discussion
In this review, CPRs were rarely superior to clinical judgment and there was generally a tradeoff between the proportion of study participants classified as not having disease and among those classified as not having disease, the proportion of missed diagnoses of disease. CPRs for the diagnosis of DVT generally classified a larger proportion of all participants as not having disease than clinical judgment, but this was often at the expense of missed diagnoses. In other disease areas, CPRs showed either an improvement in the proportion classified as not having disease or the proportion of missed diagnoses, but often with the trade-off of worsening the other measure. These findings, however, are limited by the small number of studies for many of the conditions, the design features and generally unclear or high risk of bias in many of the included studies.
Trade-offs in the proportion classified as not having disease and the proportion of missed diagnosis by CPRs and clinical judgment seen in this review probably represent differences in the diagnostic threshold for positivity of the two judgment methods. For example, CPRs might be developed to avoid missing people with disease and as such the threshold for positivity is set very low. The CPR would therefore likely be safer than clinical judgment where the threshold for positivity is implicitly set and variable between and within clinicians, but this is often at the expense of classifying fewer participants as not having disease (and thereby avoiding further testing or treatment). Whether clinical judgment or a CPR is the preferred judgment methods for a particular clinical condition will therefore depend on the relative benefits and harms arising from true positive and false positive diagnosis.
Variability in the proportion classified as not having disease and proportion of missed diagnoses of CPRs compared with clinical judgment, even amongst studies of the same CPR, may be explained in part by features of the clinical setting of the studies. Differences in study design and methodology, including the type of CPR tested (logistic regression model or other statistical technique), the rigour with which it was developed, the case-mix of the study population, 'modifications' to clinical judgment (with or without structured data collection), by whom (novice or experienced clinicians) or the way in which the result of the CPR is derived (calculation by clinician or researcher) may also explain the variation in performance seen in the studies included in this review. In many studies, clinicians collected diagnostic data on a structured data collection form. This systematic collection of diagnostic information may improve the observed diagnostic accuracy of the clinicians [39]. Clinician experience has also been shown to improve the accuracy of diagnosis [40].
Variability in the outcomes of clinical judgment and CPRs within conditions may also be explained by the method used to elicit clinical judgment, as the method used will likely be associated with the implicit threshold for positivity. In studies of appendicitis for example, clinical judgment was a clinician's diagnosis of appendicitis or the clinician's actual action to perform surgery or not. In studies of ankle fracture, clinical judgment was either a clinicians diagnosis of fracture or their intention to x-ray a patient, and for studies of sore throat, clinical judgment may have been a clinicians actual action to prescribe antibiotics or not or a clinicians statement of their intention to treat with antibiotics. The clinicians threshold for positivity will likely be higher for instance, if asked to provide a diagnosis (diagnostic threshold) than when asked of their intention to do further definitive testing (testing threshold). Where clinical judgment was elicited by obtaining a clinicians probability estimate on a continuous scale, there was also variation in the thresholds applied by study researchers. For studies of pulmonary embolism for example, thresholds were applied at probabilities of 15 or 20%.
The design of the studies included in this review allows comparison of the performance of CPRs and clinical judgment when applied independently. In practice, however CPRs are likely to be used as tools to support or complement clinical judgment. When used in this manner, the performance of the diagnostic CPRs may vary from that shown in this review. The effect of a CPR when used in conjunction with clinical judgment can only be fully tested in a study design in which participants are assigned (ideally randomly) to apply or receive clinical judgment alone or clinical judgment with access to a CPR. However, studies of diagnostic accuracy or incremental value [41,42] provide a useful and less costly interim step in the evaluation of CPRs prior to a randomised controlled trial and can guide future research.
Our study shows that, in the context of medical diagnosis, CPRs do not consistently classify more individuals as not having disease or miss fewer diagnoses among those classified as not having disease than clinical judgment. This is in contrast to several reviews comparing clinical and statistical methods of prediction, often combining studies from fields as diverse as education, criminology and healthcare, which have generally found statistical methods to be superior [3][4][5]. A more recent body of research however has found that when formally tested, heuristics, proposed as models of human judgment are, in some situations as accurate as, or more accurate than statistical models [6]. A review comparing the diagnostic accuracy of doctors and statistical tools for acute appendicitis [43] found that statistical tools had greater specificity than clinicians. However, most of the studies included in this review were excluded from the present review because a) the statistical tools and clinical judgment were not applied at the same time point or b) the statistical tools and clinical judgment were not applied to the same participants.
Due to variation in the design and purpose of the included studies, we did not attempt meta-analysis across or within study conditions. Instead, we compare CPRs and clinical judgment using two measures 1) the proportion of all study participants classified as not having disease (a measure or efficiency) and 2) the proportion of participants among those classified as not having disease, who actually have the disease (false negative rate, a measure of safety). Because many CPRs seek to either improve diagnosis or identify a group of patients who do not require additional testing, we believe these are the most clinically relevant measures. Though these measures are dependent on the prevalence of the disease in the study population, the studies were judged to have been undertaken in relevant clinical settings. Traditional measures of diagnostic accuracy, such as sensitivity, specificity and area under a receiver operator characteristic curve are often favoured accuracy metrics because they are commonly believed to be unaffected by disease prevalence, though this has recently been shown not to be the case [44]. The proportion of participants classified as having disease and the proportion with false positive results can also be obtained from Figs 3 and 4 and the traditional measures of diagnostic accuracy from Tables 3 and 4.
The majority of included studies were judged to be at high or unclear risk of bias on 2 or more of the 4 risk of bias domains assessed. Differential verification (the results of clinical judgment or the CPR influence the performance of reference tests) and incorporation bias (the results of the CPR are used to make the final diagnosis) affected many studies, particularly studies of DVT and PE. Further, studies commonly did not include all eligible cases in the analysis and often it was not clear whether researchers applying a CPR retrospectively to a dataset were blind to the results of the reference standard. The design of studies of ankle fracture and streptococcal throat infection may also have led to inaccurate estimates of the diagnostic accuracy of clinical judgment. In these studies, the clinicians' diagnosis or decision that x-ray or antibiotics are necessary may have been influenced by knowledge that all or most study participants would undergo confirmatory testing with an x-ray or throat swab. In this review, in two of the three studies of ankle and or foot fracture, the Ottawa Ankle Rules were considerably less efficient than clinical judgment that a fracture was present or that an x-ray was necessary. This finding conflicts with that a multicentre randomised controlled trial in which application of the rules lead to x-rays for 79% of study participants compared to 99.6% of participants when the decision was made by emergency department physicians [45].
The database searches to identify studies for the review were conducted up to March 2013 and eligible studies may have been published since this time. Because of the size of the search, not all titles and abstracts identified in electronic searches were screened by 2 reviewers. However, a second reviewer screened a subset of titles and did not find any additional studies. The search terms used may not have located all eligible studies, but manual searches of systematic reviews of CPRs and comprehensive reference and citation checking minimise this possibility. As assessment of the risk of bias in the studies deriving the CPRs or the 'useability' features of the CPRs evaluated in this review was not conducted, but updates to this review should seek to do this. Such information may assist in the interpretation of the results of the review.
While CPRs show promise as a way of improving clinical decision making, to date there have been limited studies comparing, in the same participants, the accuracy of CPRs and clinical judgment, and those studies often had design issues that raised the potential for bias and made interpretation of their results difficult. Though detailed guidance on the validation and evaluation of prediction models and rules is available [46,47], guidance on issues specific to studies comparing the diagnostic performance of CPRs and clinical judgment may improve this situation. To inform of the potential of diagnostic CPRs to improve diagnosis and patient outcomes when the CPR is used in combination with clinical judgment, particularly in situations where the clinician has a high degree of uncertainty, an analysis of studies comparing care provided when clinicians have access to a diagnostic CPR with usual care would be useful.

In Summary
The limited studies included in this review show that none of the CPRs evaluated to date are clearly superior to clinical judgment across a range of medical conditions. They also show variation in the comparative performance of clinical judgment and CPRs between studies for the same condition and between the same CPRs. There is generally a trade off in the proportion classified as not having disease and missed diagnosis that is most likely due to different thresholds for positivity associated with clinical judgment and CPRs. The current review highlights some of the methodological issues relating to the conduct of studies comparing CPRs and clinical judgment, with design features of many of the included studies increasing the potential for bias.
Supporting Information S1