Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A Systematic Review of Studies Comparing Diagnostic Clinical Prediction Rules with Clinical Judgment

  • Sharon Sanders ,

    ssanders@bond.edu.au

    Affiliation The Centre for Research in Evidence Based Practice, Bond University, Gold Coast, 4226, Australia

  • Jenny Doust,

    Affiliation The Centre for Research in Evidence Based Practice, Bond University, Gold Coast, 4226, Australia

  • Paul Glasziou

    Affiliation The Centre for Research in Evidence Based Practice, Bond University, Gold Coast, 4226, Australia

A Systematic Review of Studies Comparing Diagnostic Clinical Prediction Rules with Clinical Judgment

  • Sharon Sanders, 
  • Jenny Doust, 
  • Paul Glasziou
PLOS
x

Abstract

Background

Diagnostic clinical prediction rules (CPRs) are developed to improve diagnosis or decrease diagnostic testing. Whether, and in what situations diagnostic CPRs improve upon clinical judgment is unclear.

Methods and Findings

We searched MEDLINE, Embase and CINAHL, with supplementary citation and reference checking for studies comparing CPRs and clinical judgment against a current objective reference standard. We report 1) the proportion of study participants classified as not having disease who hence may avoid further testing and or treatment and 2) the proportion, among those classified as not having disease, who do (missed diagnoses) by both approaches. 31 studies of 13 medical conditions were included, with 46 comparisons between CPRs and clinical judgment. In 2 comparisons (4%), CPRs reduced the proportion of missed diagnoses, but this was offset by classifying a larger proportion of study participants as having disease (more false positives). In 36 comparisons (78%) the proportion of diagnoses missed by CPRs and clinical judgment was similar, and in 9 of these, the CPRs classified a larger proportion of participants as not having disease (fewer false positives). In 8 comparisons (17%) the proportion of diagnoses missed by the CPRs was greater. This was offset by classifying a smaller proportion of participants as having the disease (fewer false positives) in 2 comparisons. There were no comparisons where the CPR missed a smaller proportion of diagnoses than clinical judgment and classified more participants as not having the disease. The design of the included studies allows evaluation of CPRs when their results are applied independently of clinical judgment. The performance of CPRs, when implemented by clinicians as a support to their judgment may be different.

Conclusions

In the limited studies to date, CPRs are rarely superior to clinical judgment and there is generally a trade-off between the proportion classified as not having disease and the proportion of missed diagnoses. Differences between the two methods of judgment are likely the result of different diagnostic thresholds for positivity. Which is the preferred judgment method for a particular clinical condition depends on the relative benefits and harms of true positive and false positive diagnoses.

Introduction

Diagnostic clinical prediction rules (CPRs) are tools designed to improve clinical decision making [1]. Theoretically, CPRs, by providing objective estimates of the probability of the presence or absence of disease derived from the statistical analysis of cases with known outcomes and or by suggesting a clinical course of action, can improve the accuracy of diagnosis and or decision making.

Understanding whether and in what situations CPRs improve upon clinical judgment is an important step in the evaluation of CPRs and for the acceptance of CPRs by clinicians [2]. Existing research, which has focused on the comparative performance of CPRs and clinical judgment when both judgment methods are viewed as competing alternatives, is difficult to interpret. One body of research on the relative merits of clinical and statistical prediction has consistently reported the superior accuracy of statistical models over a clinicians ability to integrate the same data and to collect and integrate their preferred data [35], while another, more recent body of research has found that heuristics – proposed as models of human judgment, are on occasions more accurate than statistical models [6]. It is also difficult to know how to apply the general findings of this research to clinical practice. Many of the reviews of comparative accuracy have summarised findings from diverse professional fields including finance, medicine, psychology and education. Further, judging the clinical utility of clinical judgment and CPRs requires consideration of not just overall accuracy but the consequences of missed diagnoses (false negative) and false positive results. Results of the existing comparative research are generally not reported in a way that allows such evaluation.

We conducted a systematic review of studies that compared the performance of diagnostic CPRs with clinical judgment or the performance of the combination of CPR and clinical judgment versus either alone in the same study participants against a current and objective reference standard.

Methods

This review was performed following methods detailed in the systematic review protocol (S1 Table– Study protocol) and is reported in line with the PRISMA Statement (S2 Table – PRISMA checklist).

Data sources and searches

We searched MEDLINE, Embase and CINAHL from inception to January 2012, with an updated MEDLINE search to March 2013 (S3 Table – Electronic database search strategies). No limits were applied to the database searches. We also searched for systematic reviews of diagnostic CPRs using PubMed Clinical Queries. The reference lists of systematic reviews and the included studies were checked. We conducted forward searches of included studies using Science Citation Index Expanded in Web of Science and checked related citations using PubMed's Related Citations link.

Study selection

We included studies that compared the CPRs with clinical judgment in the same participants using a current and objective reference standard. We also included studies that compared a CPR or clinical judgment alone with the combination of CPR and clinical judgment and modelling studies to determine the added value of CPRs above clinical judgment. The CPR had to have been developed using a method of statistical analysis and tested against clinical judgment in a population different (by time, location or domain) to that from which it was derived. Studies where the CPR and clinical judgment were applied to different individuals (for example, in randomised trials) or were not applied at approximately the same point in the diagnostic pathway were excluded (for example, if the result of a CPR was determined using data collected at first presentation and this was compared to clinical judgment made after further consultation, testing and observation). We excluded studies of CPRs for the diagnosis of disorders across multiple body systems, that were not applied to actual patients, that are used for the interpretation of tests such as ECGs or that are performed in selected samples of patients not consistent with populations for whom use of the CPR is not intended.

Titles and abstracts identified by the searches were screened by one reviewer and obviously irrelevant articles excluded. A second reviewer independently screened 15% of the titles and abstracts to ensure that no further studies met the inclusion criteria. After screening, potentially relevant studies were obtained in full text and independently assessed by two reviewers against the review inclusion criteria. Discrepancies were discussed and resolved with a third reviewer.

Data extraction and risk of bias assessment

Two reviewers (SS and JD) independently extracted data on the characteristics of the study, the risk of bias and the results using a piloted data collection form. QUADAS-2 [7] was used to assess the risk of bias and concerns regarding applicability in each of the included studies. We added an additional signalling question to identify if clinical judgment and the CPR were determined independently. Discrepancies between reviewers were discussed and resolved by discussion with a third reviewer.

Data synthesis and analysis

We grouped studies where a probability estimate, clinical diagnosis or decision was made by;

  1. Clinical judgment alone;
  2. Clinical judgment with a method of structured data collection. Clinicians may have collected data on variables contained in the CPR as per the study protocol but calculation of the results of a CPR by the clinician was not anticipated or expected, or occurred after the clinician had provided their probability estimate or diagnosis; or
  3. A combination of clinical judgment and clinical prediction rule, where the clinician had access to the results of the CPR but could also use their own judgment or override the CPR.

We also recorded whether the result of the CPR was calculated by the examining doctor or a researcher, the method used to elicit clinical judgment and whether clinical judgment was a clinicians probability or risk assessment (e.g. low or high risk), a diagnosis or a clinical decision.

Because many clinical prediction rules are developed to either improve the proportion of individuals with a suspected disease classified as not having the disease (thereby decreasing the number of participants undergoing further testing, referral or treatment), or to reduce the number of cases of disease missed by the current diagnostic protocol, the main outcome measures of the review were 1) the percent of study participants classified as not having the disease by the CPR or clinical judgment ((False negative (FN)+ True negative (TN))/total number of participants in the study (total N). The higher this proportion, the fewer individuals that may undergo further testing, referral and or treatment, and 2) the percent of study participants among those classified by the CPR or clinical judgment as not having the disease who actually have the disease (FN/(FN+TN) or 1-negative predictive value). It is desirable that this be as close to 0% as possible. We also report measures of diagnostic accuracy including the sensitivity (True positive (TP)/(TP+ FN)) and 4) specificity (True negative (TN)/(FP+TN)) of CPRs and clinical judgment, and present graphically the proportion of all study participants who are classified by CPRs and clinical judgment as having disease who do (True Positives/total N) and do not (False Positives/total N) and the proportion of all participants who are classified as not having disease who do (False Negatives/total N) and do not (True Negatives/total N).

We did not perform a meta-analysis due to clinical and statistical heterogeneity. Instead, we synthesised the results of the included studies overall, and by clinical condition (where there were 2 or more studies available) by determining the number of comparisons in which the proportion of participants classified as not having disease and the proportion of missed cases of disease (missed diagnoses) in participants classified as not having disease for CPRs and clinical judgement was similar, greater or lesser. To determine whether there was a difference in the proportion classified as not having disease between CPRs and clinical judgment we conducted a statistical test of the difference between two proportions from dependent samples. To obtain the statistical significance of the relative difference in the proportion classified by CPRs and clinical judgment as not having disease that do, we conducted a test of the strength of association between two proportions (false negative rates) from dependent samples. If studies reported different thresholds for clinical judgment or the CPR, and if the proportions (i.e. those classified as not having disease and the proportion of missed diagnoses) were similar at the different thresholds (this only occurred in 1 study included in this review) we reported only the comparison for the threshold with the highest Youden’s index ((sensitivity + specificity)-1).

Results

Literature search

Of 10,155 titles and abstracts screened against review eligibility criteria, 330 were obtained in full text and assessed for eligibility by two reviewers. 31 studies [838] were included in the review (Fig 1 – PRISMA flow diagram of article selection process).

Study characteristics

The studies addressed a variety of conditions: 9 for pulmonary embolism (PE) [816], 6 for deep vein thrombosis (DVT) [1722], 3 for streptococcal throat infection [2325], 3 for ankle and/or foot fracture [2628], 2 for acute appendicitis [29, 30] and one each for acute coronary syndrome [31], pneumonia [32], head injury in children [33], cervical spine injury [34], active pulmonary tuberculosis [35], malaria [36], bacteraemia [37] and influenza [38] (Table 1 – Clinical conditions and study comparisons). Twenty five different CPRs were evaluated. The majority (n = 16) were derived from logistic regression analysis and the remainder from recursive partitioning analysis (n = 3), discriminant analysis (n = 2), neural networking (n = 1), simple Bayesian analysis (n = 2) and an unspecified multivariable analysis (n = 1). In just over half of the included studies (n = 17), clinical judgment was a clinicians estimate of the probability of the presence of disease or categorisation of a study participant into a risk group (e.g. low, intermediate or high risk). In the remaining studies, clinical judgment was a clinicians diagnosis (n = 8), intended management (n = 3) or the clinical action taken (n = 3). In half of the included studies (n = 15) the experience of clinicians estimating the probability of the target disorder or making a diagnosis or management decision was not reported. Ten studies included clinicians with varying levels of experience (e.g. ‘post graduates’ and ‘confirmed emergency physicians’), 3 included specialists only and 3 junior staff only.

Risk of bias

87% (27/31) of studies were judged to be at high or unclear risk of bias on two or more domains of the QUADAS-2 tool (Fig 2 – Summary QUADAS-2 risk of bias and applicability judgments). The most common risk of bias was due to interpretation of the reference standard occurring with knowledge of the index test result. For most studies in which the CPR was applied retrospectively to the data, it was not possible to determine whether researchers were blind to the result of the reference standard test. This is likely to bias results in favour of the CPR. 55% (17/31) of studies were judged to be at high risk of bias on the flow and timing domain. Studies commonly failed to include all enrolled cases in the data analysis or incorporated one of the index tests in the reference standard. Risks of bias assessments for individual studies are shown in Table 2 – Risk of bias and applicability concerns for individual studies included in the review.

thumbnail
Table 2. Risk of bias and applicability concerns for individual studies included in the review.

https://doi.org/10.1371/journal.pone.0128233.t002

Study results

Results of the included studies are Tabulated in Table 3 – Characteristics and results of included studies, Table 4 – Characteristics and results of included studies for conditions with ≤ 2 studies, and presented graphically in Fig 3 – Accuracy estimates of clinical judgment versus CPRs for the included studies and Fig 4 – Accuracy estimates of clinical judgment versus CPRs for the included studies for conditions with ≤2 studies.

thumbnail
Fig 4. Results of the included studies for conditions with ≤ 2 studies.

https://doi.org/10.1371/journal.pone.0128233.g004

thumbnail
Table 4. Characteristics and results of included studies for conditions with ≤2 studies.

https://doi.org/10.1371/journal.pone.0128233.t004

There were 41 comparisons between CPRs and clinical judgment [812, 1416, 1828, 3033, 3538] (Table 3, Table 4, Fig 3 and Fig 4). In 2 (5%) comparisons (10, 37), CPRs reduced the proportion of missed diagnoses in those classified as not having the disease, but this was offset by classifying a larger proportion of study participants as having disease (more false positives). In 33 (80%) comparisons [8, 9, 11, 12, 14, 15, 18, 19, 21, 2328, 3033, 35, 36, 38] the proportion of diagnoses missed by the CPR and clinical judgment was similar and in 7 of these comparisons [15, 18, 19, 27, 32, 36] CPRs classified a larger proportion of participants as not having disease (fewer false positives) and a similar proportion in 16 [8, 9, 11, 12, 21, 2326, 30, 32, 35, 36, 38]. In 6 (15%) comparisons [8, 16, 20, 22, 25] the proportion of diagnoses missed by the CPR was greater. This was offset by classifying a smaller proportion of participants as having the disease (fewer false positives) in 2 [8, 25] comparisons. In 3 of the 6 comparisons [16, 20, 22] the CPRs classified a similar proportion of participants as having the disease. There was 1 comparison [16] where the CPR both missed more diagnoses and classified a larger proportion of participants as having the disease (more false positives), but no comparisons where the CPR missed fewer diagnoses and classified a larger proportion of participants as not having disease.

There were 5 comparisons between CPRs and the combination of CPR and clinical judgment [13, 17, 29, 34] (Table 3, Table 4, Fig 3 and Fig 4). In 3 (60%) comparisons the proportion of diagnoses missed was similar [13, 17, 34] and in 2 [17, 34] of these comparisons, CPRs classified a larger proportion of study participants as not having disease (fewer false positives) than the combination of CPR and clinical judgment. In 2 (40%) comparisons [13, 29], the proportion of diagnoses missed by the CPRs was greater while the proportion classified as not having disease by the CPRs and the combination of CPR and clinical judgment was similar. There were no comparisons between the combination of CPR and clinical judgment and clinical judgment alone.

There were 5 studies [11, 12, 17, 25, 36] of 10 comparisons, that used different thresholds for the CPR or clinical judgment (for example, Kabrhel et al, 2005 [11] compared clinical judgment to the Wells PE score at threshold <2 and ≤4). We report on the results of 9 of these comparisons, excluding the results of 1 comparison [17] where the proportions of interest (that is, the proportion classified as having disease or the proportion of missed diagnoses) were similar at the different thresholds. This means that for a small number of comparisons (n = 4) clinical judgment is counted twice [11, 12, 25, 36].

Pulmonary embolism

From 9 studies in pulmonary embolism, there were 9 comparisons between the Wells PE score (original 3 level or 2 level score) and clinical judgment [8, 9, 11, 12, 13, 15, 16]. In 8 (89%) comparisons [8, 9, 1113, 15], the proportion of diagnoses missed by the score and clinical judgment was similar. In 1 of these [15], the score classified a larger proportion of all participants as not having the disease (fewer false positives), a similar proportion in 5 comparisons [8, 9, 11, 12, 13] and a larger proportion of participants as having the disease (more false positives) in 2 [11, 12]. In 1 (11%) comparison [16], the proportion of diagnoses missed by the Wells PE score was greater, while the proportion of participants classified as not having the disease was similar. In 2 comparisons between the PERC Rule and clinical judgment [10, 14], the rule reduced the proportion of missed diagnosis in 1 [10], but this was offset by classifying a larger proportion of participants as having the disease (more false positives). In the other comparison [14], the proportion of diagnoses missed by the PERC rule and clinical judgment was similar. In 1 comparison [16] the Revised Geneva Score both missed more diagnoses and classified a larger proportion of participants as having the disease than clinical judgment. In 1 comparison [13] between the Geneva score and the combination of clinical judgment and score, the proportion of diagnoses missed by the CPR was greater.

Deep vein thrombosis

From 6 studies of DVT, there were 6 comparisons between the Wells DVT score and clinical judgment [1822]. There were no comparisons in which the score reduced the proportion of missed diagnoses. In 4 (67%) comparisons the proportion of diagnoses missed by the score and clinical judgment was similar [18, 19, 21]. In 3 of these [18, 19] the score classified a larger proportion of all participants as not having disease (fewer false positives) and in 1 [21] the proportion was similar. In 2 comparisons [20, 22] the proportion of diagnoses missed by the CPR was greater, with a similar proportion classified as not having the disease. In 1 comparison [17] between the Oudega Rule and the combination of clinical judgment and Oudega Rule, the proportion of diagnoses missed was similar, with the rule classifying a larger proportion of participants as not having the disease (fewer false positives).

Streptococcal throat infection

There were 3 studies of streptococcal throat infection.

In 2 comparisons [23, 24] between the Centor Score (Modified and Original score combined with Tomkins Management Rule) and 1 comparison between the Walsh score and clinical judgment [23] the proportion of diagnoses missed and the proportion of all participants classified as not having disease was similar. In these studies clinicians would likely have been aware that all study participants would have pharyngeal swabs taken for testing as per study protocol. This may lead to an overestimate of the proportion of participants classified as not having disease by clinical judgment.

Foot and or ankle fracture

From 3 studies of foot and or ankle fracture, there were 3 (100%) comparisons between the Ottawa ankle and foot rules (OAR) and clinical judgment [2628]. In all 3 comparisons the proportion of diagnoses missed by the CPR and clinical judgment was similar. In 1 of these [27] the rule classified a larger proportion of study participants as not having disease (fewer false positives) and in 2 comparisons [26, 28] the CPR classified a larger proportion of participants as having disease (more false positives). In the 2 comparisons from 2 studies [26, 28] in which the OAR classified a larger proportion of participants as having disease than clinical judgment, the clinicians when making a decision or diagnosis, would likely have been aware that all participants would be x-rayed as per study protocol [26] or would have known that an x-ray could be ordered at their discretion (28). This may lead to an overestimate of the proportion of study participants classified as not having disease by clinical judgment.

Acute appendicitis

There were 2 studies of acute appendicitis.

In 1 comparison [29] between the Fenyo Score and the combination of score and clinical judgment, the proportion of diagnoses missed by the score was greater while the proportion classified as not having disease was similar. In 1 comparison [30] between the Modified Alvarado Score and clinical judgment, the proportion of diagnoses missed and the proportion of all study participants classified as not having disease was similar.

Acute coronary syndrome, pneumonia, head injury in children, cervical spine injury, active pulmonary tuberculosis, malaria, bacteremia and influenza.

Of 8 studies (11 comparisons) addressing a variety of conditions, the CPRs showed either an improvement in the proportion of missed diagnosis or the proportion classified as not having disease, but this was often offset by a worsening of the other measure.

Discussion

In this review, CPRs were rarely superior to clinical judgment and there was generally a trade-off between the proportion of study participants classified as not having disease and among those classified as not having disease, the proportion of missed diagnoses of disease. CPRs for the diagnosis of DVT generally classified a larger proportion of all participants as not having disease than clinical judgment, but this was often at the expense of missed diagnoses. In other disease areas, CPRs showed either an improvement in the proportion classified as not having disease or the proportion of missed diagnoses, but often with the trade-off of worsening the other measure. These findings, however, are limited by the small number of studies for many of the conditions, the design features and generally unclear or high risk of bias in many of the included studies.

Trade-offs in the proportion classified as not having disease and the proportion of missed diagnosis by CPRs and clinical judgment seen in this review probably represent differences in the diagnostic threshold for positivity of the two judgment methods. For example, CPRs might be developed to avoid missing people with disease and as such the threshold for positivity is set very low. The CPR would therefore likely be safer than clinical judgment where the threshold for positivity is implicitly set and variable between and within clinicians, but this is often at the expense of classifying fewer participants as not having disease (and thereby avoiding further testing or treatment). Whether clinical judgment or a CPR is the preferred judgment methods for a particular clinical condition will therefore depend on the relative benefits and harms arising from true positive and false positive diagnosis.

Variability in the proportion classified as not having disease and proportion of missed diagnoses of CPRs compared with clinical judgment, even amongst studies of the same CPR, may be explained in part by features of the clinical setting of the studies. Differences in study design and methodology, including the type of CPR tested (logistic regression model or other statistical technique), the rigour with which it was developed, the case-mix of the study population, ‘modifications’ to clinical judgment (with or without structured data collection), by whom (novice or experienced clinicians) or the way in which the result of the CPR is derived (calculation by clinician or researcher) may also explain the variation in performance seen in the studies included in this review. In many studies, clinicians collected diagnostic data on a structured data collection form. This systematic collection of diagnostic information may improve the observed diagnostic accuracy of the clinicians [39]. Clinician experience has also been shown to improve the accuracy of diagnosis [40].

Variability in the outcomes of clinical judgment and CPRs within conditions may also be explained by the method used to elicit clinical judgment, as the method used will likely be associated with the implicit threshold for positivity. In studies of appendicitis for example, clinical judgment was a clinician’s diagnosis of appendicitis or the clinician’s actual action to perform surgery or not. In studies of ankle fracture, clinical judgment was either a clinicians diagnosis of fracture or their intention to x-ray a patient, and for studies of sore throat, clinical judgment may have been a clinicians actual action to prescribe antibiotics or not or a clinicians statement of their intention to treat with antibiotics. The clinicians threshold for positivity will likely be higher for instance, if asked to provide a diagnosis (diagnostic threshold) than when asked of their intention to do further definitive testing (testing threshold). Where clinical judgment was elicited by obtaining a clinicians probability estimate on a continuous scale, there was also variation in the thresholds applied by study researchers. For studies of pulmonary embolism for example, thresholds were applied at probabilities of 15 or 20%.

The design of the studies included in this review allows comparison of the performance of CPRs and clinical judgment when applied independently. In practice, however CPRs are likely to be used as tools to support or complement clinical judgment. When used in this manner, the performance of the diagnostic CPRs may vary from that shown in this review. The effect of a CPR when used in conjunction with clinical judgment can only be fully tested in a study design in which participants are assigned (ideally randomly) to apply or receive clinical judgment alone or clinical judgment with access to a CPR. However, studies of diagnostic accuracy or incremental value [41, 42] provide a useful and less costly interim step in the evaluation of CPRs prior to a randomised controlled trial and can guide future research.

Our study shows that, in the context of medical diagnosis, CPRs do not consistently classify more individuals as not having disease or miss fewer diagnoses among those classified as not having disease than clinical judgment. This is in contrast to several reviews comparing clinical and statistical methods of prediction, often combining studies from fields as diverse as education, criminology and healthcare, which have generally found statistical methods to be superior [35]. A more recent body of research however has found that when formally tested, heuristics, proposed as models of human judgment are, in some situations as accurate as, or more accurate than statistical models [6]. A review comparing the diagnostic accuracy of doctors and statistical tools for acute appendicitis [43] found that statistical tools had greater specificity than clinicians. However, most of the studies included in this review were excluded from the present review because a) the statistical tools and clinical judgment were not applied at the same time point or b) the statistical tools and clinical judgment were not applied to the same participants.

Due to variation in the design and purpose of the included studies, we did not attempt meta-analysis across or within study conditions. Instead, we compare CPRs and clinical judgment using two measures 1) the proportion of all study participants classified as not having disease (a measure or efficiency) and 2) the proportion of participants among those classified as not having disease, who actually have the disease (false negative rate, a measure of safety). Because many CPRs seek to either improve diagnosis or identify a group of patients who do not require additional testing, we believe these are the most clinically relevant measures. Though these measures are dependent on the prevalence of the disease in the study population, the studies were judged to have been undertaken in relevant clinical settings. Traditional measures of diagnostic accuracy, such as sensitivity, specificity and area under a receiver operator characteristic curve are often favoured accuracy metrics because they are commonly believed to be unaffected by disease prevalence, though this has recently been shown not to be the case [44]. The proportion of participants classified as having disease and the proportion with false positive results can also be obtained from Figs 3 and 4 and the traditional measures of diagnostic accuracy from Tables 3 and 4.

The majority of included studies were judged to be at high or unclear risk of bias on 2 or more of the 4 risk of bias domains assessed. Differential verification (the results of clinical judgment or the CPR influence the performance of reference tests) and incorporation bias (the results of the CPR are used to make the final diagnosis) affected many studies, particularly studies of DVT and PE. Further, studies commonly did not include all eligible cases in the analysis and often it was not clear whether researchers applying a CPR retrospectively to a dataset were blind to the results of the reference standard. The design of studies of ankle fracture and streptococcal throat infection may also have led to inaccurate estimates of the diagnostic accuracy of clinical judgment. In these studies, the clinicians’ diagnosis or decision that x-ray or antibiotics are necessary may have been influenced by knowledge that all or most study participants would undergo confirmatory testing with an x-ray or throat swab. In this review, in two of the three studies of ankle and or foot fracture, the Ottawa Ankle Rules were considerably less efficient than clinical judgment that a fracture was present or that an x-ray was necessary. This finding conflicts with that a multicentre randomised controlled trial in which application of the rules lead to x-rays for 79% of study participants compared to 99.6% of participants when the decision was made by emergency department physicians [45].

The database searches to identify studies for the review were conducted up to March 2013 and eligible studies may have been published since this time. Because of the size of the search, not all titles and abstracts identified in electronic searches were screened by 2 reviewers. However, a second reviewer screened a subset of titles and did not find any additional studies. The search terms used may not have located all eligible studies, but manual searches of systematic reviews of CPRs and comprehensive reference and citation checking minimise this possibility. As assessment of the risk of bias in the studies deriving the CPRs or the ‘useability’ features of the CPRs evaluated in this review was not conducted, but updates to this review should seek to do this. Such information may assist in the interpretation of the results of the review.

While CPRs show promise as a way of improving clinical decision making, to date there have been limited studies comparing, in the same participants, the accuracy of CPRs and clinical judgment, and those studies often had design issues that raised the potential for bias and made interpretation of their results difficult. Though detailed guidance on the validation and evaluation of prediction models and rules is available [46, 47], guidance on issues specific to studies comparing the diagnostic performance of CPRs and clinical judgment may improve this situation. To inform of the potential of diagnostic CPRs to improve diagnosis and patient outcomes when the CPR is used in combination with clinical judgment, particularly in situations where the clinician has a high degree of uncertainty, an analysis of studies comparing care provided when clinicians have access to a diagnostic CPR with usual care would be useful.

In Summary

The limited studies included in this review show that none of the CPRs evaluated to date are clearly superior to clinical judgment across a range of medical conditions. They also show variation in the comparative performance of clinical judgment and CPRs between studies for the same condition and between the same CPRs. There is generally a trade off in the proportion classified as not having disease and missed diagnosis that is most likely due to different thresholds for positivity associated with clinical judgment and CPRs. The current review highlights some of the methodological issues relating to the conduct of studies comparing CPRs and clinical judgment, with design features of many of the included studies increasing the potential for bias.

Acknowledgments

The authors thank Rae Thomas for help with the screening of titles and abstracts.

Author Contributions

Conceived and designed the experiments: SS JD PG. Performed the experiments: SS JD. Analyzed the data: SS. Wrote the paper: SS JD PG.

References

  1. 1. Steyerberg EW. Clinical prediction models. A practical approach to development, validation and updating. New York: Springer; 2009.
  2. 2. Toll DB, Janssen KJ, Vergouwe Y, Moons KG. Validation, updating and impact of clinical prediction rules: a review. Journal of clinical epidemiology. 2008;61(11):1085–94. pmid:19208371.
  3. 3. Dawes RM, Faust D, Meehl PE. Clinical versus actuarial judgment. Science. 1989;243(4899):1668–74. pmid:2648573.
  4. 4. Ægisdóttir S, White M.J., Spengler P.M., Maugherman A.S., Anderson L.A., Cook R.S., et al. The meta-analysis of clinical judgment project: Fifty-six years of accumulated research on clinical versus statistical prediction. The Counseling Psychologist. 2006;34:341–82.
  5. 5. Grove WM, Zald DH, Lebow BS, Snitz BE, Nelson C. Clinical versus mechanical prediction: a meta-analysis. Psychological assessment. 2000;12(1):19–30. pmid:10752360.
  6. 6. Gigerenzer G, Todd P.M., & the ABC Research Group. Simple heuristics that make us smart. New York: Oxford University Press; 1999.
  7. 7. Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Annals of internal medicine. 2011;155(8):529–36. pmid:22007046.
  8. 8. Runyon MS, Webb WB, Jones AE, Kline JA. Comparison of the unstructured clinician estimate of pretest probability for pulmonary embolism to the Canadian score and the Charlotte rule: a prospective observational study. Academic emergency medicine: official journal of the Society for Academic Emergency Medicine. 2005;12(7):587–93. pmid:15995088.
  9. 9. Kabrhel C, Mark Courtney D, Camargo CA Jr., Moore CL, Richman PB, Plewa MC, et al. Potential impact of adjusting the threshold of the quantitative D-dimer based on pretest probability of acute pulmonary embolism. Academic emergency medicine: official journal of the Society for Academic Emergency Medicine. 2009;16(4):325–32. pmid:19298619.
  10. 10. Kline JA, Courtney DM, Kabrhel C, Moore CL, Smithline HA, Plewa MC, et al. Prospective multicenter evaluation of the pulmonary embolism rule-out criteria. Journal of thrombosis and haemostasis: JTH. 2008;6(5):772–80. pmid:18318689.
  11. 11. Kabrhel C, McAfee AT, Goldhaber SZ. The contribution of the subjective component of the Canadian Pulmonary Embolism Score to the overall score in emergency department patients. Academic emergency medicine: official journal of the Society for Academic Emergency Medicine. 2005;12(10):915–20. pmid:16204134.
  12. 12. Carrier M, Wells PS, Rodger MA. Excluding pulmonary embolism at the bedside with low pre-test probability and D-dimer: safety and clinical utility of 4 methods to assign pre-test probability. Thrombosis research. 2006;117(4):469–74. pmid:15893807.
  13. 13. Chagnon I, Bounameaux H, Aujesky D, Roy PM, Gourdier AL, Cornuz J, et al. Comparison of two clinical prediction rules and implicit assessment among patients with suspected pulmonary embolism. The American journal of medicine. 2002;113(4):269–75. pmid:12361811.
  14. 14. Penaloza A, Verschuren F, Dambrine S, Zech F, Thys F, Roy PM. Performance of the Pulmonary Embolism Rule-out Criteria (the PERC rule) combined with low clinical probability in high prevalence population. Thrombosis research. 2012;129(5):e189–93. pmid:22424852.
  15. 15. Sanson BJ, Lijmer JG, Mac Gillavry MR, Turkstra F, Prins MH, Buller HR. Comparison of a clinical probability estimate and two clinical models in patients with suspected pulmonary embolism. ANTELOPE-Study Group. Thrombosis and haemostasis. 2000;83(2):199–203. pmid:10739372.
  16. 16. Penaloza A, Verschuren F, Meyer G, Quentin-Georget S, Soulie C, Thys F, et al. Comparison of the unstructured clinician gestalt, the wells score, and the revised Geneva score to estimate pretest probability for suspected pulmonary embolism. Annals of emergency medicine. 2013;62(2):117–24 e2. pmid:23433653.
  17. 17. Geersing GJ, Janssen KJ, Oudega R, van Weert H, Stoffers H, Hoes A, et al. Diagnostic classification in patients with suspected deep venous thrombosis: physicians' judgement or a decision rule? The British journal of general practice: the journal of the Royal College of General Practitioners. 2010;60(579):742–8. pmid:20883623; PubMed Central PMCID: PMC2944933.
  18. 18. Bigaroni A, Perrier A, Bounameaux H. Is clinical probability assessment of deep vein thrombosis by a score really standardized? Thrombosis and haemostasis. 2000;83(5):788–9. pmid:10823281.
  19. 19. Miron MJ, Perrier A, Bounameaux H. Clinical assessment of suspected deep vein thrombosis: comparison between a score and empirical assessment. Journal of internal medicine. 2000;247(2):249–54. pmid:10692088.
  20. 20. Blattler W, Martinez I, Blattler IK. Diagnosis of deep venous thrombosis and alternative diseases in symptomatic outpatients. European journal of internal medicine. 2004;15(5):305–11. pmid:15450988.
  21. 21. Cornuz J, Ghali WA, Hayoz D, Stoianov R, Depairon M, Yersin B. Clinical prediction of deep venous thrombosis using two risk assessment methods in combination with rapid quantitative D-dimer testing. The American journal of medicine. 2002;112(3):198–203. pmid:11893346.
  22. 22. Wang B, Lin Y, Pan FS, Yao C, Zheng ZY, Cai D, et al. Comparison of empirical estimate of clinical pretest probability with the Wells score for diagnosis of deep vein thrombosis. Blood coagulation & fibrinolysis: an international journal in haemostasis and thrombosis. 2013;24(1):76–81. pmid:23103729.
  23. 23. Cebul RD, Poses RM. The comparative cost-effectiveness of statistical decision rules and experienced physicians in pharyngitis management. JAMA: the journal of the American Medical Association. 1986;256(24):3353–7. pmid:3097339.
  24. 24. Rosenberg P, McIsaac W, Macintosh D, Kroll M. Diagnosing streptococcal pharyngitis in the emergency department: Is a sore throat score approach better than rapid streptococcal antigen testing? Cjem. 2002;4(3):178–84. pmid:17609003.
  25. 25. Attia MW, Zaoutis T, Klein JD, Meier FA. Performance of a predictive model for streptococcal pharyngitis in children. Archives of pediatrics & adolescent medicine. 2001;155(6):687–91. pmid:11386959.
  26. 26. Glas AS, Pijnenburg BA, Lijmer JG, Bogaard K, de RM, Keeman JN, et al. Comparison of diagnostic decision rules and structured data collection in assessment of acute ankle injury. CMAJ: Canadian Medical Association journal = journal de l'Association medicale canadienne. 2002;166(6):727–33. pmid:11944759; PubMed Central PMCID: PMC99451.
  27. 27. Singh-Ranger G, Marathias A. Comparison of current local practice and the Ottawa Ankle Rules to determine the need for radiography in acute ankle injury. Accident and emergency nursing. 1999;7(4):201–6. pmid:10808759.
  28. 28. Al Omar MZ, Baldwin GA. Reappraisal of use of X-rays in childhood ankle and midfoot injuries. Emergency radiology. 2002;9(2):88–92. pmid:15290584.
  29. 29. Fenyo G. Routine use of a scoring system for decision-making in suspected acute appendicitis in adults. Acta chirurgica Scandinavica. 1987;153(9):545–51. pmid:3321809.
  30. 30. Meltzer AC, Baumann BM, Chen EH, Shofer FS, Mills AM. Poor sensitivity of a modified Alvarado score in adults with suspected appendicitis. Annals of emergency medicine. 2013;62(2):126–31. pmid:23623557.
  31. 31. Mitchell AM, Garvey JL, Chandra A, Diercks D, Pollack CV, Kline JA. Prospective multicenter study of quantitative pretest probability assessment to exclude acute coronary syndrome for patients evaluated in emergency department chest pain units. Annals of emergency medicine. 2006;47(5):447. pmid:16631984.
  32. 32. Emerman CL, Dawson N, Speroff T, Siciliano C, Effron D, Rashad F, et al. Comparison of physician judgment and decision aids for ordering chest radiographs for pneumonia in outpatients. Annals of emergency medicine. 1991;20(11):1215–9. pmid:1952308.
  33. 33. Crowe L, Anderson V, Babl FE. Application of the CHALICE clinical prediction rule for intracranial injury in children outside the UK: impact on head CT rate. Archives of disease in childhood. 2010;95(12):1017–22. pmid:20573733.
  34. 34. Vaillancourt C, Stiell IG, Beaudoin T, Maloney J, Anton AR, Bradford P, et al. The out-of-hospital validation of the Canadian C-Spine Rule by paramedics. Annals of emergency medicine. 2009;54(5):663–71 e1. pmid:19394111.
  35. 35. El-Solh AA, Hsiao CB, Goodnough S, Serghani J, Grant BJ. Predicting active pulmonary tuberculosis using an artificial neural network. Chest. 1999;116(4):968–73. pmid:10531161.
  36. 36. Bojang KA, Obaro S, Morison LA, Greenwood BM. A prospective evaluation of a clinical algorithm for the diagnosis of malaria in Gambian children. Tropical medicine & international health: TM & IH. 2000;5(4):231–6. pmid:10810013.
  37. 37. Leibovici L, Greenshtain S, Cohen O, Mor F, Wysenbeek AJ. Bacteremia in febrile patients. A clinical model for diagnosis. Archives of internal medicine. 1991;151(9):1801–6. pmid:1888246.
  38. 38. Stein J, Louie J, Flanders S, Maselli J, Hacker JK, Drew WL, et al. Performance characteristics of clinical diagnosis, a clinical decision rule, and a rapid influenza test in the detection of influenza infection in a community sample of adults. Annals of emergency medicine. 2005;46(5):412–9. pmid:16271670.
  39. 39. Korner H, Sondenaa K, Soreide JA, Andersen E, Nysted A, Lende TH. Structured data collection improves the diagnosis of acute appendicitis. The British journal of surgery. 1998;85(3):341–4. pmid:9529488.
  40. 40. Kabrhel C, Camargo CA Jr., Goldhaber SZ. Clinical gestalt and the diagnosis of pulmonary embolism: does experience matter? Chest. 2005;127(5):1627–30. pmid:15888838.
  41. 41. Broekhuizen BD, Sachs A, Janssen K, Geersing GJ, Moons K, Hoes A, et al. Does a decision aid help physicians to detect chronic obstructive pulmonary disease? The British journal of general practice: the journal of the Royal College of General Practitioners. 2011;61(591):e674–9. pmid:22152850; PubMed Central PMCID: PMC3177137.
  42. 42. Moons KG, de Groot JA, Linnet K, Reitsma JB, Bossuyt PM. Quantifying the added value of a diagnostic test or marker. Clinical chemistry. 2012;58(10):1408–17. pmid:22952348.
  43. 43. Liu JL, Wyatt JC, Deeks JJ, Clamp S, Keen J, Verde P, et al. Systematic reviews of clinical decision tools for acute abdominal pain. Health technology assessment. 2006;10(47):1–167, iii-iv. pmid:17083855.
  44. 44. Leeflang MM, Rutjes AW, Reitsma JB, Hooft L, Bossuyt PM. Variation of a test's sensitivity and specificity with disease prevalence. CMAJ: Canadian Medical Association journal = journal de l'Association medicale canadienne. 2013;185(11):E537–44. pmid:23798453; PubMed Central PMCID: PMC3735771.
  45. 45. Auleley GR, Ravaud P, Giraudeau B, Kerboull L, Nizard R, Massin P, et al. Implementation of the Ottawa ankle rules in France. A multicenter randomized controlled trial. JAMA: the journal of the American Medical Association. 1997;277(24):1935–9. pmid:9200633.
  46. 46. Collins GS, de Groot JA, Dutton S, Omar O, Shanyinde M, Tajar A, et al. External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC medical research methodology. 2014;14:40. pmid:24645774; PubMed Central PMCID: PMC3999945.
  47. 47. Moons KG, Kengne AP, Woodward M, Royston P, Vergouwe Y, Altman DG, et al. Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker. Heart. 2012;98(9):683–90. pmid:22397945.