Fig 1.
Example illustrating the problem of noise in a comparator.
Fig 2.
Example of the effect of misclassification by a comparator, on the apparent performance of a diagnostic test.
A total of 100 Ground Truth negative patients and 100 Ground Truth positive patients were considered. In Panel A, there is no error in patient classification (i.e. the comparator is perfectly concordant with the Ground Truth). In Panel B, a random 5% of the comparator’s classifications are assumed to diverge incorrectly from the Ground Truth. The difference in the distribution of test scores (y-axis) between the panels of this figure results in significant underestimates of diagnostic performance as shown in Table 1.
Table 1.
Effect of uncertainty in the comparator on estimates of test performance.
Fig 3.
Degradation of apparent performance of a perfect diagnostic test, as a function of error in the comparator.
In this scenario, Ground Truth positive patients and Ground Truth negative patients are equally likely to be misclassified by the comparator. (A) Comparator with no classification error, perfectly representing the Ground Truth for 100 negative patients and 100 positive patients. (B) Apparent performance of diagnostic test, as a function of the misclassification rate of the comparator. The error bars describe 95% empirical confidence intervals about medians, computed over 100 simulation cycles. True test performance is indicated when the FP and FN rates are each 0%. The terms Sensitivity and Specificity are appropriate when there is no misclassification in the comparator (FP rate = FN rate = 0%). The terms Positive Percent Agreement (PPA) and Negative Percent Agreement (NPA) should be used in place of Sensitivity and Specificity, respectively, when the comparator is known to contain uncertainty.
Fig 4.
Degradation of apparent performance of a perfect diagnostic test, as a function of error in the comparator.
(A) Representation of Ground Truth for 100 negative patients (grey points) and 100 positive patients (blue points). A slight overlap between Ground Truth negative and Ground Truth positive distributions is assumed, leading to AUC 0.98 with the Ground Truth as reference. (B) Apparent performance of a perfect diagnostic test, as a function of the misclassification rate of the comparator. The error bars describe 95% empirical confidence intervals about medians, computed over 100 simulation cycles. True test performance is indicated when the FP and FN rates are each 0%. The terms Sensitivity and Specificity are appropriate when there is no misclassification in the comparator (FP rate = FN rate = 0%). The terms Positive Percent Agreement (PPA) and Negative Percent Agreement (NPA) should be used in place of Sensitivity and Specificity, respectively, when the comparator is known to contain uncertainty.
Fig 5.
A simulated inaccurate screening test in a moderately low prevalence setting.
In this scenario, ground truth positive patients are equally likely to be misclassified as ground truth negative patients. (A) Representation of Ground Truth for 250 negative patients, and 50 positive patients, with significant overlap between the positive and negative ground truth distributions. (B) Apparent performance of diagnostic test, as a function of the misclassification rate of the comparator. The error bars describe 95% empirical confidence intervals about medians, computed over 100 simulation cycles. True test performance is indicated when the FP and FN rates are each 0%. The terms Sensitivity and Specificity are appropriate when there is no misclassification in the comparator (FP rate = FN rate = 0%). The terms Positive Percent Agreement (PPA) and Negative Percent Agreement (NPA) should be used in place of Sensitivity and Specificity, respectively, when the comparator is known to contain uncertainty.
Fig 6.
A simulated screening test in a low prevalence setting, for example for a relatively uncommon infectious disease.
In this scenario ground truth positive patients are equally likely to be misclassified as negative patients (A) Representation of Ground Truth for 1950 negative patients, and 50 positive patients. with some overlap between the positive and negative ground truth distributions. (B) Apparent performance of diagnostic test, as a function of the misclassification rate of the comparator. The error bars describe 95% empirical confidence intervals about medians, computed over 100 simulation cycles. True test performance is indicated when the FP and FN rates are each 0%. The terms Sensitivity and Specificity are appropriate when there is no misclassification in the comparator (FP rate = FN rate = 0%). The terms Positive Percent Agreement (PPA) and Negative Percent Agreement (NPA) should be used in place of Sensitivity and Specificity, respectively, when the comparator is known to contain uncertainty.
Fig 7.
(A) Real data from a clinical trial for a new sepsis diagnostic test, conducted over 8 sites in the USA and Netherlands [25]. (B) The apparent performance of the test (y axis) decreases as uncertainty is introduced into the comparator (x axis). 95% confidence intervals are shown. The difference between the apparent test performance at a given comparator misclassification rate and at a comparator misclassification rate of zero indicates the degree of underestimation of true test performance due to uncertainty in the comparator. The vertical lines mark the observed misclassification rates for various patient subsets within the same trial, as described in the text. Misclassification rates are based on quantifying the discordance between independent expert opinions. Solid triangles show the observed measurements for the trial for each of these groups without correction for comparator uncertainty. Sensitivity/PPA and Specificity/NPA are each marked with an asterisk (*) to emphasize that these measures assume no misclassification in the comparator. Positive Percent Agreement (PPA) and Negative Percent Agreement (NPA) are the correct terms to use, when the comparator is known to contain uncertainty as in this case.
Table 2.
Testing the models: Simulated vs. observed effect of comparator noise on test performance.
Table 3.
Classification of patients in a trial on a new sepsis diagnostic test.
Fig 8.
(A) Subset of pneumonia/LRTI-specific data (N = 93) from a clinical trial for a new sepsis diagnostic test, conducted over 8 sites in the USA and Netherlands [25]. (B) The apparent performance of the test (y axis) decreases as uncertainty is introduced into the comparator (x axis). 95% confidence intervals are shown. The difference between the apparent test performance at a given comparator misclassification rate and at a comparator misclassification rate of zero indicates the degree of underestimation of true test performance due to uncertainty in the comparator. Solid triangles show the observed measurements for the trial for each of these groups without correction for comparator uncertainty. Misclassification rates are based on quantifying the discordance between independent expert opinions. Sensitivity/PPA and Specificity/NPA are each marked with an asterisk (*) to emphasize that these measures assume no misclassification in the comparator. Positive Percent Agreement (PPA) and Negative Percent Agreement (NPA) are the correct terms to use, when the comparator is known to contain uncertainty as in this case.
Table 4.
Demonstration of the effect of comparator uncertainty on estimates of test performance, for the pneumonia/LRTI patient subset.