Computer-aided detection (CAD) was recently recommended by the WHO for TB screening and triage based on several evaluations, but unlike traditional diagnostic tests, software versions are updated frequently and require constant evaluation. Since then, newer versions of two of the evaluated products have already been released. We used a case control sample of 12,890 chest X-rays to compare performance and model the programmatic effect of upgrading to newer versions of CAD4TB and qXR. We compared the area under the receiver operating characteristic curve (AUC), overall, and with data stratified by age, TB history, gender, and patient source. All versions were compared against radiologist readings and WHO’s Target Product Profile (TPP) for a TB triage test. Both newer versions significantly outperformed their predecessors in terms of AUC: CAD4TB version 6 (0.823 [0.816–0.830]), version 7 (0.903 [0.897–0.908]) and qXR version 2 (0.872 [0.866–0.878]), version 3 (0.906 [0.901–0.911]). Newer versions met WHO TPP values, older versions did not. All products equalled or surpassed the human radiologist performance with improvements in triage ability in newer versions. Humans and CAD performed worse in older age groups and among those with TB history. New versions of CAD outperform their predecessors. Prior to implementation CAD should be evaluated using local data because underlying neural networks can differ significantly. An independent rapid evaluation centre is necessitated to provide implementers with performance data on new versions of CAD products as they are developed.
The World Health Organization recommended the use of artificial intelligence (AI)-powered computer-aided detection (CAD) for TB screening and triage in 2021. One year on, we comprehensively compare the performance of the newest versions of two CAD (CAD4TB and qXR) to their WHO-evaluated predecessors. We found that both newer versions significantly improved upon their predecessor’s ability to detect TB, performing better than the human readers. We also showed that the AI underlying new software versions can differ remarkably from the old and resemble an entirely new product altogether. We further demonstrate that, unlike laboratory diagnostic tools, CAD software updates could significantly impact the selection of appropriate threshold scores, the number of people with TB detected and cost-effectiveness. With newer CAD versions being rolled out almost annually, our results therefore underscore the need for rapid evidence generation to evaluate newer CAD versions in the fast-growing medical AI industry.
Citation: Qin ZZ, Barrett R, Ahmed S, Sarker MS, Paul K, Adel ASS, et al. (2022) Comparing different versions of computer-aided detection products when reading chest X-rays for tuberculosis. PLOS Digit Health 1(6): e0000067. https://doi.org/10.1371/journal.pdig.0000067
Editor: Gilles Guillot, WHO: Organisation mondiale de la Sante, SWITZERLAND
Received: March 6, 2022; Accepted: May 15, 2022; Published: June 14, 2022
Copyright: © 2022 Qin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All numeric data and codes used in this manuscript are available here: https://github.com/ZZQin/MachineBGD/tree/master/2.0%20Version%20Comparison.
Funding: This project was funded by Global Affairs Canada through the Stop TB Partnership’s TB REACH Initiative (grant number STBP/TBREACH/GSA/W5-24). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Several computer-aided detection (CAD) products for TB have emerged and can provide an automated and standardized interpretation of digital chest X-ray (CXR) based on artificial intelligence. Recent evaluations of CAD’s ability to detect TB-related abnormalities report performance comparable to (or better than) human readers. In March 2021, the World Health Organization (WHO) reviewed impartial evaluations of three CAD products, and made the landmark decision to update international TB screening policy to include the use of CAD on CXR of individuals ≥15 years. Under the WHO guidance, other CAD products may be utilized providing their performance matches those reviewed in the guideline.
The emergence of CAD as a high-performing tool for screening and triage has been a different exercise compared to new lab diagnostics, with newer software versions being available rapidly. The speed of this progress presents a challenge to the relevance of current CAD literature and the policy it informs. Two of the products reviewed during the WHO guideline development process published in 2021, CAD4TB V6 (Delft Imaging Systems, the Netherlands) and qXR V2 (Qure.ai, India), have already been updated. Further, modern CAD software is developed using the AI technique that works by mimicking human brain–neural networks. However, the inner workings of commercial CAD software are challenging to understand for both general audiences and developers, because the nature of neural networks are akin to a black box and the underlying algorithms are the utmost business secret. Therefore, for medical professionals, who will not know any real commercial AI software’s inner workings, confidence in the ability of CAD software to detect TB should be earned by comprehensive and unbiased software evaluations that measure different performance indicators on real-world datasets. Only one study, also outdated by new software versions, assessed and compared consecutive versions of a single CAD product. More broadly, there is a lack of research in quantifying differences in the programmatic impact between software versions to advice users and TB programmes what and if adjustment is needed when using a software tools that update on a yearly or more rapid basis. We therefore compare the performance of two WHO-evaluated CAD product versions with the subsequent versions, using bacteriological evidence as the reference standard.
Materials and methods
The study evaluated CAD4TB versions 6 and 7, and qXR versions 2 and 3. Both CAD products read CXR images and calculate an abnormality score representing the likelihood that TB-associated abnormalities are present in an image. A dichotomous result (TB-associated abnormalities present or absent) is arrived at by setting a threshold abnormality score, above which the algorithm suggests that TB-associated abnormalities are present, and that the individual should undergo further confirmatory testing. The outputs also include heat maps indicating the location of abnormalities. Both products had been reviewed by the WHO Guidelines Development Group and approved for use in TB triage and screening (in individuals ≥15 years) in 2021. The dataset used in this evaluation is taken from the Stop TB Partnership’s TB REACH CXR Evaluation Centre.
CXR sample collection
Every individual ≥15 years old visiting one of three TB screening centres set up by icddr,b in Dhaka, Bangladesh were verbally screened for TB symptoms–cough, shortness of breath, weight loss, haemoptysis–and received a CXR. The image was then read by one of three radiologists registered with the Bangladesh Medical and Dental Council. The radiologists were blinded to any information except age and sex. They classified each image as ‘normal’ or ‘abnormal’ (including any abnormality, whether consistent with TB or not). Regardless of the CXR results, all individuals were asked to submit a fresh spot sputum sample for testing with Xpert MTB/RIF (Xpert) assay. Xpert provided a bacteriological reference standard, confirming the presence (Bac+) or absence (Bac-) of mycobacterium tuberculosis.
For this study, this dataset was sampled using case control sampling, with a 2 to 1 match of 8,582 Bac- and 4,308 Bac+ CXR according to the reference standard, resulting in a dataset of 12,890 CXRs which were read by all four software versions.
CAD reading was performed retrospectively during sessions where CAD4TB and qXR were installed on Stop TB Partnership’s Secure File Transfer Protocol server storing the de-identified CXR images. CAD developers were not granted access to the evaluation dataset before or after the reading, and reading was performed blind of all clinical and demographic information, and without any prior AI training. Unique identifiers were used to group server datasets for analytical purposes. Only co-authors had access to the dataset.
To compare the accuracy of newer against older versions, receiver operating characteristic (ROC) curves were plotted and the area under the ROC curves (AUC) was calculated as a general indication of product version accuracy over the entire abnormality score range.
A paired one-sided t-test was performed to test whether the average CAD4TB v7 score is less than the average CAD4TB v6 score. The same was performed to test if the average qXR v2 score was less than the average qXR v3 score. We also constructed histograms of the abnormality scores of the different software versions disaggregated by bacteriological status. To examine how performance changes across threshold scores we evaluated the cost-saving of each product version in a hypothetical triage situation with CXR from 20,000 adults would be interpreted by each CAD version and only those with an abnormality score above a threshold value would receive an Xpert diagnostic test. We assumed the prevalence of Bac+ TB in the population was 19%, as in the principal study, then calculated the sensitivity of each version and number of Xpert assays hypothetically needed.[2,7]
To compare human with AI performance, we calculated the sensitivity and specificity of the Bangladeshi radiologists and the threshold score each version would need to match this sensitivity. We then compared the difference in specificity between human readers and each CAD version using the McNemar test for paired proportions.
We also compared version performance at target sensitivity and specificity values according to the WHO’s target product profile (TPP) for a TB triage test of sensitivity ≥90% and specificity ≥70%. Similarly, the threshold of each version was chosen to match the sensitivity target value, and likewise for specificity targets. Finally, subgroup analysis was performed by stratifying AUCs by gender, patient source, age group, and history of TB. For the same subgroups, we also calculated human reader sensitivity and specificity. All calculations were done using the statistical software R, v 3.6.0 (R Computing, Vienna, Austria).
All enrolled participants provided informed written consent, those under 18 years of age gave assent in addition to parent’s or guardian’s consent, their medical data were anonymized, and ethical approval was obtained approval from the Research Review Committee and the Ethical Review Committee at icddr,b.
The median age of the 12,890 participants was 42.0 [29.0, 57.0]; fewer than one third (32.0%) were female; and 1,991 individuals (15.5%) had a history of TB (Table 1). All individuals reported TB-related symptoms, the most common being a cough, reported by 11,651 individuals (90.5%), followed by fever (10,323; 80.2%), weight loss (8,440; 65.6%), shortness of breath (6,742; 52.4%) and haemoptysis (1,625; 12.6%). 777 (18.0%) had a high bacterial burden and rifampicin resistance was detected in 206 (1.6%) people. Most were referred from private (9,538 [75.9%]), public (231 [1.8%]), or DOTS (1,284 [10.2%]) facilities. 1,341 (10.7%) were walk-ins, while 86 (0.7%) came from community screening and 82 (0.7%) were contacts.
4,308 individuals (33.4%) were Bac+ according to the reference standard. Radiologists graded 4,147 (32.2%) of all CXRs as normal, but only 5% of the CXRs of Bac+ individuals as normal; and 3,932 (45.8%) of the CXRs of Bac- individuals as normal.
The median score allocated to Bac+ individuals by CAD4TB increased from 81.0 to 97.2 between version 6 and 7, while the median score for Bac- individuals decreased dramatically from 53.0 to 9.6. qXR version 2 attributed slightly higher scores to Bac+ individuals than v3, with a median of 91.5 compared to 89.0. qXR v3 allocated lower scores to Bac- people (median = 12.0) than v2. Ten percent of Bac+ individuals had a CAD4TB v6 score less than 61, a CAD4TB v7 score less than 49.8, a qXR v2 score less than 61.7, and a qXR v3 score less than 59. Ten percent of Bac- individuals had a CAD4TB v6 score greater than 84, a CAD4TB v7 score greater than 89.9, a qXR v2 score greater than 89.9, and a qXR v3 score greater than 84.
Overall performance and modelled programmatic impacts
CAD4TB v7 had a significantly higher AUC than v6, 0.903 (95% CI: 0.897–0.908) compared to 0.823 (0.816–0.830). qXR version 3 significantly outperformed v2, with AUCs of 0.906 (95% CI: 0.901–0.911) and 0.872 (0.866–0.878) respectively (S1 Fig Right, S1 Table). The improvement in CAD4TB was greater, but qXR v2 was significantly better than CAD4TB V6 at baseline.
(S1 Fig Left) Receiver Operating Characteristic (ROC) Curves of CAD4TB v6 and v7. (S1 Fig Right) ROC curve of qXR v2 and v3. (S2A Fig) Sensitivity versus threshold abnormality score for qXRv2 and v3. (S2B Fig) Xpert tests saved versus threshold abnormality score for qXRv2 and v3. (S2C Fig) Sensitivity versus threshold abnormality score for CAD4TBv6 and v7. (S2D Fig); Xpert tests saved versus threshold abnormality score for CAD4TBv6 and v7.
The sensitivity of qXR versions 2 and 3 was similar across different threshold scores. When a threshold abnormality score was between 0 and 75, the sensitivity of both qXR versions was high, displaying a similar gradual reduction as threshold increases (S2A Fig, S2 Table). In the modelling population, both qXR versions begin saving large numbers of diagnostic tests initially, (S2B Fig). For example, at a threshold score of 50, there is no significant difference in the sensitivities of qXR versions 2 and 3 (93.0% [95% CI: 92.2–93.8%] and 92.5% [91.7–93.3], respectively). V3, however, was more specific and as a result saved 57.8% of diagnostic tests compared to 50.9% saved by v2.
Overall CAD4TB versions 6 and 7 show a vastly altered relationship between abnormality score, sensitivity, and number of Xpert tests saved (S2C and S2D Fig). At a low threshold score (until approximately 48), CAD4TB v6 remains close to 100% sensitive while v7 maintains high sensitivity (80–100%) over most of its threshold score range, only at a threshold of 81 or higher falling below 80%. Similarly, S2D Fig indicates that versions 6 and 7 offer vastly different cost savings and scores needed to achieve them. Until thresholds of approximately 45 are reached, fewer than 20% of diagnostic tests are saved by v6 because of the linear initial relationship between abnormality score and diagnostic test saving. In contrast, v7 results in a steep initial increase. Until a threshold of 75, greater numbers of Xpert tests can be saved using CAD4TB v7.
Abnormality score distributions
The average CAD4TB v7 score was significantly less than the average CAD4TB v6 score with a mean difference of -15.0 (p-value < 2.2e-16). The average qXR v3 score was also significantly less than the average qXR v2 score with a mean difference of -7.7 (p-value < 2.2e-16).
The histogram of the abnormality scores for Bac+ and Bac- individuals show clear overlap for CAD4TB v6, but none for v7 (S3 Fig), indicating that CAD4TB v7 could differentiate most Bac+ from Bac- individuals at a score between 80 and 85. Similar observations were noted for qXR, the newer version providing improved separation of Bac+ and Bac- individuals, with fewer false negative and false positive cases.
Bac. Pos.–individuals with TB according to the Xpert reference standard. Bac. Neg.–individuals without TB according to the Xpert reference standard.
All versions allocated higher scores to Bac- people with a history of TB than to those without. However, large numbers of outliers remain–many Bac+ individuals have extremely low CAD4TB v7 scores or qXR v3 scores.
Comparison against human readers
The human radiologist’s sensitivity was 88.2% (87.2–89.1%) and specificity was 62.8%, (61.8–63.9%) (Table 2). Matching the sensitivity, all versions had significantly greater specificity than human radiologists except for CAD4TB v6, which was similar. CAD4TB v7 significantly improved on its predecessor’s specificity, outperforming human radiologists with specificity of 76.0% (75.1–76.9%), compared to 62.8% (61.8–63.9%) by human readers, and 64.1% (63.1–65.2%) of CAD4TB v6.
While version 2 of qXR was significantly more specific than human radiologists, V3 improved more and was 13.7% (12–15%) more specific than Bangladeshi radiologists while matching sensitivity.
Comparison against WHO TPP
The earlier versions of both products did not meet the WHO TPP, whereas the newly released version exceeded the target.
At 90% sensitivity, the newer versions of CAD4TB and qXR significantly improved upon their predecessors and obtained specificities of 72.8% (95% CI: 71.9–73.8%) and 74.2% (73.3–75.1%) for qXR, respectively (Table 3). For CAD4TB v7 a lower threshold score yielded 90% sensitivity compared to v6, while no such change was observed between qXR versions.
Although qXR v2 came close, only the updated versions met the 90% sensitivity target at 70% specificity. CAD4TB v7 had sensitivity of 91.5% (90.6–92.3%) compared to 80.9% (79.7–82.1%) of v6; qXR v3 met the TPP with sensitivity of 92.3% (91.5–93.1%), while v2 came close at 88.2% (87.2–89.2%) (Table 3).
The AUCs of both newer CAD products are higher than the previous versions across all subgroups. Overall, the AUCs of all CAD product versions were significantly higher in new cases compared to people with a history of TB: ranging from 0.846–0.918 for new cases and 0.706–0.841 for those who had TB previously. Despite comparable sensitivity, human readers also performed worse in this group with specificity of 37.62% (34.95–40.34%) compared to 67.2%, (66.1–68.3%) where there was no TB history. (Table 4). All product versions also performed significantly worse in older populations, as did human readers. No significant gender difference was noted for CAD, though human readers were less specific in males than females.
Newer product versions were more proficient at accurately classifying CXRs from people with a history of TB and older individuals, especially CAD4TB v7 compared to v6 (S4 Fig, S3 Table). V3 of qXR performed significantly better than its predecessor in older and middle-aged groups, while v7 of CAD4TB significantly outperformed its predecessor in all age groups and was the only algorithm not to perform worse in middle-aged than in younger age groups.
Patient source was a conspicuous factor. All versions performed significantly better among walk-ins than DOTS-retested and private referrals and human readers showed the same bias. qXR versions 2 and 3, and CAD4TB v6 also performed worse in private referrals in general, but this shortcoming was not carried forward into CAD4TB v7. Human reader specificity was also slightly lower in this group- 59.1% (57.8–60.3%) compared to 62.8% (61.8–63.9%) overall. Similarly, all products and human readers performed significantly worse among public referrals compared to walk-ins except CAD4TB v7. Between other patient sources, no significant differences were observed in CAD, where human readers also displayed slightly higher specificity in community screening (72.9% [60.9–82.8%]) than private referrals (59.1% [57.8–60.3%]).
This is the first study that compares the newer versions of the WHO-reviewed CAD products, qXR and CAD4TB. Both new software versions exceeded the performance of their WHO-evaluated previous versions and met the TPP targets. Our findings illustrate measurable improvements achieved by new versions of software. However, the opacity of the technology makes it difficult to predict how these changes will impact programmes since new versions of products can involve significant changes in the underlying neural network and should therefore be evaluated as if they were new products altogether to verify their performance maintains the level of those in the WHO guideline update.
A given threshold score deployed with different versions of the same CAD product will not always be associated with the same sensitivity and Xpert saving, as exemplified by CAD4TB v7 compared to v6. The improvement seen with v7 may be attributed to a large difference in the underlying neural network, demonstrated by the box plots of the abnormality scores of the two versions. In contrast, the two versions of qXR showed more nuanced improvement and the underlying classification algorithm remains largely similar between versions, although the newer can save more confirmatory tests while keeping the sensitivity the same. For example, using 60 as the threshold score with CAD4TB v6 achieved 92% sensitivity and saved about 43% of Xpert tests. If the software was then updated to v7 and the same threshold used, sensitivity would reduce to 88% and the programme would now save 55% of diagnostic tests. New software updates will likely necessitate the adjustment of the threshold score to maintain performance analogous to that of the previous version.
In general, both the older and newer versions of qXR and CAD4TB outperformed human readers, except CAD4TB v6 which performed similarly. These findings are in line with previous research.[9,10] The improvement in performance we observed in CAD4TB agrees with a previous study describing improvement in version 6 compared to predecessors.
However, algorithms can be further refined to improve performance for subgroups such as older age groups and those with a history of TB. Current weaknesses suggest a flaw in current training practices that may be limiting CAD accuracy, even in newer versions. However, human reader bias mirrored that of CAD when it came to older age groups and those with a history of TB. As new versions are automatically rolled out to users globally, their programmatic implications should be routinely monitored to ensure they serve all populations in need. A rapid evaluation centre, with access to diverse datasets from different regions of the world, will be key to meeting this need.
This study has a few limitations. Firstly, owing to logistic and budgetary constraints, we did not use culture as the reference standard, meaning that some people with Xpert-negative, culture-positive TB might have been incorrectly labelled as not having the disease. We also did not have access in Bangladesh to Xpert Ultra, which is more sensitive than Xpert. Due to the small number of asymptomatic individuals by symptoms or test for HIV, subgroup analysis was not performed on these groups. The study population also excludes children under 15 due to protocol limitation.
Updated versions of CAD4TB and qXR outperform their predecessors, meeting the standard set in the WHO guideline. Version updates arise rapidly, can involve large changes in the underlying neural network, and are rolled out globally. Independent, evidence-based guidance is urgently needed to help end users prepare for updated technology.
S1 Fig. Receiver Operating Characteristic (ROC) Curves of CAD4TB v6 and v7 (left) and qXR v2 and v3 (right).
S2 Fig. (A) Sensitivity versus threshold abnormality score for qXRv2 and v3; (B) Xpert tests saved versus threshold abnormality score for qXRv2 and v3; (C) Sensitivity versus threshold abnormality score for CAD4TBv6 and v7; (D) Xpert tests saved versus threshold abnormality score for CAD4TBv6 and v7.
S3 Fig. Histograms showing the distribution of abnormality scores of CAD4TB versions 6 and 7 and qXR versions 2 and 3 disaggregated by bacteriological status and by history of TB.
S4 Fig. The performance of CAD software versions as Area Under the Receiver Operating Characteristic curve (AUC) stratified by age group, patient source, history of TB, and gender.
S1 Table. Comparison of the AUCs of two versions of CAD4TB and qXR.
S2 Table. How threshold score impacts sensitivity and number of diagnostic tests saved for different product versions.
Delft Imaging Systems and Qure.ai allowed us to use all included CAD products free of charge, but they had no influence on any aspects of our work.
- 1. Qin ZZ, Naheyan T, Ruhwald M, Denkinger CM, Gelaw S, Nash M et al. A new resource on artificial intelligence powered computer automated detection software products for tuberculosis programmes and implementers. Tuberculosis. 2021;127:102049. pmid:33440315
- 2. Qin ZZ, Ahmed S, Sarker MS, Paul K, Ahammad SSA, Naheyan T et al. Tuberculosis detection from chest x-rays for triaging in a high tuberculosis-burden setting: an evaluation of five artificial intelligence algorithms. The Lancet Digital Health. 2021;3(9):e543–e554. pmid:34446265
- 3. World Health Organization. Module 2: Screening WHO Operational Handbook on Tuberculosis Systematic Screening for Tuberculosis Disease. 2021. [Cited: March 26, 2021]. https://apps.who.int/iris/bitstream/handle/10665/340256/9789240022614-eng.pdf
- 4. Murphy K, Habib SS, Zaidi SMA, Khowaja S, Khan A, Melendez J, et al. Computer aided detection of tuberculosis on chest radiographs: An evaluation of the CAD4TB v6 system. Scientific Reports. 2020;10(1). pmid:32218458
- 5. ai4hlth.org [Internet]. AI Products for Tuberculosis Healthcare | AI4HLTH. [Cited March 26, 2021]. https://www.ai4hlth.org/
- 6. World Health Organization. Tuberculosis prevalence surveys: a handbook. 2011. [Cited: March 26, 2021]. https://apps.who.int/iris/bitstream/handle/10665/44481/9789241548168_eng.pdf?sequence=1&isAllowed=y
- 7. Banu S, Haque F, Ahmed S, Sultana S, Rahman M, Khatun R, et al. Social Enterprise Model (SEM) for private sector tuberculosis screening and care in Bangladesh. PLOS ONE. 2020;15(11):e0241437. pmid:33226990
- 8. World Health Organization. High-Priority Target Product Profiles for New Tuberculosis Diagnostics: Report of a Consensus Meeting. 2014. [Cited: December 21, 2021]. https://apps.who.int/iris/bitstream/handle/10665/135617/WHO_HTM_TB_2014.18_eng.pdf?sequence=1&isAllowed=y.
- 9. Rahman T, Codlin AJ, Rahman M, Nahar A, Reja M, Islam T et al. An evaluation of automated chest radiography reading software for tuberculosis screening among public- and private-sector patients. The European respiratory journal. 2017;49(5). pmid:28529202
- 10. Qin ZZ, Sander MS, Rai B, Collins TN, Sudrungrot S, Laah SN, et al. Using artificial intelligence to read chest radiographs for tuberculosis detection: A multi-site evaluation of the diagnostic accuracy of three deep learning systems. Scientific Reports. 2019;9(1):1–10.