Comparing different versions of computer-aided detection products when reading chest X-rays for tuberculosis

Computer-aided detection (CAD) was recently recommended by the WHO for TB screening and triage based on several evaluations, but unlike traditional diagnostic tests, software versions are updated frequently and require constant evaluation. Since then, newer versions of two of the evaluated products have already been released. We used a case control sample of 12,890 chest X-rays to compare performance and model the programmatic effect of upgrading to newer versions of CAD4TB and qXR. We compared the area under the receiver operating characteristic curve (AUC), overall, and with data stratified by age, TB history, gender, and patient source. All versions were compared against radiologist readings and WHO’s Target Product Profile (TPP) for a TB triage test. Both newer versions significantly outperformed their predecessors in terms of AUC: CAD4TB version 6 (0.823 [0.816–0.830]), version 7 (0.903 [0.897–0.908]) and qXR version 2 (0.872 [0.866–0.878]), version 3 (0.906 [0.901–0.911]). Newer versions met WHO TPP values, older versions did not. All products equalled or surpassed the human radiologist performance with improvements in triage ability in newer versions. Humans and CAD performed worse in older age groups and among those with TB history. New versions of CAD outperform their predecessors. Prior to implementation CAD should be evaluated using local data because underlying neural networks can differ significantly. An independent rapid evaluation centre is necessitated to provide implementers with performance data on new versions of CAD products as they are developed.

Introduction Several computer-aided detection (CAD) products for TB have emerged and can provide an automated and standardized interpretation of digital chest X-ray (CXR) based on artificial intelligence.
[1] Recent evaluations of CAD's ability to detect TB-related abnormalities report performance comparable to (or better than) human readers. [2] In March 2021, the World Health Organization (WHO) reviewed impartial evaluations of three CAD products, and made the landmark decision to update international TB screening policy to include the use of CAD on CXR of individuals �15 years. [3] Under the WHO guidance, other CAD products may be utilized providing their performance matches those reviewed in the guideline.
The emergence of CAD as a high-performing tool for screening and triage has been a different exercise compared to new lab diagnostics, with newer software versions being available rapidly. The speed of this progress presents a challenge to the relevance of current CAD literature and the policy it informs. Two of the products reviewed during the WHO guideline development process published in 2021, CAD4TB V6 (Delft Imaging Systems, the Netherlands) and qXR V2 (Qure.ai, India), have already been updated. Further, modern CAD software is developed using the AI technique that works by mimicking human brain-neural networks.
[1] However, the inner workings of commercial CAD software are challenging to understand for both general audiences and developers, because the nature of neural networks are akin to a black box and the underlying algorithms are the utmost business secret. Therefore, for medical professionals, who will not know any real commercial AI software's inner workings, confidence in the ability of CAD software to detect TB should be earned by comprehensive and unbiased software evaluations that measure different performance indicators on real-world datasets. Only one study, also outdated by new software versions, assessed and compared consecutive versions of a single CAD product. [4] More broadly, there is a lack of research in quantifying differences in the programmatic impact between software versions to advice users and TB programmes what and if adjustment is needed when using a software tools that update on a yearly or more rapid basis. We therefore compare the performance of two WHO-evaluated CAD product versions with the subsequent versions, using bacteriological evidence as the reference standard.

Materials and methods
The study evaluated CAD4TB versions 6 and 7, and qXR versions 2 and 3.
[5] Both CAD products read CXR images and calculate an abnormality score representing the likelihood that TBassociated abnormalities are present in an image. A dichotomous result (TB-associated abnormalities present or absent) is arrived at by setting a threshold abnormality score, above which the algorithm suggests that TB-associated abnormalities are present, and that the individual should undergo further confirmatory testing.
[1] The outputs also include heat maps indicating the location of abnormalities. Both products had been reviewed by the WHO Guidelines Development Group and approved for use in TB triage and screening (in individuals �15 years) in 2021. [3] The dataset used in this evaluation is taken from the Stop TB Partnership's TB REACH CXR Evaluation Centre. [5] CXR sample collection Every individual �15 years old visiting one of three TB screening centres set up by icddr,b in Dhaka, Bangladesh were verbally screened for TB symptoms-cough, shortness of breath, weight loss, haemoptysis-and received a CXR. The image was then read by one of three radiologists registered with the Bangladesh Medical and Dental Council. The radiologists were blinded to any information except age and sex. They classified each image as 'normal' or 'abnormal' (including any abnormality, whether consistent with TB or not). [6] Regardless of the CXR results, all individuals were asked to submit a fresh spot sputum sample for testing with Xpert MTB/RIF (Xpert) assay. Xpert provided a bacteriological reference standard, confirming the presence (Bac+) or absence (Bac-) of mycobacterium tuberculosis.
For this study, this dataset was sampled using case control sampling, with a 2 to 1 match of 8,582 Bac-and 4,308 Bac+ CXR according to the reference standard, resulting in a dataset of 12,890 CXRs which were read by all four software versions.
CAD reading was performed retrospectively during sessions where CAD4TB and qXR were installed on Stop TB Partnership's Secure File Transfer Protocol server storing the de-identified CXR images. CAD developers were not granted access to the evaluation dataset before or after the reading, and reading was performed blind of all clinical and demographic information, and without any prior AI training. Unique identifiers were used to group server datasets for analytical purposes. Only co-authors had access to the dataset.

Data analysis
To compare the accuracy of newer against older versions, receiver operating characteristic (ROC) curves were plotted and the area under the ROC curves (AUC) was calculated as a general indication of product version accuracy over the entire abnormality score range.
A paired one-sided t-test was performed to test whether the average CAD4TB v7 score is less than the average CAD4TB v6 score. The same was performed to test if the average qXR v2 score was less than the average qXR v3 score. We also constructed histograms of the abnormality scores of the different software versions disaggregated by bacteriological status. To examine how performance changes across threshold scores we evaluated the cost-saving of each product version in a hypothetical triage situation with CXR from 20,000 adults would be interpreted by each CAD version and only those with an abnormality score above a threshold value would receive an Xpert diagnostic test. We assumed the prevalence of Bac+ TB in the population was 19%, as in the principal study, then calculated the sensitivity of each version and number of Xpert assays hypothetically needed. [2,7] To compare human with AI performance, we calculated the sensitivity and specificity of the Bangladeshi radiologists and the threshold score each version would need to match this sensitivity. We then compared the difference in specificity between human readers and each CAD version using the McNemar test for paired proportions.
We also compared version performance at target sensitivity and specificity values according to the WHO's target product profile (TPP) for a TB triage test of sensitivity �90% and specificity �70%. [8] Similarly, the threshold of each version was chosen to match the sensitivity target value, and likewise for specificity targets. Finally, subgroup analysis was performed by stratifying AUCs by gender, patient source, age group, and history of TB. For the same subgroups, we also calculated human reader sensitivity and specificity. All calculations were done using the statistical software R, v 3.6.0 (R Computing, Vienna, Austria).

Ethics
All enrolled participants provided informed written consent, those under 18 years of age gave assent in addition to parent's or guardian's consent, their medical data were anonymized, and ethical approval was obtained approval from the Research Review Committee and the Ethical Review Committee at icddr,b.

Role of CAD developers
AI developers had no role in study design, data collection, analysis plan, or writing of the publication.

Results
The median age of the 12,890 participants was 42.0 [29.0, 57.0]; fewer than one third (32.0%) were female; and 1,991 individuals (15.5%) had a history of TB (Table 1). All individuals reported TB-related symptoms, the most common being a cough, reported by 11,651 individuals (90.5%), followed by fever ( The median score allocated to Bac+ individuals by CAD4TB increased from 81.0 to 97.2 between version 6 and 7, while the median score for Bac-individuals decreased dramatically from 53.0 to 9.6. qXR version 2 attributed slightly higher scores to Bac+ individuals than v3, with a median of 91.5 compared to 89.0. qXR v3 allocated lower scores to Bac-people (median = 12.0) than v2. Ten percent of Bac+ individuals had a CAD4TB v6 score less than 61, a CAD4TB v7 score less than 49.8, a qXR v2 score less than 61.7, and a qXR v3 score less than 59. Ten percent of Bac-individuals had a CAD4TB v6 score greater than 84, a CAD4TB v7 score greater than 89.9, a qXR v2 score greater than 89.9, and a qXR v3 score greater than 84.
(S1 Fig Left) Receiver Operating Characteristic (ROC) Curves of CAD4TB v6 and v7. (S1 Fig Right)  The sensitivity of qXR versions 2 and 3 was similar across different threshold scores. When a threshold abnormality score was between 0 and 75, the sensitivity of both qXR versions was high, displaying a similar gradual reduction as threshold increases (S2A Fig, S2 Table). In the modelling population, both qXR versions begin saving large numbers of diagnostic tests initially, (S2B Fig).  Overall CAD4TB versions 6 and 7 show a vastly altered relationship between abnormality score, sensitivity, and number of Xpert tests saved (S2C and S2D Fig). At a low threshold score (until approximately 48), CAD4TB v6 remains close to 100% sensitive while v7 maintains high sensitivity (80-100%) over most of its threshold score range, only at a threshold of 81 or higher falling below 80%. Similarly, S2D Fig indicates that versions 6 and 7 offer vastly different cost savings and scores needed to achieve them. Until thresholds of approximately 45 are reached, fewer than 20% of diagnostic tests are saved by v6 because of the linear initial relationship between abnormality score and diagnostic test saving. In contrast, v7 results in a steep initial increase. Until a threshold of 75, greater numbers of Xpert tests can be saved using CAD4TB v7.

Abnormality score distributions
The average CAD4TB v7 score was significantly less than the average CAD4TB v6 score with a mean difference of -15.0 (p-value < 2.2e-16). The average qXR v3 score was also significantly less than the average qXR v2 score with a mean difference of -7.7 (p-value < 2.2e-16).
The All versions allocated higher scores to Bac-people with a history of TB than to those without. However, large numbers of outliers remain-many Bac+ individuals have extremely low CAD4TB v7 scores or qXR v3 scores.
While version 2 of qXR was significantly more specific than human radiologists, V3 improved more and was 13.7% (12-15%) more specific than Bangladeshi radiologists while matching sensitivity.

Comparison against WHO TPP
The earlier versions of both products did not meet the WHO TPP, whereas the newly released version exceeded the target.

Subgroup analysis
The AUCs of both newer CAD products are higher than the previous versions across all subgroups. Overall, the AUCs of all CAD product versions were significantly higher in new cases compared to people with a history of TB: ranging from 0.846-0.918 for new cases and 0.706-0.841 for those who had TB previously. Despite comparable sensitivity, human readers also performed worse in this group with specificity of 37.62% (34.95-40.34%) compared to 67.2%, (66.1-68.3%) where there was no TB history. (Table 4). All product versions also performed significantly worse in older populations, as did human readers. No significant gender difference was noted for CAD, though human readers were less specific in males than females.
Newer product versions were more proficient at accurately classifying CXRs from people with a history of TB and older individuals, especially CAD4TB v7 compared to v6 (S4 Fig, S3  Table). V3 of qXR performed significantly better than its predecessor in older and middleaged groups, while v7 of CAD4TB significantly outperformed its predecessor in all age groups and was the only algorithm not to perform worse in middle-aged than in younger age groups.
Patient source was a conspicuous factor. All versions performed significantly better among walk-ins than DOTS-retested and private referrals and human readers showed the same bias. qXR versions 2 and 3, and CAD4TB v6 also performed worse in private referrals in general, but this shortcoming was not carried forward into CAD4TB v7. Human reader specificity was

Discussion
This is the first study that compares the newer versions of the WHO-reviewed CAD products, qXR and CAD4TB. Both new software versions exceeded the performance of their WHO-evaluated previous versions and met the TPP targets. Our findings illustrate measurable improvements achieved by new versions of software. However, the opacity of the technology makes it difficult to predict how these changes will impact programmes since new versions of products can involve significant changes in the underlying neural network and should therefore be evaluated as if they were new products altogether to verify their performance maintains the level of those in the WHO guideline update. A given threshold score deployed with different versions of the same CAD product will not always be associated with the same sensitivity and Xpert saving, as exemplified by CAD4TB v7 compared to v6. The improvement seen with v7 may be attributed to a large difference in the underlying neural network, demonstrated by the box plots of the abnormality scores of the two versions. In contrast, the two versions of qXR showed more nuanced improvement and the underlying classification algorithm remains largely similar between versions, although the newer can save more confirmatory tests while keeping the sensitivity the same. For example, using 60 as the threshold score with CAD4TB v6 achieved 92% sensitivity and saved about 43% of Xpert tests. If the software was then updated to v7 and the same threshold used, sensitivity would reduce to 88% and the programme would now save 55% of diagnostic tests. New Table 4. The sensitivity and specificity of human readers in these subgroups.

Subgroup
Human Reader Sensitivity Human Reader Specificity software updates will likely necessitate the adjustment of the threshold score to maintain performance analogous to that of the previous version. In general, both the older and newer versions of qXR and CAD4TB outperformed human readers, except CAD4TB v6 which performed similarly. These findings are in line with previous research. [9,10] The improvement in performance we observed in CAD4TB agrees with a previous study describing improvement in version 6 compared to predecessors. [4] However, algorithms can be further refined to improve performance for subgroups such as older age groups and those with a history of TB. [2] Current weaknesses suggest a flaw in current training practices that may be limiting CAD accuracy, even in newer versions. However, human reader bias mirrored that of CAD when it came to older age groups and those with a history of TB. As new versions are automatically rolled out to users globally, their programmatic implications should be routinely monitored to ensure they serve all populations in need. A rapid evaluation centre, with access to diverse datasets from different regions of the world, will be key to meeting this need.
This study has a few limitations. Firstly, owing to logistic and budgetary constraints, we did not use culture as the reference standard, meaning that some people with Xpert-negative, culture-positive TB might have been incorrectly labelled as not having the disease. We also did not have access in Bangladesh to Xpert Ultra, which is more sensitive than Xpert. Due to the small number of asymptomatic individuals by symptoms or test for HIV, subgroup analysis was not performed on these groups. The study population also excludes children under 15 due to protocol limitation.

Conclusion
Updated versions of CAD4TB and qXR outperform their predecessors, meeting the standard set in the WHO guideline. Version updates arise rapidly, can involve large changes in the underlying neural network, and are rolled out globally. Independent, evidence-based guidance is urgently needed to help end users prepare for updated technology.   Table. The AUCs of CAD product versions in sub-analyses that showed differences in performance. (XLSX)