Diagnostic performance evaluation of different TI-RADS using ultrasound computer-aided diagnosis of thyroid nodules: An experience with adjusted settings

Background Thyroid cancer diagnosis has evolved to include computer-aided diagnosis (CAD) approaches to overcome the limitations of human ultrasound feature assessment. This study aimed to evaluate the diagnostic performance of a CAD system in thyroid nodule differentiation using varied settings. Methods Ultrasound images of 205 thyroid nodules from 198 patients were analysed in this retrospective study. AmCAD-UT software was used at default settings and 3 adjusted settings to diagnose the nodules. Six risk-stratification systems in the software were used to classify the thyroid nodules: The American Thyroid Association (ATA), American College of Radiology Thyroid Imaging, Reporting, and Data System (ACR-TIRADS), British Thyroid Association (BTA), European Union (EU-TIRADS), Kwak (2011) and the Korean Society of Thyroid Radiology (KSThR). The diagnostic performance of CAD was determined relative to the histopathology and/or cytology diagnosis of each nodule. Results At the default setting, EU-TIRADS yielded the highest sensitivity, 82.6% and lowest specificity, 42.1% while the ATA-TIRADS yielded the highest specificity, 66.4%. Kwak had the highest AUROC (0.74) which was comparable to that of ACR, ATA, and KSThR TIRADS (0.72, 0.73, and 0.70 respectively). At a hyperechoic foci setting of 3.5 with other settings at median values; ATA had the best-balanced sensitivity, specificity and good AUROC (70.4%; 67.3% and 0.71 respectively). Conclusion The default setting achieved the best diagnostic performance with all TIRADS and was best for maximizing the sensitivity of EU-TIRADS. Adjusting the settings by only reducing the sensitivity to echogenic foci may be most helpful for improving specificity with minimal change in sensitivity.


Introduction
Thyroid cancer is the most common endocrine malignancy which constitutes about 5% of all cancers [1,2]. With the advancement and increased sensitivity of diagnostic imaging tools such as ultrasound, the incidence of thyroid cancers is rising particularly for subclinical cases [3]. Ultrasound-guided fine-needle aspiration cytology (FNAC) is the reference standard pre-operatively; however, its major drawback is the indeterminate results category which has about 25% possibility of malignancy [4]. Although ultrasound is the recommended primary imaging modality for thyroid nodule assessment, it has drawbacks of being operator-dependent and the subjective interpretation of results. Various thyroid malignancy risk classification guidelines have been designed to assist with categorizing risk of malignancy based on several predictive sonographic features. Some of the commonly used guidelines are the American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS) [5], American Thyroid Association (ATA) [6], British Thyroid Association (BTA) [7] the Korean Society of Thyroid Radiology (KSThR) [8], Kwak-TI-RADS [9], and the European Thyroid Association (EU-TIRADS) [10]. The diversity of sonographic features highly predictive of malignancy in the different guidelines augments the dependence on the experience and clinical approach of the clinician [11].
Computer-aided diagnosis (CAD) systems have been proposed to offer a more objective and consistent interpretation of sonographic features in comparison with human visual assessment due to their computational analysis of sonographic textural features [12,13]. Recent studies have shown that thyroid CAD systems have a diagnostic performance that is comparable to that of experienced radiologists with combined techniques having more potential for superior performance [14,15]. One globally approved thyroid ultrasound CAD system that allows for simultaneous diagnosis of thyroid nodules with different TIRADS is AmCAD-UT (AmCad Biomed, Taipei, Taiwan). This CAD software has been evaluated for diagnostic performance in differentiating malignant and benign thyroid nodules in a few studies. Some studies demonstrated its comparable sensitivity in comparison with clinical experts and radiologists and its role in improving sonographers' interpretations of space-occupying thyroid lesions [16,17]. Although AmCAD-UT settings can be adjusted for optimised diagnostic performance, these previous studies only assessed the diagnostic performance using the default settings with a very limited comparison of multiple TIRADS.
The aim of the present study was to evaluate the diagnostic performance of AmCAD-UT at varied detection sensitivity settings of different ultrasound features for thyroid nodule differentiation based on six different TIRADS within the software. To the best of our knowledge, the value of adjusting AmCAD-UT settings in comparison with the default setting for thyroid nodule assessment has not yet been explored.

Study type and data sources
This retrospective study was approved by the Human Subjects Ethics Subcommittee of The Hong Kong Polytechnic University (Registration Number: HSEARS20190123004). A consecutive case analysis approach was used for the data collection of thyroid nodule ultrasound images. Due to the retrospective nature of this study individual informed consent was waived.
Images of thyroid nodules were obtained from image archives of thyroid ultrasound studies previously conducted by our research group and from an open access thyroid ultrasound image database, Digital Database of Thyroid Ultrasound Images (DDTI) (Universidad Nacional de Colombia, CIM@LAB and Instituto de Diagnostico Medico (IDIME), Bogota, Colombia) [18]. Images from the previous studies by our research group were all acquired using a Supersonic Aixplorer ultrasound machine (SuperSonic Imagine, Aix-en-Provence, France) and a 4-15 MHz linear transducer. These images have not been used in any grey-scale ultrasound CAD analysis study before. For the images obtained from the online database, details of the types of ultrasound machines used were a TOSHIBA Nemio 30 and a TOSHIBA Nemio MX (Canon Medical Systems, Tochigi, Japan) with 12 MHz linear and convex transducers [18].

Image selection criteria
A sonographer with more than 15 years' experience in thyroid ultrasound reviewed the ultrasound images individually. 263 images from 198 patients (94-DDTI; 104-research group files) were retrieved from both thyroid ultrasound image sources for the initial selection process. 129 of these were from our previous research group studies done between February 2013 and December 2014, and 134 images were from the DDTI database. The inclusion criteria were diagnostically acceptable thyroid nodule B-mode ultrasound images from adult patients with thyroid cancer suspicion and confirmatory cytological and/or histopathological results. Incomplete ultrasound images without clear boundaries, indeterminate cytology and/or no histopathology results were excluded from the study. Two thyroid surgeons with extensive experience conducted the fine-needle aspiration cytology of thyroid nodules and provided cytological and histopathological results. Images from the DDTI database had a cytological diagnosis as had been determined by experts [18]. The standard of reference that was used to differentiate benign and malignant nodules was a cytological diagnosis and/or histopathology results. A total of 205 images (104-research group files; 101-DDTI files) met the inclusion criteria and were evaluated in this study. Fig 1 shows the ultrasound image selection process.
Areas clearly demonstrating the nodule were selected and separated from the entire image and the new nodule-specific images were coded and saved in JPEG format.

CAD analysis of the thyroid nodule images
A radiographer with 2-years' thyroid ultrasound experience performed the CAD analysis using the AmCAD-UT thyroid CAD software after a month of training in using the software. The user was blinded to the cytology and/or histopathology results.
CAD ROI-selection. The coded JPEG images were uploaded onto the AmCAD-UT software user interface for analysis. From the 3 methods of selecting the region of interest (ROI) within the software; manual outlining was adopted for this study as it ensured a standardized approach than the semi-automated and the automated nodule recognition methods, which missed some nodule areas during the training period. After selecting the ROI, the user then adjusted the different settings for ultrasound feature analysis before confirming the analysis for the diagnosis output.
CAD settings selection. The AmCAD-UT software can be adjusted for detection sensitivity within pre-determined ranges for margins (1)(2)(3)(4)(5), hyperechoic foci (2.0-4.0) and anechoic areas (0-0.5), while visualization can be modified for echogenicity (-50-50) and texture (10-100). Detection sensitivity increases with an increase in the settings for the different ultrasound features except for hyperechoic foci setting which has an inverse relationship. The standalone diagnostic performance of AmCAD-UT established at its development phase testing showed that the detection of "hyperechoic foci" is dependent on detected "anechoic areas". The highest diagnostic performance of over 90% was achieved at a hyperechoic setting of 3.5 for different ranges of "anechoic areas", with comparable high performance at 0.2 and 0.5 settings; whereas margins had the best performance using the 2.0 and 3.0 setting based on images from 3 different ultrasound machines [19]. The commonly used default setting uses median values for all the parameter settings. Based on this background, this present study sought to determine the setting for optimised diagnostic performance between the default settings and the "hyperechoic foci" maintained at 3.5 with variations of "anechoic areas" and "margins" at settings that previously achieved the highest diagnostic performance during the development phase testing. "Echogenicity" and "texture" parameter settings were consistently maintained at median values for the objective comparative analyses with the default setting. These two parameter settings mainly influenced subjective visualization of the images without a change in CAD diagnosis output during our pilot testing of the software. The different sonographic settings used for the comparisons in this study are tabulated in Table 1. To determine the diagnostic performance of the CAD software, the CAD risk stratification output for each nodule based on 6 risk stratification systems (ACR; ATA, BTA, EU, Kwak and KSThR TIRADS) was compared to the ground truth which was the final cytological or histopathological diagnosis. AACE/ACE/AME and Seo et al., 2015 TIRADS were excluded from the analysis as this study evaluated TIRADS with 5 or more risk stratification categories.

Data analysis and statistical analysis
The statistical analysis was performed using the SPSS software package (version 26.0, SPSS Inc., Chicago, IL, USA). Categorical variables were expressed as percentages and continuous variables were expressed as mean values ± standard deviation. The diagnostic performance measures: sensitivity (SEN), specificity (SPEC), negative predictive value (NPV), positive predictive value (PPV), diagnostic accuracy (DA) and corresponding 95% confidence intervals (CI) were calculated and the Cochran's Q test and McNemar test were used for comparative analysis of the risk stratification systems. Results with the Bonferroni correction for multiple comparisons were adopted for the Cochran's Q-test. The receiver operating characteristic (ROC) curves were generated and the areas under the ROC curve (AUROC) were calculated to determine the diagnostic performance of AmCAD-UT for the 6 risk stratification systems and the z-test was used to compare the AUROC of different TIRADS. Precision-recall curves were also generated to complement the ROC results [20]. The optimal cut-off points were obtained from the ROC curves and a cut-off point that resulted in a compromise of both sensitivity and specificity with the least difference between the two at a higher sensitivity was deemed optimal [21,22]. The tests were two-sided and P< 0.05 denoted statistical significance.

Diagnostic performance of AmCAD-UT at different adjusted settings
The diagnostic performance measures for the 205 nodules were analysed at the different adjusted settings. Table 2 shows the results. The optimal TIRADS cut-off point was determined to be category 4 which was the moderate suspicion with ACR, intermediate suspicion with ATA, Kwak, EU and KSThR and suspicious level with BTA TIRADS. The best optimal diagnostic performance for all diagnostic measures was achieved at the default setting with all TIR-ADS. At this setting, Kwak TIRADS had the highest AUROC (0.74). EU TIRADS achieved the highest sensitivity and NPV and lowest specificity (SEN: 82.7%, NPV: 72.6%, SPEC: 42.1%).  (Fig 3). ACR, ATA, Kwak and KSThR TIRADS had good diagnostic accuracy based on the AUROC of 70% and above. At the chosen TIRADS cut-off category, the optimal precision and recall were derived from Kwak and ATA TIRADS (Fig 4). All TIRADS generally had high precision at low recall at different cut-off points such that even at the optimal cut-off category the PPV was substantially lower than the sensitivity. The diagnostic performances at the default setting and Adjusted 1 setting were comparable for all TIRADS whereas Adjusted 2 and Adjusted 3 had lower diagnostic performances. The sole lowering of sensitivity to hyperechoic foci to 3.5 (Adjusted 1), resulted in a slight increase in sensitivity and specificity and a good AUROC (0.71) with ATA TIRADS. The other TIR-ADS had a slightly lower sensitivity and AUROC with a slight increase in specificity except for KSThR which maintained an AUROC of 0.7. Conversely, the sensitivity of BTA was increased while the specificity was reduced (SEN: 69.4%, SPEC: 58.9%). The AUROC at the adjusted settings was generally lower than at the default setting for all TIRADS.
The Cochran's Q test indicated differences among the different TIRADS at different settings. Table 3 illustrates these results. At the default setting the most significant differences in sensitivity were between EU and 2 TIRADS (BTA and ATA, p< 0.05) whereas for specificity it was between ACR and KSThR; ACR and BTA and ATA and EU (p< 0.001). The z-test for AUROC paired differences showed the most statistically significant differences between BTA and Kwak, and EU and Kwak (p< 0.001). The AUROC of Kwak was not significantly different

PLOS ONE
from that of ACR (p = 0.289) and ATA (p = 0.795) at the default setting. At all the adjusted settings there were no significant differences between AUROC of different TIRADS (p> 0.05). The most significant difference was between ATA and EU TIRADS at all adjusted settings for both sensitivity and specificity (p< 0.05). EU-TIRADS post-adjustment diagnostic performance measures results were not statistically significant from the default setting results (p> 0.05), except at Adjusted 2. EU and ACR TIRADS sensitivity and specificity had no statistically significant differences in all settings (p> 0.05).

Discussion
CAD approaches have been found to be more objective and to perform comparatively accurate to expert human assessment of ultrasound features. This study sought to evaluate the diagnostic performance of adjusted settings of AmCAD-UT in comparison with the default setting for thyroid nodule differentiation based on six TIRADS. In this present CAD study, the Adjusted 1 setting had comparable results with the default setting; however, with a slight improvement in specificity and sensitivity for ATA TIRADS and a minimal increase in specificity for most TIRADS. The differences in performance with the adjustment of settings may be explained by the difference in malignancy risk stratification criteria for the different TIRADS based on pattern-based approaches (ATA, EU, BTA and KSThR) or score-based approaches (ACR and Kwak) [23,27,28]. Furthermore, there are inconsistencies mainly in the categorisation of echogenic foci and echogenicity among the different TIRADS [29]. While ATA may fail to classify nodules with mixed calcifications, Kwak will interpret them as having microcalcifications thereby resulting in a higher fitted malignancy probability for calcifications [30,31]. Sole reduction of the sensitivity detection of hyperechoic foci likely hindered the detection of subtle calcifications for the malignancy risk computation thereby slightly lowering the overall sensitivity while improving specificity. This suggests that AmCAD-UT sensitivity detection adjustments are most advisable for the individual analysis of problematic suspicious sonographic features that affect the malignancy risk estimation based on the TIRADS choice. An example is the adjustment of hyperechoic foci and anechoic areas settings separately for a hypoechoic nodule with mixed echogenic foci without other suggestive features or corresponding clinical history. The focus on calcifications and hypoechogenicity features separately could help ascertain the extent each feature influences the CAD output based on the different TIRADS classification disparities.
Although the current study did not involve multiple users, the sensitivity at the default and Adjusted 1 setting using Kwak, KSThR, and ATA TIRADS were comparable to a previous CAD study's findings for less experienced ultrasound users, while EU TIRADS had higher sensitivity (82.7%) than that same study which yielded an average sensitivity of about 72% [32]. For the same TIRADS category, our study had similar sensitivity (79.6%) to that of an ACRbased CAD development study for sole CAD and a junior radiologist using CAD (80.6% and 78.1%, respectively); although that study had a higher diagnostic performance for all other measures [33]. Similarly, at the default setting, the AUROC of above 0.70 with Kwak, KSThR, ACR and ATA TIRADS in our study, corresponded with that of a recent multicentre and multi-reader AmCAD-UT study which demonstrated an average of 0.792 AUROC regardless of user experience [34]. However, the multi-reader study outcomes were not stated as specific to any TIRADS. Due to the limited evaluation of CAD diagnostic performance using multiple TIRADS and readers, future studies are warranted to verify the influence of the CAD user experience based on different TIRADS and settings.
AmCAD-UT ultrasound feature impression analysis in this current study showed that most of the misdiagnosed nodules had some typical features of suspicion for malignancy or benignity. The interpretations of solid, homogenous nodules with irregular margins and/or echogenic foci features (such as the presence of colloid) were likely the key contributors to the false-positive diagnosis of some benign nodules. This can be attributed to the high thyroid malignancy prediction in the presence of multiple suspicious features established in several non-CAD studies [35][36][37]. Furthermore, the presence of punctate echogenic foci with a comet-tail artefact in a hypoechoic solid nodule and the presence of multiple calcifications can result in a high malignancy rate and PPV (77.8% and 96% respectively) [38]. This may account for the misdiagnosis of benign nodules interpreted as having mixed calcifications in the present CAD study. The TIRADS category 4 cut-off criteria, likewise, contributed to the misdiagnosis findings because in some TIRADS it denotes intermediate suspicion which presents diagnostic challenges even with human assessment. These misdiagnoses confirm the need for clinical correlation for accurate diagnosis even with the complementary use of thyroid CAD for diagnosis.
This study had several limitations. Due to its retrospective nature, some histopathological diagnosis data of some nodules were not available which prevented the analysis of pathological factors. Furthermore, selection bias cannot be excluded due to the selection of patients' images with FNAC and/or histopathology results as opposed to data from the general population. The study design was not typical of a hospital setting whereby CAD results complement those acquired by a clinician since only one user conducted the CAD analysis thereby hindering inter-rater agreement analysis. However, this is a first study to compare the sole diagnostic performance of AmCAD-UT at different adjusted settings and the study findings may help guide future studies with multiple CAD users. Future standardized prospective studies with larger sample sizes and comparative approaches may be useful in increasing the validity of the findings and improving generalizability.

Conclusion
Based on this study, the diagnostic performance of AmCAD-UT was best for all 6 TIRADS at the default setting. The default setting was best for maximising sensitivity for all TIRADS, with EU-TIRADS having the highest sensitivity. However, there may be potential for improved specificity without compromising the sensitivity at a hyperechoic foci detection setting of 3.5 with other settings maintained at median values. Further large prospective studies are warranted to validate these findings.