The Diagnostic Performance of Thyroid US in Each Category of the Bethesda System for Reporting Thyroid Cytopathology

We aimed to evaluate the diagnostic performance of thyroid ultrasonography (US) in each category of the Bethesda system and analyze false positive/negative findings using US. This retrospective study included 622 thyroid nodules in 592 patients. The sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV) and accuracy of US in each category of the Bethesda system were evaluated. False positive/negative cases of US were analyzed. Out of the 622 total thyroid FNAs, 179 (28.8%) were malignant. The malignancy rates for the 6 categories were as follows: I (nondiagnostic): 9.7%, II (benign): 2.5%, III (atypia/follicular lesion of undetermined significance): 37.5%, IV (suspicious for follicular neoplasm): 5.7%, V (suspicious for malignancy): 100%, and VI (malignancy): 100%. The accuracies of US for the 6 categories were 92.5%, 95.6%, 70.8%, 94.3%, 95%, and 92.4% in category order. US showed the lowest sensitivity (50%) in Category IV. Category III demonstrated relatively low sensitivity (66.7%) and specificity (73.3%) due to a high incidence of follicular variant of papillary thyroid carcinoma and a low number of category III nodules. The most optimal performance of US was revealed in Category I with 88.9% sensitivity and 92.9% specificity. In 22 US false positive cases, the most frequent finding was associated with marked hypoechogenicity and the least finding was noncircumscribed margin. The most common US features of 19 false negative cases were circumscribed iso or hypoechoic nodules. These results highlight the excellent diagnostic performance of US in category I of the Bethesda system and the lowest sensitivity of US in category IV. Awareness of US interpreters regarding these pitfalls can minimize false positive/negative diagnoses and prevent unnecessary interventions.


Introduction
Thyroid ultrasound (US) is the mainstay technique by which thyroid nodules are evaluated. Moreover, a general consensus has now been reached regarding the best US criteria for differentiating benign from malignant nodules [1][2][3]. The established US features of a malignant nodule include a taller-than-wide shape, an irregular margin, microcalcifications, and marked hypoechogenicity [2,4,5]. Ultrasound-guided fine needle aspiration (FNA) for suspicious thyroid nodules is an accurate and widely used diagnostic method. Cytological results can indicate whether surgery or follow-up is most appropriate for the thyroid nodules. The Bethesda System for Reporting Thyroid Cytopathology (BSRTC) was developed in 2009 to standardize the terminology for interpreting aspiration cytology results [6][7][8]. The application of BSRTC improved diagnostic accuracy for indeterminate thyroid nodules, leading to higher rates of malignancy detection despite lower rates of thyroidectomies [9]. To manage thyroid nodules, clinicians typically receive two kinds of categorical reports: morphological category, which is based on US, and cytological category, which is based on FNA. It is not yet known whether ultrasound diagnoses have limitations for any of the BSRTC categories, or whether they can provide helpful information for some cytological results. Thus, the purpose of our study was to evaluate the diagnostic performance of thyroid US in each category of the Bethesda System for Reporting Thyroid Cytopathology and analyze false positive/negative findings using US.

Materials and Methods Patients
This retrospective study was approved by the institutional review board of Samsung Medical Center and the requirement for informed consent was waived. This study included thyroid nodules assessed by US-guided FNA between Aug. and Oct. 2010 at our institution. Our institution served about average 8000 US-guided FNAs per year by radiologists at the study period. Generally, it seems that incidence of the malignancy of thyroid nodules by FNAs performed in our hospital is about 25-30%. This incidence rate is relatively high because our hospital is a tertiary referral center. All nodules were categorized based on the Bethesda system. We retrospectively reviewed the pathology and US reports of each patient. Patient records were analyzed anonymously. A total of 1353 FNAs in 1345 patients were performed during the study period. From these, nonthyroidal lesion (n = 93), nodules smaller than 0.5cm (n = 155), and nodules with no acceptable follow-up or operation (n = 483) were excluded. Finally, 622 nodules were selected from 592 patients who were followed up for at least 2 years or underwent surgery. Statistical analysis was performed on these 622 thyroid nodules.

Thyroid Ultrasound and Image Analysis
Thyroid US was performed at a frequency range of 7 to 15 MHz on an iU22 (Vision 2010; Philips, Seattle, WA, USA) by one of 7 radiologists. All radiologists had 1 to 11 years of experience in thyroid imaging.
The US features of the thyroid nodules were prospectively analyzed by the radiologist who performed the US examination. All nodules were classified into one of three categories (benign, indeterminate, and malignant) according to the Korea Society of Thyroid Radiology (KSThR) guidelines [2]. The KSThR guidelines take into account the internal components, echogenicity, margin, calcification, shape, and orientation of the thyroid nodule and categorized thyroid nodules in to three US diagnosis (Table 1). A taller-than-wide shape, a spiculated or irregular margin, marked hypoechogenicity, microcalcifications, and macrocalcifications are all findings suggestive of malignancy [2]. The presence of at least one of these findings defined a nodule as a malignant nodule. In contrast, simple cysts, predominantly cystic or cystic nodules with reverberating artifacts, and nodules with a spongiform appearance (especially with intervening isoechoic parenchyma) were defined as benign nodules. Indeterminate nodules had neither malignant nor benign features; iso-, hypo-or hyperechogenecity, ovoid-to-round shape, irregular shape, smooth or ill-defined margin, and rim calcification.

Cytological Analysis
US-FNA was performed by one of the seven trained radiologists who conducted the US examinations. US-FNA was performed manually with a 23-gauge needle attached to a 2-mL disposable syringe. On average, 1-2 passes were performed for each nodule. Aspirates were smeared onto a glass slide,immediately fixed in 95% alcohol for Papanicolaou and hematoxylin and eosin staining. No Giemsa stain, liquid-based cytology or cytobloc were performed. One of six cytopathologists interpreted the FNA specimens. All cases were reported using a six-tiered diagnostic system according to the Bethesda System for Reporting Thyroid Cytopathology [6]. Nodules were classified into the following cytological categories: (1) nondiagnostic or unsatisfactory (Bethesda System I), (2) benign (Bethesda System II), (3) atypia of undetermined significance (AUS)/follicular lesion of undetermined significance (FLUS) (Bethesda System III), (4) follicular neoplasm or suspicious for a follicular neoplasm (Bethesda System IV), (5) suspicious for malignancy (Bethesda System V), and (6) malignant (Bethesda System VI).

Data and Statistical Analysis
All thyroid nodules were categorized according to their US features and also according to their cytopathologic results. Although nodules were classified into three groups according to their US results, nodules identified as indeterminate by US were treated as benign for all subsequent statistical analysis. Thus, statistical analysis was performed on two US categories, probably benign and malignant.
Lesions were considered to be cytopathologically benign if they met at least one of the following conditions: 1] pathologically confirmed as benign by thyroidectomy or core needle biopsy; 2] US follow-up for at least 2 years with either no interval change or a decrease in size after an initial benign cytology finding; and 3] benign cytology by more than two FNAs. Nodules were defined as malignant if they were confirmed as malignant thyroid carcinoma by two serial FNAs or by thyroidectomy.
Data were analyzed using the Statistical Package for the Social Sciences for Windows (Version 17.0.1, SPSS, Chicago, IL, USA). In each category of the Bethesda system, the sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy of ultrasonography were calculated using the McNemar test. Fisher's Exact Test for Count Data was used to determine whether the differences in accuracy between the 6 Bethesda groups were significant. A P value < 0.05 was considered to be statistically significant.
Classification of a nodule as benign was determined by operation or core needle biopsy in 55 nodules, two FNAs in 112 nodules, and US follow-up after FNA with benign result in 276 nodules. Of the malignant nodules, 172 were confirmed by surgery and 7 were confirmed by two serial FNAs or core needle biopsy. These 3 patients did not undergo surgery due to refusal of the operation and having aggressive malignancy of another organ. Out of the 179 malignant nodules, 171 were papillary thyroid carcinomas (PTC), 4 were follicular thyroid carcinoma (FTC). 2 were medullary thyroid carcinoma (MTC) and 2 were diffuse large B-cell lymphoma.
We calculated the sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy of the US diagnosis in each Bethesda category ( Table 4). The sensitivities of US in categories I, V and VI were 88.9%, 95% and 92.4%, respectively, while the  19. The reasons for FNA in 19 false negative US cases confirmed later were as follows; nodules larger than 1cm with US indeterminate (11 cases), interval increased size (6 cases), and PET uptakes (2 cases). Three FTCs showed probably benign US features and FNA results were Category I, II and IV. The PTC arising from follicular adenoma, 4 follicular variants of PTC showed probably benign US features and FNA results were Category II and III. Eleven PTCs with FNA results V and VI showed probably benign US features. The most common features of false negative showed circumscribed iso or hypoechoic nodules.

Discussion
Thyroid US and US-guided FNA are the two leading diagnostic tools for evaluating thyroid nodular disease. The decision of whether to conduct surgery or to perform follow-up is taken based on thyroid US results together with cytological findings [10].
Due to the lack of information regarding the extent to which US and cytological reports are correlated, it may be difficult for physicians and surgeons to make treatment decisions. Several studies have evaluated the extent to which US diagnoses correlate with cytological results [11][12][13][14][15][16]. Lee at al. evaluated the usefulness of a combined categorical reporting system, including both US and cytological results, for deciding when repeat US-guided FNA should be performed [11]. Some investigators reported the incidence of thyroid cancer among cases with non-diagnostic (Bethesda category I) cytology and additionally evaluated the criteria for selecting those for repeat FNA according to US features [12,16]. Kim et al. and Rosario et al. reported the diagnostic efficacy of US in evaluating thyroid nodules, especially for Bethesda category III nodules [13,15]. In our study, we evaluated the diagnostic performance of US in each of the 6 Bethesda categories.
In our study, US showed the most optimal performance in Bethesda category I with a sensitivity of 88.9% and a specificity of 92.9%. Only one Bethesda category I nodule with the confirmation as FTC after operation was counted as false negative in US. This finding indicated that US can play an important role in determining further management for cytologically non-diagnostic thyroid nodules. Specifically, US follow-up rather than re-aspiration is recommended for US benign-looking nodules with Bethesda I results according to high sensitivity, specificity and accuracy of US in Bethesda category I thyroid nodules. This recommendation is consistent with that of Lee et al., who also recommended follow-up for US benign non-diagnostic nodules. This conclusion was based on the high possibility of other non-diagnostic FNA results, which are of little clinical relevance [11]. Moon et al. also recommended follow up rather than re-aspiration if Bethesda category I nodules have one or less suspicious US feature or nodules with cystic portion greater than 50% [16].
Bethesda category III is cytologically indeterminate and its rate of malignancy has been reported to range from 5-22.6% [15,17]. The recommended management for category III is clinical correlation and a repeated FNA at an appropriate interval. However, many reports have cited a risk of malignancy that should be considered in the management of Bethesda III nodules. This risk depends on the particular physician and institution and is influenced by clinical observations, repeat FNAs, core-needle biopsies, and surgeries [15,17]. To overcome the diagnostic limitations of cytology in indeterminate categories, many studies have evaluated the ability of thyroid US to predict malignancy for Bethesda category III nodules, with the aim of identifying management guidelines for these lesions [2,[17][18][19][20]. In our study, US had a sensitivity of 66.7%, a specificity of 73.3%, a PPV of 60.0%, and an NPV of 78.6% in Bethesda category III nodules. Rosario et al. recently reported prospective study for clinical, laboratory, ultrasonographic, and cytological predictors of malignancy in Bethesda category III thyroid nodules. US sensitivity, specificity, PPV and NPV were 79.4%, 90.5%, 71% and 93.5%, respectively [15]. Several other studies also suggested the usefulness of US for evaluating malignancy of Category III nodules [11,13,17]. Sensitivity and specificity were relatively low in our study compared to previous studies, probably because of low number of category III nodules (4.8%). Three category III nodules showed false negative result of US which led to relatively low sensitivity in our study, and all those 3 nodules were follicular variant of PTC. Follicular variants of PTC are reported to have relatively benign appearance on sonography that is more similar to those of follicular neoplasm than PTCs and this might be the reason of false negative US findings [21]. The accuracies of US in Bethesda categories III is relatively low (70.8%) compared to that of Category I (92.5%), II (95.6%), IV (94.3%), V (95.0%) and VI (92.4%). We speculate that the low accuracy of US in Bethesda category III nodules resulted from our classification of US indeterminate nodules treated as US probably benign nodules. If US indeterminate nodules would have been considered to be US malignant nodules, better sensitivity and accuracy would have been achieved. However, in contrast to nodules with malignant features, US indeterminate nodules should not be managed under strict guidelines because most thyroid carcinomas are not aggressive and have a good prognosis.
Meanwhile, the sensitivity and PPV in categories IV were 50% and 50%, respectively. Although the specificities and NPVs of US for nodules with cytologically suspicious for follicular neoplasm (IV) were high. This result indicates that the current US morphologic guidelines for follicular neoplasm are of limited value. The current US features mainly reflect papillary thyroid cancer, which limits the sensitivity of US in Bethesda category IV nodules. Further studies with pathology-radiology correlation are needed for follicular neoplasm of thyroid gland.
Among 22 false positive cases at US, the most frequent finding was marked hypoechogenicity and the least finding was noncircumscribed margin. Major reason for false positive cases was because nodules were interpreted as marked hypoechoic due to uncontrolled sonic gain. At follow-up of these nodules, they were not marked hypoechoic under control of proper gain. Careful adjustment of sonic gain is crucial for appropriate diagnosis of thyroid nodule and increase the efficacy of US. Benign thyroid nodules showed irregular or noncircumscribed margin when previously existing fluid component had disappeared causing shrinkage of thyroid nodules.
Comparing the 2011 KSThR guidelines to 2015 American thyroid association (ATA) guidelines, KSThR US "probable benign" correlates to US "very low suspicion" and "benign" of ATA guidelines. KSThR US "indeterminate" correlates to US "intermediate suspicion" and "low suspicion" of ATA. KSThR US "suspicious malignant" correlates to "high suspicion" of ATA. The major difference between two guidelines divided indeterminate and probable benign categories into more detailed categorizations. ATA recommended FNA at different size in "intermediate suspicion" and "low suspicion" (> 1cm and > 1.5cm, respectively), whereas KSThR recommended FNA in US "suspicious malignant"nodules > 0.5cm and US indeterminate nodules 1cm. Another difference is emphasizing rim calcified lesion. KSThR guidelines categorized rim calcified nodules into indeterminate category and commented that the presence of a hypoechoic halo and rim disruption are more suggestive of malignancy. In ATA guidelines, rim calcified nodules with small extrusive soft tissue component were categorized into "high suspicious" category. The other difference is macrocalcifications. KSThR guidelines categorized macrocalcifications as US "suspicious malignant". In ATA guidelines, macrocalcifications have same malignancy risk as microcalcifications if they are combined with microcalcifications [2,22].
Our present study did not investigate the genetic abnormalities of thyroid nodules. Many investigators have reported that detection of RET/PTC, TRK and BRAF(V600E) in FNAB specimens is proposed as a diagnostic adjunctive tool in the evaluation of thyroid nodules with suspicious cytological findings. [23][24][25]. The combination of US findings, biopsy and genetic study is the most reliable triage as the current options for the evaluation of undetermined thyroid nodules.
Our study did have several limitations. First, this was a retrospective study based on radiologic and pathologic reports. Second, the US and FNA analyses of the 622 nodules were not performed by a single radiologist. Although all the radiologists who performed the US and FNA analyses were extensively trained, inter-assessor variation could have led to different US findings for nodules that are not easily classified. We did not calculate the interobserver variability between individuals. However, we have a system that can minimize the difference. Unexperienced (less than 2 years) radiologists have received the confirmation about all cases with equivocal or indeterminate US features from experienced radiologists in the next room or in real time. Each radiologist has a chance to control the threshold through intradepartment conference. Moreover, differences in FNA skill levels could have increased the number of nondiagnostic results. However, this scenario is representative of clinical practice.
In conclusion, these results highlight the excellent diagnostic performance of US in category I of the Bethesda system and the lowest sensitivity of US in category IV. Awareness of US interpreters regarding these pitfalls can minimize false positive/negative diagnoses and prevent unnecessary interventions.