Performance of a deep-learning algorithm for referable thoracic abnormalities on chest radiographs: A multicenter study of a health screening cohort

Purpose This study evaluated the performance of a commercially available deep-learning algorithm (DLA) (Insight CXR, Lunit, Seoul, South Korea) for referable thoracic abnormalities on chest X-ray (CXR) using a consecutively collected multicenter health screening cohort. Methods and materials A consecutive health screening cohort of participants who underwent both CXR and chest computed tomography (CT) within 1 month was retrospectively collected from three institutions’ health care clinics (n = 5,887). Referable thoracic abnormalities were defined as any radiologic findings requiring further diagnostic evaluation or management, including DLA-target lesions of nodule/mass, consolidation, or pneumothorax. We evaluated the diagnostic performance of the DLA for referable thoracic abnormalities using the area under the receiver operating characteristic (ROC) curve (AUC), sensitivity, and specificity using ground truth based on chest CT (CT-GT). In addition, for CT-GT-positive cases, three independent radiologist readings were performed on CXR and clear visible (when more than two radiologists called) and visible (at least one radiologist called) abnormalities were defined as CXR-GTs (clear visible CXR-GT and visible CXR-GT, respectively) to evaluate the performance of the DLA. Results Among 5,887 subjects (4,329 males; mean age 54±11 years), referable thoracic abnormalities were found in 618 (10.5%) based on CT-GT. DLA-target lesions were observed in 223 (4.0%), nodule/mass in 202 (3.4%), consolidation in 31 (0.5%), pneumothorax in one 1 (<0.1%), and DLA-non-target lesions in 409 (6.9%). For referable thoracic abnormalities based on CT-GT, the DLA showed an AUC of 0.771 (95% confidence interval [CI], 0.751–0.791), a sensitivity of 69.6%, and a specificity of 74.0%. Based on CXR-GT, the prevalence of referable thoracic abnormalities decreased, with visible and clear visible abnormalities found in 405 (6.9%) and 227 (3.9%) cases, respectively. The performance of the DLA increased significantly when using CXR-GTs, with an AUC of 0.839 (95% CI, 0.829–0.848), a sensitivity of 82.7%, and s specificity of 73.2% based on visible CXR-GT and an AUC of 0.872 (95% CI, 0.863–0.880, P <0.001 for the AUC comparison of GT-CT vs. clear visible CXR-GT), a sensitivity of 83.3%, and a specificity of 78.8% based on clear visible CXR-GT. Conclusion The DLA provided fair-to-good stand-alone performance for the detection of referable thoracic abnormalities in a multicenter consecutive health screening cohort. The DLA showed varied performance according to the different methods of ground truth.


Introduction
Chest X-ray (CXR) can assist in the diagnosis and management of cardiothoracic disorders; however, in asymptomatic outpatients or the general population, CXR has limited benefit, leading to additional unnecessary examinations with risks of additional harm and costs. In a cohort study of primary care outpatients who received a CXR despite the absence of respiratory symptoms, only 1.2% of CXR detected a major abnormality and 93% of these findings proved to be false positives and none required treatment on further inspection [1].
Nonetheless, CXR is widely used as a component of periodic health examinations for asymptomatic outpatients or the general population because the examination has many advantages in terms of easy accessibility, low cost, and negligible radiation exposure. In Korea, the National Health Service has offered a free CXR screening biennially to all residents aged 40 years or older [2]. Furthermore, CXR has been widely performed for pre-employment and pre-military service medical screening.
However, the interpretation of CXRs is subject to human error and depends on reader expertise. Approximately 20% of errors in diagnostic radiology occurred during the interpretation of radiography, half of which were related to CXR [3]. The low diagnostic yield and substantial inter-and intra-reader variability remain persistent weaknesses of CXR as a screening tool. However, for CXR to become an effective screening tool for an asymptomatic general population with a low pre-test probability for chest disease, the method needs to show high sensitivity and low false-positive results. The limitations of human expert-based diagnosis have provided a strong motivation for the use of computer technology to improve the speed and accuracy of the diagnostic process. Recent advances in deep-learning algorithms (DLA) are expected to improve the diagnostic performance for the screening of lung cancer, pneumonia, and pulmonary tuberculosis on CXR [4][5][6][7][8][9][10].
The purpose of the present study was to evaluate the standalone performance of a commercially available DLA for thoracic abnormalities on CXR in a consecutively collected multicenter health screening cohort.

Materials and methods
This retrospective cohort study was approved by the institutional review boards of three participating institutions (approval number: GFIRB2019-175 for Gil Medical Center, 10-2019-48 for  Boramae Medical Center, 2019-05-022 for Konyang University Hospital). All data were deidentified and the requirement for written informed consent was waived. Lunit in Seoul, Korea, provided corporate support to build an image annotation tool. None of the authors have any financial interests or conflicts of interest with the industry or the product used in this study. The authors maintained full control of the data. We present the following article in accordance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting checklist (S1 Appendix).

Study population for the diagnostic cohort study
Data from a total of 5,887 consecutive subjects who visited the health screening center of the three institutions and underwent CXR and chest CT in 2018 were retrospectively investigated from the radiology database and medical records system. Subjects who underwent chest CT from CXR with intervals of 1 month or more were excluded. Data on age, sex, smoking history (pack-years), exam date of CXR, and chest CT were retrospectively collected. Based on age and smoking history, the cohort was classified as have a high risk of lung cancer (aged 55-74 years with �30 pack-years of smoking history) or an average risk (general population).

DLA for chest radiographs
We used a commercially available DLA (Lunit INSIGHT for Chest Radiography Version 2.5.7.4; Lunit, Seoul, South Korea) approved by the Korean Ministry of Food and Drug Safety. This version of DLA was developed for the detection of three major radiologic findings (the target lesion types are nodule/mass, consolidation, and pneumothorax) using a deep convolutional neural network [7]. Further detailed information about its development and validation is presented in S1 Fig. DLA-detected thoracic lesions are marked as a color map with abnormality score (%). The abnormality score indicates the probability value (0-100%) that the CXR contains malignant nodule/mass, consolidation, or pneumothorax. We used a predefined cutoff value of 15%, as it showed high sensitivity (95%) in the internal validation dataset [11].

Reference standards for referable thoracic abnormalities
After the de-identification of all CXR, the images were uploaded and annotated for ground truth (GT) using a customized web-based labeling tool provided by Lunit. With labeled GT, the system automatically classified the DLA results as true-positive when there was overlap of at least one pixel with the GT; otherwise, the lesion was classified as false-positive or falsenegative.
The reference standard for referable thoracic abnormalities on CXR was determined by three adjudicators (C.Y.J., J.K.N. K.E.Y., with 19, 13, and 12 years of experience in thoracic imaging, respectively), primarily based on the findings of the nearest chest CT. They also reviewed follow-up CXR images and medical records to determine the clinical diagnosis.
Referable thoracic abnormalities, defined as any CXR findings requiring further diagnostic evaluation or management, were classified into 10 lesion types and the lesions were annotated as a box region of interest (ROI). They included three DLA-target lesion types (nodule/mass, consolidation, and pneumothorax) and seven DLA-non-target lesion types (atelectasis or fibrosis, bronchiectasis, cardiomegaly, diffuse interstitial lung opacities, mediastinal lesion, pleural effusion, and others). These imaging findings were adapted and partially modified from the labeling standards of the ChestX-ray14 or MIMIC-CXR databases [8,12] and the Fleischner Society glossary of terms for thoracic imaging [13]. Furthermore, final clinical diagnoses were categorized based on the 10 th edition of the International Classification of Diseases (ICD)-10 [14] or radiologic descriptions for thoracic lesions [13].
The original GT was made based on chest CT, which is considered the most precise method as a reference standard for CXR. However, CT-based GT (CT-GT) is not practical and does not reflect real-world clinical situations. CXR examinations infrequently accompany chest CT examinations; CT examination is performed for suspicious or ambiguous CXR findings for which further evaluation is needed under clinical suspicion. Furthermore, when the adjudicators annotated referable thoracic abnormalities on CXR based on retrospective inspection of chest CT findings, very subtle lesions were labeled on CXR, which were difficult to identify on CXR without CT guidance. To overcome this limitation, we established additional GTs based on consensus CXR readings. For cases with any referable thoracic abnormalities on the original CT-GT, we asked three radiologists (K.R.H, S.Y.S, and H.S.H with 7, 10, and 13 years of experience in thoracic radiology, respectively) to evaluate the existence of referable thoracic abnormalities on the CXR. Finally, we made subsequent GTs based on consensus CXR readings (CXR-GTs); namely, clear visible CXR-GT (for more than two calls) and visible CXR-GT (for at least one call).

Statistical analysis
The results are presented as percentages for categorical variables and as means (± standard deviation) for continuous variables. Primarily, we evaluated the diagnostic performance of the DLA for referable thoracic abnormalities based on CT-GT, in terms of the area under the receiver operating characteristic (ROC) curve (AUC), sensitivity, specificity, positive predictive value (precision), negative predictive value, and F1 score (the harmonic mean of precision and recall). To evaluate lesion-wise localization performance, area under the alternative freeresponse ROC curves (AUAFROCs) were used as performance measures of jackknife alternative free-response ROC (JAFROC), the curve was plotted with the lesion localization fraction (LLF) against the probability of at least one false-positive (FP) per normal CXR. The total number of false-positive markings divided by the total number of CXRs was defined as the number of false-positive markings per image (FPPI). In addition, true detection rate (number of correctly localized lesions/the total number of lesions) was also evaluated. Finally, we evaluated the performance of the DLA using CXR-GTs (clear visible and visible CXR-GTs). To assess AUC differences when evaluating the DLA using different reference standard methods, we used either the paired or unpaired versions of DeLong's test for ROC curves, as appropriate. Statistical analyses were performed using MedCalc version 19.5.1 or R version 3.5.3.

Baseline characteristics and lesion types of the referable thoracic abnormalities
The normal cases differed significantly among the three institutions (Bonferroni-corrected Ps <0.001); the prevalence of normal cases was lowest at institution G (85.8%), and followed by institution K (90.1%), and institution B (92.7%). Furthermore, the proportions of target and non-target lesions also differed significantly, in which institution B had fewer targetlesions compared to those in institution K (B vs. K; 3% vs. 4.9% Bonferroni-corrected P = 0.002) and institution G had more non-target lesions compared to those in the other two institutions (G vs. K: 11% vs. 4.4% and G vs. B: 11% vs. 5.4%, Bonferroni-corrected Ps < 0.001) (Fig 2).
Two institutions (B and K) showed significantly better performance when using CXR-GTs compared to CT-GT. However, institution G did not show significantly better performance when using clear visible GT (Fig 3).

Discussion
This study evaluated the standalone performance of a commercial DLA for CXR using a consecutively collected multicenter health screening cohort. In the health screening cohort, we can expect a low prevalence of chest disorder compared to an inpatient or outpatient cohort with symptoms and risk factors for respiratory disorder. In the low pre-test probability setting, CXR needs to show high sensitivity and low false-positive results to become an effective screening tool for the asymptomatic general population. Therefore, we selected a threshold of 0.16 because the primary purpose of screening lies in sensitively detecting thoracic abnormalities, including early lung cancer and tuberculosis.
Our study results showed fair to good diagnostic performance of the DLA for CXR and revealed significantly different performance results for different reference standard methods. Based on CT-GT, the performance of the DLA for referable thoracic abnormalities was fair. However, the performance increased significantly when CXR-GTs were used, with the DLA showing the best performance based on clear visible CXR-GT. On CT-GT, subtle lesions were included as abnormalities as compared to CXR-GTs. Among cases with referable thoracic abnormalities (n = 618, 10.5%) primarily based on chest CT, visible and clear visible abnormalities decreased the number of patients with referable thoracic abnormalities to 405 (6.9%) and 227 (3.9%) on consensus CXR reading, respectively. When the adjudicators annotated abnormalities originally based on CT, they inevitably tended to call very subtle lesions that are difficult to detect on prospective inspection on CXR.
Interestingly, the performance did not increase significantly in one institution that had a higher number of non-target lesions compared to those in the other institutions. The prevalence of lesion types is dependent on the clinical setting (inpatient, outpatient, emergency room, and health care clinic) and hospital level (tertiary academic hospitals: institution G and K; secondary general hospital: institution B) and location (institution G and K were located in Incheon and Daejeon in Korea, respectively, while institution B is located in the capital city of Korea, Seoul). The institution G showed the lowest AUC when evaluated using CT-GT and the performance improvement was not observed after using subsequent CXR-GTs. In deeplearning modeling, the DLA is trained to detect and classify using a training dataset. Although some overlap was present between the imaging findings of DLA-target and DLA-non-target lesions as increased opacity, the lesion types that were not included in the initial training process (DLA-non-target lesions) did not show good performance. Therefore, the interpretation of DLA results requires care as the performance of the DLA could depend on the disease prevalence and lesion characteristics (target and non-target lesion distribution) as well as the standard reference methods. In previous studies, DLA for CXR showed excellent performance, similar to the expert radiologist reading for the diagnosis of lung cancer, tuberculosis, and multiple abnormal findings [7,11,15]. These studies used previous version of DLA of slight different DLA architectures, and the evaluation were conducted on experimentally designed datasets with prepared cases of lung cancer, tuberculosis, and normal, which have either one abnormal finding or pure normal cases. While these studies confirmed the technical validity of DLAs, in the real-world setting, the incidence of the disease differs between clinical settings and mixed abnormal findings of DLA-target and non-target lesions are common. Furthermore, image quality and comorbidities are the obstacles to DLA-based diagnosis from CXR. Therefore, the performance evaluation of DLA in a consecutively collected cohort in a real clinical situation is important to prove the clinical validity of this approach. Distinct from previous version, the DLA used in the present study does not use the lung segmentation module and the baseline architecture has been changed to ResNet34 [16]. Attend-and-Compare Module was used in the intermediate layers to improve detection performance [17] and AutoAugment algorithm [18] combined with conventional image processing techniques such as brightness, contrast adjustment, blurring, and random cropping were applied to augment the training dataset. Furthermore, the final layer output four different abnormality-specific channels (mass/nodule, pneumothorax, consolidation, and abnormalities), each representing the probability map for the corresponding abnormality (S1 Fig). To verify differences in diagnostic capabilities according to DLA architecture differences, further investigation with different DLAs using the diagnostic cohort is needed.
Our study has several limitations. First, subjects who underwent only CXR without chest CT in health clinics were excluded which may lead to selection bias. Most of the subjects who visited the health clinics did not undergo chest CT. Second, the performance of the DLA was evaluated using a specific version of a commercial product with a predefined cut-off value set for high sensitivity. Therefore, the results were obtained under certain circumstances and care is required in interpreting the results of the DLA for other products or other clinical settings. Third, the results of our study are limited to one country, so the generalizability to racial differences in other countries is uncertain.
In conclusion, the results of the present study demonstrated the overall fair to good stand-alone performance to determine the presence of referable thoracic abnormalities in a multicenter consecutive health screening cohort. The DLA showed varying performance depending on the type of reference standard method and the frequency of specific lesion types.