Deep learning in chest radiography: Detection of findings and presence of change

Background Deep learning (DL) based solutions have been proposed for interpretation of several imaging modalities including radiography, CT, and MR. For chest radiographs, DL algorithms have found success in the evaluation of abnormalities such as lung nodules, pulmonary tuberculosis, cystic fibrosis, pneumoconiosis, and location of peripherally inserted central catheters. Chest radiography represents the most commonly performed radiological test for a multitude of non-emergent and emergent clinical indications. This study aims to assess accuracy of deep learning (DL) algorithm for detection of abnormalities on routine frontal chest radiographs (CXR), and assessment of stability or change in findings over serial radiographs. Methods and findings We processed 874 de-identified frontal CXR from 724 adult patients (> 18 years) with DL (Qure AI). Scores and prediction statistics from DL were generated and recorded for the presence of pulmonary opacities, pleural effusions, hilar prominence, and enlarged cardiac silhouette. To establish a standard of reference (SOR), two thoracic radiologists assessed all CXR for these abnormalities. Four other radiologists (test radiologists), unaware of SOR and DL findings, independently assessed the presence of radiographic abnormalities. A total 724 radiographs were assessed for detection of findings. A subset of 150 radiographs with follow up examinations was used to asses change over time. Data were analyzed with receiver operating characteristics analyses and post-hoc power analysis. Results About 42% (305/ 724) CXR had no findings according to SOR; single and multiple abnormalities were seen in 23% (168/724) and 35% (251/724) of CXR. There was no statistical difference between DL and SOR for all abnormalities (p = 0.2–0.8). The area under the curve (AUC) for DL and test radiologists ranged between 0.837–0.929 and 0.693–0.923, respectively. DL had lowest AUC (0.758) for assessing changes in pulmonary opacities over follow up CXR. Presence of chest wall implanted devices negatively affected the accuracy of DL algorithm for evaluation of pulmonary and hilar abnormalities. Conclusions DL algorithm can aid in interpretation of CXR findings and their stability over follow up CXR. However, in its present version, it is unlikely to replace radiologists due to its limited specificity for categorizing specific findings.


Introduction
In 2010, approximately 183 million radiographic procedures were performed in the United States on 15,900 conventional and digital radiography units. Chest radiographs (CXR) represent close to half of the radiographs in the United States (44%). Most radiographs were acquired in outpatient clinics (48%) followed by hospital-based radiography [1]. An estimated annual growth rate of about 5.5% per year has been reported in prior surveys on radiography. On the one hand, there is tremendous volume and resultant burden, on the other hand, several past studies have reported challenges and inaccuracies associated with radiographic interpretation [2][3][4]. The Centers for Medicare and Medicaid Services (CMS) announced a 7% decrease in reimbursement for computed radiography from calendar year 2018 to address the issue of rising healthcare expenses.
Deep learning (DL) algorithms have been proposed as a solution to expedite, automate, and improve the interpretation of several imaging examinations including CXR. Prior studies have reported encouraging results of various DL algorithms for assessment of specific conditions such as pulmonary tuberculosis, cystic fibrosis, lines and tubes (position of peripherally inserted central catheters and endotracheal tubes), pneumoconiosis and lung nodules on CXR [5][6][7][8][9][10][11][12][13][14][15][16]. Another DL algorithm, now a commercially available application, subtracts ribs from single energy CXR to aid and expedite their interpretation by the radiologists.
To support and encourage research and development of DL in imaging, the National Institutes of Health (NIH) has released more than 100,000 deidentified CXR for free and open access [17]. We used these deidentified CXR datasets to assess the accuracy of a commercial deep learning (DL) algorithm for detection of abnormalities and to assess change in findings over serial radiographs.

Methods and materials
Our retrospective study was performed on de-identified NIH image data of adult frontal CXR. Institutional review board (IRB) approval was waived.

CXR data
Publicly available, de-identified ChestX-ray8 database (images and data entry file) was downloaded from the National Institute of Health website (https://nihcc.app.box.com/v/ChestXray-NIHCC accessed on January 30, 2018). This database contains labels for CXR based on presence and absence of 14 radiographic abnormalities. Several subjects have more than one CXR, which enable evaluation of stability or change in abnormalities over serial radiographs.
All CXR were selected randomly from the ChestX-ray8 datasheets (Microsoft EXCEL, Microsoft Inc., Redmond, Wash.) without looking at the accompanying CXR to ensure unbiased inclusion of CXR in the study. None of the standards of reference or test radiologists were part of the selection process.
Of the total 874 CXR included in the study, there were 574 single CXR (one CXR from each patient) from 574 patients and 300 CXR (two CXR from each patient) from the remaining 150 patients. Data labels of mass, nodules, pneumonia, atelectasis, or infiltration were grouped as pulmonary opacity; 374 CXR were selected from this group. Additional CXR were selected with data labels of effusion (n = 75), enlarged cardiac silhouette (n = 75), and those without any assessed findings (n = 200). Next, we identified CXR from 150 patients with (n = 100 patients) and without (n = 50 patients) any change in radiographic findings between the baseline and follow up CXR.
We excluded the lateral radiographs, oblique views, status post total pneumonectomy, and patients with a metal prosthesis, since the applied DL algorithm has not been trained for these views and situations. The final dataset included 874 CXR belonging to 724 patients (394M;330F) patients with mean age 54 ± 16 years. The radiographs in the ChestX-ray8 database are provided in PNG file format with 1024 x 1024 size and 8 bits gray-scale values.

DL algorithm
The DL algorithm (Qure AI) assessed in our study has been installed in hospitals in India. The algorithm does not have the United States Food and Drug Administration (FDA) approval at the time of writing this manuscript.
The DL algorithm is based on a set of convolutional neural networks (CNNs), each trained to identify a specific abnormality on frontal CXR. A dataset of 1,150,084 CXR and the corresponding radiology reports from various centers in India were used to develop the algorithm. Natural language processing algorithms were used to parse unstructured radiology reports and extract information about the presence of abnormalities in the chest X-ray. These extracted findings were used as labels when training the CNNs. Individual networks were trained to identify normal radiographs as well as those with the following findings: blunting of costophrenic angle, pleural effusion, pulmonary opacity, consolidation, cavitation, emphysema, interstitial fibrosis, parenchymal calcification, enlarged cardiac silhouette, hilar prominence, and degenerative changes in the thoracic spine.
During the algorithm development process, radiographs were resized to a fixed size and normalized to reduce source dependent variance. The CNNs were then trained to detect abnormal findings on the resized radiographs. The network architectures, modified versions of either densenets [18] or resnets [19] were pre-trained on the task of separating CXR from radiographs of other body regions. For each abnormality, multiple models, both densenets, and resnets were trained with different parameters and augmentations. A majority ensembling scheme combines the predictions from these selected models and to decide the presence/ absence of an abnormality. Algorithms were validated using an independent validation dataset of 93,972 frontal CXR.
Specific abnormalities (cavitation, emphysema, interstitial fibrosis) were excluded from our study due to lower representation and reporting heterogeneity. The randomly selected 874 CXR's in our study was neither used for training or validation of the algorithm Fig 1. For these CXR's, the algorithm provided probability statistics in percentage likelihood for the presence of findings on a continuous scale (values between 0 to 1). Additional separate scores for predictions percentage, (0 for value < 0.5; and 1 for ! 0.5) were generated. Heat maps with annotated findings were displayed on duplicate copies of the CXR. The following four findings were assessed with the algorithm: pulmonary opacities, pleural effusions, hilar prominence, and enlarged cardiac silhouette.

Standard of reference (SOR) and test radiologists
To establish the SOR, two experienced, fellowship-trained, thoracic subspecialty radiologists (SD with 16 years of subspecialty experience; MK with 12 years of subspecialty experience) assessed all 874 CXR in consensus for absence (score 0) or presence (score 1) of pulmonary opacities, pleural effusions, hilar prominence, and enlarged cardiac silhouette. SOR radiologists also recorded the presence of any lines, tubes and drains projecting over the CXR. Separately, four thoracic subspecialty radiologists (JP with 35 years of experience; CN with 5 years of experience; AS with 25 years of experience and VM with 30 years of experience) served as test radiologists for the study. The test radiologists independently evaluated the presence of the previously described four findings in the 724 CXR on the same two-point scale (absence of a finding, score 0; presence, score 1). The test radiologists were unaware of the DL and SOR findings. A total of 150 serial CXRs were assessed for follow up of findings by SOR and test radiologists. Microsoft EXCEL worksheets were used for data entry and analysis of the findings.

Statistical analysis
Microsoft EXCEL and SPSS statistical software (IBM SPSS Statistics, Armonk, NY) were used for data analysis. Receiver operating characteristics (ROC) analyses were performed to determine the separate area under the curve (AUC) for the DL algorithm and the four test radiologists against interpretation of the same findings by the SOR. Pair-wise two tailed t-tests (Microsoft EXCEL) were performed to compare detection of four radiographic abnormalities (pulmonary opacities, hilar prominence, pleural effusion, and enlarged cardiac silhouette) between the SOR and the DL algorithm as well as between the SOR and the four test radiologists. A p-value of less than 0.05 was deemed as statistically significant.
A post-hoc analysis (http://clincalc.com/stats/Power.aspx) was performed to determine if the number of radiographs with and without abovementioned abnormalities were adequate for assessing the DL algorithm.

Changes in findings
A pair of radiographs was available for 150 within the subset of 724 CXR. The pair of 150 radiographs (150x 2, 300 radiographs) were used to assess change.

Post-hoc power calculation
Post-hoc power analysis revealed that (http://clincalc.com/stats/Power.aspx) study was adequately powered for detecting up to 95% of the assessed abnormalities in 421 abnormal CXR and up to 1% false positive abnormalities in 303 CXR without any assessed abnormalities with a type 1 error rate of 0.01.

Detection of CXR findings
The overall accuracy of DL algorithm was better or equal to test radiologists with different levels of experience. DL algorithm had similar accuracy for detection of enlarged cardiac silhouette, pleural effusion and pulmonary opacities in CXR.  [20] and Rajpurkar (AUC 0.9248) [21] have also reported a higher accuracy of DL algorithms for detection of cardiomegaly. In line with the radiology reports of CXR at our institution, we chose "enlarged cardiac silhouette" as a descriptor for cardiomegaly used in other previously published studies [17,20,21].
The accuracy of our DL algorithm for detecting pleural effusions (AUC 0.872) is better than the accuracy reported by Wang [17,20,21]. These studies [17,20,21] used the same ChestX-ray8 datasets that were used in our study. The AUC of DL for detection of pleural effusions and pulmonary opacities in our study was lower than perfect accuracy (AUC 0.95-1.00) reported by Becker et al [22]. These differences can be attributed to variations in DL techniques, sample size, patient population, a gamut of radiographic findings, and quality of CXR between the two studies.
Detection of hilar prominence was more accurate with the DL algorithm than for either of the test radiologists. To our best knowledge, this finding has not been assessed in prior studies on DL algorithms for CXR interpretation. The higher performance of DL relative to the radiologists may be explained based on the relatively subjective nature of this finding compared to other findings. Conceivably, with different sets of SOR radiologists or test radiologists, the accuracy of hilar prominence might have been different. This implies that instead of radiologists, a better SOR for CXR is chest CT from the same patient since it can provide robust and objective verification of ground truth for complex radiographic findings such as hilar prominence that can be overcalled due to differences in techniques, patient rotation, and lung volumes.
Pulmonary opacities such as atelectasis, infiltration, pneumonia, consolidation, fibrosis, mass, and nodules were analyzed separately in prior publications [17,20,21]. The DL algorithm in our study was not trained to classify the opacities as pulmonary fibrosis, masses, and nodules, and thus these findings were not separately assessed in our study. Furthermore, atelectasis, infiltration, pneumonia, and consolidation can co-exist in the same anatomic region of the lung, and their radiographic appearance can be similar. Therefore, we assessed these findings jointly as pulmonary opacities. The accuracy of detection of pulmonary opacities in our study was not different from those reported in prior studies [17,20,21] for similar abnormalities.

Change in CXR findings
Although substantially better than the four test radiologists for presence or lack of changes in pulmonary opacities, the accuracy of DL as noted from the ROC analyses (Table 2) was lowest when compared to the SOR. This may have been due to variations in radiographic technique, or patient-related factors (such as differences in inspiratory effort and patient rotation over serial radiographs) on the appearance of pulmonary opacities. To our best knowledge, prior studies have not assessed the accuracy of DL for change or stability in findings on serial CXR.

Implications
Our study implies that DL based algorithm has high accuracy for detection of specific radiographic abnormalities such as pulmonary opacities, pleural effusions, enlarged cardiac silhouette, and hilar prominence. More work is needed to optimize the performance on follow up CXR, particularly for pulmonary opacities. The study also raises questions regarding the validity of radiologists as appropriate SOR given the subjectivity associated with detection and classification of radiographic findings. Differences between accuracy of test and SOR radiologists raise concern about the consistency and subjectivity of radiographic interpretation suggesting the need for a better SOR such as chest CT. Therefore, the results of our study should be interpreted with caution. However, substantially better sensitivity of CT compared to CXR will also confound Further improvements in the assessed DL algorithm are needed to classify pulmonary opacities and elucidate their anatomic location. DL algorithms for CXR interpretation should also detect, segment, and exclude implanted devices and external foreign objects projecting over the lung fields, which can otherwise result in false positive findings as noted in our study. Despite these limitations, our study adds to a growing body of evidence regarding the utility of DL algorithms for detection of findings on CXR.
There are additional limitations in our study. We did not perform pre-hoc power analysis to determine the number of CXR and test radiologists required to assess the performance of the DL algorithm. However, the number of CXR included in our study was higher than those used in prior studies and was found to be adequate for assessing the DL algorithm on posthoc analysis. Another limitation of our study pertains to the combined evaluation of different types of pulmonary opacities rather than separate categories reported in prior studies [17,20,21]. Non-inclusion of specific CXR findings in the DL algorithm like the position of lines and tubes, pneumothorax, pulmonary nodules, masses, and fibrosis, was not possible as the algorithm was not specially trained for those findings. Although the DL algorithm provided prediction statistics in percentage likelihood of findings on a continuous scale (values between 0 to 1), we relied on the binary scores (that is, finding absent for prediction percentage < 0.5 and finding present for prediction percentage ! 0.5) as recommended by the vendor. It is possible that specificity and accuracy of DL findings may have been different with the use of cut-off values other than 0.5. There are however no published guidelines on the most appropriate cut-off values for DL algorithm and how such deviations would affect the performance of DL.

Conclusions
In conclusion, a DL algorithm can aid in interpretation of CXR findings such as pulmonary opacities, hilar prominence, cardiomegaly and pleural effusion. It can also help in assessing change or stability of these findings on follow-up CXR. Though helpful in improving the accuracy of interpretation, the assessed DL algorithm is unlikely to replace radiologists due to limitations associated with the categorization of findings (such as pulmonary opacities) and lack of interpretation for specific findings (such as lines and tubes, pneumothorax, fibrosis, pulmonary nodules, and masses). However, DL algorithm can expedite image interpretation in emergent situations where a trained radiologist is either unavailable or overburdened in busy clinical practices. It may also serve as a second reader for radiologists to improve their accuracy.