DeBoNet: A deep bone suppression model ensemble to improve disease detection in chest radiographs

Automatic detection of some pulmonary abnormalities using chest X-rays may be impacted adversely due to obscuring by bony structures like the ribs and the clavicles. Automated bone suppression methods would increase soft tissue visibility and enhance automated disease detection. We evaluate this hypothesis using a custom ensemble of convolutional neural network models, which we call DeBoNet, that suppresses bones in frontal CXRs. First, we train and evaluate variants of U-Nets, Feature Pyramid Networks, and other proposed custom models using a private collection of CXR images and their bone-suppressed counterparts. The DeBoNet, constructed using the top-3 performing models, outperformed the individual models in terms of peak signal-to-noise ratio (PSNR) (36.7977±1.6207), multi-scale structural similarity index measure (MS-SSIM) (0.9848±0.0073), and other metrics. Next, the best-performing bone-suppression model is applied to CXR images that are pooled from several sources, showing no abnormality and other findings consistent with COVID-19. The impact of bone suppression is demonstrated by evaluating the gain in performance in detecting pulmonary abnormality consistent with COVID-19 disease. We observe that the model trained on bone-suppressed CXRs (MCC: 0.9645, 95% confidence interval (0.9510, 0.9780)) significantly outperformed (p < 0.05) the model trained on non-bone-suppressed images (MCC: 0.7961, 95% confidence interval (0.7667, 0.8255)) in detecting findings consistent with COVID-19 indicating benefits derived from automatic bone suppression on disease classification. The code is available at https://github.com/sivaramakrishnan-rajaraman/Bone-Suppresion-Ensemble.


Introduction
Chest X-ray (CXR) is a commonly performed radiological examination to visualize various abnormalities in the thoracic cavity [1].However, accurate interpretation of pulmonary [16].Ensemble learning is widely used in medical computer vision tasks such as segmentation, object detection, and classification [17].To the best of our knowledge, we do not find any literature that evaluates the performance of DL model ensembles for bone suppression in CXRs.
In this study, we propose DeBoNet, an ensemble of DL models, for suppressing bones in frontal CXRs.Through its use, we aim to improve disease classification and interpretation performance which is demonstrated through the detection of findings that are consistent with COVID-19 on CXRs [18].We train several state-of-the-art architectures such as U-Nets [19] and Feature Pyramid Networks (FPNs) [20], using several ImageNet classifiers as backbones, and also propose custom models toward bone suppression.DeBoNet is constructed by (i) measuring the multi-scale structural similarity index (MS-SSIM) score between the sub-blocks of the bone-suppressed image predicted by each of the top-3 performing bone-suppression models and the corresponding sub-blocks in the respective ground truth soft-tissue image, and (ii) performing a majority voting of the MS-SSIM score computed in each sub-block to identify the sub-block with the maximum MS-SSIM score and use it in constructing the final bonesuppressed image.We empirically determine the sub-block size that delivers superior bone suppression performance.The performances of individual models and DeBoNet are evaluated using several performance metrics such as average peak signal-to-noise (PSNR) ratio, structural similarity index (SSIM), MS-SSIM, correlation, intersection, chi-square, and Bhattacharya distances.Next, the best-performing bone suppression model is selected, truncated, and appended with classification layers.This is done to transfer CXR modality-specific knowledge and improve performance in the task of classifying CXRs as showing normal lungs or other findings consistent with COVID-19.The performance of the classification model trained on non-bone-suppressed CXRs and bone-suppressed CXRs are compared through several performance metrics such as accuracy, AUROC, precision, recall, the area under the precisionrecall curve (AUPRC), F-score, and MCC.Additionally, we used our in-house class-selective relevance map (CRM) algorithm [21] to interpret model predictions.Fig 1 shows the graphical abstract of our proposed approach.
Our novel contributions are highlighted as follows: (i) To the best of our knowledge, this is the first study to develop a model ensemble for suppressing bones in CXRs, that we call DeBoNet, and demonstrate its effectiveness through extensive qualitative and quantitative analyses.
(ii) We train and evaluate variants of U-Nets, Feature Pyramid Networks, and other proposed custom models toward the bone suppression task.
(iii) The individual constituent models and the DeBoNet proposed in this study are not restricted to the task of CXR bone suppression but can be potentially applied to other image denoising applications.

Datasets
The following datasets are used in this study: (i) COVID-19 CXR collection: A total of 3016 de-identified publicly available CXR images showing findings that are consistent with COVID-19, which serve as the set of cases in our study, are pooled from several sources.A majority of these CXRs are pooled from the BIMCV-COVID19+ CXR data collection that contains 2473 CXRs showing COVID-19-like manifestations [22].A set of 183 CXR images showing findings consistent with COVID-19 are collected from a GitHub repository hosted by the Institute for Diagnostic and Interventional Radiology, Hannover Medical School, Hannover, Germany [23].These CXR images are accompanied by other metadata such as admission status and patient demographics.The authors [17] collected 226 CXRs manifesting COVID-19, from a public GitHub repository hosted by the authors of [24].The CXR collection is accompanied by other metadata including sex, age, finding, and intubation status.They also used a collection of 134 CXRs acquired from SARS-CoV-2 PCR+ patients from a hospital in Spain and posted by a radiologist in a public Twitter thread [25].The ground truth COVID-19 disease-specific region of interest (ROI) annotations, set by the verification from two expert radiologists, for a subset of this collection [n = 36] are used by the authors of [16] in interpreting model performance.
(ii) RSNA CXR dataset: To serve as experimental controls, we randomly select an equal number of 3016 de-identified CXR images showing no abnormalities from the publicly available RSNA CXR dataset, released toward the RSNA pneumonia detection challenge hosted by Kaggle [26].The collection, however, includes a total of 26,684 CXR images, of which, 8851 CXRs showed no abnormalities, 6012 CXRs showed pneumonia-related lung opacities, and 11,821 CXRs showed other pulmonary abnormalities.
(iii) NIH-CC-DES-Set 1: This set consists of 27 de-identified DES CXR images [15] that were acquired at the National Institutes of Health (NIH) Clinical Center (CC) as a part of routine clinical care.A GE Discovery XR656 digital radiography system was used to acquire the DES images at 120 and 50 Kilovoltage-peak (kVp), respectively, to capture the soft-tissue images and bone-only images.This dataset is used to evaluate the performance of the bone suppression models proposed in this study.
(iv) NIH-CC-DES-Set 2: Another set of de-identified 100 DES CXRs are acquired similar to NIH-CC-DES-Set 1.This collection contains DES images of 54 females and 46 males, the average age and standard deviation of the males and females are 48.9 +/-14.5 and 45.4+/-13.6,respectively.This dataset is augmented and used to train the bone suppression models.
The NIH-CC-DES-Set 1 and NIH-CC-DES-Set 2 data were selected samples of adult subjects with no radiological findings from the NIH archives that were deidentified and manually verified before use.The NIH Institutional Review Board (IRB) exempted their use from full review.The total number of CXRs pooled from different sources is given in Table 1.

Bone suppression models
The set of 100 grayscale DES CXR images (i.e., the original CXRs and soft tissue counterparts) from the NIH-CC-DES-Set 2 dataset is augmented using affine transformations such as rotations (-10 to 10 degrees), horizontal and vertical shifting (-5 to 5 pixels), horizontal mirroring, zooming, median filtering, Gaussian blurring, and unsharp masking, resulting in 1000 DES CXRs.The augmented images are further resized to 256×256 dimensions to reduce computational complexity.The contrast of the images is enhanced by clipping the top and bottom 1%, respectively, of all pixel values.The pixel values are then normalized.
We propose the following model architectures for the task of bone suppression in CXRs: Autoencoder with separable convolutions (Autoencoder-BS).The Autoencoder-BS model is a denoising autoencoder with symmetrical encoder and decoder layers.The encoder consists of four separable convolutional blocks.Each convolutional block except for the last block contains two separable convolutional layers.Separable convolutions are used to reduce computational complexity, thereby facilitating faster convergence and realtime deployment [27].The number of filters in the separable convolutional blocks of the encoder are 64, 128, 256, and 512, respectively.Except for the last block, a max-pooling layer is used after each separable convolutional block to calculate the maximum value for individual patches of the feature map.Upsampling layers are used correspondingly in the symmetric decoder blocks to preserve the spatial resolution of the input.
ResNet-based model with residual scaling (ResNet-BS).The architecture of the proposed ResNet-BS model is shown in Fig 3 .The first and last convolutional layer contains 128 filters of dimension 3×3.We used residual blocks with shortcuts to skip over layers.This approach helps to overcome convergence issues due to vanishing gradients in deeper models.Skipping layers helps to reuse the activations of the earlier layers until weight updates in the succeeding layers.Each residual block consists of two convolutional layers with 3×3 filters and 128 feature maps.
Inspired by [28], we used a modified residual block in which (i) the batch normalization layers are removed for they are mentioned to adversely affect the range flexibility through the normalization process, and (ii) activations are not used outside the residual blocks and in the final layer.The network consists of 16 residual blocks with an identical layout.We used zero paddings to preserve the spatial dimensions of the input image.The residuals after the deepest convolutional layer in each residual block are scaled at an empirically determined scaling factor (0.1) before adding them back to the convolutional path.This scaling approach stabilizes training in deeper models with high computational complexity [28].The deepest convolutional layer with the sigmoidal activation function predicts a grayscale bone-suppressed image.U-Net and FPN-based models.The U-Net models are widely used in image segmentation tasks [19].The U-Net is composed of an encoder and decoder.The encoder or the contracting path extracts image features at multiple scales and the decoder or the expanding path semantically projects the features learned by the encoder onto the pixel space.
The Feature Pyramid Networks (FPN) are widely used as feature extractors to help object detection [20].Fig 4 shows the general architecture of the U-Net and FPN models.The FPN network is composed of bottom-up and top-down pathways.The bottom-up pathway constitutes the encoder backbone that extracts image features at multiple scales (scaling step is 2).A convolutional layer with a 1×1 filter is used to reduce the feature dimensions of the deepest convolutional layer in the bottom-up pathway to 256.This constitutes the first layer of the topdown pathway.Going deeper, the preceding layer is up-sampled by a factor of 2 using the nearest neighbor up-sampling method.A 1×1 convolutional filter is applied to the corresponding feature maps in the bottom-up pathway and is added elementwise.A 3×3 convolution is then applied to all the merged layers to reduce aliasing effects.This helps to generate high-resolution features at each scale.
The grayscale CXR is duplicated in three channels and fed into the U-Net and FPN models.This is because we use ImageNet-pretrained models, trained on RGB images, as the encoder backbones.We experimented with several encoder backbones for the U-Net and FPN models [29] toward the task of bone suppression in CXRs.These backbones include (i) EfficientNet-B0 [30], (ii) ResNet-18 [31], (iii) SE-ResNet-18 [32], (iv) DenseNet-121 [33], (v) Inception-V3 [34], and (vi) MobileNet-V2 [35].We are motivated by the fact that these ImageNet-pretrained models have demonstrated superior performance in medical visual recognition tasks [17].The final layer of the U-Net and FPN models consists of a convolutional layer with Sigmoidal activation to predict grayscale bone-suppressed CXRs.
The proposed bone-suppression models are trained on the augmented NIH-CC-DES-Set 2 dataset and tested with the NIH-CC-DES-Set 1 dataset.We allocated 10% of the training data for validation using a fixed seed.We compiled the models using an Adam optimizer with an initial learning rate of 1e-3 and monitored the following validation performance metrics: (i) loss, (ii) PSNR, (iii) SSIM, and (iv) MS-SSIM.We propose a mixed-loss function that benefits from the combination of mean absolute error (MAE) and MS-SSIM losses, given by, We empirically set the value of O to 0.84.The MS-SSIM metric is given higher weightage since the bone suppressed image is preferred to closely match the ground truth.The MAE metric is given a comparatively lower significance because the metric focuses on the contrast and luminance that is expected to change while suppressing the bones.We reduced the learning rate whenever the validation performance ceased to improve.Early stopping with the patience of 10 epochs is used.The best-performing models (with the least validation loss) are further used to predict bone-suppressed CXR images using the test set.An Ubuntu Linux system with NVI-DIA GeForce GTX 1080 graphics card and Keras framework with Tensorflow backend is used for model training and evaluation.
DeBoNet-bone suppression model ensemble.The bone suppression model ensemble, which we call DeBoNet, is constructed using the top-3 performing models that demonstrate markedly improved performance in terms of the MS-SSIM metric using the NIH-CC-DES-Set 1 test set.Each of the top-3 performing models predicts a bone-suppressed image for an input CXR.The predicted image by the individual models is divided into sub-blocks of M×M dimensions.The optimal value of M [4,8,16,32,64,128,256] is determined through extensive empirical evaluations.For a given sub-block size and in each sub-block, the following are performed: (i) we measured the MS-SSIM score between the sub-block of the bone-suppressed image predicted by each of the top-3 performing models and the corresponding sub-block in respective ground truth soft-tissue image; (ii) we performed a majority voting for the MS-SSIM score to find that image sub-block with the maximum MS-SSIM score and use it in constructing the final bone-suppressed image.The algorithm below discusses these steps.DeBoNet evaluation.We performed evaluations by using the histograms of the ground truths and the bone-suppressed images predicted by the individual bone-suppression models and the DeBoNet.Several metrics such as correlation, intersection, chi-square distance, and Bhattacharyya distance are measured to investigate for similarity.The higher the value of correlation and intersection, the closer (or more similar) are the histograms of the image pairs.For distance-based metrics such as chi-square and Bhattacharyya, a smaller value indicates a superior match between the histogram pairs.This implies the histograms of the predicted bone-suppressed images closely match their respective ground truths.The mathematical formulations of these metrics can be found in the literature [36].The average values of the aforementioned metrics are computed for each model and the DeBoNet and compared for statistical significance.
Classification model.For classification, we initially used a custom U-Net model proposed in [17] to segment the lung ROI on the CXRs.This approach ensures that the models learn relevant features from the lung ROI and not the surrounding context.The U-Net model is trained to generate 256×256-dimension lung masks.The generated masks are overlaid on the input CXRs to delineate the lung boundaries.The delineated boundaries are cropped to a bounding box containing the lung pixels.The lung-cropped CXRs are further preprocessed to enhance image contrast by clipping the top and bottom 1%, respectively, of all pixel values.We further performed pixel normalization, centering, and standardization to reduce computational complexity during model training.
The encoder of the best-performing bone-suppression model is truncated and appended with the following layers: (i) Zero padding (ZP); (ii) Convolutional layer with 512 filters, each of size 3×3; (iii) Global average pooling (GAP), and (iv) a dense layer with two nodes and Softmax activation to classify the CXRs as showing normal lungs or other findings consistent with COVID-19.This approach is followed to transfer the CXR modality-specific knowledge learned from the bone suppression task to improve performance in a relevant classification task.A study of the literature reveals several works that used CXR modality-specific models to transfer knowledge and improve classification and localization performance in a relevant task [17,37,38].
Recall that we use the COVID-19 CXR collection as cases and the RSNA CXR collection as controls for the classification task.Since the ground truth soft-tissue images are not available for these CXRs, the DeBoNet could not be directly used.Instead, the best-performing bone suppression model is selected and applied to these CXR collections.We used 90% of these data for training and 10% for hold-out testing.For consistency, we use a fixed seed and allocated 10% of the training data for validation.The model is then retrained individually on the non-bone-suppressed and bone-suppressed CXR images to classify them as showing no abnormalities or findings consistent with COVID-19.We performed augmentation with random affine transformations such as rotations (-10 to 10 degrees), horizontal and vertical pixel shifting (-5 to 5 pixels), zooming, and horizontal mirroring, to introduce variability into the training process and reduce model overfitting to the training data.The model is compiled using a stochastic gradient descent optimizer with an initial learning rate of 1e-3.The learning rate is reduced whenever the validation performance did not improve.We used callbacks to store model weights and early stopping to prevent overfitting and stored the best weights for further analysis.The best model is used to predict the test set and output class probabilities.
Here, TP, TN, FP, and FN denote the true positive, true negative, false positive, and false negative values, respectively.Additionally, we used our in-house class-selective relevance map (CRM) algorithm [21] to interpret the predictions of the model trained on non-bone-suppressed and bone-suppressed images and ensure they learned to highlight regions containing findings that are consistent with COVID-19.Statistical analyses.We performed statistical analyses to identify the existence of a significant difference in performance achieved by the bone suppression and classification models.For bone suppression, we performed a one-way Analysis of Variance (ANOVA) to analyze if a significant difference existed in the MS-SSIM and chi-square distance values obtained using the top-3 performing bone-suppression models and DeBoNet.We performed Shapiro-Wilk and Levene tests to analyze if the prerequisite conditions of data normality and homogeneity of variances are satisfied to perform one-way ANOVA analyses.For classification, we measured the 95% binomial confidence intervals (CI) as the Exact Clopper-Pearson interval for the MCC metric to compare the classification performance achieved by the models trained on non-bone-suppressed and bone-suppressed images.We used R statistical software (Version 4.1.1)to perform these evaluations.

Bone suppression
Recall that the proposed bone suppression models are trained on the augmented NIH-CC--DES-Set 2 dataset and tested using the NIH-CC-DES-Set 1 collection.The performance achieved by the bone suppression models is shown in Table 2. Fig 6 shows the bone-suppressed images predicted using the proposed bone suppression models for an input CXR instance from the test set.
It is observed from Table 2 that the FPN model with the EfficientNet-B0 encoder backbone (F-EB0-BS) demonstrated superior performance for all metrics compared to other models.We observed from Fig 6 that all models predicted bone suppressed images that demonstrated substantial suppression of the bony structures.We further performed a quantitative evaluation to differentiate model performance.In this regard, we observed that the F-EB0-BS model demonstrated the least values for the chi-square and Bhattacharya distances and superior values for the correlation and intersection measures.Higher values for the correlation and intersection metrics demonstrate that the bone-suppressed images predicted by the F-EB0-BS model closely match that of the ground truth soft-tissue images.Considering the chi-square and Bhattacharyya distance-based metrics, a smaller value indicates a superior match between the images.This signifies that compared to other models, the bone-suppressed image predicted by the F-EB0-BS model closely matches that of the ground truth soft-tissue images.This performance is followed by the FPN model with ResNet-18 encoder backbone (F-Res18-BS) and the U-Net model with the ResNet-18 encoder backbone (U-Res18-BS) that demonstrated markedly improved values for the PSNR, SSIM, MS-SSIM, correlation, intersection, chisquare, and Bhattacharya distance measures compared to other models.These top-3 performing models are further considered to construct the ensemble.
The predicted bone-suppressed images by the top-3 performing models are divided into sub-blocks of M×M dimensions.We empirically determined the value of M [4,8,16,32,64,128,256] that deliver superior bone suppression performance.For a given sub-block size, and in each sub-block, (i) we measured the MS-SSIM score between the sub-block of the bone-suppressed image predicted by each of the top-3 performing models and the corresponding subblock in the respective ground truth, and (ii) performed a majority voting of the MS-SSIM score for each sub-block to identify the sub-block with the maximum MS-SSIM score and use it in constructing the final bone-suppressed image.Table 3 shows the performance achieved while constructing the DeBoNet using varying sub-block sizes.It is observed from Table 3 that the DeBoNet performance with various sub-block sizes is superior compared to the performance achieved using the top-3 performing models (from Table 2).We observed that using a sub-block size of 4×4, the DeBoNet achieved superior performance in terms of PSNR, SSIM, MS-SSIM, correlation, intersection, chi-square, and Bhattacharya distances compared to using other sub-block sizes and the top-3 performing models.Curiously, we also note a relatively high performance at 256x256 grid dimensions.Studying the correlation between granularity and MS-SSIM score is left as future work.
We performed a one-way ANOVA analysis to observe if a statistically significant difference existed in the MS-SSIM and chi-square values obtained using DeBoNet with sub-block size 4×4, and the top-3 performing bone-suppression models namely, the F-EB0-BS, F-Res18-BS, and U-Res18-BS models.Fig 7 shows the mean plots for the MS-SSIM and chi-square values, respectively, obtained by the models.The one-way ANOVA analyses require that the assumptions regarding the normal distribution of the data and homogeneity of data variances are satisfied.We performed the   Recall that the best-performing F-EB0-BS bone suppression model is used to suppress the bones in the CXRs used in this classification task.This is because the ground truth soft-tissue images are not available for these CXRs.Hence, DeBoNet could not be used.

Classification
Recall that the encoder of the best-performing F-EB0-BS bone suppression model is truncated and added with the classification layers to classify the CXRs as showing normal lungs or COVID-19-consistent findings.Such an approach is followed to transfer CXR modality-specific knowledge to improve classification performance.The classification model is retrained on the non-bone-suppressed and bone-suppressed CXR images, and the measured performance is shown in Table 4 and  We observed from Table 4 and Fig 9 that the classification model trained on bone-suppressed images demonstrated superior performance in terms of accuracy, AUROC, AUPRC, sensitivity, precision, F-score, and MCC metrics, compared to the model trained on nonbone-suppressed images.The 95% binomial CI value obtained for the MCC metric using the model trained on bone-suppressed images demonstrated a tighter error margin, higher precision, and is found to be significantly superior (p < 0.05) compared to the MCC metric achieved by the model trained on non-bone-suppressed images.
We qualitatively evaluated the performance of the models trained on non-bone-suppressed and bone-suppressed images to ensure if the models learned to highlight regions containing COVID-19-consistent findings and not the surrounding context.We used the CRM localization tool to interpret model behavior.

Discussion and conclusions
The observations made from this study underscores the need for (i) customizing a model for the problem under study, (ii) constructing a model ensemble for bone suppression, and (iii) interpreting model behavior.
Our proposed approach facilitates predicting a bone-suppression image given an input CXR image.This is more computationally effective than other studies proposed in the literature [5][6][7][8][9] that propose a series of steps to generate bone-only images and subtract them from input CXRs to increase soft-tissue visibility.A limitation of this approach proposed in the literature is that a sub-optimal generation of bone-only images introduces noise and distortion into the process and may adversely impact decision-making.We proposed several custom
Fig 2 illustrates the architecture of the proposed Autoencoder-BS model.

Fig 2 .Fig 3 .
Fig 2. The architecture of the Autoencoder-BS model.The input to the model is a grayscale CXR image.The model has a symmetrical separable convolutional encoder and decoder architecture.https://doi.org/10.1371/journal.pone.0265691.g002

Fig 7 .
Fig 7. Statistical analyses using one-way ANOVA.(a) and (b) shows the mean plot for the MS-SSIM and chi-square values, respectively, obtained by the DeBoNet (4×4), F-EB0-BS, F-Res18-BS, and U-Res18-BS models.https://doi.org/10.1371/journal.pone.0265691.g007 Fig 8 shows the bone-suppressed images predicted by the F-EB0-BS model for instances of CXRs showing findings that are consistent with COVID-19.Note that the F-EB0-BS model generalizes to the unseen CXRs from the classification data that are not used during bone-suppression model training and validation.We observed superior suppression of bones and the image resolution is preserved.
Fig 10 shows the instances of CXRs, and the CRM-based disease ROI localization obtained using the trained models.Fig 10A, 10D and 10G show instances of CXRs from the Twitter COVID-19 CXR collection with expert annotations shown in blue bounding boxes.Fig 10B, 10E and 10H show the localization achieved using the model trained on non-bone-suppressed images.It could be observed that the model is highlighting the surrounding context but not COVID-19-consistent manifestations.This demonstrates that the model has not learned relevant features regarding findings that are consistent with COVID-19.Fig 10C, 10F and 10I show the localization achieved using the model trained on bone-suppressed images.We could observe that this model precisely highlighted regions specific to findings that are consistent with COVID-19, thereby demonstrating that the model learned task-specific features, confirming the experts' knowledge about the disease.

Table 2 . Performance achieved by the proposed bone suppression models using the NIH-CC-DES-Set 1 test set.
The values are given in terms of mean ± standard deviation.The best performances are denoted by bold numerical values in the corresponding columns.

Table 3 . Performance achieved by the DeBoNet using various sizes for the sub-blocks.
The values are given in terms of mean ± standard deviation.The best performances are denoted by bold numerical values in the corresponding columns.

Table 4 . Classification performance achieved with the model trained on non-bone-suppressed and bone-suppressed images.
Data in parenthesis denote the 95% binomial CI measured as the Exact Clopper Pearson interval for the MCC metric.Bold numerical values denote superior performance in respective columns.
https://doi.org/10.1371/journal.pone.0265691.t004improve model confidence, performance, and generalization to real-world data.(iii) This is not a classification-related study, but we wanted to evaluate if bone suppression would