Calibrated bagging deep learning for image semantic segmentation: A case study on COVID-19 chest X-ray image

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) causes coronavirus disease 2019 (COVID-19). Imaging tests such as chest X-ray (CXR) and computed tomography (CT) can provide useful information to clinical staff for facilitating a diagnosis of COVID-19 in a more efficient and comprehensive manner. As a breakthrough of artificial intelligence (AI), deep learning has been applied to perform COVID-19 infection region segmentation and disease classification by analyzing CXR and CT data. However, prediction uncertainty of deep learning models for these tasks, which is very important to safety-critical applications like medical image processing, has not been comprehensively investigated. In this work, we propose a novel ensemble deep learning model through integrating bagging deep learning and model calibration to not only enhance segmentation performance, but also reduce prediction uncertainty. The proposed method has been validated on a large dataset that is associated with CXR image segmentation. Experimental results demonstrate that the proposed method can improve the segmentation performance, as well as decrease prediction uncertainty.


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) causes coronavirus disease 2019 (COVID- 19) which was first identified in 2019 in Wuhan, Central China [1]. It is spreading globally, resulting in more than 458 million confirmed infections and 6 million deaths, and causing huge economic loss. Although global economics seems to be recovered gradually, early and accurate tests of this disease such as reverse transcription-polymerase chain reaction (RT-PCR), antigen tests, and medical imaging tests must be improved to be ready for future pademics [2,3]. Compared to RT-PCR tests, medical imaging tests such as chest X-ray (CXR) and computed tomography (CT) are more effective and efficient [4,5], especially for severe patients, which is of great help to physicians. For instance, in Italy, the United States, and China, the majority of serious COVID-19 cases have been identified through the manifestation characteristics in CT images [6]. Therefore, effective extraction of COVID-related information on medical images will play an important role to fight against a new round of pandemic caused by COVID mutated variant [7]. Deep learning (DL) played an important role in promoting COVID-related information extraction by COVID-19 infection region segmentation and disease classification through analyzing CXR and CT data [8,9]. Compared with CT images, CXR images are easier to obtain in radiological inspections. Currently, most of DL models, especially convolutional neural networks (CNN), were employed to classify entire CXR images to detect COVID-19 cases [10,11]. For example, Hemdan et al. proposed COVIDX-Net to assist radiologists to diagnose COVID-19 based on CXR features [12]. It integrated various deep convolutional neural networks (DCNNs) models with different structures, such as DenseNet201 [13], Xception [14], and MobileNetV2 [15]. Sethy et al. integrated different DCNNs models with a support vector machine (SVM) classifier to recognize COVID-19 [16]. In addition, to address the shortcomings of training data, Castiglioni et [21]. Moreover, Lucy et al. [22] developed two-path semi-supervised deep learning model to implement COVID-19 classification by using huge amounts of unlabeled data.
Compared with CXR classification, CXR semantic segmentation is a more challenging task that is to classify each pixel into predefined classes [23] to recognize region of interests (ROIs) on CXR images, where a few previous work explored this task [24][25][26]. However, prediction uncertainty of DL models for this task has not been comprehensively investigated since most of DL models focus on performance improvement on this task such as increasing detection accuracy. For safety-critical applications like medical image processing, the prediction uncertainty of DL models is a key evaluation metic on reliability of model predictions since high prediction uncertainty means low prediction reliability. For example, for COVID-19 applications, applying uncertain predictions to clinical processes would result in disastrous consequences such as missing severe COVID cases or delayed treatments. This paper proposed a novel ensemble deep learning model that integrates bagging deep learning [27] and model calibration [28] to enhance performance of semantic segmentation, as well as reduce prediction uncertainty. It includes three stages shown in Fig 1: 1) training multiple state-of-the-art DL models such as fully convolutional networks (FCN) [29], FCN combined with ResNet [30], FCN combined with MobileNet [29], PSPNet [31], and UNet [32] on training CXR datasets; 2) Calculating calibration errors to measure prediction uncertainties of these DL models on validation CXR datasets, where expected calibration error (ECE) and maximum calibration error (MCE) [28] are employed to measure the prediction uncertainties; 3) Implementing calibrated bagging deep learning with weighted voting, where the weight of each DL model is inversely proportional to the calibration error. The proposed model is validated on a large-scale CXR dataset to examine its effectiveness. Experimental results demonstrate that the proposed method not only enhances the performance of semantic segmentation, but also improves the prediction certainty on CXR data.
The contributions in this study are below.
• We systematically compared performance of various state-of-the-art DL models for semantic segmentation on COVID-19 CXR data with different evaluation metrics. Moreover, the prediction uncertainties of these DL models were investigated by measuring expected calibration error (ECE) and maximum calibration error (MCE).
• We implemented a novel ensemble deep learning model based on model calibration and bagging deep learning, which is to calibrate bagging deep learning models through weighted summation of predictions generated by individual models. The proposed approach is easily implemented and scalable to various tasks.
• We validate the proposed method with semantic segmentation on a large COVID-19 CXR dataset based on different evaluation metrics. Experimental results demonstrate its effectiveness on improving performance and prediction certainty, simultaneously.

Methodology
The proposed method is built based on calibration error [33][34][35] and bagging deep learning [27] to enhance image segmentation with higher prediction certainty.

Calibration error
In the processing of model calibration, the expected calibration error (ECE) and the maximum calibration error (MCE) can be employed to measure the quality of uncertainty for machine learning models in terms of prediction accuracy [36], which is critical for high risk applications such as medical diagnosis [34,35] and self-driving [37].
• Expected Calibration Error (ECE). It estimates the calibration error in expectation values with three steps: 1) Discretizing the prediction probability region into a fixed number of bins; 2) Assigning each predicted probability to one of these bins; 3) Calculating the difference between the fraction of predictions in the bin that are correct (accuracy) and the mean of the probabilities in the bin (confidence) by where n k is the number of predictions in bin k, N is the total number of samples predicted, and acc(k) and conf(k) denonte the accuracy and confidence in the bin k, respectively. It is a weighted average of differences of accuracy vs confidence in these bins.
• Maximum Calibration Error (MCE). It measures an upper bound of ECE that is the maximum difference between accuracy and confidence over all predictions across all bins.
In summary, MCE measures the largest calibration gap across all bins, whereas ECE measures a weighted average of all gaps. Both MCE and ECE equal 0 if the model is perfectly calibrated.

Bagging learning
Ensemble deep learning combines several individual deep models to improve generalization performance through various ensemble strategies such as bagging and boosting, which integrates the advantages of both deep learning and ensemble learning [27]. Bagging (or bootstrap aggregating) generates a series of independent subsets from training data to build multiple individual predictors to build an ensemble model [38]. In detail, it generates the bagging samples and passes each bag of samples to base models to build multiple predictors. Then, it is to combine predictions of these multiple predictors with specific strategies such as majority voting. Fig 2 presents a diagram for building and testing bagging deep learning with majority voting, where multiple training sets can be generated by sampling with or without replacement.

Proposed model
We proposed a calibrated bagging deep learning model to enhance generalization performance as well as reduce prediction uncertainty for COVID-19 semantic segmentation that is to recognize lung region of CXR images. Fig 1 presents the flow for building the proposed approach. It includes three stages: 1) training various state-of-the-art deep learning models such as UNet [32], PSPNet [31], and MobileNet [29], on an identical training data for COVID-19 image segmentation models, which differs from the standard strategy for bagging learning that is to generate a bag of training sets on original training data; 2) Estimating calibration error (CE) for these different models. First, it is to complete COVID-19 semantic segmentation on validation data by running these DL models to obtain prediction probabilities and accuracy. Then, it calculates CE including ECE and MCE to evaluate uncertainties of these DL models; 3) Testing via weighted voting bagging deep learning. We perform calibrated bagging prediction on testing data through implementing weighted voting, where the weights are built with CE of these DL models. It assumes that lower CE of DL models means higher certainty of these DL models. Moreover, DL models with the higher certainty are assigned with more weights. Therefore, we define the weight of ith model as 1

Dataset
We employed COVID-19 chest X-ray dataset (https://github.com/v7labs/covid-19-xraydataset) to validate the effectiveness of the proposed method. It includes 6, 402 images of AP/ PA chest x-rays/CT scan with pixel-level polygonal lung segmentations. Each image has a corresponding ground truth with two "Lung" segmentation masks (rendered as polygons, including the posterior region behind the heart), where the masks include most of the heart, revealing lung opacities behind the heart which may be relevant for assessing the severity of viral infection. Fig 3 shows one example of CXR image and corresponding ground truth. In terms of the example, semantic segmentation on CXR images is to classify pixels in the original image into two classes: Lung (white region in ground truth) and NonLung (black region in ground truth). We split the dataset into training (70% data), validation (10% data), and testing (20% data) datasets.
We implemented two versions of the proposed approach including Ensemble (Weighted Voting (ECE), EECE) and Ensemble (Weighted Voting (MCE), EMCE). EECE is a weighted bagging learning method, where the weights are obtained by calculating expected calibration error (ECE). Similarly, EMCE is a weighted bagging learning method, where the weights are obtained by calculating maximum calibration error (MCE). Moreover, we combine the predictions of Ensemble (Majority Voting (MV), EMV), EECE, and EMCE by majority voting to build Ensemble (Majority Voting + ECE + MCE (MVEM), EMVEM).

Evaluation metric
Various evaluation metrics are employed to evaluate the performance of our proposed model, which includes accuracy, F1score, sensitivity, and specificity. Accuracy is calculated by  dividing the number of pixels identified correctly over the total number of pixels in chest Xray images.
where Precision defines the capability of a model to represent only correct pixels and Recall computes the aptness to refer all corresponding correct pixels. Recall whereas TP (True Positive) counts the total number of pixels that matches the annotated pixels of RIOs. FP (False Positive) measures the number of pixels that don't belong to RIOs, but are recognized as pixels of RIOs. FN (False Negative) counts the number of pixels of RIOs are recognized as those don't belong to RIOs. The main goal for binary classification is to improve the recall without hurting the precision. However, recall and precision goals are often conflicting, since when increasing the true positive (TP) for the minority class (True), the number of false positives (FP) can also be increased; this will reduce the precision [40]. Moreover, we employed sensitivity and specificity to evaluate performance of semantic segmentation [41], where the sensitivity measures how good a test is at detecting the RIOs while the specificity refers to how good a test is at avoiding false alarms.
whereas TN (True Negative) counts total number of pixels that don't belong to RIOs and are recognized as those don't belong to RIOs. Finally, we employ expected calibration error (ECE) and MCE (https://www.tensorflow. org/probability/api_docs/python/tfp/stats/expected_calibration_error) to measurethe calibration errors [28] for evaluating the prediction uncertainty, where ECE and MCE are defined as equations (1) and (2), respectively. The lower ECE and MCE are, the higher prediction certainty is.

Experimental results
We validate the proposed method from two perspectives: comprehensive performance comparison between the baselines and the proposed method, and hyper-parameter examination. Table 2 presents the performance comparison between the state-of-the-art individual models and the proposed method in terms of various evaluation metrics and corresponding standard deviations. We can observe that these individual models can perform well on COVID-19 image segmentation regarding F1scores and Accuracy. Moreover, prediction uncertainties of most of them are promising with respect to ECE and MCE.

Performance comparison.
For these individual models, FCN32_ResNet50 outperforms other individual models with higher certainty. In addition, as one baseline, EMV performs better than other individual methods with highest prediction certainty by comparing ECE and MCE. It means that combining predictions of these individual models can effectively improve performance and prediction certainty in regard of F1score and ECE.
For the proposed method, EECE can perform better than the baselines including these individual models and EMV by comparing accuracy, recall, and F1score. Moreover, EECE is able to improve the prediction certainty. It means that using appropriate calibration errors as weights to implement weighted bagging deep learning can effectively improve prediction certainty as well as performance. In other words, it is an effective method to calibrate models by using appropriate calibration errors as weights to combine predictions. Furthermore, EMVEM obtains the optimal performance with highest prediction certainty. It indicates that ensemble strategy such as majority voting is effective to combine predictions to further improve performance and prediction certainty. Moreover, EMVEM performed more stable since the standard deviations of performance and calibration errors are lower than those of baselines.
In addition to the performance comparison, we show an example of prediction visualization on semantic segmentation generated by the baselines and proposed models in   When we examine the prediction visualization for these individual models, we can observe that they miss some key components (yellow regions) for detecting lung. Taking UNet as an example, through comparing the predictions with ground truth, key components highlighted with yellow color are missed on subfigure (g). On the contrary, ensemble models such as EMV, EECE, and EMVEM perform better in that regard of predictions since yellow regions in their predictions are smaller, where the proposed method including EECE and EMVEM outperform other baselines. It means that the proposed method can effectively improve recall on detecting lung by distributing contributions of prediction based on calibration errors such as ECE and MCE.

Hyper-parameter examination.
Fine-tuning hyper-parameter for building deep learning models is an imperative step to obtain optimal performance. The process of building the proposed method involved various hyper-parameters. For example, for each individual DL model, we have to fine-tune learning rate, batch size, and epoch to achieve optimal performance. Specifically, for the proposed bagging deep learning, how many individual models involved is still an open challenge. Here, we examine if the number of individual models will significantly affect the performance of the proposed method. Table 3 presents the performance comparison for various bagging deep learning models built with different number of individual models. Generally speaking, more individual models will enhance performance and improve prediction certainty regarding F1score and ECE. When we employ five individual models (Ensemble 5 (FCN32_RESNET50 + FCN32 + UNet+ FCN32_MOBILENET + PSPNet)), we obtain the optimal performance and the highest prediction certainty regarding values of accuracy, F1score, and ECE for EECE and EMVEM, where the values of F1score are improved more significantly than other evaluation metrics.
Additionally, Fig 5 shows comparison of prediction visualization produced by the proposed methods built with different number of individual models. It is observed that more individual models involved in the proposed approach will reduce the size of missing components. Moreover, EMVEM outperforms other ensemble methods, which means that majority voting based on more individual DL models can further enhance the performance of recognition of RIOs. In summary, in terms of observations mentioned above, the proposed method can effectively improve semantic segmentation, as well as reduce the prediction uncertainty through using the calibration error as weights of DL models to combine their predictions. Moreover, more individual DL models involved in the implementation of the proposed approach can further enhance the performance and prediction certainty, which meets the intuition of majority voting for bagging deep learning. To some extent, it is an effective method to combine advantages of these individual DL models to improve the task performance without complex implementations.

Related work
This paper aims to build a novel bagging learning method to implement COVID-19 semantic segmentation through combining bagging deep learning and model calibration. Semantic segmentation has achieved significant successes by developing deep learning models such as U-Net [32] and V-Net [42]. In the biomedical domain, there have been numerous techniques for lung segmentation with different purposes [43,44]. The U-Net is an effective technique for segmenting both lung regions and lung lesions in COVID applications [45]. The U-Net built with fully convolutional network [32] has a U-shape architecture with two symmetric paths: encoding path and decoding path. The layers at the same level in two paths are connected by the shortcut connections, which is to learn better visual semantics as well as detailed contexture. Zhou et al. [46] proposed the UNet++ that inserts a nested convolutional structure between the encoding and decoding path. In addition, Milletari et al. [42] built V-Net using the residual blocks as the basic convolutional block, and optimized the network by a Dice loss. Furthermore, Shan et al. [47] built VB-Net for more efficient segmentation by equipping the convolutional blocks with the so-called bottleneck blocks. Moreover, U-Net and its variants have been developed, achieving reasonable segmentation results in COVID-19 diagnosis [48]. In recent years, attention mechanisms can learn the most discriminant part of the features in deep learning models. Oktay et al. [49] proposed an Attention U-Net to capture fine structures in medical images, thereby suitable for segmenting lesions and lung nodules in COVID-19 applications.
Safety-critical applications like medical image processing [50], autonomous driving [51], and precipitation forecasting [52] not only require high accuracy, but also need high prediction uncertainty measured by the model calibration. Two categories of methods are proposed to calculate the model calibration, namely, Bayesian-based and Non-Bayesian-based. Bayesian-based methods refer to Bayesian neural networks that estimates prediction/model uncertainty based on Bayesian process. The main concern of such methods is associate with its high computation complex and prior assumption on model weights. To reduce the computation complexity and enhance the scalability of Bayesian neural networks for data analysis on larger datasets, Hernández-Lobato et al. [53] proposed probabilistic back-propagation for learning Bayesian neural networks. Non-Bayesian-based methods develop various strategies such as model ensemble [54] and prior assumption on predictions [55] to estimate the prediction uncertainty, which is to reduce the cost of estimating the uncertainty. To reduce computation cost and training difficulty, Lakshminarayanan et al. [54] proposed deep ensemble that is simply to implement, trained in a parallel manner, requires less hyper-parameter tuning, and estimates high quality predictive uncertainty. However, it is very tricky to obtain the optimal number of individual models to build deep ensemble for various applications. Moreover, to reduce the cost of the memory usage and inference of Bayesian neural networks and deep ensembles, Liu et al. [56] proposed approaches to estimate uncertainty by building only one neural networks with two steps: 1) Measuring the distance between testing samples and training samples; 2) Implementing spectral-normalized neural Gaussian process (SNGP) that is to improve the measurement of the distance by adding a weight normalization step during training and replacing the output layer with a Gaussian process. However, experimental results on dialog intent detection indicated that deep ensemble performed better than the proposed method on many evaluation metrics such as accuracy. Recently, Wilson et al. [57] systematically summarized Bayesian deep learning and claimed that deep ensemble can be treated as approximate Bayesian marginalization of model parameters. On the other side, they also claimed that Bayesian methods were not perfect regarding prior assumptions on model weights.
In terms of previous work on model calibration and semantic segmentation, we proposed the calibrated ensemble model to not only enhance performance on semantic segmentation, but also reduce the prediction uncertainty.

Conclusion and future work
In this paper, a novel bagging deep learning model is proposed for COVID-19 image segmentation on chest x-ray images. It combines the model calibration and traditional bagging learning to not only enhance the segmentation performance, but also improve the prediction certainty that is extremely important to high-risk applications in biomedical domain. We validate the proposed method on a large chest x-ray dataset that is associated with COVID-19.
Experimental results demonstrate that the proposed model could recognize the lung region more effectively through comparing with state-of-the-art baselines. For the future work, we plan to extend the proposed model for building an end-to-end model for both COVID-19 image classification and image segmentation.