Analyzing inter-reader variability affecting deep ensemble learning for COVID-19 detection in chest radiographs

doi:10.1371/journal.pone.0242301

Table 1.

Demographic study.

More »

Expand

Fig 1.

The architecture of the custom U-Net with dropout and its performance curves.

More »

Expand

Fig 2.

Segmentation workflow showing UNet-based mask generation and lung ROI cropping.

More »

Expand

Fig 3.

The workflow of the proposed repeated CXR-specific pretraining and fine-tuning.

More »

Expand

Table 2.

Datasets and their distribution used in various stages of learning.

More »

Expand

Fig 4.

A custom wide residual network (WRN) with dropout regularization.

More »

Expand

Fig 5.

The architecture of the CNNs used in the first stage of repeated CXR-specific pretraining.

I/P = Input, I-PCNN = truncated ImageNet-pretrained CNNs, ZP = Zero-padding, CONV = Extra convolution layer, GAP = Global Average Pooling, DO = Dropout, D = Final dense layer with Softmax activation.

More »

Expand

Fig 6.

The architecture of the CNNs used in the second stage of pretraining.

I/P = Input, CXR-Pre-CNN = CXR-specific CNNs from the first stage of pretraining, truncated at their deepest convolutional layer, GAP = Global Average Pooling, DO = Dropout, D = Final dense layer with Softmax activation.

More »

Expand

Fig 7.

The architecture of the CNNs fine-tuned toward COVID-19 detection.

I/P = Input, CXR-Pre-CNN = CXR-pretrained CNNs from the second stage of pretraining, truncated at their deepest convolutional layer, GAP = Global Average Pooling, DO = Dropout, D = Final dense layer with Softmax activation.

More »

Expand

Fig 8.

Examples showing inter-reader variability in annotating COVID-19 disease ROI.

(A) and (B) show the annotations (bounding boxes in blue) of Rad-1 and Rad-2, respectively, for a given COVID-19 disease labeled image; (C) and (D) shows the GT annotations of Rad-1 and Rad-2, respectively for another COVID-19 disease labeled image.

More »

Expand

Table 3.

Performance metrics achieved during the first-stage of CXR-specific pretraining.

More »

Expand

Fig 9.

Performance achieved using the VGG-19 model during the first-stage of CXR-specific pretraining.

(A) Confusion matrix; (B) ROC curves; (C) Normalized Sankey flow diagram.

More »

Expand

Table 4.

Performance metrics achieved by the models during the second stage of CXR-specific pretraining.

More »

Expand

Fig 10.

Performance achieved using the DenseNet-121 model during the second stage of CXR-specific pretraining.

(A) Confusion matrix; (B) ROC curves; (C) Normalized Sankey flow diagram.

More »

Expand

Table 5.

Performance metrics achieved with fine-tuning the second-stage pretrained models for COVID-19 detection.

More »

Expand

Fig 11.

Performance achieved using the ResNet-18 model during fine-tuning for COVID-19 detection.

(A) Confusion matrix; (B) ROC curves; (C) Normalized Sankey flow diagram.

More »

Expand

Table 6.

Performance metrics achieved during fine-tuning the second-stage pretrained models for COVID-19 detection is compared with the baseline.

More »

Expand

Fig 12.

COVID-19 viral disease ROI CRM-based localization achieved using the fine-tuned models and their baseline counterparts.

(A) Original CXR with STAPLE-generated consensus ROI (shown as blue box ROI); (B) Baseline VGG-16; (C) Baseline VGG-19; (D) Baseline MobileNet-V2; (E) Baseline ResNet-18; (F) Baseline Inception-V3; (G) Fine-tuned VGG-16; (H) Fine-tuned VGG-19; (I) Fine-tuned MobileNet-V2; (J) Fine-tuned ResNet-18; (K) Fine-tuned Inception-V3.

More »

Expand

Fig 13.

Performance achieved through weighted averaging of the top-3 fine-tuned CNNs toward COVID-19 detection.

(A) Confusion matrix; (B) ROC curves; (C) Normalized Sankey flow diagram.

More »

Expand

Table 7.

Performance achieved with an ensemble of top-3, top-5, and top-7 fine-tuned models toward COVID-19 detection.

More »

Expand

Table 8.

Performance achieved in terms of CRM-based IoU and mAP values by the individual fine-tuned CNNs using the radiologists’ annotations and STAPLE-generated ROI consensus annotation.

More »

Expand

Table 9.

IOU and mAP values obtained with top-3, top-5, and top-7 ensembles using annotations of Rad-1, Rad-2, and STAPLE-generated consensus ROI annotations.

More »

Expand

Fig 14.

Sample CXRs from two different patients (rows A-D and E-H, respectively) show ROI annotations generated.

(A) and (E) Rad-1 (in blue); (B) and (F) Rad-2 (in green); (C) and (G) Top-3 ensemble using STAPLE-generated consensus ROI (program) (in yellow); (D) and (H) STAPLE-generated consensus ROI annotation (in red).

More »

Expand

Fig 15.

Instances of ensemble CRMs combining top-N ensemble ROI predictions.

(A) top-3 CNNs using STAPLE-generated consensus ROI annotation; (B) top-5 CNNs using Rad-2 annotations. The green box denotes reference ROI annotation and the blue box denotes ensemble CRM localization.

More »

Expand

Fig 16.

Statistical analyses.

(A) Mean plot for the mAP scores obtained by the top-N ensembles using Rad-1, Rad-2, and STAPLE-generated consensus ROI annotations; Error bars represent standard errors. The differences are not statistically significant; (B) Residual plot showing the data follow the normal distribution.

More »

Expand

Table 10.

Consolidated results of Shapiro–Wilk, Levene, and one-way ANOVA analyses.

More »

Expand

Fig 17.

Assessing inter-reader variability and program performance.

The following performance metrics are measured and plotted for 10 different IoU thresholds in the range (0.1–0.7): (A) Kappa statistic; (B) Sensitivity; (C) Specificity; (D) PPV.

More »

Expand

Table 11.

Performance level assessment and inter-reader variability analysis using STAPLE-generated consensus ROI.

More »

Expand