Differentiation of COVID-19 from other types of viral pneumonia and severity scoring on baseline chest radiographs: Comparison of deep learning with multi-reader evaluation

doi:10.1371/journal.pone.0328061

Table 1.

Comparative summary of prior studies and the proposed DL framework. CI: Confidence Interval; pn: pneumonia; CV: cross-validation.

More »

Expand

Fig 1.

Diagram illustrating the patient cohort included in this study, showing the selection process for COVID-19, non-COVID-19 viral pneumonia, and normal cases.

More »

Expand

Fig 2.

Examples of CXR severity score evaluations by two radiologists.

The black vertical and horizontal lines indicate the four lung zones. Red circles highlight lung abnormalities. (A) CXR shows a mild reticular pattern in both upper lobes with an extent of <24% and ground glass opacities (GGO) in both lower lobes with an extent of 25−49% on each lobe. A total score of 10 was assigned by both readers. This patient was discharged home from the emergency department. (B) CXR shows more involvement of the lower lobes (75−100%), with consolidation in the left lower lobe. A total score of 34 was assigned by both readers. The patient passed away 23 days post COVID-19 test positivity.

More »

Expand

Fig 3.

The proposed DL system pipeline.

(A) Thorax segmentation module: Predicts thorax region masks for each CXR using a multi-task learning approach. The thorax masks are used to extract thorax regions from the original CXRs, which serve as inputs to the subsequent modules. (B) The diagnosis module. I: Diagnosing viral pneumonia cases from normal ones. II: Receiving cases predicted as viral pneumonia in step one to classify them as COVID-19 or other viral pneumonia. (C) The severity scoring module: Estimates disease severity from the input CXRs.

More »

Expand

Fig 4.

Hierarchical transfer learning process.

Step I: Pre-training on a public COVID-19 segmentation dataset. Step II: Fine-tuning using a subset of 80 COVID-19 CXRs from our in-house dataset. The pre-trained encoder was used as the foundation of the diagnosis and severity scoring modules.

More »

Expand

Table 2.

Demographic characteristics of the study population across diagnostic groups, including the number of CXRs, mean patient age ± standard deviation, and percentage of female patients in each group.

More »

Expand

Table 3.

Number of different types of Non-SARS-Cov-2 viruses.

More »

Expand

Fig 5.

Number of male and female patients across the three diagnostic groups.

More »

Expand

Fig 6.

Age distribution of patients across the three diagnostic groups.

More »

Expand

Fig 7.

The distribution of severity scores in our COVID-19 CXRs based on the average evaluations of four radiologists.

More »

Expand

Table 4.

Performance metrics for hyperparameter tuning of the diagnostic module. Results are reported as mean ± standard deviation over three runs. “*” indicates experiments repeated with the same data split, while “**” indicates runs with three randomized train/test splits.

More »

Expand

Table 5.

Performance comparison of U-Net models with different backbone architectures for thorax segmentation. Results are reported as mean ± standard deviation of Dice and IoU scores across four randomized splits. MT: multi-task.

More »

Expand

Fig 8.

Examples of thorax segmentation results for COVID-19 and other viral pneumonia CXRs on th hold test set, with columns displaying the original chest X-ray, ground truth mask, predicted thorax mask, and the overlay of the predicted mask on the original image.

More »

Expand

Table 6.

The performance of individual radiologists, their consensus, and the DL model in the first diagnostic scenario (excluding COVID-19 CXRs with a severity score of zero as assessed by at least three radiologists).

More »

Expand

Fig 9.

Receiver operating characteristic (ROC) curves of the two-stage DL diagnostic system across five-fold cross-validation.

(a) ROC curves of stage one (diagnosis of pneumonia CXRs vs. normal CXRs). (b) ROC curves of stage two (diagnosis of COVID-19 CXRs vs. other viral pneumonia CXRs). AUC (Area Under the Curve) quantifies the overall diagnostic performance for each stage.

More »

Expand

Table 7.

F1-scores and accuracy (95% CI) for individual radiologists, consensus, and the DL model.

More »

Expand

Fig 10.

Frontal CXRs and corresponding Grad-CAM heatmaps for the two-stage DL diagnosis system.

Top row: stage one (diagnosis of viral pneumonia vs. normal CXRs); (a) normal CXR (P = 0.24), (b) other viral pneumonia CXR (P = 0.99). Bottom row: stage two (diagnosis of COVID-19 vs. other viral pneumonia); (c) other viral pneumonia CXR (P = 0.03). (d) COVID-19 CXR (P = 0.99). P denotes the probability assigned by the DL model that a test CXR belongs to the target class.

More »

Expand

Table 8.

The performance of individual radiologists, their consensus, and the DL model in the second diagnostic scenario (including all 808 COVID-19 cases).

More »

Expand

Table 9.

COVID-19 sensitivity (%) of the proposed DL diagnostic model compared to four individual radiologists and their consensus, stratified by severity level. Severity was categorized based on radiographic severity scores: low (≤ 8), medium (8–16), and severe (> 16). Sensitivity is reported as mean ± standard deviation across eight randomized test splits.

More »

Expand

Table 10.

Performance comparison across architectures for differentiating COVID-19 from other viral pneumonia and normal cases. The presented results are the Mean ± SD of the obtained results through a five-fold cross-validation process. The best results have been highlighted in bold. The results have been reported for the first diagnosis scenario.

More »

Expand

Table 11.

Design of experiments. TL: Transfer Learning.

More »

Expand

Table 12.

Ablation study in the firs diagnosis scenario (excluding COVID-19 CXRs with no evidence of infection) over the five-fold cross-validation approach.

More »

Expand

Table 13.

Ablation study in the second diagnosis scenario (including all 808 COVID-19 CXRs) over the five-fold cross-validation approach. In all experiments, the segmented thorax regions have been used as the model input.

More »

Expand

Table 14.

External validation results of the diagnostic model on the COVIDGR-1.0 dataset, compared with previously published models (COVIDNet-CXR and COVID-CAPS). Performance metrics include class-wise recall, precision, F1-score, and overall accuracy.

More »

Expand

Table 15.

Evaluation of the severity scoring model’s robustness to inter-observer variability. The model was trained using the average scores from four radiologists. During testing, one radiologist was excluded at a time, and performance was compared against the remaining consensus. Results include overall Pearson correlation and MAE, along with subgroup MAEs across three severity levels: Group 1 (SI ≤ 8), Group 2 (9 ≤ SI < 16), and Group 3 (SI ≥ 16). Pearson Co.: Pearson correlation. MAE: mean absolute error. SI: Severity score Index.

More »

Expand

Fig 11.

Representative examples of COVID-19 CXRs at varying severity levels alongside corresponding model attention maps.

The first row shows thoracic regions extracted from the CXRs, while the second row presents Grad-CAM heatmaps highlighting areas that most influenced the model’s severity predictions. From left to right: a low severity case (SI ≤ 8; GT = 0; MP = 2.4), a medium severity case (8 < SI ≤ 16; GT = 9; MP = 9.5), and a high severity case (SI > 16; GT = 31; MP = 28.1). The heatmaps demonstrate an increasing focus on widespread pathological regions with rising severity, indicating that the model’s attention aligns with clinical patterns of disease progression. (SI: severity index; GT: ground-truth severity score; MP: model prediction severity score).

More »

Expand

Fig 12.

DL-Predicted severity scores versus consensus radiologists’ severity scores.

Consensus radiologists’ severity scores are the average of severity scores evaluated by four radiologists.

More »

Expand

Fig 13.

External validation of DL severity scoring module on 86 COVID-19 CXRs from a publicly available dataset.

More »

Expand

Table 16.

Summary of reported AUC values from previous studies and the proposed model across different CXR-based pneumonia detection tasks. All studies had close collaboration with radiologists for data interpretation and validation.

More »

Expand

Table 17.

Comparison of severity scoring performance across prior studies and the proposed severity scoring module. MAE: Mean Absolute Error.

More »

Expand