Encoder-decoder models for chest X-ray report generation perform no better than unconditioned baselines

doi:10.1371/journal.pone.0259639

Fig 1.

Example of impression and findings sections of a radiology report (Indiana U. Chest X-ray dataset).

xxxx’s indicate removed keywords due to de-identification.

More »

Expand

Fig 2.

An encoder-decoder model.

A CNN is used as encoder and an LSTM as decoder.

More »

Expand

Fig 3.

Illustration of unconditioned Baseline 1, which optimizes the BLEU-4 score.

More »

Expand

Fig 4.

Illustration of unconditioned Baseline 2, which optimizes BLEU-1 and BLEU-2.

More »

Expand

Table 1.

Average performance results on the IU chest X-ray dataset over test sets.

More »

Expand

Table 2.

Average performance results on the IU chest X-ray dataset over test sets.

More »

Expand

Table 3.

Results on the IU chest X-ray dataset of permuted models.

More »

Expand

Table 4.

Average validation scores for abnormal and normal reports on the IU chest X-ray dataset (Impressions and Findings).

More »

Expand

Table 5.

Average performance results over test sets on MIMIC-CXR dataset using NLP validation metrics.

More »

Expand

Fig 5.

Typical reports generated by Baseline 1 and 2 (over one train/test split).

More »

Expand

Fig 6.

Example of reports generated with encoder-decoder methods.

We show SA&T, MRA, and CDGPT2 outputs for an input image in the IU chest X-ray test set, and the reference ground truth report.

More »

Expand

Fig 7.

Number of unique n-grams in reports generated by SA&T, MRA, CDGPT2 and in ground truth reports for the IU chest X-ray data.

Although reports generated by MRA are longer than those generated by SA&T and CDGPT2, they are more similar to each other. Average length of reports generated by SA&T is 32, by CDGPT2 is 32, and by MRA is 47.

More »

Expand

Fig 8.

Bootstrap distributions of BLEU-1 values for various pairs of methods.

95% confidence intervals are shown in light yellow color.

More »

Expand