Fig 1.
Example of impression and findings sections of a radiology report (Indiana U. Chest X-ray dataset).
xxxx’s indicate removed keywords due to de-identification.
Fig 2.
A CNN is used as encoder and an LSTM as decoder.
Fig 3.
Illustration of unconditioned Baseline 1, which optimizes the BLEU-4 score.
Fig 4.
Illustration of unconditioned Baseline 2, which optimizes BLEU-1 and BLEU-2.
Table 1.
Average performance results on the IU chest X-ray dataset over test sets.
Table 2.
Average performance results on the IU chest X-ray dataset over test sets.
Table 3.
Results on the IU chest X-ray dataset of permuted models.
Table 4.
Average validation scores for abnormal and normal reports on the IU chest X-ray dataset (Impressions and Findings).
Table 5.
Average performance results over test sets on MIMIC-CXR dataset using NLP validation metrics.
Fig 5.
Typical reports generated by Baseline 1 and 2 (over one train/test split).
Fig 6.
Example of reports generated with encoder-decoder methods.
We show SA&T, MRA, and CDGPT2 outputs for an input image in the IU chest X-ray test set, and the reference ground truth report.
Fig 7.
Number of unique n-grams in reports generated by SA&T, MRA, CDGPT2 and in ground truth reports for the IU chest X-ray data.
Although reports generated by MRA are longer than those generated by SA&T and CDGPT2, they are more similar to each other. Average length of reports generated by SA&T is 32, by CDGPT2 is 32, and by MRA is 47.
Fig 8.
Bootstrap distributions of BLEU-1 values for various pairs of methods.
95% confidence intervals are shown in light yellow color.