Fig 1.
Encoder-decoder model for image captioning.
Fig 2.
Overview of the CNN-LSTM-based image captioning architecture.
Fig 3.
Taxonomy of deep learning-based image captioning approaches.
Table 1.
Comparison of vanilla transformer-based image captioning models.
Table 2.
A comparison of image captioning Vision-Language Pre-training (VLP) models.
Fig 4.
Architectural comparison between BLIP-2, CoCa, Flamingo, and the proposed gated cross-attention fusion model.
Fig 5.
Comparison of common datasets for image captioning.
Table 3.
Common datasets for image captioning.
Fig 6.
Dataset size and captions per image.
Fig 7.
Comparison of caption density per image in benchmark image captioning datasets.
Fig 8.
Distribution of specialized datasets in image captioning research.
Fig 9.
Comparison of image quality challenge levels in captioning benchmarks.
Table 4.
Baseline model CIDEr performance and SCST support status.
Table 5.
Summary of model configuration and training setup.
Fig 10.
Normalized model performance evaluation metrics.
Table 6.
Common evaluation metrics for image captioning.
Fig 11.
Component intensity scores of image captioning models.
Fig 12.
Performance comparison of image captioning models across multiple metrics.
Table 7.
Comparison of methods with image captioning models across multiple datasets.
Table 8.
Proposed methodology results.
Fig 13.
Training and validation loss trends for image captioning models.