Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

UICD: A new dataset and approach for urdu image captioning

Abstract

Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.

Introduction

At the intersection of natural language processing and computer vision is the dynamic field of image captioning, which focuses on generating descriptive text for images. Computer vision involves using algorithms and models to interpret and analyze visual information, transforming it into meaningful data. This field encompasses various techniques such as vehicle identification [1], text segmentation [2], text recognition [34], and human detection [5], which is crucial for extracting detailed features from images. These techniques contribute to the process of converting visual content into vector representations. Following this, image captioning uses a natural language processing (NLP) model to decode these vectors and generate textual descriptions of the images, as illustrated in Fig 1. It involves training a machine to look at images and generate textual descriptions of their content, just as humans do. This approach ensures that the captions accurately reflect the content and context of the images.

thumbnail
Fig 1. The generic architecture for automatic captioning illustrates how the encoder merges images and vectors, and how the decoder generates sequence predictions.

https://doi.org/10.1371/journal.pone.0320701.g001

This emerging field has seen significant research efforts tailored to different regional languages like Arabic[6], Chinese [7], Uyghur [8], Hindi [9], and English [1013]. The researchers in the field of these languages developed their corpus or datasets in their regional languages, either from scratch or from existing datasets like Flickr8k [12], Flickr30k [14], MSCOCO [10]. Most research in this field has focused on English, with limited attention to other languages like Urdu. To bridge this gap, our work develops models specifically for Urdu, addressing its unique linguistic challenges. A sample from Flickr 30k is shown in Fig 2.

thumbnail
Fig 2. Captions from Flickr30k dataset in English 3rd row and Urdu 2nd row (ours).

https://doi.org/10.1371/journal.pone.0320701.g002

Urdu is the national language of Pakistan, and it is widely spoken and understood in Pakistan, India, Nepal, and Bangladesh. According to the 2022 edition of Ethnologue [16], with almost 230 million speakers, Urdu is the tenth most spoken language in the world. Additionally, speakers of Urdu can be found in the Middle East, Europe, India, Australia, and the United States. The Urdu script employs the Nastaliq style, a calligraphic writing system derived from Arabic and Persian scripts, known for its distinct aesthetic and flowing structure. Linguistically, Urdu follows the abjad system, where consonants and long vowels are explicitly written, while short vowels are typically omitted. Nastaliq is a visual representation style, while abjad refers to the phonetic writing system. This combination impacts tokenization and preprocessing NLP, as the omission of short vowels can lead to ambiguities. Additionally, Urdu’s bidirectional nature writes numbers left-to-right and text right-to-left, with letters changing shape based on their position, like initial, medial, final, or isolated. These features highlight the script’s unique linguistic and structural complexity [17]. Understanding these linguistic features is essential for developing robust preprocessing methods and ensuring that the automatic translation and captioning processes respect the nuances of the Urdu language.

The BLEU Score is a widely used metric for evaluating caption generation models. It evaluates the quality of generated captions by comparing them to reference captions using BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores [1819]. This technology has numerous potential applications, including making images and videos more accessible to people with visual impairment [20], improving image search and retrieval [21], enhancing social media [22] and e-commerce platforms [23]. Current developments in deep learning [24], particularly the use of Convolutional Neural Networks (CNN) [25] and Recurrent Neural Networks (RNNs) [26], have produced notable advancements in the precision and fluency of machine-generated automatic image captioning. However, the challenges of generating captions that are diverse, creative, and contextually relevant remain. Researchers are actively working to develop more sophisticated models that can enhance comprehension of the visual contents of images and generate captions that are not only accurate but also compelling and engaging. This paper emphasizes that it is necessary to expand resources and solutions for Urdu to promote research in the language, especially in the area of captions for images, which has become more well-known recently. Table 1 illustrates the most recent developments in Urdu image captioning, with the majority of the automatic captioning performed by researchers using the Flickr8k dataset containing 8k images. Our study is the first to use 30k images for the image captioning system in Urdu.

So, in this paper, the Flicker30k dataset was used, which is an extended version of Flicker8k that can be translated into Urdu. Our model uses two deep learning architectures. ResNet-50 and NASNetLarge, as encoders to extract key visual features from images. These features are then processed by an LSTM (Long Short-Term Memory) [31] Network, which serves as the decoder to generate captions. The encoder extracts features from images, and then these features and vector files originating from captions are passed to a decoder. The decoder generates a caption using a greedy search approach in sequence. All these generated captions are evaluated using evaluation metrics which are BLEU and its variant, and achieve effective outcomes.

Despite its significance, Urdu remains an under-resourced language, with complex grammar and morphology[32]. Additionally, there are difficulties in text tokenization and language modeling and a lack of data available for Urdu over the internet. It is necessary to expand resources and solutions for Urdu to promote research in the language. Especially when the benchmark corpora have been established for almost all English research problems, Urdu remains an under-resourced language. This poses a significant challenge for developing an Urdu image captioning system. There are several challenges associated with the generation of captions in Urdu. Some of them are:

  1. (1) Lack of a sizeable dataset in Urdu for the development, comparison, and evaluation of deep learning and transfer learning strategies for creating captions from images in Urdu.
  2. (2) The previous model provided limited technical results for accurate prediction, primarily relying on score metrics.
  3. (3) There is also a need for a robust model that generates the caption of the given image in Urdu language. There is no large-scale and reliable model available for generating Urdu captions from images.

The contribution of this work is given below:

  1. (1) In this paper, a new dataset for image captioning in Urdu is introduced, leveraging the well-known Flickr30k dataset. The Urdu image captioning dataset UC-23-RY comprises 31,783 images, each accompanied by multiple captions, resulting in a total of 158,915 Urdu captions. These captions were generated through a semi-automated translation approach. The creation of UC-23-RY addresses a critical gap in Urdu language resources and provides a robust resource for training and evaluating image captioning models.
  2. (2) An approach using deep learning-based methods is proposed, ResNet-50-LSTM and NASNetLarge-LSTM, for Urdu image captioning. This approach is designed to effectively capture and integrate visual and textual features, facilitating the generation of coherent and contextually relevant captions in Urdu.
  3. (3) An extensive evaluation of the developed models was conducted using the BLEU score metric and its variants to assess the quality of the generated captions. The ResNet-50-LSTM and NASNetLarge-LSTM models demonstrated superior performance compared to state-of-the-art methods.

The rest of the paper is structured as follows: The Related Work section discusses previously developed datasets and methods. The Corpus Generation Process section describes the construction of the UC-23-RY dataset. The Materials and Methods section explains the implementation workflow and details the proposed approach. The Experimental Results section presents and analyzes the outcomes of the experiments conducted. The Discussion and Limitations section provides an evaluation of the findings and limitations of the study. Finally, the Conclusion and Future Work section summarizes the key findings of the study and suggests potential directions for future research.

Related Work

The literature can be divided differently, but here literature is subdivided into image captioning corpora and techniques used for image captioning. Then, this subdivision of literature is also further divided into sub-parts. More details of this subdivision are provided in the following paragraphs and illustrated in Fig 3.

thumbnail
Fig 3. Image Captioning Taxonomy: An Overview of Corpora and Techniques.

https://doi.org/10.1371/journal.pone.0320701.g003

Image Captioning Corpora

Existing corpora used for captions for images can be separated into two main subcategories as shown in Fig 3:

  1. (1) Captioning corpora for the English language.
  2. (2) Captioning corpora for non-English language.

Captioning Corpora for the English Language.

The majority of image captioning research has been conducted using the English language. The following image captioning corpora for the English language have been constructed from different studies and are discussed below and summarized in Table 2.

Flickr-8k Corpus

In 2013, Hodosh. et al. [12] provided two datasets in English to develop retrieval-based image captioning. The Flickr8k dataset has been extensively utilized in studies [3337] to advance the state of art in image captioning. To enhance the accuracy of image captioning, modern techniques such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, Attention Mechanisms, and other approaches have been employed. Consequently, over time, image captioning models have performed noticeably better, contributing to the development of various real-world applications such as image tagging, retrieval, and description.

Flickr 30k Corpus

In 2015, Plummer. et al. [14] developed the Flick-30k corpus as a benchmark for Automatic Image Captioning. It was an extension of the Flick-8k corpus and contained 31,783 images, each with five annotations on the captions. The annotations were highly structured, but due to the complexity of a large variety of images, ambiguity in object identification, and unrealistic crowdsourcing annotations, the authors proposed a multi-stage pipeline technique for the annotation process. According to the aggregated information gathered during the annotation process across all five captions, humans, clothing, body parts, vehicles, instruments, and various other objects were identified in 94.2%, 12.0%, 69.9%, 28.0%, 13.8%, 4.3%, and 91.8% of the images, respectively. Flicker 30K is used by various papers for image captioning for making captioning models using part of a speech guidance module [38], a dual-model transformer for image captioning [39], and image caption generation with a caption-to-speech mechanism [40].

IAPR TC-12 Corpus

The IAPR TC-12 Corpus [41] comprised 20,000 images, each with at least one English sentence, and was extracted from a private collection of photographic pictures. The annotation guidelines allowed for words in English, Spanish, and Portuguese. This image collection was created for the Image CLEF (CLEF Cross-Language Image Retrieval Track). The TC-12 benchmark for IAPR (Image Retrieval benchmark) was established as a result.

MS COCO Corpus

MS COCO [15] (Microsoft Common Objects in Context) is a popular dataset for computer vision tasks involving object detection and captioning. The MS COCO 15 collection comprises 330,000 images, each with a half-million caption labels. The images in MS COCO belong to 80 different classes, such as individuals, creatures, automobiles, and domestic items. The MS COCO 14 collection consists of 164,062 images with 995,684 captions. MS COCO has been utilized by numerous researchers in their studies to create advanced models for image captioning [4249].

Captioning Corpora for Non-English Language.

Most of the existing corpora are in English, leading to a lack of diversity in other languages. Many language corpora developed for image captioning are simply translated versions of the popular English-based Flickr-8k corpus. The Flickr30k dataset is primarily available in English, with limited translations into other languages [50]. The dataset includes 31,783 images, each with five English captions. While researchers may have translated these captions into other languages for their experiments, such translations are not part of the official Flickr30k dataset. Each dataset with several images and their caption details is mentioned in Table 3.

Chinese Language Corpus (F-8k)

In 2016, a bilingual version of the Flickr-8k [51] Chinese corpus was developed, known as the Flick-8k CN corpus, which consisted of Chinese captions. To create the Chinese captions, the authors used machine translation tools such as Google and Baidu. For example, “black” appeared only 116 times in Chinese captions compared to 3,832 times in English captions.

Hindi Language Corpus

In 2020, Rathi et al. developed the Flick-8k-Hindi corpus [52] by using the Flick-8k English corpus as the base corpus. Due to budget and time constraints, they opted to use a machine translation approach to construct the Hindi edition of Flick-8k. They used the Google Cloud Translator API, which offers free translation to approximately more than 100 languages.

Arabic Language Corpus

In 2018, Al-Muzaini et al. [6] developed an Arabic image-captioning corpus by combining portions of both the Flickr-8k [12] and MS COCO corpora [15]. In total, the Arabic corpus included 3,247 images with 16,663 Arabic captions and 9,854 words in the vocabulary. The lengthiest caption in the corpus consisted of 27 words.

Japanese Language Corpus

In 2020, Nakayama et al. [53], developed a Japanese corpus from Flicker 30k [14]. The study presents a multilingual multimodal corpus, named Flickr30k Entities JP (F30kEnt-JP), by extending the original Flickr30K Entities dataset with Japanese translations.This is the first bilingual image caption dataset with captions in two languages at the same time. The study investigated the worth of new Japanese labeling through phrase localization experiments and found improved performance.

Chinese Language Corpus (F-30k)

In 2017, Lan Weiyu et al. [54] created a bilingual version of the Flickr-30k [14] English corpus, known as the Flick-30k CN corpus, which consisted of Chinese captions. To create the Chinese captions, the authors used machine translation tools such as Google and Baidu. The Flick-30k CN corpus includes both machine-translated and human-generated Chinese captions, with the latter being collected through crowdsourcing.

German Language Corpus

In 2016 Elliott and Desmond [55] proposed the creation of the Multi30K benchmark to promote research in bilingual multimodal capabilities beyond the English language. The German translations by expert translators differ from the original crowed-sourced English description. The dataset expands upon the Flickr30k dataset. The dataset is useful for various tasks, including multilingual image description and multimodal machine translation.

Urdu Language Corpus

In 2020, Ilahi, I. et al. [27] prepared a dataset in the Urdu language by translating the Flickr8k dataset. Specific syntactical and grammatical norms for the Urdu language were followed to generate the dataset. Following that, Afzal, M. K., et al. [28] improved the dataset and made it publicly available for use in 2023.

Image Captioning Techniques

With the impressive advancements in computer vision, several techniques have been created that have achieved remarkable success in Image Captioning and set new benchmarks.

There are three approaches, as shown in Fig 3, currently being utilized in the development of image captioning systems:

  1. (1) Classical Machine Learning (CML)
  2. (2) Deep Learning (DL)
  3. (3) Transfer Learning (TL)

Classical Machine Learning.

Classical Machine Learning (CML) [56] is a traditional approach to machine learning that has been in use for several decades. CML models are typically based on handcrafted features designed to capture specific aspects of the images, for example, hue, feel, and form. These attributes are used as input to machine language algorithms, such as Support Vector Machines (SVMs) [57] or Random Forests [58], to learn from the data and generate new captions.

In 2012, Mitchell et al. [59] utilized contextual and syntactical techniques to produce English captions. They employed a template-based technique, whereby the syntactic trees provided information about the image’s visual content as interpreted by the computer. Mason et al. [60] introduced a non-parametric density estimation method for image captioning in 2014. They utilized the SBU-Flickr dataset, which includes defined labels. The non-parametric density estimation method involved assigning the same label word to an unseen image that had similar contextual characteristics as the seen images. The scores of existing sentences on the query image were calculated using a word probability distribution method, and the sentence with the highest score was selected. In 2013, Hodosh et al. [12] created two English language corpora to develop retrieval-based IC. They utilized the retrieval-based technique, which involved slecting captions from a pre-existing pool of sentences. In their proposed method, ranking was implemented.

Deep Learning.

In 2024, Humayun S. et al. [29] proposed a caption generation model for the Urdu language by applying deep learning techniques like InceptionV3 and LSTM to translate the Flickr-8k dataset into Urdu, with impressive outcomes. In 2017, Sidra [30] suggested a deep neural network-based automatic textual description model for the Urdu Language based on 999 natural images with 999 captions in both Urdu and English. In 2014, Socher et al. [61] incorporated both retrieval-based and template-based techniques in IC using deep learning (DL). They employed the English-annotated Amazon Mechanical Turk Dataset [62], which consisted of 1000 images with five English captions per image. In 2016, Wang et al. [63] used deep learning to caption images and utilized the Flick-8k English corpus. They proposed a unique method by merging both RNN and LSTM architectures in parallel. In 2014, Kiros et al. [64] introduced a multimodal neural approach for language modeling by proposing a neural implementation. In 2015, J. Mao et al. [65] utilized several English corpora to propose a multimodal RNN (m-RNN) model for generating novel English captions. In 2015, Vinyals et al. [66] aimed to apply deep neural machine translation to image captioning and used an LSTM in an RNN to serve as a decoder to provide a caption after CNN had encoded the provided image. In 2018, Shaung et al. [67] conducted a comprehensive survey of research in image captioning and noted that RNNs suffer from vanishing gradient issues. This approach prioritizes the most salient words in generating the name and relationship of significant objects within the image. In 2019, a classification of IC methods based on different attributes was presented by Z. Hossain [68]. In 2018, Srinivasan et al. proposed a hybrid system [69] that builds complete phrases utilizing the general keywords by applying a Long Short-Term Memory (LSTM) to a multilayer CNN to generate vocabulary describing images. In 2015, compositional image captioning was planned by Fang et al [70], which had independent functional units. The visual models focused on the image’s visual content and passed along the feature to the language model, which then composed the caption. In another approach, Mao et al. [71] proposed a novel object-based EIC method. Aneja et al. [72], in 2018, developed a CNN-based module that used masked convolutions instead of RNN and LSTM due to the issue of vanishing gradients. In 2020, Rathi [52] introduced deep learning methods to the Flick-8k Hindi dataset. The model was trained using a CNN-LSTM encoder-decoder architecture. Castro investigated the influence of different hyperparameter arrangements [73] for image captioning tasks in the field of computer vision research, using an encoder-decoder visual attention framework.

Transfer Learning.

In 2020, Wang et al. [74] proposed a cross-lingual approach in which they added an independent recurrent structure at the caption generation stage by utilizing the Flick-8k [12] English dataset and its Chinese counterpart, Flickr-8k CN [51]. Degadwala [75] utilized transfer learning for Image Captioning in 2021. To extract image features, they used a pre-trained inception-V3 model and added a fully linked layer. In 2019, Perdana et al. suggested a method for cross-domain image captioning termed Multimodal instance-based Deep Transfer Learning (MIBTL). They performed instance-based transfer learning to identify which data improved the training process of the target domain dataset [76]. In 2022 Ayoub et al. suggested a methodology for automatically creating captions for images [77] that integrates pre-trained CNN, the Bahdanau attention mechanism, and transfer learning approaches to predict image captions. Their study compared the performance of VGG16 and InceptionV3, two pre-trained CNNs. In 2021, Banerjee et al. introduced a model for annotating images using transfer learning [78] that given a target dataset with a small number of style-based ground-truth captions, produces style-based captions. After training on the source dataset, the model is adjusted to provide style-based captions for the target dataset. The model is presently being tested as proof of concept at Myntra [79] to gather user feedback, with the potential to enhance the customer experience and increase the add-to-cart ratio by providing additional style-based captions for fashion apparel.

Corpus Generation Process (UC-23-RY development)

Our main objective was to create a sizeable benchmark corpus for the Urdu Language as part of research in MUST (Mirpur University of Science and Technology). The steps involved in creating a dataset are described below:

Data Collection

The initial import images for the development of the dataset were collected from Flickr30k [80] for Urdu image captioning. For each image, five Urdu sentences are written as part of multiple captions resulting in 158916 captions and 31783 images. It is called the Urdu Captioning dataset (UC-23-RY).

Data Cleaning and Pre-Processing

For processing by computer AI model, Urdu captions were cleaned up by removing punctuation marks and special characters. In the next step, Urdu captions were tokenized to generate vocabulary and identify correct word limitations. After tokenization, 18,804 unique words were obtained. Following preprocessing and data cleaning 158,915 captions are assembled.

Sentences Creation Process and Guidelines for UC-23-RY Corpus

For sentence creation, both automatic and manual methods were used for the sentence writing process.

Automatic Translation.

Initially, the English captions were first translated into Urdu using the Bulk translator tool [81] due to its ability to handle large volumes of text. However, some limitations were observed:

  • Contextual Errors: Automatic translation tools lack contextual knowledge, leading to inaccurate translations.
  • Grammatical Errors: The difference between English and Urdu grammar led to errors that automatic tools could not resolve.
  • Limited Resources for Urdu: As Urdu is an under-resourced language, Bulk translators struggled with maintaining quality and consistency across translation.

Manual Inspection and Correction.

Due to errors in automatic translation, manual inspection and correction were performed on the translated Urdu text. An Urdu language expert was tasked with:

  • Ensuring translation matched the contextual meaning of the original English captions.
  • Correcting any grammatical errors and improving sentence fluency to maintain semantic accuracy.
  • Ensuring consistent terminology across the dataset

Challenges.

The translation and correction process were time-consuming, with over 158,000 captions requiring attention. This process only took 2–3 months, largely due to the need for manual inspection and the limitations of translation tools.

Rounds of Quality Control

Initial Automatic Translation.

  • The initial translation of English captions into Urdu was performed using online tools such as Google Translator and Bulk Translator.
  • Bulk translator was chosen due to its ability to handle large datasets, but it faced limitations in accuracy and context preservation.

Manual Inspection and Correction.

After automatic translation, a round of manual inspection and correction was conducted, focusing on the testing data. In contrast, we performed this process on the entire dataset to rectify grammatical and contextual errors.

Final Review.

The final review involves ensuring the accuracy of the corrected captions. Special attention was given to maintaining syntactic and semantic alignment between the original English captions and their Urdu translations.

Translation Quality Measurement

Human Evaluation.

A linguistic expert manually reviewed and corrected the automatically translated captions. The focus was on grammatical correctness, fluency, and contextual accuracy.

Error Correction Based on Observations.

A subset of captions with significant translation errors was identified during manual inspection. This phase ensured that any inaccuracies were fixed, particularly in critical testing data.

Consistency Check.

Throughout the process, consistency in terminology and translation style was maintained. Continuous observation ensured that corrections followed the guidelines established for the UC-23-RY corpus.

Corpus Normalization

The UC-23-RY corpus is composed of 31,785 images and 158,915 Urdu captions. The corpus has been normalized in the CSV format, which is manually corrected and inspected to facilitate a comparison between automatic and manual translation as shown in Table 4.

Corpus Characteristics

Table 5 summarizes the key characteristics of the dataset. The entire dataset has been used, which includes 31,783 images, each with an average of five English captions per image. A native speaker of Urdu translated the captions by hand, and then another native Urdu speaker went through several rounds of quality control. The UC-23-RY dataset includes a total of 18,804 unique words and a maximum caption length of 431 words as shown in Fig 4, which shows the distribution of text lengths for each caption in the dataset. The x-axis corresponds to total number of captions, while the y-axis indicates the length of caption in term of the number of words or characters. The visualization helps in understanding the variation in caption length across the dataset.

thumbnail
Fig 4. Distribution of caption lengths (index vs text length).

https://doi.org/10.1371/journal.pone.0320701.g004

Materials and Methods

An encoder-decoder mechanism is adopted for automatic image captioning, where ResNet-50 [82] and NASNetLarge [83] act as encoders and LSTM [31] acts as a decoder. ResNet-50 or NASNetLarge utilize pre-trained ImageNet weights for feature extraction from visual data. LSTM is a more progressive type of RNN that can remember long-term dependency, so it gives the sequence prediction for query images as shown in Fig 5.

thumbnail
Fig 5. Flowchart of Urdu image caption generation process.

https://doi.org/10.1371/journal.pone.0320701.g005

Implementation Workflow

This Pseudocode outlines the steps as shown in Fig 4 involved in generating Urdu captions for images using a deep learning model that integrates CNN features extraction (via ResNet-50 or NASNetLarge) and LSTM sequence modeling.

Inputs Image dataset (UC-23-RY)

Outputs Urdu captions for input images.

  1. a. Preprocess Stage
  • Resize images and normalize pixel values.
  • Tokenization and Preprocess Urdu captions.
  • Apply word embedding to Urdu captions.
  1. b. Visual Feature Extraction Stage
  • Use CNN (ResNet-50 or NASNetLarge) to visual features from input images.
  • Train model for n epochs.
  • Save the trained model and monitor its performance.
  1. c. Caption Generation Stage
  • Extract image features from CNN (ResNet-50 and NASNetLarge) and feed them to LSTM.
  • Train the LSTM model with image features and tokenized captions sequences.
  • Update LSTM hidden states in both forward and backward direction to generate the sequence of words for captions.
  • Optimize LSTM hyperparameters during training.
  • Convert the generated caption sequence to Urdu text using word embedding and obtain the predicted caption for an input image.
  1. d. Testing phase
  • Evaluate the trained model by passing query images.
  • Computer performance metrics (BLEU Scores).
  • Compare predictions to assess the accuracy.
  • Analyze results to fine-tune model parameters and improved performance.

Dataset

For the experiment presented in this paper, the entire UC-23-RY dataset was used (containing 158915 Urdu captions along with 31,785 images). A train set comprising 25,426 images (approximately 80%) and a test set of 6,357 images (approximately 20%) were randomly created from the dataset. There are five captions for each image, which yields a split of 127,130 train and 31,785 test captions. The experiment uses the remaining 6,357 test images after the model has been trained on 25,426 training images as shown in Table 6.

thumbnail
Table 6. Division of training and testing images based on a split ratio.

https://doi.org/10.1371/journal.pone.0320701.t006

Image and Textural Features Extraction

Regarding our developed model, ResNet-50 and NASNetLarge are used separately, not together, for feature extraction in different experiments. The input image size fed into the network is 224 by 224 pixels for ResNet-50 and 331 by 331 for NASNetLarge. ResNet-50 and NASNetLarge were trained on the ImageNet dataset, which contains over a million images spanning 1000 object classes, such as animals, pencils, keyboards, and more. Utilizing the pre-defined ImageNet weights of ResNet-50 and NASNetLarge for the problem constitutes transfer learning. The final layer is a fully connected (dense) layer, a prediction that can be removed to bind the image feature vector of the Urdu caption. For textural features, Urdu word embedding that contains Urdu tokens with 300-dimensional vectors was utilized.

Multimodal Approach

In the proposed multimodal approach, illustrated in Fig 6, visual and textual data are combined to generate Urdu image captions, hence it is referred to as multimodal. The image is processed through two different CNN models, ResNet-50 or NASNetLarge, which are used independently with LSTM as a decoder. ResNet-50 extracts image features using a 50-layer deep network, while NASNetLarge offers more complex feature extraction due to its large architecture and capability to generalize across a wide range of objects. The extracted image features are converted into vectors, which are then passed to LSTM to generate captions word by word. The fusing process in our model is achieved through a concatenation process, where the image embedding vector and caption/text embedding vector are combined along a specific dimension to form a single input vector. Mathematically, if I represent the image embedding vector of dimension d1 and T represents the caption/text embedding vector of dimension d2, the fuse vector F in Eqs 1 is defined as:

(1)

where [I; T] denotes the concatenation of vectors I and T. This fuse vector F is then passed to subsequent layers, including CNN and LSTM for processing. Both models were chosen for their strong performance in feature extraction tasks, with ResNet-50 providing efficient processing and NASNetLarge offering detailed feature extraction, ensuring robust caption generation across various contexts.

Resnet-50-LSTM Based Captioning Model.

For the language model (ResNet-50-LSTM), the input shape is 2048 with a dimension of 256 images, a maximum length of 431, and a vocabulary size of 4732 after removing low-frequency words with 300 embedding dimensions of text passed to the LSTM model with a dropout rate of 0.3. A fully connected neural layer with cell spatial size (256) and hidden layer (256) is used. This dense neural network is used for each name vector ai which is based on the decoder’s last hidden state ht-1. The projected hidden space was concatenated with each input vector using summation, which additionally produced a ReLU-activated shape (4732,256). The vector is used to participate in the (2048, 300) shape vector, which results in the context vector representation of the visual feature. LSTM needs fixed-length sequences, so varying sentence lengths are padded to match the maximum size of 86. This uniformity allows for consistent processing in a dataset.

A 256-size custom embedding layer has been added, during training, and learns a fixed size length continuous domain representation. The LSTM decoder uses this as its final representation of words. The LSTM has a hidden space size of 256. To forecast the next word, an updated hidden space is employed, which is sampled by a fully connected layer. This projects 300 vectors of each word into the vocabulary. It is related to SoftMax for verbal prediction. Categorical_Crossentropy loss (multicategory) is used to backpropagate the gradient.

In the benchmark approach, the last prediction, St-1’s word embedding (256), is utilized as input for the subsequent time step. In the LSTM, the hidden state ht is updated at each cycle. The vector is then concatenated and fed into a single LSTM decoder, which produces the next word.

The Adam optimizer is utilized by default with 1e-3 learning. According to the trainer, the dropout value of 0.3 was used in 30% of randomly selected workouts. A training cycle is applied to a deep-learning model with 500 epochs and a group of size 5. The training data is fed into the model using a generator (data_generator) for each loop and the model is updated using the ‘fit_generator’ method with a specified number of 32 steps per epoch. Then, each trained file is saved with a unique name. If training stops at any stage due to any cause, then the next time it should be started from the last saved file where training is stopped.

Categorical cross-entropy loss and accuracy were tracked. The loss is continuously decreasing and accuracy improves after each epoch. After 500 epochs, the accuracy, which was 0.079 at the beginning, increased to 0.937, and the loss, which was 6.239 at the beginning, decreased to 0.139. The encoder is trained using transfer learning, with its weight frozen, while the decoder is trained only once. The first convolutional block would typically have learned low-level image processing features such as identifying lines, edges, and curves. Only the convolutional blocks for ResNet-50 are fine-tuned, while the initial block remains intact. As a result, the foundations remain the same as shown in Table 7, which displays the trained model’s hyperparameters.

Nasnetlarge-LSTM based captioning model

For the language model (NASNetLarge-LSTM), the input shape is 4032 with a dimension of 256 for the image and a maximum length of 431. The vocabulary size is 4732 after removing low-frequency words with 300 embedding dimensions of text passed to the LSTM model with a dropout rate of 0.3. A fully connected neural layer with a cell spatial size (256) and a hidden layer (256). The dense neural network is utilized on individual name vectors ai, which are dependent on the final hidden state ht-1 of the decoder. The projected hidden space was concatenated with each input vector using summation, which additionally produced a ReLU-activated shape of (4732,256). To achieve the context vector representation of an image feature, the vector is subsequently utilized to participate in the shape vector (4032, 300). LSTM needs fixed-length sequences, so varying sentence lengths are padded to match the maximum size of 86. This uniformity allows for consistent processing in the dataset.

The LSTM decoder uses this as its final representation of the words. The LSTM has a hidden space size of 256. To forecast the next word, we utilize an updated hidden space sampled by a fully connected layer, which projects 300 vectors of each word into the vocabulary. This is related to SoftMax for verbal prediction. Categorical_cross-entropy loss (multicategory) is used to backpropagate the gradient.

In the benchmark model, the most recent prediction, St-1’s word embedding (256), is utilized exclusively as input for the subsequent time step. In the LSTM, the hidden state ht is updated at each cycle. The vector is concatenated and fed into a single LSTM decoder to generate the next word.

The Adam optimizer is utilized by default with 1e-3 learning. According to the trainer, the Dropout value of 0.3 was used in 30% of randomly selected workouts. A training cycle is applied to a deep-learning model with 700 epochs and a group of size 5. The train data is fed into the model using a generator (data_generator) for each loop and the model is updated using the ‘fit_generator’ method with a specified number of 32 steps per epoch.

Then, each trained file is saved with a unique name. If the training process is interrupted for any reason, it should resume the next time it starts from the most recently saved file. Categorical Cross-entropy loss and accuracy were tracked. The loss is continuously decreasing and accuracy improves after each epoch. After 700 epochs, the accuracy, which was 0.0806 at the beginning, increased to 0.9425, and the loss, which was 6.3577 at the beginning, decreased to 0.1218. Transfer learning is applied to the encoder, keeping its weights frozen, and only the decoder was trained. The first convolutional block would normally have learned low-level features that are image processing, including detecting lines, edges, curves, etc. Only the convolutional blocks for ResNet-50 are fine-tuned, while the initial block remains intact. As a result, the foundations remain the same as shown in Table 7, which displays the hyperparameters of the trained model.

Computational Resources and Time Analysis

The experiment was conducted on a system with an Intel Core i5-6300HQ CPU @ 2.0 GHz, 8 GB of RAM, and a 64-bit operating system running on Windows. The deep learning model is implemented using Python and TensorFlow, with Keras as the high-level neural networks API. While the system supplies sufficient resources for model training and evaluation, the relatively modest hardware specifications necessitated careful management of computational resources, particularly in dealing with large datasets and more complex models like NASNetLarge. After the translation, the deep learning models are trained with careful monitoring of time spent on each epoch. The ResNet-50-LSTM model was trained for 500 epochs, and the NASNetLarge-LSTM model was trained for 700 epochs. The time trained per epoch varied based on the complexity of the model and training steps. The total time required to complete the training for each model is summarized in Table 8.

thumbnail
Table 8. Time elapsed during training of ResNet-50-LSTM and NASNetLarge-LSTM model.

https://doi.org/10.1371/journal.pone.0320701.t008

In the case of ResNet-50-LSTM, the training of 500 epochs required a total of approximately 1 hour 18 minutes. For NASNetLarge-LSTM, the training of 700 epochs took roughly 1 hour 33 minutes. Both models showed varying time durations per epoch due to the difference in the number of steps and model complexity.

Experimental Results

The experiment is conducted using multimodal approaches that combine image and text processing techniques. Our steps include translating English captions into Urdu with various models trained on different datasets.

We compute accuracy as a percentage-based measure derived from BLEU Scores to provide an intuitive representation of model performance. BLEU is a parameterized metric whose value varies with the change of parameters. It considers quantitative metrics to find how much a certain prediction is valid by giving some score.

Accuracy in Eqs 2 is calculated as:

(2)

The BLEU Score is based on the sequential conformance of N-gram precision with a brevity penalty to account for length differences, it ranges from 0 to 1 (or 0% to 100%), even though natural language contains significantly more flexible formulations where more flexible formulations where multiple words or their combinations may convey the same semantic idea. This results in automatically generating predictions using greedy search and the results are evaluated using metrics for BLEU-1, BLEU-2, BLEU-3, and BLEU-4, as shown in Fig 8.

thumbnail
Fig 7. Captions in Urdu using ResNet-50 and NASNetLarge models with their transliterations.

https://doi.org/10.1371/journal.pone.0320701.g007

thumbnail
Fig 8. ResNet-50 and NASNetLarge results for BLEU – (1,2,3,4) Scores.

https://doi.org/10.1371/journal.pone.0320701.g008

For comparison, models were examined using the Flickr8k dataset. Our model which uses ResNet-50 as the encoder and LSTM as the decoder with Flickr8k, achieved BLEU-1 scores of 75.3, BLEU-2 scores of 63.2, BLEU-3 scores of 50.9, and BLEU-4 scores of 40.2. In contrast with Flickr30k, our models show low performance. Using the more extensive Flickr30k dataset and ResNet-50, our model demonstrates improved performance and highlights its effectiveness over those benchmarks.

In Table 9, Dual Model Transformer [39], CNN-LSTM [52], and Chinese caption model [51] present a comparative analysis of our proposed result with those reported in prior research using different languages (English, Hindi, and Chinese), highlighting improved performance achieved by our methodology. In our study, the Flickr30k dataset is used to translate English captions to Urdu. Our approach features an LSTM decoder combined with CNN-based encoders: ResNet-50 and NASNetLarge. The ResNet-50-LSTM model achieved a BLEU-1 score of 86, while NASNetLarge achieved 84, significantly outperforming previous Urdu-specific models. This improvement highlights the effectiveness of our approach and the richness of the UC-23-RY dataset. Additionally, our model outperforms studies in other languages such as Hindi, and Chinese, establishing the robustness and generalization of our method.

thumbnail
Table 9. BLEU scores for different datasets in different languages.

https://doi.org/10.1371/journal.pone.0320701.t009

The remaining models, such as the Attention-driven inject model [27], Generative Image captioning [28], and Deep Learning-Based Urdu Image Captioning [29], presents a comparison of the performance metrics between the same language and two models proposed in this research, emphasizing their strengths and differences. The proposed approach for an Urdu image captioning system uses deep transfer learning methods, which differ in feature extraction approaches. For deep learning, image features are extracted using the CNN model while LSTM generates sequence prediction. Pretrained Urdu word embeddings such as urduvec are used for textual embedding as shown in Table 10.

thumbnail
Table 10. Feature and textural extraction model with random weights and urdu vector for a multimodal approach.

https://doi.org/10.1371/journal.pone.0320701.t010

From the findings of deep learning models, it was observed that the BLEU-1 score of ResNet-50 outperformed that of the NASNetLarge-LSTM model. The ResNet-50-LSTM model obtained the highest BLEU-1 score of 0.86 compared to NASNetLarge-LSTM with a 0.843 BLEU-1 score. A pre-trained CNN model was utilized, which has ImageNet weights with fine-tuning of the parameters according to our needs and is utilized for image feature extraction with a sequence model RNN used for processing textual data.

Discussion and Limitation

The UC-23-RY dataset and the application of deep learning models, including ResNet-50-LSTM and NASNetLarge-LSTM, represent notable contributions to Urdu image captioning. In our study, the accuracy of the model was evaluated through a combination of automated and manual validation methods. The captions generated by the model were manually validated by both native and non-native speakers to ensure linguistic fluency, contextual accuracy, and naturalness. This manual review was conducted in multiple rounds to maintain consistency. Additionally, the caption was cross-verified with high-quality reference images and benchmarks to ensure their relevancy and alignment with ground truth. For automated evaluation, BLEU Scores (BLEU-1, BLEU-2, BLEU-3, BLEU-4) were used to measure the accuracy of predicted captions. The achieved BLEU-1 scores validate the effectiveness of these models in generating high-quality captions. By combining BLEU (1,2,3,4) scores for automated evaluation, manual checks for linguistic accuracy, and cross-validation with high-quality benchmarks, we ensured a comprehensive and reliable validation of the model’s performance. However, the study highlights several areas for improvement. Expanding the dataset and exploring the additional evaluation metrics could address limitations and enhance model performance. Future research should focus on increasing dataset size, integrating advanced language models, and optimizing computational efficiency to further advance Urdu image captioning. Regarding limitations, adding more images and their corresponding captions can increase linguistic diversity and enhance the ability of the model to produce captions for a variety of images. To compare and assess the models with huge datasets, additional robust models with different evaluation metrics might be employed.

Conclusion and Future Work

Urdu captioning is a task that includes producing an appropriate caption in Urdu language. Although Urdu lacks resources, there is no benchmark corpus available for Urdu image captioning. Therefore, we develop the UC-23-RY corpus from the Flickr30k dataset, for Urdu image captioning. A semi-automatic translation approach was adopted. This is the first research that generated an Urdu dataset using Flickr30k images, prior most of the work was done on 8k with the mostly limited chosen images. This research also focuses on applying cutting-edge deep learning models that used CNN for images and LSTM for hybrid feature learning with the best BLEU-1 score of 0.86 for ResNet-50 and 0.82 for NASNetLarge.

Throughout our research, we identified some main development areas for image captioning that will be expanded upon in the future based on the study we conducted:

  • Expanding the size of the UC-23-RY corpus by including other corpora such as MS COCO.
  • Implementing advanced transformer models such as BERT, GPT, and T5 as language models to capture the contextual relationships in caption generation tasks.

References

  1. 1. Hasanvand M, Nooshyar M, Moharamkhani E, Selyari A. Machine learning methodology for identifying vehicles using image processing. Artif Intell Appl. 2023;1(3):154–62.
  2. 2. Preethi P, Mamatha HR. Region-based convolutional neural network for segmenting text in epigraphical images. Artif Intell Appl. 2022;1(2):103–11.
  3. 3. Arafat SY, Ashraf N, Iqbal MJ, Ahmad I, Khan S, Rodrigues JJPC. Urdu signboard detection and recognition using deep learning. Multimed Tools Appl. 2022;81(9):11965–87.
  4. 4. Arafat SY, Iqbal MJ. Urdu-text detection and recognition in natural scene images using deep learning. IEEE Access. 2020;8:96787–803.
  5. 5. Mokayed H, Quan TZ, Alkhaled L, Sivakumar V. Real-time human detection and counting system using deep learning computer vision techniques. Artif Intell Appl. 2023;1(4):205–13.
  6. 6. Al-muzaini HA, Al-Yahya TN, Benhidour H. Automatic Arabic image captioning using RNN-LSTM-based language model and CNN. IJACSA. 2018;9(6).
  7. 7. Liu M, Hu H, Li L, Yu Y, Guan W. Chinese image caption generation via visual attention and topic modeling. IEEE Trans Cybern. 2022;52(2):1247–1257. pmid:32568717
  8. 8. Yan C, Xie H, Liu S, Yin J, Zhang Y, Dai Q. Effective uyghur language text detection in complex background images for traffic prompt identification. IEEE Trans Intell Transport Systs. 2018;19(1):220–229.
  9. 9. Kaur J, Josan GS. English to Hindi multi modal image caption translation. J Sci Res. 2020;64(02):274–81.
  10. 10. Ding S, Qu S, Xi Y, Wan S. Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing. 2020;398:520–30.
  11. 11. Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R. From show to tell: A survey on deep learning-based image captioning. IEEE Trans Pattern Anal Mach Intell. 2023;45(1):539–59. pmid:35130142
  12. 12. Hodosh M, Young P, Hockenmaier J. Framing image description as a ranking task: data, models and evaluation metrics,J Artif Intell Resh. 2013;47:853-99. https://doi.org/10.1613/jair.3994
  13. 13. Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, Batra D, Parikh D, Lee S, Anderson P. Nocaps: novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision. 2019;8948-57.
  14. 14. Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision.2015;2641-2649.
  15. 15. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014;740-55.
  16. 16. Wikipedia contributors, "Urdu Wikipedia," Wikipedia. [Online]. [cited 26 Apr 2024]. Available from: https://en.wikipedia.org/wiki/Urdu_Wikipedia
  17. 17. Kanwal S, Malik K, Shahzad K, Aslam F, Nawaz Z. Urdu named entity recognition:Corpus generation and deep learning applications. ACM Trans Asian Low-Resour Lang Inf Process. 2019;19(1):1–13.
  18. 18. Papineni K, Roukos S, Ward T, Zhu W-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics.2002;311-8
  19. 19. Yang M, Zhu J, Li J, Wang L, Qi H, Li S, et al. Extending BLEU Evaluation Method with Linguistic Weight. In: 2008 The 9th International Conference for Young Computer Scientists. IEEE. 2008;1683–8. https://doi.org/10.1109/icycs.2008.362
  20. 20. Rane C, Lashkare A, Karande A, Rao YS. Image captioning based smart navigation system for visually impaired. In: 2021 International Conference on Communication information and Computing Technology (ICCICT). IEEE. 2021; 1–5. https://doi.org/10.1109/iccict50803.2021.9510102
  21. 21. Sharma H, Jalal AS. Image captioning improved visual question answering. Multimed Tools Appl. 2021;81(24):34775–96.
  22. 22. You Q, Jin H, Wang Z, Fang C, Luo J. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016;4651-9.
  23. 23. Sharma A, Aggarwal M. A holistic review of image-to-text conversion: techniques, evaluation metrics, multilingual captioning, storytelling and integration. SN Comput Sci. 2025;6(3):1-26.
  24. 24. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44. pmid:26017442
  25. 25. Gu, Jiuxiang, et al. Recent advances in convolutional neural networks.Pattern Recognit, 2018;77:354-77.  https://doi.org/10.1016/j.patcog.2017.10.013
  26. 26. Medsker LR, L Jain. Recurrent neural networks. Design Appl 2001;5:64-7.
  27. 27. Ilahi I, Zia HMA, Ahsan MA, Tabassam R, Ahmed A. Efficient urdu caption generation using attention based LSTM. 2020. arXiv preprint arXiv:2008.01663.
  28. 28. Afzal MK, Shardlow M, Tuarob S, Zaman F, Sarwar R, Ali M, et al. Generative image captioning in Urdu using deep learning. J Ambient Intell Human Comput. 2023;14(6):7719–31.
  29. 29. Khan HS, Muzaffar R, Arafat SY, Irshad Z. Deep learning-based Urdu image captioning. In: 2024 International Conference on Engineering &Computing Technologies (ICECT). IEEE. 2024; 1–8. https://doi.org/10.1109/icect61618.2024.10581206
  30. 30. Sidra S. Automatic textual description generation of natural images using DNN. Master’s thesis. Mirpur University of Science and Technology (MUST), Mirpur (AJK), Pakistan, 2017.
  31. 31. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. pmid:9377276
  32. 32. Mahmood Z, Safder I, Nawab RMA, Bukhari F, Nawaz R, Alfakeeh AS, et al. Deep sentiments in Roman Urdu text using Recurrent Convolutional Neural Network model. Inform Process Manag. 2020;57(4):102233.
  33. 33. Singh V, Shankar Singh A, Anandhan K. Image captioning using machine/deep learning. In: 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N). IEEE. 2022. 849–54. https://doi.org/10.1109/icac3n56670.2022.10074220
  34. 34. Mandal S, Lele N, Kunawar C. Automatic image caption generation system. Int J Innov Sci Res Technol. 2021;6(6).
  35. 35. Saba S. Image captioning system using artificial intelligence. Graduate J Pak Rev (GJPR), 2023;3(1):1-41.
  36. 36. Bhavana D, Krishna KC, Tejaswini K, Vikas NV, Sahithya AN. Image captioning using deep learning. In Handbook of research on innovations and applications of AI, IoT, and cognitive technologies. 2021;381-95
  37. 37. Al-Sammarraie Yahya Qusay, et al. Image caption and hashtags generation using deep learning approach. In: 2022 International Engineering Conference on Electrical, Energy, and Artificial Intelligence (EICEEAI).2022;1-5
  38. 38. Bae JW, Lee SH, Kim WY, Seong JH, Seo DH. Image captioning model using part-of-speech guidance module for description with diverse vocabulary. IEEE Access. 2022;10:45219–45229.
  39. 39. Kumar D, Srivastava V, Popescu DE, Hemanth JD. Dual-modal transformer with enhanced inter- and intra-modality interactions for image captioning. Appl Sci. 2022;12(13):6733.
  40. 40. Rawale S, Ghotkar M, Sonavane K, Surve P, Khonde S, Patil D. Image captioning generator system with caption to speech conversion mechanism. Int Res J Mod Eng Technol Sci, 2021;1262-7.
  41. 41. Grubinger M, Clough P, Müller H, Deselaers T. The IAPR TC-12 benchmark: A new evaluation resource for visual information systems. In International workshop OntoImage. 2006.
  42. 42. Nam H, Ha JW, Kim J. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2017; pp. 299-307.
  43. 43. Anderson P, et al. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018;6077-86.
  44. 44. Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015; 3128-3137.
  45. 45. Kilickaya M, Erdem A, Ikizler-Cinbis N, Erdem E. Re-evaluating automatic metrics for image captioning. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics.2017;199-209.
  46. 46. Chu Y, Yue X, Yu L, Sergei M, Wang Z. Automatic Image Captioning Based on ResNet50 and LSTM with Soft Attention. Wireless Commun Mob Computi 2020;2020:1–7.
  47. 47. Diwakar P. Automatic image captioning using deep learning.In: Proceedings of the International Conference on Innovative Computing & Communication (ICICC). 2021.
  48. 48. Yu J, Li J, Yu Z, Huang Q. Multimodal transformer with multi-view visual representation for image captioning. IEEE transactions on circuits and systems for video technology. 2019;30(12),4467-80.
  49. 49. Zhou L, Palangi H, Zhang L, Hu H, Corso J, Gao J. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, 2020; 13041-9.
  50. 50. Marcelo V, et al. The case for perspective in multimodal datasets. In Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022, 2022, 108–116.
  51. 51. Li X, Lan W, Dong J, Liu H. Adding Chinese captions to images. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM. 2016.271-5. https://doi.org/10.1145/2911996.2912049
  52. 52. Rathi A. Deep learning approach for image captioning in Hindi language.2020 International Conference on Computer, Electrical & Communication Engineering (ICCECE). 2020.
  53. 53. Nakayama H, Tamura A, Ninomiya T. A visually-grounded parallel corpus with phrase-to-region linking. In: Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020; 4204-10.
  54. 54. Lan W, Li X, Dong J. Fluency-guided cross-lingual image captioning. In: Proceedings of the 25th ACM international conference on Multimedia. 2017. 1549–57. https://doi.org/10.1145/3123266.3123366
  55. 55. Elliott D, Frank S, Sima’an K, Specia L. Multi30k: multilingual english-german image descriptions. Proceedings of the 5th Workshop on Vision and Language. 2016; 70-4.
  56. 56. Faouzi J, Colliot O. Classic machine learning methods. Machine Learning for Brain Disorders; 2023, 25-75.
  57. 57. Pisner DA, Schnyer DM. Support vector machine. Machine Learning. 2020. 101–21. https://doi.org/10.1016/b978-0-12-815739-8.00006-7
  58. 58. Rigatti SJ. Random forest. J Insur Med. 2017
  59. 59. Mitchell M, et al. Midge: generating image descriptions from computer vision detections. In:Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. 2012; 747-56.
  60. 60. Mason R, Charniak E. Nonparametric method for data-driven image captioning. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014; 592-8.
  61. 61. Socher R, Karpathy A, Le QV, Manning CD, Ng AY. Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist. 2014;2:207–18.
  62. 62. Rashtchian C, Young P, Hodosh M, Hockenmaier J. Collecting image annotations using Amazon’s Mechanical Turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk. 2010; 139-47.
  63. 63. Wang M, Song L, Yang X, Luo C. A parallel-fusion RNN-LSTM architecture for image caption generation. In: 2016 IEEE international conference on image processing (ICIP). 2016; 4448-52.
  64. 64. Kiros R, Salakhutdinov R, Zemel R. Multimodal neural language models. In: International conference on machine learning. PMLR, 2014; 595-603.
  65. 65. Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A. Deep captioning with multimodal recurrent neural networks (m-rnn). 2014.arXiv preprint arXiv:1412.6632
  66. 66. Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015; 3156-64.
  67. 67. Bai S, An S. A survey on automatic image caption generation. Neurocomputing. 2018;311:291–304.
  68. 68. Hossain MD Zakir, et al. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR). 2019;51(6): 1-36.
  69. 69. Srinivasan L, D Sreekanthan, et al.Image captioning-a deep learning approach. Int J Appl Eng Res. 2018; 13(9): 7239-42.
  70. 70. Fang H, Gupta S, Iandola F, et al. From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2015;1473-82.
  71. 71. Mao J, et al. Learning like a child: fast novel visual concept learning from sentence descriptions of images. In: Proceedings of the IEEE international conference on computer vision. 2015; 2533-41.
  72. 72. Aneja J, Deshpande A, Schwing AG. Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018; 5561-70.
  73. 73. Castro R, Pineda I, Lim W, Morocho-Cayamcela ME. Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks. IEEE Access. 2022;10:33679–94.
  74. 74. Wang B, Wang C, Zhang Q, Su Y, Wang Y, Xu Y. Cross-Lingual Image Caption Generation Based on Visual Attention Model. IEEE Access. 2020;8:104543–54.
  75. 75. Degadwala S, Vyas D, Biswas H, Chakraborty U, Saha S. Image Captioning Using Inception V3 Transfer Learning Model. In: 2021 6th International Conference on Communication and Electronics Systems (ICCES). IEEE. 2021. 1103–8. https://doi.org/10.1109/icces51350.2021.9489111
  76. 76. Perdana RS, Ishida Y. Instance-based Deep Transfer Learning on Cross-domain Image Captioning. In: 2019 International Electronics Symposium (IES). IEEE. 2019. 24–30. https://doi.org/10.1109/elecsym.2019.8901660
  77. 77. Ayoub S, Gulzar Y, Reegu FA, Turaev S. Generating Image Captions Using Bahdanau Attention Mechanism and Transfer Learning. Symmetry. 2022;14(12):2681.
  78. 78. Banerjee R H, Ravi A, Dutta U. Attr2style: A transfer learning approach for inferring fashion styles via apparel attributes. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021; 15255-61.
  79. 79. Myntra. Online shopping India - shop online for branded shoes, clothing & accessories in India | Myntra.com. Myntra. 2007. [Online]. [cited 18 Feb 2024]. Available from: https://www.myntra.com/.
  80. 80. Young P, Lai A, Hodosh M, Hockenmaier J. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist. 2014;2:67–78.
  81. 81. BulkTranslator.Translate one text into multiple languages. [Online]. [cited 18 Feb 2024]. Available from: https://bulktranslator.com
  82. 82. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016; 770-8.
  83. 83. Zoph B, Vasudevan V, Shlens J, Le QV. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. 8697–710.