EIM: An effective solution for improving multi-modal large language models

Yuting Bai; Tonghua Su; Zixing Bai

doi:10.1371/journal.pone.0329590

Abstract

Enabling large language models (LLMs) to have multi-modal capabilities, such as vision-language learning, has become a current research hotspot and the next milestone in LLM development with the advent of models like GPT4. The basic structure of current multi-modal LLMs usually includes three parts: the image encoder for extracting visual features, the semantic space transformation network ST for aligning the multi-modal semantic spaces, and LLM for generating text. Current works on multi-modal LLMs primarily focus on enhancing performance by utilizing larger image encoders and LLMs, and designing more complex fine-tuning methods and STs, which results in an escalation of model parameters. In this paper, we propose EIM, a novel effective solution for improving the performance of multi-modal large language models from the perspective of training process which reduces the need to introduce new parameters and modify the model structure, and is ignored and less explored in current research. EIM includes corresponding improvement measures in the image encoder, ST, and LLM. To validate EIM, we first apply it to ClipCap and conduct experiments on the COCO Caption dataset. Secondly, we extend EIM to the multi-modal LLMs, such as LLaMA-Adapter and LaVIN, and evaluate them on the ScienceQA dataset. Finally, we also conduct multi-modal chatbot experiments with the EIM enhanced LaVIN and evaluate it on the MME benchmark. The COCO Caption dataset experimental results of , which is a model that applies EIM on the , show the 1.75% performance improvement when compared to those of , which has 3.13 times the number of parameters of . The experimental results on the ScienceQA dataset and MME benchmark show that EIM can achieve competitive performance with 7B model parameters when compared to the 13B multi-modal LLMs, which confirms the effective performance improvement of EIM for multi-modal LLMs.

Citation: Bai Y, Su T, Bai Z (2025) EIM: An effective solution for improving multi-modal large language models. PLoS One 20(8): e0329590. https://doi.org/10.1371/journal.pone.0329590

Editor: Hung Thanh Bui, Industrial University of Ho Chi Minh City, VIET NAM

Received: March 29, 2025; Accepted: July 18, 2025; Published: August 11, 2025

Copyright: © 2025 Bai et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

In recent years, with the increase in model parameter size, large language models (LLMs), such as the GPT series [1–4] and LLaMA series [5–8], continuously push the upper limit of natural language understanding. The next milestone of LLMs is usually considered to enable multi-modal capabilities. For instance, GPT4 [4] not only has excellent text comprehension ability but also supports multi-modal data such as images. The trend of enabling LLMs with multi-modal capabilities, such as image captioning and visual Q&A, has stimulated research and development in the field of multi-modal LLMs [9–26]-28].

CLIP [29] has become the de facto fundamental model in the current multi-modal field due to its excellent performance [30–40]. For instance, the combination of CLIP and diffusion models [41–44] can generate stunning images solely based on text instructions [45–47]. With the rapid development of LLMs [9, 48], how to combine CLIP and LLMs to support multi-modal tasks, such as visual language generation, has become a research hotspot [9–15, 17–20, 22–24, 26, 57, 58].

ClipCap [49] is the earliest case to consider combining CLIP and language models, which converts images into text through the three-segment structure: the image encoder CLIP for extracting visual features, the semantic space transformation network ST for aligning the multi-modal semantic spaces, and the language model GPT-2 [2] for generating text. With the development of LLMs [4, 6, 48, 50] and CLIP-like models [29, 51–56], the three-segment structural design similar to ClipCap that connects two different modal models through ST has shown strong versatility and has been widely used in recent multi-modal LLMs [11, 12, 14, 19, 22, 24]. For instance, LLaMA-Adapter [19] uses LLaMA [5] as the LLM, and PandaGPT [16] uses ImageBind [56] as the image encoder and Vicuna [7, 8] as the LLM. Following the three-segment structure, some recent works, like Emu3 [59], Show-o [60], InternVL3 [61], Janus-Pro [62], extend the multi-modal LLM ability from visual understanding to visual understanding and generation by introducing the additional image decoder. However, the training costs of this new structure are too expensive. Its biggest highlight, image generation, is a bit lackluster because it can only generate an image that is the same as the model’s input image, lacking practical application scenarios. Therefore, we only consider the three-segment structure in this paper. In summary, current research on multi-modal LLMs is still in its early stage and mainly focuses on using larger image encoders or LLMs to improve the performance.

We summarize the problems and shortcomings of current research in the field of multi-modal LLMs as below:

Current research on multi-modal LLMs primarily concentrates on enhancing performance by utilizing larger image encoders [29, 56] and LLMs [5, 7], designing more complex fine-tuning methods like MMA [22] and STs like Q-Former [23], as well as a huge number of high-quality image-text pairs for pre-training ST to align the semantic spaces between vision and language [10–13, 25, 28], which results in an escalation of model parameters and a surge in training costs.
Although text features have been proven to help improve the performance of CLIP [31, 63–65], there is currently a lack of research cases in the field of multi-modal LLMs that utilize text features of CLIP. Some studies have even suggested that fine-tuning CLIP may be detrimental to downstream tasks training [36, 66–68].
Current research on multi-modal LLMs usually only uses the auto-regressive training objective [11, 19, 22]. Optimizing multi-modal LLMs from the perspective of the training process is ignored and less explored.

In this paper, we think the training process of multi-modal LLMs should be different from that of traditional autoregressive models due to the multi-modal input rather than uni-modal input. Besides, optimizing the training process can improve the model performance, reduce the need to introduce new parameters and modify the original model structure, be orthogonal to the mainstream solutions, and enable the utilization of text features of CLIP. Therefore, we propose an effective solution for improving multi-modal LLMs called EIM that enhances model performance from the perspective of the training process. Our approach involves utilizing the text features of CLIP and adding contrastive losses on CLIP, ST, and LLM. We conduct ablation studies on ClipCap and quantitatively analyze the effects in detail. Furthermore, we extend EIM to LLaMA-Adapter [19] and LaVIN [22] for validation. The experimental results confirm the effective performance improvement of EIM for multi-modal LLMs. Overall, our paper makes the following contributions:

We propose three contrastive losses to improve the performance from the perspective of the training process. The proposed losses are used for CLIP, ST, and LLM, respectively. And we further provide a solution that combines these losses to achieve stable performance improvement.
We propose to enhance CLIP’s capability in multi-modal LLMs by introducing textual information. Although CLIP has the ability to extract both image and text features, as it undergoes image-text alignment during the pre-training stage, previous studies primarily focus on utilizing CLIP’s visual feature extraction capability while neglecting its text feature extraction capability. To address this limitation, the CLIP text encoder is introduced to encode the textual information and help to guide the CLIP fine-tuning during the training.
Based on the above improvements, we propose EIM, a novel effective solution for improving the performance of multi-modal large language models from the perspective of the training process. We apply EIM on the model and achieve 1.75% performance improvement on the COCO Caption dataset when comparing to the experimental results of , which has 3.13 times the number of parameters of . Furthermore, we extend EIM to the representative multi-modal LLMs, such as LLaMA-Adapter and LaVIN, and evaluate on the ScienceQA dataset, achieving accuracy improvements of 2.76% and 2.05%, respectively. Finally, we apply EIM on LaVIN-7B-lite and evaluate on the MME benchmark, achieving comparable performance when compared to LaVIN-13B.

Related work

Due to the widely used three-segment structural design in current multi-modal LLMs, we will introduce the related multi-modal LLMs from the image encoder, ST, and LLM.

.

Image encoder.

PandaGPT [16] extracts visual features using ImageBind [56]. Compared to CLIP [29], ImageBind supports a wider variety of modal data. Similar to PandaGPT, ImageBind-LLM [28] also uses ImageBind to encode more modal data. The EVA-CLIP series [51–53], which achieve better performance through improving the training techniques of CLIP, are broadly introduced in models like BLIP-2 [23], InstructBLIP [24], MiniGPT-4 [12], Lynx [25], Ziya-Visual [17], and LLaVA-1.5 [10]. Unlike existing works, we improve the performance of CLIP by using contrastive learning, introducing textual information, removing CLIP’s output layer, and using the full visual features rather than replacing CLIP with larger image encoders like ImageBind and the EVA-CLIP series.

ST.

Most of the current multi-modal LLMs, such as LLaVA series [10, 11], LLaMA-Adapter [19, 20], Otter [15], MultiModal-GPT [18], MiniGPT-4 [13], VisualGLM-6B [69, 70], and InstructBLIP [24], use a large number of high-quality image-text pairs for pre-training ST to align the semantic spaces between vision and language, which results in a surge in training costs. Although LaVIN [22] reduces the training costs by Mixture-of-Modality Training without the pre-training stage, there is no alignment design for ST. Therefore, a novel contrastive loss for ST is proposed in this paper to reduce the training costs, and also take into account the alignment of ST and LLM.

LLM.

LLaVA [10, 11], LLaMA-Adapter [19, 20], and LaVIN [21, 22] use the LLaMA series [5–8] and use PETL (Parameter-Efficient Transfer Learning) [luo2023towards, [19–22, 71–79] to optimize the model. Unlike existing works, we introduce contrastive learning to help improve the performance of LLMs rather than solely relying on using PETL and replacing LLMs with larger ones.

Method

As shown in Fig 1, EIM includes the following improvement measures. Firstly, the CLIP text encoder CLIP.text is introduced to encode the textual information and help to guide the CLIP fine-tuning through loss_IE during the training. Secondly, the contrastive loss loss_ST is proposed to align the visual and text semantics. Finally, the contrastive loss loss_LM is proposed to help to guide the text generation.

Download:

Fig 1. The overview of EIM.

EIM is an effective solution for improving the performance of multi-modal large language models from the perspective of the training process. EIM includes using three contrastive losses: loss_IE, loss_ST and loss_LM, and introducing CLIP.text. is the contrastive loss for the Image Encoder CLIP.visual. is the contrastive loss for the Semantic Space Transformation Network. is the contrastive loss for the LM. loss_IE and loss_LM should be used together with fine-tuning methods. There is no limitation on the CLIP and LM fine-tuning methods, which depend on the original model implementations. If the original multi-modal LLMs do not provide the fine-tuning methods for CLIP, we can skip using it or use the prompt-tuning method by default.

https://doi.org/10.1371/journal.pone.0329590.g001

Following the previous contrastive learning works [49, 80–84], there is a queue to store key vectors for all samples, the positive sample refer to the input sample itself, and the negative samples refer to all other samples in the dataset. The details are shown in Fig 2. The contrastive loss aims to maximize the similarity between the query vector and the key vector , while minimizing the similarity between and all other , where .

Download:

Fig 2. The overview of contrastive learning.

For , the positive sample is the , the negative samples are , where . The larger the queue K, the richer the visual information and features it can represent. Then, when using queries for comparative learning, the more features of images can be learned. Therefore, previous works usually use all samples to build the queue K.

https://doi.org/10.1371/journal.pone.0329590.g002

It is worth noting that in current multi-modal LLMs, the utilization of the output layer of CLIP is detrimental to downstream tasks training due to the differences between the pre-training task and downstream tasks, and the utilization of the partial visual features leads to the issue of losing information. Therefore, the CLIP image encoding capability is improved in this paper by modifying the usage of CLIP features, which includes removing CLIP’s output layer and using the full visual features. The CLIP’s output layer is a linear layer used to change the hidden dimension to the output dimension, ensuring that the output dimension of the CLIP.visual is consistent with that of the CLIP.text. The full visual features are the [CLS] token and the patch tokens extracted by CLIP.

Base structure

Fig 3 illustrates the basic structure of multi-modal LLMs, which includes three parts: the image encoder CLIP.visual for extracting visual features, ST for aligning the multi-modal semantic spaces, and LLM for generating text.

Download:

Fig 3. The overview of multi-modal LLMs.

The structure of multi-modal LLMs contains three modules: the image encoder CLIP.visual for extracting visual features, ST for aligning the multi-modal semantic spaces, and LLM for generating text.

https://doi.org/10.1371/journal.pone.0329590.g003

Current models usually use the auto-regressive training objective, which is called loss_base in this paper. Given the response , and the ST output features f_ST, the training objective loss_base is defined by:

(1)

Here, m represents the length of the response.

CLIP text encoder and contrastive loss for image encoder

Previous studies primarily focus on utilizing CLIP’s visual feature extraction capability while neglecting its text feature extraction capability. Therefore, as illustrated in Fig 4, CLIP.text is introduced to encode the textual information and help to guide the CLIP fine-tuning through loss_IE during the training.

Download:

Fig 4. CLIP.text and loss_IE.

The semantics spaces of CLIP.text and CLIP.visual are aligned through loss_IE. K_t is the pre-built queue, which contains all [EOS] token features extracted by CLIP.text from the image captions in the training dataset. denotes the i-th image’s [CLS] token vector output by CLIP.visual. In addition, loss_IE should be used together with fine-tuning methods. If the original multi-modal LLMs do not provide the fine-tuning methods for CLIP, we can skip using it or use the prompt-tuning method by default.

https://doi.org/10.1371/journal.pone.0329590.g004

Given the pre-built queue which contains all [EOS] token features extracted by CLIP.text from the image captions in the training dataset, the i-th image’s [CLS] token vector output by CLIP.visual, loss_IE is defined by:

(2)

Here, τ is the temperature coefficient, N is the number of image-text pairs in the training dataset.

Contrastive loss for ST

Most of current multi-modal LLMs use a large number of high-quality image-text pairs for pre-training ST to align the semantic spaces between vision and language, which results in a surge in training costs. Therefore, as illustrated in Fig 5, loss_ST is proposed to help align the multi-modal semantic spaces between vision and language with affordable training costs.

Download:

Fig 5. loss_ST, which is used to align the semantic spaces of images and texts by maximizing the similarity between the

and the i-th vector of the queue K_e.

K_e is the pre-built queue containing the average of the sentence token features, which are extracted by LM_e from the instructions and responses in the training dataset. is the vector that averages the token features of the i-th image output by ST. LM_e represents the LLM’s embedding layer. LM_block represents the hidden layers of LLM. LM_proj is the output layer of LLM.

https://doi.org/10.1371/journal.pone.0329590.g005

Given the pre-built queue which contains the average of the sentence token features that are extracted by LM_e from the instructions and responses in the training dataset, which averages the i-th image’s token features output by ST, loss_ST is defined by:

(3)

Here, τ is the temperature coefficient, and N is the number of image-text pairs in the training dataset.

Contrastive loss for LLM

Unlike existing works, as illustrated in Fig 6, the contrastive loss loss_LM is proposed to help improve the performance of LLMs, rather than solely relying on using PETL and replacing LLMs with larger ones.

Download:

Fig 6. loss_LM, which is used to align the semantic spaces of the tuning LLM and the original LLM by maximizing the similarity between the

and the i-th vector of the queue K_b.

K_b is the pre-built queue, which contains all [EOS] token features extracted by the original LLM’s LM_block from the instructions and responses in the training dataset. is the i-th text’s last token vector output by the tuning LLM’s LM_block. LM_block represents the hidden layers of LLM. LM_proj is the output layer of LLM.

https://doi.org/10.1371/journal.pone.0329590.g006

Given the pre-built queue which contains all [EOS] token features extracted by the original LLM’s LM_block from the instructions and responses in the training dataset, the i-th text’s last token vector output by the tuning LLM’s LM_block, loss_LM is defined by:

(4)

Here, τ is the temperature coefficient, and N is the number of image-text pairs in the training dataset.

Segmented training

In experiments, we find that although the three contrastive losses loss_IE, loss_ST, and loss_LM can improve the performance of multi-modal LLMs when applied individually, using them simultaneously in one training stage may lead to training instability. To avoid the training instability, we use the segmented training to isolate the influence between different losses, thus enabling the stable performance improvements when using the combination of the three contrastive losses. In each stage, the training data is the training part of the dataset. Table 1 shows the details of segmented training.

Download:

Table 1. Segmented training.

https://doi.org/10.1371/journal.pone.0329590.t001

Experiments

Following LLaMA-Adapter and LaVIN, we first apply EIM on ClipCap and evaluate the model on the traditional image captioning dataset COCO, then extend EIM to the representative multi-modal language models, such as LaVIN and LLaMA-Adapter, and evaluate the models on the first large scale multi-modal dataset ScienceQA, and finally apply EIM on LaVIN and evaluate the model on the first multi-modal LLM evaluation benchmark MME. We choose ClipCap because it still serves as the baseline in LLaMA-Adapter V2. We choose LaVIN and LLaMA-Adapter because they are representative works of the current multi-modal LLMs, with significant differences in fine-tuning methods and the ST structure, which can provide a more comprehensive test of our solution. LLaMA-Adapter does not use the fine-tuning method on CLIP, uses Transformer structure to design the ST, and uses the prompting method to fine-tune the LLM. In contrast, LaVIN uses the adapter to fine-tune CLIP, uses a lightweight MLP as ST, and uses the adapter MMA to fine-tune the LLM.

The COCO Caption dataset experimental results of , which is a model that applies EIM on the , show the 1.75% performance improvement when compared to those of , which has 3.13 times the number of parameters of . Furthermore, we extend EIM to the representative multi-modal LLMs, including LLaMA-Adapter and LaVIN, and evaluate on the ScienceQA dataset, achieving accuracy improvements of 2.76% and 2.05%, respectively, which confirms the effective performance improvement of EIM for multi-modal LLMs. Finally, we apply EIM on LaVIN-7B and evaluate on the MME benchmark, achieving comparable performance when compared to LaVIN-13B. The rest of this section is introduced in the order of the datasets and metrics, the implementation details, the experimental results of ClipCap on the COCO dataset, and the experimental results of two representative multi-modal LLMs on ScienceQA.

Datasets and metrics

COCO caption.

COCO Caption dataset [85] contains 0.6M training image caption data (120k images with 5 captions per image) over a wide range of distributions. We split the dataset according to the Karpathy [86] split. Similar to Oscar [87], we validate ClipCap over the COCO Caption dataset using the common metrics BLEU4 [88], METEOR [89], CIDEr [90], SPICE [91] and ROUGE_L [92], and the main reference metric is BLEU4.

ScienceQA.

ScienceQA [93] is the first large-scale multi-modal dataset designed for science question answering, covering multiple fields, including 3 subjects, 26 topics, 127 categories, and 379 skills. The dataset includes pure text and text-image examples, which are divided into three parts: train, validation, and test, with 12,726, 4,241, and 4,241 examples, respectively. Following LaVIN [22] and LLaMA-Adapter [19], we evaluate the models on the ScienceQA dataset using the average accuracy.

Alphaca-52k & LLaVA-158k & MME.

Alphaca-52k [94] contains 52k text-only instruction-following data generated by GPT-3.5 [95]. LLaVA-158k [11] is a large-scale image-text instruction dataset where the answer is automatically generated by GPT-4 [4]. Following LaVIN [22], we train the multi-modal chatbot model on Alphaca-52k & LLaVA-158k and evaluate the model on the first multi-modal LLM evaluation benchmark MME [96], which is free and widely used in 50+ recent multi-modal LLMs.

MME includes two major categories: perception and cognition. The former, with 10 subtasks, refers to recognizing specific objects in images, while the latter, with 4 subtasks, is more challenging for deducing complex answers from visual information. MME manually designs the annotations of instruction-answer pairs to avoid data leakage that may arise from the direct use of public datasets for evaluation. For each test image, MME adopts an instruction of a question and a description “Please answer yes or no", which prompts LLMs to answer “yes" or “no". The full score of MME is 2800.

Implementation details

The performance of EIM applied to ClipCap is evaluated on the COCO Caption dataset with a single RTX-3090 graphics card. Following the settings in ClipCap, the epochs, batch size, and learning rate are set to 10, 40 and 2E-5, respectively. In order to apply EIM to ClipCap, we make modifications to ST due to the use of the full visual features instead of a single visual feature, and use the segmented training to ensure the stability of the training process. The segmented training sequence is loss_ST, loss_LM, and . To reduce the training costs and ensure the fair comparison of experiments, only the stage trains the entire model. In the loss_ST stage, only the ST is trained through the training objective loss_ST. In the loss_LM stage, only the LLM is trained through the training objective loss_LM.

The performance of EIM applied to the representative models is evaluated on the ScienceQA dataset with four A100 graphics cards. Following the settings in LaVIN and LLaMA-Adapter, the epochs, batch size, and learning rate are 20, 32, and 9E-3, respectively. Due to the adapter and prompt methods used in LaVIN and LLaMA-Adapter, which harm the training stability, a new training stage Adapter, which fine-tunes the multi-modal LLMs solely by loss_base, is added after the loss_LM stage. Therefore, the segmented training sequence is loss_ST, loss_LM, Adapter, and .

We also apply EIM to LaVIN-7B-lite, train the chatbot model on four RTX-3090 graphics cards, and evaluate on the MME benchmark. Following LaVIN, the training settings are the same, except that the training epochs are 15 in the original LaVIN and 10 in ours.

Following the previous works [80, 84], the temperature coefficient τ is set as 0.07. It is worth noting that in our solution, loss_IE uses the image caption, loss_ST uses the instruction and response, and loss_LM uses the instruction and response. These losses are not applied in the inference stage, so there will be no additional restrictions on model deployment and application scenarios. In practice, the instruction is empty, and the response is caption in COCO Caption dataset. In the ScienceQA dataset, the instruction includes question, context, and options, and the response includes answer, lecture, and solution. We think that both visual question answering and image captioning can be seen as tasks for generating a response when given an image and an instruction as input, and the combination of the instruction and response can describe the image from another perspective. Specially, we observe that the solution in ScienceQA contains a lot of image related information. Therefore, if the proposed loss is effective on image captioning datasets like COCO Caption, it should also be effective on visual question answering datasets like ScienceQA.

Experimental results on COCO

We choose the ClipCap as the baseline model and conduct experiments to quickly validate the feasibility of our ideas. ClipCap uses the CLIP VIT-B/32 as the visual encoder and uses GPT-2 as the language model. The ST in ClipCap is an MLP with a hidden layer and is used to translate a single visual token with 512 dimensions to 10 tokens with 768 dimensions. ClipCap does not provide the fine-tuning methods on CLIP and fully fine-tunes GPT-2.

From Table 2, is the model that applies EIM on , which achieves the 2.7% performance improvement in BLEU4 when compared to . Furthermore, achieves the 1.75% performance improvement with only 31.99% total parameters when compared to . In practice, the parameter size of is smaller than that of because of a certain structure change of ST due to the usage of full visual features instead of using a single visual feature in original and removing the CLIP’s output layer. In the original ClipCap, the input dimension of ST is 512, the hidden dimension of ST is 3840, and the output dimension of ST is 7680. Because this ST is responsible for converting one token into multiple tokens, if we use the full visual features, the input dimension, hidden dimension and output dimension will be changed to 197376, 98688, and 197376, respectively. It is too large and has 39B parameters. So, we changed the ClipCap ST structure to a simple FFN, with the input dimension of 768, the hidden dimension of 3072, and the output dimension of 768. The ST parameters are reduced from the original 31.47M to 4.72M. While the prompting method introduces 26.07M parameters, the number of trainable parameters is reduced by 0.68M. Finally, the total parameters is reduced by 1.08M, which is the sum of the reduced trainable parameters 0.68M and the removed CLIP’s output layer parameters 0.4M. What’s more, we do not need to modify the ST to the extent of ClipCap in current representative models such as LaVIN, LLaMA-Adapter. For instance, we just modify the input dimension of ST from 768 to 1024 in LLaMA-Adapter and do not need to modify the ST in LaVIN.

Download:

Table 2. The experimental results (%) of applying EIM to ClipCap on the COCO Caption dataset.

https://doi.org/10.1371/journal.pone.0329590.t002

Ablation study.

To demonstrate the reliability of EIM in detail, ablation experiments are conducted. As shown in Table 3, when using loss_ST and loss_LM, the performance is increased by 0.8% and 0.6% respectively. When using loss_IE, the performance is improved by 1.6% with the increased parameters from the default prompt network. In contrast, if the original model provides a fine-tuning method on CLIP, loss_IE can be directly used without increasing parameters. We also present the case studies in Fig 7, which show that generates more accurate and detailed captions than .

Download:

Fig 7. Case study of applying EIM to

on the COCO Caption dataset.

We show the caption generated by , the caption generated by , and the ground truth. generates more accurate and detailed captions than . Incorrect text is highlighted in red.

https://doi.org/10.1371/journal.pone.0329590.g007

Download:

Table 3. Ablation study (%).

https://doi.org/10.1371/journal.pone.0329590.t003

Segmented training.

The experimental results of the segmented training are detailed in Table 4. The results show that when removing the CLIP’s output layer and using the full visual features on , the performance is improved by 1.7%. After introducing the loss_ST stage, the performance is improved by 1.9%. After introducing the loss_LM stage, the performance is improved by 2.3%. After introducing the stage, the performance is improved by 2.7%.

Download:

Table 4. Segmented training (%).

https://doi.org/10.1371/journal.pone.0329590.t004

Why use segmented training?

We have tried to combine different losses, but the performance is unstable. As shown in Table 5, when we use loss_base, loss_ST, loss_LM, and loss_IE together in one training stage, the performance is 33.0%. When we use segmented training, the performance is improved to 33.9% with a 0.9% improvement. As shown in Table 6, when we use both loss_ST and loss_LM simultaneously, the performace is 32. The result is lower than our expectation. We speculate that this is due to the inconsistent optimization goals among the different losses. loss_IE is designed to help the visual encoder to adapt to downstream tasks, loss_ST is designed to help ST align the image and text, and loss_LM is designed to preserve the performance of LLM. Therefore, we completely eliminate the interference between these three losses through gradient truncation and segmented training to achieve stable performance improvement.

Download:

Table 5. Experimental results (%) of applying loss_ST, loss_LM, and loss_IE in one training stage and segmented training.

https://doi.org/10.1371/journal.pone.0329590.t005

Download:

Table 6. Experimental results (%) with the combination of loss_ST and loss_LM.

https://doi.org/10.1371/journal.pone.0329590.t006

We refer to the common training approach of multi-modal LLMs, that is firstly pre-training the ST and then fine-tuning the LLM. loss_IE is placed in the last as a supplement because models like LLaMA-Adpater do not fine-tune the CLIP. We also conduct experiments to ensure the order of these losses. As shown in Table 7, we conduct experiments to verify whether the pre-training order of loss_ST and loss_LM is reasonable. We also verify whether loss_IE, as a supplement, should be placed at the beginning or at the end of the training sequence, and the results are shown in Table 8. In practice, we think loss_IE should be used together with loss_base to prevent the excessively fine-tuning of CLIP. We can only provide such a stable version right now. We will try to integrate these losses in future work.

Download:

Table 7. Experimental results (%) under different orders of loss_ST and loss_LM.

https://doi.org/10.1371/journal.pone.0329590.t007

Download:

Table 8. Experimental results (%) of loss_IE at both ends of the sequence.

https://doi.org/10.1371/journal.pone.0329590.t008

Modifying the usage of CLIP features.

In this paper, modifying the usage of CLIP features includes removing CLIP’s output layer and using the full visual features. According to Table 9, when removing the CLIP’s output layer or using the full visual features alone, the performance is 31.6% and 31.8%, respectively. However, when both are used simultaneously, the performance is improved to 32.9%. We speculate that the visual features after removing the output layer contain more information, which benefits the use of the full visual features. Our findings indirectly confirm the speculation that the use of the output layer of CLIP is detrimental to downstream task training due to the differences between the pre-training task and downstream tasks.

Download:

Table 9. Experimental results (%) of modifying the usage of CLIP features.

https://doi.org/10.1371/journal.pone.0329590.t009

Comparison of training efficiency.

According to Table 2, EIM can improve the performance with fewer parameters, which means that there will be no additional hardware overhead. However, due to the segmented training method used in this paper, the training time will be increased compared to . One epoch training time for , , and is 23 minutes, 75 minutes, and 51 minutes, respectively. The training time for the three stages of segmented training is 0.9 minutes, 6.4 minutes, and 43.7 minutes, respectively. The increase in training time mainly comes from the stage due to differences in CLIP usage. In , loss_IE necessitates the use of CLIP during the training phase. In contrast, ClipCap uses pre-extracted visual features from CLIP instead of using CLIP during the training phase. This is confirmed in subsequent LLaMA-Adapter and LaVIN model experiments. LLaMA-Adapter and LaVIN use CLIP during the training phase instead of pre-extracted visual features, which does not increase training time during the training phase.

Results on ScienceQA

We choose LaVIN and LLaMA-Adapter because they are representative works of the current multi-modal LLMs, with significant differences in fine-tuning methods and the ST structure, which can provide a more comprehensive test of our solution. LLaMA-Adapter uses the CLIP VIT-L/14 as the visual encoder, uses the Transformer structure to design the ST, and uses the LLaMA as the LLM. In terms of fine-tuning methods, LLaMA-Adapter does not fine-tune CLIP and fine-tunes the LLM through the prompting method. In contrast, LaVIN uses the CLIP VIT-L/14 as the visual encoder, uses a lightweight MLP as ST, and uses the LLaMA as the LLM. In terms of fine-tuning methods, LaVIN uses the adapter to fine-tune CLIP and LLM.

As shown in Tables 10 and 11, we extend EIM to the representative multi-modal LLMs, including LLaMA-Adapter and LaVIN, and evaluate on the ScienceQA dataset, achieving accuracy improvements of 2.76% and 2.05%, respectively, which confirms the effective performance improvement of EIM for multi-modal LLMs. It should be noted that because LaVIN takes the output features of ST as the input of the LLM and there is a maximum input length limitation in LaVIN’s LLM setting, we choose 50 visual token vectors instead of using the full visual features. In contrast, LLaMA-Adapter uses the full visual features.

Download:

Table 10. The experimental results (%) of applying EIM to LLaMA-Adapter on the ScienceQA dataset.

https://doi.org/10.1371/journal.pone.0329590.t010

Download:

Table 11. The experimental results (%) of applying EIM to LaVIN on the ScienceQA dataset.

https://doi.org/10.1371/journal.pone.0329590.t011

Results on LLaMA-adapter.

Table 10 shows the experimental results of applying EIM on LLaMA-Adapter. From the table, it can be seen that the performance improved by 2.76%. However, due to the default prompt tuning method, the number of trainable parameters is increased by nearly 55M. Therefore, we also provide the experimental results for only the first three stages of training, which can improve performance by 2.43% without increasing the number of parameters. It is worth noting that when we apply loss_ST and loss_LM, the performance of all tasks has significantly increased except for LAN and NO. After loss_IE is applied, the performance of LAN and NO tasks is increased by 1.01% and 1.12%, respectively, which indicates that loss_IE and the other two losses are complementary. Besides, the G7-12 task is increased from 86.82% to 87.74%, which indicates that loss_IE can further improve the complex reasoning ability. However, we also notice that there is a little performance decrease on IMG and SOC tasks which rely more on the image understanding ability. We think this is because the ScienceQA dataset contains text-only samples which are better not to be used in loss_IE.

Results on LaVIN.

From Table 11, LaVIN-7B- and LaVIN-7B- can achieve optimal or suboptimal performance on almost all tasks even when compared to the best model LaVIN-13B. LaVIN-7B- achieves the 2.05% performance improvement. It does not increase the number of model parameters and trainable parameters when compared to LaVIN-7B-lite(llama) because that LaVIN provides the Adapter to fine-tune the CLIP. Furthermore, LaVIN-7B- achieves the 1.18% performance improvement with only 7B total parameters when compared to the 13B parameters model LaVIN-13B-lite(llama). In addition, we also provide the experimental results of LaVIN-7B- that applies EIM on LaVIN-7B-lite(llama) with the LLM adapter network trained in the Adapter stage, which can improve the performance by 1.65%. Compared to the 13B model LaVIN-13B-lite(llama), LaVIN-7B- can also achieve the 0.78% performance improvement. The improvement of EIM on LaVIN is weaker than that applied to ClipCap and LLaMA-Adapter. We speculate this is mainly due to the fact that we only use the 20% visual features instead of using the full visual features.

Comparison of training efficiency.

The training time of one epoch is increased by 8 minutes after applying EIM to LLaMA-Adapter, with 20 seconds from the loss_ST stage, 1 minute from the loss_LM stage and nearly 7 minutes from the Adapter stage. After applying EIM to LaVIN, the training time of one epoch is increased by 8.5 minutes, with 27 seconds, 1.05 minutes, and 7 minutes from the loss_ST stage, the loss_LM stage, and the Adapter stage, respectively. Introducing loss_IE in the stage does not increase the training time in both experiments, which confirm the analysis that the increase time in the ClipCap stage is due to the different use of CLIP. Although the training time on both experiments is increased by nearly 120%, primarily due to the Adapter stage, we believe that this is acceptable when compared to the performance improvements.

Experimental results on MME

In this paper, we apply EIM on LaVIN-7B-lite, follow the original LaVIN model [22] to train the chatbot model on Alphaca-52k & LLaVA-158k datasets, and evaluate on the first multi-modal LLM evaluation benchmark MME [96].

As shown in Tables 12 and 13, our model LaVIN-7B- can achieve comparable performance in Perception and Cognition tasks when compared to LaVIN-13B. It can be seen that LaVIN-7B- can achieve significantly better performance in Celebrity, Scene, Landmark, Artwork, and Text Translation tasks, and comparable performance in Position, Color, Poster, Commonsense Reasoning, and Code Reasoning tasks. This shows that our solution EIM can improve the model performance from the image understanding and complex reasoning abilities and preserve the knowledge of LLM. For instance, when compared to LaVIN-13B, LaVIN-7B- can archive 32.36, 7.25, and 4.5 improvements in Celebrity, Landmark, and Artwork tasks, which rely more on image understanding ability and the knowledge of LLM. And LaVIN-7B- can achieve 14.5 improvement in the Scene task which relies more on the image understanding ability. Besides, LaVIN-7B- can achieve 10 improvements in the Text Translation task which relies more on complex reasoning ability. However, the performance of LaVIN-7B- on Existence, Count, and OCR tasks is lower than that of LaVIN-13B. We speculate this is due to the fact that we only use 20% visual features in LaVIN-7B- rather than using full visual features. This will reduce the model performance in Existence, Count, and OCR tasks which partial visual features have a significant impact on results. So, we think we can introduce more partial visual features in future works to improve the performance of LaVIN-7B- on tasks like OCR, Count and Existence that are sensitive to local image details. This will also benefit tasks like Position which require the local image details to better understand the image.

Download:

Table 12. Perception performance comparison on MME benchmark.

https://doi.org/10.1371/journal.pone.0329590.t012

Download:

Table 13. Cognition performance comparison on MME benchmark.

https://doi.org/10.1371/journal.pone.0329590.t013

Conclusion

In this paper, we propose a novel effective solution, EIM, for improving the performance of multi-modal large language models from the perspective of the training process. EIM contains three losses: loss_IE, loss_ST, and loss_LM. The three losses are proposed for CLIP, ST, and LLM, respectively. loss_IE is designed to help the visual encoder to adapt to downstream tasks, loss_ST is designed to help ST align the image and text, and loss_LM is designed to preserve the performance of LLM. These losses can be used separately, and we further provide a stable solution with segmented training to combine these losses. Especially if the original model, like LLaMA-Adapter, does not fine-tune the CLIP, we can omit loss_IE to achieve 2.43% improvement or use loss_IE to achieve further 2.76% improvement with certain new trainable parameters introduced as needed.

To validate EIM, we first apply it to and conduct experiments on the COCO Caption dataset. The experimental results show that we can achieve the 1.75% performance improvement with only 31.99% total parameters when compared to those of . Secondly, we extend EIM to the multi-modal LLMs, such as LLaMA-Adapter and LaVIN, and evaluate them on the ScienceQA dataset. Finally, we also conduct multi-modal chatbot experiments with the EIM-enhanced LaVIN and evaluate it on the MME benchmark. The experimental results on the ScienceQA dataset and MME benchmark show that EIM can achieve competitive performance with 7B model parameters when compared to the 13B multi-modal LLMs, which confirms the effective performance improvement of EIM for multi-modal LLMs. However, the segmented training, which is used to ensure the stability of the training process, inevitably leads to the increase of the training time and may limit the performance. We will make improvements in future work.

Acknowledgments

We thank the suggestions of the all anonymous reviewers.

References

1. Radford A, Narasimhan K. Improving language understanding by generative pre-training. OpenAI blog; 2018.
2. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI blog; 2019.
3. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems. 2020;33:1877–901.
- View Article
- Google Scholar
4. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL. Gpt-4 technical report. arXiv preprint 2023.
- View Article
- Google Scholar
5. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T. Llama: Open and efficient foundation language models. arXiv preprint 2023. https://arxiv.org/abs/2302.13971
6. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y. Llama 2: open foundation and fine-tuned chat models. arXiv preprint 2023. https://arxiv.org/abs/2307.09288
7. Chiang WL, Li Z, Lin Z, Sheng Y, Wu Z, Zhang H. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. Vicuna. 2023;2(3):6.
- View Article
- Google Scholar
8. Zheng L, Chiang WL, Sheng Y, Zhuang S, Wu Z, Zhuang Y, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems. 2023;36:46595–623.
- View Article
- Google Scholar
9. Yin S, Fu C, Zhao S, Li K, Sun X, Xu T. A survey on multimodal large language models. arXiv preprint 2023. https://arxiv.org/abs/2306.13549
10. Liu H, Li C, Li Y, Lee YJ. Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. p. 26296–306.
11. Liu H, Li C, Wu Q, Lee YJ. Visual instruction tuning. Advances in neural information processing systems. 2023;36:34892–916.
- View Article
- Google Scholar
12. Chen J, Zhu D, Shen X, Li X, Liu Z, Zhang P, et al. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint 2023. https://arxiv.org/abs/2310.09478
13. Zhu D, Chen J, Shen X, Li X, Elhoseiny M. Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv preprint 2023. https://arxiv.org/abs/2304.10592
14. Ye Q, Xu H, Xu G, Ye J, Yan M, Zhou Y. Mplug-owl: modularization empowers large language models with multimodality. arXiv preprint 2023. https://arxiv.org/abs/2304.14178
15. Li B, Zhang Y, Chen L, Wang J, Pu F, Cahyono JA. Otter: A multi-modal model with in-context instruction tuning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2025.
16. Su Y, Lan T, Li H, Xu J, Wang Y, Cai D. Pandagpt: one model to instruction-follow them all. arXiv preprint 2023. https://arxiv.org/abs/2305.16355
17. Lu J, Zhang D, Wu X, Gao X, Gan R, Zhang J. Ziya-visual: bilingual large vision-language model via multi-task instruction tuning. arXiv preprint 2023. https://arxiv.org/abs/231008166
18. Gong T, Lyu C, Zhang S, Wang Y, Zheng M, Zhao Q. Multimodal-gpt: a vision and language model for dialogue with humans. arXiv preprint 2023. https://arxiv.org/abs/2305.04790
19. Zhang R, Han J, Liu C, Zhou A, Lu P, Qiao Y. LLaMA-adapter: Efficient fine-tuning of large language models with zero-initialized attention. 2024.
20. Gao P, Han J, Zhang R, Lin Z, Geng S, Zhou A. LLaMA-adapter V2: parameter-efficient visual instruction model. arXiv preprint2023. https://arxiv.org/abs/2304.15010
21. Luo G, Huang M, Zhou Y, Sun X, Jiang G, Wang Z. Towards efficient visual adaption via structural re-parameterization. arXiv preprint 2023. https://arxiv.org/abs/2302.08106
22. Luo G, Zhou Y, Ren T, Chen S, Sun X, Ji R. Cheap and quick: Efficient vision-language instruction tuning for large language models. Advances in Neural Information Processing Systems. 2023;36:29615–27.
- View Article
- Google Scholar
23. Li J, Li D, Savarese S, Hoi S. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International conference on machine learning, 2023. 19730–42.
24. Dai W, Li J, Li D, Tiong AMH, Zhao J, Wang W, et al.. InstructBLIP: towards general-purpose vision-language models with instruction tuning; 2023.
25. Zeng Y, Zhang H, Zheng J, Xia J, Wei G, Wei Y. What matters in training a gpt4-style language model with multimodal inputs?. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024. p. 7930–57.
26. Awadalla A, Gao I, Gardner J, Hessel J, Hanafy Y, Zhu W. OpenFlamingo: an open-source framework for training large autoregressive vision-language models. arXiv preprint 2023. https://arxiv.org/abs/2308.01390
27. Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems. 2022;35:23716–36.
- View Article
- Google Scholar
28. Han J, Zhang R, Shao W, Gao P, Xu P, Xiao H. Imagebind-llm: multi-modality instruction tuning. arXiv preprint 2023. https://arxiv.org/abs/2309.03905
29. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR; 2021. p. 8748–63.
30. Shen S, Li LH, Tan H, Bansal M, Rohrbach A, Chang KW. How much can clip benefit vision-and-language tasks? arXiv preprint 2021.
- View Article
- Google Scholar
31. Patashnik O, Wu Z, Shechtman E, Cohen-Or D, Lischinski D. Styleclip: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. p. 2085–94.
32. Wu HH, Seetharaman P, Kumar K, Bello JP. Wav2clip: Learning robust audio representations from clip. In:ICASSP 2022 -2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. p. 4563–7.
33. Frans K, Soros L, Witkowski O. Clipdraw: exploring text-to-drawing synthesis through language-image encoders. Advances in Neural Information Processing Systems. 2022;35:5207–18.
- View Article
- Google Scholar
34. Xu H, Ghosh G, Huang PY, Okhonko D, Aghajanyan A, Metze F. Videoclip: contrastive pre-training for zero-shot video-text understanding. arXiv preprint 2021.
- View Article
- Google Scholar
35. Gu X, Lin TY, Kuo W, Cui Y. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint 2021. https://arxiv.org/abs/2104.13921
36. Gao P, Geng S, Zhang R, Ma T, Fang R, Zhang Y, et al. CLIP-adapter: better vision-language models with feature adapters. Int J Comput Vis. 2023;132(2):581–95.
- View Article
- Google Scholar
37. Zhu X, Zhang R, He B, Guo Z, Zeng Z, Qin Z, et al. PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2023. p. 2639–50. https://doi.org/10.1109/iccv51070.2023.00249
38. Zhang R, Guo Z, Zhang W, Li K, Miao X, Cui B, et al. Pointclip: point cloud understanding by clip. In:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022. p. 8552–8562.
39. Wang C, Chai M, He M, Chen D, Liao J. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 3835–44.
40. Huang T, Dong B, Yang Y, Huang X, Lau RW, Ouyang W. Clip2point: transfer clip to point cloud classification with image-depth pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. p. 22157–67.
41. Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems. 2020;33:6840–51.
- View Article
- Google Scholar
42. Nichol AQ, Dhariwal P. Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning. 2021. p. 8162–71.
43. Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems. 2021;34:8780–94.
- View Article
- Google Scholar
44. Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B. Score-based generative modeling through stochastic differential equations. arXiv preprint 2020. https://arxiv.org/abs/2011.13456
45. Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 10684–95.
46. Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M. Hierarchical text-conditional image generation with clip latents. arXiv preprint 2022. 3.
- View Article
- Google Scholar
47. Kim G, Kwon T, Ye JC. Diffusionclip: Text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 2426–35.
48. Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y. A survey of large language models. arXiv preprint 2023.
- View Article
- Google Scholar
49. Mokady R, Hertz A, Bermano AH. Clipcap: clip prefix for image captioning. arXiv preprint 2021. https://arxiv.org/abs/2111.09734
50. Anil R, Dai AM, Firat O, Johnson M, Lepikhin D, Passos A. Palm 2 technical report. arXiv preprint2023. https://arxiv.org/abs/2305.10403
51. Fang Y, Wang W, Xie B, Sun Q, Wu L, Wang X, et al. EVA: exploring the limits of masked visual representation learning at scale. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. p. 19358–69. https://doi.org/10.1109/cvpr52729.2023.01855
52. Fang Y, Sun Q, Wang X, Huang T, Wang X, Cao Y. EVA-02: a visual representation for neon genesis. Image and Vision Computing. 2024;149:105171.
- View Article
- Google Scholar
53. Sun Q, Fang Y, Wu L, Wang X, Cao Y. Eva-clip: improved training techniques for clip at scale. arXiv preprint2023. https://arxiv.org/abs/2303.15389
54. Cherti M, Beaumont R, Wightman R, Wortsman M, Ilharco G, Gordon C. Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 2818–29.
55. Schuhmann C, Beaumont R, Vencu R, Gordon C, Wightman R, Cherti M. Laion-5b: an open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems. 2022;35:25278–94.
- View Article
- Google Scholar
56. Girdhar R, El-Nouby A, Liu Z, Singh M, Alwala KV, Joulin A. Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 15180–90.
57. Bai Z, Bai Y. Improving multimodal large language models through combining resampler and MLP projections. In:ICASSP 2025 -2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2025. p. 1–5.
58. Bai Z, Bai Y. Exploring the role of CLIP global visual features in multimodal large language models. In: ICASSP 2025 -2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2025. p. 1–5.
59. Wang X, Zhang X, Luo Z, Sun Q, Cui Y, Wang J. Emu3: next-token prediction is all you need. arXiv preprint 2024. https://arxiv.org/abs/240918869
60. Xie J, Mao W, Bai Z, Zhang DJ, Wang W, Lin KQ, et al. Show-o: one single transformer to unify multimodal understanding and generation. arXiv preprint 2024.
- View Article
- Google Scholar
61. Zhu J, Wang W, Chen Z, Liu Z, Ye S, Gu L. Internvl3: exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint 2025. https://arxiv.org/abs/250410479
62. Chen X, Wu Z, Liu X, Pan Z, Liu W, Xie Z. Janus-pro: unified multimodal understanding and generation with data and model scaling. arXiv preprint 2025. https://arxiv.org/abs/250117811
63. Lin Z, Yu S, Kuang Z, Pathak D, Ramanan D. Multimodality helps unimodality: cross-modal few-shot learning with multimodal models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 19325–37.
64. Gal R, Patashnik O, Maron H, Bermano AH, Chechik G, Cohen-Or D. Stylegan-nada: clip-guided domain adaptation of image generators. ACM Trans Graph. 2022;41(4):1–13.
- View Article
- Google Scholar
65. Liu X, Park DH, Azadi S, Zhang G, Chopikyan A, Hu Y. More control for free! image synthesis with semantic diffusion guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023. p. 289–99.
66. Zhou K, Yang J, Loy CC, Liu Z. Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 16816–25.
67. Zhou K, Yang J, Loy CC, Liu Z. Learning to prompt for vision-language models. Int J Comput Vis. 2022;130(9):2337–48.
- View Article
- Google Scholar
68. Wortsman M, Ilharco G, Kim JW, Li M, Kornblith S, Roelofs R. Robust fine-tuning of zero-shot models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022. p. 7959–71.
69. Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z. Glm: general language model pretraining with autoregressive blank infilling. arXiv preprint 2021. https://arxiv.org/abs/2103.10360
70. Ding M, Yang Z, Hong W, Zheng W, Zhou C, Yin D. Cogview: mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems. 2021;34:19822–35.
- View Article
- Google Scholar
71. Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, De Laroussilhe Q, Gesmundo A, et al. Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning. PMLR; 2019. p. 2790–9.
72. Li XL, Liang P. Prefix-tuning: optimizing continuous prompts for generation. arXiv preprint 2021. https://arxiv.org/abs/2101.00190
73. Jia M, Tang L, Chen BC, Cardie C, Belongie S, Hariharan B. Visual prompt tuning. In: European Conference on Computer Vision. 2022. p. 709–27.
74. Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S. LoRA: low-rank adaptation of large language models. arXiv preprint 2021. https://arxiv.org/abs/2106.09685
75. Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. Qlora: efficient finetuning of quantized llms. Advances in Neural Information Processing Systems. 2023;36:10088–115.
- View Article
- Google Scholar
76. Sung YL, Cho J, Bansal M. Vl-adapter: parameter-efficient transfer learning for vision-and-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 5227–37.
77. Zhang Y, Zhou K, Liu Z. Neural prompt search. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2024.
78. Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z. GPT understands, too. AI Open. 2024;5:208–15.
- View Article
- Google Scholar
79. Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021. 3045–59.
80. He K, Fan H, Wu Y, Xie S, Girshick R. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 9729–38.
81. Chen X, Fan H, Girshick R, He K. Improved baselines with momentum contrastive learning. arXiv preprint 2020. https://arxiv.org/abs/2003.04297
82. Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning. PMLR; 2020. p. 1597–607.
83. Chen X, Xie S, He K. An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021. p. 9640–9.
84. Wu Z, Xiong Y, Yu SX, Lin D. Unsupervised feature learning via non-parametric instance discrimination. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. p. 3733–42. https://doi.org/10.1109/cvpr.2018.00393
85. Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P. Microsoft coco captions: data collection and evaluation server. arXiv preprint 2015.
- View Article
- Google Scholar
86. Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015. p. 3128–37. https://doi.org/10.1109/cvpr.2015.7298932
87. Li X, Yin X, Li C, Zhang P, Hu X, Zhang L. Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Computer Vision–ECCV 2020 : 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. 2020. p. 121–37.
88. Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. p. 311–8.
89. Denkowski M, Lavie A. Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation. 2014. p. 376–80.
90. Vedantam R, Zitnick C, Parikh D. Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 4566–75.
91. Anderson P, Fernando B, Johnson M, Gould S. Spice: Semantic propositional image caption evaluation. In: Computer Vision–ECCV 2016 : 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part V 14. 2016. p. 382–98.
92. Lin CY, Och FJ. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). 2004. p. 605–12.
93. Lu P, Mishra S, Xia T, Qiu L, Chang KW, Zhu SC. Learn to explain: multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems. 2022;35:2507–21.
- View Article
- Google Scholar
94. Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C. Stanford alpaca: an instruction-following llama model. 2023.
95. Chen T, Kornblith S, Swersky K, Norouzi M, Hinton GE. Big self-supervised models are strong semi-supervised learners. Advances in Neural Information Processing Systems. 2020;33:22243–55.
- View Article
- Google Scholar
96. Fu C, Chen P, Shen Y, Qin Y, Zhang M, Lin X, et al. MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint 2023. https://arxiv.org/abs/2306.13394

[ref1] 1. Radford A, Narasimhan K. Improving language understanding by generative pre-training. OpenAI blog; 2018.

[ref2] 2. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI blog; 2019.

[ref3] 3. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems. 2020;33:1877–901.
View Article
Google Scholar

[4] View Article

[5] Google Scholar

[ref4] 4. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL. Gpt-4 technical report. arXiv preprint 2023.
View Article
Google Scholar

[7] View Article

[8] Google Scholar

[ref5] 5. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T. Llama: Open and efficient foundation language models. arXiv preprint 2023. https://arxiv.org/abs/2302.13971

[ref6] 6. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y. Llama 2: open foundation and fine-tuned chat models. arXiv preprint 2023. https://arxiv.org/abs/2307.09288

[ref7] 7. Chiang WL, Li Z, Lin Z, Sheng Y, Wu Z, Zhang H. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. Vicuna. 2023;2(3):6.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref8] 8. Zheng L, Chiang WL, Sheng Y, Zhuang S, Wu Z, Zhuang Y, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems. 2023;36:46595–623.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref9] 9. Yin S, Fu C, Zhao S, Li K, Sun X, Xu T. A survey on multimodal large language models. arXiv preprint 2023. https://arxiv.org/abs/2306.13549

[ref10] 10. Liu H, Li C, Li Y, Lee YJ. Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. p. 26296–306.

[ref11] 11. Liu H, Li C, Wu Q, Lee YJ. Visual instruction tuning. Advances in neural information processing systems. 2023;36:34892–916.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref12] 12. Chen J, Zhu D, Shen X, Li X, Liu Z, Zhang P, et al. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint 2023. https://arxiv.org/abs/2310.09478

[ref13] 13. Zhu D, Chen J, Shen X, Li X, Elhoseiny M. Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv preprint 2023. https://arxiv.org/abs/2304.10592

[ref14] 14. Ye Q, Xu H, Xu G, Ye J, Yan M, Zhou Y. Mplug-owl: modularization empowers large language models with multimodality. arXiv preprint 2023. https://arxiv.org/abs/2304.14178

[ref15] 15. Li B, Zhang Y, Chen L, Wang J, Pu F, Cahyono JA. Otter: A multi-modal model with in-context instruction tuning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2025.

[ref16] 16. Su Y, Lan T, Li H, Xu J, Wang Y, Cai D. Pandagpt: one model to instruction-follow them all. arXiv preprint 2023. https://arxiv.org/abs/2305.16355

[ref17] 17. Lu J, Zhang D, Wu X, Gao X, Gan R, Zhang J. Ziya-visual: bilingual large vision-language model via multi-task instruction tuning. arXiv preprint 2023. https://arxiv.org/abs/231008166

[ref18] 18. Gong T, Lyu C, Zhang S, Wang Y, Zheng M, Zhao Q. Multimodal-gpt: a vision and language model for dialogue with humans. arXiv preprint 2023. https://arxiv.org/abs/2305.04790

[ref19] 19. Zhang R, Han J, Liu C, Zhou A, Lu P, Qiao Y. LLaMA-adapter: Efficient fine-tuning of large language models with zero-initialized attention. 2024.

[ref20] 20. Gao P, Han J, Zhang R, Lin Z, Geng S, Zhou A. LLaMA-adapter V2: parameter-efficient visual instruction model. arXiv preprint2023. https://arxiv.org/abs/2304.15010

[ref21] 21. Luo G, Huang M, Zhou Y, Sun X, Jiang G, Wang Z. Towards efficient visual adaption via structural re-parameterization. arXiv preprint 2023. https://arxiv.org/abs/2302.08106

[ref22] 22. Luo G, Zhou Y, Ren T, Chen S, Sun X, Ji R. Cheap and quick: Efficient vision-language instruction tuning for large language models. Advances in Neural Information Processing Systems. 2023;36:29615–27.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref23] 23. Li J, Li D, Savarese S, Hoi S. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International conference on machine learning, 2023. 19730–42.

[ref24] 24. Dai W, Li J, Li D, Tiong AMH, Zhao J, Wang W, et al.. InstructBLIP: towards general-purpose vision-language models with instruction tuning; 2023.

[ref25] 25. Zeng Y, Zhang H, Zheng J, Xia J, Wei G, Wei Y. What matters in training a gpt4-style language model with multimodal inputs?. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024. p. 7930–57.

[ref26] 26. Awadalla A, Gao I, Gardner J, Hessel J, Hanafy Y, Zhu W. OpenFlamingo: an open-source framework for training large autoregressive vision-language models. arXiv preprint 2023. https://arxiv.org/abs/2308.01390

[ref27] 27. Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems. 2022;35:23716–36.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref28] 28. Han J, Zhang R, Shao W, Gao P, Xu P, Xiao H. Imagebind-llm: multi-modality instruction tuning. arXiv preprint 2023. https://arxiv.org/abs/2309.03905

[ref29] 29. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR; 2021. p. 8748–63.

[ref30] 30. Shen S, Li LH, Tan H, Bansal M, Rohrbach A, Chang KW. How much can clip benefit vision-and-language tasks? arXiv preprint 2021.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref31] 31. Patashnik O, Wu Z, Shechtman E, Cohen-Or D, Lischinski D. Styleclip: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. p. 2085–94.

[ref32] 32. Wu HH, Seetharaman P, Kumar K, Bello JP. Wav2clip: Learning robust audio representations from clip. In:ICASSP 2022 -2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022. p. 4563–7.

[ref33] 33. Frans K, Soros L, Witkowski O. Clipdraw: exploring text-to-drawing synthesis through language-image encoders. Advances in Neural Information Processing Systems. 2022;35:5207–18.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref34] 34. Xu H, Ghosh G, Huang PY, Okhonko D, Aghajanyan A, Metze F. Videoclip: contrastive pre-training for zero-shot video-text understanding. arXiv preprint 2021.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref35] 35. Gu X, Lin TY, Kuo W, Cui Y. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint 2021. https://arxiv.org/abs/2104.13921

[ref36] 36. Gao P, Geng S, Zhang R, Ma T, Fang R, Zhang Y, et al. CLIP-adapter: better vision-language models with feature adapters. Int J Comput Vis. 2023;132(2):581–95.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref37] 37. Zhu X, Zhang R, He B, Guo Z, Zeng Z, Qin Z, et al. PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2023. p. 2639–50. https://doi.org/10.1109/iccv51070.2023.00249

[ref38] 38. Zhang R, Guo Z, Zhang W, Li K, Miao X, Cui B, et al. Pointclip: point cloud understanding by clip. In:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022. p. 8552–8562.

[ref39] 39. Wang C, Chai M, He M, Chen D, Liao J. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 3835–44.

[ref40] 40. Huang T, Dong B, Yang Y, Huang X, Lau RW, Ouyang W. Clip2point: transfer clip to point cloud classification with image-depth pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. p. 22157–67.

[ref41] 41. Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems. 2020;33:6840–51.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref42] 42. Nichol AQ, Dhariwal P. Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning. 2021. p. 8162–71.

[ref43] 43. Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems. 2021;34:8780–94.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref44] 44. Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B. Score-based generative modeling through stochastic differential equations. arXiv preprint 2020. https://arxiv.org/abs/2011.13456

[ref45] 45. Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 10684–95.

[ref46] 46. Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M. Hierarchical text-conditional image generation with clip latents. arXiv preprint 2022. 3.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref47] 47. Kim G, Kwon T, Ye JC. Diffusionclip: Text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 2426–35.

[ref48] 48. Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y. A survey of large language models. arXiv preprint 2023.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref49] 49. Mokady R, Hertz A, Bermano AH. Clipcap: clip prefix for image captioning. arXiv preprint 2021. https://arxiv.org/abs/2111.09734

[ref50] 50. Anil R, Dai AM, Firat O, Johnson M, Lepikhin D, Passos A. Palm 2 technical report. arXiv preprint2023. https://arxiv.org/abs/2305.10403

[ref51] 51. Fang Y, Wang W, Xie B, Sun Q, Wu L, Wang X, et al. EVA: exploring the limits of masked visual representation learning at scale. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. p. 19358–69. https://doi.org/10.1109/cvpr52729.2023.01855

[ref52] 52. Fang Y, Sun Q, Wang X, Huang T, Wang X, Cao Y. EVA-02: a visual representation for neon genesis. Image and Vision Computing. 2024;149:105171.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref53] 53. Sun Q, Fang Y, Wu L, Wang X, Cao Y. Eva-clip: improved training techniques for clip at scale. arXiv preprint2023. https://arxiv.org/abs/2303.15389

[ref54] 54. Cherti M, Beaumont R, Wightman R, Wortsman M, Ilharco G, Gordon C. Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 2818–29.

[ref55] 55. Schuhmann C, Beaumont R, Vencu R, Gordon C, Wightman R, Cherti M. Laion-5b: an open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems. 2022;35:25278–94.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref56] 56. Girdhar R, El-Nouby A, Liu Z, Singh M, Alwala KV, Joulin A. Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 15180–90.

[ref57] 57. Bai Z, Bai Y. Improving multimodal large language models through combining resampler and MLP projections. In:ICASSP 2025 -2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2025. p. 1–5.

[ref58] 58. Bai Z, Bai Y. Exploring the role of CLIP global visual features in multimodal large language models. In: ICASSP 2025 -2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2025. p. 1–5.

[ref59] 59. Wang X, Zhang X, Luo Z, Sun Q, Cui Y, Wang J. Emu3: next-token prediction is all you need. arXiv preprint 2024. https://arxiv.org/abs/240918869

[ref60] 60. Xie J, Mao W, Bai Z, Zhang DJ, Wang W, Lin KQ, et al. Show-o: one single transformer to unify multimodal understanding and generation. arXiv preprint 2024.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref61] 61. Zhu J, Wang W, Chen Z, Liu Z, Ye S, Gu L. Internvl3: exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint 2025. https://arxiv.org/abs/250410479

[ref62] 62. Chen X, Wu Z, Liu X, Pan Z, Liu W, Xie Z. Janus-pro: unified multimodal understanding and generation with data and model scaling. arXiv preprint 2025. https://arxiv.org/abs/250117811

[ref63] 63. Lin Z, Yu S, Kuang Z, Pathak D, Ramanan D. Multimodality helps unimodality: cross-modal few-shot learning with multimodal models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 19325–37.

[ref64] 64. Gal R, Patashnik O, Maron H, Bermano AH, Chechik G, Cohen-Or D. Stylegan-nada: clip-guided domain adaptation of image generators. ACM Trans Graph. 2022;41(4):1–13.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref65] 65. Liu X, Park DH, Azadi S, Zhang G, Chopikyan A, Hu Y. More control for free! image synthesis with semantic diffusion guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023. p. 289–99.

[ref66] 66. Zhou K, Yang J, Loy CC, Liu Z. Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 16816–25.

[ref67] 67. Zhou K, Yang J, Loy CC, Liu Z. Learning to prompt for vision-language models. Int J Comput Vis. 2022;130(9):2337–48.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref68] 68. Wortsman M, Ilharco G, Kim JW, Li M, Kornblith S, Roelofs R. Robust fine-tuning of zero-shot models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022. p. 7959–71.

[ref69] 69. Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z. Glm: general language model pretraining with autoregressive blank infilling. arXiv preprint 2021. https://arxiv.org/abs/2103.10360

[ref70] 70. Ding M, Yang Z, Hong W, Zheng W, Zhou C, Yin D. Cogview: mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems. 2021;34:19822–35.
View Article
Google Scholar

[111] View Article

[112] Google Scholar

[ref71] 71. Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, De Laroussilhe Q, Gesmundo A, et al. Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning. PMLR; 2019. p. 2790–9.

[ref72] 72. Li XL, Liang P. Prefix-tuning: optimizing continuous prompts for generation. arXiv preprint 2021. https://arxiv.org/abs/2101.00190

[ref73] 73. Jia M, Tang L, Chen BC, Cardie C, Belongie S, Hariharan B. Visual prompt tuning. In: European Conference on Computer Vision. 2022. p. 709–27.

[ref74] 74. Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S. LoRA: low-rank adaptation of large language models. arXiv preprint 2021. https://arxiv.org/abs/2106.09685

[ref75] 75. Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. Qlora: efficient finetuning of quantized llms. Advances in Neural Information Processing Systems. 2023;36:10088–115.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref76] 76. Sung YL, Cho J, Bansal M. Vl-adapter: parameter-efficient transfer learning for vision-and-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 5227–37.

[ref77] 77. Zhang Y, Zhou K, Liu Z. Neural prompt search. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2024.

[ref78] 78. Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z. GPT understands, too. AI Open. 2024;5:208–15.
View Article
Google Scholar

[123] View Article

[124] Google Scholar

[ref79] 79. Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021. 3045–59.

[ref80] 80. He K, Fan H, Wu Y, Xie S, Girshick R. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 9729–38.

[ref81] 81. Chen X, Fan H, Girshick R, He K. Improved baselines with momentum contrastive learning. arXiv preprint 2020. https://arxiv.org/abs/2003.04297

[ref82] 82. Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning. PMLR; 2020. p. 1597–607.

[ref83] 83. Chen X, Xie S, He K. An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021. p. 9640–9.

[ref84] 84. Wu Z, Xiong Y, Yu SX, Lin D. Unsupervised feature learning via non-parametric instance discrimination. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. p. 3733–42. https://doi.org/10.1109/cvpr.2018.00393

[ref85] 85. Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P. Microsoft coco captions: data collection and evaluation server. arXiv preprint 2015.
View Article
Google Scholar

[132] View Article

[133] Google Scholar

[ref86] 86. Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015. p. 3128–37. https://doi.org/10.1109/cvpr.2015.7298932

[ref87] 87. Li X, Yin X, Li C, Zhang P, Hu X, Zhang L. Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Computer Vision–ECCV 2020 : 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. 2020. p. 121–37.

[ref88] 88. Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. p. 311–8.

[ref89] 89. Denkowski M, Lavie A. Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation. 2014. p. 376–80.

[ref90] 90. Vedantam R, Zitnick C, Parikh D. Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 4566–75.

[ref91] 91. Anderson P, Fernando B, Johnson M, Gould S. Spice: Semantic propositional image caption evaluation. In: Computer Vision–ECCV 2016 : 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part V 14. 2016. p. 382–98.

[ref92] 92. Lin CY, Och FJ. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). 2004. p. 605–12.

[ref93] 93. Lu P, Mishra S, Xia T, Qiu L, Chang KW, Zhu SC. Learn to explain: multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems. 2022;35:2507–21.
View Article
Google Scholar

[142] View Article

[143] Google Scholar

[ref94] 94. Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C. Stanford alpaca: an instruction-following llama model. 2023.

[ref95] 95. Chen T, Kornblith S, Swersky K, Norouzi M, Hinton GE. Big self-supervised models are strong semi-supervised learners. Advances in Neural Information Processing Systems. 2020;33:22243–55.
View Article
Google Scholar

[146] View Article

[147] Google Scholar

[ref96] 96. Fu C, Chen P, Shen Y, Qin Y, Zhang M, Lin X, et al. MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint 2023. https://arxiv.org/abs/2306.13394

Figures

Abstract

Introduction

Related work

Image encoder.

ST.

LLM.

Method

Base structure

CLIP text encoder and contrastive loss for image encoder

Contrastive loss for ST

Contrastive loss for LLM

Segmented training

Experiments

Datasets and metrics

COCO caption.

ScienceQA.

Alphaca-52k & LLaVA-158k & MME.

Implementation details

Experimental results on COCO

Ablation study.

Segmented training.

Why use segmented training?

Modifying the usage of CLIP features.

Comparison of training efficiency.

Results on ScienceQA

Results on LLaMA-adapter.

Results on LaVIN.

Comparison of training efficiency.

Experimental results on MME

Conclusion

Acknowledgments

References