Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Long-text caption generation for surgical image with a concept retrieval augmented large multimodal model

  • Jiquan Liu,

    Roles Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Supervision, Validation, Writing – review & editing

    Affiliation Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, Zhejiang, China

  • Yichen Zhu,

    Roles Data curation, Investigation, Methodology, Software, Validation, Writing – original draft, Writing – review & editing

    Affiliation Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, Zhejiang, China

  • Jingyi Feng ,

    Roles Conceptualization, Formal analysis, Project administration, Supervision

    feng.jingyi@zju.edu.cn

    Affiliation The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China

  • Xiaoyan Zhang,

    Roles Data curation, Investigation, Writing – review & editing

    Affiliation Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, Zhejiang, China

  • Ziyu Zhou,

    Roles Data curation, Investigation, Writing – review & editing

    Affiliation Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, Zhejiang, China

  • Ye Tao,

    Roles Data curation, Investigation, Writing – review & editing

    Affiliation Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, Zhejiang, China

  • Huilong Duan

    Roles Conceptualization, Formal analysis, Project administration, Supervision

    Affiliation Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, Zhejiang, China

Abstract

Surgical image captioning is critical for automated reporting and education but is currently limited by a lack of long-text datasets and the tendency of generic Multimodal Large Language Models (MLLMs) to hallucinate medical details. To address this, we present a comprehensive framework for long-text surgical captioning. First, we construct a verified long-text benchmark extending the EndoVis2018 dataset, utilizing an automated pipeline with expert-in-the-loop validation to transform brief triplets into rich narratives. Second, we investigate domain-specific adaptation strategies for MLLMs. We implement a surgical concept retrieval-augmented generation (RAG) mechanism that dynamically injects specialized knowledge (instruments, actions) into the visual encoder, effectively mitigating domain-specific hallucinations common in generic models. Finally, recognizing the inadequacy of n-gram metrics for long medical text, we establish a robust evaluation protocol using clinically-aligned metrics. Extensive experiments demonstrate that our data-centric and retrieval-enhanced approach significantly outperforms baselines in producing clinically accurate, coherent long descriptions.

Introduction

Surgical image captioning involves the automatic generation of natural language captions based on the current surgical view. It is crucial for understanding complex surgical operations and serves as a prerequisite for generating surgical instruction during the procedure and generating surgical reports after the operation [1]. Surgical instruction generation is integral to the development of a context-aware surgical system, which aims to use the information available in the operating room to provide clinicians with contextual support at appropriate times [2]. Surgical report generation can reduce surgeon workload, reduce the time devoted to postoperative report writing [3], and serve as a training tool for junior surgeons [4].

Despite its potential, current research is hindered by three critical challenges: 1) Data Scarcity: Existing datasets like EndoVis2018 [5] and DAISI [6] predominantly contain brief captions (averaging 7.9 words) resembling simple surgical triplets (“instrument-action-target”). These lack the positional details, procedural nuances, and narrative context required for clinical utility. 2) Domain Gap and Hallucinations: While emerging Multimodal Large Language Models (MLLMs) offer powerful generative capabilities, they often lack specialized medical knowledge. Direct application of generic MLLMs to surgical scenes frequently leads to “hallucinations”—fabricating non-existent instruments or misinterpreting anatomical relationships—which is unacceptable in clinical settings. 3) Inadequate Evaluation: Traditional n-gram metrics fail to evaluate the semantic coherence of long-form captions. Furthermore, metrics like the Semantic Propositional Image Caption Evaluation (SPICE) [7] are typically limited to single-sentence analysis and do not correlate well with professional clinical judgment for complex narratives.

Accurate and detailed surgical documentation is clinically vital not only for postoperative review and audit but also for surgical skill education and the development of AI-based assistance systems. Short, triplet-style captions have been shown insufficient in capturing the full extent of surgical activities and spatial relationships required for intelligent support systems. Recent works [8,9] and interviews with experienced surgical professionals demonstrate that longer, context-rich descriptions better support clinical, educational, and technological goals.

To bridge these gaps, we present a holistic framework focusing on data quality and domain-specific knowledge integration. First, we address the data bottleneck by constructing a verified long-text benchmark extending the EndoVis2018 dataset. We design an automated pipeline utilizing GPT-4o with structural prompts (incorporating tools, targets, and spatial data) to transform brief annotations into rich narratives, followed by expert verification. Second, rather than relying on generic model architectures, we investigate domain-specific adaptation strategies for MLLMs. We implement a surgical concept retrieval-augmented generation (RAG) mechanism within the BLIP2 architecture. By constructing a dedicated surgical concept vector database, we enable the model to retrieve and inject explicit domain knowledge (e.g., specific instrument types and actions) into the visual encoding process via a dual-path attention mechanism. This strategy effectively grounds the generation process, significantly mitigating the risk of concept hallucination. The overview of our proposed architecture is illustrated in Fig 1.

thumbnail
Fig 1. Overview of the proposed framework.

Addressing the scarcity of long-text surgical data, we introduce an automated pipeline for dataset construction. Furthermore, to mitigate medical hallucinations in MLLMs, we incorporate a surgical concept retrieval mechanism that aligns visual features with domain-specific knowledge.

https://doi.org/10.1371/journal.pone.0343823.g001

Finally, to rigorously assess performance in a clinically relevant manner, we establish a robust evaluation protocol with three dedicated metrics: Long-SPICE, an adaptation using scene graph matching to measure conceptual completeness in complex texts; Weighted BERTScore, tailored to assess semantic similarity of medical terminology; and CLAIR (Criterion using Language Models for Image Caption Rating), which employs LLMs to verify logical coherence. Extensive experiments demonstrate that our data-centric and retrieval-enhanced approach generates clinically accurate descriptions, and our proposed metrics achieve significantly higher correlation with human evaluation compared to traditional methods. The code is available at https://github.com/jiquan/SC-RACap.

Related work

Image captioning constitutes a critical task in computer vision. Numerous techniques developed for captioning natural images are adaptable for surgical image captioning. In this review, we focus on the latest advancements in surgical image captioning with regard to datasets, models, and evaluation methods.

Datasets

Existing studies on surgical image captioning datasets include public and private sources, which encompass a range of procedures, such as nephrectomy, transoral robotic surgery, sleeve gastrectomy, and ventral hernia repair. The EndoVis2018 dataset [5], specifically designed for the robotic-assisted surgical scene segmentation challenge, includes endoscopic images of 15 robotic nephrectomy surgeries. These images, captured by the da Vinci X or Xi systems, involve nine distinct surgical instruments and regions. This dataset was extended by Xu et al. [10]. Each sequence is annotated with detailed captions by experienced surgeons that describe the interactions between surgical instruments and tissues.

The DAISI [6] dataset (Database for AI Surgical Instruction) contains 17,255 color images of 290 different medical procedures, ranging from external fetal monitoring to laparoscopic surgery. Each image is accompanied by the corresponding surgical instruction text, providing extensive natural language captions for automatic surgical instruction generation. This dataset supports the development of surgical instruction generation technology and offers an important resource to understand surgical scenes and research surgical assistance systems.

The neurosurgical video captioning dataset, collected by Chen et al. [11], comprises 41 endoscopic skull base neurosurgery videos documenting procedures at Prince of Wales Hospital, The Chinese University of Hong Kong. These videos are segmented into 11,004 video clips, each with a duration of 30 seconds, and annotated in detail following the Tool-Tissue Interaction (TTI) principle [12].

Regardless of whether the dataset is public or private, the representations in surgical image captioning typically follow the structure of “Instrument exerts action on target.” This format can fundamentally be viewed as a specific type of tripartite relationship. Most of the time, these captions involve one or more categorical sets, encompassing instruments, anatomical parts, and actions. However, they generally lack more nuanced or advanced semantic information, such as details about location, function, or surgical stage.

Models

Existing image captioning models have evolved from early encoder-decoder frameworks, where Convolutional Neural Networks (CNNs) encoded visual features and Recurrent Neural Networks (RNNs) or Long Short-Term Memory networks (LSTMs) [13] generated captions, to advanced architectures integrating attention mechanisms and transformers for better focus, coherence, and efficiency. Modern methods incorporate reinforcement learning to enhance sentence-level quality and leverage pre-trained models on large-scale datasets for improved accuracy and diversity. Retrieval-augmented techniques have also gained prominence, combining memory mechanisms [14] and external knowledge [15] to mitigate hallucination and improve zero-shot performance. These advances provide a robust foundation for developing domain-specific captioning models, including those for surgical image captioning.

In the context of surgical image captioning, existing models also utilize the encoder-decoder architecture. Xu et al. [10] introduced a Class-Incremental Domain Adaptation (CIDA) method for the generation of surgical captions, employing multilayer Transformer models to address challenges related to novel classes and domain shifts. CIDA incorporates class-incremental learning and supervised contrastive loss to facilitate adaptation to new classes and extraction of domain-invariant features. The method also employs one-dimensional Gaussian smoothing and label smoothing techniques to produce a well-calibrated network, enhance feature representation, and improve performance under domain shifts and when encountering unseen classes.

In subsequent work, Xu et al. [1] proposed an end-to-end surgical caption generation model termed SwinMLP-TranCAP, which eliminates the need for detectors and feature extractors. This model utilizes a window-based multi-head Multi-Layer Perceptron (MLP) to replace the multi-head attention module, thereby reducing computational complexity and enhancing inference speed. By directly employing image patches as input representations, the model circumvents intermediate modules such as detectors and feature extractors, resulting in a streamlined architectural design.

However, these approaches do not consider explicit learning of surgical concepts. To address this deficiency, Chen et al. [11] developed SCA-Net, a network that bridges visual and textual modalities by integrating surgical concepts. Using contrastive learning of the text and image, SCA-Net reduces the semantic gap between these modalities and incorporates a classification task to enhance the model’s understanding of surgical concepts. This alignment strategy enables SCA-Net to align surgical concepts across modalities, resulting in more accurate captions informed by multimodal knowledge.

Similarly, SGT++ [16] explicitly captures the structured and detailed semantic relationships within surgical scenes. It accomplishes this by homogenizing the heterogeneous scene graphs of inputs and introducing an implicit relationship attention mechanism. By leveraging prior knowledge stored in an interaction prototype memory, SGT++ enhances the representation of surgical instruments, tissues, and their interactions, thereby facilitating a deeper understanding of surgical processes and the automatic generation of high-quality reports.

Despite these advances, significant limitations persist. SCA-Net requires category labels for all training data, which is labor-intensive. Additionally, multi-task networks frequently encounter task competition, especially during the feature extraction phase. Features optimized for text generation may lack sensitivity for classification tasks, adversely affecting overall model performance. In addition, SGT++ relies on extracting scene graphs from images before text generation. This reliance poses challenges: the performance of the scene graph parsing model directly influences the outcomes, and there is potential for loss of image detail during the parsing process.

Evaluation metrics

In previous research, several metrics have been extensively utilized to evaluate image captioning performance. BLEU (Bilingual Evaluation Understudy) [17] is straightforward to calculate and implement; however, it does not account for grammatical precision and semantic coherence. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [18] offers multiple variants that support n-gram matching across different lengths, but predominantly emphasizes recall, neglecting precision. In particular, ROUGE-L relies heavily on the longest common subsequence, often overlooking semantic nuances.

METEOR (Metric for Evaluation of Translation with Explicit ORdering) [19] offers a more nuanced assessment compared to BLEU by incorporating synonym matching, morphological variations, and sentence structure considerations. It also utilizes a weighted F-score to enhance the precision of the evaluation. However, METEOR is computationally intensive and focuses primarily on lexical matching. CIDEr (Consensus-based Image Caption Evaluation) [20], designed for image captioning tasks, employs Term Frequency-Inverse Document Frequency (TF-IDF) weighted n-grams [21] to compute cosine similarity, allowing it to capture semantic content more effectively and assess descriptive accuracy. However, its computation requires at least two reference captions. SPICE (Semantic Propositional Image Caption Evaluation) [7] evaluates deep semantic content through the semantic representation of scene graphs. It considers objects, attributes, and relationships within images, providing a comprehensive semantic evaluation metric. Despite its advantages, SPICE is computationally complex, challenging to optimize, and relies on scene graph construction, which inherently supports only single-sentence evaluation.

Materials and methods

Dataset construction

Medical image captions should ideally integrate information in multiple dimensions [9]: Indication, which provides a concise statement of clinical information about the image; Tags, representing a collection of medical terms encapsulating key insights extracted from the images; Findings, offering detailed observational descriptions of various regions within the image to aid in identifying abnormalities; and Impression, a summary statement synthesizing information from the Indication and Findings to provide a comprehensive overview of the image.

However, the captions in existing public datasets are predominantly expressed in a simplistic format, which can be distilled into a triple-relationship structure. These captions generally refer to one or more categories, including devices, anatomical locations, and actions, but lack finer-grained or higher-level semantic details, such as spatial relationships, functional roles, or surgical stages.

To fully utilize the capabilities of multimodal large language models, we propose a unified framework for prompt construction to organize surgical images and their associated annotation information, as shown in Fig 2. Specifically, we incorporate the original concise captions and the coordinates of surgical concept bounding boxes as auxiliary information. This approach encodes the spatial positions of objects using standardized coordinates while requiring the model to integrate these details into its narrative without explicitly referencing “bounding boxes” or coordinate data. This ensures the generation of coherent and contextually enriched captions.

thumbnail
Fig 2. Prompt for long-text caption construction.

The black text represents general instructions. The yellow text provides the introduction rules for short-text captions. The blue text provides the introduction rules for the coordinates of surgical concept bounding boxes.

https://doi.org/10.1371/journal.pone.0343823.g002

After constructing the prompts, the GPT-4o API is used to generate an initial dataset, followed by manual reviews conducted by three personnel trained in surgical knowledge. As shown in Fig 3, the extended long-text caption provides significantly richer details regarding instrument categories, positions, and interactions compared to the original short text. Despite the inclusion of short text captions and bounding box coordinates, the review process identified occasional factual inaccuracies or fabrications in the model-generated text. To address these issues, we incorporate an additional layer of surgical knowledge, supplementing images, brief captions, and bounding boxes. This enhanced knowledge base assists the model in understanding the professional and domain-specific intricacies of surgical contexts.

thumbnail
Fig 3. Example of constructing long-text data.

The original short text caption only contained triplet information, with both the verb and target being vague. The extended long-text caption accurately details the categories and positions of the instrument, verb, and target, as well as their interactions.

https://doi.org/10.1371/journal.pone.0343823.g003

Model

Multimodal architecture.

BLIP2 [22] is a multimodal model consisting of three primary components: the visual encoder, the Q-Former, and the large language model. Its training process is executed in two stages: representation learning and generation learning. A Vision Transformer (ViT) [23] model functions as a visual encoder in both stages. The Q-Former module constitutes the core of BLIP2. During the representation learning stage, Q-Former is trained to align image features with text features. In the subsequent generation learning stage, Q-Former employs learned queries and a cross-attention mechanism to extract pertinent visual features for text generation.

Despite BLIP2’s ability to generate captions for surgical images, it occasionally produces outputs that do not align with visual content, a phenomenon known as hallucination [24]. In image captioning, special attention is paid to object hallucination, which is classified into three types: category hallucination, attribute hallucination, and relationship hallucination [25]. Category hallucination occurs when the model incorrectly identifies or fabricates nonexistent object categories; attribute hallucination refers to inaccurate descriptions of object attributes such as color, shape, or size; and relationship hallucination arises when the model’s depiction of object relationships does not align with the actual image.

Retrieval-Augmented Generation (RAG) is a generation method that integrates parametric and nonparametric memory and has gained widespread adoption to improve knowledge-intensive natural language processing tasks [26]. RAG is essential in mitigating model hallucinations by allowing direct access to and utilization of external knowledge sources. This capability reduces inaccuracies and the generation of incorrect information, resulting in outputs that are more grounded and diverse in fact. We propose the integration of a retrieval-augmented module built on the BLIP2 framework, inspired by EVCAP [15].

Vector database.

To implement a retrieval-augmented mechanism, the initial step involves constructing a vector database that supports efficient retrieval. In multimodal tasks, the network input commonly consists of an image, and the information used for retrieval also derives from visual data. In contrast to natural language tasks, it is crucial to employ a visual encoder to generate high-dimensional vector representations for each image. These vectors encapsulate rich visual features, which are subsequently stored within an index structure. Each vector is associated with the information related to the corresponding surgical concept in the image, such as instrument categories, anatomical locations, surgical procedures, and complex relationships such as triplet structures or text captions. During both training and inference, the visual feature vectors of the image in question are retrieved and matched, facilitating the identification of semantically similar images. This integration of relevant surgical concept information into the model enhances its performance.

The construction process is illustrated in Fig 4. Specifically, the visual encoder employs ViT and Q-Former networks, retaining the architecture and weights from the training phase. A query vector is employed to extract visual features from the image. After passing through layers of self-attention, cross-attention, and feed-forward operations, this vector undergoes extensive interactions with the surgical image. It ultimately serves as the embedding vector for the image, which is subsequently input into the surgical concept database.

thumbnail
Fig 4. The process of constructing the surgical concept vector database.

The surgical image is processed by ViT and Q-Former to extract an image embedding. The image embedding, along with the corresponding surgical concept text, is stored in the vector database.

https://doi.org/10.1371/journal.pone.0343823.g004

During database construction, the learnable vectors in Q-Former are initialized with pre-trained weights from BLIP2. During subsequent downstream task training for generating surgical image captions, these weights are updated as training progresses. Consequently, the weights of the visual encoder used in training will differ from those used in database construction, leading to discrepancies in the distribution of extracted vectors. To address this issue, we implement a strategy of periodically updating the surgical concept database. After several training epochs for the downstream task, the vector database is re-extracted using the updated weights of the visual encoder. This ensures that the distribution of visual features remains consistent both before and after matching, thereby mitigating potential inconsistencies between the stages.

Surgical concept retrieval augmented captioning network.

Upon extraction into the surgical concept vector database, the entire surgical concept retrieval-augmented network is illustrated in the Fig 5. Initially, surgical images from the training set are segmented into a predefined number of patches, which are subsequently input into the image encoder to extract visual features. These visual features are then input into the same Q-Former utilized for constructing the database. To distinguish it from the surgical concept Q-Former used later, we refer to it here as the Image Q-Former, with associated learnable query vectors designated as Image Queries. This process results in the extraction of image embeddings for the training images. Cosine similarity is employed as the metric to assess the similarity between vectors, facilitating efficient retrieval within the vector database.

thumbnail
Fig 5. Surgical concept retrieval-augmented image captioning network architecture.

The ViT and Q-Former are employed to extract surgical image embeddings. Similar images and associated surgical concepts are retrieved from a vector database. The image embeddings and surgical concepts interact within the Surgical Concept Q-Former to produce the final text embedding. The text and image embeddings are then concatenated to generate captions using the LLMs.

https://doi.org/10.1371/journal.pone.0343823.g005

(1)

denotes the embedding vector of the current input image, while denotes the embedding vector of a stored image. The term refers to the cosine similarity between these two vectors. By computing this similarity, the top-k most similar images are identified, and their associated category information is retrieved. Given that these images may belong to overlapping categories, a voting mechanism is employed to extract surgical concepts that occur more frequently than a predefined threshold T. All qualifying surgical concepts, such as surgical instruments, anatomical locations, procedures, and more intricate triplet relationships, are concatenated into a single string using a special delimiter. This string is then encoded through the tokenizer and embedding layers and, along with the visual embeddings, is passed into the Surgical Concept Q-Former.

The network architecture of the Surgical Concept Q-Former is illustrated Fig 6. Initially, a textual query vector is defined and concatenated with the image embedding output generated by the image query transformer, forming a joint image-text query. This query is processed using the self-attention mechanism, enabling it to capture relevant information from the current image. The mathematical formulation for the self-attention mechanism is as follows:

thumbnail
Fig 6. Architecture of surgical concept Q-Former.

The randomly initialized text queries and image embeddings are concatenated and interact through self-attention. The text queries are then subjected to cross-attention interaction with the textual embeddings of surgical concepts. This process is stacked for N blocks.

https://doi.org/10.1371/journal.pone.0343823.g006

(2)

denotes the query vector, denotes the key vector, and refers to the value vector, with serving as the scaling factor. In this step, the text query interacts with the image embeddings to extract key information pertinent to the current image. Subsequently, the text query vector derives from the concatenated vector, which serves as the query vector. The surgical concept embedding vector, obtained in the preceding step, functions as both the key and value vectors. These vectors are integrated via a cross-attention mechanism. The computation of the cross-attention mechanism is represented by the following formula:

(3)

The vector denotes the textual query, while and denote the embedding vectors corresponding to surgical concepts. The aforementioned steps enable the textual query to effectively integrate surgical concept information with image features. Following the cross-attention mechanism, the resulting vectors are subjected to a nonlinear transformation via a feedforward layer. This procedure is iteratively repeated for iterations, thereby facilitating comprehensive interaction between the surgical concept information and the image features. Finally, the two query vectors are concatenated and processed through a linear projection, as delineated in the equation below:

(4)

denotes the image query vector, while denotes the text query vector. W is the linear projection matrix, and b refers to the bias. The projected outputs are subsequently input into a pre-trained large language model to generate the final textual caption, with the training process guided by the cross-entropy loss function. Importantly, during initialization, the text query vector is randomly initialized, whereas the weights from the Image Q-Former are utilized in the self-attention, cross-attention, and feed-forward layers of the Surgical Concept Q-Former.

Experiments and results

Semantic evaluation metrics

To rigorously evaluate the quality of long-text captions for surgical images, we propose three assessment metrics that address distinct aspects of semantic and linguistic accuracy. First, the Long-SPICE metric, based on scene graph matching, decomposes complex sentences into triplet relationships, enabling precise evaluation of surgical concept representation. This approach minimizes the impact of expressive variations while focusing on the correctness of key concepts. Second, the BERTScore metric employs word vector matching, leveraging the global attention mechanism to compute semantic similarities. By capturing approximate semantic alignments through feature space distributions, it effectively measures overall semantic similarity. Finally, the CLAIR metric, utilizing LLMs scoring, capitalizes on the advanced natural language understanding capabilities of LLMs. This metric facilitates detailed comparisons of surgical concept representations and ensures captions are fluent and coherent. Together, these metrics enable the evaluation of the accuracy, semantic integrity, and linguistic quality of long-text captions in the surgical domain.

Long-SPICE.

SPICE (Semantic Propositional Image Caption Evaluation) [7] assesses the semantic alignment between generated captions and reference texts by concentrating on their underlying semantic relationships. It utilizes a scene graph representation to transform textual captions into structured graphs comprising objects, attributes, and relationships. This methodology captures the core semantic content of the text while disregarding specific vocabulary and syntax. By comparing the semantic propositions encoded in the scene graphs of both the candidate and reference texts, SPICE quantifies the quality of the generated captions using an F1 score, providing a robust measure of semantic consistency.

While SPICE demonstrates exceptional proficiency in capturing semantic content, it is not without limitations. First, SPICE presupposes that the generated text is grammatically coherent, thereby overlooking textual fluency evaluation. Second, its performance is inherently dependent on the quality of scene graph parsing. The original SPICE parsing methodology is constrained to extracting scene graphs from individual sentences, rendering it inadequate for multi-sentence or complex scenarios characteristic of longer texts. The first limitation can be alleviated by complementing SPICE with additional evaluation metrics, while addressing the latter necessitates the development of more advanced and robust semantic parsing techniques.

Existing research has investigated methods for extracting scene graphs from complex, long texts. While some approaches accomplish this through further training tailored to surgical scenarios [27], others employ zero-shot learning to transfer capabilities directly [28]. However, the latter are typically limited to parsing short text data. To enable document-level zero-shot relationship extraction, large language models have emerged as a promising solution [29,30]. By leveraging the unique characteristics of surgical image captions, we employ large language models to extract relationships from extended textual content. This process involves identifying multiple semantic relationships within the text and aligning them with relevant elements, such as instruments and anatomical structures, derived from reference captions for evaluation purposes. Our method effectively addresses the challenges posed by complex, long texts and demonstrates flexibility across diverse scenarios without requiring additional training. Furthermore, this approach extends the SPICE metric’s capability to process lengthy textual captions more efficiently, thereby enhancing its accuracy in evaluating critical information within intricate captions.

Weighted BERTScore.

BERTScore [31] is an innovative evaluation metric designed to address the limitations of traditional methods in assessing semantic similarity. In tasks involving text generation, such as machine translation and image captioning, conventional metrics like BLEU and METEOR primarily rely on n-gram matching. However, these metrics frequently fail to account for synonymous word substitutions, phrase reordering, and complex semantic relationships, thereby often undervaluing the quality of generated text.

The fundamental principle of BERTScore is its use of contextual embeddings derived from the pretrained BERT model. Each token in the candidate and reference texts is transformed into a high-dimensional vector, and their semantic similarity is assessed using cosine similarity. A greedy matching strategy is employed, in which each token in the candidate text is matched with the most semantically similar token in the reference text. The overall quality of the candidate text is evaluated through the integration of precision, recall, and F1 scores. Let and denote the sets of token embeddings for the generated and reference texts, and let denote the cosine similarity between token embeddings. The BERTScore formula is expressed as follows:

(5)

To enhance the sensitivity of BERTScore to surgical concepts, a weighted variant was developed, specifically tailored for the evaluation of surgical image captions. This method assigns increased weights to keywords pertinent to surgical contexts, derived from the tokenization results of the BERT tokenizer. Within this framework, the weights of particular surgical terms are augmented to N times those of other tokens. This modification strengthens the metric’s ability to assess the semantic alignment of surgical image captions more effectively. The formula for the weighted BERTScore is presented as follows:

(6)

denotes the weight assigned to each token, while serves as a normalization factor, ensuring comparability across diverse weight distributions. During the weight calculation process, surgical domain-specific keywords or phrases are tokenized using the BERT tokenizer.

BERTScore effectively addresses the limitations of metrics such as Long-SPICE, which often fail to adequately capture semantic similarity. Furthermore, it evaluates the precision of surgical concepts through keyword weighting. However, as this method computes token-level vector similarities independently, it encounters difficulties in intuitively assessing the coherence of the overall contextual meaning.

CLAIR.

The core methodology of CLAIR (Criterion using Language Models for Image Caption Rating) [32] utilizes the zero-shot capabilities of large language models (LLMs) to reconceptualize the task of image caption evaluation as a natural language understanding problem. By employing a human-readable prompt, the model generates a score (on a scale from 0 to 100) that reflects the degree of alignment between candidate and reference captions. Furthermore, the model provides an explanation to justify the assigned score. This approach obviates the necessity for supplementary contextual examples or complex parameter fine-tuning, offering a design that is both conceptually straightforward and computationally efficient.

By constructing prompts specifically designed for surgical image captions, this method enhances the comprehension of semantic relationships between predicted and reference texts. It emphasizes critical elements, including instruments, anatomical locations, procedures, and surgical concepts, while minimizing the influence of variations in word order or expression style. Furthermore, in addition to providing evaluation scores, large language models offer detailed justifications for their assessments, thereby significantly increasing the interpretability of the evaluation results.

Human judgment relevance assessment.

To demonstrate the effectiveness of the proposed metrics, we utilized Spearman’s rank correlation coefficient, Pearson’s correlation coefficient, and Kendall’s tau coefficient to evaluate the association between the evaluation metrics and human judgments. We generated 100 sets of long-text captions using a large language model, with each set containing four distinct captions of varying quality levels, along with corresponding human-assigned ratings.

As shown in Table 1, the correlation between various evaluation metrics and human judgments varied significantly. Traditional metrics, including BLEU-4, METEOR, and CIDEr, consistently demonstrated lower correlation scores across all three measures. Notably, BLEU-4 exhibited particularly weak correlations in every test. These findings underscore the inherent limitations of traditional metrics in effectively evaluating long-text captions.

thumbnail
Table 1. Comparison of different metrics for evaluating long-text captions. Spearman’s ρ, Pearson’s r, and Kendall’s τ are used to measure correlation. CLAIR achieves the best performance across all correlation metrics.

https://doi.org/10.1371/journal.pone.0343823.t001

CLAIR demonstrates superior correlation with human judgments across all metrics (Spearman/Pearson/Kendall) by leveraging LLMs’ contextual understanding to directly assess text quality without predefined rules, particularly effective for long texts. Long-SPICE enhances semantic consistency evaluation through LLM-powered scene graph parsing, overcoming original SPICE’s sentence-level limitations but remaining constrained by surgical image semantic parsing accuracy. Weighted BERTScore (w = 10) achieves high correlation via contextualized word-vector matching, though domain-specific terminology recognition challenges persist, requiring optimal keyword weighting to balance surgical term emphasis and fluency (Fig 7).

thumbnail
Fig 7. BERTScore performance with varying weight factors in human judgment of relevance.

https://doi.org/10.1371/journal.pone.0343823.g007

As the weight factor increases, the performance of BERTScore in human judgment of relevance improves. However, excessive weight factors can lead to overemphasis on specific words, neglecting other critical information in the caption and weakening the model’s ability to assess overall fluency. This outcome highlights the importance of careful keyword weight setting. A moderate weight factor effectively enhances the model’s sensitivity to key surgical terms without compromising its ability to assess the overall semantic quality of the generated caption. This suggests that the weighted BERTScore method is promising for evaluating lengthy textual captions of surgical images, but requires appropriate adjustments to keyword weight factors to balance the importance of global semantics with specific local terms.

Datasets and implementation details

We extended the EndoVis2018 [5] dataset using the method outlined before. The dataset consists of 15 surgical sequences. In accordance with the setup of Xu et al. [10], sequences 1, 5, and 16 were designated as the test set, while the remaining sequences were allocated to the training and validation sets. Specifically, the training and validation sets contained 1,560 image-text pairs, and the test set included 447 image-text pairs, resulting in a total of 2,007 image-text pairs. The image resolution is 12801024 pixels.

The training and evaluation are conducted using four NVIDIA RTX A6000 GPUs. The input images are resized to a fixed dimension of 224224 pixels using bicubic interpolation. During the training process, the parameters of the image encoder, query transformer, and large language model remain frozen, and training is focused exclusively on the image query vector, text query vector, and the final linear layer.

Evaluation results

We employed widely used natural language generation (NLG) evaluation metrics, along with the semantic evaluation metrics outlined before. Our approach was compared with several representative multimodal large models and state-of-the-art methods in the field of surgical image captioning, including SwinMLP-TranCAP [1], GIT [33], Qwen-VL [34], and BLIP2 [22]. Although SCA-Net [11] and SGT++ [16] surpass SwinMLP-TranCAP on the short-text dataset, their implementations have not been fully open-sourced. Consequently, we restricted our comparison to SwinMLP-TranCAP.

Multi-Task Learning (MTL) is a framework that extends BLIP2 with a multi-label classification head specifically designed for surgical instruments. Retrieval-Augmented Generation (RAG) is a network that enhances BLIP2 with an external surgical concept vector database and a retrieval-augmented module, with the surgical concept vector database restricted to categories of surgical instruments.

As shown in Tables 2 and 3, both GIT and Qwen-VL achieved superior performance across several traditional short-text evaluation metrics, surpassing TranCap on various other metrics. In contrast, BLIP2 demonstrated near-optimal performance on BERTScore, CLAIR, and Long-SPICE. This finding suggests that employing vision encoders alongside large language models pretrained on extensive natural image and text datasets can significantly enhance a model’s ability to comprehend and represent surgical images. Notably, BLIP2 outperformed several other pretrained multimodal models, which can be attributed to the query transformer’s efficacy in extracting visual information from the image, thereby facilitating the downstream language decoder in producing more accurate captions.

thumbnail
Table 2. Performance comparison of representative models on traditional short-text evaluation metrics. Metrics include BLEU-4 (B4), METEOR (MET), and CIDEr (CID), where higher scores indicate better performance in generating accurate and fluent surgical image captions.

https://doi.org/10.1371/journal.pone.0343823.t002

thumbnail
Table 3. Evaluation results for BERTScore, CLAIR, and Long-SPICE metrics across different models. BERTScore includes precision, recall, and F1-score for semantic similarity; CLAIR measures language accuracy and informativeness; Long-SPICE evaluates long-text generation accuracy for surgical image captions.

https://doi.org/10.1371/journal.pone.0343823.t003

The surgical concept retrieval-augmented model achieved superior results on most metrics, whereas the multi-task network with a classification head underperformed compared to the baseline network on certain metrics, such as CLAIR. This discrepancy suggests potential interference between the classification and text generation tasks within the multi-task network. Conversely, the retrieval-augmented approach, serving as a feature enhancement module, integrates surgical concept-related information into the original image query vector. This enhancement likely enables the model to more effectively capture the essential semantic features within surgical images. A qualitative comparison between our method and state-of-the-art models is presented in Fig 8, which visually demonstrates the superior accuracy of the captions generated by our RAG approach.

thumbnail
Fig 8. The qualitative comparison between our RAG method and state-of-the-arts.

With surgical concept retrieval augmenting, our method generates more accurate surgical captions.

https://doi.org/10.1371/journal.pone.0343823.g008

Ablation analysis

To elucidate further the contribution of the retrieval-augmented module to the network’s performance, we conducted experiments using the surgical concept database, which encompasses a diverse range of surgery-related information. BLIP2 serves as the baseline network, devoid of surgical concepts. RAG-t (Target) refers to the retrieval-augmented model integrating anatomical location categories. RAG-i (Instrument) denotes the retrieval-augmented model incorporating surgical instrument categories. RAG-v (Verb) corresponds to the retrieval-augmented model encompassing procedural action categories. Finally, RAG-ivt refers to the retrieval-augmented model integrating triplet categories.

As shown in Tables 4 and 5, the mere inclusion of the anatomical location category does not significantly enhance surgical concept recognition performance. In contrast, models incorporating surgical instruments, surgical procedures, and triplets as enhanced surgical concepts consistently surpass the baseline model’s performance. Notably, models including surgical instruments and triplets routinely achieve the highest rankings across most evaluation metrics, underscoring the critical importance of the surgical instrument category in interpreting surgical images. This is likely due to the relatively simplistic anatomical location annotations in the dataset, restricted to three categories: tissue, blood, and vessels. Consequently, the voting mechanism predominantly identifies “tissue,” rendering the anatomical location category insufficient for contributing substantial information. Conversely, although the surgical procedures category encompasses 11 valid classifications, the voting results are frequently dominated by the term “manipulate,” which is semantically ambiguous. From the standpoint of surgical image interpretation, surgical instruments are pivotal, functioning as prerequisites for performing surgical procedures. The regions where surgical instruments interact are the anatomical locations warranting focused attention. Conversely, the ambiguous classifications within the surgical procedures and anatomical location categories may hinder the model’s capacity to fully comprehend the image, potentially diminishing the attention mechanism’s focus on critical features.

thumbnail
Table 4. Performance comparison of various models on common language generation metrics, including BLEU-4 (B4), METEOR (MET), and CIDEr (CID). Metrics are evaluated across different configurations of the RAG model, showcasing the impact of retrieval-augmented components on caption quality.

https://doi.org/10.1371/journal.pone.0343823.t004

thumbnail
Table 5. Evaluation of retrieval-augmented models using advanced metrics, including BERTScore (Precision, Recall, F1), CLAIR, and Long-SPICE (Precision, Recall, F1). The analysis highlights the contribution of different retrieval strategies (e.g., anatomical locations, instruments, actions, and triplets) to the overall model performance.

https://doi.org/10.1371/journal.pone.0343823.t005

Discussion

This work addresses the dual challenges of data scarcity and model hallucination in the field of surgical image captioning.

First, regarding data construction, we present a systematic methodology for generating extensive textual captions. By automating the expansion of brief triplets into narrative descriptions and incorporating expert-in-the-loop validation, we bridge the gap between sparse technical annotations and the clinical need for comprehensive reporting. While our current expansion primarily targets visual facts—such as instrument interactions—and lacks higher-level anatomical reasoning (e.g., identifying critical vessels requiring protection), this pipeline establishes a foundational framework. Future work can leverage this infrastructure, collaborating with professional clinicians to incorporate deeper physiological insights and iterative medical knowledge updates.

At the model level, our experiments demonstrate that generic multimodal pre-training is insufficient for the specialized surgical domain. It is crucial to explicitly integrate domain-specific concepts to facilitate effective knowledge transfer. Instead of proposing a purely architectural novelty, we validate the Surgical Concept Retrieval-Enhanced strategy as a robust mechanism for domain adaptation. By incorporating a surgical concept vector database and employing cross-attention for concept injection, we enable the model to access external medical knowledge dynamically. This mechanism allows the network to selectively focus on salient surgical features, proving that retrieval-augmented generation (RAG) is highly effective in grounding MLLMs in medical reality.

Comparative analysis highlights the practical advantages of this retrieval-based paradigm. Compared with SGT++ [16], our approach eliminates the computational burden of pre-extracting scene graphs, enabling the model to capture subtle inter-image variations without heavy structural priors. Unlike SCA-Net [11], which relies on labor-intensive category annotations across the entire dataset, our method obviates the necessity for full-set classification supervision. This reduces the risk of task interference often seen in multi-task learning. Most importantly, this strategy directly addresses the issue of object and relationship hallucinations. By directing the model to retrieve and articulate verified key elements, our method demonstrates superior interpretability and precision in generating clinically coherent long-form captions.

Nevertheless, this retrieval-augmented method exhibits specific limitations. The effectiveness of the knowledge injection depends on the quality of the underlying vector database. Current databases rely on annotated concepts (instruments and sites), which restricts generalization when encountering long-tail distributions (e.g., rare surgical steps) or transferring across distinct procedures (e.g., from laparoscopic to neurosurgery). Moreover, the current evaluation is limited to the EndoVis2018 dataset. To better assess the model’s robustness and cross-domain generalization, future research must expand the scope of surgical caption datasets and incorporate diverse benchmarks, such as CholecSeg8k [35] or large-scale synthetic datasets.

Conclusion

This work addresses the challenges associated with generating long-text captions for laparoscopic surgical images, which are critical for tasks such as surgical skill training and operative report generation. To overcome the limitations of existing methods, which rely on short captions and generally extract surgical triplets from images, we propose an innovative approach for constructing a surgical image long-text caption dataset. This method significantly enhances the descriptive capabilities of the generated captions by expanding the EndoVis2018 dataset. On the model level, we introduce a surgical concept retrieval-augmented image captioning model. By integrating a surgical concept vector database and employing cross-attention mechanisms within the Q-Former module, the model is directed to focus on key features of the surgical images. Furthermore, we refine a set of semantic evaluation metrics tailored for long-text captions, enhancing their alignment with human judgment. Extensive experiments on the extended dataset demonstrate that our surgical concept retrieval-augmented model excels in generating long-text captions for surgical images. Additionally, our proposed evaluation metrics significantly improve the correlation with human judgment compared to traditional methods. This work not only advances the state of surgical image captioning but also provides several robust metrics for more accurate and meaningful model evaluation, thereby laying the groundwork for future developments in the field.

References

  1. 1. Xu M, Islam M, Ren H. Rethinking surgical captioning: end-to-end window-based mlp transformer using patches. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2022. p. 376–86.
  2. 2. Madani A, Namazi B, Altieri MS, Hashimoto DA, Rivera AM, Pucher PH, et al. Artificial intelligence for intraoperative guidance: using semantic segmentation to identify surgical anatomy during laparoscopic cholecystectomy. Ann Surg. 2022;276(2):363–9. pmid:33196488
  3. 3. Bieck R, Wildfeuer V, Kunz V, Sorge M, Pirlich M, Rockstroh M, et al. Generation of surgical reports using keyword-augmented next sequence prediction. Curr Dir Biomed Eng. 2021;7(2):387–90.
  4. 4. Elnikety S, Badr E, Abdelaal A. Surgical training fit for the future: the need for a change. Postgrad Med J. 2022;98(1165):820–3. pmid:33941663
  5. 5. Allan M, Kondo S, Bodenstedt S, Leger S, Kadkhodamohammadi R, Luengo I, et al. 2018 robotic scene segmentation challenge. arXiv:200111190 [Preprint]. 2020.
  6. 6. Rojas-Muñoz E, Couperus K, Wachs J. DAISI: Database for AI surgical instruction. arXiv:200402809 [Preprint]. 2020.
  7. 7. Niu C, Shan H, Wang G. SPICE: Semantic pseudo-labeling for image clustering. IEEE Trans Image Process. 2022;31:7264–78. pmid:36378790
  8. 8. Kolbinger FR, Bodenstedt S, Carstens M, Leger S, Krell S, Rinner FM, et al. Artificial Intelligence for context-aware surgical guidance in complex robot-assisted oncological procedures: an exploratory feasibility study. Eur J Surg Oncol. 2024;50(12):106996. pmid:37591704
  9. 9. Beddiar D-R, Oussalah M, Seppänen T. Automatic captioning for medical imaging (MIC): a rapid review of literature. Artif Intell Rev. 2023;56(5):4019–76. pmid:36160365
  10. 10. Xu M, Islam M, Lim CM, Ren H. Class-incremental domain adaptation with smoothing and calibration for surgical report generation. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2021. p. 269–78.
  11. 11. Chen Z, Guo Q, Yeung LK, Chan DT, Lei Z, Liu H, et al. Surgical video captioning with mutual-modal concept alignment. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2023. p. 24–34.
  12. 12. Nwoye CI, Alapatt D, Yu T, Vardazaryan A, Xia F, Zhao Z, et al. CholecTriplet2021: a benchmark challenge for surgical action triplet recognition. Med Image Anal. 2023;86:102803. pmid:37004378
  13. 13. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. pmid:9377276
  14. 14. Zeng Z, Xie Y, Zhang H, Chen C, Chen B, Wang Z. Meacap: Memory-augmented zero-shot image captioning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2024. p. 14100–10.
  15. 15. Li J, Vo DM, Sugimoto A, Nakayama H. Evcap: Retrieval-augmented image captioning with external visual-name memory for open-world comprehension. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2024. p. 13733–42.
  16. 16. Lin C, Zhu Z, Zhao Y, Zhang Y, He K, Zhao Y. SGT++: Improved scene graph-guided transformer for surgical report generation. IEEE Trans Med Imaging. 2024;43(4):1337–46. pmid:38015688
  17. 17. Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the Association for Computational Linguistics; 2002. p. 311–8.
  18. 18. Lin CY. Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out; 2004. p. 74–81.
  19. 19. Behm P, Benoit P, Faivre A, Meynadier JM. METEOR: A successful application of B in a large project. International Symposium on Formal Methods. Springer; 1999. p. 369–87.
  20. 20. Vedantam R, Lawrence Zitnick C, Parikh D. Cider: Consensus-based image description evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. p. 4566–75.
  21. 21. Singhal A, et al. Modern information retrieval: a brief overview. IEEE Data Eng Bull. 2001;24(4):35–43.
  22. 22. Li J, Li D, Savarese S, Hoi S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. International Conference on Machine Learning. PMLR; 2023. p. 19730–42.
  23. 23. Dosovitskiy A. An image is worth 16x16 words: transformers for image recognition at scale. arXiv:201011929 [Preprint]. 2020.
  24. 24. Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans Inf Syst. 2025;43(2):1–55.
  25. 25. Bai Z, Wang P, Xiao T, He T, Han Z, Zhang Z, et al. Hallucination of multimodal large language models: a survey. arXiv:240418930 [preprint]. 2024.
  26. 26. Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv Neural Inf Process Syst. 2020;33:9459–74.
  27. 27. Ma Y, Wang A, Okazaki N. DREEAM: Guiding attention with evidence for improving document-level relation extraction. arXiv:230208675 [Preprint]. 2023.
  28. 28. Picco G, Galindo MM, Purpura A, Fuchs L, Lopez V, Hoang TL. Zshot: an open-source framework for zero-shot named entity recognition and relation extraction. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations); 2023. p. 357–68.
  29. 29. Sun Q, Huang K, Yang X, Tong R, Zhang K, Poria S. Consistency guided knowledge retrieval and denoising in llms for zero-shot document-level relation triplet extraction. Proceedings of the ACM Web Conference 2024; 2024. p. 4407–16.
  30. 30. Yuan C, Xie Q, Ananiadou S. Zero-shot temporal relation extraction with chatgpt. arXiv:230405454 [Preprint]. 2023.
  31. 31. Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. Bertscore: Evaluating text generation with bert. arXiv:190409675 [Preprint]. 2019.
  32. 32. Chan D, Petryk S, Gonzalez J, Darrell T, Canny J. Clair: Evaluating image captions with large language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; 2023. p. 13638–46.
  33. 33. Wang J, Yang Z, Hu X, Li L, Lin K, Gan Z, et al. Git: A generative image-to-text transformer for vision and language. arXiv:220514100 [Preprint]. 2022.
  34. 34. Bai J, Bai S, Yang S, Wang S, Tan S, Wang P, et al. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv:230812966 [Preprint]. 2023;1(2):3.
  35. 35. Hong WY, Kao CL, Kuo YH, Wang JR, Chang WL, Shih CS. Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80. arXiv:201212453 [Preprint]. 2020.