Figures
Abstract
Purpose: Detection of crucial components is a fundamental problem in surgical scene understanding. Limited by the huge cost of spatial annotation, current studies mainly focus on the recognition of three surgical elements instrument, verb, target
, while the detection of surgical components
instrument, target
remains highly challenging. Some efforts have been made to detect surgical components, yet their limitations include: (1) Detection performance highly depends on the amount of manual spatial annotations; (2) No previous study has investigated the detection of targets.
Methods: We introduce a weakly supervised method for detecting key components by novelly combining the surgical triplet recognition model and the foundation model of Segment Anything Model (SAM). First, by setting appropriate prompts, we used SAM to generate candidate regions for surgical components. Then, we preliminarily localize components by extracting positive activation areas in class activation maps from the recognition model. However, using instrument’s class activation as a position attention guide for target recognition leads to positional deviations in the target’s resulting positive activation. To tackle this issue, we propose RDV-AGC by introducing an Attention Guide Correction (AGC) module. This module adjusts the attention guidance for target according to the instrument’s forward direction. Finally, we match the initial localization of instruments and targets with the candidate areas generated by SAM, achieving precise detection of components in the surgical scene.
Results: Through ablation studies and comparisons to similar works, our method has achieved remarkable performance without requiring any spatial annotations.
Conclusion: This study introduced a novel weakly supervised method for detecting surgical components by integrating the surgical triplet recognition model with visual foundation model.
Citation: Zhang X, Feng J, Zhang Q, Wu L, Zhu Y, Zhou Z, et al. (2025) A weakly supervised method for surgical scene components detection with visual foundation model. PLoS One 20(5): e0322751. https://doi.org/10.1371/journal.pone.0322751
Editor: Xu Yanwu, South China University of Technology, CHINA
Received: September 19, 2024; Accepted: March 27, 2025; Published: May 27, 2025
Copyright: © 2025 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data that support the findings of this study is available in https://www.kaggle.com/datasets/xiaoyanzhang3/surgical-scene-components-spatial-annotations/data.
Funding: This work was supported by National Key Research and Development Program of China (No. 2022YFC2407301) and two of Key Research and Development Program of Zhejiang Province (No. 2023C03172 and No. 2022C03147). Both the National Key Research and Development Program of China (No. 2022YFC2407301) and the two Key Research and Development Programs of Zhejiang Province (No. 2023C03172 and No. 2022C03147) provided support in study design, as well as data collection and analysis.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Surgical scene activities understanding in endoscopic surgery is a crucial issue in surgical data analysis. AI-based methods have facilitated the automatic identification of actions during surgery, providing assistance for surgeons in decision-making, planning and skill-teaching [1–3]. The surgical action triplet [4], defined as instrument, verb, target
, is a new paradigm for understanding surgical scene activities in a fine-grained modeling way. A considerable number of research [5,6] focus on predicting the presence of surgical triplets. Nonetheless, few studies investigate the spatial localization of surgical scene components
instrument, target
in surgical video frames.
In the past few years, several work [7], [8,9] have made efforts to detect components in surgical scenes. However, these methods share common limitations: (1) The performance of surgical components localization is significantly influenced by the amount of spatial annotations used for training. The outcomes from Endoscopic Vision Challenge [8] show that fully supervised methods, dependent on extensive spatial annotations, significantly outperform the weakly supervised methods lacking such annotations. Results in [9] also indicate a positive correlation with the amount of spatial annotations (The 60.2% increase in the amount of annotations results in the 13.2% improvement in performance). (2) The localization of targets is highly challenging due to their ambiguous boundaries and low discriminative features, leading to a focus solely on the localization of instruments, while ignoring the targets. Following these observations, we consider two research questions: (1) In addition to detecting instruments in surgical scenes, how can we achieve precise localization of the targets they acted upon? (2) Considering the high cost of manually annotating positions of instruments and targets, how can we achieve excellent localization performance without relying on spatial annotations?
The Segment Anything Model (SAM) [10], a pioneering foundational model for promptable segmentation, has recently attracted significant attention. Many recent studies [11–13] utilize SAM for downstream medical tasks, using its zero-shot learning capabilities to improve the training efficiency of medical vision models. The emergence of such visual foundation models [14,15] offers the potential to detect surgical components without spatial annotation. To address the research questions proposed above, we introduce a novel weakly supervised method that combines a surgical triplet recognition model and SAM. First, by setting appropriate prompts, we construct SAM’s Automatic Mask Generation pipeline to generate candidate regions for surgical components. Then, similar to the detection version of the surgical triplet recognition model Rendezvous (RDV), RDV-det [4], we preliminarily locate instruments and targets based on the positive activation areas in the Class Activation Map (CAM) [16]. However, due to the using of instrument’s class activation as position attention guide for target in Rendezvous [4], the positive activation area of target’s resulting CAM may converge towards the instrument, as shown in Fig 1 (a). This issue is more intractable when multiple instruments operate on the same target, different instruments’ CAMs may mislead the attention of target, resulting in the target’s positive activation failing to "focus" on the correct area, as shown in Fig 1 (b). To tackle this problem, we propose an Attention Guide Correction (AGC) module. Through shifting the original attention guide for the target, which is the instrument’s CAM, along the forward direction of the instrument, this module aligns the target’s resulting positive activation with its actual location, thus achieving more accurate preliminary localization. Finally, based on Hungarian algorithm [17] and designing appropriate matching costs, we match the preliminary localization of instruments and targets with the candidate regions generated by SAM, thereby facilitating precise detection of surgical scene components without the need of spatial annotations. To evaluate our method, we annotate the spatial bounds of instruments and targets on a subset of CholecT50 [18]. These annotations are only utilized for model evaluation and will be publicly available with this paper. Finally, ablation studies validate the effectiveness of the proposed modules, and comparisons with similar works demonstrate that our method achieves superior detection of surgical components without relying on spatial annotations. To summarize, we make the following contributions in this paper:
- A novel weakly supervised method for detecting surgical components by combining the surgical triplet recognition model and the foundation model of Segment Anything Model (SAM).
- We present the first exploration of locating the target acted upon by the instrument. The proposed Attention Guide Correction (AGC) module significantly enhances target localization performance through the automatic correction of position attention guides.
- We match candidate regions from SAM without class labels to class activation maps. We utilize the strong generalization capabilities of the foundation model, achieving more accurate weakly supervised localization than CAM-based methods.
- The proposed method eliminates the need for spatial annotations during training, thereby facilitating its extension to other laparoscopic surgery scenes.
- Our approach achieves state-of-the-art results among all publicly available weakly supervised methods for detecting instruments. Moreover, its performance is comparable to those fully supervised methods trained on extensive spatial annotations.
Related work
Surgical action triplet recognition
The concept of the surgical action triplet originates from the existing surgical ontology, where each triplet is described as a combination of the used instrument, a verb representing the action performed, and the anatomy acted upon [19]. Surgical action triplets was introduced to surgical workflow analysis for improving surgical phase recognition by Katic et al. [20]. Nwoye et al. developed a deep learning model, Tripnet, for the automatic recognition of action triplets in surgical video frames [21]. This model introduced a mechanism called the Class Activation Guide (CAG), which uses the class activation of the instrument as appearance cues to guide the recognition of verbs and targets. Additionally, it projects the components of the triplet into a 3D interaction space (3Dis) to learn their association, thus achieving more accurate results. In this work, Nwoye et al. also introduced CholecT40, an endoscopic video dataset annotated with action triplets.
In a recent work, Nwoye et al. improved upon Tripnet [4]. The proposed model, Rendezvous (RDV), leverages the attention mechanism at two different levels to enhance the recognition performance of triplets. Firstly, it integrates the spatial attention mechanism with the Class Activation Guide module from Tripnet, introducing a new form of spatial attention called the Class Activation Guided Attention Mechanism (CAGAM). It focuses on using the instrument’s resulting class activation as a position attention guide for recognizing verbs and targets. To model the final triplet association, the RDV model adds a new form of semantic attention called Multi-Head of Mixed Attention (MHMA). This technique employs several cross and self-attentions to effectively capture the relationships between the three components within the surgical action triplet.
Instrument detection
Utilizing a region-based convolutional neural networks, Jin et al. investigated the localization of instruments in laparoscopic surgery videos for the first time in [7]. They also introduced a new dataset, m2cai16-tool-locations, which extended the m2cai16-tool dataset with spatial bounds of instruments. An endoscopic vision challenge named CholecTriplet2022 [8] introduced a detection task, which required the localization of instruments in video frames and their accurate association with triplets. Since the challenge lacks instrument spatial annotations, the majority of the submitted methods utilize weak supervision to learn the instrument locations by exploiting the model’s class activation map (CAM), a few other methods take advantage of external datasets that provide complementary spatial annotations for instruments. The results indicate that compared to fully supervised methods, the performance of weakly supervised methods for instrument detection is unsatisfactory due to the imprecision of CAM-based localization. Among these, the best weakly supervised method achieves only 11.0% mean average precision (mAP), while the mAP of the top fully supervised approach is 41.9%.
A recently work [9] proposed a two-stage mixed supervised learning strategy for instrument localization and triplet association. This approach begins by learning target embeddings that fuse instrument spatial semantics and image features, then builds associations between detected instrument instances and target embeddings based on interaction graphs. Compared to the top methods in CholecTriplet2022 [8], this method achieves higher instrument detection accuracy with fewer bounding box instances. However, its detection performance is also highly dependent on the amount of spatial annotations used for training. To achieve an mAP of over 50% in instrument detection, more than 15,000 spatial annotated frames are used for model training.
Adaptation of SAM for medical images analysis
A recently work introduced a foundation model called SAM for the task of promptable segmentation [10]. Leveraging its powerful zero-shot learning ability, SAM can segment images by inputting prompts in the forms of points, bounding boxes and masks. How-ever, directly applying SAM to medical image does not produce good results [22,23]. This is because the training data of SAM primarily consists of natural images. Some studies investigated how to adapt the SAM model for downstream medical image tasks. In-stead of fine-tuning the SAM model, Wu et al. proposed the Medical SAM Adapter (Med-SA) [12], which incorporates domain-specific medical knowledge into the segmentation model using a light yet effective adaptation technique. Li et al. proposed a fine-tuned SAM model for polyp segmentation named Poly-SAM [24]. They fine-tune SAM model on a collection of multi-center colonoscopy images. The fine-tuning strategy is to freeze SAM’s encoder and fine-tune only mask decoder.
In the field of laparoscopic surgery, Wang et al. examined SAM’s robustness and zero-shot generalizability in robotic laparoscopic surgery [23]. The evaluation results reveal that although SAM shows remarkable zero-shot learning ability with bounding box prompts, it struggles to segment the instrument with point prompts and unprompted settings. Another work introduced AdaptiveSAM for adaptive modifying SAM [13]. The proposed approach for fine-tuning AdaptiveSAM called bias-tuning requires a significantly smaller number of trainable parameters than SAM (less than 2%). Currently, the application of SAM in laparoscopic surgery typically requires experts to offer prompt-based inputs.
Materials and methods
We design a weakly supervised deep learning model for the detection of crucial surgical components, instrument and target, as shown in Fig 2 and Fig 3. Our method is implemented through the following steps.
SAM candidate regions generation
For each video frame, we use a modified version of SAM’s automatic mask generation pipeline [10] to generate candidate regions. First, we prompt SAM to generate all regions with a N N regular grid of foreground points. Second, we filter the regions by predicted Intersection over Union (IoU) and stability, and remove redundant areas using Non-Maxima Suppression (NMS). We also observe that the masks from SAM contain minor, spurious components caused by reflections or water stains. To ignore these areas of no interest, we remove regions smaller than 2000 pixels. Finally, the remaining regions are ranked based on the average of their confidence and stability scores, and truncated to the top-10 region proposals. Ultimately, each frame will yield a collection of candidate regions, denoted as Regs.
Target class activation correction
The instruments’ class activation is utilized as a position attention guide for targets in RDV [4], inevitably causing that the positive activation area in target’s CAMs converge towards the instrument, as shown in Fig 1 (a). To make the positive activation of target align with the true position, we develop a recognition model named RDV-AGC by introducing an Attention Guide Correction ( AGC) module, as illustrated in Fig 2 and Fig 3. It channel-wisely translates the original attention guide (instrument’s CAM) in the forward direction of instrument by ,
. The absolute values of Kh and
represent the distance of shift in the horizontal and vertical directions, while the sign indicating the direction (positive for right/upward, negative for left/downward). The forward direction is inferred from instrument’s current position since they typically move from outside to inside the scene in laparoscopic surgery. For example, when the center of the instrument’s positive activation locates in the upper-right area, we hypothesize that the instrument will move towards the bottom-left. Therefore, the operated target is most likely located at the lower left of the instrument, hence Kh<0,
. In scenes where multiple positive activation areas exist in the same CAM, the forward direction is determined by the area with the maximal average activation. For each activation point
in the original attention guide, the new coordinate
is given as follows:
Where W and H respectively represent the number of activation points in the horizontal and vertical directions in CAM. Our model RDV-AGC is trained end-to-end. We adopt a training strategy similar to "fine-tuning" to accelerate the model’s convergence. Specifically, we initialize the Feature Extractor module and Weakly Supervised Localization (WSL) module [4] with corresponding parameters from RDV which is pre-trained on the same dataset. The loss function follows the definition in RDV as Eq 2:
Where Lcomp represents the multi-task loss for multi-label classification of each triplet component, the detail of Lcomp is provided in S1 Appendix. Lassoc represents the triplet association loss, which is modeled as a sigmoid cross-entropy. is a warm-up parameter, and
is the L2 normalized loss with a regularization weight decay.
Preliminary localization
Utilizing CAM to identify objects’ discriminative regions is a commonly used method for weakly supervised localization [25–27]. The pipeline of preliminary locating surgical components from CAM is outlined as follows:
- Extracting the CAMs from global max pooling (GMP) layer in both instrument and target branches.
- Identifying all positive and locally maximal activation points from each CAM.
- Constructing bounding boxes centered on these maximal activation points to enclose the surrounding positive activation areas.
- Removing the redundant boxes by NMS.
We ultimately generate a collection of positive activation areas of instruments and targets in each frame, denoted as Boxr.
Precise localization
We match preliminary localization result with SAM candidate region
by a mechanism denoted as Candidate Region Matching ( CRM). Guided by the criterion of minimum matching cost, we make the matching utilizing Hungarian algorithm [17]. The matching cost is determined by the overlapping area of
with
and the average activation value of
. Ci,j, one of the elements in cost matrix, its reciprocal is defined as:
Here denotes the number of activation points in boxi, and
represents the CAM for extracting boxi , where C is its channel category.
However, the close proximity between the instrument and the target it acts upon can cause confusion in matching when using only the Intersection over Union (IOU) as the overlap metric for and
. To tackle this, we use a ResNet18-based network to classify
into Instrument and Tissue. This lightweight model is trained on approximately 300 SAM candidate regions labled as Instrument and Tissue. Our definition of Overlap is as following Eq 4:
I(box,reg) and T(box,reg) are defined as Eq 5, where cbox is the category of box, preg is the classification prediction for reg and K is a positive constant that keeps + K always positive. Through this approach, the risk of mismatching could be reduced. For example, a reg categorized as Instrument and a box of Target will not be matched together, since in this case their Overlap would be 0. CIoU(b,r), referring to the simplified CIoU [28], is defined as follows:
Where represents the distance between the central points of box and reg, and c denotes the diagonal length of the minimum enclosing rectangle for both.
Ultimately, We derive a collection of matched pairs . For each matched regn, its bounding box is extracted as the localization result, with its class inherited from the category of boxm ’s source CAM.
Results and discussion
Datasets and evaluation metrics
Training dataset.
CholecT50 [18] is a publicly available dataset of endoscopic videos of laparoscopic cholecystectomy surgery, introduced to support research on fine-grained actions in laparoscopic surgery. The dataset consists of 50 videos, each annotated by two surgeons with triplet information in the form of instrument, verb, target
for each frame. It contains binary presence labels for 6 instruments, 10 verbs, 15 targets, and 100 triplet categories. Our RDV-AGC is trained on 40 videos and validated on 5 videos from the CholecT50 dataset. The video IDs are detailed in Table 1.
Evaluation dataset.
To evaluate localization for surgical components, we construct a dataset based on the validation set from the CholecTriplet 2022 Challenge [8] used for the instrument localization task. This dataset involves 5 video clips from CholecT50 [18] with the labels of triplet binary presence and instruments’ bounding boxes. The specific video IDs are shown in Table 1. The uniqueness of our evaluation dataset lies in: (1) We not only draw bounding boxes for the 6 types of instruments in the video clips but also annotate the spatial bounds for 12 types of targets, thus addressing the lack of target localization annotations in [8,18]. (2) Instead of only considering the effector of the instrument, we annotate for the whole instrument to achieve complete localization.
The spatial annotations in the form of bounding boxes are drawn using the LabelImg [29] annotation tool by two professionals, who have been trained by a team of surgeons. These spatial annotations are merged with the existing triplet binary presence labels and stored in JSON format. Our evaluation dataset contains bounding box annotations for surgical components from a total of 902 video frames and will be publicly available with this paper. Table 2 presents the bounding box instances counts for surgical components in our evaluation dataset.
Evaluation metrics.
The method performance is assessed using the average precision (AP) and average recall (AR). These metrics are computed based on precision (p) and recall (r) scores as follows:
Where TP, FP, and FN refer to true positives, false positives, and false negatives respectively. In the instrument and target detection task, a detection is considered TP if it satisfies both conditions: the predicted class matches the ground truth class, and the IoU between the predicted bounding box and the ground truth exceeds a certain threshold. If either condition is violated, the detection is assigned as FP, while missing detections corresponding to the ground truth are marked as FN. For each IoU threshold, p and r are computed, and p-r curves and r-IoU curves are plotted. AP and AR are obtained by calculating the area under the p-r curve and twice the area under the r-IoU curve, respectively:
In our evaluation, we compute the final performance metric by averaging the AP and AR values across all K classes.
Implementation details
First, we train the recognition network RDV-AGC on our training dataset. Then, we use the RDV-AGC to extract positive activation for each input video frame. The areas of positive activation will be matched with the candidate regions generated by SAM finally. During the stage of training RDV-AGC, we initially unified the spatial dimensions of input frames by resizing them to 256448. We also employed random horizontal/vertical flips and random brightness/contrast shift as data augmentation. The resolution of the output CAMs from both instrument and target branches in RDV-AGC are set to 8
14. We initialize the Feature Extractor module and WSL module with parameters from RDV which was pre-trained on our training set for 100 epochs. All modules in RDV-AGC are trained using stochastic gradient descent with momentum (
) as the optimizer. We implement a sequential combination of learning rate schedule strategy, which includes a linear warm-up scheduler before milestones, followed by an exponential decay (
) after milestones. Our model is trained for 50 epochs with a batch size of 32. For the generation of SAM candidate regions, we initialize SAM with parameters pre-trained on the SA-1B dataset [10] and set to ’vit_h’ scale.
Our network is implemented in PyTorch and performed on NVIDIA RTX A6000. Full training for RDV-AGC takes approximately 65-80 hours on a single RTX A6000. The total storage space consumption for RDV-AGC, SAM, input data, output weights, and SAM candidate masks is about 15GB.
Comparison against similar works
We compare the proposed method with other works focused on instrument detection. These include both supervised methods [7,9] that rely on spatially annotated training data and weakly supervised methods [8] that do not require such annotations. The results of these methods on our evaluation dataset (902 frames) are presented in Table 3. However, it is important to note that supervised methods, such as Faster R-CNN [7] and MCIT+IG [9], use large-scale training datasets where only the effectors (tips) of the instruments were spatially annotated, like the m2cai16-tool-locations dataset [7] used in [7,9]. This causes the trained models to focus more on locating the effectors of the instruments. In contrast, our proposed method and evaluation dataset are both designed for locating the complete instrument regions in the frame. The inconsistency between these two types of annotations is illustrated in Fig 4. Given this, directly comparing detection performance using the entire evaluation dataset may not be entirely fair.
The blue bounding boxes are from dataset in [8], and the red ones are from ours.
To ensure a more reasonable comparison, we create an additional evaluation subset containing 279 frames, where the instrument annotations are suitable for evaluation across all methods. Specifically, we analyze our evaluation dataset against the original dataset from the CholecTriplet 2022 Challenge [8], which also only annotate the effector parts of the instruments. We select frames where the bounding box annotations for the instruments are highly consistent (average IoU > 0.85) between the two datasets. As shown in Fig 4 (2) and (3), when only the effector part of the instrument is visible, the localization boxes from both annotation types overlap significantly. We perform additional evaluations on this selected subset for all similar methods, and the results are also presented in Table 3. Notably, as other similar methods have yet to explore target localization, we do not present the results of target detection in this section.
According to Table 3, Our method shows outstanding performance on instrument detection, notably exceeding other weakly supervised approaches: outperforming DualMFFNet [8] by +46.6% mean Average Precision (mAP) on evaluation set (902 frames) and +46.7% on subset (279 frames) , and exceeding MTTT [8] by +39.5% mAP on evaluation set and +39.4% mAP on subset. Compared to supervised methods that require spatial annotations, our method surpasses MCIT+IG [9], trained on approximately 4,000 spatial annotated frames, by +29.2% mAP on evaluation set and + 17.1% mAP on subset. It is comparable in accuracy to both MCIT+IG [9] trained on about 15,000 annotated frames and Faster R-CNN [7] trained on approximately 23,000 frames on evaluation subset. Although it underperforms MCIT+IG [9] trained with around 25,000 frames by –9.8% mAP on evaluation subset, our method significantly reduces the need for extensive spatial annotations.
Qualitative experiment results
The localization of surgical components on evaluation dataset, depicted by bounding boxes overlaid on the frames, is shown in Fig 5. The proposed method accurately locates instruments and targets in most cases. It occasionally fails in several situations, such as only partially detecting an instrument as shown in Fig 5 (d), or incorrectly identifying targets as instruments in Fig 5 (f). The latter error occurs when the model fails to distinguish candidate regions of instruments and targets during the matching stage.
Frame (a)-(b) contain only one pair of surgical components. Frame (c)-(d) each has two pairs of surgical components with the same target for both pairs. Frame (e)-(f) also contain two pairs of surgical components, but with different targets for each pair. Images highlighted with yellow borders show incorrect detection results.
Ablation studies
Ablation study on crucial modules.
To demonstrate the superiority of our method in localizing instruments and targets, we begin with an ablation study on the crucial modules in our model. The presence or absence of the AGC and CRM modules corresponds to the following four methods:
1. RDV-det: This is a detection version of RDV [4] model, which learns the location of the 6 distinct instruments and 15 distinct targets in the CholecT50 dataset from the last 6-channel convolutional layer of the WSL and the last 15-channel convolutional layer of the target branch in CAGAM. Specifically, it extracts bounding boxes for every positive activation in the CAMs channel-wisely and applies non-maximum suppression (NMS) to remove redundant objects.
2. RDV-AGC-det: It integrates our AGC module into RDV, following the description in Target Class Activation Correction. Then the RDV-AGC model is used to locate surgical components in the same manner as RDV-det.
3. RDV + CRM: It first generates SAM candidate regions following the pipeline in SAM Candidate Regions Generation, then preliminarily locates surgical components by extracting the positive activation of the CAMs from RDV. Finally it uses CRM to match candidate regions and preliminarily locations as described in Precise Localization.
4. RDV-AGC + CRM: Our proposed method combines the RDV-AGC recognition network with the CRM mechanism.
For fair comparison, we select the best weights after 150 epochs of training for RDV, and the best weights after 50 epochs for RDV-AGC. Both RDV and RDV-AGC are trained on our training set. The matched SAM candidate regions are extracted in the form of bounding boxes as the final localization results.
Table 4 shows the quantitative results of the four above-mentioned methods with Average Precision and Average Recall for instrument and target detection. Drawing on SAM’s powerful zero-shot learning capability, our method provides more accurate localization than CAM-based method, improving the performance of instruments detection by +48.86% mAP, which is 23.5 times increase in the RDV-AGC-det method. Furthermore, by correcting the attention guide for targets, AGC module improves the performance of our method on target detection by +16.51% mAP, which is 4.04 times increase in the RDV-AGC + CRM method.
Some predicted results in Fig 6 intuitively show that the AGC module brings the positive activation of targets closer to their true positions. As shown in Fig 6 (a) and Fig 6 (b), after the introduction of the AGC module, the positive activation in resulting CAM for target (gallbladder) is precisely adjusted to the upper left side of the instrument (clipper). Moreover, a comparison between Fig 6 (a) and Fig 6 (c) also reveals that the CRM mechanism significantly enhances the accuracy of localization by correctly matching the SAM candidate regions with the CAM-based localization boxes.
Each subfigure represents: (a) RDV-det; (b) RDV-AGC-det; (c) RDV + CRM; (d) Ours. Different colored boxes represent: white = ground truth, yellow = final localization result, green (dashed line) = preliminary localization box, red = bound of the positive activation region in CAM.
Additionally, Fig 7 provides an intuitive illustration of the AGC module’s operation and its effectiveness in offering a more accurate guide for target detection. As shown in Fig 7, the AGC module brings the positive activation regions of multiple instruments closer to the actual positions of their corresponding targets.
Selection of important hyper-parameters.
This section analyses the absolute values of the crucial parameters Kh and in the AGC module. According to the description for AGC module in Target Class Activation Correction, the effectiveness of the correction on the target’s position attention guide improves as the translation distance of the instrument’s CAM more closely approximates the actual distance between instrument and target projected on the CAM. Therefore, we initially perform a statistical analysis of the spatial distances between the instrument and the target within the same triplet in laparoscopic cholecystectomy video frames. This analysis is aimed at determining the typical range and central tendency of these distances in such surgical scenes, which will help refine the specified range of interest for |Kh| and
.
Our analysis is conducted on evaluation data annotated with spatial locations. We calculate the horizontal distance Dh and vertical distance between the central points of the bounding boxes for instruments and targets within each triplet, scaled by the dimensions of the video frame. We randomly divide the evaluation data into five folds. Fig 8 shows the Kernel Density Estimation (KDE) [30] heat maps of Dh and
for both the total data and each individual fold, illustrating their distribution characteristics. Fig 8 (a)-(f) reveal that the density distributions of Dh and
are highly consistent across all folds, with the highest density areas primarily concentrated within
and
.
Considering the highest distribution density ranges of Dh , and the dimension of the CAMs in the RDV-AGC, we select
and
as the ranges of interest for |Kh| and
. The model performance across various combinations of |Kh| and
within the specified interest ranges is showed in Fig 9. The model achieves highest accuracy for target detection when both |Kh| and
are set to 2, with performance diminishing as the values deviate from this optimal setting. Therefore, both |Kh| and
are selected to 2 for AGC module to achieve best performance in laparoscopic cholecystectomy scenes. Notably, the optimal values of Kh and
correspond to the observed trends in distances between instruments and targets in the visual field of laparoscopic cholecystectomy.
Ablation study on CAM resolution.
We apply CAMs with different resolutions (814, 16
28, 32
56) for Target Class Activation Correction and Preliminary Localization. Similarly, |Kh| and
are set to 2
2, 4
4, and 8
8, respectively. The models with these varying CAM resolutions are evaluated on the evaluation dataset, and the corresponding detection performance is shown in Table 5.
As presented in Table 5, increasing the resolution of the output CAM do not lead to significant improvements in the detection performance of both instruments and targets. Notably, the AP and AR for target detection even decline slightly, possibly due to the increased resolution introducing redundant information when localizing the positive activation areas. More importantly, higher-resolution CAMs require more computational resources, resulting in lower training efficiency. Therefore, we set the output CAM resolution to 814 for all the experiments conducted in this study.
Conclusion
In this work, we propose an innovative method for detection of critical surgical components. For CAM-based preliminary localization, we introduce RDV-AGC, which incorporates an attention guide correction module to achieve more accurate localization for targets. To precise localization, we propose CRM mechanism. Through accurate matching of candidate regions generated by SAM and CAM-based preliminary localization, we facilitate the precise detection of surgical components with no need for spatial annotations. To evaluate our method, we also annotate spatial bounds for instruments and targets from 902 frames. Both quantitative and qualitative results validate our superiority. While these initial results are encouraging, some limitations remain as follows:
- Although SAM can generate relatively accurate segmentation masks for surgical components with appropriate prompts, issues such as unstable segmentation edges persist. We have not yet achieved pixel-level segmentation of surgical components.
- The detection for targets has considerable room for improvement due to their ambiguous boundaries and low discriminative features.
- The translation distances |Kh| and
in the AGC module require re-evaluation and adjustment for different domains, which impacts the model’s adaptability.
Based on this work, future work will consider the pixel-level segmentation of surgical components. Moreover, we will study the localization and segmentation of richer anatomy in surgical scenes in the future.
Supporting information
s1 Appendix. The detail of multi-task loss Lcomp
https://doi.org/10.1371/journal.pone.0322751.s001
(PDF)
Acknowledgments
Throughout the writing of this dissertation I have received a great deal of support and assistance. I would first like to thank my supervisor, Prof.Liu, whose expertise was invaluable in formulating the research questions and methodology. I would particularly like to acknowledge my teammates for their wonderful collaboration and patient support. I would also like to thank my tutors, Prof.Duan, Prof.Lv and Prof.Deng, for their valuable guidance throughout my studies. You provided me with the tools that I needed to choose the right direction and successfully complete my dissertation. In addition, I would like to thank my parents for their wise counsel and sympathetic ear. You are always there for me. Finally, I could not have completed this dissertation without the support of my friends, Jingyao, Yuechen and Zhengwen, who provided stimulating discussions as well as happy distractions to rest my mind outside of my research.
References
- 1. Shademan A, Decker RS, Opfermann JD, Leonard S, Krieger A, Kim PCW. Supervised autonomous robotic soft tissue surgery. Sci Transl Med. 2016;8(337):337ra64. pmid:27147588
- 2.
Liu D, Li Q, Jiang T, Wang Y, Miao R, Shan F, et al. Towards unified surgical skill assessment. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 9522–31.
- 3. Jin Y, Long Y, Chen C, Zhao Z, Dou Q, Heng P-A. Temporal Memory Relation Network for Workflow Recognition From Surgical Video. IEEE Trans Med Imaging. 2021;40(7):1911–23. pmid:33780335
- 4. Nwoye CI, Yu T, Gonzalez C, Seeliger B, Mascagni P, Mutter D, et al. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Med Image Anal. 2022;78:102433. pmid:35398658
- 5. Zou X, Liu W, Wang J, Tao R, Zheng G. ARST: auto-regressive surgical transformer for phase recognition from laparoscopic videos. Comput Meth Biomech Biomed Eng: Imag Visual. 2022;11(4):1012–8.
- 6.
Yamlahi A, Tran TN, Godau P, Schellenberg M, Michael D, Smidt FH, et al. Self-distillation for surgical action recognition. International conference on medical image computing and computer-assisted intervention. Springer; 2023. p. 637–46.
- 7. Jin A, Yeung S, Jopling J, Krause J, Azagury D, Milstein A, et al. Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. 2018 IEEE winter conference on applications of computer vision (WACV); 2018. p. 691–9.
- 8. Nwoye CI, Yu T, Sharma S, Murali A, Alapatt D, Vardazaryan A, et al. CholecTriplet2022: Show me a tool and tell me the triplet – An endoscopic vision challenge for surgical action triplet detection. Med Image Anal. 2023;89:102888. pmid:37451133
- 9.
Sharma S, Nwoye CI, Mutter D, Padoy N. Surgical action triplet detection by mixed supervised learning of instrument-tissue interactions. International conference on medical image computing and computer-assisted intervention. Springer; 2023. p. 505–14.
- 10.
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, et al. Segment anything. Proceedings of the IEEE/CVF international conference on computer vision; 2023. p. 4015–26.
- 11. Yue W, Zhang J, Hu K, Xia Y, Luo J, Wang Z. SurgicalSAM: Efficient class promptable surgical instrument segmentation. AAAI. 2024;38(7):6890–8.
- 12. Wu J, Ji W, Liu Y, Fu H, Xu M, Xu Y, et al. Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:230412620. 2023.
- 13. Paranjape JN, Nair NG, Sikder S, Vedula SS, Patel VM. Adaptivesam: Towards efficient tuning of sam for surgical scene segmentation. arXiv preprint arXiv:230803726. 2023.
- 14. Yuan L, Chen D, Chen YL, Codella N, Dai X, Gao J, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:211111432. 2021.
- 15. Oquab M, Darcet T, Moutakanni T, Vo H, Szafraniec M, Khalidov V, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:230407193. 2023.
- 16.
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminative localization. Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 2921–29.
- 17. Kuhn HW. The Hungarian method for the assignment problem. Naval Research Logistics. 1955;2(1–2):83–97.
- 18. Nwoye CI. CholecT50 homepage; 2022. Available from: https://github.com/CAMMA-public/cholect50
- 19.
Neumuth T, Strauß G, Meixensberger J, Lemke HU, Burgert O. Acquisition of process descriptions from surgical interventions. Database and expert systems applications: 17th international conference, DEXA 2006, Kraków, Poland, September 4–8, 2006. Proceedings 17. Springer; 2006. p. 602–11.
- 20.
Katić D, Wekerle AL, Gärtner F, Kenngott H, Müller-Stich BP, Dillmann R, et al. Knowledge-driven formalization of laparoscopic surgeries for rule-based intraoperative context-aware assistance. Information processing in computer-assisted interventions: 5th international conference, IPCAI 2014, Fukuoka, Japan, June 28, 2014. Proceedings 5. Springer; 2014. p. 158–67.
- 21.
Nwoye CI, Gonzalez C, Yu T, Mascagni P, Mutter D, Marescaux J, et al. Recognition of instrument-tissue interactions in endoscopic videos via action triplets. Medical image computing and computer assisted intervention–MICCAI 2020: 23rd international conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23. Springer; 2020. p. 364–74.
- 22.
Ji G, Fan D, Xu P, Cheng M, Zhou B, Van Gool L. Sam struggles in concealed scenes–empirical study on segment anything. 2023.
- 23.
Wang A, Islam M, Xu M, Zhang Y, Ren H. Sam meets robotic surgery: An empirical study on generalization, robustness and adaptation. International conference on medical image computing and computer-assisted intervention. Springer; 2023. p. 234–44.
- 24.
Li Y, Hu M, Yang X. Polyp-sam: Transfer sam for polyp segmentation. In: Medical Imaging 2024: Computer-Aided Diagnosis. vol. 12927. SPIE; 2024. p. 759–65.
- 25.
Choe J, Shim H. Attention-based dropout layer for weakly supervised object localization. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2019. p. 2219–28.
- 26.
Tan C, Gu G, Ruan T, Wei S, Zhao Y. Dual-gradients localization framework for weakly supervised object localization. Proceedings of the 28th ACM international conference on multimedia; 2020. p. 1976–84.
- 27.
Gao W, Wan F, Pan X, Peng Z, Tian Q, Han Z, et al. Ts-cam: Token semantic coupled attention map for weakly supervised object localization. Proceedings of the IEEE/CVF international conference on computer vision; 2021. p. 2886–95.
- 28. Zheng Z, Wang P, Liu W, Li J, Ye R, Ren D. Distance-IoU Loss: Faster and better learning for bounding box regression. AAAI. 2020;34(07):12993–3000.
- 29. LabelImg Homepage. Available from: https://github.com/HumanSignal/labelImg
- 30. Parzen E. On estimation of a probability density function and mode. Ann Math Stat. 1962;33(3):1065–76.