Fig 1.
It has three main components: the backbone, the triple attention-guided multi-resolution fusion (TAMF) module, and the feature refinement (FR) module. The TAMF is used for the coarse localization of salient objects and for improving the fusion effect and complementarity of cross-scale features, and FR is for recovering object details. (The input images and ground truth (GT) annotations are sourced from the DUTS dataset (official download link: http://saliencydetection.net/duts/; source code repository: https://github.com/scott89/WSS, [30]). The dataset is licensed under the 3-Clause BSD Open Source License (compatible with the CC BY 4.0 license). The feature maps are generated by the code independently developed in this paper, and the model architecture diagram is created using Microsoft PowerPoint. All the above elements are original content by the authors and are licensed under the CC BY 4.0 license.)
Fig 2.
Structure of two key sub-modules TA and MF.
GAP, SA, and CA refer to global average pooling, spatial attention mechanism, and channel attention mechanism, respectively. All of them include convolution, BatchNorm, and ReLU.
Fig 3.
Structure of feature refinement (FR) module.
The meanings of GAP, SA, and CA are the same as before.
Table 1.
Ablation study of the proposed model on ECSSD, PASCAL-S, and DUTS.
Fig 4.
Visualization of the feature mapping with or without using TA.
(a) Input image. (b) Ground truth. (c) Without TA. (d) With TA. We can see that our TA method suppresses background noise better while highlighting objects. (All feature maps are generated by the code in this paper. The input images and ground truth annotations are derived from the DUTS dataset (official download link: http://saliencydetection.net/duts/; source code repository: https://github.com/scott89/WSS, [30]). The dataset is licensed under the 3-Clause BSD Open Source License (compatible with the CC BY 4.0 license), and its licensing terms have been strictly followed in this work.)
Fig 5.
Visual comparison results of our method with other methods at six different scales.
(a) Input image; (b) Ground truth; (c) F3Net; (d) ITSD; (e) MINet; (f) GCPANet; (g) Ours. As observed, our approach demonstrates a stronger capability in suppressing background noise and is more effective in addressing the challenges posed by highly variable scales and the sensitivity of SOD to such variations.
Table 2.
Ablation Study on TA vs. CBAM.
Table 3.
Ablation Study on FR Module with/without Inter-Branch Skip Connections.
Table 4.
Ablation and Sensitivity Analysis on Kernel/Dilation(K/D) Sizes of FR Module (on ECSSD).
Table 5.
Experimental results of different models on five datasets.
Fig 6.
Precision-recall, F-measure curves, and FNR↓ results.
Fig 7.
Visual comparison of our method with advanced methods.
(1) Input image; (2) Ground truth; (3) BASNet; (4) EGNet; (5) F3Net; (6) ITSD; (7) MINet; (8) GCPANet; (9) DFI; (10) ICON; (11) MGuidNet; (12) DIPONet; (13) DSLRDNet; (14) Ours. (To ensure the fairness of comparison, all visualization results are regenerated by running the official code of each method under a unified experimental setup. The use of input images complies with the licensing terms of the DUTS dataset (which is licensed under the 3-Clause BSD Open Source License, compatible with the CC BY 4.0 license; official download link: http://saliencydetection.net/duts/; source code repository: https://github.com/scott89/WSS, [30]).)
Fig 8.
Illustration of failure cases.
(a) Input image; (b) Ground truth; (c) ITSD; (d) MINet; (e) GCPANet; (f) ICON; (g) Ours.