Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

MSF-DETR: A small target detection algorithm for sonar images based on spatial-frequency domain collaborative feature fusion

Abstract

Side-scan sonar imaging is essential for underwater target detection in marine exploration and engineering applications, yet small target detection faces significant challenges including limited frequency domain feature utilization, insufficient multi-scale feature fusion, and high computational complexity. This study develops Multi-Scale Spatial-Frequency Collaborative Detection Transformer (MSF-DETR), a novel end-to-end automatic detection algorithm specifically designed for small targets in side-scan sonar images. The method integrates three core innovations: a Multi-domain Adaptive Spatial-frequency Network (MASNet) backbone employing Cascaded dual-domain Mamba-enhanced Spatial-frequency Synergistic Convolution that simultaneously captures spatial geometric and frequency domain texture features; a Hierarchical Multi-scale Adaptive Feature Pyramid Network implementing intelligent weight allocation across different scales; and an Efficient Sparse Attention Transformer Encoder utilizing Window-based Adaptive Sparse Self-Attention mechanism that reduces computational complexity from quadratic to linear. Experimental validation was conducted on the self-built SSST-3K(Side-Scan Sonar Target Detection 3K Dataset) dataset containing approximately 3000 high-quality sonar images and the public KLSG dataset. Results demonstrate that MSF-DETR achieves 78.5% mAP50 and 38.5% mAP50-95 on the SSST-3K dataset, representing improvements of 2.8% and 3.3% respectively compared to baseline RT-DETR, while reducing computational complexity by 12.0% and achieving 71.2 FPS inference speed. The proposed MSF-DETR provides an effective solution for small target detection in complex marine environments, significantly advancing underwater sonar image processing technology.

Introduction

The automatic detection technology for small targets in side-scan sonar images holds significant application value and strategic importance in marine exploration, underwater navigation, marine engineering, and marine biological monitoring. As an acoustic sensor capable of overcoming the limitations of optical imaging in turbid and low-light underwater environments, side-scan sonar provides reliable technical means for underwater target detection [1]. With the growing demands for marine resource development, underwater infrastructure maintenance, and marine security monitoring, accurate and efficient small target detection algorithms have become key technologies for advancing intelligent marine equipment development. Zhou et al. [2] proposed an automatic detection method for small targets in side-scan sonar images that demonstrated excellent detection performance in complex marine environments, proving the important value of this technology in practical engineering applications. Therefore, developing high-precision small target detection algorithms tailored to the characteristics of side-scan sonar images holds significant theoretical meaning and practical value for enhancing the intelligence level of underwater autonomous operation systems, ensuring marine engineering safety, and promoting marine scientific research.

Traditional side-scan sonar image target detection methods primarily rely on manually designed feature extraction and pattern recognition techniques, achieving target identification by analyzing statistical characteristics, texture features, and geometric shapes of sonar images. Abu et al. [3] proposed a statistical feature extraction method based on weighted likelihood ratios, combined with support vector machine classifiers to achieve sonar image target recognition, which achieved good detection results in specific scenarios. Chen et al. [4] developed an underwater target detection method based on spectral residual and three-frame algorithms, enhancing target detection robustness by combining frequency domain analysis with temporal information. However, these traditional methods generally suffer from limited feature expression capabilities, poor adaptability to complex marine environments, and requirements for extensive manual parameter tuning, making them difficult to meet modern marine engineering demands for detection accuracy and real-time performance.

In recent years, the rapid development of deep learning technology has brought new breakthroughs to side-scan sonar image target detection. Wang et al. [5] proposed a weak small target detection method for side-scan sonar images based on multi-branch shuttle neural networks, significantly improving small target detection accuracy through multi-scale feature fusion. Zhang et al. [6] developed a side-scan sonar image target detection model based on improved YOLOv5, introducing transfer learning techniques to overcome the problem of scarce sonar image data. Li et al. [7] proposed an advanced deep learning framework based on YOLOv7 architecture, achieving high-precision target detection for multibeam side-scan sonar through optimization of data preprocessing, feature fusion, and loss functions. Fan et al. [8] designed a side-scan sonar image target detection and segmentation algorithm based on improved Mask R-CNN, effectively handling target boundary problems in complex underwater environments.

Recently, deep learning-based object detection methods have achieved remarkable progress in computer vision, demonstrating mature technical systems and powerful real-time processing capabilities. Modern object detection frameworks such as YOLO series and DETR series excel in handling complex scenes, multi-scale targets, and dense target distributions, providing strong technical support for small target detection in side-scan sonar images. Therefore, this study adopts object detection methods as the technical approach, aiming to fully utilize their advantages in accuracy and efficiency to solve key technical challenges in small target detection for side-scan sonar images.

In the field of sonar image processing based on object detection, researchers have conducted extensive exploratory work. Li et al. [9] proposed CCW-YOLOv5, a side-scan sonar target detection method based on coordinate convolution and improved bounding box loss, significantly enhancing small target detection capability by introducing position information-rich feature extraction. Yu et al. [10] developed a side-scan sonar image target detection model based on Transformer-YOLOv5, improving model adaptability to complex seafloor terrain through self-attention mechanisms. With the introduction of DETR (Detection Transformer) architecture, researchers began exploring applications of end-to-end detection methods in sonar image processing. Wang et al. [11] proposed US-DETR, an underwater sonar image target detection method based on improved RT-DETR, significantly improving detection performance through enhanced feature interaction modules and non-local attention feature fusion mechanisms. Chen et al. [12] developed NAS-DETR, a method based on zero-shot neural architecture search, achieving excellent performance in sonar target detection tasks. Recent research work also includes ProNet proposed by Wang et al. [13], a network based on progressive sensitivity capture, and multiple improvement works based on RT-DETR architecture optimization, providing important technical accumulation and theoretical foundation for RT-DETR [14] applications in sonar image processing.

Although object detection-based methods and advanced algorithms like RT-DETR perform excellently in general object detection tasks, they still face numerous technical challenges in small target side-scan sonar image automatic detection applications. First, side-scan sonar images have unique imaging mechanisms and signal characteristics, making traditional spatial domain feature extraction methods unable to fully utilize frequency domain information of sonar signals, resulting in insufficient and inaccurate feature representation of small targets. Second, small targets in sonar images often exhibit characteristics such as large scale variations, blurred boundaries, and susceptibility to seafloor reverberation interference, while existing feature fusion networks lack adaptive integration mechanisms for multi-scale small target features, making it difficult to achieve stable detection performance in complex marine environments. Additionally, existing public sonar datasets are limited in scale and variable in quality, lacking dedicated high-quality datasets for side-scan sonar small target detection, which severely restricts algorithm development and performance evaluation. Finally, traditional Transformer encoders suffer from high computational complexity and insufficient spatial structure modeling capabilities when processing high-resolution sonar images, affecting model real-time performance and effective capture of small target spatial features. These technical bottlenecks severely restrict the application effectiveness of existing algorithms in practical marine engineering.

To address the aforementioned technical challenges, this paper proposes MSF-DETR, a small target side-scan sonar image automatic detection algorithm that effectively solves the technical limitations of traditional methods in complex marine environments through constructing a dedicated dataset and three core innovative modules working collaboratively. The main contributions of this paper are as follows:

(1) Construction of a high-quality side-scan sonar small target detection dataset named SSST-3K (Side-Scan Sonar Target Detection 3K Dataset). Through data collection by surface vessels towing side-scan sonar in real marine environments, a dedicated dataset containing approximately 3000 high-quality sonar images was constructed. The dataset employs high-low frequency dual-mode operation to obtain multi-frequency target features, covering 3 typical small target types, and provides important benchmark data support for side-scan sonar small target detection algorithm development and evaluation through professional annotation and scientific partitioning.

(2) Proposal of Multi-domain Adaptive Spatial-frequency Network (MASNet) backbone network that employs Cascaded dual-domain Mamba-enhanced Spatial-frequency Synergistic Convolution (CMSSC) feature extraction modules and Dual-domain Spatial-frequency Synergistic Convolution (DSSC) collaborative convolution mechanisms to achieve simultaneous capture and fusion of spatial geometric features and frequency domain texture information in sonar images, significantly enhancing multi-modal feature representation capabilities for small targets.

(3) Design of Hierarchical Multi-scale Adaptive Feature Pyramid Network (HMAFPN), a multi-scale feature fusion network that achieves intelligent weight allocation and optimal combination of different scale features through Multi-input Adaptive Fusion Module (MAFM), breaking through the limitations of traditional FPN unidirectional information flow and effectively improving the expression richness and fusion effects of small target features.

(4) Proposal of Efficient Sparse Attention Transformer Encoder (ESATE) that employs Window-based Adaptive Sparse Self-Attention (WASSA) mechanism to reduce computational complexity from quadratic to linear scale, while using Spatial-Enhanced Feed-Forward Network (SEFFN) to replace traditional feed-forward networks, significantly improving spatial structure modeling capabilities and computational efficiency for sonar images, achieving optimal balance between accuracy and speed.

Related work

Introduction to self-built dataset SSST-3K

This study employs the self-built SSST-3K dataset to validate the effectiveness of the proposed algorithm. The dataset was obtained through forward-looking sonar surveys of small targets conducted by surface vessels towing equipment at sea, possessing high practicality and representativeness. To improve survey efficiency and data quality, the forward-looking sonar employs high-low frequency dual-mode operation, capable of displaying both high-frequency and low-frequency sonar images separately, thus obtaining target feature information under different frequency bands. Meanwhile, to increase the receptive field and obtain more complete target information, this study captures full-screen waterfall images as detection images, ensuring complete target contours and surrounding environmental information can be captured. Dataset examples are shown in Fig 1.

thumbnail
Fig 1. Examples of SSST-3K dataset containing cone, cylinder, and sphere targets.

https://doi.org/10.1371/journal.pone.0336468.g001

After standardized processing, the SSST-3K dataset contains approximately 3000 sonar images with targets, covering three different types of small targets: cone (MT), cylinder (C), and sphere (M). All targets were precisely marked using professional labelImg annotation software, constructing a complete sonar image dataset. To ensure good generalization capability of the detection model, the dataset was divided into training, validation, and test sets according to a 7:1:2 ratio, with the training set containing 2182 images, validation set containing 312 images, and test set containing 626 images. To further improve detection result reliability and model discrimination capability for negative samples, over 600 background images without targets were added to the test set in a 1:1 ratio, making the actual test set scale reach 1252 images, thus constructing a balanced and challenging evaluation benchmark.

RT-DETR baseline framework

RT-DETR (Real-Time Detection Transformer) is the first real-time end-to-end object detection algorithm that successfully addresses the problems of high computational cost and slow inference speed in traditional DETR series algorithms while maintaining high detection accuracy and achieving real-time performance. The core innovation of RT-DETR lies in designing an efficient hybrid encoder architecture that significantly reduces computational complexity by decoupling intra-scale interaction and cross-scale fusion operations. The algorithm employs convolutional neural networks as the backbone network for feature extraction, extracting multi-scale features from the last three stages of the backbone network as encoder inputs. This design ensures feature richness while controlling computational overhead.

The encoder part of RT-DETR contains two key modules: Attention-based Intra-scale Feature Interaction (AIFI) module and CNN-based Cross-scale Feature Fusion (CCFF) module. The AIFI module is specifically responsible for processing high-level semantic features from the S5 layer, performing intra-scale feature interaction through single-scale Transformer encoders to effectively capture semantic associations in high-level features. The CCFF module is responsible for fusion between different scale features, achieving cross-scale information transfer through convolution operations. The decoder part adopts standard Transformer decoder structure equipped with auxiliary prediction heads, providing high-quality initial target queries for the decoder through IoU-aware query selection mechanisms, and finally generating final detection results through iterative optimization. This end-to-end design eliminates non-maximum suppression (NMS) post-processing steps, not only simplifying the detection pipeline but also improving inference speed and detection accuracy.

Related work on feature fusion networks

Feature Pyramid Network (FPN), as a classic architecture for multi-scale feature fusion, plays an important role in object detection. Lin et al. initially proposed FPN through top-down pathways and lateral connections to construct multi-scale feature representations [15], but using only unidirectional information flow has certain limitations. To address this problem, Liu et al. proposed Path Aggregation Network (PANet), enhancing information transfer from low-level to high-level features by adding bottom-up pathways [16]. Subsequently, Ghiasi et al. proposed NAS-FPN based on neural architecture search, discovering better feature fusion topological structures through automatic search [17]. Tan et al. proposed Bidirectional Feature Pyramid Network (BiFPN) in EfficientDet, achieving better balance between efficiency and accuracy through weighted bidirectional feature fusion and removal of redundant connections [18]. In recent years, researchers have further explored applications of attention mechanisms in feature fusion. Dang et al. proposed Hierarchical Attention Feature Pyramid Network (HA-FPN), enhancing feature expression capabilities by introducing Transformer and channel attention modules [19]. Chen et al. proposed Improved Feature Pyramid Network (ImFPN) that further optimizes feature fusion effects through similarity fusion modules and attention mechanisms [20]. These works provide important theoretical foundation and technical reference for the hierarchical multi-scale adaptive feature pyramid network proposed in this paper.

Methods

This paper proposes MSF-DETR (Spatial-Frequency Domain collaborative DETR) algorithm targeting the special challenges of small target detection tasks in side-scan sonar images, constructing an end-to-end detection framework. The framework includes three core innovations: first is the MASNet backbone network that enhances small target feature representation through spatial-frequency dual-domain collaborative processing mechanisms; second is the HMAFPN feature fusion network that optimizes multi-scale feature integration using dense cross-layer connections and adaptive weight fusion strategies; finally is the ESATE encoder that improves feature encoding efficiency through sparse attention mechanisms and spatially-aware feed-forward networks. The overall framework structure is shown in Fig 2. The three innovative modules work collaboratively to effectively solve problems of insufficient detection accuracy and low computational efficiency for small targets in complex marine environments using traditional methods.

thumbnail
Fig 2. Overall framework architecture of MSF-DETR.

The framework integrates MASNet backbone, HMAFPN feature fusion network, and ESATE encoder to achieve end-to-end small target detection in side-scan sonar images.

https://doi.org/10.1371/journal.pone.0336468.g002

Multi-domain Adaptive Spatial-frequency Network (MASNet) backbone design

Traditional ResNet [21] backbone networks face technical bottlenecks in side-scan sonar image small target detection. First, ResNet’s fixed convolution kernel design cannot effectively handle frequency domain feature distribution differences of targets in sonar images. Under high-low frequency hybrid working modes, single spatial domain feature extraction strategies have difficulty capturing complete spectral information of sonar signals, leading to insufficient frequency domain feature representation of small targets. Second, ResNet lacks adaptive gating mechanisms for sonar image noise characteristics. In complex seafloor reverberation and multipath interference environments, networks cannot dynamically adjust weight allocation of feature channels, making them susceptible to strong background noise interference. Therefore, this paper proposes MASNet backbone network based on CMSSC feature extraction modules. This network introduces gated spatial-frequency dual-domain feature fusion mechanisms and adaptive channel selection strategies, enabling simultaneous feature extraction and fusion in both spatial and frequency domains, effectively solving frequency domain information loss problems in traditional networks for sonar image processing. Through collaborative action of DSSC modules, detection accuracy and robustness for complex small targets in marine environments are significantly improved.

The overall structure of the CMSSC module is shown in Fig 3, employing hierarchical feature extraction and gated fusion design concepts. The module first performs channel expansion transformation on input features , expanding input channels from C1 to 2C through convolution. The expanded features are equally divided into two sub-features Xs1 and Xs2 in the channel dimension, where Xs1 directly participates in final feature connection operations, while Xs2 serves as recursive input for subsequent n Gated Dual-domain Synergistic Convolution Block (GDSCB) modules. The module progressively refines feature representations through cascading multiple gated DSSC blocks. Its overall mathematical expression can be described as:

(1)

where represents output convolution transformation, represents channel concatenation operation, represents the i-th GDSCB module, is the initial recursive input, and n is the number of module repetitions.

The GDSCB module performs layer normalization on input features to improve training stability, mapping features to high-dimensional representation space through fully connected layers with expansion ratio . The expanded features are decomposed into three functionally different sub-feature branches according to preset ratios h, , in the channel dimension: gating signal branch G, identity mapping branch I, and DSSC convolution branch C. After GELU activation function processing, the gating signal performs gated fusion with the identity mapping branch and convolution branch processed by DSSC. The mathematical expression for this process is:

(2)

where , represents channel splitting operation, represents layer normalization, and represent input and output fully connected layers respectively, is the GELU activation function, represents DSSC spatial-frequency collaborative convolution operation, represents DropPath random depth regularization, and represents Hadamard element-wise multiplication.

The DSSC module adopts dual-branch parallel processing architecture: Spatial Processing Unit (SPU) is responsible for extracting spatial domain geometric features, while Gabor-enhanced Frequency Processing Unit (GFPU) specifically handles frequency domain texture and edge information. Output features from both branches are integrated through adaptive weight fusion mechanisms, finally achieving dynamic feature selection through soft attention mechanisms. The overall mathematical expression for DSSC can be described as:

(3)

where , and represent input and output pointwise convolution layers respectively, and represent spatial processing unit and Gabor frequency processing unit respectively, represents concatenation operation, represents adaptive average pooling operation, and is the softmax normalization function. This design enables simultaneous capture of spatial geometric information and spectral texture features in sonar images through parallel dual-domain processing and adaptive fusion strategies.

The SPU module is specifically responsible for extracting spatial domain features of sonar images, employing branch cascading and progressive fusion processing strategies. The module first equally divides input features into two sub-features X1 and X2 in the channel dimension, then employs depth-wise separable convolution kernels of different sizes for feature extraction. X1 extracts local spatial features through depth convolution, while X2 first performs residual connection with X1 output, then extracts larger receptive field spatial features through depth convolution. This cascading design effectively captures spatial geometric information at different scales. The mathematical expression for SPU is:

(4)

where and represent and depth-wise separable convolutions respectively, represents pointwise convolution for feature fusion, and Xres represents the residual connection term. This design effectively extracts geometric shape and spatial structure features of targets in sonar images through cascaded processing of multi-scale spatial convolutions.

The GFPU module is an innovative component specifically designed for processing frequency domain features of sonar images, based on multi-directional and multi-scale characteristics of Gabor filters to extract texture and edge information from images. The module first equally divides input features into four sub-channel groups, with each sub-channel group processed through independent GaborSingle filters to capture frequency domain features at different orientation angles and different scales . Output features from the four sub-channels are concatenated and then integrated through fully connected layers for feature integration and dimensionality adjustment. The mathematical expression for GFPU is:

(5)

where represents single Gabor filter operation based on multi-angle, multi-scale Gabor kernel convolution calculations: , where is the parameterized Gabor filter kernel, are learnable weight parameters, and * represents convolution operation. represents fully connected layers for feature integration, and Xres is the residual connection term. This design effectively extracts frequency domain texture features and edge contour information of targets in sonar images, particularly suitable for processing underwater small targets with complex scattering characteristics.

The proposed MASN backbone network effectively solves the fundamental problem of insufficient frequency domain information utilization when traditional ResNet processes sonar images through introducing gated dual-domain feature fusion strategies and adaptive channel selection mechanisms, significantly improving multi-modal feature representation capabilities for small targets in complex marine environments. Particularly through the spatial-frequency parallel processing design of DSSC modules, the network can simultaneously capture geometric spatial features and spectral texture information of sonar images, effectively utilizing physical characteristics of sonar signals for feature enhancement. Through gating mechanisms and soft attention strategies, different domain feature weight contributions are dynamically adjusted, greatly improving accuracy and robustness of small target detection.

Hierarchical Multi-scale Adaptive Feature Pyramid Network (HMAFPN) design

Traditional Cross-scale Cascade Feature Merging (CCFM) feature fusion networks employ simple upsampling-concatenation-convolution linear fusion strategies, exhibiting significant limitations when processing multi-scale features of side-scan sonar images. First, CCFM adopts direct feature concatenation for multi-scale information fusion, lacking adaptive evaluation mechanisms for the importance of different scale features, causing high-value small target features to be easily masked by large-scale background features. Second, traditional top-down unidirectional information flow design ignores the feedback enhancement effect of bottom-level detail features on high-level semantic features, failing to fully exploit complementary relationships between multi-layer features. Finally, CCFM lacks dense connection mechanisms across hierarchical levels, resulting in insufficient information interaction between distant feature layers, limiting network expression capabilities for complex sonar scenes. Addressing these problems, this paper proposes HMAFPN feature fusion network architecture that constructs adaptive multi-scale feature fusion mechanisms through introducing MAFM and dense cross-layer connection strategies, intelligently evaluating and integrating contribution weights of different hierarchical features, significantly enhancing feature representation capabilities for small targets in sonar images, effectively solving problems of insufficient feature information utilization in traditional fusion networks under complex marine environments, providing more powerful and robust feature representation foundations for side-scan sonar image small target detection. The structure of HMAFPN is shown in Fig 4.

thumbnail
Fig 4. Architecture of HMAFPN.

The network implements dense cross-layer connections and MAFM.

https://doi.org/10.1371/journal.pone.0336468.g004

HMAFPN adopts densely connected feature pyramid architecture, achieving efficient integration of multi-scale features through multiple feature fusion pathways and adaptive attention mechanisms. The network breaks through limitations of traditional FPN unidirectional information flow, constructing bidirectional multi-path feature propagation mechanisms that enable each feature layer to simultaneously receive enhancement from upper-level semantic information and lower-level detail information. The core innovation of the network lies in introducing MAFM modules as basic units for feature fusion. These modules can adaptively learn importance weights of different input features, achieving optimal feature combinations through attention mechanisms. The overall feature fusion process of HMAFPN can be described through hierarchically recursive mathematical expressions that reflect the network’s multi-path information aggregation characteristics:

(6)

where represents the output feature of the l-th layer, represents the operation function of the multi-input adaptive fusion module, and represent upsampling and downsampling operations respectively, represents lateral connections from the backbone network, and k represents the distance of cross-layer connections. This design enables each feature layer to receive information enhancement from multiple directions and multiple scales, significantly improving feature expression richness.

HMAFPN constructs a fully interconnected feature fusion network through building dense cross-layer connection networks, achieving comprehensive information interaction between all hierarchical levels of the feature pyramid. The network not only maintains traditional FPN top-down pathways but also adds bottom-up feedback pathways and cross-hierarchical direct connection pathways, forming a highly interconnected feature fusion network. This dense connection design can be mathematically described using graph theory, where each node represents a feature layer and edges represent information propagation paths:

(7)

where represents the set of feature nodes, and , , , represent top-down edges, bottom-up edges, skip connection edges, and cross-layer connection edges respectively. This graph structure design ensures that each feature layer in the network can receive information from all other hierarchical levels. Through adaptive fusion mechanisms of MAFM modules, the network can dynamically adjust information weights from different pathways according to specific input content, thereby achieving optimal feature representation effects.

The MAFM module can handle arbitrary numbers of input feature maps, achieving intelligent feature selection and fusion by learning importance weights for each input feature. MAFM first unifies dimensions of different input features through a series of convolutions, ensuring all input features have the same channel dimensions for subsequent fusion operations. Then the unified feature maps are stacked in the channel dimension, forming a four-dimensional tensor where the second dimension represents different input feature sources. Next, the module extracts global context information for each feature map through global average pooling and generates corresponding attention weights through multi-layer perceptrons. This process can be described as multi-input feature adaptive weighted combination:

(8)

where represents the projection function for the i-th input feature, which is a convolution when input channel numbers differ from target dimension C, otherwise an identity mapping, represents stacking operation in a new dimension, N is the number of input features, and B, C, H, W represent batch size, channel number, height, and width respectively. This unification processing ensures that features from different sources can be effectively fused in the same representation space.

The MAFM mechanism learns importance weights for each input feature through global information aggregation and nonlinear transformation. The module first sums the stacked features across the feature dimension to obtain initial fused features, then compresses spatial dimensions to through global average pooling to extract global context information. Next, attention weights are generated through a multi-layer perceptron structure containing dimensionality reduction-activation-dimensionality expansion, and finally normalized using softmax function to ensure all weights sum to 1. The mathematical expression for this process is:

(9)

where represents global average pooling operation, represents multi-layer perceptron, typically containing structure, where the first convolution layer reduces channel numbers from C to (where r is the reduction ratio), the second convolution layer expands channel numbers to , and the softmax function performs normalization in the feature dimension N. The final fused feature is obtained through weighted summation: , where represents broadcast element-wise multiplication.

The proposed HMAFPN multi-scale feature fusion network significantly improves feature representation capabilities for small target detection in side-scan sonar images through innovative dense connection architecture and adaptive attention fusion mechanisms. The network breaks through limitations of traditional FPN unidirectional information flow by constructing bidirectional multi-path feature propagation networks and introducing MAFM adaptive fusion modules, achieving full utilization and intelligent integration of multi-scale feature information.

Efficient Sparse Attention Transformer Encoder (ESATE) design

Traditional TransformerEncoderLayer faces significant performance bottlenecks when processing side-scan sonar image feature encoding. First, standard global self-attention mechanisms have O(N2) computational complexity, generating enormous computational overhead when processing high-resolution sonar images, severely affecting real-time detection efficiency. Second, traditional feed-forward networks (FFN) rely only on simple linear transformations and point activation functions, lacking effective modeling capabilities for spatial geometric structures, performing poorly when processing complex seafloor terrain and target shapes in sonar images. Finally, traditional encoders lack multi-scale feature adaptive fusion mechanisms, unable to fully utilize semantic information from different hierarchical levels, resulting in insufficient and inaccurate feature representation of small targets. Addressing these limitations, this paper proposes ESATE encoder that reduces computational complexity from quadratic to linear scale through introducing WASSA mechanism, while SEFFN module significantly enhances spatial structure modeling capabilities for sonar images through multi-scale spatial branches and adaptive feature fusion strategies, effectively solving efficiency and accuracy problems of traditional encoders in small target detection tasks.The structure of HMAFPN is shown in Fig 5.

thumbnail
Fig 5. Structure of ESATE.

The encoder incorporates WASSA mechanism and SEFFN.

https://doi.org/10.1371/journal.pone.0336468.g005

ESATE encoder adopts dual-stage feature processing architecture, achieving efficient encoding of sonar image features through serial collaboration of adaptive sparse attention and spatial-enhanced feed-forward networks. The encoder first converts input feature maps from spatial format to sequence format for attention mechanism processing, then performs windowed sparse attention computation through WASSA module, effectively modeling local and global spatial dependencies. Subsequently, the encoder employs SEFFN module to replace traditional feed-forward networks, enhancing spatial awareness capabilities of features by introducing original spatial feature branches. The entire processing strictly follows Transformer design principles, including residual connections and layer normalization operations, ensuring training stability and effective gradient propagation. The mathematical expression of the encoder embodies its dual-stage processing core concept:

(10)

where represents intermediate features processed by adaptive sparse attention and first normalization, represents windowed adaptive sparse self-attention function, represents dual-input processing function of spatial-enhanced feed-forward network, and represent two layer normalization operations respectively, and D1 and D2 represent corresponding random depth dropout operations.

The WASSA mechanism achieves balance between computational efficiency and modeling capability through windowed partitioning and cyclic shifting strategies. The mechanism first reshapes input feature maps to sequence format, then divides feature sequences into multiple non-overlapping window blocks according to predefined window sizes. To maintain information exchange between different windows, the mechanism introduces cyclic shifting operations, achieving cross-window feature interaction by periodically moving window boundaries. This design reduces global attention computational complexity from O(H2W2) to , where win represents window size. The mathematical expressions for window partitioning and cyclic shifting are:

(11)

where represents cyclic shifting function, shifting −sshift pixels along height and width dimensions respectively, represents window partitioning operation, dividing shifted feature maps into windows, sshift is the shifting distance, typically set to to achieve optimal information exchange effects.

Within each window, WASSA employs sparsified attention computation strategy, retaining only the most important attention connections through top-k selection mechanism, further reducing computational overhead. This mechanism not only reduces computation but also improves attention focus, enabling models to better attend to key feature regions. The computation formula for sparse attention within windows is:

(12)

where , represent query and key matrices within windows respectively, represents sparsification function, retaining only the largest k attention weights in each row, Bmask is the relative position bias mask for encoding relative position relationships within windows, and dk is the dimension of key vectors. The choice of sparsification parameter k directly affects the balance between computational efficiency and modeling capability, typically set as , where is the sparsity control parameter.

To restore complete feature representation, WASSA needs to recombine windowed attention results back to original feature map format. This process includes window merging and inverse cyclic shifting steps, ensuring output features maintain the same spatial dimensions and semantic consistency as inputs. The mathematical expressions for window merging and inverse shifting are:

(13)

where represents value matrix within windows, represents inverse window transformation operation, recombining windowed attention outputs into complete feature maps, and represents inverse cyclic shifting operation, restoring original spatial position relationships. This inverse transformation process ensures spatial continuity and semantic integrity of features.

Finally, WASSA performs post-processing through multi-layer perceptrons, enhancing nonlinear expression capabilities of features. This process follows standard Transformer post-processing patterns, including residual connections, layer normalization, and random depth regularization techniques, ensuring training stability and model generalization capability. The complete WASSA output expression is:

(14)

where represents multi-layer perceptron operation, typically containing two linear layers and one GELU activation function, represents layer normalization operation, and represents path dropout operation for random depth regularization. This design ensures WASSA maintains powerful feature modeling capabilities while significantly reducing computational complexity, particularly suitable for processing high-resolution sonar images with complex spatial relationships.

SEFFN adopts dual-branch parallel processing architecture, achieving enhanced modeling of spatial geometric information through collaborative action of main branch and spatial branch. The main branch is responsible for standard feature transformation and nonlinear mapping, while the spatial branch specifically processes multi-scale spatial context information, providing rich spatial prior knowledge for the main branch. SEFFN first expands feature dimensions to hidden space through input projection layers, then uses depth-wise separable convolution for preliminary spatial feature extraction, finally splitting processed features into two equal-dimensional sub-features in the channel dimension for subsequent gating processing and spatial fusion. The mathematical expressions for input projection and feature splitting are:

(15)

where represents projection convolution, expanding input features from dimension d to , α is the expansion factor (typically set to 2.0), represents depth-wise separable convolution for extracting local spatial features, and represents channel splitting operation, evenly dividing expanded features into two sub-features Fx1 and Fx2 with dimension .

The spatial branch uses average pooling to downsample original inputs by 2x, reducing spatial resolution while maintaining important spatial structure information. Then multi-layer convolution networks extract spatial patterns from downsampled features, with each convolution layer followed by layer normalization and ReLU activation functions to enhance nonlinear expression capabilities of features. Finally, bilinear interpolation upsamples processed features to original resolution, providing multi-scale spatial context information for the main branch. The processing flow of the spatial branch can be expressed as:

(16)

where represents average pooling operation with kernel size 2 and stride 2, and represent two convolution layers respectively, represents layer normalization operation, represents ReLU activation function, and represents 2x bilinear upsampling operation. This multi-scale processing strategy enables networks to capture spatial features within different receptive field ranges, enhancing perception capabilities for targets of different sizes in sonar images.

The SEFFN mechanism intelligently fuses outputs from spatial branch with first sub-features from main branch, then controls information flow through gating mechanisms. The fusion process first combines spatial features and main branch features using channel concatenation, then performs feature integration through convolution, reducing channel dimensions after fusion. Next, depth convolution further extracts spatial patterns from fused features, finally achieving adaptive feature selection and enhancement through GELU-activated gating mechanisms with second sub-features via element-wise multiplication. The mathematical expression for this fusion mechanism is:

(17)

where represents channel concatenation operation, represents feature fusion convolution, reducing concatenated features from dimension to , represents depth convolution after fusion, represents GELU activation function, represents Hadamard product (element-wise multiplication), and represents output projection convolution, restoring feature dimensions from to original dimension d.

The proposed ESATE encoder successfully solves problems of insufficient computational efficiency and feature modeling capabilities of traditional Transformer encoders in side-scan sonar image processing through innovative design of adaptive sparse attention mechanisms and spatial-enhanced feed-forward networks. The encoder reduces attention computational complexity from O(N2) to while maintaining global modeling capabilities, significantly improving computational efficiency and making real-time processing of high-resolution sonar images possible. The SEFFN module significantly enhances network perception capabilities for spatial geometric information through multi-scale spatial branches and adaptive fusion mechanisms, particularly excelling in processing irregularly shaped small targets in sonar images. The spatial-enhanced feed-forward network’s integration of multi-scale spatial context further strengthens the encoder’s capability to model complex underwater scenes.

Experiments

Public dataset

The KLSG dataset (SeabedObjects-KLSG) [22] is a comprehensive underwater target recognition dataset specifically designed for seabed object detection in side-scan sonar images, established through long-term accumulation. The dataset contains five main categories: 385 wreck images, 36 drowning victim images, 62 airplane images, 129 mine images, and 578 seafloor images, totaling 1190 real side-scan sonar images. This dataset is primarily used for detecting drowning victims, wrecks, and aircraft in underwater search and rescue missions, effectively assisting sonar operators in avoiding missed targets due to fatigue during long search processes. Due to characteristics of side-scan sonar images such as low resolution, sparse target features, and complex backgrounds, combined with class imbalance issues in the dataset itself, the KLSG dataset provides researchers with an important benchmark platform for validating adaptability of deep learning models to special challenges of sonar image processing and complex underwater environments.

Experimental environment and parameter settings

To ensure reproducibility and comparability of experimental results, this study conducted all experiments under unified hardware and software environments. The experimental platform configuration is as follows: hardware environment employs high-performance computing platform equipped with Intel Core i5-14400F 2.50 GHz processor, 32GB memory, and NVIDIA GeForce RTX 4060Ti 8GB graphics card as acceleration device for deep learning model training and inference. Software environment is built on Win11 64-bit operating system, with PyTorch selected as the deep learning framework, combined with CUDA for GPU acceleration computing. Python version is 3.10. To ensure consistency and reproducibility of experiments, all experiments were conducted under the same software configuration environment. To ensure reproducibility of experimental results, we strictly controlled all random seeds and experimental conditions.

Key hyperparameter settings during training are as follows: RT-DETR-ResNet18 was used as the baseline model, batch size set to 8, AdamW optimizer selected, lr0 set to 0.0001, momentum set to 0.9, weight decay set to 0.0001 to prevent model overfitting. Input images were uniformly resized to 640×640 pixels. Other network structure and training-related parameters adopted RT-DETR default configurations to ensure reproducibility of experimental results and fair comparison with baseline methods.

Evaluation metrics

This study employs widely recognized evaluation systems in the object detection field for systematic performance assessment of models. Specific evaluation metrics are as follows: for accuracy aspects, precision (P) is used to quantify reliability of model prediction results, and recall (R) measures model coverage rate for target objects; for comprehensive performance aspects, mAP50 (mean average precision at IoU = 0.5) and mAP50-95 (mean average precision at IoU thresholds from 0.5-0.95) are used as core metrics, comprehensively reflecting overall performance of detection algorithms. Meanwhile, to quantify practical application value of models, computational complexity metric GFLOPS (giga floating-point operations) is introduced to evaluate algorithm computational load, and model parameters (Parameters) analyze network storage overhead, thereby multi-dimensionally validating effectiveness of proposed methods in achieving lightweight design while maintaining high detection performance.

Ablation experiments

Ablation experiments of CMSSC module in MASNet.

To validate effectiveness of our proposed CMSSC module, we conducted ablation experiments on different components within CMSSC. Experimental results are shown in Table 1. Four configurations were designed: baseline model (base), adding only SPU module, adding only GFPU module, and complete CMSSC module (including both SPU and GFPU). Through comparative analysis of contributions of each component to model performance and computational efficiency, we systematically evaluated effectiveness of dual-domain collaborative feature fusion mechanisms.

thumbnail
Table 1. Ablation study results of CMSSC module components.

Evaluation of individual contributions of SPU and GFPU modules to overall detection performance.

https://doi.org/10.1371/journal.pone.0336468.t001

Experimental results demonstrate that each component of the CMSSC module positively impacts model performance. Compared to the baseline model, adding only the SPU module improves mAP50 by 0.6%, while adding only the GFPU module achieves more significant improvement with mAP50 increasing by 0.8%. More importantly, the complete CMSSC module achieves optimal performance with mAP50 reaching 77.1%, representing a 1.4% improvement over the baseline model, and mAP50-95 metric showing a remarkable 2.1% improvement. In terms of computational efficiency, the complete CMSSC module achieves model lightweighting while significantly improving detection accuracy, reducing parameters from 19.97M to 14.52M (27.3% reduction) and decreasing GFLOPS from 57.3 to 49.0 (14.5% computational complexity reduction). These quantitative analysis results fully validate effectiveness of our proposed spatial-frequency dual-domain collaborative feature fusion mechanism, proving that the CMSSC module can significantly enhance small target feature representation capabilities while maintaining computational efficiency.

Ablation experiments of MAFM module in HMAFPN.

To validate effectiveness of our proposed MAFM, we designed ablation experiments targeting different components within MAFM. Experimental results are shown in Table 2. We evaluated performance of baseline method (Concat), simple addition fusion (Addition), MAFM with only attention mechanism, MAFM with only 1×1 convolution, and complete MAFM module on the COCO dataset, analyzing impacts of each component on model accuracy, parameters, and computational complexity through comparative analysis.

thumbnail
Table 2. Ablation experiments of MAFM module in HMAFPN.

Evaluation of attention mechanism and 1×1 convolution contributions to adaptive fusion performance with per-size AP analysis.

https://doi.org/10.1371/journal.pone.0336468.t002

As evident from Table 2, the complete MAFM module achieves a 1.8% improvement in mAP50-95 and 1.2% in mAP50 compared to the baseline concatenation method. Notably, MAFM demonstrates exceptional performance on small target detection, with AP_S improving by 4.6%. This improvement magnitude significantly exceeds the overall mAP improvement, demonstrating that HMAFPN’s design advantages indeed concentrate on small target detection tasks.

Simple addition fusion performs worse on small targets, 2.3% below baseline, validating the necessity of intelligent adaptive fusion mechanisms. Using attention mechanisms alone improves AP_S to 33.5%, while using 1×1 convolution alone reaches 32.8%. The complete MAFM combining both achieves optimal 35.8%. This indicates attention mechanisms and channel alignment convolutions work synergistically, jointly achieving effective preservation and fusion of small target features. Through adaptive weight distribution, MAFM can emphasize high-resolution features more strongly when detecting small targets, while relying more on deep semantic features for large targets. This context-aware fusion strategy is key to the significant performance improvement for small targets.

Adaptive fusion weight distribution analysis.

To address concerns regarding potential weight degradation in the MAFM modules, we conducted a comprehensive analysis of the learned attention weight distributions across the entire test set, as shown in Table 3. We analyzed the attention weights from all MAFM modules in the HMAFPN network during test set inference. For each fusion operation combining N input features, we calculated the Shannon entropy of the normalized weight distribution: . where wi denotes the attention weight for the i-th input feature. Higher entropy values (approaching ) indicate uniform weight distributions, while lower entropy values (approaching 0) suggest weight collapse to a single dominant input. To verify whether temperature tuning is unnecessary, we performed ablation experiments comparing our standard MAFM configuration with temperature-scaled variants, as presented in Table 3. We tested three temperature values () applied to the softmax operation in attention weight computation. Results demonstrate that (our default value) achieves optimal performance: mAP50 = 76.9%, mAP50-95 = 37.0%. Lower temperatures () create sharper weight distributions but slightly reduce performance, while higher temperatures produce more uniform weights yet also decrease performance. These findings validate that MAFM achieves genuine adaptive multi-scale fusion rather than degenerate single-scale selection, confirming the effectiveness of our hierarchical multi-scale feature integration strategy.

thumbnail
Table 3. Temperature scaling ablation experiment results.

Evaluation of different temperature values on MAFM module performance and weight distribution characteristics.

https://doi.org/10.1371/journal.pone.0336468.t003

Inter-module synergistic effects ablation experiment.

To validate effectiveness of our proposed MSF-DETR, we designed comprehensive ablation experiments to evaluate contributions of each innovative module. Experimental results are shown in Table 3. We evaluated performance of MASNet(A), HMAFPN(B), ESATE encoder(C), and their different combinations on the COCO dataset, validating impacts of each component on model accuracy, computational efficiency, and parameters through systematic comparative analysis.

Experimental results demonstrate that each innovative module significantly contributes positively to model performance. MASNet module (A) achieves significant model lightweighting while improving detection accuracy, with mAP50-95 improving by 2.1%, computational complexity reducing by 14.5%, and parameters decreasing by 27.3%, validating high efficiency of multi-scale feature adaptive fusion. HMAFPN (B) contributes 1.8% mAP50-95 improvement with almost no computational overhead increase, reflecting efficiency advantages of hierarchical multi-scale feature fusion. ESATE encoder (C) provides 1.2% accuracy gain, validating effectiveness of sparse attention mechanisms. Finally, the complete MSF-DETR model achieves optimal performance through synergistic action of three innovations, realizing 3.3% mAP50-95 improvement and 2.8% mAP50 improvement compared to baseline model while maintaining good computational efficiency, fully proving excellent performance and practical value of our proposed MSF-DETR in side-scan sonar image small target detection tasks.

Comparative experiments

Comparison experiments of different backbone networks.

To validate effectiveness of our proposed MASNet backbone network, we conducted comparative experiments with different backbone network improvements. Experimental results are shown in Table 4. Under conditions of maintaining the same detection head and training strategy, we comprehensively compared performance of proposed MASNet with current mainstream lightweight backbone networks, including efficient network architectures like Fasternet, MobileNetV4, EfficientViT, and advanced attention mechanism networks like Swin Transformer, MambaOut, evaluating comprehensive performance of different backbone networks from multiple dimensions including detection accuracy, computational complexity, and parameters.

thumbnail
Table 4. Ablation study of inter-module synergistic effects.

Comprehensive evaluation of individual and combined contributions of MASNet, HMAFPN, and ESATE modules.

https://doi.org/10.1371/journal.pone.0336468.t004

Experimental results prove that compared to baseline ResNet18 network, our proposed MASNet significantly improves mAP50 by 1.4% while also achieving 2.1% improvement in mAP50-95 metric. Compared to lightweight network Fasternet, MASNet achieves substantial 3.6% detection accuracy improvement with only 34.3% parameter increase. Notably, compared to computationally intensive Swin Transformer, MASNet achieves comparable detection performance with only 50% computational overhead, demonstrating excellent efficiency-accuracy balance characteristics. These results fully validate that proposed MASNet backbone network, through CMSNet architecture and spatial-frequency collaborative convolution mechanisms, can significantly improve accuracy and robustness of small target detection in side-scan sonar images while maintaining lightweight design.

Comparison experiments of different feature fusion networks.

To validate effectiveness of our proposed HMAFPN multi-scale feature fusion network, we conducted comparative experiments with different feature fusion network architectures. Experimental results are shown in Table 5. Under conditions of maintaining the same backbone network and detection head configuration, we comprehensively compared performance of proposed HMAFPN with current mainstream feature pyramid networks, including traditional CCFM baseline method, lightweight SlimNeck network, advanced MAFPN and BIFPN architectures, and efficient HSFPN network, systematically evaluating comprehensive performance of different feature fusion strategies from multiple dimensions including detection accuracy, computational complexity, and parameter efficiency.

thumbnail
Table 5. Comparison of different backbone networks.

Performance evaluation of MASNet against current mainstream backbone architectures.

https://doi.org/10.1371/journal.pone.0336468.t005

Experimental results prove that compared to baseline CCFM network, our proposed HMAFPN improves mAP50 and mAP50-95 by 1.2% and 1.8% respectively. Compared to HSFPN with similar parameter count, HMAFPN achieves 2.0% detection accuracy improvement with only 25.7% parameter increase, demonstrating excellent parameter efficiency. Notably, compared to computationally intensive BIFPN, HMAFPN achieves 3.1 percentage point accuracy improvement with only 86.7% computational overhead, showing remarkable efficiency-accuracy balance characteristics. These results fully validate that proposed HMAFPN, through MAFM multi-feature fusion modules and dense cross-layer connection strategies, can significantly improve feature representation capabilities and detection accuracy for small target detection in side-scan sonar images while maintaining computational efficiency.

Comparison with different mainstream SOTA network models.

To validate effectiveness of our proposed MSF-DETR, we conducted comprehensive performance comparison experiments between MSF-DETR and mainstream SOTA models. Experimental results are shown in Table 6. The experiments covered multiple mainstream detection frameworks, including latest YOLO series models, specially designed lightweight detectors (hyper-yolo-m, Mamba-YOLO-b), and advanced detection models based on DETR architecture (DEIM-D-Fine-m, RT-DETR series, etc.). Through in-depth comparative analysis with these mainstream SOTA models, we can fully validate practical application value and technical advantages of MSF-DETR in complex detection tasks.

thumbnail
Table 6. Comparison of different feature fusion networks.

Performance evaluation of HMAFPN against mainstream feature pyramid network architectures.

https://doi.org/10.1371/journal.pone.0336468.t006

Experimental results fully demonstrate significant advantages and excellent performance of MSF-DETR in balancing accuracy and efficiency. From quantitative analysis perspective, MSF-DETR achieves significant improvements in multiple key metrics with only 50.4 GFLOPS computational complexity, 20.26M parameters, and 71.2 FPS inference speed. Compared to baseline model RT-DETR-r18 using the same DETR architecture, MSF-DETR achieves 12.2% computational complexity reduction while maintaining comparable parameters, with mAP50 improving by 2.8% and mAP50-95 improving by 3.3%, and inference speed improving by 2.7%, fully demonstrating effectiveness of architectural optimization. Overall, MSF-DETR performs excellently in multiple dimensions including lightweight degree, detection accuracy, and computational efficiency, achieving optimal balance between accuracy and efficiency.

We also conducted visual analysis of detection accuracy for different models on our dataset in Fig 6. The visualization results are shown in the comparative experiment selecting current mainstream object detection algorithms as baseline methods, including YOLOv12 and RT-DETR, comprehensively evaluating detection capabilities and robustness of each method in complex scenarios (Table 7).

thumbnail
Table 7. Comprehensive performance comparison with mainstream SOTA models.

Evaluation of MSF-DETR against current state-of-the-art object detection algorithms.

https://doi.org/10.1371/journal.pone.0336468.t007

Accuracy-throughput analysis and Pareto optimality.

To comprehensively characterize accuracy-efficiency tradeoffs, we generated Pareto curves. All methods were evaluated on identical hardware (RTX 4060Ti, batch size 1).Experimental results are shown in Table 8. The Pareto curve analysis diagram of precision - throughput is shown in Fig 7.

thumbnail
Table 8. Key method performance comparison.

Accuracy-throughput analysis showing Pareto-optimal methods across different operating regions for real-time underwater detection systems.

https://doi.org/10.1371/journal.pone.0336468.t008

MSF-DETR achieves 78.5% mAP50 and 38.5% mAP50-95 at 71.2 FPS, occupying a favorable position on the Pareto frontier, particularly in the real-time region (≥60 FPS). No method simultaneously provides higher accuracy and speed. RT-DETR-r18, while similar in speed, has 2.8% lower accuracy. More accurate RT-DETR-r34 (75.9% mAP50) achieves this at reduced speed. Comparison with YOLO methods reveals competitive tradeoffs. Lightweight variants like Mamba-YOLO-b achieve higher throughput (112.6 FPS) but sacrifice accuracy. More accurate YOLOv8m and YOLOv12m still fall short of MSF-DETR.

Pareto analysis reveals distinct operating regions: high-speed dominated by YOLOv10m but with moderate accuracy, balanced real-time region dominated by MSF-DETR with highest accuracy. Key finding: MSF-DETR is the only Pareto-optimal method in the ≥60 FPS region, providing optimal balance of accuracy and real-time performance for autonomous underwater systems.

Detection results analysis.

From row-by-row visualization analysis results, we can clearly observe typical detection problems of baseline methods and significant advantages of our proposed MSF-DETR. In the first row test sample, YOLOv12 exhibits obvious false detection problems, incorrectly identifying background regions as target objects, while RT-DETR avoids false detection but has limited detection accuracy. In contrast, MSF-DETR accurately identifies all real targets without false detection phenomena. The second row sample reveals more serious missed detection problems, where both YOLOv12 and RT-DETR fail to detect key targets in images, which could lead to serious consequences in practical applications, while MSF-DETR successfully detects all target instances with accurate localization results. The third row results show RT-DETR has false detection problems, generating fake detection results, while MSF-DETR maintains good detection accuracy and low false positive rates. The fourth and fifth rows further expose limitations of existing methods, where both YOLOv12 and RT-DETR exhibit serious missed detection phenomena in these two test scenarios, particularly performing poorly when processing complex backgrounds and multi-scale targets. In contrast, MSF-DETR demonstrates excellent detection performance in all test samples, not only effectively avoiding false detection and missed detection problems but also excelling in detection confidence and bounding box accuracy.

Generalization experiments

To validate effectiveness and cross-domain generalization capability of our proposed MSF-DETR, we conducted comprehensive generalization experiments of MSF-DETR on the KLSG dataset (SeabedObjects-KLSG). Experimental results are shown in Table 9. The Pareto curve analysis diagram of precision - throughput is shown in Fig 8. The KLSG dataset, as an authoritative benchmark dataset specifically designed for seabed object detection in side-scan sonar images, has typical sonar image characteristics such as low resolution, sparse target features, complex backgrounds, and class imbalance, providing an extremely challenging testing platform for validating MSF-DETR detection performance and cross-domain adaptation capabilities under special imaging conditions. Through in-depth comparative experiments with mainstream SOTA models on this representative underwater sonar dataset, we can fully evaluate practical application value, technical advantages, and generalization performance of MSF-DETR relative to traditional detection architectures in complex marine environments.

thumbnail
Table 9. Generalization experiment results on KLSG dataset.

Cross-domain evaluation demonstrating MSF-DETR’s superior generalization capabilities.

https://doi.org/10.1371/journal.pone.0336468.t009

Experimental results fully demonstrate excellent performance and significant advantages of MSF-DETR in underwater sonar target detection tasks, validating effectiveness and advancement of the proposed method. From quantitative analysis results, MSF-DETR achieves optimal performance in all key evaluation metrics with 50.4 GFLOPS computational complexity, 20.26M parameters, and 71.2 FPS inference speed, demonstrating excellent accuracy-efficiency-speed balance capability. Compared to RT-DETR-r18 baseline model with the most similar computational complexity, MSF-DETR achieves 1.39% mAP50 improvement and 3.3% mAP50-95 improvement while significantly reducing computational complexity by 12.2%, with inference speed also improving by 2.7%, fully demonstrating significant advantages of MSF-DETR in precise localization.

Heatmap analysis

To validate effectiveness of our proposed MSF-DETR, we conducted heatmap visualization analysis of different models on our dataset in Fig 7. Heatmap visualization can intuitively demonstrate attention distribution and feature extraction mechanisms of each model. Through comparative analysis of activation patterns of YOLOv12, RT-DETR, and our proposed MSF-DETR on the same test samples, we can deeply understand internal working mechanisms of different algorithms and their perception capabilities for key target regions.

From heatmap visualization analysis results, we can clearly observe that our proposed MSF-DETR demonstrates significant advantages in attention mechanisms and feature focusing capabilities. Specifically, YOLOv12’s heatmap shows relatively scattered attention distribution, generating strong activation responses in non-target regions, which may lead to false detection and accuracy degradation. RT-DETR, although showing reasonable attention distribution in some regions, has overall low activation intensity and insufficient focusing on key target regions, explaining missed detection phenomena in detection tasks. In contrast, MSF-DETR’s heatmap presents more precise and concentrated activation patterns, with attention highly focused on real target regions while effectively suppressing background noise interference. Particularly in complex background and multi-target scenarios, MSF-DETR can accurately identify and focus on each target instance with more reasonable activation intensity distribution, directly reflecting its superior feature extraction and target localization capabilities. These heatmap analysis results fully validate scientific nature and effectiveness of MSF-DETR algorithm design from interpretability perspective, providing theoretical support for its excellent performance in practical applications.

Discussion

Experimental results demonstrate that MSF-DETR achieves excellent performance in side-scan sonar image small target detection through its innovative three-module collaborative architecture. Significant improvements observed across multiple evaluation metrics validate effectiveness of integrating spatial-frequency domain features, hierarchical multi-scale fusion, and efficient sparse attention mechanisms. The MASNet backbone successfully demonstrates that simultaneous processing of spatial domain and frequency domain features can bring superior target representation, improving mAP50-95 metric by 2.1% while reducing parameters by 27.3%. HMAFPN with MAFM modules proves that attention-based intelligent feature fusion significantly outperforms traditional concatenation methods, contributing 1.8% accuracy improvement with minimal computational overhead. ESATE encoder validates that sparse attention mechanisms can reduce computational complexity from quadratic to linear scale while maintaining global modeling capabilities, achieving 71.2 FPS real-time inference speed with 2.7% improvement compared to RT-DETR-r18 baseline model, making Transformer architecture practical for high-resolution sonar processing.

The proposed MSF-DETR framework addresses key challenges in underwater sensing, providing solid foundation for advanced marine engineering applications. Experimental results on both self-built SSST-3K dataset and public KLSG dataset demonstrate consistent improvements under different acoustic conditions, with MSF-DETR achieving 78.5% mAP50 and 38.5% mAP50-95, representing state-of-the-art performance in sonar small target detection. The method’s excellent balance of high accuracy, computational efficiency, and real-time inference capability (71.2 FPS) makes it particularly suitable for real-time deployment in autonomous underwater systems, facilitating safer and more effective underwater operations. Future research directions include expanding target categories, further optimizing inference speed to meet stricter real-time requirements, optimizing embedded deployment, and exploring temporal consistency in video sonar sequences to further enhance practical applicability of underwater autonomous systems.

Limitations

Despite encouraging results, several limitations should be acknowledged.

First, current research primarily evaluates on single datasets, lacking systematic cross-domain generalization experiments. While our SSST-3K dataset is comprehensive within its scope, it represents specific acoustic conditions and geographic regions. Future work should validate the method across different acoustic environments and target categories, conducting zero-shot and few-shot transfer learning experiments to comprehensively evaluate MSF-DETR’s domain adaptation capability and generalization performance. This includes cross-validation under different sonar systems, operating frequencies, and marine environments.

Additionally, we plan to systematically introduce and evaluate data augmentation strategies in subsequent research, specifically including designing sonar-specific augmentation libraries (speckle noise, TVG simulation, banding artifacts, acoustic shadows, reverberation effects), combining generic augmentations (geometric transformations, color perturbations, mosaics), conducting ablation experiments to quantify contributions of each augmentation technique, and evaluating augmentation impact on cross-domain generalization capability. This will provide empirical insights for the sonar image detection community regarding which augmentation strategies are most effective.

Second, although MSF-DETR improves inference speed compared to baseline models, achieving 71.2 FPS real-time processing capability, deployment on resource-constrained underwater platforms still requires further optimization. While there remains a gap compared to traditional YOLO architectures (such as YOLOv10m’s 108.5 FPS), considering DETR architecture characteristics and accuracy advantages, current inference speed can meet real-time requirements for most underwater detection tasks. Future work should explore model quantization, knowledge distillation, and other techniques to further improve inference efficiency for optimal deployment performance on embedded systems.

Potential Enhancement through Decomposition-plus-Sparsity Preprocessing

Singular Spectrum Analysis (SSA) combined with hierarchical hyper-Laplacian priors could potentially further stabilize small target signals in sonar. SSA decomposes acoustic backscatter into target, texture, reverberation, and noise components through trajectory matrix eigendecomposition. Hyper-Laplacian priors promote sparsity at multiple scales while preserving edge structures, potentially enhancing small target visibility [40].

However, implementation challenges exist. SSA’s O(L2MN) complexity generates significant overhead for images. Parameters (window length, component count) are sensitive to acoustic conditions, requiring adaptive adjustment. Aggressive preprocessing might lose subtle features our dual-domain architecture is designed to capture.

Future integration strategies could explore: conditional preprocessing (enabled only during high noise), learnable sparse decomposition layers (end-to-end optimization), and hyper-Laplacian regularization (integrated into MASNet feature extraction). These directions hold potential value for small target detection under extreme conditions (high sea states, strong multipath, low-frequency sonar) but require balancing accuracy gains against computational costs.

Conclusion

This paper proposes MSF-DETR, a novel end-to-end detection algorithm that achieves significant advances in the field of side-scan sonar image small target detection. Through collaborative integration of three core innovations—MASNet backbone with dual-domain spatial-frequency collaborative convolution, HMAFPN feature fusion network with adaptive multi-input fusion modules, and ESATE encoder utilizing efficient sparse attention mechanisms—the proposed method achieves significant improvements in detection accuracy, computational efficiency, and inference speed. MASNet backbone successfully demonstrates that simultaneous processing of spatial and frequency domain features can achieve superior target representation, improving mAP50-95 metric by 2.1% while reducing parameters by 27.3%. HMAFPN with MAFM modules proves that attention-based intelligent feature fusion significantly outperforms traditional concatenation methods, contributing 1.8% accuracy improvement with minimal computational overhead. ESATE encoder validates that sparse attention mechanisms can reduce computational complexity from quadratic to linear scale while maintaining global modeling capabilities, achieving 71.2 FPS real-time inference speed with 2.7% improvement compared to RT-DETR-r18 baseline model, making Transformer architecture practical for high-resolution sonar processing.

The proposed MSF-DETR framework addresses key challenges in underwater sensing, providing solid foundation for advanced marine engineering applications. Experimental results on both self-built SSST-3K dataset and public KLSG dataset demonstrate consistent improvements under different acoustic conditions, with MSF-DETR achieving 78.5% mAP50 and 38.5% mAP50-95, representing state-of-the-art performance in sonar small target detection. The method’s excellent balance of high accuracy, computational efficiency, and real-time inference capability (71.2 FPS) makes it particularly suitable for real-time deployment in autonomous underwater systems, facilitating safer and more effective underwater operations. Future research directions include expanding target categories, further optimizing inference speed to meet stricter real-time requirements, optimizing embedded deployment, and exploring temporal consistency in video sonar sequences to further enhance practical applicability of underwater autonomous systems.

References

  1. 1. Steiniger Y, Kraus D, Meisen T. A study on modern deep learning detection algorithms for automatic target recognition in side-scan sonar images. In: Proc Meet Acoust. 2021. 070004.
  2. 2. Wang Z, Zhang S, Huang W, Guo J, Zeng L. Sonar image target detection based on adaptive global feature enhancement network. IEEE Sensors J. 2022;22(2):1509–30.
  3. 3. Abu A, Diamant R. A statistically-based method for the detection of underwater objects in sonar imagery. IEEE Sensors J. 2019;19(16):6858–71.
  4. 4. Chen Z, Wang H, Shen J. Underwater object detection by combining the spectral residual and three-frame algorithm. Adv Comput Sci Appl. 2014;279:1109–14.
  5. 5. Wang J, Feng C, Wang L, Li G, He B. Detection of weak and small targets in forward-looking sonar image using multi-branch shuttle neural network. IEEE Sensors J. 2022;22(7):6772–83.
  6. 6. Zhang H, Tian M, Shao G, Cheng J, Liu J. Target detection of forward-looking sonar image based on improved YOLOv5. IEEE Access. 2022;10:18023–34.
  7. 7. Li L, Li Y, Yue C, Xu G, Wang H, Feng X. Real-time underwater target detection for AUV using side scan sonar images based on deep learning. Applied Ocean Research. 2023;138:103630.
  8. 8. Fan Z, Xia W, Liu X, Li H. Detection and segmentation of underwater objects from forward-looking sonar based on a modified Mask RCNN. SIViP. 2021;15(6):1135–43.
  9. 9. Li C, Ye X, Cao D. CCW-YOLOv5: coordinate convolution and weighted loss based YOLOv5 for side-scan sonar target detection. IEEE Journal of Ocean Engineering. 2022;48:233–47.
  10. 10. Yu Y, Zhao J, Gong Q, Huang C, Zheng G, Ma J. Real-time underwater maritime object detection in side-scan sonar images based on transformer-YOLOv5. Remote Sensing. 2021;13(18):3555.
  11. 11. Wang H, Zhang P, You M. Underwater sonar image targets detection based on improved RT-DETR. J Mar Sci Eng. 2024;12:1384.
  12. 12. Chen J, Li W, Zhang H. Underwater object detection in sonar imagery with detection transformer and zero-shot neural architecture search. arXiv preprint 2024. https://arxiv.org/abs/250506694
  13. 13. Wang L, Chen Y, Liu S. Progressive sensitivity capturing network for sonar target detection. IEEE Trans Geosci Remote Sens. 2023;61:4203415.
  14. 14. Zhao Y, Lv W, Xu S. DETRs beat YOLOs on real-time object detection. arXiv preprint 2023. https://arxiv.org/abs/2304.08069
  15. 15. Lin T, Dollar P, Girshick R. Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 2117–25.
  16. 16. Liu S, Qi L, Qin H, Shi J, Jia J. Path aggregation network for instance segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. p. 8759–68. https://doi.org/10.1109/cvpr.2018.00913
  17. 17. Ghiasi G, Lin T, Le Q. NAS-FPN: learning scalable feature pyramid architecture for object detection. In: Proc IEEE/CVF Conf Comput Vis Pattern Recognit. 2019. p. 7036–45.
  18. 18. Tan M, Pang R, Le Q. EfficientDet: scalable and efficient object detection. In: Proc IEEE/CVF Conf Comput Vis Pattern Recognit. 2020. p. 10781–90.
  19. 19. Dang J, Tang X, Li S. HA-FPN: hierarchical attention feature pyramid network for object detection. Sensors (Basel). 2023;23(9):4508. pmid:37177710
  20. 20. Zhu L, Lee F, Cai J, Yu H, Chen Q. An improved feature pyramid network for object detection. Neurocomputing. 2022;483:127–39.
  21. 21. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proc IEEE Conf Comput Vis Pattern Recognit. 2016. p. 770–8.
  22. 22. Steiniger Y, Kraus D, Meisen T. SeabedObjects-KLSG: a large-scale dataset for seabed object detection in side-scan sonar images. Sci Data. 2022;9:719.
  23. 23. Chen J, Kao S, He H. Run, don’t walk: chasing higher FLOPS for faster neural networks. In: Proc IEEE/CVF Conf Comput Vis Pattern Recognit. 2023. p. 12021–31.
  24. 24. Qin D, Leichner C, Delakis M. MobileNetV4: universal models for the mobile ecosystem. arXiv preprint 2024. https://arxiv.org/abs/2404.10518
  25. 25. Liu Z, Lin Y, Cao Y. Swin transformer: hierarchical vision transformer using shifted windows. In: Proc IEEE/CVF Int Conf Comput Vis. 2021. p. 10012–22.
  26. 26. Liu X, Peng H, Zheng N, Yang Y, Hu H, Yuan Y. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. p. 14420–30. https://doi.org/10.1109/cvpr52729.2023.01386
  27. 27. Yu W, Luo M, Zhou P. MambaOut: do we really need mamba for vision?. arXiv preprint 2024. https://arxiv.org/abs/2405.07992
  28. 28. Li H, Li J, Wei H. SlimNeck by PAI-Blade: a more efficient design for object detection. arXiv preprint 2022.
  29. 29. Zhang H, Li F, Liu S. Multi-scale aggregation feature pyramid network for object detection. Pattern Recognit. 2022;125:108508.
  30. 30. Tan M, Pang R, Le Q. EfficientDet: scalable and efficient object detection. In: Proc IEEE/CVF Conf Comput Vis Pattern Recognit. 2020. p. 10781–90.
  31. 31. Yang G, Lei J, Zhu Z. Hierarchical spatial feature pyramid network for object detection. Neurocomputing. 2023;544:126265.
  32. 32. Jocher G, Chaurasia A, Qiu J. Ultralytics YOLOv8. GitHub repository. 2023. https://github.com/ultralytics/ultralytics
  33. 33. Wang C, Yeh I, Liao H. YOLOv9: learning what you want to learn using programmable gradient information. arXiv preprint 2024.
  34. 34. Wang A, Chen H, Liu L. YOLOv10: real-time end-to-end object detection. arXiv preprint 2024. https://arxiv.org/abs/240514458
  35. 35. Jocher G, Qiu J, Chaurasia A. Ultralytics YOLO11. https://github.com/ultralytics/ultralytics
  36. 36. Tian Y, Ye Q, Doermann D. YOLOv12: attention-centric real-time object detectors. arXiv preprint 2025. https://arxiv.org/abs/250212524
  37. 37. Liu H, Sun F, Gu J. Hyper-YOLO: an efficient and powerful YOLO architecture. Comput Vis Image Underst. 2023;234:103751.
  38. 38. Zhu X, Lyu S, Wang X. Mamba-YOLO: SSM-based YOLO for object detection. arXiv preprint 2024. https://arxiv.org/abs/240605835
  39. 39. Huang S, Lu Z, Cun X. DEIM: DETR with improved matching for fast convergence. arXiv preprint 2024.
  40. 40. Algburi RNA, Aljibori HSS, Al-Huda Z, Gu YH, Al-antari MA. Advanced fault diagnosis in industrial robots through hierarchical hyper-laplacian priors and singular spectrum analysis. Complex Intell Syst. 2025;11(6).