Figures
Abstract
Ship object detection and fine-grained recognition of remote sensing images are hot topics in remote sensing image processing, with applications in fishing vessel operation command, merchant ship navigation route planning, and other fields. In order to improve the detection accuracy for different types of remote sensing ship objects, this paper proposes a ship object perception and feature refinement method based on the improved ReDet, called Mamba-ReDet (M-ReDet). First, this paper designs a ship object fine-grained feature extraction backbone (Mamba-ReResNet, M-ReResNet), which selects and reconstructs the unique features of different types of ship objects through the Mamba’s selective memory to improve the algorithm’s ability to extract fine-grained features. Secondly, the M-ReDet consists of the Ship Object Perception Module (SOPM) and the Ship Feature Refinement Module (SFRM), which can extract the ship’s spatial position information from the feature map, fuse different scales of spatial position information and use this information to refine the fine-grained features to improve the detection accuracy of the algorithm for different categories of ships. Finally, we use the KFIoU and Focal Loss as the regression loss and classification loss of the algorithm to improve the accuracy of the training. The experimental results show that the mAP0.5 of the M-ReDet algorithm on the FAIR1M(ship) and DOTAv1.0 visible light (RGB) remote sensing image datasets are 43.29% and 82.09%, respectively, which is 2.78% and 3.34% higher than that of the ReDet.
Citation: Liu X, Feng C, Zi S, Qin Z, Guan Q (2025) M-ReDet: A mamba-based method for remote sensing ship object detection and fine-grained recognition. PLoS One 20(8): e0330485. https://doi.org/10.1371/journal.pone.0330485
Editor: Fatih Uysal, Kafkas University: Kafkas Universitesi, TÜRKIYE
Received: April 17, 2025; Accepted: August 1, 2025; Published: August 21, 2025
Copyright: © 2025 Liu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The curated data and code can be accessed at the following link: https://github.com/LG973641114/M-ReDet/releases.
Funding: This work is supported by the Department of Science and Technology of Jilin Province, China [YDZJ202501ZYTS600]. There was no additional external funding received for this study. Funded studies: Initials of the author who received the award: Liu Xuhui Grant numbers awarded to the author: YDZJ202501ZYTS600 Full name of the funder: Jilin Provincial Department of Science and Technology URL of the funder website: http://kjt.jl.gov.cn/. The sponsors (Jilin Provincial Department of Science and Technology) provided financial support throughthegrant YDZJ202501ZYTS600, managed by Prof. Wang Jia. However, they did not participate in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. Although Prof. Wang Jia oversawthe funding allocation and supported the research infrastructure, her contributions were limited to administrative andfinancial management, and she is not listed as an author in this paper.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Remote sensing ship object detection and fine-grained recognition have important research value in fishery monitoring, navigation planning, and other fields. However, problems when using optical remote sensing images for these tasks include inconsistent ship size and rotation angle, as well as complex backgrounds. Therefore, many problems must be solved when designing a remote sensing object detection algorithm that can extract the fine-grained features of ship objects and enhance spatial positioning and feature refinement capabilities.
The commonly used remote sensing object detection algorithms can be categorized into three classes: the Convolutional Neural Network (CNN), the Transformer [1], and the Mamba [2] based remote sensing object detection algorithm. Firstly, the first class of algorithms is mainly improved based on classical networks such as YOLO [3], SSD [4], Faster R-CNN [5], and ResNet [6]. In 2021, XueYang et al. proposed the GWD algorithm [7], which uses the Gaussian Wasserstein Distance to characterize the distances between rotated boxes and solve the problem of a discontinuous range of rotation angles. The same year, XueYang et al. proposed the R3Det algorithm [8], which designed a feature refinement module to obtain the object’s position information and realize feature alignment. Jiaming Han et al. proposed the ReDet algorithm [9], which encodes rotational equivariant and rotational invariant features to improve the detection accuracy of remote-sensing objects. In 2022, Liping Hou et al. proposed the SASM algorithm [10], which uses two strategies, the Shape Adaptive Selection (SAS) and the Shape Adaptive Measurement (SAM), to realize the selection and evaluation of positive samples.
The size differences of remote sensing objects are significant, and the information in the background is rich. It is important to separate the unique information of the object from the background, and the Transformer can effectively solve this problem. In 2022, Li Qingyun et al. proposed the TRD algorithm [11], which reconstructed the Transformer and effectively extracted the spatial position information of the object and the correlation information between instances. In 2023, Wei Liu et al. proposed the AMTNet algorithm [12], which combines CNN and Transformer to reconstruct the backbone network. Through feature exchange, different scales of feature maps are used to improve the network’s feature extraction performance for changing regions. In 2024, Mingji Yang et al. proposed the Hybrid DETR algorithm [13], which can extract alienation features between remote sensing objects and then use the alienation features to distinguish small objects in complex backgrounds through the SODM module, improving the algorithm’s detection ability for small remote sensing objects.
Mamba-based remote sensing object detection algorithm is a new type of algorithm that has become more popular in recent years [14], and its selective structure reduces the computational complexity of the Transformer. It can highlight the remote sensing object’s adequate feature information to improve the algorithm’s robustness. In 2024, Yue Zhan et al. proposed the MambaSOD algorithm [15], which uses a dual Mamba-driven feature extractor for RGB and depth information to model remote dependencies in multimodal inputs with linear complexity. Moreover, designs a cross-modal fusion Mamba model to capture multimodal features. In the same year, Tushar Verma et al. proposed the SOAR algorithm [16], which combines the mamba and YOLOv9 [17], reduces the loss of practical information, expands the receptive field, and can effectively detect remote sensing objects.
The above three types of remote sensing rotated object detection algorithms have good detection results in their respective fields but have the following problems [18–19]:
- Many algorithms are incompatible with different sizes of ship objects; some algorithms are only improved for small objects, and others are only improved for large and obscured objects.
- With the deepening of the network layers, the fine-grained features of remote-sensing ship objects will be gradually lost, and some algorithms do not make long-term memory retention for the fine-grained features of remote-sensing ships.
- The complementary information fusion between feature maps of different scales is critical, but this process also requires selective learning to reduce the interference of redundant contextual information on remote sensing ship detection and fine-grained recognition of different sizes.
To address the three problems mentioned above, this paper proposes the M-ReDet algorithm, which has the following main innovations:
- We propose the M-Bottleneck to construct a new backbone network to achieve selective retention of fine-grained features of small ships in shallow feature maps and long-term memory of semantic features of large ships in deep feature maps and, at the same time to expand the receptive field to reduce the algorithm’s false detection of objects such as berths.
- This article designs the SFRM module to reconstruct feature maps of different resolutions and selectively supplement the information difference between feature maps of different levels to improve the algorithm’s detection and fine-grained recognition capabilities for ships of different sizes.
- This paper compares the effects of different combinations of CrossEntropyLoss, SmoothL1Loss, Focal Loss, and KFIoU used in the algorithm’s training and designs several groups of comparative experiments to find the optimal loss function configuration, improve the algorithm’s regression and classification accuracy, and reduce the loss of accuracy due to the imbalance in the number of different ship categories.
- This article proposes a new remote sensing ship object detection algorithm called M-ReDet and conducts multiple comparative and ablation experiments on the FAIR1M(ship) [20] and DOTA datasets [21] to verify the effectiveness of the SOPM, SFRM modules, and optimized loss functions.
Related work
CNN
ReDet is a classic remote sensing object detection algorithm based on convolutional neural networks. The design of the ReDet [22] mainly aims to solve two problems. First, based on the rotation variation characteristics of convolutional neural networks, ReDet proposes a rotation-invariant backbone network to extract the rotation-invariant features of remote sensing objects. Second, the RRoIAlign only performed a spatial alignment, and there was no alignment in the channel dimension. The RiRoI Align module, which consists of RPN (Region Proposal Network) and RT (ROI Transformer), can extract the features of the rotation recommendation region for classification and regression.
The overall structure of the ReDet mainly consists of a backbone network (ReResNet50) and Neck (ReFPN) [23], which can effectively extract the rotation-invariant features of the remote-sensing object. After the remote sensing image passes through ReResNet50, the algorithm can first obtain the rotation-invariant features of the remote sensing object. In the computation of the rotational features of multiple orientations, these computations share the weights, dramatically reducing the number of computational parameters required for each rotation. The rotation-invariant features of multiple orientations can be obtained by inputting the remote-sensing image of a fixed orientation. After that, we can fuse the feature maps of different layers in the ReFPN, and after the RiRoI Align module, the rotation-invariant features of the same remote sensing object can be extracted from the rotation-isotropic features, which contain features such as the tail and wing of an airplane, the aspect ratio of a ship, can enhance the accuracy of the remote sensing object detection. Fig 1 shows the network structure of the ReDet.
Transformer
Two of the most widespread transformer in 2025, GQA (Group Query Attention) [24] and MLA (Multi-Head Latent Attention) [25], have driven the development of large models such as the Qwen [26] and DeepSeek [27], and have also provided CNN networks variety of optimization schemes, such as MSTrans [28], which uses Q, K, and V in GQA to reorganize the input vectors, and uses the MST module to multiple hierarchical feature maps for enhancing the model’s ability to extract pixel-level features from buildings. Among them, the basis of attention in GQA is self-attention, which is to serialize data such as text or images and reconstruct the sequence by calculating the correlation between the sequence and the sequence to complete the extraction of specific object features. The computation processes of self-attention are shown below:
is the input vector,
is the learnable parameter,
,
,
are the sequence computation units, and
is the data dimension. With the increase of parallel computation and the demand to reduce the computational quantity, self-attention gradually evolves into Multi-Head Attention (MHA), Group Query Attention (GQA), and Multi-Head Latent Attention (MLA); the most important difference between them is the different methods to avoid redundant computation by KV-cache technique, the following are their computational formulas respectively:
is the output of the attention mechanism;
,
,
, and
are the same as
,
,
, and
,
is the number of groups.
Mamba
Mamba is a selective attention mechanism that can reduce the computation of the Transformer, selectively extract object features, and break through the bottleneck of the traditional model in content-aware and long-range modeling. The traditional state space model (SSM) parameters are static; Mamba introduces the gating parameter , which discretizes the sequence. The gating parameter of Mamba, the hidden state, and the output are calculated as follows:
is the input sequence,
,
,
, and
are the learnable parameters,
and
are the discretization parameters,
is the hidden state, and
is the output sequence. Mamba’s memorability can capture the contextual information of smaller targets, and the selectivity can dynamically adjust the proportion of the tail-category features to reduce the accuracy impact brought about by long-tailed distribution.
Our work
M-ReDet
This paper proposes the M-ReDet algorithm based on the ReDet algorithm by improving its Backbone, Neck, and Head to improve its detection and fine-grained recognition ability for ships of all sizes. The overall framework of the M-ReDet algorithm is shown in Fig 2, which consists of the M-ReResNet50, the M-ReFPN, and the Head, respectively. The specific structure of each module after improvement will be described in detail in the subsequent subsections.
SOPM
In this paper, the ship object perception module can better extract the edge, texture, semantic, and other features of remote sensing ship objects and improve the network’s ability to detect ships and fine-grained recognition in complex sea areas such as ports. The SOPM, as the main component of the backbone, can selectively extract and highlight the features of the object’s area. Its memorability can also pass the features extracted by the upper level of the feature map to the next layer so that the next layer can selectively extract the contextual information of different sizes of remote-sensing ship objects. The specific structure of SOPM is shown in Fig 3.
Before the feature map enters the SOPM, the remote sensing image passes through a 7 × 7 convolutional layer to initially extract the features such as ship texture and edges. Then the feature map enters into the multilayered ResLayer, which are all composed of M-Bottleneck and Bottleneck. Inside the M-Bottleneck, the feature map is first computed by the convolution, MLP, and Sigmoid to calculate the feature map channel weights, which weight the semantic, edge, and other features of the remote sensing ships to get the fine-grained features needed for ship detection and recognition. Then, the SOPM module serializes the feature map. After the serialized features pass through the mamba’s SSM module, the module can selectively extract the features of the ship object corresponding to the size of the feature map at that level and pass other features to the next level through memory to supplement the contextual information and fine-grained features of the ship corresponding to the size of the next level.
Corresponding to the four-layer ResLayer of ReResNet50, M-ReResNet50 also has four layers of ResLayers. However, the number of M-Bottlenecks in each layer differs from that of ReResNet50, and the ResLayers with different numbers of M-Bottlenecks have different capabilities for feature extraction, object perception, and localization for remote sensing ship objects. In order to find the optimal quantity ratio and ensure the total number of M-Bottleneck and Bottleneck is 3-4-6-3, this paper conducts the following experiments, setting the quantity of M-Bottleneck in each layer as 0-0-0-0, 0-1-1-0, 1-1-1-1, 1-2-2-1 and 2-2-2-2 respectively, and observing the effect of the M-Bottleneck quantity configuration on the algorithm. M-Bottleneck number configuration and the experimental results are shown in Table 1.
According to Table 1, when the number of M-Bottleneck in each layer of ResLayer is set to 1-1-1-1, the mAP0.5 of the algorithm is the highest, which is 41.52%, and the accuracy is improved by 1.01% compared to ResLayer in ReResNet50. With the increase of the number of M-Bottlenecks, M-ReDet’s accuracy does not continue to increase but stabilizes around 41.52%; adding too many M-Bottlenecks will introduce additional computation, so in this paper, we chose 1-1-1-1 as the configuration parameter of ResLayer.
SFRM
Compared with ReResNet stacking a large number of small convolutional kernels, stacking a small number of M-Bottleneck can obtain a substantial receptive field enhancement, and theoretically, M-Bottleneck can obtain the global receptive field, which can effectively improve the algorithm’s extraction ability for the contextual information of ship objects. The feature maps in the backbone network pass through the SOPM module; the algorithm can obtain the rotation-isotropic, rotation-invariant, and fine-grained features of various ship objects. The SFRM module in this section can fully use these features to fuse the upper-layer low-resolution ship fine-grained semantic features with the lower-layer high-resolution ship localization information to improve the algorithm’s ship detection and fine-grained recognition accuracy. The structure diagram of the SFRM module is shown in Fig 4.
Changing the loss function
The ReDet algorithm uses CrossEntropy Loss [29] and SmoothL1 Loss [30] as the classification and regression losses in the Head, respectively. The selection of the loss function has a particular impact on the algorithm’s classification and regression accuracy. In this subsection, the classification and regression losses of the M-ReDet are modified as the Focal Loss and KFIoU Loss.
Focal Loss can deal with the long-tailed distribution problem that exists in ship object detection and fine-grained recognition tasks by adjusting the balance factor and focus factor to change the value size of the classification loss of each ship in the FAIR1M(ship) dataset; there are a total of nine categories of ships, namely Dry Cargo ship, Engineering Ship, Fishing Boat, Liquid Cargo Ship, Motorboat, Passenger Ship, Tugboat, W-ship, and other-ship, the number of ships in each category varies is shown in Fig 5.
KFIoU Loss [31] is an approximation of SkewIoU, which represents the overlapping region of the rotated boxes by Kalman filtering; it essentially calculates the overlap rate to replace the IOU without introducing additional parameters and has a complete derivation and computation for non-overlapping scenarios, which is effective and improves the accuracy in the field of rotated object detection and fine-grained recognition. Focal Loss [32] and KFIoU Loss are calculated as shown below:
Where is the class label prediction probability,
is the balance factor,
is the focus factor, and
and
can adjust the effect of different categories of samples on the classification loss.
,
, and
are the areas of the prediction box, the actual box, and the overlapping region, respectively. This subsection also investigates the effect of different combinations of loss functions on the accuracy of ship object detection and fine-grained recognition, and Table 2 shows the results.
As Table 2 shows, among the four loss function combinations, the mAP0.5 of the Focal Loss + KFIoU is 41.39%, which is 0.88%, 1.07% and 0.52% higher than the CrossEntropy Loss + SmoothL1 Loss, Focal Loss + SmoothL1 Loss, and CrossEntropy Loss + KFIoU combinations, respectively. Thus, it is suitable for training the M-ReDet algorithm.
Experiment and result analysis
Experimental environment and parameter configuration
The experimental environment used for the M-Redet remote sensing ship target detection and fine-grained recognition algorithm designed in this paper is CUDA11.1, Intel i7-11700 CPU, NVIDIA GeForce RTX 3080Ti, and the deep learning framework is PyTorch. In this paper, the comparison and ablation experiments are conducted using DOTAv1.0 and FAIR1M(ship) datasets to evaluate the detection accuracy of M-Redet and the effectiveness of each module. The experiment selects SGD as an optimizer, set lr = 0.0025, momentum = 0.9, weight_decay = 0.0001, warmup_iters = 500, warmup_ratio = 1.0/3, and the learning rate strategy is linear. The overall training has 100 epochs, and if the algorithm converges ahead, then end the training round early. Fig 6 shows the loss convergence curve of M-ReDet on the DOTAv1.0 and FAIR1M(ship) datasets. Table 3 shows the experimental environment, parameter configuration, and model resource consumption.
Datasets
The DOTAv1.0 dataset has a total of 2806 images, which contains 15 types of remote sensing targets, namely the plane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, large vehicle, small vehicle, helicopter, roundabout, soccer ball field, swimming pool, and the number of instances is 188282. The number of training sets in the DOTAv1.0 dataset is 1411, the number of validation sets is 458, and the number of test sets is 937. In order to facilitate the training of the algorithm, in this paper, the remote sensing images in the DOTAv1.0 dataset are cropped into 1024 × 1024 size, totaling 21046 remote sensing images. Fig 7 shows the example of the DOTAv1.0 dataset. Fig 8 illustrates the distribution of quantities for each category in the DOTAv1.0 dataset.
Fig 7 is attributed to the DOTA open-source database and is available from the DOTA database (URL (s): https://captain-whu.github.io/DOTA/dataset.html).
The FAIR1M(ship) dataset contains all remote sensing images containing remote sensing ship objects in the FAIR1M2.0 dataset. The dataset contains a total of 13238 remote sensing ship images, the number of instances is 58982, and there are a total of nine types of ship instances, namely the Dry Cargo Ship, Engineering Ship, Fishing Boat, Liquid Cargo Ship, Motorboat, Passenger Ship, Tugboat, W-ship, and other-ship. The number of ships in each category is 17474, 3897, 9031, 3883, 15445, 1924, 1897, 1178 and 4253, respectively. The FAIR1M(ship) dataset is rich in all kinds of ships, which is suitable for the research of remote sensing ship object detection and fine-grained recognition. Fig 9 shows part of the FAIR1M(ship) dataset examples.
Fig 9 is attributed to the FAIR1M open-source database and is available from the FAIR1M database (URL (s): https://gaofen-challenge.com/benchmark).
Experimental evaluation indicators
We use ,
,
, and
as experimental evaluation indicators to validate the M-ReDet algorithm and the performance enhancement of each module in the comparison and ablation experiments. The formulas for
and
are as follows:
,
and
represent the number of true positives, false positives, and false negatives, respectively. Average Precision is an averaging of the accuracies at different recall rates, and in general, the model The higher the average accuracy for a specific category of target detection, the larger the AP value. The formula is as follows:
Where R is Recall, for a complete response to the model’s overall accuracy, mAP is the average AP of all objects:
N is the number of object categories, and is the average accuracy of each category.
Analysis of experimental results
Comparison experiments on the DOTA dataset.
This subsection compares the M-ReDet algorithm with the commonly used rotated object detection algorithms such as the GWD, R3Det, RoI Transformer [33], Rotated Faster RCNN, Rotated RetinaNet, Rotated Reppoints [34], S2ANet [35], SASM, KFIoU, and ReDet to verify its effectiveness on the DOTA dataset. Fig 10 shows the detection results of the M-ReDet algorithm on the DOTA dataset.
Fig 10 is attributed to the DOTA open-source database and is available from the DOTA database (URL (s): https://captain-whu.github.io/DOTA/dataset.html).
The detection result shows that the M-ReDet can detect small, medium and large remote-sensing objects well. Faced with simple remote-sensing objects such as tennis courts and object-intensive scenarios such as harbors, the M-ReDet’s detection accuracies are mostly above 90%, and some occluded and small objects have false and missed detection. Table 4 shows the experimental results of comparing the M-ReDet and the above 10 algorithms.
The AP-max of M-ReDet is similar to other algorithms but has the highest AP-min, which verifies its effectiveness in enhancing the classification and detection accuracy of remote-sensing objects with small samples. Table 5 shows the detection accuracies of various remote-sensing objects by the M-ReDet algorithm on the DOTA dataset.
The M-ReDet has the highest detection accuracy for the tennis course with an AP value of 90.8%, and for the remote sensing objects, such as planes, ships, basketball courses, harbors, large vehicles, and helicopters, also have high detection accuracies of around 90%. Since the M-ReDet uses M-Bottleneck to expand the receptive field when facing remote sensing objects with large sizes and high aspect ratios, such as bridges, the algorithms in this paper can combine the contextual information, such as selectively utilizing the road information on both sides of the bridge to make the detection judgment. Finally, there is a 4% accuracy improvement in AP-min(bridge). Fig 11 shows the training results of the above types of algorithms.
The 11 algorithms all use the pre-trained model to train; as seen from Fig 10, the SASM and Rotated RepPoints algorithms’ convergence is relatively slower compared to the other algorithms. Most of the algorithms converge to the optimal mAP0.5 in about 10 epochs. The M-ReDet algorithm achieves convergence after 10 epochs, and the final mAP0.5 is stabilized at about 82.09%. Its training curve is also in the top left corner, indicating optimal convergence speed and accuracy.
Comparison experiments on the FAIR1M(ship) dataset.
The M-ReDet algorithm performs well in dense ship distribution, sparse ship distribution, simple remote sensing background, and complex remote sensing background. Fig 12 shows its ship detection and fine-grained recognition results on the FAIR1M(ship) dataset. Since the appearance and aspect ratio of these nine types of ship objects are similar, it is necessary to extract the fine-grained features of each type of ship from the texture and semantic information in order to recognize the type of ship accurately. The M-ReDet algorithm first extracts the fine-grained features of the spatial location of the ship by using the SOPM and then extracts the different features between the different types of ships by using the SFRM.
Fig 12 is attributed to the FAIR1M open-source database and is available from the FAIR1M database (URL (s): https://gaofen-challenge.com/benchmark).
In the ship detection and fine-grained recognition comparison experiments, the M-ReDet and RoI Transformer, SASM, ReDet, R3Det, Faster RCNN, Rotated RetinaNet, GWD, S2ANet, KFIoU, LSKNet [37] algorithms use the same experimental hyper-parameters and the remote sensing ship images used for training are cropped to 1024 × 1024 size. The mAP0.5 of M-ReDet is 43.29%, which is higher than the above algorithms by 7.85%, 13.71%, 2.78%, 13.46%, 12.11%, 18.61%, 13.57%, 9.59%, 7.7% and 3.17%, respectively. M-ReDet has the highest AP-max and AP-min, which reflects from the side that the algorithm in this paper uses the SOPM and SFRM modules to enhance the size of the receptive field so that the M-ReDet can extract more contextual information, which can enhance the detection accuracy of the ship objects of all size of ships. Table 6 shows the detection results of various algorithms on the FAIR1M(ship) dataset.
There are nine categories of ship objects in the FAIR1M(ship) dataset. The detection and fine-grained recognition accuracy of ships is proportional to the number of instances of this category, except that there are many ships in the category of “other-ships” that do not have subdivided categories or the category determination information is ambiguous, which results in their detection and fine-grained recognition accuracy of only 15.9%. The M-ReDet algorithm has the highest detection accuracy for “Dry-Cargo-Ship,” with an AP of 71.20%. The “Motorboat” has 7921 instances, but because of its small size, the overall AP is lower than “Dry-Cargo-Ship” at 64.7%. Table 7 shows the training results for each of the nine categories of ships.
Table 8 demonstrates the AP values of 9 types of ships detected by the 11 algorithms, respectively, to show more clearly the impact of the M-ReDet algorithm on the detection and fine-grained recognition of various types of ships. As seen from the table, the M-ReDet algorithm achieves the optimal AP in the detection results of 7 classes of ships. It is lower than the LSKNet and ReDet only in Passenger-Ship and W-ship detection results.
After 15 epochs of training, the mAP0.5 of the M-ReDet algorithm is finally stabilized near 43.29%; because of the addition of M-Bottleneck, stabilizing the internal parameters of the module requires more training epochs of updating, and its convergence speed is slightly slower compared to the other algorithms. However, it has the highest accuracy of ship detection and fine-grained recognition, which is shown in Fig 13 for the comparison of the M-ReDet algorithm with the other 10 algorithms. Fig 13 shows the training results of all algorithms.
Ablation experiments
The M-ReDet mainly consists of the M-ReResNet50, M-ReFPN and detection head, where the SFRM adds the M-Bottleneck and initializes it as configured in subsection 3.2. The SFRM and the FPN constitute the M-ReFPN module together. Table 9 shows the results of the ablation experiments of M-ReDet, which mainly investigate the effects of the SOPM, SFRM, KFIoU and Focal Loss alone or both or together on the mAP0.5 of the M-ReDet algorithm, and the results of the ablation experiments are as follows.
It can be seen from Table 9 that using each module alone can improve the mAP0.5 of the M-ReDet algorithm; the mAP0.5 are 41.52%, 41.45%, and 41.39%, respectively, which are improved by 1.01%, 0.94%, and 0.88% compared to the baseline model. Improvement modules used two by two can also improve the mAP0.5 of the algorithm. Finally, M-ReDet using SOPM, SFRM, KFIoU, and Focal Loss at the same time achieves the optimal ship detection and fine-grained recognition accuracy, with a mAP0.5 of 43.29%, which is 2.78% higher than that of the baseline model ReDet, and Fig 14 shows the difference between the ReDet and the M-ReDet on the FAIRM (ship) dataset for ship detection and fine-grained results.
(a)Detection and fine-grained results of the ReDet; (b)Detection and fine-grained results of the M-ReDet. Fig 14 is attributed to the FAIR1M open-source database and is available from the FAIR1M database (URL (s): https://gaofen-challenge.com/benchmark).
From Fig 14, it can be found that the SOPM module in the M-ReDet algorithm expands the receptive field, which enables the algorithm to extract more contextual information, such as the sea surface and ports. With that information, the M-ReDet algorithm can detect the remote sensing ship objects that are in the edge position or obscured; for example, in the first line of the remote sensing image, M-ReDet successfully detects the incomplete ship objects that are on the top, right and bottom left, respectively. Moreover, the SOPM and SFRM modules can selectively memorize the fine-grained features of remote sensing ship objects of different sizes to minimize the loss of features of small ship objects in the downsampling process and to improve the algorithm’s ability to detect small ships, for example, the first image in the second row, M-ReDet detects a tiny “other-ship” object located in the center of the image, but its size is too small, and the classification confidence is low. In addition, the enlarged receptive field of M-ReDet can also avoid the false detection of large buildings such as bridges and dock berths; for example, in the second row, ReDet recognizes the dock berth in the second image and the bridge in the third image as a ship.
Discussion
The main innovation of this paper is the design of the M-ReDet algorithm framework. The proposal of two improved modules, SOPM and SFRM, which form a new backbone network (M-ReResNet50) and a new Neck (M-ReFPN), which enable the algorithm to selectively learn and retain fine-grained information about objects of different sizes of ships, and to fuse the contextual information required for memorization, which improves the algorithm’s detection and fine-grained recognition ability for ships of different size. After several sets of comparison and ablation experiments, the detection robustness of the M-ReDet algorithm designed in this paper is optimal.
Small ship feature information is easily lost in the downsampling process, resulting in the loss of detection and fine-grained recognition accuracy. The M-Bottleneck in the SOPM module can extract and memorize the fine-grained features of small ships and send them to the subsequent layers for further feature extraction. In the SFRM module, the feature maps of different layers can also realize the complementary selective information so that the information on ships of different sizes can be better retained. The SOPM module can also expand the receptive field and cooperate with the SFRM module to supplement the contextual information required by different scale feature maps to reduce the probability of misdetection of similar ship objects. The optimization of classification loss and regression loss is also essential for the algorithm training process. However, this paper selects only four commonly used loss functions for permutation and combination to try the optimal loss function scheme. There are no further attempts to compare and analyze the recent excellent loss functions. Due to the limitation of hardware memory, this paper also did not further improve the number of configurations of M-Bottleneck in the backbone network to optimize the module structure and training hyperparameters of M-Bottleneck based on the experimental training results. In the future, we will make improvements to the structure and configuration of M-Bottleneck in subsection 3.2, continue to optimize the training loss function of the algorithm, and try to incorporate the fine-grained feature information into the loss function for the ship detection and fine-grained recognition tasks further to improve the regression and classification accuracy of the algorithm. In addition, this study can be integrated with other remote sensing data in the future, allowing for the design of multimodal remote sensing image fusion modules that combine visible light, SAR, and infrared data to enhance the algorithm’s remote sensing ship detection capability in harsh weather conditions, such as cloudy and foggy scenes; other fields can also utilize this work, such as agricultural object detection [38–39].
Conclusion
Remote sensing ship detection and fine-grained recognition face significant challenges due to high inter-class similarity in aspect ratios, ambiguous appearance features among vessel categories, arbitrary orientation variations, and multi-scale object characteristics. This paper proposes the M-ReDet, a memory-augmented ship perception network with feature refinement mechanisms, to address these issues. The optimization of the loss function of M-ReDet further improves the classification and regression accuracy of the algorithm. The comparative and ablation experiments on the FAIRM(ship) and DOTA datasets prove the effectiveness of this paper’s improved modules. Finally, in the remote sensing ship and fine-grained recognition task, the M-ReDet’s mAP0.5 is 43.29%, validating the effectiveness of our algorithm in complex maritime scenarios.
Acknowledgments
The author would like to express thanks to anonymous reviewers for all careful review of the paper and kind suggestions made to improve overall quality of the manuscript.
References
- 1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, G o m e z A. Attention is all you need. Adv Neural Inform Process Syst. 2017;30.
- 2. Hu J, Cao A, Feng Z, Zhang S, Wang Y, Jia L, et al. Vision mamba mender. Adv Neural Inform Process Syst. 2024;37:51905–29.
- 3. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR). 2016. 779–88.
- 4.
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C, et al. Ssd: Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I. Amsterdam, The Netherlands; 2016. 21–37.
- 5. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–49. pmid:27295650
- 6.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. 770–8.
- 7.
Yang X, Yan J, Ming Q, Wang W, Zhang X, Tian Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In: International conference on machine learning. 2021. 11830–41.
- 8. Yang X, Yan J, Feng Z, He T. R3Det: refined single-stage detector with feature refinement for rotating object. In: Proceedings of the AAAI conference on artificial intelligence. 2021;35(4):3163–71.
- 9.
Han J, Ding J, Xue N, Xia G. Redet: a rotation-equivariant detector for aerial object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021. 2786–95.
- 10.
Hou L, Lu K, Xue J, Li Y. Shape-adaptive selection and measurement for oriented object detection. In: Proceedings of the AAAI conference on artificial intelligence, 2022. 923–32.
- 11. Li Q, Chen Y, Zeng Y. Transformer with transfer CNN for remote-sensing-image object detection. Remote Sensing. 2022;14(4):984.
- 12. Liu W, Lin Y, Liu W, Yu Y, Li J. An attention-based multiscale transformer network for remote sensing image change detection. ISPRS J Photogram Remote Sens. 2023;202:599–609.
- 13. Yang M, Xu R, Yang C, Wu H, Wang A. Hybrid-DETR: a differentiated module-based model for object detection in remote sensing images. Electronics. 2024;13(24):5014.
- 14. Liu Y, Tian Y, Zhao Y, Yu H, Xie L, Wang Y. Vmamba: visual state space model. Adv Neural Inform Process Syst. 2024;37:103031–63.
- 15. Zhan Y, Zeng Z, Liu H, Tan X, Tian Y. MambaSOD: dual mamba-driven cross-modal fusion network for RGB-D salient object detection. Neurocomputing. 2025;631:129718.
- 16. Verma T, Singh J, Bhartari Y, Jarwal R, Singh S, Singh S. SOAR: advancements in small body object detection for aerial imagery using state space models and programmable gradients. arXiv preprint. 2024.
- 17.
Wang CY, Yeh IH, Liao HY. Yolov9: Learning what you want to learn using programmable gradient information. In: European conference on computer vision. 2024. 1–21.
- 18. Shi H, Yang W, Chen D, Wang M. ASG-YOLOv5: Improved YOLOv5 unmanned aerial vehicle remote sensing aerial images scenario for small object detection based on attention and spatial gating. PLoS One. 2024;19(6):e0298698. pmid:38829850
- 19. Wu W, Liu H, Li L, Long Y, Wang X, Wang Z, et al. Application of local fully convolutional neural network combined with YOLO v5 algorithm in small target detection of remote sensing image. PLoS One. 2021;16(10):e0259283. pmid:34714878
- 20. Sun X, Wang P, Yan Z, Xu F, Wang R, Diao W, et al. FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J Photogram Remote Sens. 2022;184:116–30.
- 21.
Xia GS, Bai X, Ding J, Zhu Z, Belongie S, Luo J. DOTA: a large-scale dataset for object detection in aerial images. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. 3974–83.
- 22.
Han J, Ding J, Xue N, Xia G. Redet: a rotation-equivariant detector for aerial object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021. 2786–95.
- 23.
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. 2117–25.
- 24. Ainslie J, Lee-Thorp J, De Jong M, Zemlyanskiy Y, Lebrón F, Sanghai S. Gqa: training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint. 2023.
- 25. Meng F, Tang P, Tang X, Yao Z, Sun X, Zhang M. Transmla: multi-head latent attention is all you need. arXiv preprint. 2025.
- 26. Yang A, Yu B, Li C, Liu D, Huang F, Huang H. Qwen2. 5-1M technical report. arXiv. 2025.
- 27. Liu A, Feng B, Wang B, Wang B, Liu B, Zhao C. Deepseek-v2: a strong, economical, and efficient mixture-of-experts language model. arXiv preprint. 2024.
- 28. Yang F, Jiang F, Li J, Lu L. MSTrans: multi-scale transformer for building extraction from HR remote sensing images. Electronics. 2024;13(23):4610.
- 29.
Mao A, Mohri M, Zhong Y. Cross-entropy loss functions: Theoretical analysis and applications. In: International conference on machine learning. 2023. 23803–28.
- 30.
Liu C, Yu S, Yu M, Wei B, Li B, Li G. Adaptive smooth L1 loss: A better way to regress scene texts with extreme aspect ratios. In: 2021 IEEE symposium on computers and communications (ISCC). 2021. 1–7.
- 31. Yang X, Zhou Y, Zhang G, Yang J, Wang W, Yan J. The KFIoU loss for rotated object detection. arXiv preprint. 2022;2201.12558.
- 32.
Lin TY, Goyal P, Girshick R, He K, Dollar PF. Focal loss for dense object detection. In: Proceedings of the IEEE International conference on computer vision. 2017. 2980–8.
- 33.
Ding J, Xue N, Long Y, Xia G, Lu Q. Learning RoI transformer for oriented object detection in aerial images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. 2849–58.
- 34.
Yang Z, Liu S, Hu H, Wang L, Lin S. Reppoints: Point set representation for object detection. In: Proceedings of the IEEE/CVF International conference on computer vision. 2019. 9657–66.
- 35. Han J, Ding J, Li J, Xia G-S. Align deep features for oriented object detection. IEEE Trans Geosci Remote Sensing. 2022;60:1–11.
- 36. Guan Q, Liu Y, Chen L, Li G, Li Y. A deformable split fusion method for object detection in high-resolution optical remote sensing image. Remote Sensing. 2024;16(23):4487.
- 37. Li Y, Li X, Dai Y, Hou Q, Liu L, Liu Y, et al. LSKNet: a foundation lightweight backbone for remote sensing. Int J Comput Vis. 2024;133(3):1410–31.
- 38. Lu D, Wang Y. MAR-YOLOv9: a multi-dataset object detection method for agricultural fields based on YOLOv9. PLoS One. 2024;19(10):e0307643. pmid:39471150
- 39. Shi M, Zheng D, Wu T, Zhang W, Fu R, Huang K. Small object detection algorithm incorporating swin transformer for tea buds. PLoS One. 2024;19(3):e0299902. pmid:38512917