Fig 1.
Backbone, which is responsible for feature extraction; Neck, which aggregates and fuses multi-scale features; and Head, which performs the final prediction.
Fig 2.
Compared with the baseline network, the backbone and neck are enhanced by incorporating GSConv and VoVGSCSP modules, while the C2f-RFAConv module is introduced to improve feature representation and information fusion efficiency.
Fig 3.
RFAConv integrates receptive-field spatial features to expand contextual perception and group convolution to improve computational efficiency and reduce parameter redundancy.
Fig 4.
The CRFAConv module integrates a CBS block with RFAConv, enabling effective feature extraction through receptive-field spatial features and efficient convolution operations.
Fig 5.
C2f is a feature fusion module employed in the neck of YOLOv8 to enhance information flow and multi-scale feature representation.
Fig 6.
The original C2f module is enhanced by incorporating CRFAConv, aiming to strengthen feature representation while maintaining computational efficiency.
Fig 7.
GSConv consists of a depthwise convolution (DWConv) for lightweight spatial feature extraction and a channel shuffle operation to promote information exchange across channel groups.
Fig 8.
The VoVGSCSP module incorporates GSConv to achieve efficient feature extraction.
Fig 9.
YOLOv8n detection head structure.
The detection head adopts a decoupled design to separately perform bounding box regression and classification on multi-scale feature maps.
Fig 10.
The SCConv module is composed of a Spatial Reconstruction Unit (SRU) and a Channel Reconstruction Unit (CRU), which collaboratively enhance spatial and channel-wise feature representation.
Fig 11.
SCConv is embedded into the coupled detection head to refine shared features for both classification and localization through spatial and channel reconstruction.
Table 1.
Training configuration and experimental settings.
Table 2.
Comparison of our RGS-YOLO with the classic methods.
Table 3.
Comparison of ablation experimental results.
Table 4.
Ablation experiments on the public dataset.
Table 5.
Model comparison experiments on the public dataset.
Fig 12.
Each row represents a different input image, and each column corresponds to the detection results produced by a different model.