Figures
Abstract
With the rapid development of intelligent transportation systems, especially in traffic image detection tasks, the introduction of the transformer architecture greatly promotes the improvement of model performance. However, traditional transformer models have high computational costs during training and deployment due to the quadratic complexity of their self-attention mechanism, which limits their application in resource-constrained environments. To overcome this limitation, this paper proposes a novel hybrid architecture, Mamba Hybrid Self-Attention Vision Transformers (MHS-VIT), which combines the advantages of Mamba state-space model (SSM) and transformer to improve the modeling efficiency and performance of visual tasks and to enhance the modeling efficiency and accuracy of the model in processing traffic images. Mamba, as a linear time complexity SSM, can effectively reduce the computational burden without sacrificing performance. The self-attention mechanism of the transformer is good at capturing long-distance spatial dependencies in images, which is crucial for understanding complex traffic scenes. Experimental results showed that MHS-VIT exhibited excellent performances in traffic image detection tasks. Whether it is vehicle detection, pedestrian detection, or traffic sign recognition tasks, this model could accurately and quickly identify target objects. Compared with backbone networks of the same scale, MHS-VIT achieved significant improvements in accuracy and model parameter quantity.
Citation: Zhang X, Ou W, Wu X, Zhang C (2025) MHS-VIT: Mamba hybrid self-attention vision transformers for traffic image detection. PLoS One 20(6): e0325962. https://doi.org/10.1371/journal.pone.0325962
Editor: Hirenkumar Kantilal Mewada, Prince Mohammad Bin Fahd University, SAUDI ARABIA
Received: January 26, 2025; Accepted: May 22, 2025; Published: June 30, 2025
Copyright: © 2025 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data used in this study are publicly available on the Kaggle platform The datasets include: (1) Traffic Lights Data: https://www.kaggle.com/datasets?search=traffic+lights, (2) Traffic Data: https://www.kaggle.com/datasets?search=traffic. The datasets comprise various image sets used for mode training. validation, and evaluation. All relevant data are available without restrictions, ensuring reproducibility of the study.
Funding: This work was supported by the Foundation of Micro-Nano and Intelligent Manufacturing Engineering Research Centre of the Ministry of Education (No. 2024WZG03), Foundation Research Project of Kaili University (No. 2024YB010), Supported by the Integration Research Project of Kaili University (No. YTH-XM2025002). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
With the acceleration of urbanization and the vigorous development of the automotive industry, the complexity and challenges of the transportation system, as the lifeline of urban operation, are becoming increasingly prominent. However, traditional traffic image detection methods often rely on manually designed features and rules, which are difficult to adapt to complex and changing traffic environments and constantly changing traffic demands [1]. For complex traffic images, continuous tracking of object trajectory can be achieved through Mamba time series modeling; Adding self-attention can realize real-time capture of relative position changes between vehicles, and enhance the response speed to emergencies; The local texture of multi-scale traffic image is extracted by self-attention at the low level, and the global semantics is integrated by Mamba at the high level to adjust the state transition matrix, give priority to the key objects and suppress the background noise; When the image is under the influence of complex illumination, the continuous state transfer based on Mamba has a smoothing effect on local illumination changes and reduces noise sensitivity. The self-attention module can dynamically strengthen the illumination invariant features and weaken the illumination related pixels; When it is necessary to process high frame rate video stream with low delay, Mamba’s parallel scanning algorithm makes full use of GPU parallelism and achieves higher throughput than other methods. Therefore, Mamba hybrid self-attention vision transformers play an indispensable role in traffic images.
In recent years, the rapid development of deep learning technology, especially its widespread application in the field of computer vision, has yielded unprecedented visual understanding and processing capabilities. Models have evolved from the classic convolutional neural network (CNN) [2–4] to the revolutionary transformer [5–8] model and then to the innovative Mamba architecture [9–12]. However, the limitation of CNN is their receptive field, where each neuron can only perceive a small area in the input image. In the field of object detection, transformers [13–15] have successfully solved the problem of the CNN receptive field limitation with their global modeling ability. The detection transformer (DETR) [16–19] series of models use transformers’ self-attention mechanism to transform object detection into an end-to-end sequence prediction problem, thereby achieving accurate localization and classification of objects in images. However, transformer models also face challenges such as high computational complexities and high memory consumption. The Mamba model was recently developed, which introduces a selective state-space model (SSM) combined with the recurrent framework of recurrent neural networks (RNNs) [20–22] and the parallel computing capability of transformers while maintaining the linear characteristics of the SSM [23]. Mamba can selectively propagate or forget information related to the sequence length dimension based on the current label, thereby improving its modeling ability for discrete modalities. Mamba achieves efficient computational performance through hardware-aware parallel algorithms, has linear scalability with the sequence length, and is capable of processing sequences up to millions in length.
In this article, we propose a traffic image detection model called Mamba Hybrid Self-Attention Vision Transformers (MHS-VIT), which combines the advantages of the Mamba model and vision transformers (ViT). The overall architecture of the model adopts a layered design, aiming to achieve accurate detection of targets in traffic images through multi-stage feature extraction and global context capture. In the field of deep learning, especially when processing image data, SSMs are mainly used to capture dependencies in time series or text sequences, but they lack effective modeling capabilities for the spatial structures and multi-channel features of image data. To compensate for this deficiency, we innovatively propose SVT Block and DLS Block components to provide better adaptability and improve the performance of image processing. The SVT Block can focus on each pixel in the input image, calculate their correlation, capture long-distance spatial dependencies in the image, and enhance the model’s ability to capture global information. The DLS Block uses dot multiplication combined with high-dimensional expressions to enhance the correlation between channels, which can enhance the interactions between channels by adjusting the eigenvalues on each channel without changing the dimensionality of the data space. MHS-VIT aims to build a new backbone network that combines the advantages of Mamba and vision transformers. This architecture, which uses a state-space transformation model, was applied for traffic image detection, effectively capturing global dependencies and using the strength of local convolution to improve the detection accuracy and model understanding of complex scenes. In addition, plug-and-play SVT blocks and DLS blocks can be integrated into deep neural networks to achieve effective image detection. The main contributions of this article can be summarized as follows:
- We propose the SSM-based MHS-VIT model to establish a new baseline for object detection in traffic images, laying a solid foundation for future development of more efficient image detection based on SSM.
- We propose the SVT Block, which can capture global information and long-range dependencies between pixels in images to compensate for the local modeling ability of SSM. Combined with residual connections, the input information is directly propagated across layers, simplifying the learning task of the network and making the learning of deep networks more efficient and stable.
- We propose the DLS Block to increase the complexity of the data in the spatial dimension and improve the model’s ability to capture complex features by optimizing the information flow and interactions between channels.
- A traffic image dataset was used for comprehensive experiments. The results showed that our MHS-VIT model has strong competitiveness compared with state-of-the-art CNN and transformer-based methods.
Related work
CNNs performs well in image feature extraction, while Transformer achieves global context awareness through self attention mechanism. Mamba improves the processing capability of long sequence data through selective mechanisms and hardware aware algorithms. In this section, we attempt to propose a new method to further improve the performance and efficiency of traffic image detection.
Convolutional neural network-based image detection
CNNs are a key technology in the field of computer vision. The powerful feature extraction ability of CNN is used to achieve automatic recognition and localization of target objects in images. R-CNN [24] first use a selective search algorithm to generate candidate regions, then use a CNN to extract features for each candidate region, and finally use a support vector machine classifier and bounding box regressor for object classification and localization. However, R-CNN suffer from a high computational complexity and slow detection speeds, as they perform independent CNN feature extraction on each candidate region. To improve the speed of R-CNN, the Fast R-CNN model [25] uses a region-of-interest pooling layer, which allows the entire image to pass through the CNN only once and then extracts the features of the candidate regions on the feature map. This improvement significantly improves the detection speed and maintains a high detection accuracy. Faster R-CNN [26] further introduces a region proposal network (RPN) to generate high-quality candidate regions, thereby further improving the detection speed and accuracy. The RPN and the detection network share convolutional features, achieving end-to-end training.
YOLOv1 [27], as the pioneering work of the YOLO series, detects all objects in the image through a single forward propagation, divides the image into grids, and predicts bounding boxes and categories for each grid. YOLOv2 [28] introduces anchor boxes and batch normalization to improve the detection accuracy and speed for small targets. YOLOv3 [29] adopts multi-scale feature extraction, improves the network structure, uses Darknet-53 as the backbone network, and introduces a feature pyramid network (FPN) to further enhance the detection performance. YOLOv4 [30] improves the detection accuracy by introducing various technologies while maintaining its speed advantage, but the introduction of these technologies correspondingly increases the demand for computing resources. YOLOv5 and subsequent versions [31–35] continue to seek a balance between speed and accuracy, introducing new network architectures [36] and optimization strategies to meet the needs of different application scenarios [37]. A CNN can automatically learn and extract effective feature representations from raw images, avoiding the tedious process of manually designing features in traditional methods. It can achieve certain robustness to image translation, rotation, scaling, and other transformations and can adapt to object detection tasks in different scenarios. However, CNNs mainly focus on local features of images, and they may not perform well in tasks that require global information. When recognizing objects, a CNN may not fully consider the relative spatial direction and hierarchical structure between features, leading to misjudgments in some cases.
Transformer-based image detection
Transformer-based object detection has been an important research direction in the field of computer vision in recent years. The powerful global context modeling and parallel computing capabilities of transformer models have introduced new ideas for object detection tasks. DETR [38] is an object detection model based on transformers, which achieves end-to-end training through the encoder–decoder architecture of transformers. The key to the DETR model lies in its unique training strategy and object matching mechanism. It optimizes the loss function through a bipartite graph matching algorithm to improve the accuracy and robustness of detection. The deformable DETR [39] introduces a deformable attention mechanism to reduce the complexity of self-attention computations. Through a multi-scale deformable attention module, it achieves accurate detection of objects of different scales in images. RT-DETR [16] includes an efficient hybrid encoder that decouples an attention-based intrascale feature interaction (AIFI) module and a CNN-based cross-scale feature fusion module (CCFM) to efficiently process multi-scale features, reduce the computational costs, and enable the model to achieve real-time detection while maintaining high accuracy. However, transformer models have numerous parameters and complex structures, requiring large-scale datasets and powerful computing resources for training and optimization. Small target images occupy fewer pixels and provide limited feature information, resulting in certain limitations of transformer-based models in small target detection.
Mamba-based image detection
Recently, the SSM [23,40,41] has played an important role in the research of artificial intelligence. Its unique modeling ability and efficient computational characteristics make it an important tool for processing complex dynamic systems, time series data, and long series data. In this context, Mamba [9], as an innovative architecture based on SSM, has attracted considerable attention for its linear complexity and efficient solution to the computational efficiency problems faced by transformers when processing long sequences. Vision Mamba [11] is a pure visual backbone model design based on an SSM, which introduces a cross-scan module (CSM) to achieve selective scanning of two-dimensional images, thereby enhancing the model’s ability to process image data. Specifically, the cross-scanning module can scan images in different directions, extract richer feature information, and model and analyze them through the SSM. This design enables Vision Mamba to effectively capture global contextual information and local detail features in images without the need for additional self-attention mechanisms. VMamba [10] introduces a selective scanning mechanism to reduce the quadratic complexity of traditional attention computations to linear while retaining the advantage of a global receptive field. Additionally, a CSM is introduced to solve the problem of directional sensitivity in two-dimensional image scanning and achieve efficient extraction of multi-scale features. Applying Mamba to object detection tasks can effectively improve the detection performance. Through its bidirectional scanning mechanism and global modeling capability, Mamba can more accurately capture target features in images and reduce the interference of background noise. This helps to improve the accuracy and robustness of object detection, enabling the model to maintain a good performance even in complex scenarios. Mamba provides strong support for object detection tasks with its efficient modeling ability and computational efficiency [42]. In the future, with the continuous development of technology and the expansion of application scenarios, Mamba’s application prospects in the field of object detection will be even broader.
Method
Preliminaries
The self-attention mechanism of the standard transformer can be formalized as:
where , computational complexity
, When the sequence length n is large, the complexity of quadratic will lead to huge consumption of computing resources and low computational efficiency.
In this section, we briefly introduce the fundamentals of SSM [23], The SSM-based structured state-space sequence model S4 and Mamba are derived from a continuous system that maps a univariate sequence to an output sequence
via an implicit latent intermediate state
. The mathematical definition of the system can be summarized as follows:
where represents the state matrix, and
and
represent the projection parameters, computational complexity
.
The Mamba model cleverly introduces a key time scale parameter when discretizing the continuous SSM to adapt to deep learning scenarios. The parameter represents the time step in the discretization process and determines the level of refinement when the continuous dynamics are approximated as discrete sequences. To effectively achieve discretization, Mamba uses a zero-order keeper to convert A and B into discrete parameters
and
, respectively, using fixed discretization rules, defined as follows:
where and
represent the discrete correspondence of continuous parameters in a given time interval, respectively, and I represents the identity matrix.
Mamba’s linear complexity makes it superior to transformer in computational efficiency and memory usage, especially when dealing with long sequences of traffic scene data. At the same time, through the state space model, Mamba can effectively capture the spatio-temporal correlation of traffic data and improve the detection ability of the model for traffic scenes.
Overall architecture
In this section, we mainly introduce the structure of MHS-VIT, as shown in Fig 1. The model consists of the MHS-VIT Backbone, PAFPN, and Detect Head. The MHS-VIT Backbone is composed of Conv Block and Downsample Block components for feature extraction, while the PAFPN is mainly composed of VISSBlock, Concatenate, and Upsample components. Multi-scale feature fusion helps the model better capture the details and contextual relationships of the target, thereby improving the accuracy of detection. The Detect Head component is responsible for predicting the bounding boxes, categories, and confidence levels of targets in images.
Focus block
Focus Block, as shown in Fig 2, cuts the input tensor along specific dimensions to form multiple smaller tensor fragments, which are then rearranged and stacked together to form tensors with richer information. By changing the arrangement of features, additional spatial- or channel-level information interactions are introduced, which can enhance the model’s perception ability and make it easier to capture key feature information, thereby improving the accuracy and robustness of detection.
Focus Block provides a new perspective for the learning process of the model by changing the presentation of data, which helps the model learn effective feature representations more easily during training and accelerates the convergence process. As the Focus Block module can enhance the model’s feature extraction ability, it can improve the model’s generalization ability to a certain extent, enabling the model to perform better when facing new and unseen data.
VISSBlock
As shown in Fig 3, VISSBlock is the core of MHS-VIT, playing a key role in improving the model’s feature extraction capability and maintaining a high processing efficiency and throughput. VISSBlock introduces the 2D-Selective-Scan for Vision Data (SS2D) model using a four-way scanning strategy, scanning simultaneously from all four corners of the feature map to ensure that each element in the feature can integrate information from all other positions in different directions, achieving efficient extraction of multi-scale features. This not only preserves the spatial structure information of the image but also enhances the model’s ability to capture image details by scanning in different directions. The SS2D module also merges multiple one-dimensional vectors into a two-dimensional feature output through a scanning merging step, further improving the compactness and effectiveness of the feature representation.
An SVT Block component based on convolutional additive self-attention is added [43]. Q and K are transformed by mapping the channel attention and spatial attention to construct an additive similarity function to calculate the similarity between queries and keys. This similarity function is based on the output activated by a sigmoid and implemented through a convolution operation, preserving the original feature dimension while avoiding information loss. Finally, the result of the similarity function is weighted and summed with V to obtain the final output, which contains global context information and has a low computational complexity.
The input feature information of the DLS Block is subjected to convolution and LS Block for feature extraction. The LS Block consists of a depthwise separable convolution, residual module, and activation function. The output feature information is effectively enhanced by combining dot multiplication and a high-dimensional expression, which can effectively enhance the correlation between channels without changing the dimension of the data space.
LSDetect
As shown in Fig 4, LSDetect is the detection head part of MHS-VIT, consisting of a convolution, GroupNorm, and an activation function. The detection head needs to go through a common module and a shared module. The common module has different parameters for the three types of detection heads during feature extraction, while the shared module has the same parameters for the three types of detection heads. Using GroupNorm can improve the performance of the detection head localization and classification. By using shared convolution, the number of parameters can be significantly reduced, and the model becomes lightweight. To address the issue of inconsistent target scales detected by each detection head while using shared convolution, a Scale layer is used to scale the features.
Experiments
We conducted comprehensive experiments on traffic image detection tasks. Specifically, we evaluated the performance of MHS-VIT in image detection tasks on signal light datasets and traffic datasets. In terms of the experimental environment, we selected Ubuntu 20.04, which is stable and widely supported, as the cornerstone operating system to ensure compatibility and repeatability of the experimental environment. To fully tap into the computational potential of the model, we employed an NVIDIA RTX 4090D GPU, whose powerful computing power combined with CUDA 11.8 optimization provides a solid hardware foundation for processing high-resolution and high-complexity traffic image data. Before model training, unify the width and height of all input images to 640,set the model hyperparameters epochs to 100, batch size of 8, use SGD for optimization with an initial learning rate of 1e–2 and a final learning rate of 1e–4, The warm-up epochs are 3, and the learning rate is adjusted using Cosine Annealing format, and finally save the model weights only when the validation set reaches the new optimal performance.
Datasets
This study used the Traffic Road Object Dataset (TROD), which consists of 6633 images covering various traffic scenarios, including but not limited to urban roads, highways, rural roads, and complex intersections. Each image underwent strict screening and annotation to ensure the accuracy and reliability of the data. The objects in the dataset are divided into five categories: cycle, bus, car, motorbike, and person. Considering the dynamic changes in the transportation environment, the TROD aims to include images from different periods and weather conditions to simulate complex real-world situations. The dataset can be further expanded, and researchers can add new images and annotation information as needed to meet different research needs.
With the rapid development of autonomous driving technology, the detection and recognition of traffic signals have become an important component of the perception systems of autonomous vehicles. Accurately and quickly identifying the status of traffic signals is crucial for vehicles to make correct driving decisions. This study used the Traffic Signal Light Dataset (TSLD), which consists of 4564 images with four categories: red, green, off, and yellow. The TSLD contains various types of traffic signal light images, including images of different shapes, colors, and lighting conditions, which help to train more robust recognition models. By training recognition models based on the TSLD, autonomous vehicles can accurately recognize the status of traffic signals and make corresponding driving decisions according to traffic rules.
Baseline methods
To verify the effectiveness of the proposed method, we chose Mamba YOLO as our baseline model. MHS-VIT has the same variant as Mamba YOLO, namely T/B/L, and all the models were trained under the same hyperparameter configuration, with the same number of training iterations and batch size. MHS-VIT was directly compared with other benchmark methods (YOLOv5 [32], YOLOv6 [34], YOLOv8 [33], YOLOv10 [35], and MambaYOLO [42]) to observe and evaluate metrics to determine the performance of the model.
In traffic image detection, the selection of evaluation indicators not only reflects the accuracy of model detection but also reveals the performance of the model in terms of detail capture, robustness, and sensitivity to outliers. When evaluating the performance of MHS-VIT and the baseline methods, to measure the detection performance, we chose parameters including the accuracy, recall, and mean average precision (mAP), with higher values indicating a better performance.
Ablation study on MHS-VIT
We independently examined each module in VISSBlock to evaluate its effect on traffic image detection and conducted ablation experiments on the TROD using MHS-VIT. Table 1 shows the effect of the SVT Block, DLS Block, and LS Block components on the image detection process. YES indicates that the module was used, and NO indicates that the module was not used. The ablation experiment was evaluated using the precision, recall, and mAP, and the higher the values were, the better the detection effect was.Through the comparison of MHS-VIT ablation research results, it is obvious that using model 6(our) is almost the best in precision, recall, mAP50 and mAP50-95.
Evaluate each module in the DLS Block Fig 5 to perform ablation experiments on the TROD dataset. Table 2 shows the impact of 3Conv-Block, 3ConvDw-Block, 3ConvDwMu-Block, 3ConvDwMMu-Block and DLS Block(our) on traffic image detection. Ablation experiments were evaluated by precision, recall and map. The higher the value, the better the detection effect.Through the comparison of DLS Block ablation research results, it is obvious that using DLS Block(our) is the best in precision, recall, mAP50 and mAP50-95.
Quantitative comparisons with state-of-the-art image detection methods
The comparison of the results from MHS-VIT and the other image detection models on the TROD, where the proposed method was compared based on various evaluation metrics, including the precision, recall, mAP, floating point operations per second (FLOPs) (G), and Par. (M). The performance of MHS-VIT on the TROD was almost superior to all reference models, Fig 6 shows the comparison of detection results for different models on the TROD datasets.
As shown in Table 3, our MHS-VIT achieved state-of-the-art performances at various model scales. We first compared MHS-VIT with the baseline model, Mamba YOLO. With the T/B/L variants, our MHS-VIT achieved 0.7%/0.3%/0.6% improvements in mAP50, respectively, with improvements of 0.9%/0.6%/1.6% in mAP50-95, parameter reductions of 1.1%/7%/4.5%, and computational reductions of 9.1%/5%/2.3%, respectively. Compared with other YOLOs, MHS-VIT also demonstrated a superior trade-off between the accuracy and the computational cost. Compared with YOLOv5-N/YOLOv6-N/YOLOv8-N/YOLOv10-N, MHS-VIT-T improved the mAP50 by 7.7%/5.5%/4.4%/5% and the mAP50-95 by 10.6%/7.3%/6.4%/8.4%, respectively. Compared with YOLOv5-M/YOLOv6-M/YOLOv8-M/YOLOv10-M, MHS-VIT-B achieved mAP50 improvements of 0.2%/2%/0.4%/2.5%, respectively; mAP50-95 improvements of 0.5%/2.4%/1.2%/5.2%, respectively; parameter reductions of 8.8%/56%/11.7%/-48.8%, respectively; and computation reductions of 22.3%/69%/36.8%/15.5%. Compared with YOLOv5-L/YOLOv6-L/YOLOv8-L/YOLOv10-L, MHS-VIT-L achieved 1%/2.3%/0.4%2.4% improvements in mAP50 and improvements of 1.7%/3%/1.8%7.2% in mAP50-95, respectively. These comparison results indicate that the proposed model provided significant improvements in MHS-VIT at different scales compared with existing state-of-the-art methods.Compare the performance of different model on the TROD dataset,The Fig 7 left figure shows that MHS-VIT(our) performs best under the same Parameters, and the right figure shows that MHS-VIT(our) performs best under the same FLOPs.
Table 4 shows the comparison of the results between MHS-VIT and the other image detection models on the TSLD. Our MHS-VIT achieved state-of-the-art performances at various model scales. We first compared MHS-VIT with the baseline model, Mamba YOLO. With the T/B/L variants, our MHS-VIT achieved 0.8%/4%/4.3% improvements in mAP50, 0.2%/2%/2.9% improvements in mAP50-95, 1.1%/7%/4.5% reductions in the parameters, and 9.1%/5%/2.3% reductions in the computations, respectively. Compared to other YOLOs, MHS-VIT also demonstrated a superior trade-off between the accuracy and the computational cost. Compared with YOLOv5-N/YOLOv6-N/YOLOv8-N/YOLOv10-N, MHS-VIT-T improved the mAP50 by 2.9%/6.5%/1.1%/0.4% and the mAP50-95 by 4%/6.4%/3.6%/3.1%, respectively. Compared with YOLOv5-M/YOLOv6-M/YOLOv8-M/YOLOv10-M, MHS-VIT-B achieved mAP50 improvements of 3%/7.6%/2%/2.3%, mAP50-95 improvements of 1.7%/6.6%/3%/0.4%, parameter reductions of 8.8%/56%/11.7%/-48%, and computation reductions of 22.3%/69%/36.8%/15.5%, respectively. These comparison results indicate that the proposed model provided significant improvements over existing state-of-the-art methods at different scales of MHS-VIT. Fig 8 shows the comparison of detection results for different models on the TSLD datasets.Compare the performance of different model on the TSLD dataset, The Fig 9 left figure shows that MHS-VIT(our) performs best under the same Parameters, and the right figure shows that MHS-VIT(our) performs best under the same FLOPs.
Conclusions
This article proposed a traffic image detection model called MHS-VIT. To solve the problems of the difficulty in capturing long-distance dependencies and the high ViT computational complexity in CNN global image information, the Mamba model was used in MHS-VIT, and an update mechanism for the state-space representation and linear time complexity was employed to enhance the model’s understanding of traffic scenes. It also enables the model to maintain efficient operation when processing datasets, reducing the computational complexity. Our MHS-VIT showed superiority over other methods on the TROD and TSLD. An ablation study also demonstrated that the proposed SVT Block and DLS Block components played crucial roles in improving the model performance and reducing the computational complexity. The proposed model provides new ideas and methods for research in the field of intelligent transportation. Its superior performance and efficient computational efficiency demonstrate the enormous potential of the model in practical applications. We look forward to continuing to explore the potential of the Mamba model in future research and contributing more to the construction of intelligent transportation systems.
References
- 1. Chen X, Zheng J, Li C, Wu B, Wu H, Montewka J. Maritime traffic situation awareness analysis via high-fidelity ship imaging trajectory. Multimed Tools Appl. 2023;83(16):48907–23.
- 2.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. https://doi.org/10.1109/cvpr.2016.90
- 3.
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 4700–8. https://doi.org/10.1109/CVPR.2017.243
- 4.
Tan M, Le Q. Efficientnetv2: smaller models and faster training. In: International Conference on Machine Learning. 2021, pp. 10096–106. https://doi.org/10.48550/arXiv.2104.00298
- 5. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in Neural Information Processing Systems 30. 2017.
- 6.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–22. https://doi.org/10.48550/arXiv.2103.14030
- 7.
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, 2021, pp. 10347–57. https://doi.org/10.48550/arXiv.2012.12877
- 8.
Shi D. Transnext: robust foveal visual perception for vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; 2024, pp. 17773–83. https://doi.org/10.1109/CVPR52733.2024.01683
- 9. Gu A, Dao T. Mamba: linear-time sequence modeling with selective state spaces. arXiv, preprint, 2023.
- 10. Liu Y, Tian Y, Zhao Y, Yu H, Xie L, Wang Y, et al. Vmamba: visual state space model. Advances in Neural Information Processing Systems 37. 2024, pp. 103031–63.
- 11. Zhu L, Liao B, Zhang Q, Wang X, Liu W, Wang X. Vision Mamba: efficient visual representation learning with bidirectional state space model. arXiv, preprint, 2024. https://arxiv.org/abs/2401.09417
- 12. Huang T, Pei X, You S, Wang F, Qian C, Xu C. Localmamba: visual state space model with windowed selective scan. arXiv, preprint, 2024. https://arxiv.org/abs/2403.09338
- 13.
Song H, Bang J. Prompt-guided transformers for end-to-end open-vocabulary object detection. arXiv, preprint, 2023. https://doi.org/arXiv:2303.14386
- 14.
Kim B, Mun J, On KW, Shin M, Lee J, Kim ES. Mstr: Multi-scale transformer for end-to-end human-object interaction detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19578–87. https://doi.org/10.48550/arXiv.2203.14709
- 15. Zhou Q, Li X, He L, Yang Y, Cheng G, Tong Y, et al. TransVOD: end-to-end video object detection with spatial-temporal transformers. IEEE Trans Pattern Anal Mach Intell. 2023;45(6):7853–69. pmid:36417746
- 16.
Zhao Y, Lv W, Xu S, Wei J, Wang G, Dang Q, et al. DETRs beat YOLOs on real-time object detection. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024, pp. 16965–74. https://doi.org/10.1109/cvpr52733.2024.01605
- 17. Gordeev A, Dokholyan V, Tolstykh I, Kuprashevich M. Saliency-guided detr for moment retrieval and highlight detection. arXiv, preprint, 2024.
- 18. Ramezani H, Aleman D, Létourneau D. Lung-DETR: deformable detection transformer for sparse lung nodule anomaly detection. arXiv, preprint, 2024.
- 19.
Hou X, Liu M, Zhang S, Wei P, Chen B, Lan X. Relation detr: exploring explicit position relation prior for object detection. In: European Conference on Computer Vision. arXiv, preprint, 2024, pp. 89–105. https://doi.org/arXiv:2407.11699
- 20. Zhao G, Zhou L, Wang X. ISMRNN: an implicitly segmented RNN method with Mamba for long-term time series forecasting. arXiv, preprint, 2024. https://arxiv.org/abs/2407.10768
- 21. Alomar K, Aysel HI, Cai X. RNNs, CNNs and transformers in human action recognition: a survey and a hybrid model. 2024.
- 22. Deng F, Park J, Ahn S. Facing off world model backbones: Rnns, transformers, and s4. In: Advances in Neural Information Processing Systems 36. 2023, pp. 72904–30. https://arxiv.org/abs/2307.02064
- 23. Gu A, Goel K, Ré C. Efficiently modeling long sequences with structured state spaces. arXiv, preprint, 2021.
- 24.
Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 580–7. https://doi.org/10.1109/cvpr.2014.81
- 25.
Girshick R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 1440–8. https://doi.org/arXiv:1504.08083
- 26. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–49. pmid:27295650
- 27.
Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, pp. 779–88. https://arxiv.org/abs/1506.02640
- 28.
Redmon J, Farhadi A. YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7263–71. https://arxiv.org/abs/1612.08242
- 29. Redmon J, Farhadi A. Yolov3: an incremental improvement. arXiv, preprint, 2018.
- 30. Bochkovskiy A, Wang CY, Liao HYM. Yolov4: optimal speed and accuracy of object detection. arXiv, preprint, 2020.
- 31.
Wang C-Y, Bochkovskiy A, Liao H-YM. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023, pp. 7464–75. https://doi.org/10.1109/cvpr52729.2023.00721
- 32. Jocher G, Chaurasia A, Stoken A, Borovec J, Kwon Y, Michael K, et al. Ultralytics/yolov5: v7.0-yolov5 sota realtime instance segmentation. Zenodo. 2022.
- 33. Jocher Glenn J, Ayush C, Jing Q. Ultralytics YOLO. 2023. Available from: https://github.com/ultralytics/ultralytics
- 34.
Li C, Li L, Jiang H, Weng K, Geng Y, Li L. YOLOv6: a single-stage object detection framework for industrial applications. arXiv, preprint, 2022. https://doi.org/arXiv:2209.02976
- 35.
Wang A, Chen H, Liu L, Chen K, Lin Z, Han J, et al. Yolov10: real-time end-to-end object detection. In: Advances in Neural Information Processing Systems 37. 2024, pp. 107984–8011. https://doi.org/arXiv:2405.14458
- 36. Zhu J, Qiu J, Chen S, Chen S, Zhang H. An application of YOLOv8 integrated with attention mechanisms for detection of grape leaf black rot spots. PLoS One. 2025;20(4):e0321788. pmid:40233077
- 37. Chen X, Wu H, Han B, Liu W, Montewka J, Liu RW. Orientation-aware ship detection via a rotation feature decoupling supported deep learning approach. Eng Appl Artif Intell. 2023;125:106686.
- 38.
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In: European Conference on Computer Vision. Springer; 2020, pp. 213–29. https://doi.org/10.1007/978-3-030-58452-8_13
- 39.
Zhu X, Su W, Lu L, Li B, Wang X, Dai J. Deformable transformers for end-to-end object detection. arXiv, preprint, 2021. https://doi.org/arXiv:2010.04159
- 40.
Gu A, Johnson I, Goel K, Saab K, Dao T, Rudra A. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In: Advances in Neural Information Processing Systems 34. 2021, pp. 572–85. https://doi.org/10.48550/arXiv.2110.13985
- 41. Smith JT, Warrington A, Linderman SW. Simplified state space layers for sequence modeling. arXiv, preprint, 2022.
- 42.
Wang Z, Li C, Xu H, Zhu X. Mamba YOLO: SSMs-based YOLO for object detection. arXiv, preprint, 2024. https://doi.org/arXiv:2406.05835
- 43.
Zhang T, Li L, Zhou Y, Liu W, Qian C, Hwang JN. Cas-vit: convolutional additive self-attention vision transformers for efficient mobile applications. arXiv, preprint, 2024. https://doi.org/arXiv:2408.03703