Road pedestrian detection and tracking algorithm based on improved YOLOv5s and DeepSORT

Guofeng Qin; Rongting Pan; Yi Deng; Peiwen Mi; Yongjian Zhu

doi:10.1371/journal.pone.0334786

Abstract

To address the challenges of low accuracy, high miss detection rate, and poor tracking stability in pedestrian detection and tracking under dense occlusion and small object scenarios on traffic roads, this paper proposes a pedestrian detection and tracking algorithm based on improved YOLOv5s and DeepSORT. For the improvements in the YOLOv5s detection network, first, the Focal-EIoU loss function is used to replace the CIoU loss function. Second, a 160 × 160-pixel Small Object (SO) detection layer is added to the Neck structure. Finally, the Multi-Head Self-Attention (MHSA) mechanism is introduced into the Backbone network to enhance the model’s detection performance. Regarding the improvements in the DeepSORT tracking framework, a lightweight ShuffleNetV2 network is integrated into the appearance feature extraction network, reducing the number of model parameters while maintaining accuracy. Experimental results show that the improved YOLOv5s achieves an mAP0.5 of 80.8% and an mAP0.5:0.95 of 49.7%, representing increases of 4.4% and 3.9%, respectively, compared to the original YOLOv5s. The enhanced YOLOv5s-DeepSORT achieves an MOTA of 50.7% and an MOTP of 77.3%, improving by 3.3% and 0.5%, respectively, over the original YOLOv5s-DeepSORT. Additionally, the number of identity switches (IDs) is reduced by 11.3%, and the model size is reduced to 20% of the original algorithm, enhancing its portability. The proposed method demonstrates strong robustness and can effectively track targets of different sizes.

Citation: Qin G, Pan R, Deng Y, Mi P, Zhu Y (2025) Road pedestrian detection and tracking algorithm based on improved YOLOv5s and DeepSORT. PLoS One 20(11): e0334786. https://doi.org/10.1371/journal.pone.0334786

Editor: Himadri Majumder, G H Raisoni College of Engineering and Management, Pune, INDIA

Received: October 10, 2024; Accepted: October 2, 2025; Published: November 4, 2025

Copyright: © 2025 Qin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All code and dataset files are available from Github:“https://github.com/mikel-brostrom/boxmot.git”.

Funding: This work was supported by the National Natural Science Foundation of China (52102473) and Guangxi Middle-aged Teacher Research Project (2023KY0061). The funders had no role in study design, data analysis, or publication decisions.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

With the increase in urban population and traffic volume, pedestrian detection and tracking have become increasingly important in the fields of intelligent transportation and public safety. Accurate and efficient pedestrian detection and tracking contribute to pedestrian behavior analysis, improving the efficiency and reliability of road safety monitoring and Intelligent Transportation Systems [1] (ITS). In pedestrian traffic management [2], real-time detection and tracking of pedestrians in monitored areas assist security personnel in promptly identifying abnormal behaviors or potential threats. In ITS, pedestrian detection and tracking can be applied to traffic flow statistics and accident early warning, providing crucial support for urban traffic management. Additionally, in the field of autonomous driving [3], accurately perceiving and tracking pedestrians on the road is one of the core technologies for ensuring driving safety, enabling autonomous driving systems to make reasonable decisions based on pedestrian behavior. Therefore, pedestrian detection and tracking have become one of the key research areas in computer vision and intelligent transportation.

Traditional object detection methods typically rely on manually designed features [4] and perform object recognition through multiple independent steps. However, they involve high computational costs and slow detection speeds, making it difficult to meet the requirements of real-time object detection. The development of deep learning has significantly advanced object detection technology. YOLOv5s is a deep-learning-based object detection algorithm that achieves end-to-end training through a multi-scale feature fusion strategy, enabling simultaneous optimization of bounding box prediction and category recognition. Compared with traditional methods, YOLOv5s has significantly improved both feature representation ability and computational efficiency, allowing it to perform object detection at a relatively high frame rate and meet real-time detection requirements. However, its detection accuracy still has certain limitations in complex scenarios such as low-light conditions, dense occlusion, and small pedestrian targets.

Currently, multi-object tracking (MOT) methods are mainly categorized into Tracking-by-Detection [5] (TBD) and Joint Detection and Embedding [6] (JDE) methods. Among these, detection-based tracking is the mainstream approach. This method first utilizes an object detection algorithm to identify target regions in video frames and then matches detection boxes belonging to the same target through an association model to construct a complete motion trajectory. However, the tracking performance of such methods largely depends on the quality of target feature extraction. As a widely used detection-based tracking algorithm, DeepSORT [7] has significantly improved in terms of stability and robustness compared to SORT [8]. However, it still has certain limitations in practical applications. Firstly, its appearance feature extraction network employs a relatively complex deep neural network, resulting in high computational overhead, which limits its application in real-time pedestrian tracking tasks. When running on resource-constrained edge devices, it may lead to reduced processing efficiency. Secondly, in scenarios involving dense pedestrians, severe occlusion, or small targets, the target association robustness of DeepSORT decreases, making identity switches (IDs) more likely to occur, thereby affecting tracking stability and accuracy. Therefore, optimizing the pedestrian appearance feature extraction network of DeepSORT to enhance feature representation ability while reducing computational complexity will contribute to improving overall tracking performance.

To address the above issues, this paper proposes a pedestrian detection and tracking algorithm based on an improved YOLOv5s-DeepSORT, optimizing both the detection network and the feature extraction network. This method employs an improved YOLOv5s as the pedestrian detector and inputs the detection results into the DeepSORT algorithm to achieve end-to-end pedestrian tracking. The optimized model enhances lightweight characteristics while maintaining excellent tracking performance, making it easier to deploy on edge devices. The main contributions of this paper are as follows:

(1). The original CIoU loss function in YOLOv5s is replaced with the Focal-EIoU loss function to improve the localization accuracy of bounding box regression, thereby enhancing pedestrian detection accuracy.
(2). A Small Object (SO) detection layer is added to the Neck structure of YOLOv5s, introducing a 160 × 160 detection feature map for detecting pedestrian targets larger than 4 × 4 in size, thereby improving the model’s ability to detect small-sized pedestrians.
(3). The Multi-Head Self-Attention (MHSA) mechanism is integrated into the backbone network of YOLOv5s, enabling the model to fully capture global contextual information and enhancing pedestrian detection performance and robustness.
(4). In the appearance feature extraction network of DeepSORT, the original deep neural network is replaced with the lightweight ShuffleNetV2 network, and the appearance feature extraction model is retrained. While maintaining good accuracy, this effectively reduces model parameters and computational complexity.

The remainder of this paper is organized as follows. Section 2 introduces related work on pedestrian detection and tracking. Section 3 elaborates on our improvements to the pedestrian detection network and tracking algorithm. Section 4 analyzes the experimental results. Finally, Section 5 concludes this paper and discusses future research directions.

2. Related work

Pedestrian detection and tracking technology based on deep learning leverages the fundamental principles and methods of computer vision, utilizing deep learning algorithms to detect and track pedestrian targets.

2.1. Object detection

In recent years, deep learning models represented by convolutional neural networks (CNN) [5] have achieved significant progress in the field of pedestrian detection. For example, object detection models such as the R-CNN series [9–11], SSD [12], and the YOLO series [13–15] have demonstrated excellent performance. Zhang et al. [16] incorporated a cross-channel attention mechanism into the Faster R-CNN network structure, improving the localization accuracy of bounding box regression and thereby enhancing the detection accuracy of occluded pedestrians. Li et al. [17] improved the SSD algorithm by introducing a Feature Pyramid Network (FPN) and proposed the Feature Fusion SSD (FSSD) algorithm, which improved detection accuracy. Yin et al. [18] proposed an improved YOLOv5 algorithm that enhances long-term attention and dependencies in image processing by integrating a large-kernel attention module and the C3 module. Additionally, they optimized the loss function, leading to improved pedestrian detection accuracy in road scenes. Zhu et al. [14] introduced the YOLOv7 algorithm, which significantly enhances object detection accuracy through a dynamic label assignment strategy and an extended efficient layer aggregation network, without increasing inference costs. Dou et al. [19] proposed an improved YOLOv8 algorithm, which integrates a multi-scale feature fusion mechanism and an optimized non-maximum suppression (NMS) algorithm, effectively reducing duplicate detections and missed detections in pedestrian detection.

2.2. Multi-object tracking

Bewley et al. [8] proposed the SORT algorithm, a lightweight multi-object tracking method based on Kalman filtering and the Hungarian algorithm. It utilizes detection results provided by an object detector for inter-frame object matching, enabling efficient online tracking. Wojke et al. [7] introduced the DeepSORT algorithm, which extends SORT by incorporating person re-identification (ReID) technology as an appearance model. It also employs a cascade matching strategy that prioritizes high-confidence targets, thereby improving matching accuracy and tracking stability. Wang et al. [20] proposed the JDE algorithm, which integrates object detection and person re-identification (ReID) feature extraction within a single neural network. This unified approach reduces inference time and enhances the real-time performance of multi-object tracking. Zhang et al. [21] introduced the YOLOX detector and the ByteTrack tracker. ByteTrack improves the detection result filtering strategy by retaining high-confidence detections while re-matching low-confidence detections in subsequent frames. This approach reduces missed detections and enhances the continuity of object trajectories. Hu et al. [22] proposed an algorithm based on an improved SSD and DeepSort, which enhances pedestrian detection capability in low-visibility scenarios by optimizing the SSD model and achieves stable pedestrian tracking under complex nighttime interference using DeepSort. Zhang et al. [23] proposed a pedestrian tracking algorithm designed to address occlusion issues. This algorithm integrates a dual-path self-attention mechanism and a cyclic-shift negative sample generation strategy. Additionally, it employs the least squares algorithm and Kalman filtering to predict the trajectories of tracked targets, thereby improving pedestrian tracking accuracy in occluded scenarios.

2.3. Object detection and tracking

The object detector can provide the initial position information of the target [24], defining the starting point for the tracker. Multi-object tracking [25] focuses on analyzing the motion trajectory of targets over time, closely intertwining with and complementing the object detection task. Through collaborative integration, these two tasks leverage each other’s strengths, enhancing the system’s robustness and real-time performance, making it well-suited for pedestrian tracking tasks in various complex scenarios.

Deep learning-based pedestrian detection and tracking algorithms are increasingly applied in traffic scenarios and have achieved remarkable success. However, numerous challenges remain in different pedestrian environments, such as low-light conditions, high pedestrian density, occlusion, and small pedestrian targets, which lead to high false detection and missed detection rates. Additionally, achieving a comprehensive optimization of the algorithm to balance accuracy, real-time performance, and lightweight deployment for practical applications remains an ongoing challenge. To address these challenges, we employ an improved YOLOv5s-DeepSORT algorithm for detecting and tracking densely occluded pedestrians.

3. Proposed methods

With the continuous optimization of object detection performance, detection-based tracking methods have been driven to develop in the field of multi-object tracking and have achieved significant results in pedestrian detection and tracking.

3.1. The YOLOv5s pedestrian detection model

The YOLOv5 model is a real-time and accurate algorithm widely used in the field of object detection. Based on the size of the model, it is divided into variants such as s, m, l, and x, with each variant differing in network depth and width. Although the overall average precision of YOLOv5s may be relatively lower, it has fewer parameters and lower computational complexity, achieving a good balance between accuracy and efficiency. Therefore, in this study, we adopted the YOLOv5s model, as shown in Fig 1, with the improved YOLOv5s model shown in Fig 2.

Download:

Fig 1. YOLOv5s network structure.

https://doi.org/10.1371/journal.pone.0334786.g001

Download:

Fig 2. Improved YOLOv5s network structure.

https://doi.org/10.1371/journal.pone.0334786.g002

The YOLOv5s network consists of four components: the input stage (Input), the backbone network (Backbone), the neck structure (Neck), and the detection network (Head). The input stage is responsible for tasks such as Mosaic data augmentation, adaptive anchor box calculation, and adaptive image scaling. The backbone is used to extract features from the image while continuously reducing the size of the feature maps. The CBS module, consisting of a 2D convolutional layer (Conv), batch normalization layer (BN), and the SiLU (Sigmoid-Weighted Linear Unit) activation function, is a commonly used module in convolutional neural networks. The detection layer is located between the backbone and the head structure, combining the SPP (Spatial Pyramid Pooling) module and the FPN [26] (Feature Pyramid Network) with a PAN [27] (Path Aggregation Network) module. This design helps to fuse features of different scales, enhancing detection performance for multi-scale object features. The detection network is responsible for predicting image features, generating bounding boxes, and predicting classes, recording the classification and location information of the objects.

3.1.1. Loss function improvement.

YOLOv5s employs the CIoU [28] (Complete Intersection over Union) loss function to compute the localization error between the target box and the predicted box. The calculation formula for CIoU is as follows:

(1)

(2)

(3)

(4)

(5)

(6)

The EIoU [29] (Efficient Intersection over Union) loss function improves upon the shortcomings of the CIoU loss function by dividing the loss into three components: distance loss, direction loss, and IoU (Intersection over Union) loss. The parameters α and v of CIoU are modified, where Cw and Ch represent the length and height of the smallest box covering the ground truth box and anchor box. This addresses the issues caused by CIoU using aspect ratios. By integrating the EIoU loss function and the FocalL1 [29] loss function, the final Focal-EIoU [29] loss function is obtained.

To enhance model performance and accuracy, this experiment adopts the Focal-EIoU loss function to replace the original CIoU loss function in YOLOv5s. The calculation formula for Focal-EIoU is as follows:

(7)

(8)

(9)

In the formula, the variable χ signifies the disparity between the actual and predicted values, while e represents the mathematical constant. The parameter β governs the curvature of the curve, and C stands as a constant within the equation. Additionally, the parameter γ plays a role in controlling the extent to which outlier values are suppressed.

The Focal-EIoU loss function is particularly suitable for dense pedestrian scenarios and is compatible with the YOLOv5s algorithm. Integrating it into YOLOv5s can improve the model’s detection accuracy and enhance its robustness.

3.1.2. Improvement in small object detection.

One of the reasons for the poor performance of dense occluded pedestrian detection is the small size of small target samples. Due to the high downsampling rate of YOLOv5s, it is difficult for deep feature maps to capture the feature information of small targets. Although the original YOLOv5s model has detection layers with grid sizes of 80 × 80, 40 × 40, and 20 × 20, enabling multi-scale detection, it still suffers from missed and false detections when detecting dense and occluded small pedestrian targets. To address this issue, we propose adding a 160 × 160 SO (Small Object) detection layer to the Neck module. By improving the FPN combined with PAN operations, the shallow and deep feature maps are concatenated for detection, thereby enhancing the model’s detection capability.

As shown in Fig 2, after the 18th layer, upsampling and other operations are applied to further enlarge the feature map. At the 21st layer, a 160 × 160 feature map collected from earlier layers is fused with the second-layer feature map from the backbone network. This fusion results in larger feature maps, which improve small object detection performance. The newly added 160 × 160 detection feature map is used to detect objects larger than 4 × 4 pixels. This method is simple and effective, improving the model’s ability to detect objects at different sizes and resolutions in images. The improved FPN combined with PAN structure is shown in Fig 3.

Download:

Fig 3. Improved FPN combined with PAN structure.

https://doi.org/10.1371/journal.pone.0334786.g003

3.1.3. Introduction of the multi-head self-attention mechanism.

The Multi-Head Self-Attention (MHSA) mechanism originates from the Transformer [30] architecture. By constructing a globally correlated attention weight matrix, it enables global modeling of input features, thereby capturing global information more effectively. Additionally, each attention head in MHSA can be computed independently and supports parallel processing, improving the efficiency of model training and inference.

In the YOLOv5s network model, convolution operations are typically local, focusing mainly on specific regions of the input sequence and failing to directly capture global contextual information. This limitation results in insufficient semantic information integration in dense occlusion pedestrian detection tasks. To address this issue, this paper introduces the MHSA mechanism at the end of the YOLOv5s backbone network, allowing the model to perform global self-attention computation on low-resolution feature maps. This helps reduce redundant computations in shallow feature maps, enabling more efficient processing and integration of semantic information related to small-scale targets in densely occluded scenarios. Consequently, the model’s detection performance and generalization capability in dense occlusion pedestrian detection tasks are significantly improved. The structural diagram of the MHSA mechanism is shown in Fig 4.

Download:

Fig 4. Schematic diagram of MHSA mechanism.

https://doi.org/10.1371/journal.pone.0334786.g004

The computational formula for the MHSA mechanism is as follows:

(10)

(11)

(12)

Here, Q represents the query set, K represents the key set, V represents the value set, h represents the number of self-attention heads, head_h represents the computation of the h-th self-attention head, and W^o represents the output weight.

3.2. The DeepSORT tracking algorithm

DeepSORT is a target tracking algorithm capable of continuously tracking objects in a video sequence. The main process of the DeepSORT algorithm builds upon the SORT algorithm by incorporating appearance information. It utilizes models from the pedestrian re-identification (ReID) domain to extract features, estimates the tracking target state through Kalman filtering, associates the current frame’s detected targets with the tracking targets from the previous frame using the Hungarian algorithm, and finally updates the target states in real-time using Kalman filtering.

3.2.1. Incorporating the lightweight ShuffleNetV2 network.

ShuffleNetV2 [31] is a lightweight convolutional neural network structure that achieves information exchange within network layers through group convolution and channel shuffling. It is used for feature extraction and correlation matching, enhancing feature extraction capabilities while keeping the model lightweight. YOLOv5s is a deep learning-based object detection model with high detection accuracy and fast inference speed. YOLOv5s-DeepSORT combines the YOLOv5s object detection model with the DeepSORT object tracking algorithm. This integration involves replacing the appearance feature extraction network in DeepSORT with the lightweight ShuffleNetV2 network, as illustrated in Fig 5. This approach reduces the overall model complexity and parameter count, improving computational efficiency while maintaining high accuracy. By utilizing features extracted with ShuffleNetV2 for correlation matching, there is a significant improvement in the ID switch metric of the tracking model, a slight increase in tracking accuracy, and the provision of real-time object tracking capabilities.

Download:

Fig 5. Improved YOLOv5s-DeepSORT network structure.

https://doi.org/10.1371/journal.pone.0334786.g005

4. Experiments

Under the same experimental environment and training conditions, this paper conducts improvement experiments on the YOLOv5s and DeepSORT algorithms, followed by a comparison and result analysis.

4.1. CrowdHuman dataset

The following is a comparison of commonly used pedestrian detection datasets, including Caltech, KITTI, CityPersons, COCOPersons, and CrowdHuman [32]. The comparison results are shown in Table 1. CrowdHuman has 15,000 images in the training set, 5,000 images in the test set, and 4,370 images in the validation set. Compared to other pedestrian detection datasets, CrowdHuman has a larger scale, better representing real-world scenarios. It includes various changes such as different scales, poses, occlusions, densities, and lighting, making it more diverse and challenging. In this study, YOLOv5s experiments used 15,000 images from the CrowdHuman dataset as the training set, divided into a training set: validation set: test set ratio of 6:2:2.

Download:

Table 1. Volume, density and diversity of different human detection training datasets.

https://doi.org/10.1371/journal.pone.0334786.t001

4.2. Market1501 dataset

The Market1501 [33] dataset was collected and publicly released in 2015 on the campus of Tsinghua University. It includes 32,668 labeled images of 1,501 pedestrians captured from 6 cameras, comprising 5 high-definition cameras and 1 low-definition camera. The images are of size 64 × 128, featuring pedestrians in various poses and perspectives. Each pedestrian appears multiple times across different cameras. The training set consists of 12,936 images of 751 different pedestrians, while the test set includes 19,732 images of 750 different identities. In this study, the DeepSORT appearance feature extraction model was experimentally retrained on this dataset to enable it to extract highly discriminative pedestrian appearance information.

4.3. Experimental results and analysis

4.3.1. Model training.

The hardware configuration used in the experiment includes an AMD 3990x CPU, NVIDIA RTX3090 24GB GPU, and 64GB of memory. The operating system is Ubuntu 18.04.6 LTS, and the experiment is conducted using the PyTorch framework. For YOLOv5s, the hyperparameters are set as follows: batch size is 64, epochs is 300, initial learning rate is 0.01, weight decay coefficient is 0.0005, and the SGD optimization algorithm is employed. For DeepSORT, the hyperparameters are set as follows: batch size is 64, epochs is 40, initial learning rate is 0.1, weight decay coefficient is 0.0005, and the SGD optimization algorithm is used.

4.3.2. Evaluation metrics.

This article uses precision (P), recall (R), average precision (AP), and mean average precision (mAP) as the evaluation metrics for the YOLOv5s model. The formulas for calculating these performance metrics are as follows:

(13)

(14)

(15)

(16)

(17)

Where AP is the area under the Precision-Recall curve; TP is the number of target boxes with IoU ≥ threshold; FP is the number of target boxes with IoU < threshold; FN is the number of missed targets. The experimental results for YOLOv5s are shown in Table 2.

Download:

Table 2. Performance Comparison of Different Improvement Methods.

https://doi.org/10.1371/journal.pone.0334786.t002

The evaluation metrics used in the YOLOv5s-DeepSORT experiment [34] are as follows:

MOTA [35]: Multiple Object Tracking Accuracy. This metric considers three types of errors: false positives, missed targets, and identity switches.

MOTP: Multiple Object Tracking Precision. This metric summarizes the overall tracking precision by calculating the overlap between the true bounding boxes and predicted locations.

IDF1 [35]: ID F1 score. The ratio of correctly identified detections to the average of the true positives and the total number of detections.

IDs: Total number of identity switches.

ML: Mostly Lost targets. The proportion of lost target trajectories does not exceed 20% of the total ground truth trajectories.

MT: Mostly Tracked targets. The proportion of correctly tracked trajectories is not less than 80% of the total ground truth trajectories.

FP: Total number of false positives.

FN: Total number of missed targets.

We evaluated the performance of the improved DeepSORT and improved YOLOv5s-DeepSORT on the MOT16 [36] challenge sequences 02, 04, 05, 09, 10, 11, and 13, and compared them with the original algorithms. The experimental results are shown in Tables 3 and 4, respectively.

Download:

Table 3. Comparison of Tracking Performance Among Different Improvement Methods with YOLOv5s (↓represents “Lower is better”, ↑ represents “Higher is better”).

https://doi.org/10.1371/journal.pone.0334786.t003

Download:

Table 4. Comparison of detection performance of different detection methods.

https://doi.org/10.1371/journal.pone.0334786.t004

4.3.3. Ablation experiment.

The comparison of detection performance for different YOLOv5s improvement methods is shown in Table 2. It can be observed that Improvement 1, which replaces the original CIoU loss function with the Focal-EIoU loss function, results in a 1.4% and 0.9% increase in mAP0.5 and mAP0.5:0.95, respectively, while keeping the model’s parameter size and computational cost unchanged. Improvement 2, based on Improvement 1, adds an SO small object detection layer, which increases the model’s parameter size and computational complexity by 0.66M and 11G, respectively, but improves mAP0.5 and mAP0.5:0.95 by 2.3% and 2.5%, respectively. Improvement 3, based on Improvement 2, introduces the MHSA mechanism, which increases the model’s parameter size and computational complexity by 0.79M and 0.6G, respectively, but further improves mAP0.5 and mAP0.5:0.95 by 0.7% and 0.5%, respectively, demonstrating the effectiveness of the improved model’s detection capabilities.

As shown in Table 3, DeepSORT-ShuffleNetV2 is the improved version of DeepSORT, with the number of parameters and computational complexity reduced to 2.02M and 0.05G, respectively, compared to the original algorithm, which saw reductions of 9.47M and 1.97G. MOTA and MOTP also show slight improvements. The model size has decreased from 41.6MB in the original algorithm to 8.3MB in the improved version, shrinking to 20% of the original size, significantly reducing the computational load and effectively demonstrating the efficacy of the model’s lightweight design.

4.3.4. Comparative experiment.

To verify the superiority of the improved algorithm, this paper compares the improved YOLOv5s model with mainstream object detection models, including Faster-RCNN-ResNet50, SSD512, YOLOv7-tiny, and YOLOv8s. As shown in Table 4, the improved YOLOv5s algorithm achieved an mAP0.5 of 0.808, which is a 4.4% improvement over the original mAP0.5 of 0.764. The inference time for a single image is only 8.9ms, demonstrating a clear advantage over other mainstream object detection models, effectively proving the rationality and effectiveness of the proposed improvements.

As shown in Table 5, the improved YOLOv5s-DeepSORT algorithm achieves MOTA, MOTP, and IDF1 of 50.7%, 77.3%, and 58.4%, respectively, representing improvements of 3.3%, 0.5%, and 8.6% compared to the original YOLOv5s-DeepSORT algorithm. In addition, the number of pedestrian IDs is reduced by 57 times, which is an 11.3% reduction compared to the original algorithm. The experimental results demonstrate that the improved YOLOv5s-DeepSORT algorithm effectively addresses the problem of missed and false detections of densely occluded pedestrians on traffic roads to some extent, better meeting the requirements of pedestrian detection and tracking tasks.

Download:

Table 5. Comparison of tracking performance of different methods.

https://doi.org/10.1371/journal.pone.0334786.t005

5. Conclusion

This paper addresses the pedestrian detection issues in dense occlusion and small targets by improving the loss function, neck structure, and backbone network of the YOLOv5s model. Additionally, lightweight modifications were made to the feature extraction network of the DeepSORT model. The improved YOLOv5s-DeepSORT algorithm significantly enhances the accuracy of pedestrian detection and tracking, reduces false positives and false negatives, and lowers the identity switch error rate, making the pedestrian tracking process more efficient. Future research can explore more efficient lightweight network structures to improve the real-time performance of pedestrian detection and tracking algorithms, thereby better adapting to various application scenarios.

References

1. Chowdhury A, Kaisar S, Khoda ME, Naha R, Khoshkholghi MA, Aiash M. IoT-Based Emergency Vehicle Services in Intelligent Transportation System. Sensors (Basel). 2023;23(11):5324. pmid:37300051
- View Article
- PubMed/NCBI
- Google Scholar
2. Tahir NUA, Long Z, Zhang Z, Asim M, ELAffendi M. PVswin-YOLOv8s: UAV-Based Pedestrian and Vehicle Detection for Traffic Management in Smart Cities Using Improved YOLOv8. Drones. 2024;8(3):84.
- View Article
- Google Scholar
3. Jang J, Kim D, Jin D, Kim C-S. Contour-based object forecasting for autonomous driving. Journal of Visual Communication and Image Representation. 2025;106:104343.
- View Article
- Google Scholar
4. Cao T, Song K, Xu L, Feng H, Yan Y, Guo J. Balanced multi-scale target score network for ceramic tile surface defect detection. Measurement. 2024;224:113914.
- View Article
- Google Scholar
5. Lee S-H, Park D-H, Bae S-H. Decode-MOT: How Can We Hurdle Frames to Go Beyond Tracking-by-Detection? IEEE Trans Image Process. 2023;32:4378–92. pmid:37506023
- View Article
- PubMed/NCBI
- Google Scholar
6. Xu L, Huang Y. Rethinking Joint Detection and Embedding for Multiobject Tracking in Multiscenario. IEEE Trans Ind Inf. 2024;20(6):8079–88.
- View Article
- Google Scholar
7. Wojke N, Bewley A, Paulus D. Simple online and realtime tracking with a deep association metric. 2017 IEEE international conference on image processing (ICIP). 2017:3645–9. https://doi.org/10.1109/icip.2017.8296962
8. Bewley A, Ge Z, Ott L, Ramos F, Upcroft B. Simple online and realtime tracking. 2016 IEEE international conference on image processing (ICIP). 2016:3464–8. https://doi.org/10.1109/ICIP.2016.7533003
9. Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems. 2015;28.
- View Article
- Google Scholar
10. Cai Z, Vasconcelos N. Cascade r-cnn: Delving into high quality object detection. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018:6154–62. https://doi.org/10.1109/CVPR.2018.00644
11. Xie X, Cheng G, Wang J, Li K, Yao X, Han J. Oriented R-CNN and Beyond. Int J Comput Vis. 2024;132(7):2420–42.
- View Article
- Google Scholar
12. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, et al. Ssd: Single shot multibox detector. Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. 2016. p. 21–37.
- View Article
- Google Scholar
13. Zhu X, Lyu S, Wang X, Zhao Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. Proceedings of the IEEE/CVF international conference on computer vision. 2021. p. 2778–88.
14. Wang C-Y, Bochkovskiy A, Liao H-YM. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023. p. 7464–75.
15. Rasheed AF, Zarkoosh M. Optimized YOLOv8 for multi-scale object detection. J Real-Time Image Proc. 2024;22(1).
- View Article
- Google Scholar
16. Zhang S, Yang J, Schiele B. Occluded pedestrian detection through guided attention in cnns. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018. p. 6995–7003.
- View Article
- Google Scholar
17. Li Z, Yang L, Zhou F. FSSD: feature fusion single shot multibox detector. arXiv preprint arXiv:1712.00960. 2017.
- View Article
- Google Scholar
18. Yin Y, Zhang Z, Wei L, Geng C, Ran H, Zhu H. Pedestrian detection algorithm integrating large kernel attention and YOLOV5 lightweight model. PLoS One. 2023;18(11):e0294865. pmid:38019827
- View Article
- PubMed/NCBI
- Google Scholar
19. Dou H, Chen S, Xu F, Liu Y, Zhao H. Analysis of vehicle and pedestrian detection effects of improved YOLOv8 model in drone-assisted urban traffic monitoring system. PLoS One. 2025;20(3):e0314817. pmid:40100905
- View Article
- PubMed/NCBI
- Google Scholar
20. Wang Z, Zheng L, Liu Y, Li Y, Wang S. Towards real-time multi-object tracking. European conference on computer vision; 2020. p. 107–22.
- View Article
- Google Scholar
21. Zhang Y, Sun P, Jiang Y, Yu D, Weng F, Yuan Z, et al. Bytetrack: Multi-object tracking by associating every detection box. European conference on computer vision; 2022. p. 1–21.
22. Hu X, Zhang Q. Nighttime trajectory extraction framework for traffic investigations at intersections based on improved SSD and DeepSort. SIViP. 2023;17(6):2907–14.
- View Article
- Google Scholar
23. Zhang L, Ding G, Li G, Jiang Y, Li Z, Li D. An anti-occlusion optimization algorithm for multiple pedestrian tracking. PLoS One. 2024;19(1):e0291538. pmid:38295135
- View Article
- PubMed/NCBI
- Google Scholar
24. Yuan Y, Wu Y, Zhao L, Chen H, Zhang Y. Multiple object detection and tracking from drone videos based on GM-YOLO and multi-tracker. Image and Vision Computing. 2024;143:104951.
- View Article
- Google Scholar
25. Zhang Y, Wang T, Zhang X. Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023. p. 22056–65.
26. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. p. 2117–25.
- View Article
- Google Scholar
27. Liu S, Qi L, Qin H, Shi J, Jia J. Path aggregation network for instance segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. p. 8759–68.
- View Article
- Google Scholar
28. Zheng Z, Wang P, Liu W, Li J, Ye R, Ren D. Distance-IoU loss: Faster and better learning for bounding box regression. Proceedings of the AAAI conference on artificial intelligence. 2020;34(07):12993–3000.
- View Article
- Google Scholar
29. Zhang Y-F, Ren W, Zhang Z, Jia Z, Wang L, Tan T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing. 2022;506:146–57.
- View Article
- Google Scholar
30. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30.
- View Article
- Google Scholar
31. Ma N, Zhang X, Zheng H-T, Sun J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. Proceedings of the European conference on computer vision (ECCV). 2018. p. 116–31.
- View Article
- Google Scholar
32. Shao S, Zhao Z, Li B, Xiao T, Yu G, Zhang X, et al. Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123. 2018.
- View Article
- Google Scholar
33. Zheng L, Shen L, Tian L, Wang S, Wang J, Tian Q. Scalable person re-identification: A benchmark. Proceedings of the IEEE international conference on computer vision. 2015:1116–24.
- View Article
- Google Scholar
34. Yang F, Zhang X, Liu B. Video object tracking based on YOLOv7 and DeepSORT. arXiv preprint arXiv:2207.12202. 2022. doi:
- View Article
- Google Scholar
35. Alikhanov J, Obidov D, Abdurasulov M, Kim H. Practical Evaluation Framework for Real-Time Multi-Object Tracking: Achieving Optimal and Realistic Performance. IEEE Access. 2025;13:34768–88.
- View Article
- Google Scholar
36. Milan A, Leal-Taixé L, Reid I, Roth S, Schindler K. MOT16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831. 2016.
- View Article
- Google Scholar

[ref1] 1. Chowdhury A, Kaisar S, Khoda ME, Naha R, Khoshkholghi MA, Aiash M. IoT-Based Emergency Vehicle Services in Intelligent Transportation System. Sensors (Basel). 2023;23(11):5324. pmid:37300051
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Tahir NUA, Long Z, Zhang Z, Asim M, ELAffendi M. PVswin-YOLOv8s: UAV-Based Pedestrian and Vehicle Detection for Traffic Management in Smart Cities Using Improved YOLOv8. Drones. 2024;8(3):84.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref3] 3. Jang J, Kim D, Jin D, Kim C-S. Contour-based object forecasting for autonomous driving. Journal of Visual Communication and Image Representation. 2025;106:104343.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref4] 4. Cao T, Song K, Xu L, Feng H, Yan Y, Guo J. Balanced multi-scale target score network for ceramic tile surface defect detection. Measurement. 2024;224:113914.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref5] 5. Lee S-H, Park D-H, Bae S-H. Decode-MOT: How Can We Hurdle Frames to Go Beyond Tracking-by-Detection? IEEE Trans Image Process. 2023;32:4378–92. pmid:37506023
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref6] 6. Xu L, Huang Y. Rethinking Joint Detection and Embedding for Multiobject Tracking in Multiscenario. IEEE Trans Ind Inf. 2024;20(6):8079–88.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref7] 7. Wojke N, Bewley A, Paulus D. Simple online and realtime tracking with a deep association metric. 2017 IEEE international conference on image processing (ICIP). 2017:3645–9. https://doi.org/10.1109/icip.2017.8296962

[ref8] 8. Bewley A, Ge Z, Ott L, Ramos F, Upcroft B. Simple online and realtime tracking. 2016 IEEE international conference on image processing (ICIP). 2016:3464–8. https://doi.org/10.1109/ICIP.2016.7533003

[ref9] 9. Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems. 2015;28.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref10] 10. Cai Z, Vasconcelos N. Cascade r-cnn: Delving into high quality object detection. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018:6154–62. https://doi.org/10.1109/CVPR.2018.00644

[ref11] 11. Xie X, Cheng G, Wang J, Li K, Yao X, Han J. Oriented R-CNN and Beyond. Int J Comput Vis. 2024;132(7):2420–42.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref12] 12. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, et al. Ssd: Single shot multibox detector. Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. 2016. p. 21–37.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref13] 13. Zhu X, Lyu S, Wang X, Zhao Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. Proceedings of the IEEE/CVF international conference on computer vision. 2021. p. 2778–88.

[ref14] 14. Wang C-Y, Bochkovskiy A, Liao H-YM. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023. p. 7464–75.

[ref15] 15. Rasheed AF, Zarkoosh M. Optimized YOLOv8 for multi-scale object detection. J Real-Time Image Proc. 2024;22(1).
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref16] 16. Zhang S, Yang J, Schiele B. Occluded pedestrian detection through guided attention in cnns. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018. p. 6995–7003.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref17] 17. Li Z, Yang L, Zhou F. FSSD: feature fusion single shot multibox detector. arXiv preprint arXiv:1712.00960. 2017.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref18] 18. Yin Y, Zhang Z, Wei L, Geng C, Ran H, Zhu H. Pedestrian detection algorithm integrating large kernel attention and YOLOV5 lightweight model. PLoS One. 2023;18(11):e0294865. pmid:38019827
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref19] 19. Dou H, Chen S, Xu F, Liu Y, Zhao H. Analysis of vehicle and pedestrian detection effects of improved YOLOv8 model in drone-assisted urban traffic monitoring system. PLoS One. 2025;20(3):e0314817. pmid:40100905
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref20] 20. Wang Z, Zheng L, Liu Y, Li Y, Wang S. Towards real-time multi-object tracking. European conference on computer vision; 2020. p. 107–22.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref21] 21. Zhang Y, Sun P, Jiang Y, Yu D, Weng F, Yuan Z, et al. Bytetrack: Multi-object tracking by associating every detection box. European conference on computer vision; 2022. p. 1–21.

[ref22] 22. Hu X, Zhang Q. Nighttime trajectory extraction framework for traffic investigations at intersections based on improved SSD and DeepSort. SIViP. 2023;17(6):2907–14.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref23] 23. Zhang L, Ding G, Li G, Jiang Y, Li Z, Li D. An anti-occlusion optimization algorithm for multiple pedestrian tracking. PLoS One. 2024;19(1):e0291538. pmid:38295135
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref24] 24. Yuan Y, Wu Y, Zhao L, Chen H, Zhang Y. Multiple object detection and tracking from drone videos based on GM-YOLO and multi-tracker. Image and Vision Computing. 2024;143:104951.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref25] 25. Zhang Y, Wang T, Zhang X. Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023. p. 22056–65.

[ref26] 26. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. p. 2117–25.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref27] 27. Liu S, Qi L, Qin H, Shi J, Jia J. Path aggregation network for instance segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. p. 8759–68.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref28] 28. Zheng Z, Wang P, Liu W, Li J, Ye R, Ren D. Distance-IoU loss: Faster and better learning for bounding box regression. Proceedings of the AAAI conference on artificial intelligence. 2020;34(07):12993–3000.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref29] 29. Zhang Y-F, Ren W, Zhang Z, Jia Z, Wang L, Tan T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing. 2022;506:146–57.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref30] 30. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref31] 31. Ma N, Zhang X, Zheng H-T, Sun J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. Proceedings of the European conference on computer vision (ECCV). 2018. p. 116–31.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref32] 32. Shao S, Zhao Z, Li B, Xiao T, Yu G, Zhang X, et al. Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123. 2018.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref33] 33. Zheng L, Shen L, Tian L, Wang S, Wang J, Tian Q. Scalable person re-identification: A benchmark. Proceedings of the IEEE international conference on computer vision. 2015:1116–24.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref34] 34. Yang F, Zhang X, Liu B. Video object tracking based on YOLOv7 and DeepSORT. arXiv preprint arXiv:2207.12202. 2022. doi:
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref35] 35. Alikhanov J, Obidov D, Abdurasulov M, Kim H. Practical Evaluation Framework for Real-Time Multi-Object Tracking: Achieving Optimal and Realistic Performance. IEEE Access. 2025;13:34768–88.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref36] 36. Milan A, Leal-Taixé L, Reid I, Roth S, Schindler K. MOT16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831. 2016.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

Figures

Abstract

1. Introduction

2. Related work

2.1. Object detection

2.2. Multi-object tracking

2.3. Object detection and tracking

3. Proposed methods

3.1. The YOLOv5s pedestrian detection model

3.1.1. Loss function improvement.

3.1.2. Improvement in small object detection.

3.1.3. Introduction of the multi-head self-attention mechanism.

3.2. The DeepSORT tracking algorithm

3.2.1. Incorporating the lightweight ShuffleNetV2 network.

4. Experiments

4.1. CrowdHuman dataset

4.2. Market1501 dataset

4.3. Experimental results and analysis

4.3.1. Model training.

4.3.2. Evaluation metrics.

4.3.3. Ablation experiment.

4.3.4. Comparative experiment.

5. Conclusion

References