MDCFVit-YOLO: A model for nighttime infrared small target vehicle and pedestrian detection

Huiying Zhang; Qinghua Zhang; Yifei Gong; Feifan Yao; Pan Xiao

doi:10.1371/journal.pone.0324700

Abstract

An MDCFVit-YOLO model based on the YOLOv8 algorithm is proposed to address issues in nighttime infrared object detection such as low visibility, high interference, and low precision in detecting small objects. The backbone network uses the lightweight Repvit model, improving detection performance and reducing model weight through transfer learning. The proposed MPA module integrates multi-scale contextual information, capturing complex dependencies between spatial and channel dimensions, thereby enhancing the representation capability of the neural network. The CSM module dynamically adjusts the weights of feature maps, enhancing the model of sensitivity to small targets. The dynamic automated detection head DAIH improves the accuracy of infrared target detection by dynamically adjusting regression feature maps. Additionally, three innovative loss functions—focalerDIoU, focalerGIOU and focalerShapeIoU are proposed to reduce losses during the training process. Experimental results show that the detection accuracy of 78% for small infrared nighttime targets, with a recall rate of 58.6%, an mAP value of 67%. and a parameter count of 20.9M for the MDCFVit-YOLO model. Compared to the baseline model YOLOv8, the mAP increased by 6.4%, with accuracy and recall rates improved by 4.5% and 5.7%, respectively. This research provides new ideas and methods for infrared target detection, enhancing the detection accuracy and real-time performance.

Citation: Zhang H, Zhang Q, Gong Y, Yao F, Xiao P (2025) MDCFVit-YOLO: A model for nighttime infrared small target vehicle and pedestrian detection. PLoS One 20(6): e0324700. https://doi.org/10.1371/journal.pone.0324700

Editor: Muhammad Waqar Akram, Hohai University, CHINA

Received: December 19, 2024; Accepted: April 29, 2025; Published: June 17, 2025

Copyright: © 2025 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The minimal anonymised dataset used in this study has been uploaded to the GitHub platform and is publicly available via the following link: https://github.com/QinghuaZhang1/MDCFVit/tree/master.

Funding: This work was supported by the Science and Technology Development Programme Project of the Department of Science and Technology of Jilin Province (Grant No. YDZJ202501ZYTS610), awarded to Dr. Huiying Zhang. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors declare there are no financial interests, commercial affiliations, or other potential conflicts of interest that have influenced the objectivity of this research or the writing of this paper.

2. Introduction

With the rapid advancement of technology, fields such as autonomous driving, intelligent transportation systems, and security monitoring are placing increasing demands on the accuracy and real-time performance of target detection. Infrared target detection, as a crucial technological tool, has unique advantages in nighttime and low-light environments, effectively overcoming the limitations of visible light imaging in such conditions. Consequently, infrared target vehicle and pedestrian detection have become a key focus of current research. In the field of target detection, the YOLO(You Only Look Once) series of algorithms is highly regarded for its efficiency and accuracy. As the member of the YOLO series, YOLOv8 inherits the advantages of its predecessors and introduces innovations in the backbone network, detection head, and loss functions, greatly enhancing the flexibility and performance model. This makes YOLOv8 a powerful tool for achieving high-precision infrared target detection. However, despite the significant accomplishments of YOLOv8 in target detection, challenges remain in the realm of infrared target detection. Infrared images often have lower contrast and higher noise levels, making it challenging for traditional target detection algorithms to effectively extract and identify target features. Additionally, the lack of clear distinction between vehicle and pedestrian targets and the background in infrared images further complicates detection. To address these challenges, this paper proposes a method for infrared target vehicle and pedestrian detection based on the YOLOv8 model. This method enhances YOLOv8 of the backbone network and detection head to better suit the characteristics of infrared images, improving the extraction and recognition of target features. Additionally, a self-developed model is introduced to reduce noise interference and enhance the contextual relationship of images, thereby further improving detection performance.

Nighttime environments pose a great challenge to target detection due to the lack of visible light, and infrared technology has become an important means of solving this problem by virtue of its all-weather operation. Infrared target detection can capture target information under no light or low light conditions by sensing the infrared radiation emitted by the object, which is widely used in the fields of security, automatic driving, rescue and so on. Especially in night scenes, infrared target detection not only makes up for the shortcomings of visible light systems, but also attracts attention for its ability to penetrate haze and soot. In this paper, we will systematically discuss the research and application of infrared target detection at night from the aspects of technical background, current status, challenges and future development direction.

The basis of infrared target detection lies in the imaging principle of infrared radiation. Infrared sensors achieve target identification by detecting the radiation difference between the target and the background and converting it into a grey scale image. While traditional visible light cameras are difficult to work in nighttime environments due to the lack of natural light, infrared imaging technology is not subject to this limitation. According to the wavelength range, infrared technology is mainly classified into near infrared (0.75–1.4 μm), mid-infrared (3–5 μm), and long-wave infrared (8–14 μm), of which long-wave infrared is the most widely used in nighttime detection due to its high sensitivity to thermal radiation. Traditional infrared target detection relies on signal processing methods, such as threshold-based segmentation and filter enhancement. However, these methods have limited effect in complex background or low signal-to-noise ratio conditions. With the development of artificial intelligence, especially deep learning technology, infrared target detection has entered a new stage. Neural networks can automatically extract target features and significantly improve detection accuracy and robustness, which has become a hot spot in current research.

In recent years, infrared target detection at night has made great progress, and infrared target detection algorithms based on models such as convolutional neural network (CNN) and Transformer have significantly improved detection performance.Frameworks such as YOLO and Faster R-CNN are widely adapted to infrared image processing, and a series of improvement schemes are proposed for the detection of weak targets at night, such as multi-scale feature fusion and attention mechanism. A single infrared information may not be sufficient to meet all the challenges in a night scene, so fusion with multi-sensor data, such as visible light and radar, has become a trend. The fusion of infrared and visible light images utilises the thermal imaging properties of infrared and the detailed information of visible light to further improve the accuracy of target detection.

With the enhancement of hardware arithmetic power, such as the popularity of GPUs and TPUs, the lightweight network design MobileNet makes real-time infrared target detection possible. Some algorithms are still able to achieve processing speeds of more than a hundred frames per second in complex scenarios, meeting the needs of night surveillance and autonomous driving. For the construction of datasets, the lack of infrared target detection datasets has been the bottleneck, but in recent years several public datasets (FLIR, NUDT-SIRST) have been gradually released, which provides support for algorithm training and validation. Nighttime infrared target detection in the application field, in security, city security, border protection and other fields are widely used infrared cameras to achieve 7 × 24 hour all-weather monitoring. Infrared technology is able to penetrate darkness and haze and provide clear target images. In the field of autonomous driving, at night and in bad weather, infrared thermal imaging technology provides reliable sensing capabilities for autonomous vehicles, helping to identify potential obstacles such as pedestrians and vehicles. In emergency rescue, infrared technology can detect the location of trapped people in fire, earthquake and other scenarios to assist rescue operations.

Currently, infrared target detection at night mainly focuses on the detection of weak targets The targets in infrared images at night are often small in size, weak in brightness, and lack texture and shape information. Researchers have improved the detection capability through multi-scale feature extraction and attention mechanism. By suppressing the complex background, night scenes may contain background noise such as clouds and heat source interference, so how to effectively separate the target from the background is the focus of the research. By optimising the low signal-to-noise ratio, infrared images have low resolution and poor contrast, to address this problem, data enhancement and loss function optimisation (e.g., Focal Loss) have become the direction of improvement.

However, the detection of infrared targets at night is also facing many problems. Infrared images lack colour and detail information, and the targets usually appear as fuzzy hot spots, which makes it difficult to distinguish between target types, especially in long-distance scenes at night. In the nighttime environment, ground heat reflection, natural heat sources and even atmospheric scattering can introduce noise and increase the false alarm rate. Compared to the visible light domain, the infrared dataset has a small sample size, a single scene, and the annotation accuracy needs to be improved. Although the deep learning model has excellent performance, it is computationally intensive, and its deployment on embedded devices still needs to be optimised to meet real-time requirements. Weather changes at night can have an impact on infrared radiation, leading to fluctuations in detection performance.

In response to these challenges, the main contributions of this paper include, firstly, a detailed analysis of the challenges and difficulties of infrared target detection and an exploration of the strengths and limitations of the YOLOv8 model in this context; secondly, through the introduction of a new lightweight backbone network Repvit, Dynamic Automated Inspection Head (DAIH), self-developed models of context-sensitive modulation (CSM) and Multiplexed Interactive Attention (MIA) modules, the MDCFVit-YOLO model based on the YOLOv8 architecture is proposed, and in order to improve the detection effect, a new loss function is also used in this paper. focalerShapeIoU, the MDCFVit-YOLO model in this paper significantly improves the accuracy and real-time performance of infrared target vehicles and pedestrians detection; finally, we experimentally validate the effectiveness of the proposed method and compare it with other advanced infrared target detection algorithms. The experimental results show that MDCFVit-YOLO achieves 78% small target detection accuracy, 58.6% recall, and 67% mAP on the FLIR_ADAS_v2 dataset, which is an improvement of 4.5%, 5.7%, and 6.4%, respectively, compared to the baseline model YOLOv8. The number of parameters is 20.9M and the frame rate is 65.2 FPS, indicating that the model maintains real-time performance while improving performance. These specific metrics fully validate the claimed performance improvements and provide new ideas and methods for infrared target detection.

Compared with the existing methods, the MDCFVit-YOLO model proposed in this paper demonstrates novelty in the following aspects: the MPA module breaks through the limitation of the traditional attention mechanism that is difficult to capture the interactions between spatial and channel dimensions at the same time; the CSM module enhances the feature expression capability through multi-level contextual aggregation; and the combination of DAIH detector head and dynamic tuning strategy improves the detection performance of the complex scenarios significantly. detection performance in complex scenarios; the new loss function is proposed, and the focalerShapeIoU breaks through the inadequacy of the traditional IoU loss for small target localisation, and together these improvements provide a solution to the problems of low visibility, high interference and low detection accuracy of small targets in IR detection. This research will provide new insights and methods in the field of infrared target vehicle and pedestrian detection, contributing to the technological advancement and social development in related fields.

3. Related work

Nighttime driving infrared image target detection is a crucial topic, aiming to accurately identify and locate vehicles and pedestrians in images using advanced image processing techniques.

Deep learning technology has become dominant in the field of infrared vehicle and pedestrian detection. This technology constructs complex neural network models and uses back-propagation algorithms to optimize network parameters, allowing computers to automatically learn and extract object features from images. Currently, deep learning-based object detection algorithms can be broadly divided into two categories: two-stage detection algorithms and single-stage detection algorithms.

Two-stage detection algorithms, such as the R-CNN series (including R-CNN [1], Fast R-CNN [2], and Faster R-CNN [3]) are renowned for their high accuracy. These algorithms first extract image features through neural networks, then generate candidate regions, and classify and localize these regions. Although they excel in detection accuracy, they are generally slower in processing speed, which may make them unsuitable for real-time detection scenarios.

To address the speed deficiencies of two-stage detection algorithms, researchers have proposed single-stage detection algorithms, such as the YOLO series (including YOLOv1 [4], YOLOv2 [5], etc.) and the SSD series [6]. These techniques use neural networks to directly extract picture information and simultaneously carry out object categorization and localization, significantly enhancing detection speed. While single-stage detection algorithms may be slightly less accurate than two-stage algorithms, they excel in real-time detection tasks.

In infrared target detection, the attention mechanism enables the model to focus on key parts of the input information, thus significantly improving the model’s ability to handle complex tasks. Zhang [7] proposed an attention-guided pyramid context network (AGPCNet), which enhances the accuracy of small target infrared detection through the context pyramid module and the asymmetric fusion module, and the experimental results show that AGPCNet achieves an mIoU of 0.7288 and an F1 score of 0.8431 on the SIRST Aug dataset, and on the MDFA dataset The mIoU reached 0.4843 and the F1 score was 0.6525, with AUCs as high as 0.9344 and 0.8682, showing excellent detection of small targets in complex backgrounds. Su [8] designed an Attention Fusion Feature Pyramid Network (AFFPN) by introducing an expansive spatial pyramid pooling module in the feature extraction phase and enhancing the infrared target feature representation through the attention fusion module, which achieved an F1 score of 0.892 and an AUC of 0.956 on the SIRST dataset, and an F1 score of 0.788 on the MDFA dataset and an AUC of 0.901, proving its superiority in feature enhancement.

In infrared vehicle and pedestrian detection, small targets face challenges such as scarce feature information, complex and variable backgrounds, and unpredictable motion trajectories, which significantly increase the detection difficulty. However, scholars have continuously invested in research to explore new methods and theories to break through the bottleneck of small target detection and improve the detection accuracy and robustness. Ming [9] proposed an attention-based context-aware network (ACANet), which utilises the attention mechanism to extract features from local and global contexts to improve the performance of infrared small target detection, and it outperforms the traditional methods in terms of detection probability and mIoU metrics, demonstrating the strong potential of context-awareness. Zuo [10] designed an Attention Fusion Feature Pyramid Network (AFFPN), which significantly improves the performance of small target infrared detection by expanding the spatial pyramid pooling and attention fusion module, and the results are consistent with Su [8], with an F1 score of 0.892 and an AUC of 0.956 on the SIRST dataset, which verifies the robustness of the method. Ling [11] proposed an attention-based context-aware network to improve the accuracy of infrared small target detection by extracting features from local and global contexts, and although specific data are not available at the moment, its design idea is similar to that of Ming [9], and it is expected to perform well. Thin [12] proposed a Dense Nested Attention Network (DNAN), which significantly enhances IR small target detection through hierarchical attention mechanism and feature fusion strategy, and outperforms traditional methods on datasets such as NUDT-SIRST, with specific metrics such as mIoU and F1 scores showing the efficiency of hierarchical feature fusion.

In infrared vehicle and pedestrian detection, the design of the loss function is crucial to improve the detection performance. Conventional loss functions may perform poorly when dealing with small target infrared detection, so researchers have proposed improvements to optimise target detection and significantly improve overall detection performance. Among the loss function designs, the focal loss function (Focal) loss) significantly improves the detection performance of difficult-to-classify targets by reducing the weights of easy-to-classify samples [13]. Experiments show that the focal loss is excellent in dealing with class imbalance problems, especially in infrared small target detection, which can effectively improve the model’s focus on small targets. Yu [14] proposed a method combining multi-dimensional feature fusion and local detection enhancement for small target infrared detection, and improved the loss function to enhance the detection performance. Although specific values are not available, the method is expected to provide significant improvement in detection accuracy by optimising feature extraction and loss calculation. Another method to enhance the performance of IR target detection is to use an alternative model by replacing the pixel-level discrimination with a masked post-detection distribution model and employing a diffusion model to enhance the accuracy of small-target IR detection [15]. It has been shown that the diffusion model significantly improves the detection performance on the public dataset through data enhancement, e.g., the leakage and false detection rates are reduced in complex backgrounds.

In infrared target detection process, the modification of the backbone part and optimisation of the loss function are also crucial to improve the detection performance. Wang [16] proposed a modified backbone feature extraction network and repaired the original GIoU loss using Inner-GIoU loss, thus improving the speed and accuracy of small target detection. Experiments show that Inner-GIoU, through the auxiliary bounding box bounding box regression, ends up in accelerating the GIoU loss in small target detection by about 10% of the speed of convergence, and detection is also improved.

In the field of infrared target detection, improving the YOLO algorithm to enhance the detection performance is a commonly used and efficient method.YOLO-ISTD: A YOLOv5-S based significantly improves the detection performance of infrared small targets by replacing the original 8x, 16x, and 32x downsampling detector heads with 4x, 8x, and 16x downsampling detector heads. On the IRSTD-1K dataset, the method resulted in a 2.3% improvement in mAP and a 2.7% improvement in F1 score, which helped the model to better capture small target features and thus improve detection accuracy. Another study proposed a YOLOv5 infrared small target detection algorithm (YOLO-FR) based on a feature reorganisation sampling method [17]. By adjusting the feature map size without increasing or decreasing the amount of current feature information, the algorithm proposes the STD module and CARAFE operation to optimise the detection head structure, which achieves 97.4% of mAP50 in experiments, which is 7.4% higher than the original network, and further improves the detection accuracy. The method can effectively use the feature information in infrared images to significantly enhance the model’s detection ability for small targets.

The YOLOv7-based infrared small target network (YOLO-SDLUWD) better adapts the model to infrared image characteristics and improves the small target detection performance through the optimised detection head and feature fusion method [18]. On the public dataset, the method achieves a mAP of 85.3%, which is 2.7% higher than that of the original YOLOv7, and a frame rate of 68.9 YOLO and ConvNeXt-based Infrared Target Detection Networks (YOLO-CIR) improves the YOLO’s detector head structure and the feature implementation module to adapt to the high bit-width infrared images [19]. On the FLIR dataset, mAP50 is 3% higher than YOLOv5 and 5.6% higher than Faster R-CNN, and the number of parameters and computation are also significantly reduced, enabling the model to better capture the target features in the infrared images and thus improve its detection accuracy.

Aiming at the practical limitations of edge devices, a lightweight infrared target detection technique, Edge-YOLO, is proposed [20]. This method optimises the detection head of edge devices by replacing the backbone network of the YOLOv5m model and introducing a lightweight visual attention mechanism. AP50:95 reaches 50.6% on the COCO2017 dataset with an AP50 of 69.8% and 44.8% on VisDrone2019-DET, with a frame rate that meets real-time requirements (FPS ≥ 30). This approach reduces computational complexity and model size while maintaining high detection accuracy, providing an effective solution for IR target detection on edge devices.

To solve the safety problem of vehicles traveling at night in the field of automatic driving, an MDCFVit-YOLO algorithm is proposed to solve the problem of false detection and missed detection of infrared target images at night due to the low brightness and small target. The innovations are as follows:

To solve the problem that traditional attention mechanisms make it difficult to capture long-range dependencies and interactions between spatial and channel dimensions simultaneously, the MPA module is proposed in this paper, which generates more accurate attention graphs by integrating operations such as global average pooling, horizontal and vertical average pooling, therefore, the ability to capture feature graph details and characterize the network is improved.
To improve the network of ability to detect complex backgrounds and tiny targets, this paper proposes the Contextual Focus Modulation Mechanism (CSM), a mechanism that dynamically adjusts the weights of the feature maps so that the network pays more attention to important contextual information.
A new detection head DAIH is proposed. Through its unique task decomposition dynamic convolution strategy and unique architecture design, it can capture critical information in complex scenarios produce accurate detection results, and improve the adaptability and robustness of the regression features.
Innovatively proposed three kinds of loss functions: focalerGIOU combines IoU and minimum closed frame area to improve global positioning accuracy; focalerDIoU introduces centroid distance to significantly improve positioning accuracy; and focalerShapeIoU combines IoU, centroid distance, and shape matching to optimize the objective in multi-dimension.

4. Model architecture and methodology

4.1 YOLOv8 overview

The YOLOv8 algorithm introduces a new SOTA model, and YOLOv8 also offers models of different scales: N/S/M/L/X, to meet various application needs. The YOLOv8 architecture also consists of three parts: the backbone network, the neck network, and the head network. As shown in Fig 1, the backbone network is mainly composed of Conv modules, c2f modules, and SPPF modules. The neck network adopts a Feature Pyramid Network (FPN) structure [21] to achieve multi-scale object detection from top to bottom. The head network includes three object detectors, using grid-based anchors to perform detection on feature maps of different scales. As the feature map increases in size, it contains more contours and geometric information, which is beneficial for detecting small objects.

Download:

Fig 1. Block diagram of YOLOv8 model architecture.

https://doi.org/10.1371/journal.pone.0324700.g001

The Conv module is the basic component of many important modules. The C2f module includes Conv and Bottleneck modules. The SPPF module consists of Conv modules, Maxpool layers, and Concat layers.

4.2 MDCFVit-YOLO network architecture

The study adheres to all relevant ethical standards and publication guidelines. It has not been previously published or disseminated in any form, and all appropriate publication ethics have been strictly observed. Furthermore, the research complies with the requirements set forth by the Ethics Review Board or other relevant ethics committees, and no ethical or moral concerns are associated with the conduct of this study.

The backbone network of the MDCFVit-YOLO architecture uses Tsinghua University’s lightweight repvit network [22] as the Backbone to achieve network lightweight. The MDCFVit-YOLO network architecture is shown in Fig 2. Context-sensitive modulation (CSM) is also employed to enhance the network’s expressive capabilities. To integrate spatial and channel attention and enhance the representational capacity of the neural network, Multipath Interactive Attention (MIA) is used to simultaneously capture long-range dependencies and interactions between spatial and channel dimensions. For the detection head, Dynamic Automated Inspection Heads (DAIH) are used, enabling the detection head to provide efficient and accurate results in complex detection tasks. The DyDCNV2 dynamic convolution module used in the detection head makes regression features more adaptive and robust, as shown in the upper right part of the module in Fig 2.

Download:

Fig 2. Block Diagram of MDCFVit-YOLO Network Architecture.

https://doi.org/10.1371/journal.pone.0324700.g002

The MDCFVit-YOLO model processes infrared images for high-precision night-time small target detection by taking the input images and normalising and optimising them through pre-processing techniques such as resizing, normalisation and data augmentation to enhance the robustness and generalisation capability of the model. The pre-processed images are then fed into a lightweight RepViT backbone network to extract multi-scale feature maps that capture key information. Then it is to enter these feature maps into a neck network such as Feature Pyramid Network (FPN) for multi-level feature fusion to enhance the detection of targets of different sizes. The fused feature maps are then passed through the multipath interactive attention (MPA) module to generate an attention map to highlight key regions using global pooling and horizontal and vertical average pooling operations, and then the weights are dynamically adjusted through the context-sensitive modulation (CSM) module to focus on important contextual information such as small targets and complex backgrounds. Afterwards, the enhanced feature maps are fed into the Dynamic Automated Inspection Head (DAIH) to generate classification and regression predictions via task decomposition, where the regression part uses dynamic convolution (DyDCNV2) to adjust the features and output the target bounding box, and the classification part generates the category probabilities. Finally, innovative loss functions (e.g. focalerDIoU) are used to optimise the parameters during training, and redundant frames are removed by non-maximal suppression (NMS) in the post-processing stage to filter out high-quality detection results. This process effectively overcomes the challenges of low visibility and high interference in nighttime infrared imagery through modular design and innovative technology to achieve efficient and accurate small target detection.

Research on multiplexed interactive attention.

The proposed MIA (Multipath Interactive Attention-MIA) aims to enhance the representational capability of neural networks by integrating spatial and channel attention. Traditional attention mechanisms often struggle to simultaneously capture long-range dependencies and interactions between spatial and channel dimensions. MIA addresses this issue with an innovative approach, allowing the network to perform better in complex visual tasks. The architectural diagram of the MIA module is shown in Fig 3.

Download:

Fig 3. Architecture of the Multipath Interactive Attention Mechanism Model.

https://doi.org/10.1371/journal.pone.0324700.g003

The MIA module mainly consists of global average pooling (GAP), horizontal and vertical average pooling, and a series of convolution operations to generate spatial and channel attention maps. The main function of the GAP path is to capture the global context information of the input feature map. Specifically, adaptive average pooling is applied to the input feature map, reducing it to one spatial point (1x1) per channel, then a convolution layer transforms the pooled features to enhance their representational ability for global dependencies. This step effectively extracts global features, providing critical information for subsequent feature fusion. The horizontal and vertical pooling paths capture spatial dependencies along the height and width dimensions, respectively. The input feature map is first subjected to adaptive average pooling along the height and width directions. To align the dimensions, the vertically pooled features need to undergo a dimension permutation. Then, the horizontally and vertically pooled features are concatenated along the channel dimension and processed through a convolution layer to generate a combined feature map. The combined feature map is then split back into separate horizontal and vertical components. This step aims to capture the spatial dependencies of the input feature map in different directions, thereby increasing the sensitivity of the attention mechanism to spatial features. To refine the spatial attention map, MIA further generates and processes attention weights. Specifically, another convolution layer is applied to the combined feature map, and the attention weights are generated through a sigmoid activation function. These weights are then split into horizontal and vertical weights. The original pooled features are element-wise multiplied with their corresponding weights, thereby modulating these features. Simultaneously, the global context features are modulated by the average value of the spatial weights. This process ensures the effective fusion of spatial and channel features, allowing the attention mechanism to more accurately capture long-range dependencies and global context information. The multi-path processed feature map is obtained by element-wise multiplying the input feature map with the sigmoid-activated horizontal and vertical features, and the globally modulated features, resulting in the final attention-modulated feature map. This process achieves an effective combination of spatial and channel attention, significantly enhancing feature representation ability.

The MIA mechanism effectively enhances the representational capability of neural networks by integrating multi-scale contextual information and capturing complex dependencies between spatial and channel dimensions. Through adaptive pooling and convolution operations, MIA generates precise attention maps, thus performing excellently in various visual tasks. This method provides a powerful solution for addressing the challenges faced by deep learning models in capturing long-range dependencies.

Proposed dynamic automated inspection head.

The existing object detection algorithms, such as YOLOv8, perform excellently on visible light images, but their detection performance on infrared images still needs improvement. Therefore, to better meet the needs of infrared object detection, this study modifies the detection head of YOLOv8 to enhance its performance in infrared images.

The architecture of the DAIH(dynamic automated Inspection head) module, as shown in Fig 4, primarily consists of feature extraction, task decomposition, dynamic convolution, feature alignment, and output modules. Multi-scale feature maps (P14, P18, P22) undergo feature extraction through a shared convolutional layer. This shared convolutional layer contains two convolutional layers (Conv_GN 3x3), which process the input feature maps separately. The purpose of this process is to reduce noise and extract more representative features. After feature extraction, the extracted feature maps are concatenated during the feature fusion phase to generate a composite feature map. The composite feature map is then decomposed by the task decomposition module into classification and regression feature maps. The task decomposition module (TaskDecomposition) adapts the feature maps based on the global average pooling results of the input feature map, generating feature maps suitable for classification and regression tasks.

Download:

Fig 4. The architecture of dynamic automated Inspection head.

https://doi.org/10.1371/journal.pone.0324700.g004

Subsequently, the regression feature map is processed by the dynamic convolution module (DyDCNV2). DyDCNV2 dynamically adjusts the regression feature map using offsets and masks generated by a convolutional layer, enabling the feature map to better adapt to the regression task. Meanwhile, the classification feature map is processed through two convolutional layers to generate the classification probability map, which is then activated by the sigmoid function. The processed classification and regression feature maps are output through convolutional layers (Conv_Cl, Conv_Reg) as classification and regression results. These results are adaptively adjusted through the Scale module, ensuring consistency in feature maps across different resolutions.

Each module in the DAIH detection head has a clear function and collaborates to achieve efficient object detection. The Conv_GN 3x3 convolutional layers improve feature extraction and reduce noise in the feature maps through Group Normalization. The task decomposition module adaptively adjusts and decomposes the composite feature map into feature maps suitable for classification and regression tasks. The DyDCNV2 dynamic convolution module utilizes offsets and masks for dynamic adjustment of the feature map, enhancing the adaptability and robustness of the regression features. Finally, the Scale module adjusts the output feature maps, ensuring their consistency across different resolutions.

In the actual workflow, the input multi-scale feature maps (P14, P18, P22) first undergo feature extraction through the shared convolutional layer. After the feature fusion phase, a composite feature map is generated. This composite feature map is decomposed by the task decomposition module into classification and regression feature maps. The regression feature map generates offsets and masks through a convolutional layer and undergoes dynamic convolution with the DyDCNV2 module, producing the final regression feature map. Simultaneously, the classification feature map generates a classification probability map through two convolutional layers, followed by sigmoid activation. The classification and regression feature maps are output through convolutional layers, and adjusted by the Scale module. During the training phase, the output is a set of feature maps; during inference, the feature maps are concatenated, and the final detection results are generated based on preset dynamic or fixed shapes.

The DAIH module significantly improves the model’s performance in object detection through innovative task decomposition and dynamic alignment techniques. Its reasonable architectural design and clear module functions enable the detection head to provide efficient and accurate detection results in complex detection tasks. Future research can further optimize the parameters and design of each module to enhance the detection head’s efficiency and accuracy, offering more possibilities for the development of object detection.

Design of context-sensitive modulation.

To further enhance the network’s representation capability, we have innovatively proposed a context-sensitive modulation (CSM) module. This module achieves deep enhancement of the input features through multi-level context aggregation and modulation.

The core of context-sensitive modulation lies in its unique architectural design, as illustrated in the algorithmic framework diagram shown in Fig 5. The input feature x is decomposed into a query Q, context C, and gating coefficients G through a preliminary linear projection layer, Conv_linear. This decomposition allows the model to process and modulate different types of information separately. In the multi-level focal convolution layer, we introduce several convolution layers with progressively larger kernels. Through this design, the contextual information at each level can be effectively aggregated and multiplied by the corresponding gating coefficients, achieving layer-by-layer accumulation. The focal convolution layers in the diagram demonstrate how context information is aggregated at each level. Specifically, to capture global context information, we obtain global context through average pooling and multiply it by the final gating coefficient before adding it to the final context representation. After context aggregation, normalization is performed to balance the importance of context from different levels, ensuring a more uniform overall context. Next, the query Q is modulated with the normalized context through element-wise multiplication, effectively integrating the multi-level context information into the query. To further improve the model’s performance, we introduce a posterior linear projection layer after modulation. This layer performs a linear transformation on the modulated output and increases the model’s robustness and generalization ability through a dropout layer [23]. If posterior layer normalization is enabled, layer normalization will be applied at this stage to further stabilize the output.

Download:

Fig 5. Algorithm schematic diagram of the CSM module.

https://doi.org/10.1371/journal.pone.0324700.g005

The context-sensitive modulation achieves deep enhancement of the input features by introducing multi-level context aggregation and modulation. This approach not only significantly enriches the feature representation but also enhances the overall expressive capability of the network. The CSM mechanism not only excels in visual tasks but also shows its potential and advantages in a wide range of applications. The architecture diagram illustrates the complete process from input to output, intuitively showing how the focal modulation mechanism progressively enhances the input features through various steps. This design is not only theoretically innovative but also demonstrates excellent performance and application prospects in practice.

Backbone network repvit.

The backbone network RepViT primarily employs transfer learning methods, leveraging the knowledge of pre-trained models to achieve rapid convergence and significantly enhance performance on new tasks [24]. In infrared target detection tasks, the introduction of RepViT significantly improved YOLOv8’s detection accuracy and real-time performance. Specifically, RepViT optimized the macro-architectural elements of the network, such as the initial layers, downsampling layers, and classifiers, achieving more efficient feature processing and target recognition. This is particularly important for infrared images in complex backgrounds and diverse target scenarios, enabling the system to excel in various challenging application environments.

By adopting RepViT as the backbone network, YOLOv8 demonstrated high accuracy and low inference latency in practical applications, particularly excelling on mobile devices and embedded systems. This innovation not only expands YOLOv8’s application prospects in infrared target detection but also provides new insights and directions for research in related fields.

Three loss functions are proposed.

In infrared target detection, to improve the YOLOv8 model’s detection accuracy and localization precision, three novel loss functions—focalerDIoU, focalerGIOU, and focalerShapeIoU—were proposed. FocalerGIOU integrates IoU with the smallest enclosing box area, enabling effective optimization even with non-overlapping boxes by providing a broader global measure. FocalerDIoU combines IoU with center point distance, enhancing box positioning accuracy. FocalerShapeIoU extends this by incorporating shape matching alongside IoU and center distance, optimizing the predicted box across overlap, position, and shape dimensions. These enhancements boost accuracy and robustness in infrared target detection. Developed from the Focaler-IoU concept [25], these functions innovatively blend bounding box adjustment strategies from GIoU [26], DIoU [27], and ShapeIoU [28]. The principle of focalerShapeIoU is outlined below:

(1)

(2)

(3)

(4)

(5)

(6)

Formula (1) calculates IoU to measure overlap between the predicted and target boxes. Formula (2) defines Focaler_IoU, where adjustment parameters d and u scale IoU, clamped between 0 and 1. Formulas (3) and (4) compute the distance and shape loss components of ShapeIoU, assessing box shape similarity via width and height differences, with exponential functions and higher powers enhancing sensitivity. Formula (5)yields the ShapeIoU loss, while formula (6)details the derivation of focalerShapeIoU. This loss function integrates shape differences for precise bounding box similarity assessment and leverages the Focaler-IoU strategy to focus on difficult samples, boosting training effectiveness.

The focalerDIoU loss function combines IoU with the center point distance, aiming to more effectively assess and optimize the localization of the detection box. The principle of DIoU is shown in formula (7), where ρ is the Euclidean distance between the center points of the predicted box and the target box, and c is the diagonal length of the minimum enclosing box surrounding the predicted and target boxes.

(7)

(8)

The principle of focalerDIoU is shown in formula (8). Building on the original DIoU’s consideration of the minimum enclosing box for bounding boxes, focalerDIoU enhances the focus on difficult samples using the Focaler-IoU focal loss strategy, effectively improving the model’s detection performance by more accurately assessing the similarity of bounding boxes.

(9)

(10)

The principle of the GIoU loss function is shown in Formula (10). Here, Ac is the area of the minimum enclosing box surrounding the predicted and target boxes, and Au is the area of the union of the predicted and target boxes. FocalerGIOU considers global information by incorporating the area of the minimum enclosing box, providing a more meaningful metric when the boxes do not completely overlap. The principle of focalerGIOU is shown in Formula (10).

5. Experimental results and analysis

5.1 Datasets and training configurations

The dataset used in this experiment is the FLIR_ADAS_v2 version, released in 2022. It contains 9,233 visible spectrum images, 9,711 thermal spectrum images, and 7,498 matched thermal-visible video frames. The FLIR ADAS v2 dataset is an enhanced and expanded thermal imaging dataset, specifically designed for the development and testing of Advanced Driver Assistance Systems (ADAS) and autonomous driving technologies. The dataset includes over 26,000 images captured under various conditions. In this experiment, the focus is on the training, testing, and validation sets of thermal images. The training set (images_thermal_train) contains 8,862 images, the validation set (images_thermal_val) contains 1,366 images, and the testing set (images_thermal_test) contains 3,674 images. The experiment primarily involves nine categories selected from the training set, Fig 6 shows the data distribution in the target detection task. Fig 6-A is a bar chart showing that the category ‘person’ has the highest number of instances, close to 70,000, followed by ‘bike’ with about 50,000, and other categories like Fig 6-B is an illustration of the distribution of the prediction boxes, and the size and position of the prediction boxes; Fig 6-C is a two-dimensional scatter plot showing the dense distribution of target width and height, with dark banded areas in the middle; Fig 6-D is similar to Fig 6-C, but with a more dispersed distribution, with points concentrated in the lower left. is more dispersed, with points concentrated in the lower left corner and in the middle, indicating the diversity of target sizes under different conditions.

Download:

Fig 6. Number of categories and distribution of categories in the dataset.

https://doi.org/10.1371/journal.pone.0324700.g006

For the classification results of small, medium, and large objects in the dataset, this paper defines a small object as one whose bounding box area is less than 32*32 pixels, a medium object as one whose pixel range is more than 32*32 pixels but less than 96*96 pixels, and a large object as one whose area exceeds 96*96 pixels. Table 1 uses the statistical results from the validation set in the dataset. From Table 1, it can be seen that small objects account for a larger proportion in each category of the dataset, with the total proportion of small objects being 87.9%. The proportion in each category also exceeds 70%, indicating that the dataset is predominantly composed of small objects.

Download:

Table 1. Statistics of Large, Medium, and Small Targets.

https://doi.org/10.1371/journal.pone.0324700.t001

For the preprocessing part of the dataset, since the dataset FLIR uses a data type that is not in YOLO format, the dataset was analysed by converting the FLIR json annotation file provided in the dataset index.json to the YOLO tags for each image by employing a script file. And create a.txt file containing all the image names extracted from the JSON file and append the specified directory to its name. The dataset previews the images with class name tags during the conversion process so that the user can visually check the validity of the data. For the output of the dataset conversion process, a folder named labels is created where the script resides within the FLIR dataset folder. This will contain label txt files named with the same image names provided by FLIR, each.txt file will contain bounding boxes and classes in YOLO format, a.txt file called all_image_names.txt, which will contain all the images in the dataset file with the input directory appended to its name, and by preprocessing the dataset, the FLIR dataset into YOLO format, for is dataset preprocessing, as the categories of detection can be set in advance in the script file, in this paper, we have selected nine to be detected by the categories in the dataset and the number of each category, and the dataset is preprocessed by these operations.

Several parameters need to be conFigd during the model training process, as shown in Table 2. The SGD optimizer was used, with a Momentum value of 0.937, a learning rate of 0.01, a Batch Size of 16, 100 Epochs, and a weight_decay value of 0.0005. The server configuration includes 100 GiB of memory, 30 GiB of video memory, and one NVIDIA A10 GPU. The environment configuration includes modelscope 1.13.1, pytorch 2.1.2, tensorflow 2.14.0, and ubuntu 22.04.

Download:

Table 2. Experimental environment and parameter configuration strategy.

https://doi.org/10.1371/journal.pone.0324700.t002

SGD offers stability and strong generalization ability in object detection tasks, efficient convergence when combined with momentum, lower computational overhead, compatibility with the YOLO series, and adaptability to the noise and complex backgrounds encountered in infrared small target detection tasks. Although ADAM is favored in some tasks for its rapid convergence, in the context of MDCFVit-YOLO, SGD is more suitable for maintaining a balance between accuracy and real-time performance while effectively supporting the optimization of innovative modules (MPA, CSM, and DAIH) and loss functions.

5.2 Evaluation of indicators

In this paper, Precision (P), Recall (R) and mean Average Precision (mAP) are used as metrics to evaluate the performance of the model. The evaluation index formula is as follows:

(11)

(121)

(13)

(14)

In Eqs. (11)–(14), TP denotes the number of positive samples correctly recognized as positive classes, FP denotes the number of negative samples incorrectly recognized as positive classes, FN denotes the number of positive samples incorrectly recognized as negative classes, and n denotes the total number of samples and the number of classes in the dataset. n represents the total number of samples and the number of classes in the dataset. Flops can reflect the complexity of the model to some extent, and FPS indicates the number of images detected per second, and these two metrics can be used as important parameters for evaluating the model.

5.3 Ablation experiments

To verify the effect of the MDCFVit-YOLO model, the ablation experiments are carried out for each improved scheme under the same strategy, the results are shown in Table 3. The FPS is added as a reference index for the model. In order to facilitate the use of the reader to understand the meaning of Table 3, for example, for model1, is in the YOLOv8 based on the improvement of the obtained, for the improvement of which, the reader can see in Table 3, tick is in the YOLOv8 based on the improvement of the part, you can see that model1 is the repvit model is used in the YOLOv8, and the model1 is used in YOLOv8. The other parts of the table are also the same as model1 is the same form of expression.

Download:

Table 3. Experimental results of the ablation experiment.

https://doi.org/10.1371/journal.pone.0324700.t003

Models1 adopts Repvit to replace the YOLOv8 backbone model by modifying the YOLOv8 backbone. Compared with the baseline model, the mAP and recall rates are improved by 3.2% and 3.3%, respectively. Repvit uses the strategy and method of employing a complex network structure to enhance the learning ability during the training process, and simplifying the structure to improve the computational efficiency in the inference stage, therefore, misjudgment is reduced and the detection accuracy is improved. The multiplexed interactive attention mechanism is used for Models2, the detection accuracy is increased by 1.6% compared with Models1 while the number of parameters is slightly decreased. Compared with the YOLOv8 model, the mAP is increased by 1.6%. Models3 uses a dynamic automated detection head to improve the detection head part of YOLOv8, the mAP is 63.3%, and the mAP and precision are improved by 2.7% and 1.2% compared to YOLOv8, respectively. Models4 replaces the traditional SPPF with the CFM module and optimizes its structure, resulting in the mAP, precision, and recall rate being improved by 1.2%, 1%, and 1.3% respectively. A new focalerShapeIoU loss function is used for Models5, the mAP and recall are improved by 1.2% and 1% through optimizing the border of the model, respectively. Although the model parameters decreased compared with Yolov8, the fastest processing speed is 251.2 FPS. The detection precision is 75.1% for the MDCFVit-YOLO network, which is 1.6% higher than the baseline model, the recall is 58.8% and the mAP is 66.7% which is 5.9% and 6.1% is higher than the baseline model, respectively. Due to the significant performance improvement of the model, the increase of 20.9M for the computation of the model and the decrease of 65.2 FPS for the computation speed are also acceptable.

MDCFVit-YOLO performs best in detecting people and car categories, with mAP50 reaching 77.3% and 85.1% respectively. The mAP50-95 for the car category even reaches 60.7%, demonstrating high detection accuracy and stability. However, the performance is poorer in detecting bicycles and traffic lights, with a significant drop in accuracy, where mAP is only 48.1% and 57.1% respectively. Detailed detection data for each category can be seen in Table 4. Overall, the model performs well in common traffic scenarios, but there is room for improvement in the recognition of details and small objects.

Download:

Table 4. Detection Performance of the MDCFVit-YOLO Model on Various Categories.

https://doi.org/10.1371/journal.pone.0324700.t004

5.4 Loss function comparison experiment

The focalerDIoU, focalerGIOU, focalerShapeIoU, GIoU, and DIoU are used for comparison experiments to study the most suitable loss function as infrared target detection, respectively. The metrics of the loss functions are shown in Table 5.

Download:

Table 5. Comparative experimental data of loss functions.

https://doi.org/10.1371/journal.pone.0324700.t005

As can be seen from Table 5 the mAP value of GIoU is 61.7% and almost the same as that of focalerGIOU, and focalerDIoU, the mAP value of DIOU is the smallest at 60.6%, and focalerShapeIoU has the best mAP value as high as 62.6%. It is found that the detection accuracy of focalerShapeIoU is optimal as high as 78.9%, which is improved by 5.1%, 5%, 8.1%, and 6.2% compared to focalerDIoU, focalerGIoU, GIoU, and DIoU, respectively. Therefore, the performance of focalerShapeIoU is optimal. To clearly show the effects of different loss functions on the infrared target detection, the changes of different loss functions on the loss curves during training and the effects on the evaluation index of the detected objects in the detection of Infrared targets are shown in Fig. 7 and Fig. 8.

Download:

Fig 7. Loss function curve graphs.

Figs 7(A), 7(B), and 7(C) show the training set losses, Figs 7(D), 7(E), and 7(F) show the validation set loss curves. Figs 7(A)、 7-(D) represent localization loss, 7-(B)、7-(E) represent confidence loss, 7-(C)、7-(F) represent classification loss.

https://doi.org/10.1371/journal.pone.0324700.g007

Download:

Fig 8. Evaluation metrics graphs for different loss functions.

Fig 8(A) is precision, Fig 8(B) is recall, Fig 8(C) is mAP50, Fig 8(D) is mAP95.

https://doi.org/10.1371/journal.pone.0324700.g008

From Fig 7(B), Fig 7(C), Fig 7(E), Fig 7(F), it can be seen that the trend of the five loss functions of classification loss and confidence loss is the same, but from Fig 7(A), Fig 7(D) it can be seen that the loss of the loss change curve of the GIoU and the DIoU is relatively large. Therefore, the GIoU and DIoU losses are rounded off and not considered as the loss function of the improved YOLOv8.

As shown in Fig 8 all curves exhibit a rapid increase and eventually stabilize. The green line represents the focalerShapeIoU metric indicators, which is almost on top of the chart in the Precision, Recall, and mAP.Therefore, the focalerShapeIoU is used as the loss function for the improved YOLOv8 as a result of comprehensive consideration. From the above results, it is evident that focalerShapeIoU as the new loss in infrared target detection is beyond doubt. However, the proposed focalerGIOU and focalerDIoU loss functions also perform excellently, although slightly lower than focalerShapeIoU in this task. These two functions might outperform focalerShapeIoU in other detection tasks or computer vision experiments.

5.5 Comparative experiments and their visualization

To further validate the effectiveness and superiority of the proposed algorithm, we tested several state-of-the-art target detection algorithms on the FLIR dataset, with experimental data and detection results visualized for analysis. The performance metrics of each model are presented in Table 6. These models were evaluated based on key indicators, including precision, recall, mAP, and the number of parameters.

Download:

Table 6. Model comparison test results.

https://doi.org/10.1371/journal.pone.0324700.t006

Comparative experiments on infrared detection were conducted using YOLOv3, YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv10, and MDCFVit-YOLO. Table 6 shows that MDCFVit-YOLO achieves a precision of 0.78, surpassing other models such as YOLOv5 (0.751), YOLOv6 (0.7525), YOLOv7 (0.741), and YOLOv8 (0.735), with improvements ranging from 3.86% to 6.12%. This indicates superior accuracy in identifying infrared targets. In terms of recall, MDCFVit-YOLO reaches 0.586, outperforming YOLOv5 (0.521), YOLOv6 (0.523), YOLOv7 (0.522), and YOLOv8 (0.529) by 6.25% to 12.47%, effectively reducing missed detections. The mAP@0.5 of MDCFVit-YOLO is 0.67, a significant improvement over YOLOv5 (0.61), YOLOv6 (0.588), YOLOv7 (0.612), and YOLOv8 (0.606), with a 9.48% gain compared to YOLOv7, highlighting enhanced overall detection performance. Although MDCFVit-YOLO has 20.9M parameters—higher than YOLOv5 (5M), YOLOv6 (8.3M), YOLOv7 (12.3M), and YOLOv8 (6.2M)—the substantial performance gains justify the increased complexity, striking a balance between model size and effectiveness. These results underscore MDCFVit-YOLO’s superiority in infrared target detection, providing a robust foundation for further research and applications.

To clearly show the detection effect, the mAP of all contrast models is visualized as shown in Fig 8. It is clear from Fig 9 that the MDCFVIT-YOLO model performs best in infrared target detection. To clearly show the training results of precision, recall, mAP50, and mAP95 of each comparative model during each round of training, visualization results of the 100 rounds of experimental data of the YOLO series of models in the detection of infrared targets are shown in Fig. 10. As can be seen from Fig 10, all evaluation indicators of the proposed model far exceed those of other YOLO series detection algorithms, especially in both accuracy and recall rate.

Download:

Fig 9. YOLO series algorithms in the detection of infrared target city mAP value results of the map.

https://doi.org/10.1371/journal.pone.0324700.g009

Download:

Fig 10. Results of YOLO series models in 100epoch in terms of variation in precision, recall, mAP50, and mAP95.

https://doi.org/10.1371/journal.pone.0324700.g010

To highlight the detection effect of the MDCFVIT-YOLO model, the visual contrast experiment of the detection effect of the Yolo series detection algorithm is carried out, results are shown in Fig 11.

Download:

Fig 11. Contrast the visualization results of the experiment.

https://doi.org/10.1371/journal.pone.0324700.g011

Infrared object detection performance varies significantly across models. Comparing YOLOv3, YOLOv5, YOLOv7, YOLOv8, and the proposed MDCFVit-YOLO model, as shown in Fig 11, reveals distinct differences. YOLOv3 struggles with frequent missed and false detections, particularly in complex backgrounds, resulting in unstable and poor performance. YOLOv5 and YOLOv7 improve accuracy and reduce errors compared to YOLOv3, but they still fall short in target boundary precision and small target detection. YOLOv8 enhances accuracy and speed, yet it cannot fully eliminate missed detections or boundary issues in certain infrared scenarios. In contrast, the MDCFVit-YOLO model excels in these test cases, effectively capturing multi-scale target features and maintaining high accuracy in complex backgrounds. It minimizes missed and false detections, offering superior boundary precision and small target detection capabilities. Overall, MDCFVit-YOLO outperforms the others, delivering more accurate, stable, and reliable results, marking a significant advancement in infrared target detection.

Download:

Fig 12. Model thermal diagrams (a) for the YOLOv8 model (b) for the MDCFVit-YOLO model.

https://doi.org/10.1371/journal.pone.0324700.g012

Fig 12 shows the feature visualization of YOLOv8 and the proposed model detection results in infrared scenes. The experimental results clearly show that the network of feature extraction effectiveness on images, of which, the MDCFVit-YOLO model performs better in target recognition and has a higher recognition rate.

6. Conclusion

This paper proposes the MDCFVit-YOLO model by introducing new backbone networks and detection heads, along with a self-developed model, enhancing the accuracy and real-time performance of infrared vehicle and pedestrian detection. Experimental results demonstrate that the MDCFVit-YOLO model achieves a precision of 78%, a recall of 58.6%, and an mAP of 67%, significantly outperforming other YOLO series models. For small target detection in complex nighttime environments, the model excels in addressing challenges such as low visibility and high interference. Under conditions of varying illumination, multi-target detection, and small target scenarios, it markedly improves small target detection accuracy. This study offers new ideas and methods for advancing infrared vehicle and pedestrian detection. Future research can focus on optimizing the model structure, enhancing detection speed and efficiency, and extending its application to more complex infrared target detection scenarios.

References

1. Bharati P, Pramanik A. Deep learning techniques—R-CNN to mask R-CNN: a survey. Computational Intelligence in Pattern Recognition: Proceedings of CIPR 2019. 2020;:657–68.
2. Girshick R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2015. 1440–8.
3. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–49. pmid:27295650
- View Article
- PubMed/NCBI
- Google Scholar
4. Redmon J, Divvala S, Girshick R. You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 779–88.
5. Shafiee MJ, Chywl B, Li F. Fast YOLO: A fast you only look once system for real-time embedded object detection in video. arXiv preprint. 2017.
- View Article
- Google Scholar
6. Liu W, Anguelov D, Erhan D. Ssd: Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I, 2016. 21–37.
7. Zhang T, Cao S, Pu T. AGPCNet: Attention-guided pyramid context networks for infrared small target detection. arXiv preprint. 2021.
- View Article
- Google Scholar
8. Zuo Z, Tong X, Wei J. AFFPN: Attention fusion feature pyramid network for small infrared target detection. Remote Sensing. 2022;14(14):3412.
- View Article
- Google Scholar
9. Zhang Y, Zhang F, Zhou Y, et al. ACA-Net: adaptive context-aware network for basketball action recognition[J]. Frontiers in Neurorobotics, 2024, 18: 1471327.
- View Article
- Google Scholar
10. Zuo Z, Tong X, Wei J. AFFPN: Attention fusion feature pyramid network for small infrared target detection. Remote Sensing. 2022;14(14):3412.
- View Article
- Google Scholar
11. Zhang H, Dai C, Chen C, et al. One stage multi-scale efficient network for underwater target detection[J]. Review of Scientific Instruments, 2024, 95(6).
- View Article
- Google Scholar
12. Li B, Xiao C, Wang L, Wang Y, Lin Z, Li M, et al. Dense nested attention network for infrared small target detection. IEEE Trans Image Process. 2023;32:1745–58. pmid:35994532
- View Article
- PubMed/NCBI
- Google Scholar
13. Lin TY, Goyal P, Girshick R. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 2980–8.
14. Yu C, Liu Y, Wu S. Infrared small target detection based on multiscale local contrast learning networks. Infrared Physics & Technology. 2022;123:104107.
- View Article
- Google Scholar
15. Li H, Yang J, Xu Y. Mitigate target-level insensitivity of infrared small target detection via posterior distribution modeling. arXiv preprint. 2024.
- View Article
- Google Scholar
16. Wang S, Jiang H, Li Z. PHSI-RTDETR: A lightweight infrared small target detection algorithm based on UAV aerial photography. Drones. 2024;8(6):240.
- View Article
- Google Scholar
17. Hao Z, Wang Z, Xu X, Jiang Z, Sun Z. YOLO-ISTD: An infrared small target detection method based on YOLOv5-S. PLoS One. 2024;19(6):e0303451. pmid:38870195
- View Article
- PubMed/NCBI
- Google Scholar
18. Mou X, Lei S, Zhou X. YOLO-FR: A YOLOv5 infrared small target detection algorithm based on feature reassembly sampling method. Sensors (Basel). 2023;23(5):2710. pmid:36904912
- View Article
- PubMed/NCBI
- Google Scholar
19. Zhou J, Zhang B, Yuan X. YOLO-CIR: The network based on YOLO and ConvNeXt for infrared object detection. Infrared Physics & Technology. 2023;131:104703.
- View Article
- Google Scholar
20. Liang S, Wu H, Zhen L, et al. Edge YOLO: Real-time intelligent object detection system based on edge-cloud cooperation in autonomous vehicles[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(12): 25345-25360.
- View Article
- Google Scholar
21. Gong Y, Yu X, Ding Y. Effective fusion factor in FPN for tiny object detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021. 1160–8.
22. Wang A, Chen H, Lin Z. Repvit: Revisiting mobile cnn from vit perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 15909–20.
23. Srivastava N, Hinton G, Krizhevsky A. Dropout: a simple way to prevent neural networks from overfitting. J Machine Learning Research. 2014;15(1):1929–58.
- View Article
- Google Scholar
24. Zhuang F, Qi Z, Duan K. A comprehensive survey on transfer learning. Proceedings of the IEEE. 2020;109(1):43–76.
- View Article
- Google Scholar
25. Zhang H, Zhang S. Focaler-IoU: more focused intersection over union loss. arXiv preprint. 2024.
- View Article
- Google Scholar
26. Hou Z, Liu X, Chen L. Object detection algorithm for improving non-maximum suppression using GIoU. In: IOP Conference Series: Materials Science and Engineering, 2020. 012062.
27. Zheng Z, Wang P, Liu W. Distance-IoU loss: Faster and better learning for bounding box regression. In: Proceedings of the AAAI conference on artificial intelligence, 2020. 12993–3000.
28. Zhang H, Zhang S. Shape-IoU: more accurate metric considering bounding box shape and scale. arXiv preprint. 2023.
- View Article
- Google Scholar

[ref1] 1. Bharati P, Pramanik A. Deep learning techniques—R-CNN to mask R-CNN: a survey. Computational Intelligence in Pattern Recognition: Proceedings of CIPR 2019. 2020;:657–68.

[ref2] 2. Girshick R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2015. 1440–8.

[ref3] 3. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–49. pmid:27295650
View Article
PubMed/NCBI
Google Scholar

[4] View Article

[5] PubMed/NCBI

[6] Google Scholar

[ref4] 4. Redmon J, Divvala S, Girshick R. You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 779–88.

[ref5] 5. Shafiee MJ, Chywl B, Li F. Fast YOLO: A fast you only look once system for real-time embedded object detection in video. arXiv preprint. 2017.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref6] 6. Liu W, Anguelov D, Erhan D. Ssd: Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I, 2016. 21–37.

[ref7] 7. Zhang T, Cao S, Pu T. AGPCNet: Attention-guided pyramid context networks for infrared small target detection. arXiv preprint. 2021.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref8] 8. Zuo Z, Tong X, Wei J. AFFPN: Attention fusion feature pyramid network for small infrared target detection. Remote Sensing. 2022;14(14):3412.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref9] 9. Zhang Y, Zhang F, Zhou Y, et al. ACA-Net: adaptive context-aware network for basketball action recognition[J]. Frontiers in Neurorobotics, 2024, 18: 1471327.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref10] 10. Zuo Z, Tong X, Wei J. AFFPN: Attention fusion feature pyramid network for small infrared target detection. Remote Sensing. 2022;14(14):3412.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref11] 11. Zhang H, Dai C, Chen C, et al. One stage multi-scale efficient network for underwater target detection[J]. Review of Scientific Instruments, 2024, 95(6).
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref12] 12. Li B, Xiao C, Wang L, Wang Y, Lin Z, Li M, et al. Dense nested attention network for infrared small target detection. IEEE Trans Image Process. 2023;32:1745–58. pmid:35994532
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref13] 13. Lin TY, Goyal P, Girshick R. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 2980–8.

[ref14] 14. Yu C, Liu Y, Wu S. Infrared small target detection based on multiscale local contrast learning networks. Infrared Physics & Technology. 2022;123:104107.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref15] 15. Li H, Yang J, Xu Y. Mitigate target-level insensitivity of infrared small target detection via posterior distribution modeling. arXiv preprint. 2024.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref16] 16. Wang S, Jiang H, Li Z. PHSI-RTDETR: A lightweight infrared small target detection algorithm based on UAV aerial photography. Drones. 2024;8(6):240.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref17] 17. Hao Z, Wang Z, Xu X, Jiang Z, Sun Z. YOLO-ISTD: An infrared small target detection method based on YOLOv5-S. PLoS One. 2024;19(6):e0303451. pmid:38870195
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref18] 18. Mou X, Lei S, Zhou X. YOLO-FR: A YOLOv5 infrared small target detection algorithm based on feature reassembly sampling method. Sensors (Basel). 2023;23(5):2710. pmid:36904912
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref19] 19. Zhou J, Zhang B, Yuan X. YOLO-CIR: The network based on YOLO and ConvNeXt for infrared object detection. Infrared Physics & Technology. 2023;131:104703.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref20] 20. Liang S, Wu H, Zhen L, et al. Edge YOLO: Real-time intelligent object detection system based on edge-cloud cooperation in autonomous vehicles[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(12): 25345-25360.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref21] 21. Gong Y, Yu X, Ding Y. Effective fusion factor in FPN for tiny object detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021. 1160–8.

[ref22] 22. Wang A, Chen H, Lin Z. Repvit: Revisiting mobile cnn from vit perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 15909–20.

[ref23] 23. Srivastava N, Hinton G, Krizhevsky A. Dropout: a simple way to prevent neural networks from overfitting. J Machine Learning Research. 2014;15(1):1929–58.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref24] 24. Zhuang F, Qi Z, Duan K. A comprehensive survey on transfer learning. Proceedings of the IEEE. 2020;109(1):43–76.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref25] 25. Zhang H, Zhang S. Focaler-IoU: more focused intersection over union loss. arXiv preprint. 2024.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref26] 26. Hou Z, Liu X, Chen L. Object detection algorithm for improving non-maximum suppression using GIoU. In: IOP Conference Series: Materials Science and Engineering, 2020. 012062.

[ref27] 27. Zheng Z, Wang P, Liu W. Distance-IoU loss: Faster and better learning for bounding box regression. In: Proceedings of the AAAI conference on artificial intelligence, 2020. 12993–3000.

[ref28] 28. Zhang H, Zhang S. Shape-IoU: more accurate metric considering bounding box shape and scale. arXiv preprint. 2023.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

Figures

Abstract

2. Introduction

3. Related work

4. Model architecture and methodology

4.1 YOLOv8 overview

4.2 MDCFVit-YOLO network architecture

Research on multiplexed interactive attention.

Proposed dynamic automated inspection head.

Design of context-sensitive modulation.

Backbone network repvit.

Three loss functions are proposed.

5. Experimental results and analysis

5.1 Datasets and training configurations

5.2 Evaluation of indicators

5.3 Ablation experiments

5.4 Loss function comparison experiment

5.5 Comparative experiments and their visualization

6. Conclusion

References