Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

FCMI-YOLO: An efficient deep learning-based algorithm for real-time fire detection on edge devices

  • Junjie Lu,

    Roles Conceptualization, Writing – original draft

    Affiliation College of Photonic and Electronic Engineering, Fujian Normal University, Fujian, China

  • Yuchen Zheng,

    Roles Conceptualization, Resources, Validation, Writing – original draft

    Affiliation College of Photonic and Electronic Engineering, Fujian Normal University, Fujian, China

  • Liwei Guan ,

    Roles Methodology, Project administration, Supervision, Writing – review & editing

    guanlw@fjnu.edu.cn

    Affiliation College of Physics and Energy, Fujian Normal University, Fujian, China

  • Bing Lin,

    Roles Writing – review & editing

    Affiliation College of Physics and Energy, Fujian Normal University, Fujian, China

  • Wenzao Shi,

    Roles Formal analysis, Funding acquisition, Project administration

    Affiliation College of Photonic and Electronic Engineering, Fujian Normal University, Fujian, China

  • Junyan Zhang,

    Roles Data curation

    Affiliation College of Photonic and Electronic Engineering, Fujian Normal University, Fujian, China

  • Yunping Wu

    Roles Funding acquisition, Investigation, Project administration, Writing – review & editing

    Affiliation College of Photonic and Electronic Engineering, Fujian Normal University, Fujian, China

Abstract

The rapid development of Internet of Things (IoT) technology and deep learning has propelled the deployment of vision-based fire detection algorithms on edge devices, significantly exacerbating the trade-off between accuracy and inference speed under hardware resource constraints. To address this issue, this paper proposes FCMI-YOLO, a real-time fire detection algorithm optimized for edge devices. Firstly, the FasterNext module is proposed to reduce computational cost and enhance detection precision through lightweight design. Secondly, the Cross-Scale Feature Fusion Module (CCFM) and the Mixed Local Channel Attention (MLCA) mechanism are incorporated into the neck network to improve detection performance for small fire targets and reduce resource consumption. Finally, the Inner-DIoU loss function is proposed to optimize bounding box regression. Experimental results on a custom fire dataset demonstrate that FCMI-YOLO increases mAP@50 by 1.5%, reduces parameters by 40%, and lowers GFLOPs to 28.9% of YOLOv5s, demonstrating its practical value for real-time fire detection in edge scenarios with limited computational resources. The core code and dataset are available at https://github.com/ JunJieLu20230823/code.git.

1 Introduction

Fire is a frequent disaster that poses a significant threat to public safety and social development. Its progression is typically divided into four stages: ignition, growth, full development, and decay [1]. Each stage has distinct characteristics that require corresponding prevention and specific control measures to effectively manage the fire. The most effective strategy for fire prevention and suppression is to detect and extinguish fires during the incipient stage, preventing the fire from escalating into rapid growth or full development stages.

Fire detection systems have transitioned from traditional manual observation and routine patrols to IoT-based technologies [2]. By deploying temperature sensors [3], smoke sensors [4], and light sensors [5], these systems enable real-time monitoring of environmental changes, such as temperature, smoke particles, and spectral shifts, to achieve automated and intelligent fire detection and alerts. However, the limited coverage of sensor deployments leads to monitoring blind spots, and the poor sensor density directly impacts the success of fire detection in the ignition stage.

With the development of advanced object detection approaches, the vision-based fire detection approaches have gained significant attention [6,7]. These approaches utilize cameras or other image acquisition devices to capture real-time video streams and detect fires through image processing and analysis algorithms [810]. Compared to IoT sensor-based technologies, vision-based approaches offer significant advantages in spatial coverage, sensitivity, and cost efficiency, particularly in large-scale monitoring and dynamic environments [7].

Deep learning-based object detection approaches have revolutionized fire detection as a novel advancement in computer vision and supervised learning. These approaches leverage neural network models to automatically learn and extract multi-layered abstract high-dimensional features from extensive datasets, offering superior robustness and adaptability. This enables them to effectively handle complex environmental interferences while achieving exceptional detection accuracy and generalization capabilities. However, as the accuracy of deep learning-based object detection approaches continues to improve, the computational complexity and resource demands of object detection models have increased significantly, severely constraining their feasibility for deployment on low-power embedded devices. This technical bottleneck has become even more pronounced in the IoT era, where the large-scale proliferation of embedded devices necessitates more efficient solutions. Cao et al. [11] proposed the YOLO-SF algorithm, which combines instance segmentation technology with YOLOv7 while incorporating the MobileViTv2 module and the Convolutional Block Attention Module (CBAM) to enhance fire feature extraction capabilities. Although the accuracy of detection increased by 4%, the number of parameters doubled, and the Frames Per Second (FPS) nearly halved, which poses a significant obstacle to deployment on mobile embedded devices. To enhance multi-scale fire feature capture, Wang et al. [12] optimized the feature pyramid network in YOLOv8 using an FSPPF structure, introduced an additional small-object detection layer to extend multi-scale perception, and incorporated Dynamic Snake Convolution (DSC) to enhance feature fusion. While this algorithm improves fire detection mAP@0.5 by 1.9%, it also increases model parameters by 1.6 times and FLOPs by 2.1 times, imposing a significant burden on real-time processing in edge computing environments. He et al. [13] removed the FPN-PAN structure from the YOLOv5 neck network and merged the three detection heads into a single-head prediction head, significantly reducing model parameters and achieving a 29ms per-frame inference speed on edge devices. However, this algorithm leads to a 3.6 percentage point decrease in Average Precision (AP), making it less robust in complex environments.

Although existing deep learning methods have been widely applied to fire detection, they often face a trade-off between accuracy and inference speed. The proposed FCMI-YOLO, an enhancement of YOLOv5s, not only improves the accuracy of fire detection but also optimizes inference speed on edge devices. In comparison to the studies mentioned above, our method strikes a superior balance between accuracy and inference speed, featuring a lightweight design, efficient deployment, and real-time detection capabilities. The contributions are as follows:

  • A lightweight FasterNext module was designed to replace the C3 module in the backbone network, which reduces both the number of parameters and computational load while enhancing the model’s feature extraction capability in complex environments.
  • The neck network was optimized by integrating the CCFM and the MLCA mechanism, which first employed lightweight convolutions to extract features from the deep network and then utilized the MLCA mechanism to focus on key information, effectively enhancing the model’s detection performance for small targets.
  • The Inner-IoU loss function was introduced to optimize bounding box regression. By incorporating auxiliary bounding boxes, the model’s sensitivity to fire scale variations was improved, resulting in enhanced localization accuracy.
  • A dedicated fire dataset was constructed to support the algorithm’s training and evaluation. Experimental results demonstrated that FCMI-YOLO outperforms other YOLO-based algorithms, achieved superior comprehensive performance, and enabled real-time monitoring of medium- and long-range fires on edge devices.
  • FCMI-YOLO was deployed on the Orange Pi 5 Plus edge device, utilized an asynchronous multi-threading strategy to accelerate inference speed, and achieve efficient real-time detection of medium- and long-range fire sources.

The remainder of this paper is organized as follows: Sect Sect 2 provides a comprehensive review of existing deep learning-based methods for fire detection. Sect Sect 3 presents a detailed description of the improvements made to the YOLOv5s algorithm and introduces the FCMI-YOLO algorithm, specifically designed for enhanced fire detection. Sect Sect 4 evaluates the performance metrics of the FCMI-YOLO algorithm on both personal computers (PC) and edge devices. Finally, Sect Sect 5 summarizes the research contributions and concludes the paper.

2 Related work

Vison-based fire detection methods can be broadly divided into two categories: methods based on handcrafted feature extraction and deep learning. Handcrafted feature extraction methods detect fire by extracting static features such as color, texture, and blur level, as well as dynamic features like shape changes, flickering, and motion direction. These methods rely on classical computer vision methods, including probability density functions, color space analysis (e.g., YUV [14] and HSV [15]), and texture analysis (e.g., SIFT [16] and HOG [17]), combined with classifiers such as SVM and AdaBoost for fire detection. However, they are susceptible to environmental conditions, such as variations in lighting and airflow, leading to high false positives and limited robustness in long-range detection or complex scenarios. Deep learning methods have addressed the limitations through end-to-end feature learning mechanisms. Convolutional neural network (CNN)-based methods can automatically extract multi-scale abstract features from large-scale data and construct hierarchical representations that are robust to environmental interference [18,19]. As a result, they have been widely applied to domains including autonomous driving [20], smart agriculture [21], security surveillance [22], and industrial IoT [2326].

In the context of fire detection, CNN-based object detection algorithms can be divided into two categories: two-stage and one-stage algorithms [8]. Two-stage detection algorithms, represented by Faster-RCNN [27] and Fast-RCNN [28], generate candidate regions and progressively optimize target recognition results, achieving high detection accuracy. However, these algorithms involve extensive redundant computations during region generation and classification, resulting in slower inference speeds that fail to meet real-time detection requirements. In contrast, one-stage detection algorithms, such as YOLO [29] and SSD [30], simplify the detection pipeline by performing target classification and bounding box regression directly on the image, bypassing redundant steps like region proposal generation and classification optimization. These algorithms significantly reduce computational overhead and shorten information processing pathways, providing a notable advantage in processing speed. Among these, YOLO stands out for its superior inference speed and detection accuracy, making it the preferred choice for deployment on edge devices, particularly in fire detection applications with high demands for real-time performance and accuracy [3134].

Recent YOLO-based studies have focused on improving both accuracy and efficiency in fire detection tasks. Jiang et al. [35] proposed the DG-YOLO model based on YOLOv8, which integrates a deformable attention mechanism, a lightweight feature extraction module (GSC2f) for cross-scale edge information fusion, and a dedicated small-target detector to enhance texture detail capture. Under complex background interference, the model achieves a 10.7% improvement in mAP@0.5. To address the limitations of single model feature representation, Xu et al. [36] proposed a fire detection method that integrates YOLOv5 and EfficientDet, with EfficientDet capturing global features to reduce false positives caused by overemphasis on local details, resulting in a 2.5%-10.9% improvement in detection performance and a 51.3% reduction in false positives. For complex and dynamic scenarios, Luan et al. [37] proposed a lightweight fire detection algorithm based on YOLOX, which integrates a multi-level feature extraction structure (CSP-ML) and the CBAM attention mechanism to enhance small-target recognition. By leveraging multi-branch feature fusion, the model captures both the positional sensitivity of shallow high-resolution features and the semantic abstraction of deep features, resulting in a 6.4% improvement in mAP@0.5 under smoke occlusion and dynamic background interference. Zhao et al. [38] proposed a Fire Segmentation-Detection Framework (FSDF), which enhances fire representation by jointly extracting image color and texture features, and integrates YOLOv8 with Vector Quantized Variational Autoencoders (VQ-VAE) to enable supervised target localization and unsupervised fire feature learning, respectively, thereby improving recognition accuracy in complex scenarios.

To enable efficient deployment on edge devices, Wang et al. [39] proposed Light-YOLOv4, which replaces the original backbone network with MobileNetv3 to reduce parameters, incorporates a depthwise separable attention module that combines depthwise separable convolutions and a coordinate attention mechanism to minimize computational redundancy and introduces a Bidirectional Feature Pyramid Network (BiFPN) to enhance multi-scale feature interaction, resulting in an 80.9% reduction in parameters while maintaining 85.64% accuracy on resource-constrained hardware devices. To further enhance lightweight fire detection, Huang et al. [40] proposed YOLO-ULNet, which employs Grouped Channel Shuffle (GCS) units within Light-weight Feature Extraction (LFE) units to reduce parameters, incorporates a Multipath Aggregation Feature Pyramid (MAFP) structure to enable efficient multi-dimensional feature fusion, and integrates channel pruning and feature distillation techniques to compress the model, resulting in a detection speed of 24.57 FPS and 74.50% accuracy on a Raspberry Pi 4B, with only 0.19M parameters and 0.4 GFLOPs.

The studies above provide valuable insights into the development of fire detection models but also reveal certain limitations. Models that emphasize feature enhancement often introduce additional complexity, leading to increased parameter counts and computational costs that hinder real-time deployment. In contrast, lightweight models tend to suffer from reduced detection precision, especially when dealing with small or distant targets. Striking a balance between detection accuracy and inference speed remains a core challenge, particularly under the resource constraints of edge devices. To address these challenges, this paper proposes FCMI-YOLO, which achieves a trade-off between accuracy and inference speed on edge devices, effectively mitigating the issues of high resource consumption and low detection precision.

3 Proposed methods and model architecture

This section provides an overview of the FCMI-YOLO algorithm, highlighting its key improvements and optimizations for real-time fire detection on edge devices. Based on YOLOv5s architecture, FCMI-YOLO aims to enhance detection accuracy while minimizing computational complexity, making it suitable for resource-constrained environments.

Sect introduces the overall architecture, describing how the input layer, backbone, neck, and detection head work together for efficient fire detection. Sect , the FasterNext module is presented, replacing the C3 module in the backbone to reduce complexity while maintaining feature extraction performance. Sect details integrating the Cross-Scale Feature Fusion Module (CCFM) and the Mixed Local Channel Attention (MLCA) mechanism into the neck network to improve detection accuracy, particularly for small fire targets, while minimizing resource usage. In Sect , the Inner-IoU loss function is also discussed, optimizing bounding box regression to enhance the model’s sensitivity to variations in fire target size and shape. These improvements collectively form a robust, efficient, and real-time fire detection solution for edge devices.

3.1 FCMI-YOLO

The FCMI-YOLO algorithm is based on the YOLOv5s architecture, which consists of four main components: the input layer, backbone network, neck network, and detection head, working together to achieve end-to-end object detection [41]. The input layer enhances the model’s generalization ability through data normalization and augmentation; the backbone network extracts multi-scale features using hierarchical down-sampling; the neck network fuses both detailed and semantic features from the backbone network; and the detection head uses an anchor-based mechanism to generate multi-scale feature maps for precise detection of objects of different sizes.

To enhance the performance of YOLOv5s on resource-constrained edge devices, while meeting the requirements for high accuracy and real-time fire detection, FCMI-YOLO introduces several key optimizations to the overall architecture. As shown in Fig 1, the improvements include the following key aspects. First, in the backbone network, all C3 modules are replaced with the newly designed FasterNext module. Based on the design philosophy of FasterNet, this module significantly reduces the model’s complexity while maintaining strong feature extraction capabilities. Second, at the neck network, a Cross-Scale Feature Fusion Module (CCFM) is introduced, which strengthens the fusion of multi-level features through lightweight convolution operations, effectively optimizing the representation of multi-scale features. Furthermore, a Mixed Local Channel Attention (MLCA) mechanism is added, combining both spatial and channel attention, enabling the model to focus more accurately on key regions, thus improving detection performance for small fire targets. Finally, in terms of the loss function, the traditional CIoU loss is replaced with Inner-DIoU, which introduces an auxiliary bounding box mechanism to enhance the model’s sensitivity to variations in fire target size and irregular shapes, significantly improving bounding box localization accuracy.

3.2 FasterNext network

The backbone network of YOLOv5s uses the C3 module, which consists of three CBS modules and several Bottleneck modules. Although the Bottleneck module helps mitigate the issue of gradient vanishing, it still results in some loss of feature information in complex scenarios. Additionally, its large number of parameters poses challenges for deployment on edge devices. To address these issues, the design principles of the FasterNet network [42] were adopted, leading to the development of a novel network structure named FasterNext, which replaces the C3 module in the backbone network. The structures of FasterNet, S-FasterNet, and FasterNext are shown in Fig 2(a), 2(b), and 2(c), with S-FasterNet being a key component of the FasterNext network.

thumbnail
Fig 2. The structure of FasterNext.

(a) FasterNet. (b) S-FasterNet. (c) FasterNext.

https://doi.org/10.1371/journal.pone.0329555.g002

As shown in Fig 2(a), the FasterNet network consists of a PConv layer and two PWConv layers, forming a reverse residual block. The working principle of the PConv layer is illustrated in Fig 3. Unlike conventional convolution, PConv only performs convolution operations on a subset of input channels, while the remaining channels remain unchanged, making more efficient use of computational resources. To achieve continuous or regular memory access, PConv uses either the first or last set of consecutive channels as representatives of the entire feature map during computation. Assuming the input and output feature maps have the same number of channels, the FLOPs (floating point operations) of PConv can be calculated as: , where h and w represent the height and width of the feature map, k is the kernel size and is the number of channels involved in the partial convolution. When PConv uses only one-quarter of the channels, its FLOPs are reduced to 1/16 of a regular convolution. To ensure feature diversity and low latency, batch normalization, and ReLU activation layers are placed exclusively between the two PWConv layers. However, this design might be insufficient for handling complex feature extraction tasks.

According to Fig 2(b), the S-FasterNet module replaces the original ReLU activation function with SiLU and applies batch normalization and SiLU after each PWConv layer to enhance nonlinear representation and feature extraction. This design enhances the network’s nonlinear transformation capability and improves feature extraction performance under complex scenarios. ReLU is one of the most common activation functions in deep neural networks, favored for its computational simplicity and sparse activation properties. It is defined in Eq 1.

(1)

However, ReLU has several limitations. When the input is negative, the gradient becomes zero, which may lead to the permanent deactivation of certain neurons during training. Moreover, its non-differentiability at the origin and complete suppression of negative values can hinder gradient propagation and reduce the network’s expressive capacity. To overcome these issues, the SiLU activation function is employed, defined in Eq 2.

(2)

As shown in Fig 4, SiLU offers a smooth and continuous derivative, enabling a more stable gradient flow. Unlike ReLU, it provides a moderated response to negative inputs rather than discarding them entirely, thereby improving feature retention and model expressiveness. In fire detection tasks, the nonlinear behavior of SiLU allows more accurate modeling of fire characteristics, leading to better detection performance in complex environments.

thumbnail
Fig 4. Comparison curve of ReLU and SiLU activation functions.

https://doi.org/10.1371/journal.pone.0329555.g004

According to Fig 2(c), the FasterNext network mainly consists of two parallel branches. The first branch contains one CBS module and multiple stacked S-FasterNet modules, while the second branch passes through a single CBS module. The two branches are merged using a Concat operation, and finally, the output feature map is generated through a CBS module. As shown in Table 1, FasterNext significantly reduces the number of parameters compared to the C3 module, achieving approximately a 31% reduction while maintaining the same input image size and number of channels.

3.3 Improvements in the neck network

3.3.1 Cross-scale feature fusion module.

In the neck network of YOLOv5s, the Path Aggregation Network (PAN) structure is employed to fuse target features at different scales, thereby enhancing detection performance. However, this feature fusion approach has limitations, including high computational complexity, excessive memory consumption, and insufficient capability to capture small target information. To address these issues, a Cross-Scale Feature Fusion Module (CCFM) [43] was introduced.

CCFM constructs multi-scale feature interaction paths using lightweight convolution operations, achieving cross-layer feature reuse while reducing computational complexity. This design effectively preserves high-resolution spatial details, mitigating the loss of small-object information caused by feature map downsampling. Additionally, it employs a strategy based on learned weights to automatically adjust the feature fusion approach. This grants the model a high degree of flexibility, allowing it to adapt the integration of features from different scales according to the demands of fire detection tasks. By enabling collaborative perception of local details and global semantics, this design enhances the model’s capability to distinguish complex fire characteristics while ensuring detection accuracy. This design ensures detection accuracy while adopting a lightweight architecture suitable for resource-constrained edge devices, providing an efficient solution for real-time fire detection tasks.

3.3.2 Mixed local channel attention mechanism.

The attention mechanism is designed to enable models to focus more precisely on critical information, thereby enhancing performance and efficiency. However, traditional channel attention mechanisms, while effective in amplifying feature representation along the channel dimension, exhibit limitations in capturing spatial information, which can compromise detection accuracy. In contrast, spatial attention mechanisms can capture local details of images but often fail to focus accurately on critical regions due to overly uniform attention distribution. Additionally, their high computational complexity and large parameter count restrict their application in edge devices.

As illustrated in Fig 5, the Mixed Local Channel Attention (MLCA) mechanism [44] addresses these issues by dividing input feature maps into multiple local regions, combining spatial and channel information processing, and utilizing 1  1 convolutions to significantly reduce computational overhead and parameter count. Compared to other attention mechanisms, MLCA incorporates both spatial and channel information, optimizing resource utilization while improving model robustness. Consequently, MLCA is introduced after the CCFM to further enhance the model’s performance. The synergy between feature fusion and the attention mechanism substantially improves the model’s ability to learn fire-specific features and detect small fire targets, providing robust technical support for mid-to long-range fire detection.

The structure of the MLCA mechanism is shown in Fig 6. First, the input feature map undergoes Local Average Pooling (LAP), which reduces its dimensions to 1  128  5  5. Then, the processed feature map is split into two paths. The first path extracts global features through Global Average Pooling (GAP), while the second path reshapes the feature map to extract local features. Next, the features from both paths undergo 1D convolution processing, and the feature maps are then restored to 1  128  5  5 dimensions using Unpooling and reshaping operations. Finally, the features from both paths are fused, and the resulting feature map is restored to the original dimensions of the input feature map via a Unpooling operation to generate the final output.

3.4 Improved loss function

The YOLOv5s algorithm originally uses the CIoU loss function to evaluate the quality of predicted bounding boxes. This function enhances localization accuracy by evaluating the overlap area between the predicted box and the ground truth, the distance between their center points, and the consistency of their aspect ratios.The definition of CIoU is shown in Eqs (3)(6).

(3)(4)(5)(6)

Where IoU represents the ratio between the intersection and the union of the predicted bounding box and the ground truth box, measuring the degree of spatial overlap. The denotes the euclidean distance between the center points of the predicted box and the GT box, used to penalize positional deviation. The α is a dynamic weighting factor that adaptively adjusts the contribution of the aspect ratio term based on the IoU and shape difference. The quantifies the discrepancy in aspect ratio between the predicted and GT boxes by calculating the difference in the arctangent of their respective width-to-height ratios. Specifically, when the IoU is low, α becomes smaller to focus more on improving the overlap region, and when the IoU is high, α increases to emphasize shape refinement. By jointly optimizing the overlap area, center distance, and aspect ratio, the CIoU loss function achieves improved localization accuracy and overall object detection performance.

However, the CIoU loss function shows limitations in fire detection, as its reliance on aspect ratio consistency makes it less effective at handling small or irregularly shaped fire targets, leading to insufficient sensitivity to scale variations, which in turn results in detection errors and slows down model convergence.To address these issues, the width-height ratio adjustment factor in CIoU is removed, which leads to the adoption of the DIoU loss function, further optimized by incorporating the Inner-IoU [45], resulting in the formulation of the Inner-DIoU loss function. The Inner-DIoU is defined in Eqs (7)(9).

(7)(8)(9)

The Inner-IoU loss function optimizes detection performance by introducing a scale-scaling factor, ratio, which adjusts the aspect ratio of the auxiliary bounding box. As illustrated in Fig 7, when the ratio equals 1, the auxiliary bounding box is equal to the actual bounding box. If the ratio is less than 1, the auxiliary bounding box is smaller than the actual bounding box, facilitating faster convergence for high IoU samples. Conversely, when the ratio is greater than 1, the auxiliary bounding box is larger than the actual bounding box, which accelerates the regression of low IoU samples. Eqs (10)(13) define the boundary parameters of the Inner Box.

(10)(11)(12)(13)

Where represent the center coordinates of the GT box, and represent the center coordinates of the anchor box. , , , and represent the left, right, top, and bottom boundary coordinates of the Inner GT Box, respectively. Similarly, bl, br, bt, and bb represent the left, right, top, and bottom boundary coordinates of the Inner Anchor Box, respectively.

In Eqs (14)(16), the definition of Inner-IoU is provided.

(14)(15)(16)

Where denotes the intersection of the Inner GT box and Inner anchor box, while represents their union. The Inner-IoU is defined as the ratio of to .

4 Experiments and analysis

4.1 Dataset

The performance of fire image recognition algorithms is heavily dependent on the quality of the dataset. However, the field of fire detection faces significant challenges, including insufficient sample size, imbalanced sample distribution, and limited background diversity. Additionally, publicly available video and image datasets are scarce, and there is a lack of authoritative standard datasets [10].

To address this, the fire image dataset was collected through three approaches:

  1. Selecting high-resolution images from publicly available datasets such as BoWFire [46], FASDD [47], and COCO2017 [48].
  2. Collecting high-quality fire images from the internet using web scraping techniques.
  3. Creating custom fire videos and extracting images from each frame.

After data collection, manual filtering and data augmentation techniques, including scaling and stitching, were applied to enhance data diversity. The dataset was then annotated using the LabelImg tool. Finally, the dataset was split into training, validation, and test sets with a ratio of 7:1.5:1.5. Fig 8 illustrates the data distribution, and Table 2 provides detailed statistical information on the images.

4.2 Evaluation metrics

In the experiment, the following metrics were used to evaluate the model: precision (P), recall (R), parameters, floating point operations (FLOPs), frames per second (FPS), mAP@0.5, and mAP@0.5:0.95. The calculation formulas are as follows:

(17)(18)(19)(20)(21)

Specifically, TP (True Positive) represents the number of samples correctly predicted as positive, FP (False Positive) represents the number of samples predicted as positive but actually negative, and FN (False Negative) represents the number of samples predicted as negative but actually positive. The parameters metric reflects the total number of model parameters, indicating the model’s size and complexity. FLOPs measure the number of operations required for one forward inference pass, offering an estimate of the computational efficiency and resource demand. FPS indicates the number of image frames the model can process in one second, providing insight into its real-time processing capability and overall responsiveness during deployment.

In Eq (21), n represents the number of classes. When n = 1, as is the case in this experiment focusing solely on the ”fire” class, the mAP (mean Average Precision) value is equivalent to the AP (Average Precision) for this task. Additionally, mAP@0.5 refers to the mean average precision at an IoU (Intersection over Union) threshold of 0.5, which measures the detection accuracy at this specific overlap between predicted and ground-truth bounding boxes. Meanwhile, mAP@ 0.5:0.95 calculates the average precision across a range of IoU thresholds, from 0.5 to 0.95, with a step size of 0.05, providing a more comprehensive assessment of the model’s performance by accounting for different degrees of overlap.

4.3 Experimental results and analysis on PC

The FCMI-YOLO algorithm was first trained and validated on a PC. Subsequently, the trained model was converted to the RKNN format using the RKNN-Toolkit2 and deployed onto edge devices for further validation and evaluation.

4.3.1 Experimental environment.

The training environment is detailed in Table 3, and the main training parameters for the fire detection model are listed in Table 4. The experiment utilizes the pre-trained weights of YOLOv5s and employs the SGD optimizer for training the model.

4.3.2 Performance comparison of different YOLOv5 model versions.

Fast and accurate fire detection is crucial for fire source control and early detection during the ignition stage. As shown in Table 5, there are significant performance differences between different versions of the YOLOv5 model in terms of detection accuracy and inference speed. YOL Ov5s strikes a good balance between accuracy (87.2%) and inference speed (98 FPS), while maintaining a low FLOPs count (15.9M), making it suitable for reliable detection performance with lower computational cost on resource-constrained edge devices. In contrast, although YOLOv5n achieves the fastest inference speed (105 FPS) and the lowest FLOPs (4.6M), its mAP is only 83.2%, which is insufficient for meeting the accuracy requirements in fire detection. On the other hand, YOLOv5l demonstrates the best detection accuracy (89.0% mAP), but its relatively slow inference speed (56.3 FPS) and large FLOPs (109.6M) limit its application on edge devices. Overall, YOLOv5s, with its excellent performance and efficient computational capability, emerges as the preferred model for edge-based fire detection tasks and has been selected as the base network for this study.

thumbnail
Table 5. Performance of fire detector based on different model versions of YOLOv5.

https://doi.org/10.1371/journal.pone.0329555.t005

4.3.3 FasterNext comparison experiment.

To evaluate the performance of the FasterNext module, the C3 modules in the backbone, neck, and head networks of YOLOv5s were replaced with FasterNext modules, and their performance was compared on the constructed fire dataset. As shown in Table 6, when only replacing the C3 module in the backbone network, the model’s precision increased by 1.9% to 89.1%, recall improved by 4.7% to 83.4%, and mAP@0.5 increased by 1.3% to 87.8%. Furthermore, as illustrated in Fig 9, the model demonstrated greater robustness under various challenging conditions, such as daytime, nighttime, rain, and fog. These improvements primarily originate from two key innovations in FasterNext. First, using PConv convolution reduces computational costs to 1/16 of conventional convolutions and significantly enhances computational efficiency through optimized memory access. Second, adopting the SiLU activation function ensures that normalization and activation are sequentially applied after each PWConv, constructing a more refined nonlinear feature transformation pathway while mitigating potential feature loss issues associated with the C3 module. Although the inference speed slightly decreased from 98.8 FPS to 93.3 FPS, FasterNext achieves a better balance between accuracy improvement and computational efficiency, making it particularly suitable for complex object detection tasks in resource-constrained environments. Experimental results indicate that FasterNext not only overcomes the primary limitations of the C3 module but also achieves comprehensive performance enhancements through structural innovations.

thumbnail
Table 6. Performance comparison of YOLOv5s with FasterNext module replacement.

https://doi.org/10.1371/journal.pone.0329555.t006

thumbnail
Fig 9. Comparison of detection results of different methods in FasterNext.

(a) YOLOv5s. (b) YOLOv5s + FasterNext (Backbone). (c) YOLOv5s + FasterNext(Neck). (d) YOLOv5s + FasterNext(Backbone + Neck).

https://doi.org/10.1371/journal.pone.0329555.g009

4.3.4 Loss function comparison experiment.

Based on the introduction of the FasterNext module, CCFM, and MLCA mechanism, this experiment adjusts the ratio of Inner-DIoU within the range of [0.6, 1.0]. According to Fig 10, the experimental results indicate that the model achieves optimal performance when the ratio is set to 0.8, with a mAP@0.5 reaching 88.0%.

To evaluate the detection performance of the proposed Inner-DIoU loss function, comprehensive comparative experiments were conducted on the constructed fire dataset. As shown in Table 7, a comparison of DIoU with benchmark methods such as CIoU, DIoU, EIoU, and GIoU along with their Inner and Focal variants [49], reveals that Inner-DIoU achieved the best performance in terms of mAP@0.5 and recall. Specifically, Inner-DIoU attained a mAP@0.5 of 88.0% and a recall of 84.4%, representing improvements of 0.7% and 2.0%, respectively, over the baseline CIoU. Experimental results demonstrate that Inner-DIoU dynamically adjusts the scaling ratio of the auxiliary bounding box based on the actual scale of fire targets, effectively enhancing the model’s ability to perceive sudden variations in fire target sizes.

thumbnail
Table 7. Performance comparison of different loss functions.

https://doi.org/10.1371/journal.pone.0329555.t007

4.3.5 Ablation experiments.

In this ablation experiment, each improvement stage of FCMI-YOLO was evaluated to assess its effectiveness in fire detection tasks. The “✓” symbol indicates the incorporation of a specific method.

As shown in Table 8, introducing each module individually results in a certain improvement in mAP@0.5, with the FasterNext module achieving the most significant increase of 1.3%. FasterNext, by incorporating a lightweight structure combining PConv and PWConv, reduces redundant computations while enhancing feature extraction capabilities, enabling the network to extract more discriminative fire features with a lower computational cost. However, since the PConv and PWConv combination primarily focuses on the central region of the input, it may have limitations in capturing edge details of fire targets, which could lead to a slight decrease in recall. Additionally, the module significantly reduces the number of model parameters and FLOPs, making it more suitable for edge computing environments.

Introducing the CCFM and MLCA mechanisms in the YOLOv5s neck network resulted in a 4.7% improvement in recall and a noticeable reduction in false positives, especially in scenarios with small fire targets or complex backgrounds, where the detection performance became more stable. This improvement is largely due to CCFM enhancing the interaction between high and low-level features through cascaded feature fusion, which improves the representation of small fire targets. Meanwhile, MLCA, by combining spatial and channel information, allows the network to focus more effectively on key fire features while suppressing background noise. Furthermore, this optimization reduced the model parameters by 30% and FLOPs by 15.9%, demonstrating higher computational efficiency that meets the stringent resource constraints of edge devices.

After introducing the Inner-DIoU loss function in YOLOv5s, the model’s precision, recall, and mAP@0.5 were improved by 2%, 3.2%, and 0.9%, respectively. This improvement dynamically adjusts the scaling parameters of the auxiliary bounding box, effectively enhancing the network’s adaptability to changes in the scale of fire targets, particularly in scenarios with significant variations in target size, where recall benefits were particularly evident. Inner-DIoU, by reconstructing the bounding box regression mechanism, strengthens the model’s ability to capture multi-scale fire features, improving target differentiation while alleviating the bounding box localization bias caused by scale sensitivity in traditional methods, thus achieving a synergistic optimization of model robustness and detection accuracy.

When FasterNext, CCFM, MLCA, and Inner-DIoU are combined, the model achieves an optimal balance. The number of parameters is reduced by 40%, and the floating-point operations are reduced by 29%. While maintaining real-time performance, mAP@0.5 increases by 1.5% and recall improves by 5.7%. The experiments demonstrate that FCMI-YOLO achieves the best balance between accuracy and inference speed, meeting the stringent requirements of edge devices for high-precision, fast-response, and resource-efficient fire detection, providing essential support for efficient real-time fire monitoring on edge platforms.

4.3.6 Visualization analysis.

To comprehensively verify the performance of the FCMI-YOLO model in fire detection tasks, visualization experiments were conducted under different exposure conditions, including normal exposure, underexposure, and overexposure, simulating complex scenarios that may arise in practical applications.

In the experiments, a detailed comparison was conducted between the FCMI-YOLO and the YOLOv5s. According to Fig 11, the FCMI-YOLO model not only accurately detects fire sources at close range (Fig 11 (a)), at long distances (Fig 11 (b)), and in nighttime environments (Fig 11 (c)), but also outperforms the YOLOv5s model in accuracy across all scenarios. Notably, under nighttime conditions, the YOLOv5s model tends to exhibit missed detections and decreased accuracy when detecting multiple or small fire targets as the exposure levels fluctuate. In contrast, FCMI-YOLO effectively reduces missed detections, demonstrates higher accuracy in small-object detection, and improves overall detection precision, exhibiting strong anti-interference capability and robustness.

thumbnail
Fig 11. Detection results of YOLOv5s and FCMI-YOLO under different exposure levels.

(a) Fire in the close interior. (b) Remote outdoor fire. (c) Fire in the Night.

https://doi.org/10.1371/journal.pone.0329555.g011

Overall, the FCMI-YOLO model demonstrates significant advantages in managing complex scenarios in fire detection tasks. Its sensitivity and accuracy under various exposure conditions provide more reliable technical support for practical applications in fire warning and monitoring.

4.3.7 Comparison of mainstream algorithms.

To further evaluate the proposed algorithm, a comparative analysis with mainstream methods was conducted using the constructed fire dataset. As shown in Fig 12, FCMI-YOLO achieves the highest recall rate (84.4%) among all models, demonstrating strong adaptability in complex environments. This high recall is particularly important for fire safety applications, where missed detections may delay early warning and compromise response in industrial or long-range monitoring scenarios.

thumbnail
Fig 12. Comparison of mAP@0.5 and Recall for mainstream algorithms.

https://doi.org/10.1371/journal.pone.0329555.g012

As shown in Table 9, compared to the proposed FCMI-YOLO, the parameters and flops of Faster R-CNN, SSD, PP-YOLOEs, YOLOv3, YOLOv4, YOLOv5s, YOLOv6s, YOLOv7, YOLOv7-Tiny, YOLOv8s, YOLOv8s-World, YOLOv8s-FasterNet, YOLOv9s, YOLOv10s, YOLOv11s, YOLOv11s-MobileNetv4, and YOLOv11s-EMO are 32.55, 5.62, 1.88, 14.86, 15.33, 1.67, 4.12, 8.86, 1.43, 2.64, 3.19, 2.05, 2.36, 1.93, 2.24, 1.24, and 2.02 times larger, respectively. The flops of these models are 32.72, 15.47, 1.54, 5.83, 5.36, 1.41, 3.90, 9.30, 1.15, 2.53, 2.88, 1.93, 3.59, 2.19, 1.90, 0.93, and 1.02 higher than FCMI-YOLO, respectively. Moreover, FCMI-YOLO achieves a balanced performance with 88.0% mAP@0.5 and 91.2 FPS, surpassing most mainstream algorithms in inference speed and accuracy. While its precision slightly trails behind PP-YOLOEs (88.2%), YOLOv6s (88.5%), YOLOv8s-World (89.4%), YOLOv9s (89.0%), and YOLOv11s (88.3%), these models exhibit significantly higher computational costs that hinder embedded deployment. Specifically, YOLOv8s-World requires 13.4M parameters, more than three times the 4.2M parameters of FCMI-YOLO, and 32.6G flops, nearly three times higher than FCMI-YOLO’s 11.3G flops. However, it delivers only 36.3 FPS, 60.1% slower than our model. Similarly, YOLOv6s achieves 77.8 FPS with 17.3M parameters, over four times that of FCMI-YOLO, and 44.1G flops, almost four times higher, YOLOv9s attains 44.2 FPS, 51.5% slower despite having 9.9M parameters, about two and a half times that of FCMI-YOLO, and 40.6G flops, nearly four times greater.

thumbnail
Table 9. Performance comparison of mainstream algorithms.

https://doi.org/10.1371/journal.pone.0329555.t009

To intuitively verify the effectiveness of FCMI-YOLO, a visual comparison analysis was conducted using the four optimal models from Table 9. As shown in Fig 13, under challenging conditions such as strong light, occlusion and long-distance scenarios, YOLOv6, YOLOv9, and YOLOv11 exhibit varying degrees of missed detections. In contrast, FCMI-YOLO consistently achieves the highest detection accuracy across different test scenarios.

thumbnail
Fig 13. Detection results of FCMI-YOLO, YOLOv6s, YOLOv9s, and YOLOv11s.

https://doi.org/10.1371/journal.pone.0329555.g013

In summary, FCMI-YOLO demonstrates a distinct advantage over mainstream detection algorithms by achieving an optimal balance between precision, inference speed, and model lightweight. With only 4.2M parameters and 11.3 GFLOPs, it significantly reduces computational overhead while delivering competitive accuracy and recall rates. This enables deployment on edge devices without sacrificing detection quality. Unlike larger models such as YOLOv8s-World and YOLOv9s, which require over twice the parameters and FLOPs while operating at much lower FPS, FCMI-YOLO offers both real-time responsiveness and robust detection performance. Furthermore, its superior recall capability is particularly valuable for safety-critical applications, where failure to detect early-stage fires could lead to uncontrollable spread and escalated hazards. These results affirm the practical applicability and technical strength of the proposed method in resource-constrained fire detection scenarios.

4.4 Model deployment

4.4.1 Hardware specifications and environment configuration.

The experiment utilizes the Orange Pi 5 Plus as the edge device testing platform. According to Fig 14, the system is primarily composed of five components: the image acquisition module, image processing module, display module, network communication module, and storage module. The platform is powered by the Rockchip RK3588 processor, which integrates a Neural Network Processing Unit (NPU) with a computational power of 6 TOPS. The system uses an IMX577 camera to capture images.

First, the image acquisition module receives video stream data from an external source through the IMX577 camera and transmits it to the image processing module, where it is temporarily stored in LPDDR4 memory. Next, the RK3588 processor utilizes the CPU and GPU in the multimedia processing unit to perform image enhancement on the video stream, runs the object detection algorithm, and accelerates inference speed using the NPU. Finally, the detection results are optimized through video stream compression and encoding, then transmitted in real-time to the display module via the HDMI interface for displaying the detection results.

4.4.2 Asynchronous multi-threaded processing.

In detection tasks, the Orange Pi 5 Plus leverages NPU acceleration, with its utilization directly affecting inference speed. To maximize NPU performance, this study employs an asynchronous multithreading method, which allows multiple tasks to be executed in parallel, thereby reducing idle time and improving resource utilization efficiency. In this method, asynchronous operations, such as submitting inference requests to the NPU, are initiated by the main thread without waiting for their completion. This non-blocking mechanism ensures that the main thread remains responsive and can continue handling other tasks, such as preprocessing input frames or managing communication with peripheral devices. Meanwhile, worker threads (or background threads) are responsible for executing time-consuming inference operations. Once the NPU completes an inference task, the results are passed back to the main thread, enabling prompt post-processing and visualization. As shown in Table 10, the asynchronous multithreading method significantly improves overall NPU utilization by enabling concurrent task execution and more balanced resource scheduling. Consequently, the system’s frame rate (FPS) increased from 5.5 to 23.4, demonstrating substantial improvements in responsiveness and real-time performance.

thumbnail
Table 10. NPU utilization and FPS Under different processing methods.

https://doi.org/10.1371/journal.pone.0329555.t010

4.4.3 Deployment results.

To evaluate the detection performance of the FCMI-YOLO algorithm on edge devices, experiments were conducted on the Orange Pi 5 Plus using the constructed fire dataset. As shown in Fig 15, the algorithm achieved a mAP of 81.45%, demonstrating excellent detection accuracy.

thumbnail
Fig 15. mAP performance of FCMI-YOLO on the Orange Pi 5 Plus.

https://doi.org/10.1371/journal.pone.0329555.g015

Additionally, outdoor fire detection experiments were conducted to further assess the real-world performance of the proposed algorithm on resource-constrained edge devices. As shown in Fig 16, experiments were performed at distances of 30 m and 75 m from the fire source, with FCMI-YOLO and YOLOv5s deployed on the OrangePi 5 Plus. At 30 m, FCMI-YOLO achieved a precision of 88% and an inference speed of 23.4 FPS, outperforming YOLOv5s, which recorded 72% precision at 25.8 FPS. At 75 m, FCMI-YOLO maintained a precision of 70% with the same FPS, while YOLOv5s dropped to 59% precision. Although FCMI-YOLO operates at a slightly lower FPS, it consistently delivers higher detection accuracy, especially under long-range and low-resolution fire conditions. These results indicate that FCMI-YOLO demonstrates superior robustness and detection stability compared to YOLOv5s in outdoor environments, validating its suitability for real-time fire detection on edge devices.

thumbnail
Fig 16. Detection results of FCMI-YOLO and YOLOv5s at 30m and 75m on the OrangePi 5 Plus.

https://doi.org/10.1371/journal.pone.0329555.g016

5 Conclusion

To achieve the optimal trade-off between accuracy and inference speed for fire detection, a realtime fire detection algorithm, FCMI-YOLO, is proposed and successfully deployed on edge devices for realtime detection tasks involving fire images, videos, and camera streams. First, a light-weight feature extraction module, FasterNext is introduced, which combines PConv and PWC-onv to reduce model complexity while incorporating the nonlinear activation function SiLU to enhance fire feature representation. Second, the CCFM and MLCA mechanisms are integrated to improve recall through hierarchical feature fusion and the utilization of spatial and channel information. Finally, the Inner-DIoU loss function is proposed, introducing an auxiliary bounding box constraint mechanism to optimize multi-scale object localization and enhance scale variation awareness. Experimental results demonstrate that FCMI-YOLO achieves 88.0% mAP@0.5 and 91.2 FPS on a PC, exhibiting significant advantages over other YOLO variants regarding model parameters, detection accuracy, and inference speed. When deployed on an edge device, it maintains an mAP of 81.45% and a real-time performance of 23.4 FPS, providing an efficient solution for real-time fire monitoring.

However, the proposed algorithm still presents several limitations. First, in environments with strong wind or reflective surfaces, fire deformation, and mirror reflection of fire may compromise feature extraction and increase false positives, reducing detection accuracy. Second, although the model has been lightweighted to achieve real-time detection on edge devices, there is still room for further improvement in inference speed for practical applications. Lastly, the performance on edge devices varies depending on the hardware, making model optimization for different platforms a challenge for future work. Future research will focus on enhancing the generalization ability of FCMI-YOLO to adapt to a wider range of detection scenarios. Additionally, further lightweight strategies will be explored to improve detection speed regarding FPS.

References

  1. 1. Madrzykowski D. Fire dynamics: the science of fire fighting. Int Fire Service J Leadership Manag. 2016;10.
  2. 2. Perilla FS, Villanueva Jr GR, Cacanindin NM, Palaoag TD. Fire safety and alert system using arduino sensors with IoT integration. In: Proceedings of the 2018 7th International Conference on Software and Computer Applications. 2018. p. 199–203.
  3. 3. Wu J, Wu Z, Ding H, Wei Y, Yang X, Li Z, et al. Multifunctional and high-sensitive sensor capable of detecting humidity, temperature, and flow stimuli using an integrated microheater. ACS Appl Mater Interfaces. 2019;11(46):43383–92. pmid:31709789
  4. 4. Shi M, Bermak A, Chandrasekaran S, Amira A, Brahim-Belhouari S. A committee machine gas identification system based on dynamically reconfigurable FPGA. IEEE Sens J. 2008;8(4):403–14.
  5. 5. Xu L, Yan Y. A new flame monitor with triple photovoltaic cells. IEEE Trans Instrument Measur. 2006;55(4):1416–21.
  6. 6. Agirman AK, Tasdemir K. BLSTM based night-time wildfire detection from video. PLoS One. 2022;17(6):e0269161. pmid:35657931
  7. 7. He Y, Hu J, Zeng M, Qian Y, Zhang R. DCGC-YOLO: the efficient dual-channel bottleneck structure YOLO detection algorithm for fire detection. IEEE Access. 2024;12:65254–65.
  8. 8. Jin C, Wang T, Alhusaini N, Zhao S, Liu H, Xu K. Video fire detection methods based on deep learning: datasets, methods, and future directions. Fire. 2023;6(8):315.
  9. 9. Foggia P, Saggese A, Vento M. Real-time fire detection for video-surveillance applications using a combination of experts based on color, shape, and motion. IEEE Trans Circuits Syst Video Technol. 2015;25(9):1545–56.
  10. 10. Qureshi WS, Ekpanyapong M, Dailey MN, Rinsurongkawong S, Malenichev A, Krasotkina O. QuickBlaze: early fire detection using a combined video processing approach. Fire Technol. 2016;52(5):1293–317.
  11. 11. Cao X, Su Y, Geng X, Wang Y. YOLO-SF: YOLO for fire segmentation detection. IEEE Access. 2023;11:111079–92.
  12. 12. Wang S, Wu M, Wei X, Song X, Wang Q, Jiang Y. An advanced multi-source data fusion method utilizing deep learning techniques for fire detection. Eng Appl Artif Intell. 2025;142:109902.
  13. 13. He H, Zhang Z, Jia Q, Huang L, Cheng Y, Chen B. Wildfire detection for transmission line based on improved lightweight YOLO. Energy Rep. 2023;9:512–20.
  14. 14. Chang HC, Hsu YL, Hsiao CY, Chen YF. Design and implementation of an intelligent autonomous surveillance system for indoor environments. IEEE Sens J. 2021;21(15):17335–49.
  15. 15. Mueller M, Karasev P, Kolesov I, Tannenbaum A. Optical flow estimation for flame detection in videos. IEEE Trans Image Process. 2013;22(7):2786–97. pmid:23613042
  16. 16. Ghassempour N, Zou JJ, He Y. A SIFT-based forest fire detection framework using static images. In: 2018 12th International Conference on Signal Processing and Communication Systems (ICSPCS). 2018. p. 1–7.
  17. 17. Yan M, Yan Y. Fire detection based on improved hog. In: 2018 17th International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES). IEEE; 2018. p. 111–4.
  18. 18. Zhou Z, Wu R. Stock price prediction model based on convolutional neural networks. J Indust Eng Appl Sci. 2024;2(4):1–7.
  19. 19. Wu R, Zhang T, Xu F. Cross-market arbitrage strategies based on deep learning. Acad J Sociol Manag. 2024;2(4):20–6.
  20. 20. Muhammad K, Ullah A, Lloret J, Del Ser J, De Albuquerque VHC. Deep learning for safe autonomous driving: Current challenges and future directions. IEEE Trans Intell Transp Syst. 2020;22(7):4316–36.
  21. 21. Muruganantham P, Wibowo S, Grandhi S, Samrat NH, Islam N. A systematic literature review on crop yield prediction with deep learning and remote sensing. Remote Sens. 2022;14(9):1990.
  22. 22. Duong H-T, Le V-T, Hoang VT. Deep learning-based anomaly detection in video surveillance: a survey. Sensors (Basel). 2023;23(11):5024. pmid:37299751
  23. 23. Younan M, Houssein EH, Elhoseny M, Ali AA. Challenges and recommended technologies for the industrial internet of things: a comprehensive review. Measurement. 2020;151:107198.
  24. 24. Alakbari FS, Mohyaldinn ME, Ayoub MA, Muhsan AS. Deep learning approach for robust prediction of reservoir bubble point pressure. ACS Omega. 2021;6(33):21499–513. pmid:34471753
  25. 25. Alakbari FS, Mohyaldinn ME, Ayoub MA, Muhsan AS, Abdulkadir SJ, Hussein IA. Prediction of critical total drawdown in sand production from gas wells: machine learning approach. Canad J Chem Eng. 2023;101(5):2493–509.
  26. 26. Alakbari FS, Mohyaldinn ME, Ayoub MA, Hussein IA, Muhsan AS, Ridha S. A gated recurrent unit model to predict Poisson’s ratio using deep learning. J Rock Mech Geotech Eng. 2024;16(1):123–35.
  27. 27. Ren S, He K, Girshick R, Sun J. Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst. 2015;28.
  28. 28. Girshick R. Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision; 2015. p. 1440–8
  29. 29. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 779–88.
  30. 30. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: single shot multibox detector. In: European conference on computer vision. Springer; 2016. p. 21–37.
  31. 31. Johnston J, Zeng K, Wu N. An evaluation and embedded hardware implementation of yolo for real-time wildfire detection. In: 2022 IEEE World AI IoT Congress (AIIoT). 2022. p. 138–44.
  32. 32. Zheng H, Dembélé S, Wu Y, Liu Y, Chen H, Zhang Q. A lightweight algorithm capable of accurately identifying forest fires from UAV remote sensing imagery. Frontiers in Forests and Global Change. 2023;6:1134942.
  33. 33. Li Y, Shang J, Yan M, Ding B, Zhong J. Real-time early indoor fire detection and localization on embedded platforms with fully convolutional one-stage object detection. Sustainability. 2023;15(3):1794.
  34. 34. Chan Y-W, Liu J-C, Kristiani E, Lien K-Y, Yang C-T. Flame and smoke detection using Kafka on edge devices. Internet of Things. 2024;27:101309.
  35. 35. Jiang X, Xu L, Fang X. DG-YOLO: a novel efficient early fire detection algorithm under complex scenarios. Fire Technol. 2024;:1–25.
  36. 36. Xu R, Lin H, Lu K, Cao L, Liu Y. A forest fire detection system based on ensemble learning. Forests. 2021;12(2):217.
  37. 37. Luan T, Zhou S, Zhang G, Song Z, Wu J, Pan W. Enhanced lightweight YOLOX for small object wildfire detection in UAV imagery. Sensors (Basel). 2024;24(9):2710. pmid:38732816
  38. 38. Zhao H, Jin J, Liu Y, Guo Y, Shen Y. FSDF: a high-performance fire detection framework. Exp Syst Appl. 2024;238:121665.
  39. 39. Wang Y, Hua C, Ding W, Wu R. Real-time detection of flame and smoke using an improved YOLOv4 network. Signal Image Video Process. 2022;16(4):1109–16.
  40. 40. Huang L, Ding Z, Zhang C, Ye R, Yan B, Zhou X. YOLO-ULNet: ultralightweight network for real-time detection of forest fire on embedded sensing devices. IEEE Sens J. 2024;24(15):25175–85.
  41. 41. Yan B, Fan P, Lei X, Liu Z, Yang F. A real-time apple targets detection method for picking robot based on improved YOLOv5. Remote Sens. 2021;13(9):1619.
  42. 42. Chen J, Kao Sh, He H, Zhuo W, Wen S, Lee CH, et al. Run, don’t walk: chasing higher FLOPS for faster neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023. p. 12021–31.
  43. 43. Zhao Y, Lv W, Xu S, Wei J, Wang G, Dang Q, et al. Detrs beat yolos on real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,.2024. 16965–74.
  44. 44. Wan D, Lu R, Shen S, Xu T, Lang X, Ren Z. Mixed local channel attention for object detection. Eng Appl Artif Intell. 2023;123:106442.
  45. 45. Zhang H, Xu C, Zhang S. Inner-IoU: more effective intersection over union loss with auxiliary bounding box. arXiv preprint 2023. https://arxiv.org/abs/2311.02877
  46. 46. Chino DY, Avalhais LP, Rodrigues JF, Traina AJ. Bowfire: detection of fire in still images by integrating pixel color and texture analysis. In: 2015 28th SIBGRAPI Conference on Graphics, Patterns and Images. IEEE; 2015. p. 95–102.
  47. 47. Wang M, Yue P, Jiang L, Yu D, Tuo T, Li J. An open flame and smoke detection dataset for deep learning in remote sensing based fire detection. Geo-spatial Inf Sci. 2024:1–16.
  48. 48. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D. Microsoft coco: common objects in context. In: European Conference on Computer Vision. 2014. p. 740–55.
  49. 49. Lin TY, Goyal P, Girshick R, He K, Dollár PF. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 2980–8.
  50. 50. Xu S, Wang X, Lv W, Chang Q, Cui C, Deng K. PP-YOLOE: An evolved version of YOLO. 2022. https://doi.org/arXiv:220316250
  51. 51. Redmon J, Farhadi A. Yolov3: an incremental improvement. arXiv preprint. 2018. https://arxiv.org/abs/1804.02767
  52. 52. Bochkovskiy A, Wang CY, Liao HYM. Yolov4: optimal speed and accuracy of object detection. arXiv preprint 2020. https://arxiv.org/abs/2004.10934
  53. 53. Li C, Li L, Jiang H, Weng K, Geng Y, Li L. YOLOv6: a single-stage object detection framework for industrial applications. arXiv preprint 2022. https://arxiv.org/abs/2209.02976
  54. 54. Wang C-Y, Bochkovskiy A, Liao H-YM. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. p. 7464–75. https://doi.org/10.1109/cvpr52729.2023.00721
  55. 55. Zhu J, Zhang J, Wang Y, Ge Y, Zhang Z, Zhang S. Fire detection in ship engine rooms based on deep learning. Sensors (Basel). 2023;23(14):6552. pmid:37514845
  56. 56. Cheng T, Song L, Ge Y, Liu W, Wang X, Shan Y. Yolo-world: real-time open-vocabulary object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. p. 16901–11.
  57. 57. Wang CY, Yeh IH, Liao HY. Yolov9: Learning what you want to learn using programmable gradient information. In: European Conference on Computer Vision. 2024. p. 1–21.
  58. 58. Wang A, Chen H, Liu L, Chen K, Lin Z, Han J. Yolov10: real-time end-to-end object detection. Adv Neural Inf Process Syst. 2024;37:107984–8011.
  59. 59. Khanam R, Hussain M. Yolov11: an overview of the key architectural enhancements. arXiv preprint 2024. https://arxiv.org/abs/241017725
  60. 60. Qin D, Leichner C, Delakis M, Fornoni M, Luo S, Yang F. MobileNetV4: universal models for the mobile ecosystem. In: European Conference on Computer Vision. 2024. p. 78–96.
  61. 61. Zhang J, Li X, Li J, Liu L, Xue Z, Zhang B, et al. Rethinking mobile block for efficient attention-based models. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society; 2023. p. 1389–400.