A deep learning-based dynamic deformable adaptive framework for locating the root region of the dynamic flames

Traditional optical flame detectors (OFDs) in flame detection are susceptible to environmental interference, which will inevitably cause detection errors and miscalculations when confronted with a complex environment. The conventional deep learning-based models can mitigate the interference of complex environments by flame image feature extraction, which significantly improves the precision of flame recognition. However, these models focus on identifying the general profile of the static flame, but neglect to effectively locate the source of the dynamic flame. Therefore, this paper proposes a novel dynamic flame detection method named Dynamic Deformable Adaptive Framework (DDAF) for locating the flame root region dynamically. Specifically, to address limitations in flame feature extraction of existing detection models, the Deformable Convolution Network v2 (DCNv2) is introduced for more flexible adaptation to the deformations and scale variations of target objects. The Context Augmentation Module (CAM) is used to convey flame features into Dynamic Head (DH) to feature extraction from different aspects. Subsequently, the Layer-Adaptive Magnitude-based Pruning (LAMP) where the connection with the smallest LAMP score is pruned sequentially is employed to further enhance the speed of model detection. More importantly, both the coarse- and fine-grained location techniques are designed in the Inductive Modeling (IM) to accurately delineate the flame root region for effective fire control. Additionally, the Temporal Consistency-based Detection (TCD) contributes to improving the robustness of model detection by leveraging the temporal information presented in consecutive frames of a video sequence. Compared with the classical deep learning method, the experimental results on the custom flame dataset demonstrate that the AP0.5 value is improved by 4.4%, while parameters and FLOPs are reduced by 25.3% and 25.9%, respectively. The framework of this research extends applicability to a variety of flame detection scenarios, including industrial safety and combustion process control.


Introduction
The lives and property of individuals are directly impacted by fire safety.In complex outdoor environments, the rapid spread of fire will cause casualties and property loss.Thus, timely detection of flame becomes exceptionally crucial.Optical flame detectors (OFDs) employ sensors to monitor specific light frequencies for flame detection.OFDs are capable of successfully fusing particular feature fusion algorithms, such as scale invariant feature transform (SIFT) [1], flame detection algorithm based on multi-feature fusion (FDAMF) [2], and fire color feature extraction (FCFE) [3], with optical sensor hardware, such as [4,5], and so on.These OFDs, i.e., infrared [6], ultraviolet [7], and infrared/ultraviolet [8], are frequently used in the community of firefighting [9], medical [10], and agricultural [11].OFDs integrate optical sensors and algorithms for flame detection in diverse applications.Naturally, the valuable signals obtained from optical sensors in OFDs can be effectively combined by these feature fusion algorithms, enhancing the overall capability of flame detection systems.
However, the OFDs face limitations regarding their detection process, susceptibility to interference in complex environments, and compatibility with advanced devices.For the first one, due to certain characteristics such as signal transmission and communication interface requirements, OFDs adopt multi-stage detection methods.These methods include sequential processes like feature detection, feature-to-signal transformation, and target recognition, which collectively lead to a low detection speed.For the second one, false detections in outdoor environments may be influenced by objects with similar flame characteristics.(e.g., red roses, fire engine, sun, and among others).For the third one, the algorithms of traditional OFDs may struggle to effectively locate the root region of dynamic flames that exhibit rapid changes in shape, intensity, or movement.Eventually, these computationally complex algorithms of OFDs are not well-suited for contemporary advanced devices, leading to a sluggish device detection process.
To overcome the low robustness and slow detection speed in the existing OFDs, the onestage deep learning (DL)-based object detection methods, which recognize objects directly, have gained wide attention.They are effective in many fields, such as intelligent security [12], intelligent transportation [13], and intelligent firefighting [14].Specifically, the DL model automatically discovers and extracts the most relevant and salient features of each object class through neural networks training data [15].Moreover, DL-based methods are generally endto-end models, i.e. [13,16,17], where the target features are directly extracted from the input data, and thus the computational efficiency is greatly improved significantly.Currently, the availability of abundant computational resources, such as GPU and TPU, enables faster inference speeds for deep learning algorithms, making their application on devices feasible [18].Therefore, Deep learning-based methods are more compatible for integration with existing optical sensing devices.Additionally, DL-based methods are becoming increasingly mature in various industrial applications [19,20], further confirming the effectiveness of such methods.
In many studies, the You Only Look Once (YOLO) family of DL-based algorithms [21][22][23], such as YOLOv5 [24], possess characteristics of end-to-end and multi-scale object detection.These characteristics render it exceptional in terms of the level of speed and real-time detection.However, when faced with flame detection, DL-based YOLO has limitations in real-time detecting dynamic flames efficiently where the landscape of the dynamic flame is uncertain.Concretely, the current DL-based YOLO in flame recognition has primarily focused on balancing model accuracy and efficiency, often overlooking the enhancement of robustness in flame detection models.More importantly, recent studies [25,26] suggest that the robustness of object detection models is a critical factor in designing new models, however, the existing flame detection methods have not explicitly considered how to improve the robustness.Thus,

Competing interests:
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
the research on flame detection has not been designed based on the two types of key features of flame, i.e., dynamic recognition and flame root region localization.Therefore, the deformable convolution network v2 (DCNv2) [27], context augmentation module (CAM) [28], dynamic head (DH) [29], and temporal consistency-based detection (TCD) [30], are effectively integrated into YOLOv5 to address these challenges.Specifically, DCNv2 employs deformable convolutional offsets to enhance the extraction of flame image features across different scales.CAM employs an adaptive fusion method to effectively combine the features obtained from multi-scale expanded convolutions.DH integrates three self-attention mechanisms, i.e., scaleaware, spatial-aware, and task-aware, to generate real-time results.TCD determines whether targets are real or not by focusing on their consistency across consecutive frames.These techniques mitigate the complexities of the environment effectively and enhance the precision of dynamic flame detection significantly.After integrating the four techniques, the pruning technique [31] is introduced to accelerate the detection speed.Then the ability to localize the flame root region has been improved by the inductive modeling (IM) approach [32,33] when the precision is enhanced.Research findings indicate that targeting the flame root region for fire suppression enables more effective control of the fire [34][35][36].Therefore, this paper proposes a novel dynamic flame detection method named Dynamic Deformable Adaptive Framework (DDAF) for locating the flame root region dynamically.
The main contributions of this paper can be summarized as follows.
1.The proposed Dynamic Deformable Adaptive Framework achieves the localization of the flame root region by detecting the flame.This improvement is essential for achieving more effective fire suppression and control.
2. To overcome the limitations of existing algorithms in OFDs, this paper introduces four techniques i.e., deformable convolution network v2 (DCNv2), context augmentation module (CAM), dynamic head (DH), and temporal consistency-based detection (TCD), and integrate them into YOLOv5 to improve dynamic feature extraction capability and detection robustness.These techniques enhance the precision of real-time detection for dynamic flames, even in uncertain flame landscape conditions.
3. In this study, a pruning technique is applied to remove redundant parameters from the model, thus improving the flame detection speed.
4. This paper successfully combines the inductive modeling (IM) method with YOLOv5 to model the flame landscape, enabling the effective location of the flame root region.
The remainder of this paper is organized as follows.Section 2 briefly introduces the related work and background technique.Section 3 introduces the DDAF.Section 4 introduces the experimental results and analysis.Finally, Section 5 concludes this paper.

Related works
This study builds on several previous works: conventional algorithms in OFDs, the YOLObased end-to-end algorithms, and object detection algorithms for temporal consistency.The YOLO-based end-to-end algorithms have successfully tackled the limitations of conventional algorithms in the OFDs, and the TCD technique offers a solution to address the specific limitations of the YOLO-based method in flame detection.
OFD is a significant piece of devices for ensuring fire safety.The algorithms integrated into OFDs play a crucial role in enhancing their flame detection capabilities.During the past decades, various algorithms for OFDs have been proposed, including those based on infrared [4,6,37], ultraviolet [7,38,39], and infrared/ultraviolet [5,8,40] technologies.Liu et al. proposed a detector of flame/smoke video image detection system consisting of one infrared camera and others [6].Truong et al. proposed a low-cost and reliable smart fire alarm system that utilizes ultraviolet detection technology [7].Genovese et al. proposed an image processing system for the detection of wildfire smoke based on computational intelligence techniques by infrared/ ultraviolet cameras [8].They all detect flames by the characteristic wavelengths of the fire.These wavelengths are transformed into characteristic signals.Then, the detector combines the signals from both sensors and is used to determine if the target is a flame based on a pre-defined threshold.However, the quantification of flame features with two-dimensional data [41] and dependence on threshold settings [6] will result in limited detection speed and inadequate adaptability for different environments respectively.This highlights the need for advancements in flame detection technologies, motivating the exploration of alternative approaches.
Over the years, the YOLO [21][22][23][42][43][44] family has been one of the popular one-stage real-time object detectors.YOLO detectors can be found in many hardware platforms and application scenarios, meeting different requirements.After years of development, YOLO has evolved into a series of high-speed models demonstrating strong performance.Compared to multi-stage detection algorithms, the YOLO-based object detection algorithms, such as YOLOv4 [21], YOLOv5 [24], and YOLOv7 [44], demonstrate significantly faster object detection speed.In recent years, some YOLO-based algorithms have been applied to flame detection.Zhao et al. [45] proposed an improved Fire-YOLO deep learning algorithm for the detection of fire targets in forest fire images.Lestari et al. [46] proposed a method that can monitor an area of fire in a building.Goyal et al. [47] used both deep learning and infrared cameras to monitor the forest and surrounding area.Xiao et al. utilized the YOLOv5 deep neural network to develop a detection system for early fire warning in monitoring substations [48].They are very effective in detecting static flames.However, when confronted with dynamic flames, this type of YOLO-based algorithms perform poorly in terms of detection precision since the landscape of flames varies over time.
Recently, a growing number of scholars have shifted their focus to dynamic flames.For instance, Avazov et al. [49] proposed a method that relies on a lightweight CNN model and an enhanced version of YOLOv3 for detecting dynamic flames in shipyard areas.Li et al. [50] concentrated on utilizing diverse motion detection methods, such as adaptive background subtraction and motion history images, for the effective identification of dynamic flames.Wang et al. [51] employed a diverse set of methodologies, incorporating visualized heat release rate prediction, root mean square error and mean absolute error comparisons, as well as an analysis of detection time, to enhance the accuracy and efficiency of dynamic flame detection systems.Despite the effectiveness of these methods in detecting dynamic flames, they often overlook the crucial aspect of precisely locating the root region of the flame, which is a more effective area for firefighting purposes.
Based on existing literature, current research in flame recognition has primarily focused on balancing model accuracy and efficiency, often overlooking the enhancement of robustness in flame detection models.However, recent studies [52][53][54] suggest that the robustness of object detection models is a critical factor in designing new models, and the temporal consistencybased detection (TCD) algorithms perform excellent in improving robustness.This type of algorithm ensures that the recognized flames do not change abruptly in successive frames.TCD algorithms are commonly used in computer vision [30,55].These algorithms aim to enhance the precision and robustness of detecting objects by leveraging the temporal information presented in consecutive frames of a video sequence.Nishimura et al. [52] proposed a semi-supervised cell-detection method that uses a time-lapse sequence.Jeong et al. [54] introduced a consistency-based semi-supervised learning approach for object detection.Xiao et al. [53] presented a method for detecting adversarial frames based on the temporal consistency property of videos.They focus on ensuring that detected objects demonstrate a consistent appearance and behavior of motion across multiple frames.Compared to other object detection algorithms, these methods exhibit an improved capability to differentiate between truly objects and false detections by taking into account the temporal context.
To further demonstrate the characteristics of the related studies, a summary of the algorithms related to the proposed DDAF is given in Table 1.It can be seen that the research on flame detection has not been designed based on the two types of key features of flame, i.e., dynamic recognition and flame root region localization.More importantly, the existing flame detection methods have not explicitly considered how to improve the robustness, however, the research on robustness receives widespread attention on other object detection scenarios.Therefore, this paper takes these three limitations as the starting points to designed an effective object detection framework tailored for dynamic flame root region detection.Specifically, the integration of DCNv2, CAM, and DH effectively extracts flame features.Moreover, the pruning technique significantly improves flame detection speed.IM achieves notable success in flame root region localization and TCD enhances flame detection robustness.

Background techniques 2.2.1 Upsample.
Upsample is a technique that used to map low-resolution images to higher resolutions.This study employs nearest-neighbor interpolation for the up-sampling process (details are shown in Fig 1).Given an original image F with dimensions M × N and a desired up-sampled size represented as P×Q(P>M,Q>N).The up-sampled image is denoted as G and its corresponding pixel position is denoted as G(i,j).This process is indispensable in DDAF, particularly in handling intricate details of images, as encountered in flame detection, expressed mathematically as follows: where round(�) denotes the process of rounding off the values, i and j represent the row and column indices of the image after upsample.

Conv blocks.
Conv blocks are consisted of residual convolutional blocks, where the convolution operation extracts feature from input data, the activation function introduces non-linear elements, and the BN layer aids in the training of the network.These blocks are defined as:  To begin with, SPPF takes the input feature map denoted as x, and applies a convolution operation to reduce the channel dimensions by half for alleviating computational load.Then MaxPool2d operations are conducted with kernel sizes of k = 5, k = 9, and k = 13 on the downsized feature maps to generate y1, y1, and y3 respectively.These feature maps capture semantic information at various scales, corresponding to different levels of detail.Subsequently, a concat operation is executed to concatenate the original feature map x with y1, y2, and y3.Finally, a k = 1×1 convolution operation is performed to adjust the channel dimensions of the concatenated feature maps.

Dynamic Deformable Adaptive Framework (DDAF)
Xu et al. ( 2019) proposed a method that includes adaptive spatial feature selection and temporal consistency constraints, enabling joint spatial-temporal filter learning in a lower-dimensional discriminative manifold.Inspired by this work, this paper proposes the DDAF framework.It aims to mitigate the limitations of slow and false detection of conventional OFDs, as well as the detection of dynamic flames varies over time.

Architecture
As shown in Fig 3, the image of size 640*640 is fed at the input.The DDAF framework consists of three parts: Backbone, Neck, and Head.They are responsible for extracting flame features, enhancing and fusing these features, and generating target detection results, respectively.In the Backbone network, the DCNv2 structure enables the network to extract features at different scales, followed by the concatenate (Concat) operation, and finally processes the output feature vector using SPPF.The Neck network uses FPN+PAN+CAM structure for the adaptive fusion of deep and shallow network features, thus improving the quality of features extracted from the target.The feature pyramid network (FPN) [56] structure passes deep semantic features downward from top to bottom (top-down).It effectively utilizes multi-scale feature maps for improved precision and robustness in detecting objects of different sizes within an image.The pixel aggregation network (PAN) [57] structure bottom to up (bottom-up) complements the FPN by passing the low-level localization features upward.Then, the CAM integration into FPN performs an adaptive fusion of features by using expanded convolution with varying rates.The integration of FPN, PAN, and CAM structures into the Neck network enables adaptive fusion, leveraging multi-scale features, complementing low-level features, and adaptively merging features to improve the accuracy and robustness of flame detection.The Head network employs dynamic head (DH) detection to improve the detection of dynamic flames effectively, then combines IM and temporal consistency-based detection to achieve stable root region localization.Eventually, a pruning algorithm is used to improve the detection speed of

Deformable convolution network v2
DCNv2 [27] introduces the modulation mechanism into the standard deformable module [58].This modulation mechanism allows convolutional kernels to dynamically adjust their shape based on input features, thereby improving flexibility in capturing spatial details and enhancing feature extraction capabilities, and the modulated deformable convolution is reformulated as: where  DCNv2 replaces 10 more plain counterparts than the setting of DCNv1 in the ResNet [59] with deformable convolution.

Context augmentation module
The small landscape of the flame is difficult to detect as usual, necessitating a wealth of feature information.In this paper, in order to enhance the information fusion of different feature layers for tiny objects, the CAM [28] structure is added.The CAM structure is improved from the FPN structure, and the main function is to adaptively (c) learn the weights of feature fusion across various levels.The details of CAM can be found in Fig 5.
The incorporation of the CAM (Contextual Attention Module) structure is an enhancement for information fusion.By employing dilated convolutions with rates of 1, 3, and 5, the CAM effectively broadens its receptive field.The rationale behind using dilated convolutions lies in their ability to increase the convolutional kernel's receptive field, thereby enhancing the model's capacity to capture contextual information.This deliberate expansion of the receptive field is essential for overcoming the challenge of detecting small flames, as it enables the model to gather spatial context and intricate details associated with varying flame sizes.The kernel size is 3×3, and the rates are 1, 3, and 5.The (c) is an adaptive fusion method.Specifically, assuming the input has a size of (bs, C, H, W), convolution operations can produce spatial adaptive weights with a shape of (bs, 3, H, W).Methods (a) and (b) are weighted fusion and concatenation of channel dimensions, respectively.The CAM effectively addresses the challenge of detecting small flames by capturing spatial context through varying receptive fields.

Dynamic head
The FPN is a detection structure that combines multi-scale convolution features.However, during the down-sampling process, there is potential for information loss in detecting small landscape flame targets.In contrast, a method known as DH [29] can effectively mitigate this  The feature tensor is F � R L×S×C , where L denotes the number of pyramid layers, S denotes the size of the feature, and C denotes the number of channels.Moreover, S = H×W, where H, W denote the height and width of the feature.DH can be expressed as: where π L (�), π S (�), and π LC (�) correspond to scale-aware attention, spatial-aware attention, and task-aware attention, respectively.π L enables dynamic feature fusion based on the importance of the features in each layer.The expression is shown in Eq (3).
The sparsity is first learned using deformed convolution v2 [27], then the cross-level features are aggregated at the same spatial locations.The expression is shown in Eq (4).
where K is the number of sparse sampling positions, p k +Δp k is the offset position when the self-learning spatial offset Δp k is focused on a specific region, and Δm k is the significant scalar at the self-learning position p k .The task-aware attention module dynamically opens or closes the feature channel to select different tasks with the expression shown in Eq (5).
where [α 1 ,α 2 ,β 1 ,β 2 ] T is a hyperfunction to learn control activation thresholds.It first performs a global pooling on the L×S dimensions to reduce dimensionality, then uses two fully connected layers and a normalization layer, and final normalizes by the sigmoid activation function.Global pooling facilitates the aggregation of feature maps across spatial dimensions, allowing for the capture of global context and a reduction in spatial information.The ensuing fully connected layers further transform these features into a vector, catering to classification or regression tasks.This sequence of operations enhances attention mechanisms, providing improved focus on specific perspectives, including scale, spatial relationships, and task-specific details.Consequently, this refined method contributes to an enhanced performance in flame detection.This type of attention integration module can be stacked based on Eq (5).These pyramids can be scaled to the same size 3D tensor L×S×C.This tensor is then fed to the dynamic detection head, which consists of several DH blocks as shown in Fig 7 .The output of the DH can be used for a variety of tasks, including classification and bounding box regression.The several DH blocks are arranged in the order of L, S, and C. Based on the number of DH blocks, this study compares the AP 0.5 , precision, and AP 0.5:0.95, as shown in Table 9.

Layer-adaptive magnitude-based pruning (LAMP)
LAMP [31] proposes a novel importance scoring method perspective for global pruning from the model-level distortion minimization.Specifically, each neural network layer can be considered as an operator for studying the model-level distortion produced by the pruned layers.Assuming the weights are sorted in ascending order according to the index map, LAMP apply it to each unexpanded vector without loss of generality, i.
Informally, the LAMP score (Eq 8) measures the relative importance of the target connection among all existing connections belonging to the same layer.Connections with smaller magnitudes (in the same layer) have been pruned.Therefore, two connections with the same weight magnitude will have different LAMP scores.Once the LAMP score is calculated, this algorithm prunes the connection with the smallest LAMP score globally until the desired global sparsity constraint is reached.The details are shown in Fig 8.

Inductive modeling
In this paper, an inductive modeling method [32] based on the position of the detection box is proposed to mitigate the flame root region localization challenge with the following steps, as shown in Fig 9: 1. Feature extraction and bounding box generation: Through the feature fusion algorithm, which includes DCNv2, CAM, and DH, the flame is detected with the aim of finding the position of the flame and the bounding box information.
2. Ratio-based coarse-grained location: Given the typical location of flames near the bottom of images, a preliminary localization of the root region within the bounding box is approximated based on this assumption.3. Fine-grained location based on inductive modeling: Considering both the position and size of the bounding box, thus a more precise localization of the flame root region is determined in Eq (9).
where (x, y) are the coordinates of the upper left corner of the detection box.w and h are the width and height of the bounding box.Scaling factors 1/κ and 1/λ are applied to w and h, respectively.The flame root region is defined by coordinates (m, n), with points set at a radius of 5 pixels.

Experimental setup
The dataset utilized in this study is a combination of contributions from four works, i.e. [60][61][62][63], where images of forest fires and images with flames filling the entire image are removed to validate the reliability of our flame detection method for urban fires.2.

Evaluation indexes
As shown in Eqs (10-13), the evaluation parameters which are defined in detail as follows: TP represents correctly identified positive instances, FP indicates falsely identified positives, TN denotes correctly identified negatives, and FN signifies falsely identified negatives.The confusion matrix synthesizes these indexes, enabling a comprehensive evaluation of model performance by computing accuracy, recall, precision, and F1 score.recall (R) and precision (P) are calculated from the confusion matrix shown in Table 3. R is the ability of the model to successfully detect all real flames, while P indicates how many of the flames detected by the model are real flames.The weighted average summed of precision and recall can be expressed as F1 score (F1).Compared to F1, the average precision (AP) reflects the overall detection performance of the model.It can be obtained by calculating the area under the corresponding precision-recall curve.This paper also uses frames per second (FPS) to characterize the timing detection performance of the algorithm.These indexes measure the comprehensiveness and accuracy of the model in identifying flames.In general, 24 FPS must be achieved to guarantee real-time detection [64].To minimize the effect of potential outliers or fluctuations, the frames per second (FPS) value is the average FPS value of 30 separate runs with a batch size of 1. Calculating the average FPS value of these 30 separate runs involves adding up the FPS values for each run  and dividing the total value by 30.

Comparison of different object detection algorithms
In order to verify the flame detection algorithm, a comparative analysis of various object detection algorithms was conducted.In this study, we selected previous versions of YOLO [21][22][23][42][43][44] as well as the SOTA YOLOv7-tiny [44] as baselines to assess the performance of our proposed model.Moreover, the existing derived models based on these baselines are employed to further validate the effects of our proposed contributions on YOLOv5s.Table 4 presents the comparison results between the proposed DDAF model and these compared models trained under the same settings.
In Table 4, comparing all the algorithms, DDAF has the best performance on AP 0.5 and AP 0.5:0.95 with 0.814 and 0.493, respectively.It reflects the most advantageous in flame detection precision.Moreover, on classical indexes, DDAF obtains optimal results on F1 values, only performs slightly worse to YOLOv7-tiny on P values, and is superior to most methods on R values.This shows that the DDAF performs highly competitive results on classical indexes.Although DDAF is obviously lower than other comparative algorithms in the FPS values, but it still satisfies the base speed requirements (e.g., FPS�24) for real-time firefighting scenarios [64].
As shown in Table 6, in order to further improve the detection precision of DDAF, the data augmentation (AD) technique is integrated into DDAF* called DDAF + .For further enhance the model performance of DDAF, a series of pilot experiments of pruning are conducted to systematically optimize several key hyperparameters such as speed up, finetune epochs, and learning rates.Other detailed experimental configurations are provided in Table 5.Table 6 shows the results of the comparison for DDAF*, DDAF + and DDAF.
In Table 6, the detection precision of DDAF + is improved to 0.82.However, the limitation on the FPS value has not alleviated.Expanding on this groundwork, we further incorporated pruning technique, resulting in a detection speed of 89.6 FPS, with substantial enhancements in both parameter and FLOPs reduced.Compared to DDAF*, the FPS of DDAF is improved by 261.8%.It's worth noting that this advanced algorithm is referred to as DDAF.The results show that the AP 0.5 on our custom dataset is increased to 0.826 while concurrently reducing parameters and FLOPs by 49.0% and 33.5%, respectively.In summary, the DDAF algorithm has achieved a great balance between detection precision and speed.

Ablation study
In order to verify the effectiveness of each method proposed in this paper, ablation experiments were conducted and the results are shown in Table 7. DCNv2, CAM and DH are the three main components of DDAF.We add them to the baseline incrementally to compare the effectiveness of each component.
Different components have similar effects.For example, both DCNv2 and CAM are scaleaware.Compared to DCNv2 and CAM, their combination improved 2.14% AP 0.5 and 2.01% AP 0.5:0.95, respectively.DCNv2 and DH are both sensitive to different objects, especially medium and large objects, and the combination of the two reached the second highest 0.804 AP 0.5 , which is a little bit lower than that of DH.The combination of CAM and DH achieved not only a 0.808 AP 0.5 but also the second-highest precision.Each of these three components has its own strengths and weaknesses.It is evident that the overall performance, particularly in terms of AP, is optimized when all three components are utilized together.The experiment proves that compared with the baseline, DDAF comprehensively improves the average accuracy of targets at different scales, and can effectively improve the detection of small-scale targets.In conclusion, DDAF can effectively improve the real-time detection precision of dynamic flames.Compared to the baseline, DCNv2 improves AP 0.5 and AP 0.5:0.95 by 1.66% and 0.42%, respectively.The reason is that DCNv2 allows each pixel in the input feature map to have an adaptive receptive field.This adaptation allows for better capture of complex object details and handling variations in object shape and size.However, above the FPS, it is about 14 frames lower than the baseline.On the one hand, the increased model parameters result in the need for more computation and storage requirements.On the other hand, increased memory and bandwidth requirements lead to slower data transfer.
The CAM also achieves good performance with its performance being 1.79% AP 0.5 and, 3.21% P and FPS slightly lower than the baseline while its FLOPs are minimal.The mechanism of adaptive expanded convolution fusion reduces the computational burden of the model.However, the performance of CAM on AP 0.5:0.95performs only about the same as the baseline.The reason for this is that performing expanded convolution to select different rates discards some of the valid information, thus reducing part of the model's ability to capture details in the image.
The DH we utilized had the most significant benefits of all the components.Increasing DH compared to the baseline improves AP 0.5 and AP 0.5:0.95 by 3.96% and 2.51%, respectively.This is due to the fact that the three attention mechanisms adaptively fuse multiple layers of features, which enhances the ability of the probe head to discriminate different feature points, thus improving the perceptual ability of the model.At the same time, DH also brings about a serious reduction in FPS, which is due to the inclusion of additional learnable parameters, i.e., newly added weights and biases, which require more memory to store it.

Ablation experiments for CAM fusion models.
In order to verify that the adaptive module is more applicable on Head, this study compares it with the two modules of weighting and splicing on Backbone and Head respectively.The experimental results are shown in Table 8.Although the performance of the adaptive fusion model is best on Backbone, the parameter and FLOPs are too large, which results in decreasing the detection speed of the algorithm.The performance of the adaptive fusion model on Head is only slightly lower than that on Backbone.Therefore, we choose to use the adaptive fusion model on Head.
4.4.2Ablation experiments for the number of DH block stacks.This section explores the best detection results by controlling the number of stacked DH blocks.On the home-made dataset, we found that using 8 DH blocks produced the best results.We designed the following experiments using 2, 4, 6 and 8 DH blocks for comparison.As shown in Table 9, it is clear that the highest performance improvement is obtained by using 8 DH blocks.Therefore, DDAF chose 8 DH blocks.

Ablation experiments for data augmentation types.
As shown in Table 10, we found during our experiments that the model continues to improve in terms of our model performance with the moderate data augmentation approach in YOLOv5s.The hyperparameters for all three data enhancements are YOLOv5s defaults.

Experiment for temporally consistent video processing
The aim of this section is to determine whether a target is real or not by focusing on the consistency of the target's appearance and motion of the detected object in consecutive video frames.Specifically, if the same target is consistently detected for a certain number of identical images (consecutive frames) in this consecutive time frame (jumping frames) it is recognized as the flame to be detected, and conversely, if it is not consistently detected for a certain number of frames inside this number of frames it goes into a loop for the next number of frames.We performed try and error on the flame video with the parameter settings of the number of consecutive time frames and the number of continuously detected frames, and the experiments showed that the best flame detection performance in the outdoor scene was achieved when the number of jumping frames and the number of consecutive frames were set to 30 frames and 10 frames, respectively.The experimental results are shown in Fig 11.

Try and error for IM
In order to systematically explore the flame root region, we integrated the previous flame literature based on inductive modelling to derive a range of regions for κ and λ.We also used a stepwise approach to iteratively experiment with a range of values for κ and λ.Specifically, the values of κ were taken in the range of [1.8, 2.2] in increments of 0.1, and the values of λ were taken in the range of [1,2], again in increments of 0.1.This exhaustive search strategy provided a comprehensive assessment of root region localization through a thorough examination of multiple parameter combinations.We conducted experiments on 437 images of the validation set and the experimental results yielded that κ = 2, λ = 1.3 is the most reasonable.This method was able to precisely mark the flame root area during the detection stage, and the picture below shows the test we did in a real scenario.The experimental results are shown in Fig 11.

Conclusions
This paper proposed an algorithm called DDAF to mitigate the limitations of the algorithms in traditional optical flame detectors (OFDs) where dynamic real-time flame detection was achieved by integrating four techniques, i.e., deformable convolution network v2 (DCNv2), context augmentation module (CAM), dynamic head (DH), and temporal consistency-based detection (TCD) into yolov5s.The pruning method of LAMP was also used to lighten the model and improve the model detection speed.This study compared the performance of different object detection algorithms.The results showed that under the same setup conditions, the detection accuracy of DDAF was better than all the compared algorithms, and the number of parameters and FLOPs were reduced by 49.0% and 33.5%, respectively.The method used in this paper achieved a good balance between detection performance and detection speed.The proposed technology framework used in this paper was also applicable to other tasks that had special requirements for flame detection, such as wildfire monitoring, industrial safety, and combustion process control.
In the future, the construction of standard flame datasets for different firefighting scenarios may be more suitable for validating the generalization ability of the algorithm.The hyperparameters can be further optimized to improve the accuracy of flame detection.Some special noise images should be added to the training set to further validate the robustness of the algorithm.Moreover, the iterative improvement techniques may be more suitable to merged into DDAF for providing efficient adaption to changing technological environment or emerging challenges.Furthermore, other advanced pruning or knowledge distillation techniques may also able to provide efficient abilities in lightweight.
ð2Þwhere BN denotes batch normalization, Siou represents the activation function, Conv 3×3 denotes convolution operation with a kernel size of 3×3.Following the design principles, these convolutions are dense.2.2.3 Spatial pyramid pooling features.SPPF serves to perform a multi-scale pooling operation on the input feature graph to capture semantic information at different scales.Details are shown inFig 2.

Fig 2 .
Fig 2. Illustrate of SPPF.Best viewed in color.https://doi.org/10.1371/journal.pone.0301839.g002 Δp k and Δm k are the learnable offset and modulation scalar for the k-th location, respectively.As shown in Fig 4, both the offset Δp k and modulation Δm k are obtained via a separate convolutional layer applied over the same input feature maps x with 2K and K output channels respectively.Additionally, to enhance the model of ability for geometric transformation,

Fig 3 .
Fig 3. Illustrate of the DDAF.This study primarily uses DCNv2 for feature extraction, CAM for fusion, and DH for dynamic flame detection.Additional details are presented in the backbone, neck, and head.Best viewed in color.https://doi.org/10.1371/journal.pone.0301839.g003

Fig 4 .
Fig 4. Illustrate of deformable convolutional.Input feature map with 2K and K output channels.The 2K portion represents sampling offsets, and the K portion represents channel weights.The final aggregation results in the output feature map.Best viewed in color.https://doi.org/10.1371/journal.pone.0301839.g004

Fig 8 .
Fig 8. Illustration of pruning process of LAMP.First, the structured pruning is performed using the layer-adaptive magnitude-based pruning (LAMP) method where the connection with the smallest LAMP score is pruned sequentially till the required global sparsity constraint is satisfied.This method significantly reduces the quantity of parameters.Then the clipped model is trained, i.e., finetune.https://doi.org/10.1371/journal.pone.0301839.g008 The flame dataset has a total of 4,248 images, with 3,811 images in the training set and 437 images in the validation set.The distribution ratio of custom dataset across different categories in shown in Fig 10.All experiments are first trained using the training set and validated using the validation set.The validation set help to determine if the model generalizes well and to assess the effectiveness of the flame detection application.All experiments are conducted under consistent environmental conditions and hyperparameters.The experimental environment is shown in Table

Table 1 . Summary of the related works.
Goyal et al. [47] YOLO-based Yes Static None Using both deep learning and infrared cameras to monitor the forest.Li et al. [50] YOLO-based Yes Dynamic None Using GMM-based background subtraction and motion history images; Using color models in YCbCr and RGB spaces.Xiao et al. [48] YOLO-based Yes Static None Using multi-feature integration technology, adaptive mixture Gaussian model, and motion memory matrix.Zhao et al. [45] YOLO-based Yes Static None Extending feature extraction networks in three dimensions for small target object detection.Wang et al. [51] YOLO-based Yes Dynamic None Using vHRR prediction, comparison of RMSE and MAE, and detection time analyzing to detect dynamic flames; Note: Conventional: conventional algorithm in OFDs; Static: static flame; Dynamic: dynamic flame.https://doi.org/10.1371/journal.pone.0301839.t001

Table 4 . The results of different object detection algorithms. Model AP 0.5 AP 0.5:0.95
Assuming the proposed and compared methods are denoted as A and B, respectively, the percentage can be calculated using the formula: (A-B)/B. https://doi.org/10.1371/journal.pone.0301839.t004

Table 6 . The results of different object detection algorithms.
Note: Assuming the proposed and compared methods are denoted as A and B, respectively, the percentage can be calculated using the formula: (A-B)/B.