Figures
Abstract
In the current environmental context, significant emissions generated by industrial and transportation activities, coupled with an unreasonable energy structure, have resulted in recurrent haze phenomena. This consequently leads to degraded image contrast and reduced resolution in captured images, significantly hindering subsequent mid- and high-level visual tasks. These technical challenges have positioned image dehazing as a pivotal research frontier in computer vision. Nevertheless, current image dehazing approaches exhibit notable limitations. Deep learning-based methodologies demand extensive paired hazy-clean training datasets, the acquisition of which remains particularly challenging. Furthermore, synthetically generated data frequently exhibit marked disparities from authentic scenarios, thereby limiting model generalizability. Despite diffusion-based approaches demonstrating superior image reconstruction performance, their data-driven implementations face comparable limitations. To overcome these challenges, we propose HazeDiff: a training-free dehazing method based on the Diffusion model. This method provides a novel perspective for image dehazing research. Unlike existing approaches, it eliminates the need for hard-to-get paired training data, reducing computational costs while enhancing generalization. This not only reduces computational costs but also improves the generalization ability and stability on different datasets. Ultimately, it ensures that the dehazing restoration results are more reliable and effective. The Pixel-Level Feature Inject (PFI) we proposed is implemented through the self-attention layer. It integrates the pixel-level feature representation of the reference image into the initial noise of the dehazing image, effectively guiding the diffusion process to achieve the dehazing effect. As a supplement, the Structure Retention Model (SRM) incorporated in Cross-attention performs dynamic feature enhancement through adaptive attention re-weighting. This ensures the retention of key structural features during the restoration process while reducing detail loss. We have conducted comprehensive experiments on both real-world and synthetic datasets.Experimental results demonstrate that HazeDiff surpasses state-of-the-art dehazing methods, achieving higher scores on both no-reference (e.g., NIQE) and full-reference (e.g., PSNR) evaluation metrics. It shows stronger generalization ability and practicality. It can restore high-quality images with natural visual features and clear structural content from low-quality hazy images.
Citation: Lin X, Li Z, Huang D, Feng W, An X, Sun L, et al. (2025) Hazediff: A training-free diffusion-based image dehazing method with pixel-level feature injection. PLoS One 20(10): e0329759. https://doi.org/10.1371/journal.pone.0329759
Editor: Xiaohui Zhang, Bayer Crop Science United States: Bayer CropScience LP, UNITED STATES OF AMERICA
Received: March 30, 2025; Accepted: July 21, 2025; Published: October 28, 2025
Copyright: © 2025 Lin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: 1. The RESIDE dataset can be downloaded from https://sites.google.com/view/reside-dehaze-datasets/ 2. NH-HAZE dataset can be downloaded from https://data.vision.ee.ethz.ch/cvl/ntire20/nh-haze/. These datasets contain various sets of foggy images used for verification and evaluation, covering diverse scenes. All relevant data are available without restrictions to ensure the reproducibility of this study. The data are owned by a third-party.
Funding: This work was supported in part by the Shandong Province Science and Technology Small and Medium-sized Enterprises Innovation Ability Improvement Project in the form of a grant [2024TSGC0285].
Competing interests: The authors have declared that no competing interests exist.
Introduction
In the various applications of computer vision, image dehazing is an important and challenging research area. Whether in daily image capturing, security monitoring, or key fields like autonomous driving assistance, clear images are the foundation for accurate analysis and decision-making. However, airborne particles such as dust and smoke are prevalent in the atmosphere, exhibiting strong light absorption and scattering properties. This phenomenon results in haze-obscured images with severely degraded contrast and clarity, substantially impairing their utility for tasks such as object detection and scene recognition. As shown in Fig 1.
This shows we need effective haze removal.
Currently, the main dehazing methods are generally based on deep learning and diffusion models. Deep learning-based dehazing methods, such as Dehazenet [1], dehaze images by learning the medium transmission in the haze degradation model; FFA-Net [2] introduces a feature attention mechanism (FAM) into the dehazing network; GDN [3]; and MSBDN [4] propose a multi-scale enhanced dehazing network based on the U-Net architecture, presenting an end-to-end trainable convolutional neural network. Dehamer [5] generates high-quality dehazed images through convolutional blocks and multi-scale residual blocks. These methods typically require considerable effort and resources to train models with large amounts of data, but obtaining suitable training data is quite challenging. This is because it is nearly impossible to obtain paired images of the same scene in both hazy and clear conditions in the real world. Additionally, due to the lack of effective use of prior knowledge, the stability and consistency of dehazing results based on deep learning methods are difficult to ensure.Diffusion model-based dehazing methods can leverage the prior knowledge of large pre-trained models, demonstrating unique advantages in dehazing tasks. However, some studies, when optimizing dehazing effects by training additional plugins, again face challenges such as the scarcity of real hazy scene training data and the mismatch between synthetic and real data distributions. For example, RSHazeDiff [6] applies contrastive constraints during Fourier-perceived condition diffusion model training, while MonoWAD [7] uses hazy-clear paired features during the training phase. Through convolutional layers and a quantization process, it generates weather reference features.
In dehazing methods that require training, the datasets used for model training are mainly constructed in two ways. First, most studies simulate hazy images by artificially adding haze to clear-day images to create synthetic hazy data. Second, some researchers use haze machines in real-world scenes to directly capture hazy images along with their corresponding real clear images. However, these artificially created hazy images have fundamental differences from images captured in real hazy conditions. These differences limit the generalization ability of deep learning models trained on these datasets. As a result, when the model encounters entirely new, untrained real hazy images, it lacks targeted learning of the specific features of these true hazy images. The dehazing model cannot accurately extract and process haze information in the images, making it difficult to effectively remove the haze. Consequently, dehazed images exhibit deficiencies in clarity, color fidelity, and detail preservation, thereby diminishing the reliability and practicality of such models in real-world scenarios.
We have observed significant progress in using reference images [8–11] to guide diffusion models in image synthesis, editing, and stylization. However, image synthesis and editing often struggle to achieve transformations in both content and tone simultaneously, while stylization tends to disrupt the structure of the original hazy image. Despite this, using reference images to guide diffusion models still holds value for dehazing, as it can effectively reference two real datasets (one normal dataset and one hazy dataset).
Against this backdrop, we propose an innovative HazeDiff framework. This framework uses reference images to guide the diffusion model for dehazing, without the need for complex training processes. The core idea is to cleverly integrate the distribution features of clear-weather images into hazy images while preserving the original structure of the hazy image, thereby achieving high-quality dehazing. Specifically, we select a clear-weather image as a reference and use a specific algorithm to fuse the latent noise of the two images, obtaining the initial noise for the dehazed image and performing a reverse diffusion process. In this process, we introduce a PFI method. This method feeds the hazy data into a self-attention mechanism, and by replacing the corresponding parts with key-value pairs from the reference image, it successfully transfers the pixel-level features of the reference image to the hazy image. Furthermore, to ensure that the model not only preserves the original structure of the hazy image but also accurately focuses on key regions and details during the dehazing process, we design an attention-based SRM, which effectively overcomes the problem of blurred structural details in the generated image.
We conduct comprehensive experiments on both reference-free real-world datasets and synthetic datasets with corresponding clear references. The experimental results show that, compared to existing advanced dehazing methods, HazeDiff achieves the best performance in dehazing synthetic hazy images, artificially generated hazy images, and real-world hazy images. It demonstrates stronger generalization ability and practicality, successfully recovering high-quality images with natural visual features and clear content structure from low-quality, blurry hazy images.
In summary, our contributions can be listed as follows:
- We propose HazeDiff, a novel diffusion-based dehazing framework. Using a training-free strategy, it achieves accurate dehazing while effectively avoiding the challenges of hazy data.
- We propose a pixel injection method, PFI, which better guides the content image to achieve pixel-level transformation by replacing the key and value in the model’s self-attention.
- We propose an attention-based object edge retention module, SRM, which helps the model focus on the key areas and details of the hazy image, solving the problem of losing original structural details during the generation process.
- We conducted extensive experiments on real hazy datasets. The results show that the method proposed in this paper significantly outperforms previous methods.
Related work
Deep learning-based image dehazing methods
Deep learning has been widely applied in fields such as computer vision, image classification, and object detection, and is gradually being applied to image dehazing tasks. The Dehazenet [1] dehazing network includes a feature extraction layer combined with traditional handcrafted features, a multi-scale mapping layer, a local extrema layer, and a nonlinear regression layer, which learns the medium transmittance in the haze degradation model to perform dehazing. IA-YOLO [12] relies on a large amount of high-quality labeled data for training. Through multi-scale feature fusion, it utilizes deep-layer features (containing high-level semantic information) to help restore the details blurred by haze. FFA-Net [2] introduces the Feature Attention Mechanism (FAM) into the dehazing network to handle different types of information. GDN [3] proposes an end-to-end trainable convolutional neural network that introduces attention-based multi-scale estimation to guide single-image dehazing. Dehamer [5] uses Transformer to establish long-range dependencies, guided by haze density, to generate high-quality dehazed images through convolutional blocks and multi-scale residual blocks. MSBDN [4] proposes a multi-scale enhanced dehazing network based on the U-Net architecture, with a dense feature fusion method that uses boosting strategies and back-projection techniques to enhance feature fusion. The decoder gradually restores the dehazed image, improving dehazing performance under specific axioms.
Deep learning-based methods often require training a completely new architecture from scratch. On one hand, these methods lack prior knowledge, making it difficult to guarantee effective dehazing. On the other hand, they commonly face the issue that model performance depends on large amounts of hazy scene data for training. Factors such as data quality, quantity, and diversity all impact the model’s generalization ability, and paired hazy and clear weather data from the same scene are nearly impossible to obtain. Some methods add haze to clear weather data and use these augmented data for training. However, the distribution of haze-added data differs fundamentally from that of real hazy scene data, which may lead to a decrease in the model’s adaptability to different hazy conditions in practical applications.
Diffusion model-based image dehazing methods
Diffusion-based generative models [13,14] gradually introduce noise to obtain the latent noise of an image, and then iteratively denoise from the noise distribution to recover a clear image. This approach effectively leverages the prior knowledge of large models. Some works choose to train a plug-in, but when training the plug-in, they encounter the same issue of difficulty in obtaining training data, as well as the disparity between haze-added data and real data. During the process of Frequency [15] training the dehazing framework based on the conditional diffusion model, these paired data are used to learn the parameters of the model, enabling the model to learn the mapping relationship from hazy images to clear images and thus achieve the dehazing function. The RSHazeDiff [6] model training also requires using coarse recovery results and clear real image patches for contrastive constraints during the Fourier perceptual condition diffusion model training. MonoWAD [7] in the training phase, it receives paired clear-hazy features. Through the convolution layer and the quantization process, it generates weather-reference features. It is trained with the clear knowledge recalling (CKR) loss composed of the clear knowledge embedding (CKE) loss and the weather-invariant guiding (WIG) loss, enabling it to remember the knowledge of clear weather and generate reference features for different weather conditions.
Some studies on image restoration using conditional diffusion models, in addition to requiring a large amount of training data, may also have limitations in the dehazing application scenarios due to their reliance on physical models. Diff-Retinex [16] puts forward a low-light image enhancement method based on physical interpretability and generative diffusion models. It transforms the low-light image enhancement problem into Retinex decomposition and conditional image generation problems. In [17], a unified conditional framework based on diffusion models for image restoration is proposed. Additionally, there is a risk of significantly damaging the original structure of the image. Dehaze-DDPM [18] proposed an innovative two-stage image dehazing model, dehazedDPM. In the first stage, the model introduces a physical model to perform initial processing of the image, estimating key parameters to generate a new image. In the second stage, the model uses the output of the first stage as a condition to guide the diffusion model for sampling. A visual-textual dual-encoder was constructed to map abstract text entities to specific image regions. This type of model has limited understanding of text semantics, which may lead to poor consistency between the generated image and the text semantics and also poses a significant risk of damaging the original structure of the image.
We avoid these limitations from a new perspective.We use a reference image to guide the diffusion model without training. This improves the model’s generalization ability and makes it better suited to the complexity of real hazy scenes.
Research methods
Based on the Diffusion Model, our training-free strategy focuses on the pixel and structural features in the reference image and hazy image to achieve effective dehazing. In section 3.1, we introduce the background of Stable Diffusion; In section 3.2, we describe the overall pipeline of HazeDiff; In section 3.3, we explain the Pixel-Level Feature Inject method and its implementation; In section 3.4, we provide an overview of the design and implementation of the Structure Retention Model to preserve the original structure details in hazy images.
Preliminary: Denoising diffusion implicit models
The Denoising Diffusion Implicit Model is a non-Markovian latent variable model that achieves efficient generation through deterministic sampling trajectories. Its core framework consists of two key processes:
Forward Diffusion Process: The input image x0 is encoded into a latent variable , and noise is gradually added via a predefined scheduling mechanism. The mathematical formulation is:
Here, (
) is a monotonically decreasing noise scheduling parameter that controls the rate of noise accumulation, and T denotes the total diffusion steps. By directly linking the initial latent variable z0 to any intermediate state zt, this process supports non-Markovian jumps, enabling flexible trajectory design.
Reverse Denoising Process: The reverse process follows a deterministic generation path, reconstructing the latent variables step-by-step using a noise-prediction model . Specifically, the update from zt to zt−1 is governed by:
The parameter adjusts stochasticity (typically set to 0 for fully deterministic generation). Unlike traditional approaches, this reverse process eliminates the need to learn covariance matrices, relying solely on the noise-prediction model
to drive the dynamics.
HazeDiff pipeline
The overall architecture of the proposed method is shown in Fig 2. Inspired by the work of [11,19,20], we using the U-Net [21] architecture and leveraging the pre-trained model of Stable Diffusion. At time step 0, we invert the reference image () and the hazy image (
) into their respective latent noises. Then, we combine the two latent noises using AdaIN [22] to initialize the initial noise
of the dehazed image, and perform the reverse diffusion process on this initial noise.
DDIM Diffusion Process (right diagram): The hazy image and reference image are separately mapped to the latent noise space through the DDIM Diffusion Process. This produces their corresponding noise images and
.
The overall framework diagram (middle part of the left diagram) shows the proposed image dehazing workflow:
1. Initial noise generation for dehazed image: Adaptive instance normalization (AdaIN) combines noise from the hazy image with noise
from the reference image. This produces initial noise
for the dehazed image, preserving the hazy image’s content structure while incorporating the reference image’s color distribution.
2. Reverse diffusion process: Starting from the initial noise of the dehazed image, we execute a reverse diffusion process. During this process, pixel-level features from the reference image are injected through the following operations: Pixel-Level Feature Inject (upper part of the left diagram): In the self-attention layer, the Key and Value
of the hazy image are replaced with the corresponding parts (
,
) from the reference image. The Query
of the hazy image is mixed with the Query
of the dehazed image to retain content structure. Structure Retention Model (lower part of the left diagram): The module combines channel attention and spatial attention mechanisms. It dynamically weights feature channels through global pooling operations and a shared MLP. Meanwhile, convolution integrates channel-pooled features to enhance key spatial regions.
Pixel-level feature inject
Inspired by some generative methods [11,23,24], we found that injecting pixel-level features into the k and v of self-attention can better guide the content image to achieve pixel-level transformations. The specific operation of self-attention is as follows:
Here, d denotes the dimension of the projected query, is the projection layer, and ϕ is the feature after the residual block.
We adjust the self-attention mechanism by using the features extracted from the as conditions. Specifically, in the self-attention, we replace the keys and values in
with the keys and values from
at the same time step t, without requiring any textual supervision. As a result, the dehazed image
retains the original structure of
while effectively removing the haze and injecting the pixel distribution of normal weather.
As shown in Fig 2, in order to achieve this goal, at each noisy time step , we collect the query features
from the
image and the key features
and value features
from the
image. During the reverse denoising process, we inject the collected key features
and value features
from the
image into the self-attention layer, replacing the original key features
and value features
. This transfers the clear, haze-free pixel distribution from the
to the latent representation
of
.
In Stable Diffusion, the query vector Q determines the range or object that needs to be focused on and queried. We aim to perform dehazing on hazy images with varying levels of haze by controlling Q. Therefore, during the denoising process, we mix the query from the latent representation of
with the query features
from
. The operation of the self-attention block can be expressed as follows:
Here, γ is the mixing degree parameter, with a value range of [0,1]. We adjust the dehazing strength by changing the value of γ. Specifically, the higher the value of γ, the weaker the dehazing effect, and more content from the hazy image, including structure and pixel-level haze, will be preserved. Conversely, the lower the value of γ, the stronger the dehazing effect, but it may also cause the dehazed image to lose more key details of the original content. Ablation studies (section 4.4) demonstrate that γ = 0.5 balances haze removal and structural preservation.
Structure retention model
During the dehazing process, preserving the original structural details in the input is challenging. Therefore, during the dehazing process, important features such as object contours and key textures in the hazy image should be assigned relatively higher weights, while the weights of areas that are heavily affected by haze and less helpful for content understanding should be appropriately reduced.
Some Diffusion-based image editing tasks [23,25] propose that the cross-attention map directly influences the generation results. Inspired by [26–28], we designed a SRM within the Cross Attention framework. Leveraging the SRM’s ability to adaptively focus and filter features across both channel and spatial dimensions, we further optimize the feature representations involved in the Cross Attention process, ensuring better preservation of key structural elements such as object contours and relative spatial relationships in hazy images during the dehazing process. For example, when processing hazy images containing various objects such as buildings and vehicles, structural information like the shape of buildings and the form of vehicles, which are typically prone to blurring or loss, is more effectively preserved in the final dehazed image through SRM’s attention and enhancement across both channel and spatial dimensions.
The SRM combines channel attention and spatial attention to adaptively focus on haze-relevant features and structural regions. Channel attention prioritizes haze-relevant features by weighting channels based on their importance, while spatial attention focuses on structural regions (e.g., edges and contours) to preserve key details during dehazing.
As shown in Fig 2, we first apply global average pooling and global max pooling operations to the input content structure feature map (i.e., hazy image) , where C represents the number of channels, H is the height of the feature map, and W is the width of the feature map:
We obtain a C-dimensional vector and another C-dimensional vector Fmax, where Fc(i,j) represents the pixel value at position (i,j) in channel c.
Then, these two vectors are passed through a shared multilayer perceptron (MLP), where the shared MLP has two layers, with the first layer weights W1 and the second layer weights W2, to obtain the channel attention weights. The process of applying the channel attention-weighted feature map is shown as follows:
After the feature map is weighted by the channel attention weights, it undergoes average pooling and max pooling along the channel dimension in the spatial attention mechanism to obtain feature maps. These maps are then concatenated, convolved, and processed to obtain a 1-channel feature map. After passing through a sigmoid function, the feature map represents the attention target location information, highlighting the key regions of the hazy image features in the spatial dimension. The process is shown as follows:
Experiment
This paper will first present detailed experimental data and evaluation metrics. Then, we will compare our method with several recent dehazing methods to verify its effectiveness. After that, we will conduct an ablation study on the proposed method to further demonstrate its necessity.
Datasets
We conducted comprehensive experiments using the publicly available large-scale dataset RESIDE [29] (REalistic Single Image DEhazing) to evaluate the effectiveness of our proposed dehazing model. This dataset is divided into subsets with different purposes (training or evaluation) or sources (indoor or outdoor) based on various data sources and image content. In the experiments, we selected synthetic hazy images and their corresponding clear images from the ITS (Indoor Training Set) subset, as well as highly challenging real-world hazy images (without corresponding clear images) from the RTTS (Real-world Task-Driven Testing Set) subset, for dehazing experiments. Additionally, we evaluated the performance and generalization ability of our method using non-homogeneous hazy images created with haze machines from the NH-HAZE dataset [30]. The specific details of the dataset are shown in Table 1.
Evaluation metrics
For the dehazing results of synthetic data selected from the ITS subset, we used widely adopted full-reference evaluation metrics in dehazing tasks: Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) [31]. However, synthetic hazy images and real hazy images have fundamental differences in haze distribution, and real hazy data lack clean, clear images as references. Therefore, in addition to PSNR and SSIM, we adopted no-reference metrics, including the Naturalness Image Quality Evaluator (NIQE) [32] and the Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) [33], to evaluate the dehazing results of real hazy images selected from the RTTS subset. These no-reference metrics complement the limitations of PSNR and SSIM. Additionally, we used object detection performance on dehazed images as a task-specific no-reference evaluation standard, employing a task-driven evaluation approach to assess the effectiveness of the dehazing algorithm.
Implementation details
The experiment was conducted on an Ubuntu 18.04 computing platform with an NVIDIA RTX4090 GPU that has 24GB of memory. In terms of software environment, to improve experimental efficiency and code maintainability, we used CUDA version 11.1, Python version 3.8.5, and PyTorch version 1.11.0. During the experiment, most of the configurations were set to default, with adjustments made only to some key parameters.
DDIM [13] inversion steps were set to 50, and the number of steps for saving features was also set to 50. The downsampling factor was set to 8. The hyperparameter γ, which controls query retention, was selected between 0 and 1 based on an ablation study to find the optimal value. The layers where attention features were injected are layers 6, 7, 8, 9, 10, and 11. All experiments were repeated 5 times with different random seeds to ensure stability. The reported results are averaged values, with standard deviations below 0.5% for all metrics.
Experimental results
In this section, we compare the proposed method with existing advanced dehazing methods and conduct experimental analysis. These methods include IA-YOLO [12], FFA-Net [2], GDN [3], Dehamer [5], and MSBDN [4]. Among them, except that IA-YOLO uses the atmospheric scattering model, the remaining models are trained using the synthetic hazy images in the RESIDE [29] dataset and their corresponding real-clear images.In 4.3.1, we will perform a qualitative comparison of the dehazed images obtained using the aforementioned methods and our method. In 4.3.2, to validate the results of the quantitative comparison analysis, we will use the evaluation metrics widely used in dehazing performance assessment, as mentioned earlier, for a quantitative comparison.In 4.3.3, to further verify the effectiveness and applicability of the proposed method in the real world, we compare the mean Average Precision (mAP) obtained from YOLO [34] detection on the dehazed RTTS hazy dataset with other related works in hazy weather object detection.
A qualitative comparison of defogged foggy images under various circumstances.
In this part, we will present the dehazing results produced by the aforementioned recent methods and the proposed method, and conduct a qualitative comparative analysis. We show the results obtained from artificially simulated hazy images and real-world hazy images.
First, based on the synthetic hazy data selected in ITS, this study applied the dehazing methods mentioned earlier and the innovative dehazing method proposed in this paper respectively, and carried out in-depth analysis and detailed comparison of the processed results. The result images of various methods are shown in Fig 3.
(a) Hazy image, (b) Ground Truth, (c) FFA [2], (d) GDN [3], (e) IA-YOLO [12], (f) Dehamer [5], (g) MSBDN [4], (h) Ours.
It can be observed that works such as FFA-Net [2], GDN [3], Dehamer [5], and MSBDN [4] all exhibit excellent dehazing effects when facing the datasets they have trained on. While compared with the trained models, our work also achieves a satisfactory degree in terms of color restoration, clarity, and detail preservation.
Next, based on the real hazy data (without corresponding real-clear images) selected in RTTS, we conduct an analysis and comparison of the processed results. The dehazing result images of various methods are shown in Fig 4.
(a) Hazy image, (b) FFA [2], (c) GDN [3], (d) IA-YOLO [12], (e) Dehamer [5], (f) MSBDN [4], (g) Ours.
As shown in Fig 4, FFA [2] and GDN [3] exhibit severe noise artifacts and ghosting on the RTTS real-world hazy dataset, which significantly affect the dehazing results. IA-YOLO [12], on the other hand, displays noticeable graininess and overexposure, with clear white borders on both sides. Although Dehamer [5] and MSBDN [4] remove haze to some extent, they only eliminate haze that is relatively close, with little effect on the haze in the distance. Additionally, during the experiment, we observed that MSBDN [4] sometimes introduced blurry spots and scale distortion. In the real hazy scenes and under the condition of lacking reference images, the above-mentioned models expose relatively obvious limitations. Their dehazing effects are not satisfactory, and it is difficult for them to effectively deal with complex and changeable scenes, reflecting insufficient adaptability to different scenes. In contrast, the method proposed in this study demonstrates significant advantages.
Finally, we conduct a comparative experiment on the NH-HAZE dataset. This dataset uses a professional haze generator to simulate real-world hazy scene conditions, generating images with non-uniform haze and forming image pairs with the corresponding haze-free images. We analyze and compare the processed results. The dehazing result images of various methods are shown in Figs 5 and 6.
(a) Hazy image,(b) FFA [2], (c) Dehamer [5], (d) GDN [3], (e) MSBDN [4], (f) IA-YOLO [12], (g) Ours, (h) Ground truth.
(a) Hazy image,(b) FFA [2], (c) Dehamer [5], (d) GDN [3], (e) MSBDN [4], (f) IA-YOLO [12], (g) Ours, (h) Ground truth.
As can be observed from Fig 5, for the dehazing of non-uniform hazy images, our method achieves a higher degree of scene restoration. It can recover high-quality images with natural visual features and clear content structures from low-quality, blurry hazy images. However, the models that have not been trained on this dataset perform poorly in dehazing when facing these hazy images.
We can observe from Fig 6 that for the dehazing of dense hazy images, apart from the completely invisible parts of the image caused by the dense haze (marked by the red box), our method can still restore clear, sharp, high-contrast and detailed haze-free images.
We conduct evaluations on more diverse real-world datasets and challenging scenarios to demonstrate the robustness of the method. As shown in the Fig 7.
Including heavy rain, blizzards, thick fog, nighttime haze, and other challenging conditions, Images (a) and (c) are captured under various adverse weather conditions; images (b) and (d) show the results after image restoration using HazeDiff.
Fig 7 further demonstrates the adaptability of HazeDiff in extreme scenarios, including heavy rain, heavy snow, thick fog, and night haze, highlighting the robustness of HazeDiff in other extremely bad weather conditions.
Meanwhile, we conducted supplementary experiments using reference images under different scenes and lighting conditions. The visualization result is shown in Fig 8.
It can be seen from Fig 8 that when the similarity between the reference image and the image to be defogged is very low, although the defogging result leans towards the reference image in terms of color distribution, the key structures (such as building contours and road signs) are still completely retained, prioritizing the integrity of the structure and avoiding the risk of distortion. It has been verified that our method can still achieve a good defogging effect when the similarity between the reference image and the image to be defogged is relatively low.
To more intuitively demonstrate the effectiveness of our dehazing method, we compare the visualization results of YOLOv8 detecting the original image and the dehazed image, as shown in Fig 9.
(a) Hazy image, (b) dehazed image.
We can observe that the images dehazed by our method produce higher confidence detection results with fewer missed instances. These examples demonstrate that under real-world hazy conditions, our method can adaptively output clearer images with sharper edges, thereby helping the object detection model accurately recognize and locate objects in hazy images.
Quantitative analysis of various data in the experiment.
In this section, we carried out a quantitative comparison and ranked various comparison methods, including ours. As previously stated, in the research of image dehazing without Ground truth reference, we employed the reference-free metrics NIQE [32] and BRISQUE [33] to assess the dehazing results of the real hazy images selected from RTTS, so as to make up for the limitations of PSNR/SSIM. All results are reported as the mean, standard deviation (SD) from five independent experiments to ensure the stability and statistical reliability of performance evaluation. The SD reflects the variability in model performance across different image samples, while the 95% confidence interval (CI) indicates the plausible range of values for each metric. For the evaluation metrics adopted in this table NIQE and BRISQUE lower values correspond to superior image quality, demonstrating enhanced dehazing performance. The detailed results are shown in Table 2.
The quantitative evaluation results of our method and other recent methods are shown in the above table. As we can see, our method achieves the lowest NIQE and BRISQUE values, outperforming other dehazing methods. It is evident that the dehazing effect of the method proposed in this paper is the most satisfactory for hazy images.
We conduct an evaluation and comparison of the de-hazed images from the NH-HAZE dataset and the ITS subset within the RESIDE [29] dataset. All results are reported as the mean, SD from five independent experiments to ensure the stability and statistical reliability of performance evaluation. The SD reflects the variability in model performance across different image samples, while the 95% CI indicates the plausible range of values for each metric. For the evaluation metrics adopted in this table PSNR and SSIM higher values correspond to superior image quality, demonstrating enhanced dehazing performance.
By analyzing Table 3, our proposed model achieved the best values for all SSIM and PSNR metrics. The experimental results indicate that, compared with other dehazing methods, the dehazing effect of the proposed method is better, and it has good structural awareness and color preservation capabilities.
Comparison with the state-of-the-art object detection methods under hazy conditions.
We recognize that in real-world applications, image dehazing is often used as a preprocessing step for higher-level vision tasks such as object detection. Therefore, we use the object detection performance on dehazed images as a no-reference, task-specific evaluation standard to assess the dehazing results of real-world hazy images that lack clean, real references. Specifically, we follow the method in [20] and adopt a task-driven evaluation approach, using several state-of-the-art hazy image object detection models to detect images in the RTTS dataset and obtain the mAP, including DS-Net [35], IA-YOLO [12], YOLOX [36], SWDA [37], TogetherNet [38], MS-DAYOLO [39], and our dehazed images detected by YOLOv8. We rank all algorithms based on the mAP(%) results to evaluate the impact of the dehazing algorithm on the performance of downstream tasks. The mAP results are shown in Table 4.
As shown in Table 4, the dehazed RTTS dataset using our method achieves better results in object detection compared to the currently advanced hazy weather object detection algorithms. Specifically, images from the RTTS dataset, after being dehazed with the proposed method and then processed through YOLOv8 for object detection, achieved the highest mAP. This indicates that our method effectively removes haze while improving image quality, making objects in the image clearer and more distinguishable, thereby enhancing the accuracy and robustness of object detection. This further demonstrates the reliability and effectiveness of our method in real-world scenarios, providing an effective solution for image dehazing.
We conducted rigorous testing on the NVIDIA RTX 4090 (24GB VRAM) GPU, with key metrics shown in Table 5. Compared to SD, our method achieves faster single-image inference time and lower VRAM usage, enabling efficient inference with significant computational cost advantages. In future work, we plan to employ parameter-efficient fine-tuning techniques such as LoRA to reduce the parameter count of UNet, further optimizing the computational efficiency of our method.
Ablation experiments
In this section, we conducted several ablation studies to analyze the effectiveness of the proposed HazeDiff on real-world hazy weather datasets. These studies include our model HazeDiff, HazeDiff (without SRM), HazeDiff (without PFI)and the determination of the final parameter settings.
We applied the same reference image to guide the image dehazing for both HazeDiff and HazeDiff (without SRM), and the experimental results are shown in Fig 10.
(a) Hazy images, (b) HazeDiff, (c) HazeDiff (without SRM).
It can be observed that, through SRM, key structural features in hazy images, such as object contours and relative spatial relationships, are better preserved during the dehazing process. For instance, the shapes of buildings and the appearance of vehicles are more effectively presented in the final dehazed images.
We also conducted experiments on the blending degree parameter γ in the model, quantitatively evaluating the generated images under different parameters to explore the optimal value for addressing the potential blurring of original structural details in hazy images during the generation process. The quantitative analysis results are shown in Table 6 and the visualization results are shown in the Fig 11.
We observed that when the mixing degree parameter γ is less than 0.4, although the dehazing effect is significant and the colors are clear and bright, the structural content of the original image is not well preserved. This results in generated images that differ greatly in content from the original hazy images, failing to accurately convey their semantic information and visual effects. When the mixing degree parameter γ is greater than 0.6, while the structure of the hazy images is well preserved, the haze is not effectively removed. However, when γ is between 0.4 and 0.6, the haze can be reasonably removed without introducing additional noise, while retaining the structural content of the hazy images. Therefore, we selected a mixing degree parameter of , which lies between 0.4 and 0.6.
To validate the effectiveness of the proposed PFI method, we performed haze removal using Hazediff with the PFI removed. The visualization results are shown in the Fig 12. Columns (a) and (d) present the hazy input images requiring processing. Panels (b) and (e) demonstrate the dehazing results from the PFI-ablated Hazediff variant, where structural distortions (highlighted in red dashed boxes) manifest in critical regions. In contrast, panels (c) and (f) exhibit the restoration outputs from the full Hazediff framework with intact PFI.
The results revealed that in the absence of PFI to effectively guide the content images, the generated images exhibited significant randomness. This forced the model to autonomously interpret the definition of “haze removal,” leading to severe image degradation with substantial loss of fine details, while failing to achieve the intended dehazing effect. These findings demonstrate that the proposed PFI provides superior guidance for pixel-level transformation of content images.
To comprehensively assess the individual contributions and parameter sensitivity of the PFI and SRM modules, we conducted quantitative experimental analyses, as shown in Table 6.
Values are reported as mean ± SD. For PSNR and SSIM, higher is better; for NIQE and BRISQUE, lower is better.
The experimental results demonstrate that the PFI module plays a decisive role in improving image reconstruction quality. By incorporating the SRM module, the experimental data are further optimized and achieve optimal values, thereby comprehensively assessing the individual contributions of both the PFI and SRM.
Conclusion
In this paper, we propose a Diffusion Model-based method, HazeDiff, to address several challenging image dehazing problems. Specifically, we propose PFI, which extracts pixel-level features from the reference image and injects them into the latent noise of the dehazed image, guiding the diffusion model to generate the dehazed image in an unsupervised manner. Next, we design an attention-based SRM to focus on the key regions and details of hazy images, addressing the issue where diffusion models guided by reference images may blur the original structural details of the hazy image during the generation process. Compared to previous dehazing methods, HazeDiff has stronger generalization ability and does not require a large number of paired hazy-clear images from the same scene for training. Extensive experiments show that our HazeDiff outperforms other state-of-the-art dehazing methods, achieving the best results.
References
- 1. Cai B, Xu X, Jia K, Qing C, Tao D. DehazeNet: an end-to-end system for single image haze removal. IEEE Trans Image Process. 2016;25(11):5187–98. pmid:28873058
- 2. Qin X, Wang Z, Bai Y, Xie X, Jia H. FFA-net: feature fusion attention network for single image dehazing. AAAI. 2020;34(07):11908–15.
- 3.
Liu X, Ma Y, Shi Z. Griddehazenet: attention-based multi-scale network for image dehazing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. p. 7314–23.
- 4.
Dong H, Pan J, Xiang L. Multi-scale boosted dehazing network with dense feature fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 2157–67.
- 5.
Guo CL, Yan Q, Anwar S. Image dehazing transformer with transmission-aware 3d position embedding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 5812–20.
- 6. Xiong J, Yan X, Wang Y. RSHazeDiff: a unified Fourier-aware diffusion model for remote sensing image dehazing. IEEE Transactions on Intelligent Transportation Systems. 2024.
- 7.
Oh Y, Kim HI, Kim ST, et al. MonoWAD: Weather-Adaptive Diffusion Model for Robust Monocular 3D Object Detection. In: European Conference on Computer Vision. 2024. p. 326–45.
- 8.
Qi T, Fang S, Wu Y. Deadiff: An efficient stylization diffusion model with disentangled representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. p. 8693–702.
- 9.
Choi J, Kim S, Jeong Y. Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint. 2021. https://doi.org/10.48550/arXiv.2108.02938
- 10.
Cho H, Lee J, Chang S. One-shot structure-aware stylized image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. p. 8302–11.
- 11.
Chung J, Hyun S, Heo JP. Style injection in diffusion: A training-free approach for adapting large-scale diffusion models for style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. p. 8795–805.
- 12.
Liu W, Ren G, Yu R. Image-adaptive YOLO for object detection in adverse weather conditions. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2022. p. 1792–800.
- 13. Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems. 2020;33:6840–51.
- 14.
Sohl-Dickstein J, Weiss E, Maheswaranathan N. Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning. 2015. p. 2256–65.
- 15. Wang J, Wu S, Yuan Z, Tong Q, Xu K. Frequency compensated diffusion model for real-scene dehazing. Neural Netw. 2024;175:106281. pmid:38579573
- 16.
Yi X, Xu H, Zhang H. Diff-retinex: rethinking low-light image enhancement with a generative diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. p. 12302–11.
- 17. Zhang Y, Shi X, Li D. A unified conditional framework for diffusion-based image restoration. Advances in Neural Information Processing Systems. 2023;36:49703–14.
- 18.
Yu H, Huang J, Zheng K. High-quality image dehazing with diffusion model. arXiv preprint. 2023.https://doi.org/10.48550/arXiv.2308.11949
- 19.
Gatys LA, Ecker AS, Bethge M. Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 2414–23.
- 20.
Song J, Meng C, Ermon S. Denoising diffusion implicit models. arXiv preprint 2020. https://doi.org/10.48550/arXiv.2010.02502
- 21.
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015 : 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III. 2015. p. 234–41.
- 22.
Huang X, Belongie S. Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 1501–10.
- 23.
Hertz A, Mokady R, Tenenbaum J. Prompt-to-prompt image editing with cross attention control. arXiv preprint 2022. https://doi.org/10.48550/arXiv.2208.01626
- 24.
Cao M, Wang X, Qi Z. Masactrl: tuning-free mutual self-attention control for consistent image synthesis and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. p. 22560–70.
- 25.
Tumanyan N, Geyer M, Bagon S. Plug-and-play diffusion features for text-driven image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 1921–30.
- 26. Lyu H, Sha N, Qin S. Advances in Neural Information Processing Systems. 2019;32.
- 27.
Woo S, Park J, Lee JY. Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018. p. 3–19.
- 28.
Lin H, Cheng X, Wu X, et al. Cat: cross attention in vision transformer. 2022 IEEE international conference on multimedia and expo (ICME). IEEE; 2022. p. 1–6.
- 29. Li B, Ren W, Fu D. Benchmarking single-image dehazing and beyond. IEEE Transactions on Image Processing. 2018;28(1):492–505.
- 30.
Ancuti CO, Ancuti C, Timofte R. NH-HAZE: an image dehazing benchmark with non-homogeneous hazy and haze-free images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020. p. 444–5.
- 31. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process. 2004;13(4):600–12. pmid:15376593
- 32. Mittal A, Soundararajan R, Bovik AC. Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters. 2012;20(3):209–12.
- 33. Mittal A, Moorthy AK, Bovik AC. No-reference image quality assessment in the spatial domain. IEEE Trans Image Process. 2012;21(12):4695–708. pmid:22910118
- 34.
Redmon J, Divvala S, Girshick R. You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 779–88.
- 35. Huang S-C, Le T-H, Jaw D-W. DSNet: joint semantic learning for object detection in inclement weather conditions. IEEE Trans Pattern Anal Mach Intell. 2021;43(8):2623–33. pmid:32149681
- 36.
Ge Z, Liu S, Wang F, et al. Yolox: exceeding yolo series in 2021. arXiv preprint 2021.
- 37.
Saito K, Ushiku Y, Harada T. Strong-weak distribution alignment for adaptive object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. p. 6956–65.
- 38.
Wang Y, Yan X, Zhang K. Togethernet: bridging image restoration and object detection together via dynamic enhancement learning. In: Computer Graphics Forum. 2022. p. 465–76.
- 39.
Hnewa M, Radha H. Multiscale domain adaptive yolo for cross-domain object detection. In: 2021 IEEE International Conference on Image Processing (ICIP). 2021. p. 3323–7.