Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

UltraStyle: Best of both worlds in style and content for style transfer

  • Yongqian Tan ,

    Roles Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Software, Writing – original draft, Writing – review & editing

    tanyongqian@kluniv.edu.cn

    Affiliations School of Big Data Engineering, Kaili University, Kaili, Guizhou, China, Guizhou Miao Embroidery Culture Protection and Development Research Center, Kaili, Guizhou, China

  • Fanju Zeng

    Roles Writing – original draft

    Affiliation School of Big Data Engineering, Kaili University, Kaili, Guizhou, China

Abstract

Achieving an ideal balance between style fidelity and content preservation remains a critical challenge in style transfer. In this work, we present UltraStyle, a new framework that reconciles these two objectives through a reformulation of the learning process. Unlike prior LoRA-based methods that rely on noise prediction, UltraStyle adopts a reconstruction-centered optimization paradigm, allowing the diffusion model to better retain global structural features while faithfully reproducing stylistic patterns. We propose a dual-phase training method that first isolates content representations before specializing style learning, minimizing cross-interference. To further refine detail preservation without sacrificing structure, we introduce a progressive loss transition strategy during training. Moreover, we develop a flexible inference control mechanism that enables smooth adjustment of content and style influences in the generation phase. Experimental results demonstrate that UltraStyle consistently delivers stylized outputs with superior structural integrity and stylistic authenticity, significantly mitigating issues such as content drift and feature entanglement found in existing methods.

Introduction

Style transfer [110] has long been recognized as a pivotal task in computer vision, where the goal is to seamlessly blend the structural semantics of a content image with the visual characteristics of a style exemplar. From early methods leveraging convolutional neural networks (CNNs) [1113] based on feature reconstruction losses to more recent advancements employing generative adversarial networks (GANs) [1420], the field has witnessed significant progress. However, the emergence of diffusion models has fundamentally transformed the landscape, offering new levels of flexibility, controllability, and fidelity in image generation. Diffusion-based approaches [2126] have unlocked unprecedented potential for detailed content preservation while allowing for rich stylistic transformations, leading to a surge of interest in applying these models to artistic and personalized image synthesis.

To efficiently adapt large-scale diffusion models for style transfer tasks, Low-Rank Adaptation (LoRA) [27] techniques have gained considerable attention. LoRA-based methods [2832] enable lightweight fine-tuning by inserting learnable low-rank matrices into pre-trained model architectures, significantly reducing the computational and memory overhead compared to full fine-tuning. They have been successfully applied across a variety of domains, from text-to-image personalization to domain-specific generation. In the context of style transfer, LoRA offers an attractive solution: capturing the essence of a target style with only minimal updates to the base model. Yet, despite these advantages, fundamental challenges persist.

Existing LoRA-based style transfer methods [3234] often struggle with three major issues: First, structural distortions frequently occur, where the global layout and semantics of the content image are poorly preserved, leading to unnatural outputs. Second, style misalignment remains prevalent, where the transferred style appears incomplete, diluted, or inaccurately applied. Third, and perhaps most critically, content leakage—where elements from the style reference image undesirably intrude into the generated result—undermines the clarity and intent of the stylization. These failures can be traced back to the training objectives typically employed, which are centered around noise prediction losses that inherently prioritize low-level detail recovery over the modeling of high-level structural information critical for effective style transfer.

Addressing these limitations requires a fundamental rethinking of the style transfer process within diffusion models. In this work, we propose UltraStyle, a novel framework designed to bridge the gap between content integrity and style fidelity (Please see Fig 1). Unlike conventional approaches that fine-tune models based on noise prediction, UltraStyle introduces a reconstruction-centered optimization paradigm, where the model is trained to recover the original latent representation of the input rather than the noise perturbations. This adjustment reorients the model’s focus towards capturing global semantics and stylistic patterns more holistically, leading to outputs that better honor both the source content and the target style. To further disentangle the learning dynamics between content and style, we introduce a dual-phase training strategy. In the first phase, we exclusively optimize a content-preserving LoRA by focusing on maintaining the global structure and semantics of the content image. In the second phase, building upon the frozen content representation, we train a separate style LoRA, allowing the model to specialize in stylistic attributes without corrupting content fidelity. This separation mitigates feature entanglement and reduces the risk of content leakage, resulting in more faithful stylized images. Moreover, we recognize that different stages of the diffusion process emphasize different aspects of image generation. To exploit this property, we propose a progressive loss transition mechanism: starting with an emphasis on low-level feature recovery and gradually shifting the optimization focus towards high-level structural reconstruction. This dynamic adjustment ensures that both fine-grained textures and global arrangements are accurately preserved, addressing the dual needs of style expressiveness and content consistency. Finally, to enhance user controllability during generation, UltraStyle incorporates a flexible inference guidance mechanism. Inspired by classifier-free guidance principles, our method enables dynamic modulation of content and style strengths independently at inference time, allowing for fine-grained control over the degree of stylization versus content preservation without requiring retraining. This feature is particularly valuable in real-world applications where users may desire varying levels of stylization intensity based on different creative needs.

thumbnail
Fig 1. Results generated by the SDXL with our approach UltraStyle.

Reprinted from [http://xhslink.com/o/nfgAm1FZ5w] under a CC BY license, with permission from Xiaoming Huang, original copyright 2024. Reprinted from [https://pan.baidu.com/s/10_CjlBAaXZ6vB_RONzfU3g?pwd=b97i] under a CC BY license, with permission from Yi Yang, original copyright 2025.

https://doi.org/10.1371/journal.pone.0346260.g001

Through extensive experiments on diverse benchmarks and styles, we demonstrate that UltraStyle consistently surpasses existing state-of-the-art methods across multiple metrics, including structural similarity, style adherence, and user preference studies. Our framework substantially reduces content leakage, improves style transfer fidelity, and preserves content structure with greater robustness. Beyond empirical performance, UltraStyle also provides a conceptual advance: showing that restructuring the learning objectives and training procedures within diffusion models can lead to significantly better trade-offs between style and content in generative tasks.

In summary, this work presents a comprehensive solution to the enduring challenge of achieving high-quality style transfer that does not sacrifice content integrity. We believe that UltraStyle offers not only practical improvements but also new perspectives for future research on controllable, reliable, and high-fidelity image generation systems.

Related work

Style transfer

Style transfer [2,3,69], the task of merging the structural essence of a content image with the aesthetic patterns of a style reference, has seen remarkable evolution over the past decade. Early pioneering works [5,3538] demonstrated that pre-trained convolutional neural networks (CNNs) could separate high-level content features from low-level style textures. By matching Gram matrices of activations between images, these methods achieved impressive stylization effects, marking a significant conceptual breakthrough.

However, initial optimization-based methods were computationally intensive and limited to transferring predefined styles. To address this, feed-forward stylization networks [3943] were introduced, enabling real-time inference by directly predicting stylized outputs given a content image. Yet, each network typically handled only one or a limited number of styles, restricting their practical utility. Later advancements introduced techniques such as conditional instance normalization and adaptive instance normalization, allowing networks to generalize across multiple styles with a single model. These models greatly improved flexibility but still faced trade-offs between maintaining detailed content structure and achieving strong stylization, often leading to spatial distortions or artifacts.

The adoption of generative adversarial networks (GANs) further expanded the potential of style transfer. GAN-based frameworks [17,4454] learned mappings from content to style domains through adversarial objectives, improving the realism and diversity of generated images. Multi-domain translation models emerged, capable of transferring styles from arbitrary domains onto target images. Despite their expressiveness, GAN-based methods [3,4,5564] introduced instability during training, susceptibility to mode collapse, and challenges in finely controlling the degree of style influence without affecting content integrity.

More recently, diffusion models [6573] have ushered in a new era for style transfer by leveraging a gradual denoising process that inherently supports flexible, high-quality generation. Diffusion-based approaches allow for more robust manipulation of content and style signals during the iterative generation process, offering improved trade-offs between fidelity and expressiveness. Nonetheless, while diffusion models provide a strong foundation, achieving seamless blending of style characteristics without compromising the semantic structure of the content image remains an open challenge, particularly when adapting to diverse styles or when style references are limited.

Diffusion models

Diffusion models [7478] have emerged as the dominant paradigm for high-fidelity image synthesis. By progressively refining a noise signal toward a coherent image through a series of denoising steps, these models capture complex data distributions with remarkable stability and diversity. Unlike GANs, which directly generate images in a single pass, diffusion models model the entire generation trajectory, offering better coverage of the target distribution and resilience against training instabilities.

Personalization within diffusion models has become increasingly important, particularly for tasks where users require models to learn new concepts, objects, or styles from a small number of examples [7985]. Early personalization techniques relied on full model fine-tuning, adapting the backbone networks to new data. StyleAligned [86] introduces a training-free shared attention mechanism that ensures stylistic consistency by synchronizing attention keys and values across the diffusion process. StyleID [8] adopts a training-free approach that leverages DDIM inversion for robust content reconstruction while manipulating attention queries and keys to inject reference style attributes. While effective, this approach was computationally prohibitive and prone to overfitting, making it impractical for scenarios requiring frequent or lightweight personalization. Alternative strategies focused on optimizing specialized embeddings or prompt tokens, allowing the model to incorporate new concepts with minimal architectural modifications. Methods based on text inversion techniques introduced new embeddings into the text encoder space, representing novel styles or subjects. These approaches provided a lightweight mechanism for personalization but often lacked the capacity to capture more intricate style details, particularly those involving global spatial structures or fine texture variations.

To address these limitations, parameter-efficient fine-tuning methods such as LoRA (Low-Rank Adaptation) [27] were introduced. LoRA injects trainable low-rank matrices into specific layers of the diffusion network, enabling the model to adapt to new concepts or styles with a minimal number of additional parameters. This dramatically reduces training time and memory consumption while preserving the original model’s generative capacity.

Despite these advances, personalization in the context of style transfer introduces unique challenges. Unlike object personalization, where preserving identity and appearance is paramount, style transfer demands altering the global appearance of an image while faithfully maintaining its content structure. Standard personalization approaches often fall short in this regard, either overfitting to style textures at the expense of content or preserving structure but failing to convincingly translate style. As a result, there is a pressing need for more sophisticated personalization methods that can balance these competing objectives within diffusion models, ensuring both content preservation and effective style adaptation.

LoRA-based style transfer

Low-Rank Adaptation (LoRA) [27] has proven to be a highly effective method for efficient fine-tuning, particularly within large-scale diffusion models. By decomposing weight updates into low-rank components and strategically placing them within the network’s architecture, LoRA allows for targeted adaptation while leaving the majority of the pre-trained parameters unchanged. This enables fast, memory-efficient training with strong adaptability, making it particularly attractive for style transfer tasks where only a few style references are available.

Several works have explored the application of LoRA in style transfer [3234,8789], typically by designing separate LoRA modules for content and style components. The idea is to disentangle the structure-preserving features from the style-altering features, enabling flexible recombination and better generalization across different style-content pairs. Methods leveraging dual-LoRA architectures have shown that it is possible to train independent content and style adapters, which can later be recombined at inference time to generate diverse stylizations. Lastly, B-LoRA [34] focuses on the implicit separation of style and content by training separate LoRA adapters on specific layers of the diffusion backbone, enabling lightweight disentanglement through low-rank weight updates. Nevertheless, applying LoRA to style transfer comes with significant challenges. Standard training objectives for LoRA-based modules often involve noise prediction tasks inherited from the original diffusion training, which prioritize the accurate recovery of local textures rather than the preservation of global structures. This misalignment between the training objective and the needs of style transfer leads to common issues such as content distortion, where object shapes and layouts are warped, and style misalignment, where style patterns are inconsistently or incompletely applied.

Moreover, the separation between content and style is rarely perfect. Style LoRA modules can inadvertently capture content-specific features, leading to content leakage where elements of the style reference’s structure contaminate the generated output. Conversely, content LoRA modules may unintentionally encode stylistic attributes, diluting the clarity of the final stylized image. These entanglement issues are exacerbated when strong stylization is desired, pushing the model to its limits. Another limitation of existing LoRA-based style transfer methods lies in their lack of adaptive control during inference. Most approaches apply static LoRA weights during generation, providing limited flexibility to adjust the strength of style or content emphasis dynamically. This rigidness restricts user control and adaptability, especially in applications where nuanced adjustments are needed for different stylistic intents.

Addressing these challenges requires rethinking both the training objectives and the adaptation strategies for LoRA in diffusion-based style transfer. Specifically, better separation of content and style signals, improved emphasis on high-level structural preservation, and the development of flexible inference control mechanisms are crucial for advancing the capabilities of LoRA-based stylization frameworks.

Preliminaries

In this section, we review the foundational concepts underlying our approach, including denoising diffusion probabilistic models, latent diffusion models (LDMs) [90], and Low-Rank Adaptation (LoRA) [27] as applied in diffusion-based generation. These components are critical for understanding the design and rationale of our proposed method.

Denoising diffusion models

Denoising diffusion probabilistic models (DDPMs) are a class of generative models that learn data distributions by reversing a fixed stochastic process that gradually adds noise to data. Given an input image x0, the forward process generates noisy observations through a sequence of Gaussian transitions:

where is a predefined variance schedule. After T steps, the sample xT approximates a standard Gaussian distribution.

To train the model, a neural network is optimized to predict the noise added during the forward process. The standard training objective, known as the epsilon-prediction loss, is formulated as:

At inference time, sampling is performed by initializing and recursively applying the learned denoising process to generate x0.

Latent diffusion models

To reduce computational cost and memory consumption, Latent Diffusion Models (LDMs) perform the diffusion process in a lower-dimensional latent space rather than pixel space. We utilize a pre-trained encoder E that maps high-dimensional images from the pixel space to a lower-dimensional latent space . For a given content or style image x0, the latent representation is computed as . This compression step effectively filters out high-frequency pixel-level noise while retaining the essential structural and semantic information required for style transfer. After the diffusion model completes the denoising process in the latent space , the resulting clean latent variable z0 is passed through a pre-trained decoder D to reconstruct the final image in the pixel space: . The diffusion model operates on latent variables, where the training loss becomes:

By modeling generation in the latent space, LDMs preserve the visual quality of outputs while substantially improving training and inference efficiency, making them suitable for large-scale or resource-constrained applications.

Low-rank adaptation in diffusion models

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning strategy that introduces trainable rank-decomposed matrices into the weights of pre-trained models. Specifically, LoRA modifies a weight matrix via a low-rank update:

During fine-tuning, only A and B are updated while W remains fixed.

In diffusion models, LoRA modules are typically injected into the U-Net backbone, which predicts the noise in each denoising step. This allows for fast and modular adaptation to new styles or domains using a limited number of parameters. Additionally, content- and style-specific LoRAs can be trained separately and later composed at inference time to flexibly control the generation behavior.

Method

We present UltraStyle, a novel diffusion-based framework designed to achieve high-fidelity style transfer with explicit decoupling of content and style pathways. In contrast to prior LoRA-based stylization approaches, UltraStyle introduces three innovations: (1) a structure-oriented adaptation strategy using reconstruction-based optimization, (2) an independently trained style enrichment module that avoids content interference, and (3) a dynamic inference controller that allows user-controllable interpolation between style and content representations. Furthermore, we incorporate an adaptive training scheduler and provide a lightweight yet effective training recipe. The overall pipeline is depicted in Fig 2.

thumbnail
Fig 2. Overview of our UltraStyle.

To facilitate the training of both style and content LoRA modules, we replace the conventional prediction paradigm with a novel prediction formulation. For content LoRA training, we design a loss transition mechanism that simultaneously captures the global structural layout and the fine-grained local details of the content image. To effectively disentangle the style and content information encoded in the style image, we adopt a two-stage training framework: first, we optimize a content-consistent LoRA module via the proposed loss transition; subsequently, we freeze the content LoRA and train a dedicated style LoRA to encode stylistic variations independently. Reprinted from [https://pan.baidu.com/s/10_CjlBAaXZ6vB_RONzfU3g?pwd=b97i] under a CC BY license, with permission from Yi Yang, original copyright 2025.

https://doi.org/10.1371/journal.pone.0346260.g002

Structure-preserving adaptation

The Structure-Preserving Adaptation (SPA) module is designed to address the challenge of maintaining the semantic and spatial integrity of content images during stylization. In conventional diffusion-based models, the optimization target often focuses on noise prediction, which encourages the model to capture low-level pixel statistics but neglects the preservation of object structure, spatial arrangements, and semantic boundaries.

To overcome this limitation, SPA redefines the training objective using a latent-space reconstruction loss. Rather than predicting noise, the model learns to reconstruct the original latent representation of the content image, ensuring that high-level semantic information is retained. Formally, the loss is defined as:

where denotes the denoising function equipped with SPA-specific LoRA modules.

In implementation, LoRA adapters are strategically inserted only into mid-to-deep U-Net layers that influence spatial resolution and semantic coherence. By avoiding shallow layers, the SPA module minimizes interference from early-stage texture learning and ensures stability across diverse content domains. Furthermore, we introduce semantic-aware augmentation strategies during SPA training. These include random occlusion, aspect ratio variation, and depth-preserving distortions, encouraging the model to generalize structural priors across varied viewing conditions. The SPA model effectively acts as a structure regularizer, providing a strong backbone for subsequent stylization.

Style-specific enrichment

The Style-Specific Enrichment (SSE) module is responsible for encoding and applying artistic features, including color distributions, textural patterns, stroke geometry, and lighting cues. Stylization in generative models often suffers from two extremes: either it overwhelms the content with excessive texture or applies only superficial changes, leading to weak style identity. SSE addresses this by isolating style learning in a dedicated phase with specialized LoRA modules.

Unlike SPA, SSE leverages full U-Net coverage, injecting LoRA into both encoder and decoder blocks. This enables holistic propagation of style across spatial and semantic dimensions. The optimization follows the standard epsilon-prediction loss:

where denotes the style-adaptive noise estimator.

To facilitate rich style generalization, SSE is trained using a curated set of diverse style exemplars with aligned content masks. Each style domain includes multiple intra-class variations to capture both coarse and fine-grained stylistic cues. We also use feature histogram loss and patch-level diversity regularization to ensure that the learned LoRA captures not only global style context but also local artistic variations. In practice, we find that SSE learns to emphasize global color mood in early timesteps and detailed textures in later steps. This progressive stylization mirrors traditional artistic workflows and enhances human perceptual satisfaction.

Decoupled inference controller

The Decoupled Inference Controller (DIC) allows UltraStyle to dynamically control the contribution of content and style pathways during the sampling process. Traditional stylization models provide fixed outputs once trained, limiting their usability in interactive or user-driven applications. DIC resolves this by enabling real-time trade-off adjustment using a scalar interpolation strategy:

where controls the balance between content fidelity and stylistic intensity.

In practical applications, users can specify interactively or map it spatially to different regions of the image. For example, one can use content-preserving SPA predictions in foreground objects while applying strong SSE stylization to the background. We further extend DIC to support temporal interpolation for video stylization, where evolves across frames to ensure smooth style transitions. To support this flexibility, we design DIC to operate independently of the core diffusion sampling loop. It functions as a plug-and-play fusion layer, requiring only latent-level predictions from pre-trained SPA and SSE modules. This design keeps DIC modular, efficient, and highly extensible.

Adaptive transition scheduling

The Adaptive Transition Scheduling (ATS) module bridges the training phases of SPA and SSE. While disjoint training ensures clean decoupling, it may miss opportunities for shared learning. ATS resolves this by smoothly interpolating the training objectives over time:

where is a cosine annealing function that transitions from 1 to 0.

ATS enables early learning of spatial coherence from SPA while gradually allowing SSE to refine appearance. To stabilize this joint regime, we introduce curriculum sampling: in early iterations, only content-rich samples are used; later, stylized samples with greater variance are introduced. This adaptive regime avoids catastrophic forgetting of structure during SSE optimization and leads to smoother convergence and better joint alignment. Experiments show that ATS improves final performance by up to 2.4% on user preference tests and reduces flickering in sequential stylization.

Experiments

Datasets and settings

Each LoRA module is trained for 500 steps per phase using a batch size of 4 and the AdamW optimizer. During training, we apply mild geometric augmentations to content images to improve robustness. For inference, we employ DDIM sampling with 50 steps and enable the Decoupled Inference Controller for dynamic style-content interpolation. We compare UltraStyle to a wide range of baselines, such as StyleAligned [86], StyleID [8] and B-LoRA [34]. For fair comparison, all methods are trained or tuned on the same training sets and evaluated with identical preprocessing and sampling parameters where applicable.

To evaluate the effectiveness of UltraStyle, we conduct experiments across different domains and stylization complexities following exist works [9092] based on SDXL [93]. All training and evaluation images are resized to and evaluated using multiple quantitative (e.g., DreamSim Distance [94], CLIP Score [95] and DINO Score [96].) and human-centric metrics.

Qualitative evaluation

To comprehensively assess the visual effectiveness of our method, we conduct a qualitative comparison with three state-of-the-art approaches: StyleID, StyleAligned, and B-LoRA. As illustrated in Fig 3, we present representative results on a range of content and style image pairs, highlighting both content preservation and style alignment. StyleID employs DDIM inversion to achieve faithful content reconstruction. While this approach excels at preserving the spatial structure of the content image, it often struggles to capture the distinctive stylistic features from the reference image. As a result, the generated outputs tend to exhibit limited stylistic diversity, and the visual impact of the style is frequently diminished. This observation is consistent with quantitative metrics, where StyleID achieves relatively high content similarity but lower style alignment. StyleAligned focuses on enforcing shared attention between content and style, aiming to improve the consistency of stylization. However, in practice, this method often suffers from significant structural inconsistencies. The generated images sometimes introduce unintended structural elements from the style image, leading to distortions in the reconstructed content. In addition, StyleAligned may fail to effectively disentangle content and style, resulting in artifacts or partial content leakage. B-LoRA attempts to separate content and style representations by jointly optimizing two distinct LoRA modules within the diffusion backbone. While this design mitigates some degree of content leakage, B-LoRA often fails to preserve the global structure of the content image and occasionally exhibits style misalignment. In some cases, content details from the style reference inadvertently appear in the output, further undermining the consistency and visual fidelity. Our Method distinctly outperforms the above approaches by achieving superior content preservation and style fidelity. Leveraging the proposed prediction loss, our method more effectively captures both the global structure and fine-grained details of the content image, while accurately applying the desired style. The two-step training strategy facilitates a clean disentanglement of content and style, significantly reducing content leakage and style misalignment. Furthermore, our inference guidance mechanism enables continuous and precise control over the strengths of both content and style, allowing for flexible stylization according to user preference. As demonstrated in the qualitative results, our method consistently generates stylized images that not only adhere closely to the structural integrity of the content input but also faithfully embody the characteristics of the style reference. In summary, our approach achieves a robust balance between content consistency and style expressiveness, substantially surpassing the competing methods in visual quality and controllability.

thumbnail
Fig 3. We present style transfer results of our method and three baseline methods.

Reprinted from [https://pan.baidu.com/s/10_CjlBAaXZ6vB_RONzfU3g?pwd=b97i] under a CC BY license, with permission from Yi Yang, original copyright 2025. Reprinted from [http://xhslink.com/o/9GlJdtObddi] under a CC BY license, with permission from Wuwei Zhang, original copyright 2025. Reprinted from [http://xhslink.com/o/54Eeo1PlMG] under a CC BY license, with permission from Tao Pu, original copyright 2025.

https://doi.org/10.1371/journal.pone.0346260.g003

Quantitative evaluation

To objectively compare UltraStyle with state-of-the-art baselines, we conduct quantitative evaluations on both style and content alignment using established perceptual metrics. Specifically, we adopt DreamSim (DS) distance and cosine similarity calculated over CLIP and DINO features, which comprehensively reflect the alignment between generated images and their respective style and content references. Table 1 reports the evaluation results for UltraStyle, StyleID, StyleAligned, and B-LoRA across 400 style-content pairs. In terms of style alignment, UltraStyle achieves the lowest DreamSim distance (0.567) and the highest CLIP (0.659) and DINO (0.629) scores, indicating superior style fidelity. While B-LoRA attains a comparable CLIP score (0.654), this is partially attributed to content leakage, which can inflate feature similarities without genuine style transfer. StyleID and StyleAligned, on the other hand, exhibit higher DS distances and lower CLIP/DINO scores, revealing difficulties in faithfully capturing the intended style. Regarding content alignment, UltraStyle again demonstrates clear advantages, with a DreamSim distance of 0.524 and a CLIP similarity of 0.671, both outperforming all baselines. Notably, StyleID records a marginally higher DINO score (0.679), reflecting its strong preservation of content structure, but this often comes at the expense of stylistic expressiveness, as shown in qualitative results. In contrast, StyleAligned and B-LoRA fail to simultaneously maintain both content and style, resulting in lower overall alignment scores. These quantitative findings confirm that UltraStyle delivers the best balance between style and content preservation, significantly surpassing StyleID, StyleAligned, and B-LoRA. Our method effectively avoids the typical trade-off between content and style fidelity, achieving robust performance across all evaluation metrics.

thumbnail
Table 1. Quantitative comparison of style and content alignment. Lower DS and higher CLIP/DINO indicate better performance.

https://doi.org/10.1371/journal.pone.0346260.t001

Ablation studies

To further validate and analyze the contributions of individual components in UltraStyle, we perform comprehensive ablation studies focusing on key design choices. Specifically, we examine the impact of our dual-phase training strategy, progressive loss transition mechanism, and the decoupled inference controller. All ablations are conducted under identical experimental setups using the same dataset and evaluation metrics as detailed as above. Visualization results can be seen in Fig 4.

thumbnail
Fig 4. Visualization Results of the Ablation Study.

Reprinted from [https://pan.baidu.com/s/10_CjlBAaXZ6vB_RONzfU3g?pwd=b97i] under a CC BY license, with permission from Yi Yang, original copyright 2025. Reprinted from [http://xhslink.com/o/9GlJdtObddi] under a CC BY license, with permission from Wuwei Zhang, original copyright 2025.

https://doi.org/10.1371/journal.pone.0346260.g004

Impact of Dual-Phase Training. In this ablation, we evaluate the importance of our dual-phase training paradigm by comparing it against a single-phase joint training approach. For the joint training, we simultaneously optimize both content and style LoRA modules without the proposed separation. Results summarized in Table 2 clearly indicate that the dual-phase training approach consistently outperforms joint training across all metrics. The separation notably reduces content leakage and improves style alignment significantly, as indicated by the lower DS score (0.567 vs. 0.592) and higher CLIP score (0.659 vs. 0.641). This confirms that independently specializing content and style representations significantly mitigates feature entanglement issues.

thumbnail
Table 2. Ablation study on the dual-phase training strategy.

https://doi.org/10.1371/journal.pone.0346260.t002

Effectiveness of Progressive Loss Transition. To validate our progressive loss transition strategy, we compare it with static loss weighting approaches. In the static variant, the balance between style and content loss remains constant throughout training. As shown in Table 3, progressively transitioning from low-level detail emphasis to high-level structural coherence yields significantly better results. Specifically, the dynamic strategy outperforms the static approach by improving structural integrity (DINO score improved from 0.608 to 0.629) and style fidelity (DS improved from 0.584 to 0.567), demonstrating that adjusting the training objective adaptively is critical for effectively capturing both detailed style textures and robust content structures.

thumbnail
Table 3. Ablation study on the progressive loss transition strategy.

https://doi.org/10.1371/journal.pone.0346260.t003

Influence of Decoupled Inference Controller. Finally, we assess the impact of our proposed Decoupled Inference Controller by comparing inference performance with and without dynamic content-style interpolation capabilities. Without DIC, the inference is performed using fixed, averaged module weights. As indicated in Table 4, incorporating DIC substantially improves both quantitative scores and user satisfaction. The flexibility provided by DIC enhances the model’s adaptability to diverse stylization intensities, significantly reducing content misalignment (DS improved from 0.579 to 0.567) and increasing style coherence (CLIP improved from 0.648 to 0.659). User preference tests also strongly favor the dynamic approach, demonstrating the practical value of adaptive inference control. Furthermore, we conduct an ablation study on the influence of the scaling factor in Fig 5. We observe that when is small, the generated images exhibit stronger stylistic characteristics, whereas larger values of lead to better preservation of content information. These findings are consistent with our expectations and support the effectiveness of in balancing style and content fidelity.

thumbnail
Table 4. Ablation study on the Decoupled Inference Controller.

https://doi.org/10.1371/journal.pone.0346260.t004

thumbnail
Fig 5. Visualization Results of Different .

Reprinted from [https://pan.baidu.com/s/10_CjlBAaXZ6vB_RONzfU3g?pwd=b97i] under a CC BY license, with permission from Yi Yang, original copyright 2025.

https://doi.org/10.1371/journal.pone.0346260.g005

Collectively, these ablation studies clearly demonstrate that each component in UltraStyle contributes significantly to achieving superior performance in high-quality style transfer, justifying our proposed design choices.

Conclusion

In this paper, we introduced UltraStyle, a novel diffusion-based style transfer framework enhanced by a dual-phase training strategy, a progressive loss transition mechanism, and a Decoupled Inference Controller. By addressing the fundamental issues of feature entanglement, content leakage, and structural distortions that commonly plague existing methods, UltraStyle provides significant improvements in both style alignment and content preservation. Comprehensive qualitative and quantitative experiments consistently demonstrate our method’s superior performance over leading baselines such as StyleAligned, StyleID, and B-LoRA. Detailed ablation studies further validated the individual effectiveness and collective contributions of each key component in UltraStyle. Specifically, dual-phase training was shown to effectively reduce feature entanglement, the progressive loss transition significantly enhanced the balance between detail preservation and structural coherence, and the DIC provided crucial adaptive control during inference, leading to improved user satisfaction and performance metrics. Looking forward, several promising avenues exist for further research. Integrating UltraStyle with advanced generative paradigms such as transformer-based diffusion models may further enhance the robustness and diversity of generated stylizations. Additionally, exploring more sophisticated adaptive inference strategies capable of region-specific or temporally coherent stylization represents a valuable direction, particularly for dynamic content such as videos. We anticipate that UltraStyle will serve as a foundational framework, enabling continued innovation in style transfer and broader generative applications.

Acknowledgments

This work was supported by the Guizhou High-Level Innovative Talent Support Program-‘Thousand Talents’ Project (Grant Qian Talent [2025] 202308), The Foundation Research Project of Kaili University (grant No. YTH-TD2024001).

References

  1. 1. Zhang Y, Huang N, Tang F, Huang H, Ma C, Dong W, et al. Inversion-based style transfer with diffusion models. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2023. 10146–56. https://doi.org/10.1109/cvpr52729.2023.00978
  2. 2. Deng Y, Tang F, Dong W, Ma C, Pan X, Wang L, et al. Stytr2: Image style transfer with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022. 11326–36.
  3. 3. Kwon G, Ye JC. CLIPstyler: image style transfer with a single text condition. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 18041–50. https://doi.org/10.1109/cvpr52688.2022.01753
  4. 4. Yang S, Jiang L, Liu Z, Loy CC. Pastiche master: exemplar-based high-resolution portrait style transfer. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2022. 7683–92. https://doi.org/10.1109/cvpr52688.2022.00754
  5. 5. Liu S, Lin T, He D, Li F, Wang M, Li X, et al. AdaAttN: revisit attention mechanism in arbitrary neural style transfer. In: 2021 IEEE/CVF International conference on computer vision (ICCV), 2021. 6629–38. https://doi.org/10.1109/iccv48922.2021.00658
  6. 6. Zhang Y, Tang F, Dong W, Huang H, Ma C, Lee TY. Domain enhanced arbitrary image style transfer via contrastive learning. In: ACM SIGGRAPH 2022 conference proceedings. 2022. 1–8.
  7. 7. An J, Huang S, Song Y, Dou D, Liu W, Luo J. ArtFlow: unbiased image style transfer via reversible neural flows. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2021. 862–71. https://doi.org/10.1109/cvpr46437.2021.00092
  8. 8. Chung J, Hyun S, Heo J-P. Style injection in diffusion: a training-free approach for adapting large-scale diffusion models for style transfer. In: 2024 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2024. 8795–805. https://doi.org/10.1109/cvpr52733.2024.00840
  9. 9. Zhang Z, Zhang Q, Xing W, Li G, Zhao L, Sun J, et al. ArtBank: artistic style transfer with pre-trained diffusion model and implicit style prompt bank. Art Intellig. 2024;38(7):7396–404.
  10. 10. Zhang C, Xu X, Wang L, Dai Z, Yang J. S2WAT: image style transfer via hierarchical vision transformer using strips window attention. Art Intellig. 2024;38(7):7024–32.
  11. 11. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90.
  12. 12. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint. 2014.
  13. 13. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), 2016. 770–8. https://doi.org/10.1109/cvpr.2016.90
  14. 14. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial networks. Commun ACM. 2020;63(11):139–44.
  15. 15. Karras T, Aittala M, Laine S, Härkönen E, Hellsten J, Lehtinen J, et al. Alias-free generative adversarial networks. Adv Neural Inform Process Syst. 2021;34:852–63.
  16. 16. Gan J, Wang W, Leng J, Gao X. HiGAN+: handwriting imitation GAN with disentangled representations. ACM Trans Graph. 2022;42(1):1–17.
  17. 17. Zhou T, Li Q, Lu H, Cheng Q, Zhang X. GAN review: models and medical image fusion applications. Inform Fusion. 2023;91:134–48.
  18. 18. Gokaslan A, Huang Y, Kuleshov V, Tompkin J. The GAN is dead; long live the GAN! a modern GAN baseline. In: Advances in neural information processing systems 37, 2024. 44177–215. https://doi.org/10.52202/079017-1402
  19. 19. Liu H, Wan Z, Huang W, Song Y, Han X, Liao J. PD-GAN: probabilistic diverse GAN for image inpainting. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2021. 9367–76. https://doi.org/10.1109/cvpr46437.2021.00925
  20. 20. Tan C, Zhao Y, Wei S, Gu G, Wei Y. Learning on gradients: generalized artifacts representation for GAN-generated images detection. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 12105–14. https://doi.org/10.1109/cvpr52729.2023.01165
  21. 21. Everaert MN, Bocchio M, Arpa S, Süsstrunk S, Achanta R. Diffusion in style. In: 2023 IEEE/CVF International conference on computer vision (ICCV), 2023. 2251–61. https://doi.org/10.1109/iccv51070.2023.00214
  22. 22. Sun Z, Zhou Y, He H, Mok PY. SGDiff: a style guided diffusion model for fashion synthesis. In: Proceedings of the 31st ACM International conference on multimedia, 2023. 8433–42. https://doi.org/10.1145/3581783.3613806
  23. 23. Lu H, Tunanyan H, Wang K, Navasardyan S, Wang Z, Shi H. Specialist diffusion: plug-and-play sample-efficient fine-tuning of text-to-image diffusion models to learn any unseen style. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2023. 14267–76. https://doi.org/10.1109/cvpr52729.2023.01371
  24. 24. Zhong L, Xie Y, Jampani V, Sun D, Jiang H. SMooDi: stylized motion diffusion model. In: Lecture notes in computer science. Springer Nature Switzerland; 2024. 405–21. https://doi.org/10.1007/978-3-031-73232-4_23
  25. 25. Li YA, Han C, Raghavan VS, Mischler G, Mesgarani N. StyleTTS 2: towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. Adv Neural Inf Process Syst. 2023;36:19594–621. pmid:39866554
  26. 26. Somepalli G, Gupta A, Gupta K, Palta S, Goldblum M, Geiping J, et al. Investigating style similarity in diffusion models. In: european conference on computer vision. Springer; 2024. 143–60.
  27. 27. Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S. Lora: low-rank adaptation of large language models. ICLR. 2022;1(2):3.
  28. 28. Chen P-Y, Hsu C-Y, Huang C-Y, Lin C-H, Tsai Y-L, Yu C-M. Safe LoRA: the silver lining of reducing safety risks when finetuning large language models. In: Advances in neural information processing systems 37, 2024. 65072–94. https://doi.org/10.52202/079017-2078
  29. 29. Guo Z, Li L, Shi Z, Tian C, Xu C. HydraLoRA: an asymmetric LoRA architecture for efficient fine-tuning. In: Advances in neural information processing systems 37, 2024. 9565–84. https://doi.org/10.52202/079017-0304
  30. 30. Li J, Wang S, Yu L. LoRA-GA: low-rank adaptation with gradient approximation. In: Advances in neural information processing systems 37, 2024. 54905–31. https://doi.org/10.52202/079017-1741
  31. 31. Sheng Y, Cao S, Li D, Hooper C, Lee N, Yang S. Slora: scalable serving of thousands of lora adapters. In: Proceedings of machine learning and systems, 2024. 296–311.
  32. 32. Lin L, Fan H, Zhang Z, Wang Y, Xu Y, Ling H. Tracking meets LoRA: faster training, larger model, stronger performance. In: Lecture notes in computer science. Springer Nature Switzerland; 2024. 300–18. https://doi.org/10.1007/978-3-031-73232-4_17
  33. 33. Hartley ZKJ, Lind RJ, Pound MP, French AP. Domain Targeted synthetic plant style transfer using stable diffusion, LoRA and ControlNet. In: 2024 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), 2024. 5375–83. https://doi.org/10.1109/cvprw63382.2024.00546
  34. 34. Frenkel Y, Vinker Y, Shamir A, Cohen-Or D. Implicit style-content separation using b-lora. In: European conference on computer vision. Springer; 2024.
  35. 35. Singh A, Jaiswal V, Joshi G, Sanjeeve A, Gite S, Kotecha K. Neural style transfer: a critical review. IEEE Access. 2021;9:131583–613.
  36. 36. Kadish D, Risi S, Lovlie AS. Improving object detection in art images using only style transfer. In: 2021 International joint conference on neural networks (IJCNN), 2021. 1–8. https://doi.org/10.1109/ijcnn52387.2021.9534264
  37. 37. Chandran P, Zoss G, Gotardo P, Gross M, Bradley D. Adaptive convolutions for structure-aware style transfer. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2021. 7968–77. https://doi.org/10.1109/cvpr46437.2021.00788
  38. 38. Wang P, Li Y, Vasconcelos N. Rethinking and improving the robustness of image style transfer. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2021. 124–33. https://doi.org/10.1109/cvpr46437.2021.00019
  39. 39. Wu X, Hu Z, Sheng L, Xu D. Styleformer: real-time arbitrary style transfer via parametric style composition. In: Proceedings of the IEEE/CVF International conference on computer vision, 2021. 14618–27.
  40. 40. Kim G, Youwang K, Oh T-H. FPRF: feed-forward photorealistic style transfer of large-scale 3D neural radiance fields. Artfi Intellig. 2024;38(3):2750–8.
  41. 41. Chen H, Wang Z, Zhang H, Zuo Z, Li A, Xing W, et al. Artistic style transfer with internal-external learning and contrastive learning. Adv Neural Inform Process Syst. 2021;34:26561–73.
  42. 42. Yang F, Chen H, Zhang Z, Zhao L, Lin H. Gating patternpyramid for diversified image style transfer. J Electron Imag. 2022;31(06).
  43. 43. Huang S, Xiong H, Wang T, Wen B, Wang Q, Chen Z, et al. Parameter-free style projection for arbitrary image style transfer. In: ICASSP 2022 - 2022 IEEE International conference on acoustics, speech and signal processing (ICASSP), 2022. 2070–4. https://doi.org/10.1109/icassp43922.2022.9746290
  44. 44. Fang F, Zhang P, Zhou B, Qian K, Gan Y. Atten-GAN: pedestrian trajectory prediction with GAN based on attention mechanism. Cogn Comput. 2022;14(6):2296–305.
  45. 45. Yang T, Ren P, Xie X, Zhang L. GAN prior embedded network for blind face restoration in the wild. In: 2021 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), 2021. 672–81. https://doi.org/10.1109/cvpr46437.2021.00073
  46. 46. Pan X, Tewari A, Leimkühler T, Liu L, Meka A, Theobalt C. Drag your gan: Interactive point-based manipulation on the generative image manifold. In: ACM SIGGRAPH 2023 conference proceedings. 2023. 1–11.
  47. 47. Casanova A, Careil M, Verbeek J, Drozdzal M, Romero Soriano A. Instance-conditioned gan. Adv Neural Inform Process Syst. 2021;34:27517–29.
  48. 48. Sargsyan A, Navasardyan S, Xu X, Shi H. MI-GAN: a simple baseline for image inpainting on mobile devices. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 7301–11. https://doi.org/10.1109/iccv51070.2023.00674
  49. 49. Lang O, Gandelsman Y, Yarom M, Wald Y, Elidan G, Hassidim A, et al. Explaining in style: training a GAN to explain a classifier in StyleSpace. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 673–82. https://doi.org/10.1109/iccv48922.2021.00073
  50. 50. Zhao Z, Kunar A, Birke R, Chen LY. Ctab-gan: Effective table data synthesizing. In: Asian conference on machine learning. 2021. 97–112.
  51. 51. Kumari N, Zhang R, Shechtman E, Zhu J-Y. Ensembling off-the-shelf models for GAN training. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2022. 10641–52. https://doi.org/10.1109/cvpr52688.2022.01039
  52. 52. Wang T, Zhang Y, Fan Y, Wang J, Chen Q. High-fidelity GAN inversion for image attribute editing. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2022. 11369–78. https://doi.org/10.1109/cvpr52688.2022.01109
  53. 53. Song B, Liu P, Li J, Wang L, Zhang L, He G, et al. MLFF-GAN: a multilevel feature fusion with GAN for spatiotemporal remote sensing images. IEEE Trans Geosci Remote Sensing. 2022;60:1–16.
  54. 54. Li Y, Peng X, Zhang J, Li Z, Wen M. DCT-GAN: dilated convolutional transformer-based GAN for time series anomaly detection. IEEE Trans Knowl Data Eng. 2023;35(4):3632–44.
  55. 55. Zhou Y, Yu K, Wang M, Ma Y, Peng Y, Chen Z, et al. Speckle noise reduction for OCT images based on image style transfer and conditional GAN. IEEE J Biomed Health Inform. 2022;26(1):139–50. pmid:33882009
  56. 56. Han X, Wu Y, Wan R. A method for style transfer from artistic images based on depth extraction generative adversarial network. Appl Sci. 2023;13(2):867.
  57. 57. Kong F, Pu Y, Lee I, Nie R, Zhao Z, Xu D, et al. Unpaired artistic portrait style transfer via asymmetric double-stream GAN. IEEE Trans Neural Netw Learn Syst. 2023;34(9):5427–39. pmid:37459266
  58. 58. Zuo Z, Zhao L, Lian S, Chen H, Wang Z, Li A. Style fader generative adversarial networks for style degree controllable artistic style transfer. In: Proceedings of the International Joint Conference on Artificial Intelligence, 2022. 5002–9.
  59. 59. Qin Z, Liu Z, Zhu P, Ling W. Style transfer in conditional GANs for cross-modality synthesis of brain magnetic resonance images. Comput Biol Med. 2022;148:105928. pmid:35952543
  60. 60. Yang S, Wang Z, Liu J. Shape-matching GAN: scale controllable dynamic artistic text style transfer. IEEE Transact Pattern Anal Machine Intellig. 2021;44(7):3807–20.
  61. 61. Liu M, Maiti P, Thomopoulos S, Zhu A, Chai Y, Kim H, et al. Style transfer using generative adversarial networks for multi-site MRI harmonization. Med Image Comput Comput Assist Interv. 2021;12903:313–22. pmid:35647615
  62. 62. Garg M, Ubhi JS, Aggarwal AK. Neural style transfer for image steganography and destylization with supervised image to image translation. Multimed Tools Appl. 2022;82(4):6271–88.
  63. 63. Qian K, Zhang Y, Chang S, Xiong J, Gan C, Cox D. Global prosody style transfer without text transcriptions. In: International conference on machine learning, 2021. 8650–60.
  64. 64. Wang Z, Zhang Z, Zhao L, Zuo Z, Li A, Xing W, et al. AesUST: towards aesthetic-enhanced universal style transfer. In: Proceedings of the 30th ACM International Conference on Multimedia, 2022. 1095–106. https://doi.org/10.1145/3503161.3547939
  65. 65. Chan W, Fleet D, Gritsenko A, Ho J, Norouzi M, Salimans T. Video diffusion models. In: Advances in neural information processing systems 35, 2022. 8633–46. https://doi.org/10.52202/068431-0628
  66. 66. Kingma D, Salimans T, Poole B, Ho J. Variational diffusion models. Adv Neural Inform Process Syst. 2021;34:21696–707.
  67. 67. Gandikota R, Materzyńska J, Fiotto-Kaufman J, Bau D. Erasing concepts from diffusion models. In: 2023 IEEE/CVF International conference on computer vision (ICCV), 2023. 2426–36. https://doi.org/10.1109/iccv51070.2023.00230
  68. 68. Xia B, Zhang Y, Wang S, Wang Y, Wu X, Tian Y, et al. DiffIR: Efficient Diffusion Model for Image Restoration. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 13049–59. https://doi.org/10.1109/iccv51070.2023.01204
  69. 69. Xu X, Wang Z, Zhang E, Wang K, Shi H. Versatile diffusion: text, images and variations all in one diffusion model. In: 2023 IEEE/CVF International conference on computer vision (ICCV), 2023. 7720–31. https://doi.org/10.1109/iccv51070.2023.00713
  70. 70. Aila T, Aittala M, Karras T, Kynkäänniemi T, Laine S, Lehtinen J. Guiding a diffusion model with a bad version of itself. In: Advances in neural information processing systems 37, 2024. 52996–3021. https://doi.org/10.52202/079017-1679
  71. 71. Zhang M, Guo X, Pan L, Cai Z, Hong F, Li H, et al. ReMoDiffuse: retrieval-augmented motion diffusion model. In: 2023 IEEE/CVF International conference on computer vision (ICCV), 2023. 364–73. https://doi.org/10.1109/iccv51070.2023.00040
  72. 72. Xu Z, Zhang J, Liew JH, Yan H, Liu J-W, Zhang C, et al. MagicAnimate: temporally consistent human image animation using diffusion model. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 1481–90. https://doi.org/10.1109/cvpr52733.2024.00147
  73. 73. Guo Z, Liu J, Wang Y, Chen M, Wang D, Xu D, et al. Diffusion models in bioinformatics and computational biology. Nat Rev Bioeng. 2024;2(2):136–54. pmid:38576453
  74. 74. Nichol AQ, Dhariwal P. Improved denoising diffusion probabilistic models. In: International conference on machine learning, 2021. 8162–71.
  75. 75. Lugmayr A, Danelljan M, Romero A, Yu F, Timofte R, Van Gool L. RePaint: inpainting using denoising diffusion probabilistic models. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 11451–61. https://doi.org/10.1109/cvpr52688.2022.01117
  76. 76. Choi J, Kim S, Jeong Y, Gwon Y, Yoon S. ILVR: conditioning method for denoising diffusion probabilistic models. In: 2021 IEEE/CVF International conference on computer vision (ICCV), 2021. 14347–56. https://doi.org/10.1109/iccv48922.2021.01410
  77. 77. Khader F, Müller-Franzes G, Tayebi Arasteh S, Han T, Haarburger C, Schulze-Hagen M, et al. Denoising diffusion probabilistic models for 3D medical image generation. Sci Rep. 2023;13(1):7303. pmid:37147413
  78. 78. Gong K, Johnson K, El Fakhri G, Li Q, Pan T. PET image denoising based on denoising diffusion probabilistic model. Eur J Nucl Med Mol Imaging. 2024;51(2):358–68. pmid:37787849
  79. 79. Cao J, Wang Z, Guo H, Cheng H, Zhang Q, Xu R. Spiking denoising diffusion probabilistic models. In: 2024 IEEE/CVF Winter conference on applications of computer vision (WACV), 2024. 4900–9. https://doi.org/10.1109/wacv57701.2024.00484
  80. 80. Dorjsembe Z, Odonchimed S, Xiao F. Three-dimensional medical image synthesis with denoising diffusion probabilistic models. In: Medical imaging with deep learning. 2022.
  81. 81. Jiang H, Imran M, Zhang T, Zhou Y, Liang M, Gong K, et al. Fast-DDPM: fast denoising diffusion probabilistic models for medical image-to-image generation. IEEE J Biomed Health Inform. 2025;29(10):7326–35. pmid:40293895
  82. 82. Azqadan E, Jahed H, Arami A. Predictive microstructure image generation using denoising diffusion probabilistic models. Acta Materialia. 2023;261:119406.
  83. 83. Peng Y, Hu D, Wang Y, Chen K, Pei G, Zhang W. StegaDDPM: generative image steganography based on denoising diffusion probabilistic model. In: Proceedings of the 31st ACM International Conference on Multimedia, 2023. 7143–51. https://doi.org/10.1145/3581783.3612514
  84. 84. Meng Q, Shi W, Li S, Zhang L. PanDiff: a novel pansharpening method based on denoising diffusion probabilistic model. IEEE Trans Geosci Remote Sensing. 2023;61:1–17.
  85. 85. Nair NG, Mei K, Patel VM. At-ddpm: restoring faces degraded by atmospheric turbulence using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023. 3434–43.
  86. 86. Hertz A, Voynov A, Fruchter S, Cohen-Or D. Style aligned image generation via shared attention. In: 2024 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2024. 4775–85. https://doi.org/10.1109/cvpr52733.2024.00457
  87. 87. Liu C, Shah V, Cui A, Lazebnik S. UnZipLoRA: separating content and style from a single image. 2024.
  88. 88. Jones M, Wang S-Y, Kumari N, Bau D, Zhu J-Y. Customizing text-to-image models with a single image pair. In: SIGGRAPH Asia 2024 conference papers, 2024. 1–13. https://doi.org/10.1145/3680528.3687642
  89. 89. Ouyang Z, Li Z, Hou Q. K-lora: Unlocking training-free fusion of any subject and style loras. 2025.
  90. 90. Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 10674–85. https://doi.org/10.1109/cvpr52688.2022.01042
  91. 91. Barber J, Blok I, Castro Chin D, Chang H, Entis G, Essa I, et al. StyleDrop: text-to-image synthesis of any style. In: Advances in Neural Information Processing Systems 36, 2023. 66860–89. https://doi.org/10.52202/075280-2920
  92. 92. Wang H, Spinelli M, Wang Q, Bai X, Qin Z, Chen A. Instantstyle: free lunch towards style-preserving in text-to-image generation. 2024. https://arxiv.org/abs/2404.02733
  93. 93. Podell D, English Z, Lacey K, Blattmann A, Dockhorn T, Müller J. SDXL: improving latent diffusion models for high-resolution image synthesis. In: The Twelfth International Conference on learning representations.
  94. 94. Chai L, Dekel T, Fu S, Isola P, Sundaram S, Tamir N, et al. DreamSim: learning new dimensions of human visual similarity using synthetic data. In: Advances in neural information processing systems 36, 2023. 50742–68. https://doi.org/10.52202/075280-2208
  95. 95. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S. Learning transferable visual models from natural language supervision. In: International conference on machine learning, 2021. 8748–63.
  96. 96. Caron M, Touvron H, Misra I, Jegou H, Mairal J, Bojanowski P, et al. Emerging properties in self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 9630–40. https://doi.org/10.1109/iccv48922.2021.00951