Fig 1.
In Stage 1, the VAE latent channels in the existing Stable Diffusion model are expanded from 4 to 8, and the VAE is pre-trained to enhance representational capacity. In Stage 2, the pre-trained 8-channel VAE is used to configure the input and output layers. For detailed representation learning, a pre-trained CLIP image encoder extracts multi-level embeddings at five scales. These embeddings are injected into the denoising UNet during the diffusion process through additional adapter layers, leading to the generation of the final synthesized images.
Fig 2.
An example of the HAM10000 dataset.
Consisting of 10,015 images across seven skin tumor classes.
Fig 3.
Multi-level feature visualization of skin lesion images.
PCA projection of features extracted from all 32 layers (0-31) of the OpenCLIP ViT-H/14 visual encoder into RGB space. Each layer’s feature map is projected using the first three principal components mapped to RGB channels. The pronounced color variations between adjacent patches indicate the model’s ability to distinguish different regions such as lesion boundaries and morphological structures, with deeper layers showing more prominent inter-patch differences for fine-grained detail representation.
Fig 4.
Latent space interpolation-based sampling workflow.
Disease images are generated from noise using a diffusion model and then segmented to isolate lesion regions. These regions are inpainted to obtain pseudo-normal counterparts while preserving skin context. Both disease and pseudo-normal images are mapped to the latent space via a pre-trained VAE. Latent interpolation is then performed between the two extremes. While this produces visually smooth transitions, the interpolation axis reflects a heuristic binary lesion–normal direction rather than a clinically grounded severity concept.
Fig 5.
Visual comparison of reconstruction results: 4-Channel VAE vs. 8-Channel VAE.
In the areas marked with red circles, the 4-channel VAE exhibits subtle artifacts, whereas the 8-channel VAE demonstrates improvements, mitigating these issues.
Table 1.
Quantitative comparison between 4ch VAE vs 8ch VAE.
In the comparison of reconstructed image quality, the 8-channel VAE showed superior results across MSE, LPIPS, and MS-SSIM metrics.
Fig 6.
Visual comparison between the baseline model and our method.
In the “VASC” class, which has relatively fewer training samples, the samples generated by Stable Diffusion (baseline model) fail to accurately reflect lesion boundaries and color information when compared to the original images. Similarly, in the “DF” class, the baseline model produces unnatural textures, such as the keratinized surface of the lesion. In contrast, our method, utilizing multi-level embeddings, effectively learns and represents boundaries, color, and texture, resulting in more natural and faithful representations.
Fig 7.
Generated samples for seven classes.
The synthetic images for each class are visually compared to demonstrate the quality and diversity of the generated data.
Table 2.
Quantitative comparison of image quality between the base model and the proposed method.
This table presents the quantitative results of image quality (FID, IS) for the base model (Stable Diffusion) and the proposed method. While FID scores showed improvement with the proposed method, IS scores showed no significant differences between the models. This indicates that the proposed method demonstrated better performance than the baseline in terms of distributional similarity to real data.
Table 3.
Evaluation of synthetic data effectiveness in downstream classification tasks.
Classification downstream tasks were conducted using four types of data consisting of origin and synthetic data. Training with synthetic images alone resulted in lower classification accuracy compared to original images. However, combining both synthetic and original data to form a larger dataset achieved the highest average classification accuracy.
Fig 8.
Example images from the PAD-UFES-20 dataset.
Unlike the HAM10000 dataset, which consists of professional dermoscopic images, PAD-UFES-20 contains clinical photographs captured using smartphone cameras, introducing variations in lighting, resolution, and background.
Table 4.
Cross-dataset zero-shot evaluation results on PAD-UFES-20.
Models were trained on HAM10000 and directly evaluated on zero-shot on PAD-UFES-20 without fine-tuning.
Fig 9.
Visual ablation comparison. Rows, from top to bottom, show (1) the original image, (2) the synthesis produced by the 4-channel VAE (4ch), (3) the synthesis from the 8-channel VAE (8ch), and (4) the synthesis from the 8-channel VAE with multi-level embeddings (8ch+ML). Comparing each column reveals progressively sharper vessel patterns, keratin scales, and overall texture fidelity as channel width is increased and semantic guidance is introduced.
Table 5.
Quantitative ablation study on VAE channel width and multi-level embeddings.
We evaluated three variants using FID and IS metrics: 4-channel VAE (4ch), 8-channel VAE (8ch), and 8ch model with multi-level CLIP embeddings (8ch+ML). Both 8ch and 8ch+ML variants demonstrated improved FID and IS scores compared to 4ch. However, the IS score for 8ch+ML showed a slight decrease compared to the standalone 8ch model, suggesting that excessive learning of fine-grained features through multi-level embeddings may reduce sample diversity.
Fig 10.
Effect of class-conditioning and CFG scale on skin lesion synthesis.
Class-wise comparison of non-conditioned generation (top row) and class-conditioned generation at different classifier-free guidance (CFG) scales (1, 3, 5) using identical random seeds. Without class-conditioning (non-cond), images lose distinctive lesion-specific characteristics, resulting in visually similar patterns across classes. Increasing CFG strengthens class-specific morphological features while enhancing overall discriminability between classes.
Fig 11.
Comparison of generated samples based on adjustments to the interpolation coefficient α.
This figure shows samples generated gradually changing visual characteristics by adjusting the interpolation coefficient α in latent space interpolation. While normal images were generally well generated, completely normal samples could not be achieved when there were significant differences between the normal and lesion regions. Nevertheless, the results demonstrate that adjusting α allows for the generation of diverse reference samples that can enhance training data variability.
Table 6.
Semantic variance comparison between A and AI using CLIP embeddings and PCA projection.
Across all seven classes, the AI set—containing partially interpolated samples—consistently exhibited higher variance than the A set, suggesting a potential contribution of interpolation-based samples to enhancing semantic diversity.
Fig 12.
PCA projection of CLIP embeddings for synthetic image samples A (randomly generated) and AI (partially interpolated).
The left panel shows all classes combined, while the right panels show each class separately. Triangular markers indicate interpolated samples (AI), and circular markers indicate non-interpolated samples (A). In most classes, AI samples exhibit a wider spatial spread in the embedding space, suggesting a broader coverage of semantic characteristics compared to A.
Fig 13.
When the influence of fine details, such as hairy regions or color information, is excessively reflected, unrealistic samples are generated.
Fig 14.
Hair removal preprocessing using the DullRazor algorithm before image generation.
The DullRazor algorithm systematically removes hair artifacts through a multi-step process: (1) converting the original hairy input image to grayscale, (2) applying a black hat filter to detect and isolate hair structures, (3) creating a binary mask of detected hair regions, and (4) removing hair regions and inpainting the underlying skin texture. This preprocessing step significantly improves the quality of generated images by eliminating hair-related artifacts that could otherwise dominate the synthesis process, resulting in cleaner and more clinically relevant synthetic skin lesion images.