Figures
Abstract
Unsupervised image-to-image translation (UI2I) tasks aim to find a mapping between the source and the target domains from unpaired training data. Previous methods can not effectively capture the differences between the source and the target domain on different scales and often leads to poor quality of the generated images, noise, distortion, and other conditions that do not match human vision perception, and has high time complexity. To address this problem, we propose a multi-scale training structure and a progressive growth generator method to solve UI2I task. Our method refines the generated images from global structures to local details by adding new convolution blocks continuously and shares the network parameters in different scales and also in the same scale of network. Finally, we propose a new Cross-CBAM mechanism (CRCBAM), which uses a multi-layer spatial attention and channel attention cross structure to generate more refined style images. Experiments on our collected Opera Face, and other open datasets Summer↔Winter, Horse↔Zebra, Photo↔Van Gogh, show that the proposed algorithm is superior to other state-of-art algorithms.
Citation: Feng L, Geng G, Li Q, Jiang Y, Li Z, Li K (2023) CRPGAN: Learning image-to-image translation of two unpaired images by cross-attention mechanism and parallelization strategy. PLoS ONE 18(1): e0280073. https://doi.org/10.1371/journal.pone.0280073
Editor: Xiangjie Kong, Zhejiang University of Technology, CHINA
Received: September 27, 2022; Accepted: December 20, 2022; Published: January 6, 2023
Copyright: © 2023 Feng et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The Summer2Winter datasets can be obtained on website (https://people.eecs.berkeley.edu/~taesung_park/CycleGAN/datasets/summer2winter_yosemite.zip) The Horse2Zebra datasets can be obtained on website (https://people.eecs.berkeley.edu/~taesung_park/CycleGAN/datasets/horse2zera.zip) The Photo2Van Gogh datasets can be obtained on website (https://people.eecs.berkeley.edu/~taesung_park/CycleGAN/datasets/vangogh2photo.zip) The Grumpifycat datasets can be obtained on website (https://people.eecs.berkeley.edu/~taesung_park/CycleGAN/datasets/grumpifycat.zip).
Funding: This research was funded by the National Key Research and Development Program of China (2020YFC1523301 and 2019YFC1521103), National Natural Science Foundation of China(62271393), Key Research and Development Program of Shaanxi Province (2019ZDLSF07-02, 2019ZDLGY10-01 and 2021GY-171), National Natural Science Foundation of China (61731015), Key Research and Development Program of Qinghai Province (2020-SF-142). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The unsupervised image-to-image translation (UI2I) is to translate an image from one domain to another, capable of changing the appearance of a given image while keeping its geometry unchanged. For example, from a horse to a zebra, from a low-resolution image to a high-resolution image, from a photograph to an art painting, and vice versa [1, 2]. UI2I has received a lot of attention due to its excellent performance in areas such as image style transfer [3–7], colourisation [8], super-resolution [9, 10], dehazing [11], denoising [12], image Synthesis [13], text-to-image Synthesis [14], image Generation [15, 16], and underwater image restoration [17].
In recent years, with the emergence of Generative Adversarial Networks (GANs) [18], many works with GAN has been proposed to solve the UI2I tasks [19–24]. In UI2I tasks without paired training data, the main problem of GANs is that the adversarial loss [18] is un-constrained and many mappings functions exist between the source and target domains, which may lead to unstable training and failure of image translation. To solve this problem, CycleGAN [3], DiscoGAN [25] and DualGAN [26], introduce the cycle-consistency loss [3] to network model and learn the reverse mapping from target to the source domain with the reconstruction consistency constraint.
The above methods usually require a large number of unpaired images for training. However, the massive unpaired images are difficult to be obtained. Therefore, the Few-Shot and One-Shot learning has attracted more and more reserchers’ interest [27–29]. In One-Shot unsupervised learning, the source and target domain each has only one image and these two images are unpaired. Unfortunately, one-shot and few-shot usually leads to severe overfitting of the model. Therefore how to solve UI2I task with a small number of training samples faces great challenges.
The recently proposed SinGAN [30] shows that there are enough information contained in the patches of one image to train a GAN model, thus a large amount of information can be used to extracte from a single image. Unfortunately, the SinGAN is limited to generate a specific data distribution, not suitable for UI2I task.
Furthermore, due to lack of constraints, SinGAN is a serial multi-level model structure with slow training and is limited to generating specific data distributions, resulting in blurred image translated images translation results. ConsinGAN [31] uses parallelism for the first time to improve the training speed, but still does not solve the image translation blur problem of UI2I. TuiGAN [32] takes full advantage of SinGAN’s learning of translated images by multiple scales and using consistency loss to limit the structural difference between two images, which can achieve UI2I of two unpaired images.
However, as with SinGAN, it is limited by the serial structure of the model resulting in slow training, and cannot effectively capture the differences between the source and target domains at different scales due to the constant changing of the perceptual field to extract the underlying relationship between the two images. Therefore, this learning process can generate a large amount of noise, resulting in poor image translation quality, distortion and others that do not accord with human vision.
To overcome the above problems, we propose a new one-shot image translation network framework, named CRPGAN, which adopts multi-scale training structures and progressive growth generators that can effectively capture the differences in distribution between the source and target domains. In the multi-scale learning process, each scale has two generators to generate target image and restore the source image, and one discriminator to capture the domain distribution of the source and target domains. We divide the initial generator structure into three convolutional blocks, H, B, and T, and keep adding new B block while keeping H and T blocks constantly in high-scale training to continuously optimize the details of image translation. That is, a new B block is added to the generator of scale N − 1 to form the generator of scale N. In addtion, to improve the training speed, we make use of parallel structures that do not require repeated training of the current generator. Finally, we use two parameter sharing structures to reduce the training time, the first sharing is to share the generator network parameters between different scales, and the second is to share parameters between convolutional blocks B in the same scale generator. Moreover, in order to avoid overfitting, a gradually decreasing learning rate is applied to the additional B blocks, which can more accurately capture the different domain distributions at each scale.
During multi-scale training, the problem of excessive noise and blurred image translation results can easily occur due to repetitive training of single sample image. However, CBAM [33] is often effective in many tasks such as object detection [34], image classification [35], image translation, etc., capturing local information from the image in terms of channels and space, thus improving image quality. We find that adding CBAM to the one-shot image translation module still produced images with a lot of noise.
To address this problem, we propose a novel Cross-CBAM mechanism (CRCBAM), which uses a two-layer spatial attention and channel attention crossover structure to fully learn local and global information by learning the semantics and location of the target style multiple times in space and channel, in order to better capture differences between sources and target domains at different scales in order to generate more finely stylised images.
Our CRPGAN can significantly reduce the training time (60 minutes versus 240-300 minutes of TuiGAN [32]). We have conducted extensive experiments on various UI2I tasks including Summer→Winter, Horse→Zebra, Photo→Van Gogh and so on. The results show that our method can effectively solves the problem of low translation quality in one-shot learning. Compared with the existing UI2I models, CRPGAN achieves better performance. Some experimental results of our method are shown in Fig 1.
Here we show object transformation ((a), (b)) and image style transfer((c)-(f)). The three images from left to right are the source image (which provides the main content), the target image (which provides style and high-level semantic information), and the translated image.
The main contributions of this paper are as follows:
- We propose a new one-shot image translation network framework CRPGAN, which introduces a progressive growth generator and a multi-scale training approach to efficiently capture the differences between two different distributions in the source and target domains to learn image target styles from global information.
- We use a parallel structure and a two-layer parameter sharing mechanism to reduce the training time for one-shot image translation.
- We propose a Cross-CBAM attention mechanism(CRCBAM) to fully learn the local information of single-sample image to generate finer stylized images.
The paper is structured as follows. Section 1 is a general introduction providing the necessary background and problems of unsupervised image translation. Section 2 provides current work related to style transfer, unsupervised image translation, and one-shot image translation. Section 3 provides the network structure, attention mechanism and loss function of the methods in this paper. Section 4 provides experimental setup, baseline methodology and evaluation metrics. Section 5 provides qualitative and quantitative experiments and the corresponding parametric and ablation studies. Section 6 provides a conclusion of our approach and future work.
Related work
Image style transfer
Image style transfer [1, 2, 36–38] aims to transfer the style of target image to the source image. The early works on image style transfer include [1, 2]. Commonly, these methods are based on convolutional neural network and transfer image by minimizing the Gram matrix of pretrained deep features. [1] combines the content and style of a image to transfer the image style. However, the cost of its training process is very high. Considering the correlation matrix between the deep features extracted by neural network is helpful for image transfer, the [2], which minimizes the loss constructed by the Gram matrix or covariance matrix, captures the visual style of target image to transfer image.
[36, 39, 40] attempt to transfer style via a single network layer in a trained feed-forward neural networks. The shortcoming of these feed-forward methods is that each network is limited to one image style, the optimization is slow and the model is not flexible enough. The work of Ioffe and Szegedy [41] introduces a batch normalisation (BN) layer, which significantly simplifes the training of feedforward networks by normalizing the statistical information of the features.
Image-to-image translation
The work of Pix2Pix [42] achieves impressive results in image-to-image translation tasks using paired images based on conditional generative adversarial networks (CGAN), which has been extended to other tasks such as super-resolution [9, 10] and video generation [43, 44] etc. Although these methods achieve good results, all of them need to collect pair of data for training, which limits their practical application. Unlike Pix2Pix [42], CycleGAN [3] and DiscoGAN [25], DualGAN [26] solved unsupervised image-to-image translation (UI2I) by introducing one cycle-consistency loss and two opposite domain transformation generators. Liu [27] further extends the idea of cyclic consistency and proposed a FUNIT for few-shot UI2I, which replace the domain-specific space with a domain-shared potential space. DRIT [5] embeds the images into a domain-invariant and a domain-specific style space to translate image. But, the above UI2I methods not only require large amounts of data and computation resources, but also cannot effectively capture the differences of the distribution between the source and target domain.
Benaim [29] and Cohen [45] propose two methods to solve the problem of One-Shot cross-domain translation by learning the unidirectional mapping between the source domain with one image and the target domain with a set of images. By adopting the generative model and a multi-scale structure, Lin [32] used a bidirectional function to map image from source to target and then map back to source domain to solved the One-Shot cross-domain translation task. The disadvantage of these methods is that they eithor need a large number of training images, or has a high time complexity and will lead to low image translation quality, distortion and other conditions that do not match human vision perception.
One-shot image-to-image translation
Shaham T R et al. proposed SinGAN [30] using an unconditional pyramid generation model to learn patch distributions based on images of different scales for one-shot image translation. ConSinGAN [31]adopts the multi-level parallel training method to improve the speed of model training, but it can not capture the domain distribution between images or solve image blur problem. Recently, Lin et al proposed TuiGAN [32] continuing to capture the domain distribution between two images using a multi-level approach to one-shot cross-domain translation. This approach constantly changes the receiving domain to extract the potential relationship between the two images and cannot effectively capture the differences between source and target domains at different scales. Therefore, this learning process produce a lot of noise, resulting in poor image translation quality, distortion and other phenomena that do not conform to human vision. And it takes 4-5 GPU hours, which does not allow for better control of image translation results.
In this paper, in order to generate high-quality the UI2I translation image quickly, we propose a parallel multi-scale structural training and progressive growth generator to obtain mapping function between two unpaired images. This method overcomes the shortcomings of the existing UI2I methods, such as large data sets, large memory resources, low translation quality and long training time.
Methods
Our method is to learn the mapping function from source domain A to target domain B through parallel multi-scale training and progressive growth generators, where A and B denote two image domains, e.g. summer and winter.
We summarize our approach from four aspects. 1) We replace the multi-scale serial network model proposed by TuiGAN [32] with parallel multi-scale network model, which need not to use the generated images in the previous scale and share the network parmaters between the adjacent scales and also in the same scale. This parallel multi-scale network model can train independently without affecting the image transfer performance and greatly improve the training speed. 2) We adopt a progressive generator to continuously add new convolutional layers during the training process, which can capture the differences in domain distribution at different scales and obtain the detailed features of image translation. 3) We use two network parameter sharing mechanism to speed the training. The first is to share the network parameters in the previous N − 1 and current scale N, and the newly added layers and the keeped in B blocks. 4) We use the newly proposed CRCBAM(Cross-CBAM) attention mechanism to fully extract the image local information to generate a finer stylized image. Moreover, to avoid training overfitting, a gradually decreasing learning rate is used to capture different domain distributions more accurately in different scales.
Unlike traditional UI2I methods, our method requires only two unpaired images to complete the UI2I task, which can produce high quality translation images with fast training speed. Below, we provide a detailed description of the our proposed network.
A, Network architecture
The network architecture of the proposed CRPGAN is shown in Fig 2. The overall framework consists of two symmetric pyramids generators (G and F) and discriminators (DA and DB). It consists of two main translation modules: generator G for translating IA into IAB (Fig 2(a)) and generator F for translating IB into IBA (Fig 2(b)), DA for distinguishing whether the input image is from the real image IA or the generated image IBA, and DB for distinguishing whether the input image is from the real image B or the generated image IAB. In Fig 2(a), generator G translates IA into IAB, followed by generator F reconstructs the translation result IAB into IABA. In Fig 2(b), generator F translates IB into IBA, followed by generator G reconstructs the translation result IBA into IBAB. Generators G and F have the same network structure and form a symmetric structure in the network with different weight parameters. The learning model constains four losses: LADV, LCYC, LContent, LTV, in which the adversarial loss LADV ensure the generated images are similar to the target images, the cyclic consistency loss LCYC solves the collapse problem in GAN, the content loss LContent can maintain the content information of source image, and the total variance loss LTV can avoid the noise and artifacts occured in the translated image.
In (a) and (b), our model consists of two symmetric pyramids of GANs that gradually refine the generated images from global structure to local details. We start training at ‘scale 0’ by using the lowest resolution image and the smallest generator. With the scale increasing, the size of the generator are gradually increased and the resolution of the image are also changed from low to high. (c) is our Cross-CBAM attention mechanism to extract the image global information and local information.
Since there are only two images (IA and IB) in the training sample, in order to make fully use these two images and extract as much image information as possible, we adopt multi-scale structure, which takes different resolutions of image as inputs of our model. The whole network is divided into N scales, namely Scale 0, Scale 1, …, Scale N, which is shown in Fig 2. We downsample IA and IB to N different scales, and
, where
and
are downsampled from IA and IB by a scale factor
, respectively.
To learn the mapping between the source and target domains at different scales, we introduced progressively growing generators into the network. At Scale 0, our initialised generator consists of Head block (H), Body block (B) and Tail block (T), where the Head block network structure is Conv − BN − ReLU, the Body block network structure is composed of three Conv − BN − ReLU and CRCBAM attention mechanisms, and the Tail block network structure is Conv − Tanh. At the next scale, our generator keeps the number of Head blocks (H) and Tail blocks (T) constant and dynamically adds a Body block (B) to fully learn the image local features.
Instead, we used a parallel multi-scale training structure to speed up the training process without using the image translation results from the previous scale, i.e. without using the translated image of the previous scale N − 1 as input for the current scale N, structurally speeding up the training speed.
(1)
Secondly, due to the similar structure of the multi-scale generators used in our method, repeated initialization of generator training will increase the training time cost. Therefore, we uses parameter sharing to assign the weights of the previous scale training parameters to the weights of the current scale training parameters to speed up the training, which means that Gn−1 and Fn−1 in equation Eq (2) do not need to be initialised and directly use the weight parameters of the previous scale training, and only need to initialise the newly added Body block Bnew parameters.
(2)
where Gn is a generator from source domain A to target domain B at scale n, and Fn is a generator from source domain B to target domain A at scale n. Gn−1 is a generator from source domain A to target domain B at scale n − 1, and Fn−1 is a generator from source domain B to target domain A at scale n − 1. Bnew is the newly added Body block.
In addition, since adding Body blocks with the same initialisation weight parameters each time the network structure is added will likewise increase the training time, we continue to use parameter sharing so that the parameters of the newly added Body block Bnew at the current stage are the same as those of the adjacent Bn−2, again increasing the training speed.
(3)
where H is a Head block, T is a Tail block, B0, B1, …, Bn−2 are Body blocks already trained by generator Gn−1, Fn−1, and Bnew is a new Body block added to Gn−1, Fn−1 to form the new generators Gn and Fn.
B, Cross-CBAM mechanism
The traditional CBAM attention mechanism has designed to infer attention weights in both channel and space dimensions, multiply them with the original feature map to adaptively adjust features, and improve accuracy and reduce training costs in many tasks (classification, object detection, etc.).
As shown in Fig 3(a), given an intermediate feature map as the input of channel attention Mc and spatial attention Ms. The channel attention Mc and spatial attention mechanisms Ms are in Fig 4.
denotes the 1D channel attention map,
denotes the 2D spatial feature map.
(4)
where ⊗ denotes element-wise multiplication, F′ is the feature map after the channel attention calculation, F″ is the final refined output.
(a) is CBAM attention mechanism including a channel attention and spatial attention mechanism. (b) is the mechanism of two channel followed by two spatial attentions. (c) is the mechanism of two spatial attentions followed by two channel attentions. (d) is our CRCBAM attention mechanism.
The above is the CBAM process, these results are related to the task of large data sets, but in the one-shot learning process, it is impossible to better learn semantic information and location information, local information and global information are seriously missing, so the one-shot image translation task generates significant noise (Fig 5(a)).
To address this, we hypothesize the use multiple crossed channels and spatial attention to address this shortcoming, and design three strategies, such as Fig 3(b)–3(d). (b) uses the fused results of features obtained from two parallel channel attentions as the input of two parallel spatial attentions, and multiplies the results with the original feature map to adaptively adjust the features, (c) uses the fused results of features obtained from two parallel spatial attentions as the input of two parallel channel attentions, and multiplies the results with the original feature map to adaptively adjust the features, (d) The result of feature fusion using parallel channels and spatial attention intersection structures is used as input to the cross structure, and finally the result is multiplied by the original feature map to adaptively adjust the features.
Fig 3(d) is the Cross-CBAM attention mechanism approach in this paper, through the cross-structure of two channels and spatial attention mechanism to learn the semantic information and position information of single image from the channel and spatial dimensions multiple times, to optimize the local information of single-sample image translation.
In Fig 3(d), our overall attention process can be summerized as:
(5)
where ⊗ denotes element-wise multiplication, ⊕ denotes element-wise summation. F″ is the final refined output. Mc is the channel attention mechanism, Ms is the spatial attention mechanism.
The results of applying a personal attention strategy to a single image are shown in Fig 5 and Table 1. (a) is the stylized image adding CBAM, we find the generated image is blurred. In (b) and (c), we find that the clarity and style of image translation results gradually improved, but local information remained relatively blurred. (d) Because of our attention mechanism, we find that the content and style of image translation results are relatively clear, with the lowest SIFID values.
The best scores are in bold.
C, Loss functions
CRPGAN comtains four loss functions, including adversarial loss, cycle-consistency loss, content loss, and total variation loss. The details are described below.
(1) Total loss. For any n ∈ {0, 1, ⋯, N}, the overall loss function of the n-th scale is defined as follows:
(6)
where
,
,
,
and
refer to adversarial loss, cycle-consistency loss, identity loss, content loss, and total variance loss at the n-th scale respectively. λCYC, λIDT, λContent and λTV are the hyperparameters that balance the loss functions.
(2) Adversarial loss. We use adversarial loss to encourage the generator to transfer images that are visually similar to the target domain images. At each scale of image translation, there are two discriminators, and
, with
and
as inputs, respectively, and output is the probability that the input is a natural image in the corresponding domain. In this paper, we choose WGAN-GP loss [46] as adversarial loss, which can effectively increase the stability of training through weight clipping and gradient penalty.
(7)
where
, α ∼ U(0, 1) and λPEN is the penalty coefficient.
(3) Cycle-consistency loss. One of the training problems of conditional GAN is mode collapse. To mitigate the mode collapse problem, we impose cycle-consistency loss [3] on the generator, which can constrain the model to retain the inherent properties of input image after translation: ∀n ∈ {0, 1, ⋯, N},
(8)
where
.
(4) Content loss. To maintain the content information of input image, we include the content loss LCONTENT with calculating the mean-square error between the features of content and output images extracted from the pre-trained VGG-16 networks similar to the existing work by Gatys et al [2].
(9)
where
. ϕi is the pretrain model of VGG16.
(5) Total variation loss. To avoid the effect of noise on the image, we introduce the total variation (TV) loss [47], which helps to remove the rough texture in the translated image to smooth the image, eliminate noise, induce spatial continuity in the translated image, and avoid over-pixelation of the result. It encourages images to consist of several patches by calculating the differences of neighboring pixel values in the image. Let x[i, j] denote the pixel of image x located in the i-th row and j-th column, the n-th stage TV loss is calculated as follows.
(10)
where
,
.
Experiments
Our training details, dataset, evaluation metrics, and all baselines are described below.
A, Training details
We train the network using the Adam [48] optimizer, where β1 = 0.5 and β2 = 0.999. The initial learning rate δ of CRPGAN is 0.0005, the scale factor η is 0.1, our model contains 5 scales with 100 epochs per scale, and the generator learning rate decays exponentially. In addition, we adopt the generator based on Resnet [49] and discriminator based on PatchGAN [50]. We set the batch size to 1, the maximum image resolution to 250 × 250, and the minimum resolution to 100 × 100. All experiments are set with the weight parameters λCYC = 1, λCONTENT = 0.08, λTV = 0.1 and λPEN = 0.1. We train our model by using a single 2080Ti GPU and the training costs 60 minutes.
As mentioned before, all generators in the CRPGAN framework share the same architecture and they are all fully convolutional networks. In detail, the generator consist of 5 conv-blocks in the form of 3x3 Conv-BatchNorm-LeakyReLU with stride 1. Whenever any scale N − 1 converges, we add three convolution layers to body block of generator. For each discriminator, we choose the Markovian discriminator [50] which have the same receptive field as the generator, and the patch size is 11 × 11.
B, BaseLines
In this paper, the proposed methods are compared qualitatively and quantitatively with the latest UI2I methods, we choose the following baselines:
- SinGAN [30], which is a pyramidal unconditional generative model trained on only one image from the target domain.
- TuiGAN [32], which Uses two unpaired images for image translation by multi-stage training structure.
- CycleGAN [3], which introduces cycle-consistency loss to learn the reverse mapping from the target domain to the source domain.
- TSIT [52], which provide a carefully designed two-stream generative model for image translation.
- DCLGAN [19], which bases on contrastive learning and a dual learning setting to infer an efficient mapping between unpaired data.
- lrwGAN [23], which solves the image translation task between two unaligned domains by importance re-weighted image selection.
- StyTR2 [53], which is a style transfer model using transformers as encoders.
- Qs-Attn [24], which design a query-selected attention (QS-Attn) module to ensure that the source image learns the target image features at the corresponding location for image translation.
For all the above baselines, we use their official released code to produce the results.
C, Evaluation
Metrics. In this paper, we use Single Image Fréchet Inception Distance (SIFID) [30] to evaluate the quality of translated images. SIFID estimates the difference in the internal distribution of two images by calculating the Fréchet Inception Distance (FID) [51] between the depth features of the two images. A lower FID indicates a smaller Fréchet distance between the real image and the generated image. That is, a lower FID means that the translated image is more realistic. Therefore, a lower SIFID score means that the style of the two images is more similar and the quality of the translated image is higher. In this paper, the SIFID between the translated image and the target image is calculated.
Results
In this section, we compare CRPGAN with all baselines on different datasets. In addition, we only use the SIFID score as an evaluation metric.
A, General UI2I tasks
Table 2 shows the translation results of CRPGAN compared with the latest UI2I models on Summer↔Winter, Horse↔Zebra and Photo→Van Gogh. Clearly, our method outperforms all the baselines. The corresponding qualitative results for the random selection are given in Figs 6–8.
Among them, SinGAN is trained using one target domain image, TuiGAN and CRPGAN in this paper are trained using two unpaired images, others are trained using the complete dataset.
Overall, CRPGAN generates better translation images than baselines model, with a substantial reduction in training time. SinGAN [30] change the global color of the source image in their translation results, but they fail to transfer high-level semantic structures, when the difference between image blocks is large. SinGAN is unable to learn better image distributions and are prone to generate unrealistic images, Horse↔Zebra in Fig 7. TuiGAN [32] trained with only two images achieved comparable results, but the translated images were usually unclear and of poor quality, with noise, distortion, and other parts that do not match human vision, Summer↔Winter in Fig 6 and Horse↔Zebra in Fig 7. TSIT [52] only changes the color of the source image and does not capture the prominent painting style of the target domain. CycleGAN [3], DCLGAN [19] and Qs-Attn [24] can perseve source content features well, but cannot learn one-shot target image styles. StyTr2 [53] can change the image style and the source image content is well maintained, but its training time cost is too high due to its utilization of transformer structure, Photo→Van Gogh in Fig 8. CRPGAN learns the global and local structure of the source image through CRCBAM attention from coarse to fine by a multi-scale progressive generation structure, so it can preserve the architectural contours more better and has the style of the target image.
B, Painting-to-image translation
The painting-to-image translation task is to convert a rough drawing into a realistic image. In this paper, two samples provided by SinGAN [30] are used for training, and the results are shown in Fig 9. Although the two images have similar elements trees and roads, the image styles are completely different. SinGAN and TSIT [52] are unable to transfer the target style or generate specific details leaves on trees. StyTr2 [53] is good at global style transfer, but local information (e.g. trees) does not transform. TuiGAN [32] is able to transfer the target style but the local details are not as rich as our model, as shown in the second row in Fig 9.
We amplify the green box in the translated image at the second row to show more detail.
C, Parametric study
To evaluate the effect of the scaling factor η in the generator and the Head, Body, and Tail blocks on the image translation results in the network framework of this paper, we designed parameter study experiments on Horse↔Zebra. In the experiments, our network architecture consists of five scales, represented from Scale 0 − 4 respectively. In the later experimental results, Scale 0 − M represents the current block trained from Scale 0 to Scale M. The weights obtained from Scale M training are directly applied to Scale M + 1. Since the Tail block in our network architecture directly outputs the image translation result, consists of one layer of convolution and an activation function, so its training is required in all scales by default.
We design three parameter study experiments, Experiment 1 is used to verify the effect of the scaling factor η on our model, and Experiments 2 and 3 to observe the effect of the parallel training mode on the model for the Head and Body blocks.
In Experiment 1, we set the Head, Body, and Tail blocks to participate in training at all Scale by default, vary the value of the scaling factor η, and observe the effect of different scaling factors η on the model performance. To avoid overfitting in the lower scales due to too high learning rate, we defaulte Head, Body, Tail blocks to be trained in all Scale stages, mitigate by scaling the learning rate and the factor η. We change the values of the scaling factor η to 0.05, 0.10, 0.30, 0.50, 1.0 to observe the influence of η to translate, and the results are shown in Fig 10 and Table 3. From the results, it can be seen that when η is 0.05, the generator is unable to learn the local features of the target image and can only generate the outline of the horse, but not the specific texture. When the η is 0.30, 0.50 or 1.0, the horse’s head is distorted and texture is missing, whose reason is to the occurrence of overfitting. When η is 0.10, the image translation results are better and the SIFID value is minimal, generating a more realistic image. Therefore, in our model CRPGAN, we fix the scaling factor η to 0.10.
The best scores are in bold.
In Experiment 2, We fixed scaling factor η to 0.10 and Body, Tail blocks to be trained in all Scale and varied the parallel training positions of Head blocks at different scales. The results of Experiment 2 are shown in Fig 11(a) and Table 4. From Fig 11(a), we can see that Scale 0 and Scale 0 − 1 cannot guarantee the object integrity (the translated zebra’s head is missing), and Scale 0 − 2 appears to have incorrect color and texture (the translated zebra’s head appears to have a horse texture). Scale 0 − 4 can capture the content features of the image and transfer the style features of the target image (the translated zebra with a horse). The Table 4 also shows that Scale 0 − 4 has the smallest SIFID value, representing a more realistic generated image. As the Scale of parallel training increase, the image content feature is better maintained and target image style is better transferred. Therefore, when the scaling factor η is 0.10 and Body and Tail blocks are trained in all Scale, Head block trained in all Scale can ensure object integrity and translation accuracy.
The best scores are in bold.
Similar to Experiment 2, we also fixed the scaling factor η to 0.10 in Experiment 3 and trained Head and Tail blocks at all Scale by default, varying the parallel training positions of Body blocks at different scales. From Fig 11(b), we can see that Scale 0 and Scale 0 − 1 cannot guarantee the integrity of the object (the translated zebra’s head is missing), and as the Scale increases, Scale 0 − 3 and Scale 0 − 4 show incorrect colors and textures (the translated zebra’s head shows a horse texture). Scale 0 − 2 captures the image content features, and transfer the style features (translated zebra with a horse style). This is also verified by the experimental results in Table 3. Combining Fig 11(b) and Table 3, it can be seen that when the scaling factor η is 0.10 and Head and Tail blocks are trained at all Scale, Body block stops training at Scale 3, which can ensure the object integrity and translation accuracy. The reason is that the Body block learns global information of the image at Scale 0 − 2 to ensure the integrity of the image, but over-training the Body block will result in overfitting and cause the image translation results to appear as features of the source image.
D, Ablation study
To evaluate the effect of individual loss functions and CRCBAM attention in the CRPGAN on the image translation results, we design ablation experiments based on Summer↔Winter, as shown in Fig 12.
Fixing N = 6, epochs = 100, we removed content loss (CRPGAN w/o LContent), cyclic consistency loss (CRPGAN w/o LCYC), total variation loss(CRPGAN w/o LTV), CRCBAM attention mechanism(CRPGAN w/o CRCBAM) and compared the differences.
On image style transfer tasks such as Summer↔Winter, the qualitative result is shown in Fig 12 and the quantitative result is shown in Table 5. Without content loss LContent, the content information of the generated image is lost (Fig 12(b)). Without cyclic consistency loss LCYC, the generated result style information is lost(Fig 12(c)). Without total variation loss LTV, Our model generates image with noise(Fig 12(d)). Without CRCBAM, our model also loses location information (Fig 12(e)). Our method generates finer stylized images while preserving the source image content with the lowest SIFID value(Fig 12(a)).
The best scores are in bold.
E, Object transformation
To verify the generalizability of our method on other datasets, we conducte experiments on the animal and OperaFace datasets.
In addition, we show the results of CRPGAN on four image object transformation tasks, which are dog face Translation, cat face translation, wild face translation, and OperaFace translation in Fig 13. The experiment results verify the the generality of the model in this paper on the UI2I tasks with good performance, our model can generate more realistic and higher quality translated images.
Conclusions and discussion
In this paper, we propose CRPGAN, a new image-to-image translation framework with two unpaired images. Specifically, CRPGAN uses a multi-scale training process to learn the global and local structures (texture and style features) of images from coarse to fine. Meanwhile, we use a progressive growth generator to grow the generator size at each scale and adjust the learning rate at lower scales and the number of layers in parallel training stages so that the model can accurately capture the differences in the distribution between the source and target domains and improve the quality of translated images. Next, the model training speed is improved by using a twice-parameter sharing structure. Finally, the newly proposed CRCBAM can fully extract the local and global information of single-sample images to generate finer stylized images. The experimental results show that in image translation tasks with extremely limited data, our method can make better use of image information to generate detailed and realistic image translation results, and the framework can be widely applied to image translation tasks.
However, there are still shortcomings in our method, the first is the generalization problem, such as the use of horses and zebras trained models to use the same horse dataset to translate poorly. We consider the possibility of introducing data augmentation to improve the performance in the future. Secondly, the incomplete parallel Strategy. The parallel Strategy does not need to repeatedly train the weights of the same Body layer in the current stage, but still needs to utilize the training weights of the previous stage, and it is future to study the real sense of parallel strategies in the training process of single-sample image translation. Finally, transformer [54] can learn image global information better, which can guarantee the original information in the image style transfer and translation, but it needs to be trained on large datasets, and how to use transformer in single-sample image translation is a scientific research point to explore.
References
- 1.
Gatys LA, Ecker AS, Bethge M. A neural algorithm of artistic style. arXiv preprint arXiv:150806576. 2015.
- 2.
Gatys LA, Ecker AS, Bethge M. Image style transfer using convolutional neural networks on transfer learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 2414–2423.
- 3.
Zhu JY, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 2223–2232.
- 4.
Huang X, Liu MY, Belongie S, Kautz J. Multimodal unsupervised image-to-image translation. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 172–189.
- 5.
Lee HY, Tseng HY, Huang JB, Singh M, Yang MH. Diverse image-to-image translation via disentangled representations. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 35–51.
- 6.
Park T, Efros AA, Zhang R, Zhu JY. Contrastive learning for unpaired image-to-image translation. In: European Conference on Computer Vision. Springer; 2020. p. 319–345.
- 7.
Benaim S, Wolf L. One-Sided Unsupervised Domain Mapping. In: NIPS; 2017. p. 752–762.
- 8.
Zhang R, Isola P, Efros AA. Colorful image colorization. In: Computer Vision—14th European Conference, ECCV 2016, Proceedings. Springer; 2016. p. 649–666.
- 9.
Kim J, Lee JK, Lee KM. Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 1646–1654.
- 10.
Yuan Y, Liu S, Zhang J, Zhang Y, Dong C, Lin L. Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; 2018. p. 701–710.
- 11.
Li R, Pan J, Li Z, Tang J. Single image dehazing via conditional generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. p. 8202–8211.
- 12.
Chen J, Chen J, Chao H, Yang M. Image blind denoising with generative adversarial network based noise modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. p. 3155–3164.
- 13.
Zhang X, Zheng Z, Gao D, Zhang B, Pan P, Yang Y. Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022. p. 18429–18438.
- 14. Yuan M, Peng Y. Bridge-GAN: Interpretable Representation Learning for Text-to-Image Synthesis. IEEE Transactions on Circuits and Systems for Video Technology. 2020;30(11):4258–4268.
- 15.
Han L, Min MR, Stathopoulos A, Tian Y, Gao R, Kadav A, et al. Dual Projection Generative Adversarial Networks for Conditional Image Generation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021. p. 14418–14427.
- 16. Peng Y, Qi J. CM-GANs: Cross-Modal Generative Adversarial Networks for Common Representation Learning. ACM Trans Multimedia Comput Commun Appl. 2019;15(1).
- 17.
Han J, Shoeiby M, Malthus T, Botha E, Anstee J, Anwar S, et al. Single underwater image restoration by contrastive learning. 2021; p. 2385–2388.
- 18.
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative Adversarial Nets. In: NIPS; 2014. p. 2672–2680.
- 19.
Han J, Shoeiby M, Petersson L, Armin MA. Dual Contrastive Learning for Unsupervised Image-to-Image Translation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); 2021. p. 746–755.
- 20.
Pizzati F, Cerri P, de Charette R. CoMoGAN: continuous model-guided image-to-image translation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021. p. 14283–14293.
- 21. Zheng Z, Bin Y, Lu X, Wu Y, Yang Y, Shen HT. Asynchronous generative adversarial network for asymmetric unpaired image-to-image translation. IEEE Transactions on Multimedia. 2022; p. 1–1.
- 22. Li X, Du Z, Huang Y, Tan Z. A deep translation (GAN) based change detection network for optical and SAR remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing. 2021;179:14–34.
- 23.
Xie S, Gong M, Xu Y, Zhang K. Unaligned Image-to-Image Translation by Learning to Reweight. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021. p. 14154–14164.
- 24.
Hu X, Zhou X, Huang Q, Shi Z, Sun L, Li Q. QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022. p. 18291–18300.
- 25.
Kim T, Cha M, Kim H, Lee JK, Kim J. Learning to Discover Cross-Domain Relations with Generative Adversarial Networks. In: Proceedings of the 34th International Conference on Machine Learning—Volume 70; 2017. p. 1857––1865.
- 26.
Yi Z, Zhang H, Tan P, Gong M. Dualgan: Unsupervised dual learning for image-to-image translation. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 2849–2857.
- 27.
Liu MY, Huang X, Mallya A, Karras T, Aila T, Lehtinen J, et al. Few-shot unsupervised image-to-image translation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019. p. 10551–10560.
- 28.
Wang Y, Khan S, Gonzalez-Garcia A, Weijer Jvd, Khan FS. Semi-supervised learning for few-shot image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020. p. 4453–4462.
- 29.
Benaim S, Wolf L. One-Shot Unsupervised Cross Domain Translation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18. Curran Associates Inc.; 2018. p. 2108––2118.
- 30.
Shaham TR, Dekel T, Michaeli T. SinGAN: Learning a Generative Model From a Single Natural Image. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV); 2019. p. 4569–4579.
- 31.
Hinz T, Fisher M, Wang O, Wermter S. Improved techniques for training single-image gans. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2021. p. 1300–1309.
- 32.
Lin J, Pang Y, Xia Y, Chen Z, Luo J. Tuigan: Learning versatile image-to-image translation with two unpaired images. In: European Conference on Computer Vision. Springer; 2020. p. 18–35.
- 33.
Woo S, Park J, Lee JY, Kweon IS. Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 3–19.
- 34.
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft coco: Common objects in context. In: European conference on computer vision. Springer; 2014. p. 740–755.
- 35.
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition; 2009. p. 248–255.
- 36.
Johnson J, Alahi A, Fei-Fei L. Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision. Springer; 2016. p. 694–711.
- 37.
Huang X, Belongie S. Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision; 2017. p. 1501–1510.
- 38.
Sanakoyeu A, Kotovenko D, Lang S, Ommer B. A style-aware content loss for real-time hd style transfer. In: proceedings of the European conference on computer vision (ECCV); 2018. p. 698–714.
- 39.
Ulyanov D, Lebedev V, Vedaldi A, Lempitsky VS. Texture networks: Feed-forward synthesis of textures and stylized images. In: ICML. vol. 1; 2016. p. 4.
- 40.
Li C, Wand M. Precomputed real-time texture synthesis with markovian generative adversarial networks. In: European conference on computer vision. Springer; 2016. p. 702–716.
- 41.
Ioffe S, Szegedy C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In: Bach F, Blei D, editors. Proceedings of the 32nd International Conference on Machine Learning. vol. 37 of Proceedings of Machine Learning Research. Lille, France: PMLR; 2015. p. 448–456.
- 42.
Isola P, Zhu JY, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1125–1134.
- 43.
Huang Y, He M, Jin L, Wang Y. RD-GAN: few/zero-shot chinese character style transfer via radical decomposition and rendering. In: European Conference on Computer Vision. Springer; 2020. p. 156–172.
- 44.
Vondrick C, Pirsiavash H, Torralba A. Generating Videos with Scene Dynamics. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16. Curran Associates Inc.; 2016. p. 613–621. https://dl.acm.org/doi/abs/10.5555/3157096.3157165
- 45.
Cohen T, Wolf L. Bidirectional one-shot unsupervised domain mapping. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019. p. 1784–1792.
- 46.
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC. Improved Training of Wasserstein GANs. In: Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc.; 2017. https://dl.acm.org/doi/10.5555/3295222.3295327
- 47.
Pumarola A, Agudo A, Martinez AM, Sanfeliu A, Moreno-Noguer F. Ganimation: Anatomically-aware facial animation from a single image. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 818–833.
- 48.
Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. In: ICLR (Poster); 2015.
- 49.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–778.
- 50.
Demir U, Unal G. Patch-based image inpainting with generative adversarial networks. arXiv preprint arXiv:180307422. 2018.
- 51.
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS; 2017. p. 6629––6640. https://dl.acm.org/doi/abs/10.5555/3295222.3295408
- 52.
Jiang L, Zhang C, Huang M, Liu C, Shi J, Loy CC. Tsit: A simple and versatile framework for image-to-image translation. In: European Conference on Computer Vision. Springer; 2020. p. 206–222.
- 53.
Deng Y, Tang F, Dong W, Ma C, Pan X, Wang L, et al. StyTr2: Image Style Transfer with Transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 11326–11336.
- 54.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All you Need. In: NIPS; 2017. p. 6000–6010.