Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Multi-attention bidirectional contrastive learning method for unpaired image-to-image translation

  • Benchen Yang,

    Roles Conceptualization, Formal analysis, Resources, Software, Supervision, Validation, Writing – review & editing

    Affiliation Software College, Liaoning Technical University, Huludao, China

  • Xuzhao Liu ,

    Roles Conceptualization, Investigation, Methodology, Project administration, Resources, Software, Validation, Writing – original draft, Writing – review & editing

    chivlxz@163.com

    Affiliation Software College, Liaoning Technical University, Huludao, China

  • Yize Li,

    Roles Data curation, Formal analysis, Methodology, Supervision, Writing – original draft

    Affiliation Software College, Liaoning Technical University, Huludao, China

  • Haibo Jin,

    Roles Data curation, Funding acquisition, Validation, Visualization

    Affiliation Software College, Liaoning Technical University, Huludao, China

  • Yetian Qu

    Roles Project administration, Visualization

    Affiliation Software College, Liaoning Technical University, Huludao, China

Abstract

Unpaired image-to-image translation (I2IT) involves establishing an effective mapping between the source and target domains to enable cross-domain image transformation. Previous contrastive learning methods inadequately accounted for the variations in features between two domains and the interrelatedness of elements within the features. Consequently, this can result in challenges encompassing model instability and the blurring of image edge features. To this end, we propose a multi-attention bidirectional contrastive learning method for unpaired I2IT, referred to as MabCUT. We design separate embedding blocks for each domain based on depthwise separable convolutions and train them simultaneously from both the source and target domains. Then we utilize a pixel-level multi-attention extractor to query images from embedding blocks in order to select feature blocks with crucial information, thus preserving essential features from the source domain. To enhance the feature representation capability of the model, we incorporate depthwise separable convolutions for the generator. We conducted comprehensive evaluations using three datasets, demonstrating that our approach enhances the quality of unpaired I2IT while avoiding the issue of mode collapse-related image blurring.

Introduction

In recent years, image-to-image translation (I2IT) has been widely used in a number of domains, including style transfer, image restoration, dehazing, super-resolution reconstruction, and more. Achieving precise transformations between the original area S and the goal area T is the aim of I2IT. This is done while ensuring the preservation of the fundamental content from the source domain in the transformation. One common task in the field of style transfer involves converting horse images into zebra images. The objective of this transfer is to convert an image of a horse to a zebra while preserving the original image’s background and structures.

CycleGAN [1], 1D Cycle-GAN [2], and AttenCGAN [3] address unique challenges by incorporating the concept of combining Generative Adversarial Networks (GAN) with cycle consistency. These approaches aim to constrain and improve the resemblance between artificially generated images and actual images. However, cycle consistency [1] can lead to distorted generated images. In order to generate more vibrant images, CUT [4], RRUIT [5], and SRC [6] attempt to reduce the constraints of cycle consistency [1]. CUT [4] introduces the method of block contrastive learning. With the goal to efficiently use the properties of the generator’s input and output domains for contrastive learning, this strategy suggests maximising the mutual information between input and output blocks. By imposing constraints on the features from the same position using the encoder, it ensures the preservation of important content in the source domain, thus leading to an improvement in the quality of image translation.

Nevertheless, CUT [4] lacks the capability to differentiate the effectiveness of negative samples, leading to a high level of randomness in the translation process. Qs-Attn [7] emphasizes the importance of features with significant information in cross-domain translation and considers the relationship between contrastive blocks and surrounding features. Although Qs-Attn [7] achieves feature extraction from both global contextual and local information aspects of images, it fails to account for the details and overall texture of multi-layer features in a single image, resulting in blurred edge features in the generated images. MCL [8] proposes to enhance the discriminator by applying contrastive loss constraint on the output layer features of the discriminator. However, it does not consider the overfitting issue of the generator and embedding blocks.

This paper leverages the advantages of contrastive learning and proposes a fusion multi-attention bidirectional contrastive learning method to enhance the details and textures of the image. The workflow is illustrated in Fig 1. Firstly, MabCUT establishes independent embedding blocks for the source and destination domains. In these blocks, the encoding layer adopts the architecture of depthwise separable convolutions. The models are trained simultaneously from both directions, effectively avoiding overfitting and improving model stability. Then, use pixel-level global and local attention extractors to query the multi-level features of a single image. One way to gauge the significance of traits is via entropy [9]. By calculating the entropy of the attention extractor, important multi-level feature blocks are extracted and the loss is computed through contrastive learning. Simultaneously, depthwise separable convolutions is used in the generator’s architecture to further increase model efficiency and improve the model’s generalization ability.

thumbnail
Fig 1. Example of the workflow for the MabCUT method.

Train and generate images from both the ST and TS directions, extract features using independent embedding blocks, and perform contrastive learning by querying key points with the attention extractor.

https://doi.org/10.1371/journal.pone.0301580.g001

This article’s primary contributions include the following aspects:

  • We propose the MabCUT method for I2IT tasks, which effectively establishes independent embedding blocks based on depthwise separable convolutions for two domains, avoids the problem of mode collapse, and improves the quality of translated images.
  • We combine global and local attention extractors to simultaneously query and integrate pixel-level key information from multiple layers of a single image, effectively refining edge features. In addition, we utilize a generator based on depthwise separable convolutions.
  • We evaluate our model using three datasets: Horse2Zebra, Cat2Dog, and Cityscapes, against the state-of-the-art techniques currently in use. The results of the trial show that our method produces images with more clarity.

Related works

Deep generative models

One kind of generative model that is created by deep learning methods is called a deep generative model. Their primary objective is to learn and generate highly realistic data samples. Deep neural networks are used in these models to simulate the data’s latent distribution and produce new instances based on it. By expanding the dimensions of representation and extrapolating information, generative models can significantly enhance their generation capabilities. Several deep generative models are described below:

Variational Autoencoder (VAE) [10] integrates the concepts of autoencoders and variational inference, allowing it to produce new samples and effectively capture the underlying data distribution. There are two parts of a VAE: an encoder and a decoder. The input data is transformed by the encoder to a latent space containing latent variables. Similarly, the decoder utilizes these latent variables to reconstruct the original data space, producing samples that are almost identical to the original data.

Diffusion models (DMs) [11] comprise two processes: forward and reverse diffusion. The image is gradually subjected to noise during the forward diffusion stage, which ends with a total conversion to random noise. Subsequently, reverse diffusion is applied to gradually eliminate the anticipated noise, thereby generating samples that closely resemble the distribution of real data.

Normalizing Flows (NFs) [12] are employed for modeling complex data distributions. The primary goal of NFs is to establish a mapping from a simple prior distribution, to a more intricate posterior distribution, which represents the target data distribution. This mapping is achieved using invertible transformation functions, enabling the generation of samples that closely resemble the target data.

Generative Adversarial Network (GAN) [13] is composed of a generator and a discriminator. The fundamental principle underlying GAN is adversarial learning, where the discriminator is intended to distinguish between genuine and created images, while the generator aims to produce images that are convincingly realistic. Through an adversarial interplay between the generator and discriminator, they continually enhance each other, ultimately leading to a scenario where the discriminator becomes incapable of distinguishing whether an image is sourced from reality or generated. GAN is a powerful generative model that has found extensive use in applications including super-resolution reconstruction [14], image denoising [15], and image style transfer for image synthesis. However, mode collapse and unstable training plague the original GAN. To address these issues, researchers have proposed various improved GAN models, such as CycleGAN [1], PalGAN [16], and ActFormer [17], to improve the training’s stability and the quality of images that are created.

Unpaired image-to-image translation

I2IT is often divided into one of two categories: paired image-to-image translation [18] and unpaired image-to-image translation [19]. Paired I2IT, often known as supervised translation, involves establishing a direct correspondence between images in domain S and domain T. However, obtaining paired images in practical applications is challenging. Consequently, many I2IT tasks are performed in unpaired scenarios. Unpaired I2IT currently encompasses various methods, including cycle consistency [1], shared latent space [20], and the integration of knowledge from different domains to achieve effective translation. At the outset, researchers endeavored to establish a steadfast mapping relationship for unpaired I2IT. To serve this purpose, they introduced the notion of cycle consistency constraint [1]. Concurrently, the UNIT [20] framework put forth the hypothesis of a shared latent space. A matched set of images from different domains may be mapped to a common representation in the latent space, according to this idea. Later, researchers endeavoured to improve the performance of I2IT by integrating methodologies from diverse domains. For instance, InstaFormer [21], Ittr [22] and UVCGAN [23] introduced the transformer architecture into the realm of I2IT, aiming to enhance the quality of generated images.

Contrastive learning

In unpaired I2IT tasks, finding the mapping link between corresponding areas in different domains is the goal of contrastive learning [7, 24, 25], to connect correlated features, and to impose constraints that maintain vital content throughout the I2IT process. One of the early adopters of contrastive learning in the I2IT work was CUT [4]. It maximizes the mapping relationship between input and output patches through Noise Contrastive Estimation (NCE). Nevertheless, the conventional objective of contrastive learning may lead to confusion as it fails to differentiate between the similarity of negative examples and key points. Instead, contrastive learning treats them on equal grounds by pushing them apart. To address this issue, Large-margin Contrastive Learning was first established by Chen et al. [26] to distinguish intra-cluster and inter-cluster pairings, with the goal of driving the inter-cluster pairs away. None of them have addressed the issue of mode collapse, and they have also not selected meaningful feature blocks during contrastive learning.

The proposed method

Bidirectional contrastive learning

Due to the inefficiency of one-way contrastive learning in capturing the mapping information between two domains, our method establishes a separate embedding for each domain based on instance sets S = {sS} and T = {tT}. Each embedding block has its own weights, which do not interfere with each other. By using separate encoders and projection layers for the source and target domains, multi-layer features are extracted. Attention matrices are used to select feature blocks that reflect important domain information, maximizing mutual information and achieving high-quality unpaired I2IT tasks.

Contrastive learning primarily aims to establish associations between a query and its relevant positive examples, as well as the irrelevant negative examples. In this process, it maps the query, positive examples, and N negative examples to K-dimensional vectors. It is formulated as an (N + 1) − way classification problem, and the following equation is used to determine the cross-entropy loss: (1) where q represents a crucial point selected from the set A(Ss), while both k+ and k are derived from Ss. Specifically, k+ represents a positive example, denotes the n-th negative example, and τ signifies a temperature hyperparameter with a specific value of 0.07.

The MabCUT architecture comprises two generators, namely A and B. Generator A is tasked with translating source images into target images, while generator B performs the reverse operation. The initial portion of the generator is designated as the encoder, whereas the latter section is referred to as the decoder. Simultaneously, discriminators DS and DT are employed to assess the authenticity of the images, utilizing the GAN Loss as the evaluation metric. The comprehensive architecture is depicted in Fig 2. The embedding block S incorporates generator A’s encoding layer Aenc and two layers of MLP(HS), while the embedding block T comprises generator B’s encoding layers Benc and two layers of MLP(HT). By using embedding blocks, multi-layer features are extracted from both the original and goal image. The attention matrix then uses these retrieved features as inputs, which enables the querying and selection of relevant and meaningful feature blocks.

thumbnail
Fig 2. MabCUT framework.

The framework achieves bidirectional mapping by utilizing the mappings A : ST and B : TS, effectively enabling I2IT between unpaired images while avoiding the strict cycle consistency constraint. In this paper, we define generators A and B, where Aenc and Benc serve as encoders, and Adec and Bdec serve as decoders. By employing Aenc and HS as the embedding blocks to extract features from various layers of the source domain, and utilizing Benc and HT as the embedding blocks for the target domain. The attention matrix selects multiple layers of features through queries and calculates the PatchNCE loss. Additionally, discriminators DS and DT compute the GAN loss.

https://doi.org/10.1371/journal.pone.0301580.g002

Attention extractor

In this paper, the attention extractor comprises two main types: global and local. The global attention extractor takes into account the features of all positions in the image, allowing for a comprehensive mapping of the overall image texture. However, it may overlook the correlation between neighboring features. Conversely, the local attention extractor measures the correlation between neighboring features using a fixed-size sliding window, mitigating the limitations of the global attention extractor and improving computational efficiency. Our method employs both global and local attention extractors to combine pixel-level information from multiple layers of a single image. This enables us to capture both the overarching global contextual information and the intricate relationship among neighboring elements.

Global attention extractor.

The workflow of the global attention extractor is illustrated in Fig 3. Initially, we utilize embedding blocks and to derive three-dimensional matrices . Subsequently, Fs is reshaped into two-dimensional matrices and . Simultaneously, the matrix Qs undergoes a transpose operation, resulting in a new two-dimensional matrix . Then, the matrices Qs and Ks are multiplied together. By applying the softmax function, each row of the multiplied matrix is activated, leading to the formation of the global attention matrix . Significant features can be identified by measuring the entropy Hg of each row in Mg. The formula is defined as follows: (2) where i and j correspond to the rows and columns of matrix Mg, representing the query and the key. As Hg(i) approaches 0, it indicates that only a few features in row i are similar to query i. Therefore, as Hg(i) becomes smaller, the features represented by query i are more salient and thus they are more worth retaining. Sort the rows of matrix Mg in ascending order of their entropy values. Select the top N rows, resulting in matrix Matt.

thumbnail
Fig 3. The operational principle of global attention extractor.

Image features are extracted from Ss and A(Ss) using embedding blocks S and T, respectively. These features are then mapped to three-dimensional matrices Fs and Ft. Various operations, including reshaping and transposing, are applied to matrix Fs to derive a two-dimensional attention matrix Mg. N rows are selected based on the importance of each row in the matrix. These rows are then matched with the value matrices of the target and source domains to find the relevant important points, negative examples, and positive examples. The contrast loss is subsequently calculated. Among them, the feature blocks inside the blue, red, and green boxes represent key points, positive examples, and negative examples respectively.

https://doi.org/10.1371/journal.pone.0301580.g003

Local attention extractor.

Using a square matrix and a stride of 1, with a window size of , the local attention extractor calculates the similarity between each query i and its neighboring keys by sliding the window. These calculations result in the value matrix and the key matrix . The reconstructed query matrix is multiplied by the key matrix Ks, and then activated to get the local attention matrix using the softmax algorithm. Additionally, the entropy of each row is calculated. The formula is defined as follows: (3) where i and j correspond to the query and key, Sort matrix Ml in ascending order, select the N smallest rows, and obtain matrix Matt.

The attention matrix Matt is obtained through the use of both global and local attention mechanisms. It is then multiplied, in a routing manner, with the value feature matrices Vs and Vt derived from generated and real images, respectively. This process generates positive examples, negative examples, and key points for the purpose of contrastive learning, which are subsequently employed to calculate the contrastive loss. We extract four layers of features from each embedding block. The first two layers are computed using the global attention extractor, while the last two layers are computed using the local attention extractor.

Depthwise separable convolutions ResNet

The workflow of the ResNet generator is illustrated in module (a) of Fig 4. During the upsampling and downsampling process on the feature map, we incorporate depthwise convolution and pointwise convolution. The count of parameters used in model training is decreased by this integration, while also enhancing the handling of local features in the data. Module (b) provides a comprehensive explanation of the operational principles underlying depthwise convolution. It is defined as follows: (4) where signifies the dimension of input image features. Each channel’s features undergo processing by the convolutional kernel , ultimately yielding . Subsequently, pointwise convolution is executed, wherein the working principle is exemplified as shown in module (c). This process is established on a 1 × 1 convolution kernel, and it is defined as follows: (5) where F2 represents the output of the depthwise convolution, while represents a 1x1 convolution kernel. The final output is represented as .

thumbnail
Fig 4.

(a) ResNet generator structure. (b) Depthwise convolution. (c) Pointwise convolution.

https://doi.org/10.1371/journal.pone.0301580.g004

The kernel size of the conventional convolution technique is denoted by k × k × C × C′, and it necessitates a parameter count of Ps = k2 × C × C′. In contrast, the parameter count for depthwise separable convolution is represented by Pd = C × C′ + k × k × C = C × (k2 + C′). The ratio between the two is denoted as . The dimension of the kernel, denoted as k, and the number of channels, denoted as C′, will both greatly surpass 1 as the network’s depth grows. Consequently, the enhancement in model efficiency becomes increasingly evident.

Loss function

Adversarial loss.

The discriminator is continuously updated through backpropagation, while realistic images are generated by the generator, and the differences between the translated and original images are discernible. The discriminator DT imposes constraints on the translation of images from the domain TS. Here is a description of the loss function: (6) where the discriminator DT is designed to differentiate between authentic images, denoted as t, and translated images, denoted as A(s). Conversely, the generator A is tasked with producing images that are increasingly indistinguishable from real images, thereby enhancing their realism. The discriminator DS is defined by the following formula: (7)

PatchNCE loss.

In order to map the image into a feature set , we send the L layers that we chose from encoder Aenc to HS. The target domain feature set is represented as in correspondence. The multi-layer features extracted from two embedding blocks are inputted into an attention extractor, facilitating the selection of suitable contrast blocks for each layer of features. By leveraging this approach, we represent the spatial positions as p ∈ {1, …, Pl} for each layer that has been chosen, the number of spatial locations per layer is denoted by the symbol Pl. Every query is associated with a positive example, which is represented as , whereas all other examples are considered as negative and represented as . Furthermore, the notation Cl is used to indicate the count of channels in every layer. Here is how the mapping A : ST is defined: (8)

The following defines the mapping B : TS: (9)

Identity loss.

To retain the essential texture and features of the original image throughout the I2IT process, and to mitigate the occurrence of substantial distortion in the generated images, we introduce the concept of Identity Loss as a constraint. This constraint aims to enhance the authenticity of I2IT. The definition is provided below: (10)

General objective

Our primary goal is to accomplish unpaired I2IT with specific features while maintaining the integrity of essential texture and features. To achieve this, the comprehensive loss function of the framework incorporates Adversarial Loss, PatchNCE Loss and Identity Loss. Below are the definitions of these losses: (11)

In this paper, we establish the hyperparameters λGAN = 1, λNCE = 2, and λIDT = 1, calculate the loss according to the corresponding weights.

Experiment

Datasets

The experimental results were assessed on three separate datasets. Within the Horse2Zebra [1] dataset, the images featuring horses serve as the representatives of the source domain set, whereas zebras represent the target domain set. The Cat2Dog [27] dataset consists of 1000 test images and 9892 training images, where the source domain set is composed of cat images and the target domain set is composed of dog images. The Cityscapes [28] dataset is sourced from urban streets, with images from two domains representing German urban street scenes and semantic segmentation labels. The experiment is built upon the three aforementioned datasets. Table 1 displays the quantity of images contained in each dataset. We consider regular horse, semantic segmentation, and cat images as the source domain set, and zebra, street scenes, and dog images as the target domain set.

Training details

Our model and all baseline methods were trained on the RTX 3080Ti GPU. The Adam optimizer is used in model training, with the initial learning rate set to 0.0001, parameters set as β1 = 0.5 and β2 = 0.999. The model is trained for a total of 400 iterations. However, for the Cat2Dog training set, the Iteration count is explicitly limited at 200. We utilize a PatchGAN [29] discriminator and a depthwise separable ResNet [30] generator, with a batch size of 1. The image dimensions used for testing and training the model are set at 256 × 256. In the case of particular images, we employ a cropping technique from the center to obtain a pixel size of 256 × 256. Similarly, we set the number of rows in the attention extractor to 256 in order to maximize the reflection of the features of the source image.

Evaluation

Metrics.

The techniques of Kernel Inception Distance (KID) [31] and Fréchet Inception Distance (FID) [32] are frequently utilized for evaluating the quality of produced images in I2IT tasks. By calculating the distance between features derived from a pretrained image recognition network, the FID [32] measure assesses the similarity between produced images and actual images. The produced and actual images are more alike when the FID [32] value is smaller, serving as an indicator of superior image quality. Using the feature vectors of produced and actual images acquired from the Inception network, the KID [31] computes the squared maximum average difference. The visual resemblance between the produced and actual images is greater when the KID [31] value is smaller. Specifically, the cityscapes dataset employed the pre-trained DRN [33] model for performing segmentation. The correlation between the generated maps and the ground truth maps was assessed, and various evaluation metrics including the mean class accuracy (classAcc), pixel-wise accuracy (pixAcc) and mean average precision (mAP) were computed.

Baselines.

To ascertain whether the proposed approach is effective in unpaired I2IT, we conducted a comparative analysis of the proposed model using a number of state-of-the-art unsupervised techniques, namely UNSB [34], CycleGAN [1], CUT [4], DCLGAN [35], MCL [8], Qs-Attn [7] and ASGIT [36].

UNSB [34] avoids the curse of dimensionality in unpaired I2IT tasks through regularization and adversarial learning, achieving the transformation between two distributions. Various numbers provide distinct impacts on the model’s NFE values in the range of 1 to 5. We choose the result with the highest image quality for comparison with our model.

CycleGAN [1] employs the concept of cycle consistency to establish an identically between synthesized and actual images. This process entails the incorporation of adversarial loss, cycle consistency loss, and identity loss into the overall loss function.

CUT [4] includes two methods: CUT and FastCUT. CUT uses traditional contrastive learning with the parameter settings λX = 1 and λY = 1. On the other hand, FastCUT is a faster method that improves computational efficiency by using larger parameters λX = 10 in order to compensate for the absence of identity loss λY = 0.

DCLGAN [35] proposes two methods, DCLGAN and its variant SimDCL, which splits the target domain’s training from the source domain’s, therefore stabilizing the training process. SimDCL effectively avoids the problem of mode collapse. Similar to our method, DCLGAN has a starting learning rate of η = 0.0001 and is trained for 400 epochs. Specifically, SimDCL sets the learning rate to η = 0.0002 and is trained for 200 epochs.

MCL [8] utilizes the discriminator’s output layer to the fullest extent possible, using its feature information to compute the contrastive loss. Specifically, the initial learning rate is η = 0.0002.

Qs-Attn [7] proposes a query-selection attention module that constrains the preservation of important features during the image translation process, including both the global and local modes.

ASGIT [36] introduces an attention mechanism in the discriminator, generating attention maps for predicted images and transmitting them to the generator. This approach enhances the optimization of the generator.

Results

Unpaired I2IT tasks

Through conducting quantitative experiments, we compare the results of our model on the Horse2Zebra, Cat2Dog, and Cityscapes datasets with those of several baseline models. The evaluation of different models is based on the FID [32] and KID [31] metrics. To gain a more intuitive understanding of the model’s performance on different datasets, we conduct separate comparisons of the three result sets. This allows us to analyze the strengths and weaknesses of our model in comparison to other baseline models.

The results depicted in Fig 5 indicate that other models display distortion when translating from horses to zebras, causing the zebra’s details to become blurred and the textures to appear unnatural. In contrast, our model-generated images not only retain the characteristics of the horses present in the source images but also exhibit a broader coverage of zebra stripes and smoother textures. The quantitative experimental results shown in Table 2 show that our model performs best in both metrics, boasting a significant 2.7 points lead over the MCL [8] model, which holds second place in terms of FID [32].

thumbnail
Table 2. FID and KID×100 scores on the Horse2Zebra dataset, with the best performance indicated in bold.

https://doi.org/10.1371/journal.pone.0301580.t002

Fig 6 provides an objective assessment of the performance exhibited by various models in the task of translating Cat2Dog images. Through qualitative comparison, it has been observed that, while Qs-Attn [7], MCL [8], DCLGAN [35], and CUT [4] are capable of generating more vivid facial features compared to other baseline methods, they still exhibit deficiencies in terms of visual effects when compared to our model. Specifically, the dog images generated by these models lack smoothness. The two scores presented in Table 3 serve to reinforce our analysis of the performance displayed by various advanced models. Our approach achieved scores of 59.8 and 2.2698 for FID [32] and KID [31], respectively, outperforming all baseline methods.

From Fig 7, it is not difficult to observe that our model can produce realistic content based on semantic labels, including cars, pedestrians, houses, and trees, among others. In contrast, models such as DCLGAN [35] and CUT [4] produce indistinct content that fails to accurately depict the information in the target domain. Our model exhibits significant advantages in generating street scenes, as demonstrated in Table 4. Our model outperforms other methods in terms of classAcc, mAP, KID [31], and FID [32] metrics, but slightly underperforms compared to Qs-Attn [7] in the pixAcc metric. However, upon observing Fig 7, it becomes evident that the image realism of our model surpasses that of Qs-Attn [7].

thumbnail
Table 4. FID and KID×100 scores on the Cityscapes dataset.

https://doi.org/10.1371/journal.pone.0301580.t004

In this paper, we conducted quantitative comparative experiments and displayed qualitative results to compare our model with nine advanced baseline models. Our comparisons were based on factors such as texture, visual effects, clarity, and details. We analyzed the advantages of our model from various perspectives in contrast to the other models.

Ablation study

Our method demonstrates superior translation performance when compared to all baseline models. To delve deeper into the influence of each contribution on the efficacy of our model, we conducted ablation experiments. These experiments were based on the three datasets utilized in the aforementioned experiments, enabling comprehensive ablation comparisons.

Our model incorporates a bidirectional contrastive learning mechanism and integrates a multiple attention extractor method into the feature extraction process. Consequently, we establish the following comparisons: (A) Using unidirectional contrastive learning and applying an identical embedding block. (B) Eliminating the attention extractor and employing randomized selection of contrast blocks. (C) All feature layers are computed using the global attention extractor. (D) All feature layers are computed using the local attention extractor. (E) The generator and embedding blocks eliminate depthwise separable convolutions.

  1. (A). Through an investigation into the effects of unidirectional contrastive learning methods, we employ a unilateral attention extractor to extract feature blocks. The results depicted in Fig 8 indicate a substantial degradation in performance on the Horse2Zebra dataset. Additionally, the quantitative findings presented in Table 5 demonstrate that MabCUT has improved by 28.6%, 6%, and 12.4% respectively compared to the unidirectional comparative learning method across three datasets. This proves the effectiveness of MabCUT in establishing independent embedding blocks for different domains.
  2. (B). In order to examine the effectiveness of actively selecting meaningful keypoints, we adopted the strategy of randomly selecting contrastive blocks from the CUT [4] framework in our experimental setup. Analysis of the results in Table 5 reveals higher overall scores across all three datasets. More specifically, the scores for Cat2Dog and Cityscapes are recorded as 74.6 and 57.1, respectively. Interestingly, among all the ablation contrastive models, this approach demonstrates the lowest performance, indicating that MabCUT’s attention matrix method is capable of learning superior contrastive loss and achieving stable training.
  3. (C). Use the global attention extractor to query and select key points from all feature layers. The global attention extractor focuses more on the overall features of the image. When observing Fig 8, it becomes evident that the translated visual quality of the image is relatively low, and Table 5 demonstrates that the scores are worse than those of MabCUT across different datasets, indicating that the global attention extractor pays less attention to neighboring features.
  4. (D). Query and select key points from all feature layers using the local attention extractor. The model scores 46.9 on the Cityscapes dataset, which is close to the score of the MabCUT model. In contrast, it significantly underperforms the MabCUT model in the other two datasets, highlighting the efficacy of the MabCUT model’s approach to combining local and global attention. The correlations between adjacent features and the general texture of the picture may both be captured by MabCUT model.
  5. (E). The depthwise separable convolutions were removed from the generator and embedding blocks, leading to inferior experimental results compared to MabCUT. The analysis of Table 5 reveals a noticeable increase in the FID scores for the Horse2Zebra and Cat2Dog datasets, with respective increments of 9.4 and 7.6. These findings suggest a deterioration in the overall quality of the images. Moreover, the parameter count increased to 29.274M, surpassing MabCUT’s 28.508M. This observation suggests that the utilization of depthwise separable convolutions not only reduces parameter count and enhances computational efficiency but also effectively captures data features, thereby improving the model’s generalization ability.
thumbnail
Fig 8. Qualitative ablation experiment.

Here, MabCUT represents the results of this model, and (A)→(E) are the I2IT results of each ablation module in sequence.

https://doi.org/10.1371/journal.pone.0301580.g008

thumbnail
Table 5. The quantitative comparison results from ablation experiments.

In order to demonstrate the effects of each of our contributions on I2IT.

https://doi.org/10.1371/journal.pone.0301580.t005

User study

To substantiate our model’s efficacy in unpaired I2IT, we recruited 30 volunteers based on visual perception and compared MabCUT with other baseline models on three datasets. From each dataset, we chose 20 images at random. Volunteers were asked to evaluate the quality of image translation from a visual perception perspective and rank the translation results of different models. As shown in Fig 9, MabCUT exhibited significant quality advantages compared to the baseline models, ranking first in 58% of user evaluations.

thumbnail
Fig 9. User study results.

In this paper, we consolidate and compute the proportional rankings provided by users across various models. Subsequently, we conduct a thorough analysis of the quality of these models using detailed graphical representations. The horizontal axis shows the percentage of ranks, while the vertical axis refers to the various models.

https://doi.org/10.1371/journal.pone.0301580.g009

Conclusions and discussion

In this paper, we propose a bidirectional contrastive learning method based on multiple attention extractor. This method extracts features from both the source and target domains separately using independent embedding blocks, and integrates both global and local attention extractor to identify important feature blocks for contrastive learning. At the same time, while ensuring the quality of image translation, we incorporate depthwise separable convolutions into the generator to reduce training costs. Through comprehensive comparison experiments, ablation experiments, and user studies, we have validated the effectiveness of the MabCUT method in the field of unpaired I2IT. However, our model changes the background color of the image during the process of I2IT, which affects the translation results. The problem of background preservation during the I2IT process should be the main focus of future study.

References

  1. 1. Zhu JY, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 2223–2232.
  2. 2. Wang YQ, Wang Q, Lu WK, Ge Q, Yan XF. Seismic impedance inversion based on cycle-consistent generative adversarial network. Petroleum Science. 2022;19(1):147–161.
  3. 3. Wen L, Wang Y, Li X. A new cycle-consistent adversarial networks with attention mechanism for surface defect classification with small samples. IEEE Transactions on Industrial Informatics. 2022;18(12):8988–8998.
  4. 4. Park T, Efros AA, Zhang R, Zhu JY. Contrastive learning for unpaired image-to-image translation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer; 2020. p. 319–345.
  5. 5. Zhang X, Fan C, Xiao Z, Zhao L, Chen H, Chang X. Random reconstructed unpaired image-to-image translation. IEEE Transactions on Industrial Informatics. 2022;19(3):3144–3154.
  6. 6. Jung C, Kwon G, Ye JC. Exploring patch-wise semantic relation for contrastive learning in image-to-image translation tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022. p. 18260–18269.
  7. 7. Hu X, Zhou X, Huang Q, Shi Z, Sun L, Li Q. Qs-attn: Query-selected attention for contrastive learning in i2i translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 18291–18300.
  8. 8. Gou Y, Li M, Song Y, He Y, Wang L. Multi-feature contrastive learning for unpaired image-to-image translation. Complex & Intelligent Systems. 2023;9(4):4111–4122.
  9. 9. Jianbiao M, Xinzui W, Zhaobo L, Juan L, Zhongwei Z, Hui F. EEG signal classification of tinnitus based on SVM and sample entropy. Computer Methods in Biomechanics and Biomedical Engineering. 2023;26(5):580–594. pmid:35850561
  10. 10. Hu Q, Zhang G, Qin Z, Cai Y, Yu G, Li GY. Robust semantic communications with masked VQ-VAE enabled codebook. IEEE Transactions on Wireless Communications. 2023.
  11. 11. Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022. p. 10684–10695.
  12. 12. Gudovskiy D, Ishizaka S, Kozuka K. Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2022. p. 98–107.
  13. 13. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. Advances in neural information processing systems. 2014;27.
  14. 14. Liang J, Zeng H, Zhang L. Details or artifacts: A locally discriminative learning approach to realistic image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 5657–5666.
  15. 15. Zhu ML, Zhao LL, Xiao L. Image denoising based on GAN with optimization algorithm. Electronics. 2022;11(15):2445.
  16. 16. Wang Y, Xia M, Qi L, Shao J, Qiao Y. PalGAN: Image colorization with palette generative adversarial networks. In: European Conference on Computer Vision. Springer; 2022. p. 271–288.
  17. 17. Xu L, Song Z, Wang D, Su J, Fang Z, Ding C, et al. ActFormer: A GAN-based Transformer towards General Action-Conditioned 3D Human Motion Generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023. p. 2228–2238.
  18. 18. Wang TC, Liu MY, Zhu JY, Tao A, Kautz J, Catanzaro B. High-resolution image synthesis and semantic manipulation with conditional gans. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 8798–8807.
  19. 19. Zhao M, Bao F, Li C, Zhu J. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Advances in Neural Information Processing Systems. 2022;35:3609–3623.
  20. 20. Liu MY, Breuel T, Kautz J. Unsupervised image-to-image translation networks. Advances in neural information processing systems. 2017;30.
  21. 21. Kim S, Baek J, Park J, Kim G, Kim S. InstaFormer: Instance-aware image-to-image translation with transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 18321–18331.
  22. 22. Zheng W, Li Q, Zhang G, Wan P, Wang Z. Ittr: Unpaired image-to-image translation with transformers. arXiv preprint arXiv:220316015. 2022.
  23. 23. Torbunov D, Huang Y, Yu H, Huang J, Yoo S, Lin M, et al. Uvcgan: Unet vision transformer cycle-consistent gan for unpaired image-to-image translation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2023. p. 702–712.
  24. 24. Zhang Y, Li M, Cai W, Gou Y, Shi S. SARCUT: Contrastive learning for optical-SAR image translation with self-attention and relativistic discrimination. In: International Workshop on Frontiers of Graphics and Image Processing (FGIP 2022). vol. 12644. SPIE; 2023. p. 88–99.
  25. 25. Zhuang Y, Liu H, Song E, Xu X, Liao Y, Ye G, et al. A 3D Anatomy-Guided Self-Training Segmentation Framework for Unpaired Cross-Modality Medical Image Segmentation. IEEE Transactions on Radiation and Plasma Medical Sciences. 2023.
  26. 26. Chen S, Niu G, Gong C, Li J, Yang J, Sugiyama M. Large-margin contrastive learning with distance polarization regularizer. In: International Conference on Machine Learning. PMLR; 2021. p. 1673–1683.
  27. 27. Choi Y, Uh Y, Yoo J, Ha JW. Stargan v2: Diverse image synthesis for multiple domains. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 8188–8197.
  28. 28. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, et al. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 3213–3223.
  29. 29. Isola P, Zhu JY, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1125–1134.
  30. 30. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–778.
  31. 31. Bińkowski M, Sutherland DJ, Arbel M, Gretton A. Demystifying mmd gans. arXiv preprint arXiv:180101401. 2018.
  32. 32. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems. 2017;30.
  33. 33. Yu F, Koltun V, Funkhouser T. Dilated residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 472–480.
  34. 34. Kim B, Kwon G, Kim K, Ye JC. Unpaired Image-to-Image Translation via Neural Schr\” odinger Bridge. arXiv preprint arXiv:230515086. 2023.
  35. 35. Han J, Shoeiby M, Petersson L, Armin MA. Dual contrastive learning for unsupervised image-to-image translation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 746–755.
  36. 36. Lin Y, Wang Y, Li Y, Gao Y, Wang Z, Khan L. Attention-based spatial guidance for image-to-image translation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2021. p. 816–825.