Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

RISNet: A variable multi-modal image feature fusion adversarial neural network for generating specific dMRI images

  • Guolan Wang,

    Roles Writing – original draft

    Affiliation College of Computer and Information Engineering, Shanxi Technology and Business University, Taiyuan, China

  • Xiaohong Xue,

    Roles Data curation

    Affiliation College of Computer and Information Engineering, Shanxi Technology and Business University, Taiyuan, China

  • Yifei Chen,

    Roles Methodology

    Affiliation College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Taiyuan, China

  • Hao Liu,

    Roles Visualization

    Affiliation College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Taiyuan, China

  • Haifang Li ,

    Roles Supervision

    lihaifang@tyut.edu.cn (H.L.); wangqianshan0203@gmail.com (Q.W.)

    Affiliations College of Computer and Information Engineering, Shanxi Technology and Business University, Taiyuan, China, College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Taiyuan, China

  • Qianshan Wang

    Roles Writing – review & editing

    lihaifang@tyut.edu.cn (H.L.); wangqianshan0203@gmail.com (Q.W.)

    Affiliation College of software, Taiyuan University of Technology, Taiyuan, China

Abstract

The b-value in the diffusion magnetic resonance image(dMRI) reflects the degree to which the water molecules are affected by the magnetic field gradient pulse in the tissue, and the different b-values not only affect the image contrast but also the accuracy of the subsequent calculation. The imbalance between the lower and higher b-value image categories in the macaque dMRI brain imaging dataset dramatically affects the accuracy of computational neuroscience. The medical image conversion method based on the generative adversarial network can generate different b-value images. However, the macaque brain dataset has multi-center and small-sample problems, which restricts the training effect of the general model. To increase macaques’ lower b-value dMRI data, we propose a variable multi-modal image feature fusion adversarial neural network called RISNet. The network can use the proposed rapid insertion structural(RIS) to input features from different modes into a general residual decoding structure to enhance the model’s generalization ability. The RIS combines the advantages of multi-modal data, which can quickly rewrite the network and extract and fuse the features of multi-modal data. We used a T1 image and a higher b-value image of the brain as model inputs to generate high-quality, lower b-value images. Experimental results show that our method improves the PSNR index by 1.8211 on average and the SSIM index by 0.0111 compared with other methods. In addition, in terms of qualitative observation and DTI estimation, our process also shows sound visual effects and strong generalization ability. These advantages make our method an effective means to solve the problem of dMRI brain image conversion in macaques and provide strong support for future neuroscience research.

Introduction

As non-human primates, macaques have become a key model in neuroscience research due to their genetic and anatomical similarities with humans. In particular, in terms of exploring the brain’s structure, function, and disease mechanisms, the results of macaque research are of great value for understanding the working mechanism of the human brain. Therefore, accurate and efficient analysis of brain imaging data of macaques is of great significance to promote the development of neuroscience [1,2].

Using diffusion magnetic resonance imaging (dMRI), as a non-invasive medical imaging technique, can delineate the fine structure of the white matter fiber tract of the brain, providing an unprecedented perspective for neuroscience research [3,4]. The b-value is a key parameter during dMRI acquisition that quantifies the intensity of the influence of the diffusion-sensitive gradient field applied to the tissue [5]. The b-value determines the sensitivity of the image to the diffusion process of water molecules. A higher b-value means greater diffusion sensitivity and can more effectively detect the influence of the internal structure of the tissue on the diffusion path of water molecules, which can help reveal microstructural features such as nerve fiber bundles [6]. However, as the b-value increases, the signal-to-noise ratio (SNR) decreases, as the signal difference is more significant when the conditions are higher than those that do not diffuse freely.

Different b-values need to be set during dMRI acquisition. Typically, one lower b-value image corresponds to multiple higher b-value images, and this ratio can ideally reach 5-10 times [7,8]. The acquisition and processing of different b-value images in dMRI is essential for subsequent fiber tract tracing and structural analysis. However, due to the differences in early acquisition protocols and significant differences in the parameters of scanning equipment at different centers, various factors affect the data quality of publicly available macaque dMRI images data quality. The number of images with low b-value is often insufficient or even missing, which poses a challenge for further neurocomputing [9].

In recent years, generative adversarial networks (GAN) have made remarkable progress in the field of image generation and conversion [10,11], which provides new ideas for solving this problem [1214]. GAN can learn the features and distribution of low-b-value images from existing modal images and generate high-quality low-b-value images. This method can not only solve the problem of the insufficient number of low-b-value images in the data set but also improve the quality and information integrity of dMRI images and provide a more reliable data basis for subsequent fiber bundle tracing and structural analysis.

At present, GAN has been widely used in the field of medical image conversion. The Synb0-DisCo method applied the pix2pix method to correct the distorted b0 image [1517]. pGAN uses the residual network as the generator architecture and introduces VGG loss [18,19]. EA-GAN starts from the edge information to improve the texture details of the generated images [20]. MedGAN uses cascaded U-Net as its generator for a variety of medical image conversion tasks [21]. SwinUnet converts magnetic resonance brain images based on Transformer, which is applied to the field of medical image conversion [22,23]. ResViT combines attention mechanisms and convolutional networks to improve image quality further [24,25].

Although GAN has a wide range of applications in image conversion, most medical brain image conversion methods are based on human brain image data. Compared with the human brain image, the macaque images at different centers are quite different; the sample size is small, and the brain structure of the macaque is different from that of the human brain [26]. The multi-center joint training method can increase the data sample size. Still, applying the method based on human brain image transformation to multi-center joint training may have the problem of insufficient generalization ability. In addition, existing conversion methods are often limited to uni-modal conversion; that is, only a single type of medical imaging data is used for conversion. Multi-modal medical imaging data usually provides more comprehensive and accurate brain information in neuroscience. Therefore, multi-modal data combined with multi-center data training can expand the training samples and enhance the ability to generate the neural network model [27]. However, how to ensure that the features extracted from the data of different modalities can be fused and not produce excessive bias to a specific modality [28].

To solve the above problems, this paper proposes a rapid insertion structure(RIS) to combine the advantages of multi-modal data based on residual decoders. Our innovative residual decoder has powerful feature extraction and expression capabilities and realizes the accurate conversion from high-b-value images and T1 images to low-b-value images. Through this method, we can generate high-quality images to make up for the insufficient number of low-b-value images in the dMRI data and make full use of multi-modal medical imaging information to improve the accuracy and reliability of macaque brain image analysis. The specific contributions of this article are as follows:

This paper introduces a Rapid Insertion Structure (RIS) combined with residual decoders to the issues above by integrating the benefits of multi-modal data using the GAN method. The novel residual decoder employed in this paper exhibits robust feature extraction and expression capabilities, facilitating precise conversion from high-b-value images and T1 images to low-b-value images. This approach enables the generation of high-quality images to compensate for the limited quantity of low-b-value images in dMRI data, leveraging the full potential of multi-modal medical imaging data to enhance the precision and dependability of macaque brain image analysis. The key contributions of this study are outlined as follows:

  • Innovative Architecture for Multi-Modal Integration: we innovatively proposed a rapid insertion structure that requires only simple modifications to the model to extract and integrate various modal data. This mode can be rewritten effortlessly to combine features from various data types. This approach enables the application of multi-center joint training in expanding the macaque dMRI dataset during the experiment.
  • Residual Decoder-Based Multi-Modal Fusion: A novel residual decoder framework is proposed for multi-modal dMRI image conversion. The residual decoder carries out multi-modal fusion and decoder feature fusion, which avoids the loss of information caused by information redundancy, makes full use of multi-modal information, and generates a more accurate diffusion MRI.

Materials and methods

Dataset

The PRIMatE Data Exchange (PRIME-DE) data center has made a rich dataset of primate brain images public [29]. We selected four datasets, including dMRI images from PRIME-DE provided by Aix-Marseille Universite (AMU), Mount Sinai School of Medicine-Siemens, University of California, Davis(UCDavis), and the University of Wisconsin-Madison(UWM). These four sites provide dMRI, fMRI, and T1-weighted images, offering a multi-modal dataset resource for our research. Detailed acquisition parameters for the dMRI images can be found in Table 1, and the T1-weighted images are listed in Table 2.

Preprocessing

This study preprocesses dMRI images and T1-weighted images separately. The specific steps are as follows:

  • Perform head motion and eddy current correction on dMRI images and T1 image data using the eddy tool in FSL software [30,31].
  • Remove non-brain tissue from T1-weighted images and dMRI images using the brain tissue segmentation tool HC-Net [32].
  • Extract paired high-b-value() images and low-b-value(b = 0) images from dMRI images using FSL tools. We combine the T1-weighted images and high-b-value images served as input for the model, and the low-b-value images serve as the ground truth.
  • Normalize all pixel values of the images to the range of 0 to 1 using the maximum-minimum normalization method.
  • Resample all images to 256 × 256 × 256 and slice all images into 2D slices in the coronal plane.

Finally, the 2D slices of T1-weighted images and high-b-value images serve as input for the multi-modal generator. In contrast, the 2D slices of low-b-value images serve as input for the discriminator. We use multi-center data for joint model training to enhance model generalization and increase data diversity. We divide all data within each site into training and testing sets. The specific partition results are shown in Table 3.

Overall framework design

During the model training process, T1, B0, and high-B-value magnetic resonance imaging (MRI) data are utilized as input. Prior to feeding the data into the model, downsampling is performed, and the original dimensions of the data are meticulously recorded. Subsequently, the resampled T1 and high-B-value data are fed into the generator. The generated B0_ data from the generator, along with the resampled real B0 data, are then used as inputs for the discriminator. The output of the generator undergoes concatenation and upsampling operations to restore it to the same dimensions as the original real data. Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Information Entropy (IM) are employed as evaluation metrics to assess the model’s performance after each training epoch. The model with the optimal performance is ultimately selected as the final result.The overall system architecture diagram is shown in Figure A of Fig 1.

Generative adversarial network based on residual decoders

Our team proposes a generative adversarial network (GAN) that utilizes residual decoders to manage multi-modal data effectively. The most significant advantage of RIS lies in its non - interference with pre - trained models. All features are processed into an average feature vector before the network starts learning. As a result, when multimodal data is incorporated, there is no need to retrain the pre - trained model. Consequently, data from other modalities can be added for training at any time.The generator’s architecture features two rapid insertion structural (RIS) encoders designed to process multi-modal data and a shared residual decoder. The discriminator employs a classical PatchGAN network to enhance the model’s discriminative capabilities.

For data input, both T1 and high b-value images are fed simultaneously into two distinct RIS encoders. Each encoder processes its respective model data, resulting in 16 multi-scale feature maps.

The residual decoder then integrates information from these 16 multi-scale feature maps through a series of complex fusion operations. This integration allows the model to fully utilize the complementary aspects of the multi-modal data, thereby generating more accurate and realistic low-b-value images.

Ultimately, after being processed by the residual decoder and output through the Tanh activation function, the desired low-b-value image is produced. This entire process takes full advantage of the residual decoder, not only improving the quality of the generated images but also enhancing the stability and robustness of the model. Figure B of Fig 1 shows the design details.

Encoder.

T1 images and high-b-value images are used as multi-modal data inputs, represented by . The data from different modalities enters the corresponding encoder. In each encoder, the data is first processed through convolutional blocks and the LeakyReLU activation function, as shown in Eq (1):

(1)

where LReLU represents the LeakyReLU activation function, which is used to increase the nonlinear characteristics, and Conv represents the convolution operation, with a convolution kernel size of 3 × 3, which is used to extract the features of the image. represents the first feature map obtained as the starting point for subsequent operations.

The feature maps are then processed by six identical downsampled convolution blocks. Each block performs convolution, batch normalization, and downsampling operations on the input feature map to generate feature representations at various scales. The output from each downsampled convolution block serves as the input for the subsequent block, creating a hierarchical mechanism for feature extraction, as demonstrated in Eq (2):

(2)

where BN indicates a batch normalization operation. represents the first i feature map, and at each stage, the size of the feature map gradually decreases. At the same time, the number of channels gradually increases to capture image information at different scales. Specifically, the size of H and W is reduced to 1/2 of their original size after each downsampling. At the same time, the number of channels C gradually increases from 64 to 512 depending on the depth of the convoluted block, and then it remains the same. Finally, the last feature map is obtained through the convolution block and the ReLU activation function. As shown in Eq (3):

(3)

where ReLU indicates the ReLU activation function.

After processing this series of downsampled convolution blocks, each encoder obtained several feature maps of different scales. These feature maps not only contain the rich details of the image but also reveal its internal structure across various scales. By fully extracting the multi-scale features of multi-modal data, our method offers robust support for subsequent image generation and detail enhancement.

It’s worth noting that our encoders are designed to be flexible and scalable. If the dataset contains more modal images, we only need to add the corresponding encoder path to extract the new multi-modal information. This design allows our method to accommodate datasets of different sizes and diversity to support multi-modal image generation tasks. Our encoders are designed to be both flexible and scalable. If the dataset includes additional modal images, we need to add the appropriate RIS encoder path to extract the new multi-modal information. This design enables our method to effectively accommodate datasets of varying sizes and diversity, supporting multi-modal image generation tasks and directly utilizing single-modal data for generation.

Residual decoder.

The residual decoder is a core component of the proposed method, designed to efficiently fuse key information from multi-modal data and generate high-quality images. The decoder consists of eight residual fusion modules, each responsible for fusing two multi-scale feature maps from different encoders [33].

In the fusion process, each residual fusion module makes full use of the multi-modal feature map from the two encoders as well as the feature map of the previous residual fusion module. The first residual fusion module only fuses multi-modal feature maps to ensure the completeness and accuracy of the initial information. As the decoding process progresses, subsequent modules gradually incorporate more contextual information to enrich the details of the generated images. It is shown in Eq (4):

(4)

where represents the i feature map of the decoder, represents the residual fusion module, represents the i feature map of the first encoder, and represents the i feature map of the second encoder.

To more fully fuse multi-modal information, we introduce a phased fusion strategy in the residual decoder. First, we assume that the two input modes have different contributions in the eigenmaps at different scales. Therefore, when fusing multi-modal information, the model will select appropriate features for fusion according to the scale of the feature map and reduce the dimensionality of the multi-modal feature map through the linear layer to extract the most effective information. It is shown in Eq (5):

(5)

where represents the first i feature map after linear layer processing, and represents the linear layer.

In the second stage of the residual decoder, we use two residual convolutional layers to mine the filtered features further. These residual convolutional layers retain the original feature information while enhancing the fusion effect by learning deeper features. In particular, we use a 3 × 3 convolutional layer to finely learn multi-modal features to capture subtle changes in the image. It is shown in Eq (6):

(6)

where is the residual convolution and is the tensor addition. While retaining the fLi information, deeper features are learned through the residual convolutional layer. The specific residual convolutional layer is shown in Eq (7):

(7)

Where denotes the first i feature map processed by the residual convolutional layer, and it is worth noting that, unlike the downsampled convolutional layer, to learn the multi-modal features more finely, here denotes the convolutional layer.

Finally, in the third stage, we stitch the features of the residual fusion module of the previous layer with the multi-modal features and perform the fusion and upsampling of the feature map through the upsampling layer. This process not only preserves the information of the previous layer but also gradually restores the original size of the image through an upsampling operation. Specifically, it is shown in Eq (8):

(8)

where represents channel splicing, and represents transpose convolution operations. With each upsampled convolutional layer, the H and W of the feature map become twice as large. At the same time, the number of channels C gradually decreases from 512 to 1 depending on the depth of the transposed convolutional block.

Finally, after processing with the Tanh activation function, we get the resulting image as shown in Eq (9):

(9)

where represents the generated image and represents the Tanh activation function.

In general, the residual decoder effectively avoids information redundancy and loss problems by fusing the multi-modal information first and then fusing the sorted multi-modal details with the information transmitted from the previous layer. At the same time, introducing residual connections allows the model to retain most of the obtained information while extracting depth features, thereby improving the quality of the generated images.

Discriminator.

The discriminator part comprises eight convolutional layers, and the overall structure follows the classic PatchGAN architecture [34]. As a mature discriminator framework, PatchGAN is widely popular and widely used in CNN-based generative adversarial networks. To effectively distinguish between modalities, we stitch the generated image output by the generator and the real image of the target modality with the source modal image to form a feature image of 256 × 256 × 2. The feature image is then downsampled and convoluted by five convolution modules, and the feature map size is reduced to 1/4 of the original size after each operation. After processing these convolutional layers, we finally get a one-dimensional feature map, which is presented as a two-dimensional matrix. Each element of this matrix corresponds to a specific area of the original image. The goal of the discriminator is to make the matrix value of the generated image as close to 0 as possible and the matrix value of the real image as close to 1 as possible to achieve accurate image discrimination. With this design, the discriminator can effectively identify the difference between the generated and real image, thereby assisting the generator to produce a more realistic image.

Loss function.

The loss function consists of several components, including generator adversarial loss, discriminator adversarial loss, and pixel reconstruction loss [35]. These loss functions work together to ensure that the generator can produce a high-quality image and that the discriminator can distinguish between the real image and the generated image.

Precisely, generator adversarial loss measures how the image generated by the generator performs in the discriminator. When the discriminator identifies the generated image as a real image (i.e., the output is close to 1), the generator’s adversarial loss is minimized. This encourages the generator to produce a more realistic image to trick the discriminator.

(10)

represents the generator adversarial loss, represents the expectation, represents the discriminator output, x represents the input image, and G(x) represents the generator output. When the generation result is 1 after being identified by the discriminator, the opponent loss of the generator is minimized.

Discriminator adversarial loss is used to optimize the performance of the discriminator. It contains two parts: one is to ensure that the real image is stitched with the source image and output close to 1 through the discriminator to identify the real image correctly; The other part is to ensure that the generated image is near to 0 after stitching with the source image and output by the discriminator to distinguish the generated image accurately. In this way, the discriminator can continuously improve its discriminatory capabilities.

(11)

where represents the discriminator adversarial loss, and y represents the real image.

We have also introduced pixel reconstruction losses to further enhance the realism of the generated image. This loss function evaluates the quality of image reconstruction by calculating the pixel difference between the generated and target real images. By minimizing pixel reconstruction losses, we encourage generators to reconstruct target images more accurately at the pixel level.

(12)

where L1 represents the pixel reconstruction loss, and represents the 1 norm.

Ultimately, the overall loss is the weighted sum of these loss functions. By adjusting the pixel reconstruction loss factor and the adversarial loss factor, we can balance the impact of different loss terms on the overall performance.

(13)

where L represents the overall loss, represents the pixel reconstruction loss coefficient, and represents the adversarial loss coefficient.

In the training process, the generator and discriminator alternately fix one side to train the other to achieve the purpose of adversarial training. In this way, our method generates high-quality, photorealistic multi-modal images.

We use Peak Signal Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Mutual Information (MI) as evaluation metrics. PSNR quantifies image difference, with higher values indicating better quality. SSIM assesses image structural similarity, and in our research, it specifically indicates the model’s prediction accuracy—values closer to 1 mean higher accuracy. MI measures the information overlap between images, with higher values showing stronger correlation.

The author-generated code supporting the findings of this study is available in the GitHub repository at https://github.com/opsliuhao/RIS. The code is released under the MIT License.

Experiments and results

To verify the quality of the images generated by the proposed multi-modal network, we designed a series of experiments such as comparison experiments with the uni-modal method, ablation experiments, and DTI estimation.

Comparation

We selected five representative models in the comparative experiments: pix2pix, CycleGAN, pGAN, SwinUnet, and ResViT. We strictly followed each method’s experimental parameters, network architecture, and loss function settings in the exposed code. At the same time, all experiments follow the standard of multi-center joint training, and the data set division is completely consistent to ensure that the models can be fairly compared under the same conditions. The experimental environment is an NVIDIA GeForce RTX 3090 graphics card, and the training cycle is unified to 80 epochs.

To comprehensively evaluate each model’s performance, we selected peak signal-to-noise ratio, structural similarity, and mutual information as evaluation indicators. These indicators can quantitatively evaluate the images generated by each model in terms of image quality, structure, and information relevance.

Table 4 shows the quantitative results. The method with a high-b-value as the input generally outperforms the method with T1 as the input in the unimodal method. Our RISNet performed excellently on all datasets when using multi-model data as inputs. It achieves the best results, especially on UCDavis, MountSinai, and UWM datasets, which fully proves its advantages in multi-modal image generation tasks.

thumbnail
Table 4. The results of evaluation indicators under different datasets by different methods.

https://doi.org/10.1371/journal.pone.0329653.t004

When using a single-modal image as input for testing, RISNet continues to make predictions and outperforms other traditional models, particularly with high B-value images. This indicates that the model benefits from richer details through joint training with multi-modal data. Fig 2 presents the qualitative results of various methods when high B-value images are used as input.

In summary, our comparative experiments demonstrate the superiority of the proposed multi-modal network in image generation quality. Both quantitative evaluations and qualitative assessments show that our method performs exceptionally well. It effectively leverages the benefits of multi-modal images, resulting in the generation of high-quality low-b-value images.

Ablation experiments

We carefully designed an ablation experiment to verify the critical role of residual decoders in improving model performance. In this set of experiments, we replaced the residual decoder with a normal upsampled convolutional block to observe its specific impact on the model performance. In the substitution process, we keep the dimensional stitching operation of the multi-modal feature map and the sampling convolutional block on the previous layer unchanged to ensure the consistency of experimental conditions. The design details of this decoder are shown in Eq (14):

(14)

The experimental results are shown in Fig 3 and Table 5. After replacing the residual decoder on the UWM dataset, the number and quality of the generated image texture features are significantly reduced. This is directly reflected in the decline of various evaluation indicators, which fully proves the effectiveness of the residual decoder in improving the texture detail of the image.

Although both methods can only roughly learn texture details on the AMU dataset, the image generated by the model using the residual decoder is more similar in color to the reference image. However, in terms of evaluation indicators, the results after removing the residual decoder are improved, possibly due to the quality and quantity of the AMU dataset itself and the low similarity with other site data. However, this does not mean that replacing the residual decoder is a better option, as the residual decoder has better generalization capabilities.

After replacing the residual decoder, the model performs poorly on the MountSinai and UCDavis datasets. On the MountSinai dataset, the model failed to effectively learn the texture information, resulting in a significant decline in various indicators. On the UCDavis dataset, the model even learns unreal texture information after replacing the residual decoder, which further widens the gap with the reference image and reduces the evaluation index.

In summary, this set of ablation experiments has identified the residual decoder’s crucial role in enhancing the model’s texture detail and generalization ability.

DTI estimation experiments

DTI imaging estimates the diffusion tensor by applying gradient pulses in different directions to measure water molecules’ diffusion in all directions. Based on the measured diffusion-weighted imaging data, the diffusion tensor parameters for each voxel can be calculated using a linear regression method, such as the principal diffusion direction, the eigenvalues of the diffusion tensor, and the eigenvectors. The diffusion parameter DTI provides a variety of diffusion parameters, the most commonly used of which is Fractional Anisotropy (FA), which indicates the degree of directionality of the diffusion of water molecules in the tissue. The experimental results are shown in Fig 4. The texture of the FA images estimated from each dMRI dataset is clear, but they contain a significant amount of noise. By incorporating low-b-value images generated through various methods, the noise levels in the FA images are reduced to varying extents. Notably, our RISNet approach effectively removes noise while better preserving the sharp and highlighted texture details of the original FA image. In contrast, although the SwinUnet method demonstrates some denoising capabilities, it inadvertently introduces additional unwanted noise.

To further validate the quality of the low-b-value images generated by our method, we performed in-depth DTI estimation experiments. In the experiment, we fused the low-b-value images generated by different methods into the dMRI images and added a low-b-value volume to the dMRI images. On the UCDavis dataset, we obtained the dMRI images enhanced by different methods. We estimated the DTI of these images, while on the MountSinai-S dataset, we replaced the original low-b-value images in MountSinai-S with the low-b-value images generated by the ResViT and RISNet for Xtract tracking experiments. The experimental results are shown in Fig 5.

In the Xtract tracking experiment, we found that replacing the Xtract outputs with low-b-value images generated by our proposed method produced results that closely resemble those of the original diffusion MRI (dMRI) data. In contrast, the results from the ResViT method deviated significantly from the original dMRI images.

Through experiments involving DTI estimation, we confirmed the advantages of our RISNet method in generating high-quality low-b-value images. Our approach enhances the visual quality of FA images and more accurately reflects the characteristics of the tissue’s microstructure.

Discussion

In this study, we innovatively propose RISNet for macaque dMRI image conversion and data enhancement. The RIS encoder part is extensible, and a new encoder path with the same encoder path can be added to have one more modal input. In addition, this approach’s core lies in its residual decoder’s design, which enables the effective fusion of information in three steps. Firstly, the decoder fuses the multi-modal feature map from the encoder and reduces the dimensionality of the multi-modal features. Subsequently, the residual convolutional structure is used to learn more and deeper information while retaining the existing multi-modal information. Finally, the decoder further fuses the fused multi-modal information with the previous layer’s feature map to filter out the key information with more details among the many redundant information.

The greatest advantage of RIS is that it does not interfere with the pre-trained model, and all features will be processed into an average feature vector before the network learning. Therefore, when multi-modal data is added, there is no need to retrain the pre-trained model. Thus, data of other modalities can be added for training at any time.

Considering that the sample size of macaque data at each site is relatively small, but the data from each site show intraspecific consistency, we adopted a multi-center joint training method to alleviate the problem caused by sample size. However, the differences in acquisition parameters between different sites put forward high requirements for the model’s generalization ability. Experimental results show that the traditional uni-modal method performs poorly in image generation details. Specifically, Pix2pix retains multi-scale information to a certain extent through the U-Net network structure. CycleGAN’s cyclic consistency loss performs well in unsupervised scenarios. Although pGAN can learn information about deep networks by using ResNet as a generator, it ignores the advantages of U-Net in retaining multi-scale information. SwinUnet and ResViT capture the contextual relationship between pixels through an attention mechanism. However, as shown in the figure, due to the limitations of single-modal data, these methods are relatively general in performance, and their generalization ability is weak under the condition of multi-center joint training.

In contrast, RISNet is based on the U-Net architecture, introducing two encoders to read data in two modalities. The residual decoder fuses multi-scale and multi-modal information hierarchically, generating an image that performs better in detail texture. At the same time, it can still maintain good generalization ability in joint training.

Ablation experiments further demonstrated the effectiveness of the residual decoder. When we replaced the multi-modal residual fusion module with simple dimension stitching and upsampling convolution blocks, both the qualitative and quantitative results of the model significantly declined, and its generalization ability was greatly diminished. This indicates that the basic upsampled convolutional block is not effective in extracting important features from the multi-modal feature map, nor does it manage the redundant information present in the previous convolutional blocks. In contrast, our designed residual decoder substantially enhances the efficiency of multi-modal fusion by first reducing the dimensionality of the multi-modal features to extract relevant information. It then fuses these reduced features with those extracted from the convolutional blocks of the previous layer, facilitating the extraction of more detailed information.

In addition, the experimental results of DTI estimation indirectly prove that the quality of the low-b-value images generated by our method is high, which is of positive significance for subsequent analyses such as nerve fiber tract tracing. It is worth noting that, unlike the images mapped to the RGB color space, the evaluation of 3D medical images should not be limited to visual observation and the calculation of quantitative evaluation indicators, and further analysis of the generated images is also an effective evaluation method [36].

As we move forward, we plan to broaden our approach to human brain imaging. While most existing human brain imaging data do not typically suffer from an imbalance between low-b-value and high-b-value images, this area remains crucial for medical research. Presently, most methods for converting human brain images rely on single-modal images. Although some progress has been made, these methods do not fully leverage the benefits of multi-modal data. Therefore, future research in multi-modal human brain image conversion will be essential. We aim to achieve more innovations and breakthroughs in medical image analysis by continuously refining and enhancing our methodology.

Conclusion

In this paper, we propose a variable multi-modal image feature fusion adversarial neural network called RISNet, which uses rapid insertion structural to input features from different modes into a general residual decoding structure for multi-modal macaque brain image conversion and data enhancement. In the encoder part of the generator, we downsample and extract multi-scale feature maps from T1 and high-b-value images through two identical convolutional downsampling paths. In particular, our method is not limited to the input of two modalities, and a new encoder path can receive one more modal image. In the decoder part of the generator, we first fuse and reduce the multi-modal information through the three-step fusion method, then further mine the multi-modal information through the residual network, and finally fuse the information obtained by the previous layer of decoding and multi-modal information, to eliminate redundant information and learn more image detail information. Finally, we verified the method’s effectiveness through comparative, ablation, and DTI estimation experiments. Our method can be applied to generate low-b-value images of macaques and enhance dMRI image data to provide more high-quality dMRI images for brain science research.

Acknowledgments

We would like to express our gratitude to Professor Li Haifang from Taiyuan University of Technology for providing financial support for this research and offering meticulous guidance and rigorous review during the research design, experiment implementation, and paper writing processes. Professor Li’s professional insights and rigorous attitude have significantly improved the quality of this study, and we hereby extend our sincere thanks.

References

  1. 1. Wang QS, Wang Y, Chai JW. Review the research of homologous brain regions on human and macaque. J Taiyuan Univ Technol. 2021;52:274–81.
  2. 2. Wang Q, Fei H, Abdu Nasher SN, Xia X, Li H. A Macaque brain extraction model based on U-net combined with residual structure. Brain Sci. 2022;12(2):260. pmid:35204023
  3. 3. Lu X, Wang Q, Li X, Wang G, Chen Y, Li X, et al. Connectivity reveals homology between the visual systems of the human and macaque brains. Front Neurosci. 2023;17:1207340. pmid:37476839
  4. 4. Li BQ, Wang QS, Yao R. Research on the connectivity method of human and macaque brain regions based on DTI. Chin J Magn Reson Imaging 2022;13:43–8.
  5. 5. Hao J-G, Wang J-P, Gu Y-L, Lu M-L. Importance of b value in diffusion weighted imaging for the diagnosis of pancreatic cancer. World J Gastroenterol. 2013;19(39):6651–5. pmid:24151395
  6. 6. Wongkornchaovalit P, Feng M, He H, Zhong J. Diffusion MRI with high to ultrahigh b-values: How it will benefit the discovery of brain microstructure and pathological changes. Investig Magn Reson Imaging. 2022;26(4):200.
  7. 7. Soares JM, Marques P, Alves V, Sousa N. A hitchhiker’s guide to diffusion tensor imaging. Front Neurosci. 2013;7:31. pmid:23486659
  8. 8. Koirala N, Kleinman D, Perdue MV, Su X, Villa M, Grigorenko EL, et al. Widespread effects of dMRI data quality on diffusion measures in children. Hum Brain Mapp. 2022;43(4):1326–41. pmid:34799957
  9. 9. Cetin Karayumak S, Bouix S, Ning L, James A, Crow T, Shenton M, et al. Retrospective harmonization of multi-site diffusion MRI data acquired with different acquisition parameters. Neuroimage. 2019;184:180–200. pmid:30205206
  10. 10. Goodfellow I, Pouget-Abadie J, Mirza M. Generative adversarial nets. Adv Neural Inf Process Syst. 2014;27:1–9.
  11. 11. Thambawita V, Salehi P, Sheshkal SA, Hicks SA, Hammer HL, Parasa S, et al. SinGAN-Seg: Synthetic training data generation for medical image segmentation. PLoS One. 2022;17(5):e0267976. pmid:35500005
  12. 12. Valliani AA, Gulamali FF, Kwon YJ, Martini ML, Wang C, Kondziolka D, et al. Deploying deep learning models on unseen medical imaging using adversarial domain adaptation. PLoS One. 2022;17(10):e0273262. pmid:36240135
  13. 13. Li , Cao Q, Liu CM. Image super-resolution based on no match generative adversarial network. J Zhengzhou Univ (Eng Sci). 2021;42:1–6.
  14. 14. Onakpojeruo EP, Mustapha MT, Ozsahin DU, Ozsahin I. A comparative analysis of the novel conditional deep convolutional neural network model, using conditional deep convolutional generative adversarial network-generated synthetic and augmented brain tumor datasets for image classification. Brain Sci. 2024;14(6):559. pmid:38928561
  15. 15. Isola P, Zhu JY, Zhou T. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1125–34.
  16. 16. Schilling KG, Blaber J, Huo Y, Newton A, Hansen C, Nath V, et al. Synthesized b0 for diffusion distortion correction (Synb0-DisCo). Magn Reson Imaging. 2019;64:62–70. pmid:31075422
  17. 17. Sun J, Du Y, Li C, Wu T-H, Yang B, Mok GSP. Pix2Pix generative adversarial network for low dose myocardial perfusion SPECT denoising. Quant Imaging Med Surg. 2022;12(7):3539–55. pmid:35782241
  18. 18. Dar SU, Yurt M, Karacan L, Erdem A, Erdem E, Cukur T. Image synthesis in multi-contrast MRI with conditional generative adversarial networks. IEEE Trans Med Imaging. 2019;38(10):2375–88. pmid:30835216
  19. 19. Mahapatra D, Bozorgtabar B. Progressive generative adversarial networks for medical image super resolution. arXiv; 2019. https://doi.org/abs/1902.02144
  20. 20. Yu B, Zhou L, Wang L, Shi Y, Fripp J, Bourgeat P. Ea-GANs: Edge-aware generative adversarial networks for cross-modality MR image synthesis. IEEE Trans Med Imaging. 2019;38(7):1750–62. pmid:30714911
  21. 21. Armanious K, Jiang C, Fischer M, Küstner T, Hepp T, Nikolaou K, et al. MedGAN: Medical image translation using GANs. Comput Med Imaging Graph. 2020;79:101684. pmid:31812132
  22. 22. Cao H, Wang YY, Chen J. Swin-Unet: Unet-like pure transformer for medical image segmentation. Eur. Conf. Comput. Vis. 2022, 205–218.
  23. 23. Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, et al. Swin-Unet: Unet-like pure transformer for medical image segmentation. In: Computer vision – ECCV 2022 workshops: Tel Aviv, Israel, October 23–27, 2022, proceedings, Part III; 2022. p. 205–18.
  24. 24. Dalmaz O, Yurt M, Cukur T. ResViT: Residual vision transformers for multimodal medical image synthesis. IEEE Trans Med Imaging. 2022;41(10):2598–614. pmid:35436184
  25. 25. Dalmaz O, Yurt M, Cukur T. ResViT: Residual vision transformers for multimodal medical image synthesis. IEEE Trans Med Imaging. 2022;41(10):2598–614. pmid:35436184
  26. 26. Haberl D, Spielvogel CP, Jiang Z, Orlhac F, Iommi D, Carrió I, et al. Multicenter PET image harmonization using generative adversarial networks. Eur J Nucl Med Mol Imaging. 2024;51(9):2532–46. pmid:38696130
  27. 27. Ha J, Park JS, Crandall D, Garyfallidis E, Zhang X. Multi-resolution guided 3D GANs for medical image translation. arXiv preprint; 2024. https://doi.org/arXiv:2412.00575
  28. 28. Xie X, Chen J, Li Y, Shen L, Ma K, Zheng Y. Generative adversarial network for medical image domain adaptation using mutual information constraint. In: Medical image computing and computer assisted intervention – MICCAI 2020 ; 2020. p. 516–25.
  29. 29. Milham MP, Ai L, Koo B, Xu T, Amiez C, Balezeau F, et al. An open resource for non-human primate imaging. Neuron. 2018;100(1):61-74.e2. pmid:30269990
  30. 30. Jenkinson M, Beckmann CF, Behrens TE, et al. FSL. NeuroImage. 2012;62:782–90.
  31. 31. Andersson JLR, Sotiropoulos SN. An integrated approach to correction for off-resonance effects and subject movement in diffusion MR imaging. Neuroimage. 2016;125:1063–78. pmid:26481672
  32. 32. Fei H, Wang Q, Shang F, Xu W, Chen X, Chen Y, et al. HC-Net: A hybrid convolutional network for non-human primate brain extraction. Front Comput Neurosci. 2023;17:1113381. pmid:36846727
  33. 33. Zhou M, Zhang Y, Xu X, Wang J, Khalvati F. Edge-enhanced dilated residual attention network for multimodal medical image fusion; 2024. https://arxiv.org/abs/2411.11799
  34. 34. Tahmid M, Alam MdS, Rao N, Ashrafi KMA. Image-to-image translation with conditional adversarial networks. In: 2023 IEEE 9th international women in engineering (WIE) conference on electrical and computer engineering (WIECON-ECE); 2023. p. 1–5. https://doi.org/10.1109/wiecon-ece60392.2023.10456447
  35. 35. Longfei L, Sheng L, Yisong C, Guoping W. X-GANs: Image reconstruction made easy for extreme cases. arXiv preprint arXiv:1808.04432; 2018.
  36. 36. Lévêque L, Outtas M, Liu H, Zhang L. Comparative study of the methodologies used for subjective medical image quality assessment. Phys Med Biol. 2021;66(15):10.1088/1361-6560/ac1157. pmid:34225264