Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Accurate and lightweight MRI super-resolution via multi-scale bidirectional fusion attention network

Abstract

High-resolution magnetic resonance (MR) imaging has attracted much attention due to its contribution to clinical diagnoses and treatment. However, because of the interference of noise and the limitation of imaging equipment, it is expensive to generate a satisfactory image. Super-resolution (SR) is a technique that enhances an imaging system’s resolution, which is effective and cost-efficient for MR imaging. In recent years, deep learning-based SR methods have made remarkable progress on natural images but not on medical images. Most existing medical images SR algorithms focus on the spatial information of a single image but ignore the temporal correlation between medical images sequence. We proposed two novel architectures for single medical image and sequential medical images, respectively. The multi-scale back-projection network (MSBPN) is constructed of several different scale back-projection units which consist of iterative up- and down-sampling layers. The multi-scale machine extracts different scale spatial information and strengthens the information fusion for a single image. Based on MSBPN, we proposed an accurate and lightweight Multi-Scale Bidirectional Fusion Attention Network(MSBFAN) that combines temporal information iteratively. That supplementary temporal information is extracted from the adjacent image sequence of the target image. The MSBFAN can effectively learn both the spatio-temporal dependencies and the iterative refinement process with only a lightweight number of parameters. Experimental results demonstrate that our MSBPN and MSBFAN are outperforming current SR methods in terms of reconstruction accuracy and parameter quantity of the model.

Introduction

Magnetic resonance imaging (MRI) is a non-invasive medical imaging technique that offers outstanding spatio-temporal resolution and clear soft-tissue been contrast. Since its invention in 1972, MRI has proven to be a versatile imaging technique and is widely used in hospitals and clinics. Compared with other imaging techniques such as computed tomography (CT) and positron emission tomography (PET), MRI does not involve X-rays or the use of ionizing radiation. However, clinically, to acquire high-quality MR images, patients usually are needed to remain stable in a narrow tube for a long time, which aggravates the patient’s discomfort and unavoidably introduces motion artefacts that compromise image quality. Long acquisition times and the sustained increase in demand for MRI within health systems has led to concerns about cost-effectiveness.

To accelerate the acquisition speed and ensure the quality of the MR image, a large number of published studies consider adopting super-resolution (SR) algorithms without any hardware update, which have been widely studied and applied in the natural image domain. The image SR algorithm base on interpolation estimates the value of the current pixel through the adjacent pixels [1, 2]. By combining prior information, the image SR algorithm based on reconstruction generates the high-resolution image [3]. The proposition of compressed sensing (CS) proved that the sparsity of a signal could be exploited to recover it from far fewer samples than required by the Nyquist-Shannon sampling theorem. Past literature based on CS has achieved preferable performance in MR imaging [4, 5]. Lingala et al. [6] show that exploiting spatio-temporal redundancy from sequence MR images can immensely improve image reconstruction quality. However, one of the most significant challenges of those traditional approaches is that the reconstructed image introduces smoothness and aliasing artefacts that work so intensely to the disadvantage of image quality. Furthermore, the regularization functions and their hyper-parameters are sensitive and must be selected carefully, which brings great difficulty to practical application.

In recent years, multiple advanced SR models [710] have been proposed with the significant development of deep learning and attracted increasing attention due to their superior performance on natural images. Unlike traditional algorithms, deep learning methods directly learn an end-to-end mapping between the low/high-resolution image-pairs without specifying the point information and regularization in the training process. Dong et al. [11] is the pointing work that introduces convolutional neural networks (CNNs) [12] to the SR field, which confirms the advantage of CNN in image feature extraction. The following works mainly focuse on developing model depth and width to construct more complex structures that have better extract and merge feature maps. However, those methods are mainly aimed at natural image SR rather than medical image. Meanwhile, training and applying the deeper and wider models is difficult due to the great numbers of parameters and much computing resources.

Benefiting from the convenience of medical image dataset acquisition, researchers employed various kinds of neural networks to enhance the quality of MR images directly [1319]. Those deep learning-based methods assimilate the characteristics of MR images and have better performance than natural image models. Although existing medical image SR methods have achieved significant improvements, they still suffer from several limitations. Firstly, almost all existing methods only extract information at a single scale and neglect the information of the other scales, which often take a lot of parameters due to the large kernel size and damage the accuracy of the model. Secondly, most existing deep learning research for MR images is based on a single image or different image sequences, which exploit inherent image redundancy to recover lost high-frequency details but ignore the temporal correlations of the medical image sequence. There is no complete exploitation of spatio-temporal dependencies. Finally, for better SR performance, the SR models are getting more and more complexity. However, non-attention methods [2022] treat all image features equally, which prevents training deeper models and is detrimental to image reconstruction. Meanwhile, for sequence medical images, the abundant spatial information features of the target slice and the supplementary temporal information features from a set of adjacent slices have different effects on target slice reconstruction.

In this paper, we propose two novel networks to resolve the remaining issues mentioned above. For a single medical image, we present a multi-scale back-projection network (MSBPN) to extract the information from different scales, which is beneficial to reduce the number of parameters and further improve the SR performance. For sequence medical images, we integrate the benefits of the MSBPN and propose an accurate and lightweight multi-scale bidirectional fusion attention network (MSBFAN) to explore the spatio-temporal dependencies iteratively. Specifically, we employ MSBPN to explore the abundant spatial information of the target slice, and adopt ResNet [23]to extract the supplementary temporal information from a set of adjacent slices, then fusion attention were employed to filter and combine the spatial and temporal information to improve the quality of the target slice further. Our contributions include the following key innovations:

Multi-Scale Back-Projection Network for single target MR image: We propose MSBPN for extracting details of different scales through multiple up- and down-sampling layers. We combine back-projection and multi-scale to expose residual features of multiple scales, and thus better performance and computational efficiency are achieved.

Iteratively integrating spatial and temporal information: For sequence MR images, spatial and temporal information is extracted from different sources. Spatial exploration block outputs various feature maps of the target slice and temporal exploration block extracts multiple sets of feature maps from adjacent slices. Those different sources are fused into the HR slice iteratively. To our best knowledge, and MR image SR, this is the first work to adequately investigate the temporal information supplement of the target slice from the adjacent slice.

Multi-Scale Bidirectional Fusion Attention Network for sequence MR image: We propose MSBFAN, a bidirectional recurrent neural network based on MSBPN, which use only a modest number of parameters to achieve the state-of-the-art performance on SR task (Fig 1). Our MSBFAN effectively boost the performance via iteratively integrating spatial information and temporal information.

thumbnail
Fig 1. The performance comparison between several SR model on IXI dataset for SR x4.

The test results vs. the number of model parameters. The symbols ⋆, ∘, □ and △ represent models with less than 5M, 10M, 15M and more than15M parameters respectively. Note that RBPN, BasicVSR and our MSBFAN treat 7 slices of a 3D volume as one training sample.

https://doi.org/10.1371/journal.pone.0277862.g001

Related work

The key problem of image super-resolution is how to perform upsampling [24]. Based on the employed upsampling operations and their locations, the architectures of existing models can be divided into following types. Pre-upsampling models [11] also utilize traditional upsampling algorithms to obtain middle higher-resolution images and then refine them using residual learning [25, 26] and recursive layers [27]. Pre-upsampling makes the model learning much easier, however, this approach often introduces extra noise and blurring, while increasing the cost of time and space. Post-upsampling [22, 2832] performs the most computation in low-dimensional space to improve the computational efficiency and increase resolution automatically at the end of models. However, those models also require a large number of parameters due to failure to learn complicated mapping. Progressive upsampling [24, 33] is based on the cascade of upsampling modules to decompose a complex task into several simple tasks and progressively reconstruct multiple SR images, which dramatically reduces the learning difficulty. Iterative up- and down-sampling [20, 34, 35] apply back-projection [36] to compute the reconstruction error then fuse it back to tune the HR image intensity. This framework can better explore the deep relationships between LR-HR image pairs. Recently, some works [3741] have adopted multi-scales to fully exploit the image feature, but there is no research on the fusion of iterative projection and multi-scale.

Benefit by the development of deep learning, more and more researchers have presented 2D and 3D CNNs models for medical images. As we know, 3D CNNs also outperform 2D CNNs in spatio-temporal feature extracting. However, 3D CNNs are more difficult to train due to the small number of high-quality training samples and many parameters. Schlemper et al. [16] used a deep cascade of CNNs to reconstruct dynamic sequences of 2D cardiac MR images. Qin et al. [15] combined traditional iterative algorithms with CNN then proposed a convolutional recurrent neural network. Zhao et al. [18] proposed a deep channel splitting network (CSN) which has two branches used for different information transmissions. Qing et al. [17] found that combining multi-contrast information contributes to reconstructing the results. Zhang et al. [42] proposed a queeze and excitation reasoning attention networks for accurate 2D MR. Although those 2D CNNs models have shown excellent ability to reconstruct the 2D MR images, they still lack the ability to extract the temporal information.To extract temporal features of the 3D MR volume more fully, 3D CNNs models extract features from 3D MR volume directly. One kind of 3D CNNs model [43] converts existing state-of-the-art deep 2D super-resolution models into 3D versions and improves some structures, which is the most convenient and fastest way to apply 3D CNNs to MR images. However, those kinds of models struggle to strike a balance between the number of parameters and performance. Recently, Li et al. [44] presented a ParalleNet using parallel connections and group convolution to treat features on different channels unequally. The number of network parameters and computational complexity can be reduced significantly while maintaining accuracy.

Proposed method

Multi-scale back-projection

In the case of 4x, as shown in Fig 2, we construct an end-to-end trainable architecture based on four scales (1x, 2x, 3x and 4x), then a channel selection mechanism will select and output the concatenated HR features. Where k = 1, 2, …, n and j = 2, 3, …, m are the index of the kth projection unit in each scale and the jth layer in the projection unit. Stacked up- and down-sampling layers output the synthesized LR feature for each scale and map it to HR features, such that the detail of different scales can be fused into the HR image. In such a design, the MSBPN integrates different scale detail, further improves the performance, and simplifies our model’s parameter.

thumbnail
Fig 2. The diagram of the proposed MSBPN model.

The overall structure consists of two main parts: cascaded projections of different scales and a channel selection layer. Each scale projection module is composed of k sub-projection modules that employ densely connected to encourage feature reuse. Each sub-projection module contains j alternating up- and down-layer group to generate the projection error and iteratively refine the LR.

https://doi.org/10.1371/journal.pone.0277862.g002

Overall network architecture

The overall structure of the proposed MSBFAN model is illustrated in Fig 3. The operation of MSBFAN can be divided into three parts: Initial feature extraction, spatio-temporal attention modules and reconstruction. Firstly, the initial feature extraction module is employed to extract the shallow features of the input sequence LR images {…, It−1, It, It+1, …} where It is the target slice. Subsequently, these shallow features are then transmitted to the spatio-temporal attention module iteratively to generate the hierarchical features {} and output SR images {Ht}. Finally, the output SR images are collected into the reconstruction module to generate the final SR image.

thumbnail
Fig 3. Overview of the proposed MSBFAN for sequence MR image SR.

The overall structure consists of three parts: Initial feature extraction FE(.), spatio-temporal module FSTAM(.) and reconstruction FR(.). The horizontal line is based on our MSBPN to explore the spatial information of target slice. The vertical line computes the residual features from a pair of target and neighbor slices to explore the temporal information. On each spatio-temporal attention module, the spatial information and the temporal information are connected and enhanced to recover the missing details.

https://doi.org/10.1371/journal.pone.0277862.g003

Initial feature extraction.

The initial feature extraction module consists of a 3*3 convolution layer and an activation layer. Denote FE(⋅) as the feature extract function, then for target slice It, the shallow features St extracted can be represented as: (1)

For each neighbouring slice {…, It−1, It+1, …}, we simply concatenate the It with It+k, the shallow features Tt+k extracted can be represented as: (2)

Spatio-temporal attention module.

Our proposed STAM is illustrated in Fig 4. The STAM is composed of temporal exploration block (TEB), spatial exploration block (SEB), spatio-temporal attention block (STAB) and downsampling block (DB). Here, abundant spatial information is extracted by SEB, and temporal information is extracted by TEB. We extract the missing details of the target slice by STAB, which integrates the SEB and TEB paths, then produce a refined HR feature. This part receives , and Tt, and outputs , and Ht.

thumbnail
Fig 4. The propose spatio-temporal attention module.

The spatial features of adjacent slices explored by spatial exploration block (SEB) and the temporal features of target and neighbor slices explored by temporal exploration block (TEB). Spatial and temporal features are concatenated and enhanced to construct better HR features and the next LR features produced by downsampling block (DB) for the next module.

https://doi.org/10.1371/journal.pone.0277862.g004

Temporal exploration block. Similar to ResNet, we stack several residual groups which contains two residual layers to form a very lightweight network. Denote FT(⋅) as the TEB function, then for shallow temporal features Tt+k, the output of k-th TEB can be obtained by FT(Tt+k), k ∈ [1, n].

Spatial exploration block. We creatively propose a multi-scale back-projection network that stacks multi-scale projections which contain several up- and down-sampling layers to expose the different scales projection errors. Denote FS(⋅) as the SEB function, then for shallow spatial features St+k, the output of k-th SEB can be obtained by FS(St+k), k ∈ [1, n].

Spatio-temporal attention block. STAB receives and concatenates FT(Tt+k) and FS(St+k), then produces refined periodical HR features through the spatio-temporal fusion attention. Denote FA(⋅) as the spatio-temporal fusion attention function, the HR features Ht+k produced can be represented as: (3)

Downsampling block. DB downsamples the HR features Ht+k then outputs and . Denote FD(⋅) as the downsampling function, the and produced can be represented as: (4)

Therefore, supposing the sequence medical images have n + 1 slices It+k, k ∈ [1, n], and It is the target slice, then the output of the last STAM can be iteratively formulated as follow: (5) (6)

These periodical HR features constitute the final out of our STAMs.

Reconstruction.

The final SR output is generated by feeding concatenated HR features for all STMs into a reconstruction module, the (7)

In our model, Frec is a single convolution layer with the kernel size of 3*3.

Experimental results

In this section, we first introduce the training dataset and implementation details. Then we compare the different configurations of MSBPN and the whole network on SR performance. Finally, our MSBFAN model is compared with several state-of-the-art SR algorithms. We evaluate the quantitative SR result with PSNR and SSIM. In all our experiments, we focus on 4x SR factor.

Dataset and implementation details

Our training dataset is constructed of the IXI dataset which contains three subsets of MR images: 578 PD volumes, 581 T1 volumes and 578 T2 volumes. We divided the training set, testing set, and verification set in a ratio of approximately 100:10:1 for each subset. We select and clip these three types of 3D volumes to the size of 240 x 240 x 91 (height x width x depth)and then generate 47985, 51548, 51184 2D training examples and 6855, 7364, 7312 7-slices training examples, respectively. We also apply augmentation, such as flipping and rotation, to generate the LR image. We downscale the HR image with bicubic interpolation.

For TEB, we construct nine blocks where each block consists of two 3*3 convolutional layers. The up- and down-layer in TEB and DB use 8*8 kernel with stride = 4 and pad by 2 pixels. For SEB, we construct four scales (1x, 2x, 3x, 4x) projection units where each projection unit consists of three up-sampling layers and two down-sampling layers (n = 1, m = 2). For 1x projection unit, the up- and down-sampling layers use 3*3 kernel with stride = 1 and pad by 1 pixel; For 2 x projection unit, the up- and down-sampling layers use 6*6 kernel with stride = 2 and pad by 2 pixels; For 3x projection unit, the up- and down-sampling layers use 7*7 kernel with stride = 3 and pad by 2 pixels; And for 4x, the configuration of the up- and down-sampling layers same with the TEB and DB. The number of feature maps is used ct = cs = 64.

We train the models with patch size 48 × 48, which is cropped randomly from 60 × 60 LR images. All models are trained end-to-end using L1 loss, and the learning rate is initialized as 10−4 for all layers and decrease by a factor of 10 for half of the total 100 epochs. For optimization, we used Adam by setting β1 = 0.9, β2 = 0.999 and ϵ = 10−8. All experiments were conducted using Python 3.8.5 and PyTorch 1.6.0 on NVIDIA GeForce GTX 1080 Ti GPU.

Model analysis

Multi-scale back-projection network.

The proposed MSBPN can be configured in several ways. For comparison, we have verified the structure of different MSBPN modules from the following aspects:

Back-projection. To study the impact of different configurations of back-projection, we construct multiple modules to show the tradeoff between performance and the number of network parameters. Specifically, we create two kinds of modules M1,n, and M2,n to investigate the impact of the number of convolutional layers of the projection unit. We also created the other three kinds of modules Mm,1, Mm,2, and Mm,3 to investigate the impact of the number of projection units. The training and testing results are shown in Fig 5(a) and Table 1. It can be seen that the performance of the model is improved with the deepening of the network depth which is mainly determined by the number of projection units m and the number of convolutional layers of projection unit n. We can infer that the performance improvement on our MSBPN is mainly due to the increase of model depth. However, the depth of the model does not increase indefinitely. When m = 2, n = 3, the model training began to be unstable, proving that models with complex structures and plenty of parameters are promising to improve model performance, but it is more challenging to be fully trained with MR images.

thumbnail
Fig 5. The performance comparison between different configures of our MSBPN for SR 4x.

(a) Impact of projection units and the convolutional layers of the projection unit validated on PD. (b) Dense Connection, validated on T1. (c) Multi-scale machine, validated on T2.

https://doi.org/10.1371/journal.pone.0277862.g005

thumbnail
Table 1. The testing performance of different configurations of projection units and the convolutional layers of projection unit on PD for SR 4x.

https://doi.org/10.1371/journal.pone.0277862.t001

Dense connection. We can remove the dense connection of the MSDPN to show how dense connection influences the performance of the model in three cases, as shown in Fig 5(b) and Table 2. Dense connection stabilizes the training deeper network and adaptively reuses the extraction of information from current and preceding back-projection units.

thumbnail
Table 2. Test results of the models with different connection approximations on T1 for SR 4x (PSNR/SSIM).

https://doi.org/10.1371/journal.pone.0277862.t002

Multi-scale. To demonstrate the advantage of our multi-scale mechanism, we build two kinds of networks, SS which adopt single-scale projection units (four 4x units) and MS which adopt multi-scale projection units (1x, 2x, 3x and 4x units). Those two networks were compared in terms of performance and the number of parameters. The results on 4x enlargement are shown in Fig 5(c) and Table 3. It is observed that the multi-scale machine helpful to reduce the parameters of the model significantly, and the performance is not compromised.

thumbnail
Table 3. Test results of the model with multi-scale configuration on T2 for SR 4x (PSNR/Parameters).

https://doi.org/10.1371/journal.pone.0277862.t003

Multi-scale bidirectional fusion attention network.

In this part, we validate several components of the proposed MSBFAN and mainly focuses on temporal information usage.

Baselines. We consider three baselines with different spatial and temporal information fusion. First, we simplify concatenate all slices (7 slices) as the input of the SEB, which introduces temporal information but hasn’t been explored enough. Second, we remove the stream, only keep , which turns off the backward temporal connection. Third, we remove the stream, only keep , which turns off the forward temporal connection. The testing results are shown in the Table 4. The results of SEB (1 slice) and SEB (7 slices) show that extracted information from neighbouring slices contributes to the image reconstruction. The combination of spatial and temporal information is also important. The full MSBFAN model can achieve 31.41 dB, which is better than 0.37db, 0.08 dB and 0.09 dB than SEB(7slices), MSBFAN (forward) and MSBFAN (backward).

thumbnail
Table 4. Baseline comparison on PD for SR 4x (PSNR/SSIM).

https://doi.org/10.1371/journal.pone.0277862.t004

Slice length. We evaluated MSBFAN with different lengths of MR image sequences. Fig 6 shows the performance improves with the more extended slices. As we can see from the figure, the model achieves the most improvement when increasing the slices from one to two and three, on account of the slices closest to the target slice have the highest correlation. The performance of MSBFAN/4 is even better than RBPN/7 which we refer to. Predictably, the performance of our MSBFAN will be further improved as the number of slices increases.

thumbnail
Fig 6. The performance comparison between the model with different number of slices, 4x SR on PD.

MSBFAN/s: MSBFAN trained/tested with s slices. Note: MSBFAN/1 equivalent MSBPN (n = 1, m = 2).

https://doi.org/10.1371/journal.pone.0277862.g006

Slice order. When selecting the MR image sequence to serve as a slice for the target slice It, we have a choice of how to choose it. We consider three case: use only past 2 slices (It−2, It−1), named P; use only future 2 slices (It+1, It+2), named F; use both past 1 slice (It−1) and future 1 slice (It+1), named PF. P represents the network is trained and tested on P, P → F represents the network is trained on P and tested on F. The results are shown in Table 5. Our intuition suggests, and the results confirm, that PF is better than P and F by 0.19dB, since the nearest slice has more supplementary information about the target slice. P is better than P → F by 0.25dB, and F is better than F → P by 0.21dB indicate that the MR image sequence does not have symmetry. P and F achieve similar performance indicates that the model is robust and insensitive to the order of MR image sequence.

thumbnail
Table 5. Effect of temporal order of slice on PD for SR 4x.

https://doi.org/10.1371/journal.pone.0277862.t005

Ablation study. To verify the superiority of our MSBFAN, we investigate the basic network modules: the SEB and STAB. To demonstrate the effect of our SEB, we use DBPN instead of our MSBPN (denote as M→D for short). To demonstrate the effect of our STAB, we use residual learning instead our spatio-temporal fusion attention (denote as FA→RL for short). We further show the effect of optical flow (OF) on the performance of our model. Table 6 shows the ablation investigation on the effects of the three described above. When we compare the results of the second line and last line, we find that model with MSBPN would perform better than those with DBPN. Benefitting from our multi-scale machine, the performance was improved from 30.71dB to 30.82dB, while the number of parameters was reduced by nearly 16.6%. This comparison firmly demonstrates the effectiveness of MSBPN and indicates adaptive different scale image detail improves the performance. From the comparison between the third line and the last line, we can conclude that adopting our spatio-temporal fusion attention block is a better way to promote the fusion of spatio-temporal information, and the performance is improved by 0.16dB when the number of parameters is similar. When both SEB and STAB are replaced, the performance drops significantly, up to 0.25dB. Just as we predicted, the optical flow information of the slices sequence compromised the performance of the model, reducing the performance by 0.02dB while increasing the computational cost.

thumbnail
Table 6. Investigations of SEB, STAB and optical flow on T2 for SR 4x.

https://doi.org/10.1371/journal.pone.0277862.t006

Comparison with other methods

To verify the effectiveness of the proposed MSBPN and MSBFAN more scientifically, we compare them with several advanced SR algorithms: VDSR [25], LapSRN [24], EDSR [21], DBPN [20], RDN [22], CSN [18], RCAN [28], SAN [45], HAN [46], Swin [47], LBNet [48], RBPN [34] and BasicVSR [49]. All models are retrained with the same training configuration on generated three datasets. Each data has a different focus and characteristics.

Table 7 shows that our MSBPN performs less than satisfactory at small scales. This is due to the small scale used less scales. For example, when the scale is 3, the scales used are x1, x2 and x3. This is done to effectively reduce the number of model parameters. Compared with RDN, our MSBPN achieves considerable performance with a small number of parameters for scale factor 4. However, all these single slice-based methods perform worse than the proposed multitude slices-based MSBFAN, indicating the proposed method’s superiority. Specifically, the PSNR value on three datasets achieved by our model is higher than HAN by 0.22dB, 0.21dB, and 0.32dB for scale factor 4, respectively. That is because, different scale spatial information can be well explored by our MSBPN and more supplementary temporal information can be aggregated into the features of the target slice, which is a great help in reconstructing high quality images. Our MSBFAN achieves better accuracy than the same multitude slices-based model RBPN and BasicVSR, even though RBPN has more than twice as many parameters. Note that RBPN and BasicVSR training failure for scale factor 2. This shows that simply applying natural image algorithms to medical images is not feasible.

thumbnail
Table 7. Quantitative comparison between the state-of-the-art SR algorithms on 3 test datasets.

https://doi.org/10.1371/journal.pone.0277862.t007

Fig 7 displays the qualitative results on three scenarios of the PD, T1, and T2 dataset, respectively. It can be observed from the zoom-in regions that our model reconstructs plentiful and more authentic details and it has the most similar entirety to the ground truth. The first and second rows show the result of a PD image. There is a lot of texture at the position indicated by the red arrow, which not completely be reconstructed in the results of other models, but our model gives a relatively comprehensive and clear reconstruction result. The third and fourth rows show the result on a T1 image. Similar to PD, there is a black ridge at the position indicated by the red arrow, which divide the area into smaller areas. Only our MSBFAN can restore this area well. The last two lines show the result of a T2 image. It can be observed that our MSBFAN has almost successfully restored the black and white area. However, several other methods, such as CSN, HAN, Swin and BasicVSR lost the corresponding area.

thumbnail
Fig 7.

The visual effect of the compared methods on a PD (top), T1 (middle) and T2 (bottom) image with SR x4.

https://doi.org/10.1371/journal.pone.0277862.g007

Conclusion

In this work, we have proposed a novel multi-scale back-projection network (MSBPN) for a single target MR image, primarily made up of different scale back-projection units to extract abundant spatial information. Inspired by video super-resolution, we also presented a multi-scale bidirectional fusion attention network (MSBFAN) to integrating the spatial information and temporal information of sequential medical images. The temporal information is explored from the medical image sequence surrounding the target slice and iteratively integrated with the spatial information, yielding gradual refinement of the high-resolution features used, eventually, to reconstruct the high-resolution target slice. In extensive experiments, we verify the various design in the ultimate performance of our model and demonstrate that, on the IXI dataset, MSBFAN achievements significantly performance advantages over most existing SR methods.

References

  1. 1. Meijering E H W, Niessen WJ, Viergever MA. Quantitative evaluation of convolution-based methods for medical image interpolation[J]. Medical image analysis, 2001, 5(2): 111–126. pmid:11516706
  2. 2. Lee W L, Yang C C, Wu H T, et al. Wavelet-based interpolation scheme for resolution enhancement of medical images[J]. Journal of Signal Processing Systems, 2009, 55(1): 251–265.
  3. 3. Farsiu S, Robinson M D, Elad M, et al. Fast and robust multiframe super resolution[J]. IEEE transactions on image processing, 2004, 13(10): 1327–1344. pmid:15462143
  4. 4. Shi F, Cheng J, Wang L, et al. LRTV: MR image super-resolution with low-rank and total variation regularizations[J]. IEEE transactions on medical imaging, 2015, 34(12): 2459–2466. pmid:26641727
  5. 5. Tourbier S, Bresson X, Hagmann P, et al. An efficient total variation algorithm for super-resolution in fetal brain MRI with adaptive regularization[J]. NeuroImage, 2015, 118: 584–597. pmid:26072252
  6. 6. Lingala S G, Hu Y, DiBella E, et al. Accelerated dynamic MRI exploiting sparsity and low-rank structure: kt SLR[J]. IEEE transactions on medical imaging, 2011, 30(5): 1042–1054. pmid:21292593
  7. 7. Diederich B, Then P, Jügler A, et al. cellSTORM—Cost-effective super-resolution on a cellphone using dSTORM[J]. PloS one, 2019, 14(1): e0209827. pmid:30625170
  8. 8. Zhao C, Shao M, Carass A, et al. Applications of a deep learning method for anti-aliasing and super-resolution in MRI[J]. Magnetic resonance imaging, 2019, 64: 132–141. pmid:31247254
  9. 9. Dai Y, Zhuang P. Compressed sensing MRI via a multi-scale dilated residual convolution network[J]. Magnetic resonance imaging, 2019, 63: 93–104. pmid:31362047
  10. 10. Xie D, Li Y, Yang H, et al. Denoising arterial spin labeling perfusion MRI with deep machine learning[J]. Magnetic resonance imaging, 2020, 68: 95–105. pmid:31954173
  11. 11. Dong C, Loy C C, He K, et al. Image super-resolution using deep convolutional networks[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 38(2): 295–307.
  12. 12. LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278–2324.
  13. 13. Chen Y, Xie Y, Zhou Z, et al. Brain MRI super resolution using 3D deep densely connected neural networks[C]//2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). IEEE, 2018: 739–742.
  14. 14. Chen Y, Shi F, Christodoulou A G, et al. Efficient and accurate MRI super-resolution using a generative adversarial network and 3D multi-level densely connected network[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2018: 91–99.
  15. 15. Qin C, Schlemper J, Caballero J, et al. Convolutional recurrent neural networks for dynamic MR image reconstruction[J]. IEEE transactions on medical imaging, 2018, 38(1): 280–290. pmid:30080145
  16. 16. Chlemper J S, Caballero J, Hajnal J, et al. A deep cascade of convolutional neural networks for dynamic MR image reconstructio[J]. IEEE Transactions on Medical Imaging, 2017, 37: 491–503.
  17. 17. Lyu Q, Shan H, Steber C, et al. Multi-contrast super-resolution MRI through a progressive network[J]. IEEE transactions on medical imaging, 2020, 39(9): 2738–2749. pmid:32086201
  18. 18. Zhao X, Zhang Y, Zhang T, et al. Channel splitting network for single MR image super-resolution[J]. IEEE transactions on image processing, 2019, 28(11): 5649–5662. pmid:31217110
  19. 19. Lyu Q, Shan H, Wang G. MRI super-resolution with ensemble learning and complementary priors[J]. IEEE Transactions on Computational Imaging, 2020, 6: 615–624.
  20. 20. Haris M, Shakhnarovich G, Ukita N. Deep back-projection networks for super-resolution[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 1664–1673.
  21. 21. Lim B, Son S, Kim H, et al. Enhanced deep residual networks for single image super-resolution[C]//Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017: 136–144.
  22. 22. Zhang Y, Tian Y, Kong Y, et al. Residual dense network for image super-resolution[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 2472–2481.
  23. 23. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770–778.
  24. 24. Lai W S, Huang J B, Ahuja N, et al. Deep laplacian pyramid networks for fast and accurate super-resolution[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 624–632.
  25. 25. Kim J, Lee J K, Lee K M. Accurate image super-resolution using very deep convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 1646–1654.
  26. 26. Tai Y, Yang J, Liu X. Image super-resolution via deep recursive residual network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 3147–3155.
  27. 27. Kim J, Lee J K, Lee K M. Deeply-recursive convolutional network for image super-resolution[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 1637–1645.
  28. 28. Zhang Y, Li K, Li K, et al. Image super-resolution using very deep residual channel attention networks[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 286–301.
  29. 29. Dong C, Loy C C, Tang X. Accelerating the super-resolution convolutional neural network[C]//European conference on computer vision. Springer, Cham, 2016: 391–407.
  30. 30. Shi W, Caballero J, Huszár F, et al. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 1874–1883.
  31. 31. Ledig C, Theis L, Huszár F, et al. Photo-realistic single image super-resolution using a generative adversarial network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4681–4690.
  32. 32. Han W, Chang S, Liu D, et al. Image super-resolution via dual-state recurrent networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 1654–1663.
  33. 33. Wang Y, Perazzi F, McWilliams B, et al. A fully progressive approach to single-image super-resolution[C]//Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2018: 864–873.
  34. 34. Haris M, Shakhnarovich G, Ukita N. Recurrent back-projection network for video super-resolution[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 3897–3906.
  35. 35. Li Z, Yang J, Liu Z, et al. Feedback network for image super-resolution[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 3867–3876.
  36. 36. Irani M, Peleg S. Improving resolution by image registration[J]. CVGIP: Graphical models and image processing, 1991, 53(3): 231–239.
  37. 37. Li J, Fang F, Li J, et al. MDCN: Multi-scale dense cross network for image super-resolution[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(7): 2547–2561.
  38. 38. Li J, Fang F, Mei K, et al. Multi-scale residual network for image super-resolution[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 517–532.
  39. 39. Zamir S W, Arora A, Khan S, et al. Learning enriched features for real image restoration and enhancement[C]//European Conference on Computer Vision. Springer, Cham, 2020: 492–511.
  40. 40. Kim S Y, Oh J, Kim M. Fisr: Deep joint frame interpolation and super-resolution with a multi-scale temporal loss[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(07): 11278–11286.
  41. 41. Xiao Y, Su X, Yuan Q, et al. Satellite video super-resolution via multiscale deformable convolution alignment and temporal grouping projection[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021, 60: 1–19.
  42. 42. Zhang Y, Li K, Li K, et al. Mr image super-resolution with squeeze and excitation reasoning attention network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 13425–13434.
  43. 43. Pham C H, Ducournau A, Fablet R, et al. Brain MRI super-resolution using deep 3D convolutional networks[C]//2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017). IEEE, 2017: 197–200.
  44. 44. Li Y, Iwamoto Y, Lin L, et al. VolumeNet: a lightweight parallel network for super-resolution of MR and CT volumetric data[J]. IEEE Transactions on Image Processing, 2021, 30: 4840–4854. pmid:33945478
  45. 45. Dai T, Cai J, Zhang Y, et al. Second-order attention network for single image super-resolution[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 11065–11074.
  46. 46. Niu B, Wen W, Ren W, et al. Single image super-resolution via a holistic attention network[C]//European conference on computer vision. Springer, Cham, 2020: 191–207.
  47. 47. Liang J, Cao J, Sun G, et al. Swinir: Image restoration using swin transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 1833–1844.
  48. 48. Gao G, Wang Z, Li J, et al. Lightweight Bimodal Network for Single-Image Super-Resolution via Symmetric CNN and Recursive Transformer[J]. arXiv preprint arXiv:2204.13286, 2022.
  49. 49. Chan K C K, Wang X, Yu K, et al. BasicVSR: The search for essential components in video super-resolution and beyond[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 4947–4956.